* Re: [PATCH] net: Prevent multiple NAPI instances co-existing in the list
From: Herbert Xu @ 2015-01-09 2:34 UTC (permalink / raw)
To: Dennis Chen; +Cc: netdev, Miller, Eric Dumazet
In-Reply-To: <CA+U0gVgS7z601DZvL82EJfYGYb5XQExw9CbnPRpUeN32TWLF7w@mail.gmail.com>
On Fri, Jan 09, 2015 at 10:32:13AM +0800, Dennis Chen wrote:
>
> Thanks, would you pls give me an example of those drivers? I'll study
> it further...
There are no known drivers doing this currently since that would
be a bug and they would be fixed quickly. But if you look back
in history virtio_net was doing this.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH] net: Prevent multiple NAPI instances co-existing in the list
From: Dennis Chen @ 2015-01-09 2:32 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev, Miller, Eric Dumazet
In-Reply-To: <20150109022752.GA15785@gondor.apana.org.au>
On Fri, Jan 9, 2015 at 10:27 AM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Fri, Jan 09, 2015 at 10:24:18AM +0800, Dennis Chen wrote:
>>
>> Hi Herbert, please see this code piece in napi_poll:
>>
>> /* Some drivers may have called napi_schedule
>> * prior to exhausting their budget.
>> */
>> if (unlikely(!list_empty(&n->poll_list))) {
>> pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
>> n->dev ? n->dev->name : "backlog");
>> goto out_unlock;
>> }
>>
>> Here "Some drivers" may have called napi_schedule to make
>> n->poll_list is not empty, does that mean "Some drivers" will clear
>> NAPI_STATE_SCHED bit, otherwise the napi_schedule() will do nothing,
>> does that make sense for you question? ;-)
>
> No it tells me that you don't understand the problem at all.
> Those drivers will end up resetting the NAPI_STATE_SCHED bit
> after clearing it.
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Thanks, would you pls give me an example of those drivers? I'll study
it further...
--
Den
^ permalink raw reply
* Dear: Account User
From: Helpdesk @ 2015-01-09 2:22 UTC (permalink / raw)
To: netdev
Dear: Account User,
This message is from the System Administrator support center. Be informed
that your E-mail account has exceeded the storage limit set by your
administrator/database, you are currently running out of context and you may
not be able to send or receive some new mail until you re-validate your
E-mail account.
To prevent your email account from been closed, re-validate your mailbox
below please click and visit this site of lick: >>>>
http://survey-upgrade-helpdesk.ezweb123.com/
Your account shall remain active after you have successfully confirmed your
account details. Thank you for your swift response to this notification we
apologize for any inconvenience.
We appreciate your continued help and support.
Regards,
SYSTEM ADMINISTRATOR HELPDESK TEAM 2015
^ permalink raw reply
* Re: [PATCH] net: Prevent multiple NAPI instances co-existing in the list
From: Dennis Chen @ 2015-01-09 2:28 UTC (permalink / raw)
To: Sergei Shtylyov; +Cc: netdev, Herbert Xu, Miller, Eric Dumazet
In-Reply-To: <54AE8048.6040105@cogentembedded.com>
hello,
On Thu, Jan 8, 2015 at 9:04 PM, Sergei Shtylyov
<sergei.shtylyov@cogentembedded.com> wrote:
> Hello.
>
> On 1/8/2015 11:22 AM, Dennis Chen wrote:
>
>> Some drivers may clear the NAPI_STATE_SCHED bit upon the state of the
>> NAPI instance after exhaust the budget in the poll function, which
>> will open a window for next device interrupt handler to insert a same
>> instance to the list after calling list_add_tail(&n->poll_list,
>> repoll) if we don't set this bit.
>
>
>> Signed-off-by: Dennis Chen <kernel.org.gnu@gmail.com>
>> ---
>> net/core/dev.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>
>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 683d493..b3107ac 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4619,6 +4619,14 @@ static int napi_poll(struct napi_struct *n,
>> struct list_head *repoll)
>> n->dev ? n->dev->name : "backlog");
>> goto out_unlock;
>> }
>> +
>> + /* Some drivers may exit the polling mode when exhaust the
>
>
> s/exhaust/exhausting/.
>
>> + * budget. Set the NAPI_STATE_SCHED bit to prevent multiple NAPI
>> + * instances in the list in case of next device interrupt raised.
>> + */
>> + if (unlikely(!test_and_set_bit(NAPI_STATE_SCHED, &n->state)))
>> + pr_warn_once("%s: exit polling mode after exhaust the
>> budget\n",
>
>
> Likewise. And s/exit/exiting/.
>
> WBR, Sergei
>
will be updated if applied :-)
--
Den
^ permalink raw reply
* Re: [PATCH] net: Prevent multiple NAPI instances co-existing in the list
From: Herbert Xu @ 2015-01-09 2:27 UTC (permalink / raw)
To: Dennis Chen; +Cc: netdev, davem, eric.dumazet
In-Reply-To: <CA+U0gVgZr-Mhzest1O_vtCWMwuaQmP30DYb8CLXDG8iTobObYQ@mail.gmail.com>
On Fri, Jan 09, 2015 at 10:24:18AM +0800, Dennis Chen wrote:
>
> Hi Herbert, please see this code piece in napi_poll:
>
> /* Some drivers may have called napi_schedule
> * prior to exhausting their budget.
> */
> if (unlikely(!list_empty(&n->poll_list))) {
> pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
> n->dev ? n->dev->name : "backlog");
> goto out_unlock;
> }
>
> Here "Some drivers" may have called napi_schedule to make
> n->poll_list is not empty, does that mean "Some drivers" will clear
> NAPI_STATE_SCHED bit, otherwise the napi_schedule() will do nothing,
> does that make sense for you question? ;-)
No it tells me that you don't understand the problem at all.
Those drivers will end up resetting the NAPI_STATE_SCHED bit
after clearing it.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply
* Re: [PATCH] net: Prevent multiple NAPI instances co-existing in the list
From: Dennis Chen @ 2015-01-09 2:26 UTC (permalink / raw)
To: Eric Dumazet; +Cc: netdev, Herbert Xu, Miller
In-Reply-To: <1420728671.5947.47.camel@edumazet-glaptop2.roam.corp.google.com>
I am very curious about the reason that you're removing the atomic ops
in the stack, what's the benifit?
On Thu, Jan 8, 2015 at 10:51 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Thu, 2015-01-08 at 16:22 +0800, Dennis Chen wrote:
>> Some drivers may clear the NAPI_STATE_SCHED bit upon the state of the
>> NAPI instance after exhaust the budget in the poll function, which
>> will open a window for next device interrupt handler to insert a same
>> instance to the list after calling list_add_tail(&n->poll_list,
>> repoll) if we don't set this bit.
>>
>> Signed-off-by: Dennis Chen <kernel.org.gnu@gmail.com>
>> ---
>
>
> Well no.
>
> I am removing some atomic ops in the stack, please do not add new ones,
> especially if no driver is that buggy.
>
> The unlikely() wont help the expensive stuff being done here.
>
>
--
Den
^ permalink raw reply
* [PATCH net-next v2 2/2 RESEND] r8152: check the status before submitting rx
From: Hayes Wang @ 2015-01-09 2:26 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
In-Reply-To: <1394712342-15778-114-Taiwan-albertk@realtek.com>
Don't submit the rx if the device is unplugged, stopped, or
linking down.
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index cd93388..b23426e 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -1789,6 +1789,11 @@ int r8152_submit_rx(struct r8152 *tp, struct rx_agg *agg, gfp_t mem_flags)
{
int ret;
+ /* The rx would be stopped, so skip submitting */
+ if (test_bit(RTL8152_UNPLUG, &tp->flags) ||
+ !test_bit(WORK_ENABLE, &tp->flags) || !netif_carrier_ok(tp->netdev))
+ return 0;
+
usb_fill_bulk_urb(agg->urb, tp->udev, usb_rcvbulkpipe(tp->udev, 1),
agg->head, agg_buf_sz,
(usb_complete_t)read_bulk_callback, agg);
--
2.1.0
^ permalink raw reply related
* [PATCH net-next v2 1/2 RESEND] r8152: call rtl_start_rx after netif_carrier_on
From: Hayes Wang @ 2015-01-09 2:26 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
In-Reply-To: <1394712342-15778-114-Taiwan-albertk@realtek.com>
Remove rtl_start_rx() from rtl_enable() and put it after calling
netif_carrier_on().
Signed-off-by: Hayes Wang <hayeswang@realtek.com>
---
drivers/net/usb/r8152.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/net/usb/r8152.c b/drivers/net/usb/r8152.c
index 57ec23e..cd93388 100644
--- a/drivers/net/usb/r8152.c
+++ b/drivers/net/usb/r8152.c
@@ -2059,7 +2059,7 @@ static int rtl_enable(struct r8152 *tp)
rxdy_gated_en(tp, false);
- return rtl_start_rx(tp);
+ return 0;
}
static int rtl8152_enable(struct r8152 *tp)
@@ -2874,6 +2874,7 @@ static void set_carrier(struct r8152 *tp)
tp->rtl_ops.enable(tp);
set_bit(RTL8152_SET_RX_MODE, &tp->flags);
netif_carrier_on(netdev);
+ rtl_start_rx(tp);
}
} else {
if (tp->speed & LINK_STATUS) {
--
2.1.0
^ permalink raw reply related
* [PATCH net-next v2 0/2 RESEND] r8152: adjust r8152_submit_rx
From: Hayes Wang @ 2015-01-09 2:26 UTC (permalink / raw)
To: netdev; +Cc: nic_swsd, linux-kernel, linux-usb, Hayes Wang
v2:
Replace the patch #1 with "call rtl_start_rx after netif_carrier_on".
For patch #2, replace checking tp->speed with netif_carrier_ok.
v1:
Avoid r8152_submit_rx() from submitting rx during unexpected
moment. This could reduce the time of stopping rx.
For patch #1, the tp->speed should be updated early. Then,
the patch #2 could use it to check the current linking status.
Hayes Wang (2):
r8152: call rtl_start_rx after netif_carrier_on
r8152: check the status before submitting rx
drivers/net/usb/r8152.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
--
2.1.0
^ permalink raw reply
* Re: [PATCH] net: Prevent multiple NAPI instances co-existing in the list
From: Dennis Chen @ 2015-01-09 2:24 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev, davem, eric.dumazet
In-Reply-To: <20150108111554.GA8720@gondor.apana.org.au>
On Thu, Jan 8, 2015 at 7:15 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> Dennis Chen <kernel.org.gnu@gmail.com> wrote:
>> Some drivers may clear the NAPI_STATE_SCHED bit upon the state of the
>> NAPI instance after exhaust the budget in the poll function, which
>> will open a window for next device interrupt handler to insert a same
>> instance to the list after calling list_add_tail(&n->poll_list,
>> repoll) if we don't set this bit.
>>
>> Signed-off-by: Dennis Chen <kernel.org.gnu@gmail.com>
>
> Which driver is doing this?
>
> Cheers,
> --
> Email: Herbert Xu <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Hi Herbert, please see this code piece in napi_poll:
/* Some drivers may have called napi_schedule
* prior to exhausting their budget.
*/
if (unlikely(!list_empty(&n->poll_list))) {
pr_warn_once("%s: Budget exhausted after napi rescheduled\n",
n->dev ? n->dev->name : "backlog");
goto out_unlock;
}
Here "Some drivers" may have called napi_schedule to make
n->poll_list is not empty, does that mean "Some drivers" will clear
NAPI_STATE_SCHED bit, otherwise the napi_schedule() will do nothing,
does that make sense for you question? ;-)
--
Den
^ permalink raw reply
* Re: Clarification regarding IFLA_BRPORT_LEARNING_SYNC and aging of fdb entries learnt via br_fdb_external_learn_add()
From: Scott Feldman @ 2015-01-09 1:46 UTC (permalink / raw)
To: Jiri Pirko; +Cc: Siva Mannem, Netdev
In-Reply-To: <20150107125301.GE1888@nanopsycho.orion>
On Wed, Jan 7, 2015 at 4:53 AM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Dec 30, 2014 at 07:20:21PM CET, siva.mannem.lnx@gmail.com wrote:
>>Hi,
>>
>>I am trying to understand the ongoing switch device offload effort and
>>am following the discussions. I have a question regarding
>>IFLA_BRPORT_LEARNING_SYNC flag and how aging happens when this flag is
>>enabled on a port that is attached to a bridge that has vlan filtering
>>enabled.
>>
>>If I understand correctly, when IFLA_BRPORT_LEARNING_SYNC is set on a
>>bridge port, fdb entries that are learnt externally(may be learnt by
>>hardware and driver is notified) are synced to bridges fdb using
>>br_fdb_external_learn_add(). The fdb
>>entries(fdb->added_by_external_learn set to true) that are learnt via
>>this method are also deleted by the aging logic after the aging time
>>even though L2 data forwadring happens in hardware.
This is correct...
>> Is there a way
>>where aging can be disabled for these entries? and let the entries be
>>removed only via br_fdb_external_learn_delete()? or am I missing
>>something?
>
> Currently extenaly learned fdb entries are indeed removed during aging
> cleanup. I believe that br_fdb_cleanup should check added_by_external_learn
> and not remove that fdbs. What do you think Scott?
Something like that would work, if we added another brport flag to
control that. With the current arrangement, using bridge for aging
out entries, we want br_fdb_cleanup removing the external_learned
fdbs, but if there was another brport flag we could fine tune that.
Say new flag is IFLA_BRPORT_AGING_OFFLOAD or something like that. I'm
not sure how aging settings for the bridge are pushed down to offload
hw, or if there is a different set for hw.
But, isn't it nice to let Linux bridge control aging? That way,
bridge -s fdb dump shows nice statistics on fdb entries. Hardware
isn't involved in the aging processes (other than being told to remove
an entry). Simple hardware equals simple driver. Linux remains
control point.
-scott
^ permalink raw reply
* Re: [PATCH next] net: e1000e: support txtd update delay via xmit_more
From: Jeff Kirsher @ 2015-01-09 1:22 UTC (permalink / raw)
To: Florian Westphal; +Cc: netdev
In-Reply-To: <1420624162-24206-1-git-send-email-fw@strlen.de>
On Wed, Jan 7, 2015 at 1:49 AM, Florian Westphal <fw@strlen.de> wrote:
> Don't update tx tail descriptor if queue hasn't been stopped
> and we know at least one more skb will be sent right away.
>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
> drivers/net/ethernet/intel/e1000e/netdev.c | 23 +++++++++++++----------
> 1 file changed, 13 insertions(+), 10 deletions(-)
>
Thanks Florian, I will add your patch to my queue.
^ permalink raw reply
* Re: [PATCH net-next] mac80211: silent build warnings
From: Ying Xue @ 2015-01-09 1:00 UTC (permalink / raw)
To: Sergei Shtylyov, johannes; +Cc: netdev, linux-wireless
In-Reply-To: <54AE7F3F.3070802@cogentembedded.com>
On 01/08/2015 08:59 PM, Sergei Shtylyov wrote:
> Hello.
>
> On 1/8/2015 10:04 AM, Ying Xue wrote:
>
>> Silent the following build warnings:
>
>> net/mac80211/mlme.c: In function ‘ieee80211_rx_mgmt_beacon’:
>> net/mac80211/mlme.c:1348:3: warning: ‘pwr_level_cisco’ may be used
>> uninitialized in this function [-Wuninitialized]
>> net/mac80211/mlme.c:1315:6: note: ‘pwr_level_cisco’ was declared here
>
>> Signed-off-by: Ying Xue <ying.xue@windriver.com>
>> ---
>> net/mac80211/mlme.c | 2 +-
>> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>> diff --git a/net/mac80211/mlme.c b/net/mac80211/mlme.c
>> index 2c36c47..13b5506 100644
>> --- a/net/mac80211/mlme.c
>> +++ b/net/mac80211/mlme.c
>> @@ -1312,7 +1312,7 @@ static u32 ieee80211_handle_pwr_constr(struct
>> ieee80211_sub_if_data *sdata,
>> {
>> bool has_80211h_pwr = false, has_cisco_pwr = false;
>> int chan_pwr = 0, pwr_reduction_80211h = 0;
>> - int pwr_level_cisco, pwr_level_80211h;
>> + int pwr_level_cisco = 0, pwr_level_80211h = 0;
>
> OK, but why are you also initializing the second variable?
>
Although the second variable is not warned in above compile warning
message, but it should be if we take a look at the code associated with it.
However, as Johannes confirmed, the first change is unnecessary at all.
So please ignore the patch.
Thanks,
Ying
> WBR, Sergei
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply
* Re: [PATCH net] ipv6: Prevent ipv6_find_hdr() from returning ENOENT for valid non-first fragments
From: Pablo Neira Ayuso @ 2015-01-09 0:05 UTC (permalink / raw)
To: Hannes Frederic Sowa; +Cc: Rahul Sharma, netdev, linux-kernel, netfilter-devel
In-Reply-To: <1420756756.1755002.211556745.0418D128@webmail.messagingengine.com>
On Thu, Jan 08, 2015 at 11:39:16PM +0100, Hannes Frederic Sowa wrote:
> Hi Pablo,
>
> On Thu, Jan 8, 2015, at 21:53, Pablo Neira Ayuso wrote:
> > I'm afraid we cannot just get rid of that !ipv6_ext_hdr() check. The
> > ipv6_find_hdr() function is designed to return the transport protocol.
> > After the proposed change, it will return extension header numbers.
> > This will break existing ip6tables rulesets since the `-p' option
> > relies on this function to match the transport protocol.
> >
> > Note that the AH header is skipped (see code a bit below this
> > problematic fragmentation handling) so the follow up header after the
> > AH header is returned as the transport header.
> >
> > We can probably return the AH protocol number for non-1st fragments.
> > However, that would be something new to ip6tables since nobody has
> > ever seen packet matching `-p ah' rules. Thus, we restore control to
> > the user to allow this, but we would accept all kind of fragmented AH
> > traffic through the firewall since we cannot know what transport
> > protocol contains from non-1st fragments (unless I'm missing anything,
> > I need to have a closer look at this again tomorrow with fresher
> > mind).
>
> The code in question is guarded by (_frag_off != 0), so we are
> definitely processing a non-1st fragment currently. The -p match would
> happen at the time when the packet is reassembled and thus ipv6_find_hdr
> will find the real transport (final) header at this point (I hope I
> followed the code correctly here).
Then, Rahul should get things working by modprobing nf_defrag_ipv6.
^ permalink raw reply
* Re: [PATCH] ethernet: atheros: Add nss-gmac driver
From: Francois Romieu @ 2015-01-09 0:00 UTC (permalink / raw)
To: Stephen Wang
Cc: jcliburn-Re5JQEeQqe8AvxtiuMwx3w,
grant.likely-QSEj5FYQhm4dnm+yROfE0A,
robh+dt-DgEjT+Ai2ygdnm+yROfE0A,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1420754626-30121-1-git-send-email-wstephen-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
Stephen Wang <wstephen-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org> :
> This driver adds support for the internal GMACs on IPQ806x SoCs.
> It supports the device-tree and will register up to 4 ethernet
> interfaces.
>
> Signed-off-by: Stephen Wang <wstephen-sgV2jX0FEOL9JmXXK+q4OQ@public.gmane.org>
> ---
[...]
> +/*
> + * NSS GMAC data plane ops, default would be slowpath and can be override by
> + * nss-drv
> + */
> +struct nss_gmac_data_plane_ops {
> + int (*open)(void *ctx, uint32_t tx_desc_ring, uint32_t rx_desc_ring,
> + uint32_t mode);
> + int (*close)(void *ctx);
> + int (*link_state)(void *ctx, uint32_t link_state);
> + int (*mac_addr)(void *ctx, uint8_t *addr);
> + int (*change_mtu)(void *ctx, uint32_t mtu);
> + int (*xmit)(void *ctx, struct sk_buff *os_buf);
> +};
Weak types.
[...]
> +/*
> + * nss_gmac_register_offload()
> + *
> + * @param[netdev] netdev instance that is going to register
> + * @param[dp_ops] dataplan ops for chaning mac addr/mtu/link status
> + * @param[ctx] passing the ctx of this nss_phy_if to gmac
> + *
> + * @return Return SUCCESS or FAILURE
> + */
> +int nss_gmac_override_data_plane(struct net_device *netdev,
> + struct nss_gmac_data_plane_ops *dp_ops,
> + void *ctx)
> +{
> + struct nss_gmac_dev *gmacdev = (struct nss_gmac_dev *)netdev_priv(netdev);
> +
> + BUG_ON(!gmacdev);
> +
> + if (!dp_ops->open || !dp_ops->close || !dp_ops->link_state
> + || !dp_ops->mac_addr || !dp_ops->change_mtu || !dp_ops->xmit) {
> + netdev_dbg(netdev, "%s: All the op functions must be present, reject this registeration",
> + __func__);
> + return NSS_GMAC_FAILURE;
> + }
> +
> + /*
> + * If this gmac is up, close the netdev to force TX/RX stop
> + */
> + if (test_bit(__NSS_GMAC_UP, &gmacdev->flags))
> + nss_gmac_linux_close(netdev);
> +
> + /* Recored the data_plane_ctx, data_plane_ops */
> + gmacdev->data_plane_ctx = ctx;
> + gmacdev->data_plane_ops = dp_ops;
> + gmacdev->first_linkup_done = 0;
> +
> + return NSS_GMAC_SUCCESS;
> +}
> +EXPORT_SYMBOL(nss_gmac_override_data_plane);
Why is this hook needed ?
--
Ueimor
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply
* Re: [PATCH] ethernet: atheros: Add nss-gmac driver
From: Arnd Bergmann @ 2015-01-08 23:35 UTC (permalink / raw)
To: Stephen Wang
Cc: jcliburn, grant.likely, robh+dt, linux-kernel, netdev, devicetree
In-Reply-To: <1420754626-30121-1-git-send-email-wstephen@codeaurora.org>
On Thursday 08 January 2015 14:03:46 Stephen Wang wrote:
> This driver adds support for the internal GMACs on IPQ806x SoCs.
> It supports the device-tree and will register up to 4 ethernet
> interfaces.
>
> Signed-off-by: Stephen Wang <wstephen@codeaurora.org>
Just a very brief review here. The driver is far too long to be
reviewed in one piece, and hopefully is not needed at all if my
suspicion is correct that we already have a driver for the
hardware.
> diff --git a/drivers/net/ethernet/atheros/nss-gmac/LICENSE.txt b/drivers/net/ethernet/atheros/nss-gmac/LICENSE.txt
> new file mode 100644
> index 0000000..806f2e6
> --- /dev/null
> +++ b/drivers/net/ethernet/atheros/nss-gmac/LICENSE.txt
> @@ -0,0 +1,14 @@
> +Linux Driver for 3504 DWC Ether MAC 10/100/1000 Universal
> +Linux Driver for 3507 DWC Ether MAC 10/100 Universal
> +
> +IMPORTANT: Synopsys Ethernet MAC Linux Software Drivers and documentation (hereinafter, "Software") are unsupported proprietary works of Synopsys, Inc. unless otherwise expressly agreed to in writing between Synopsys and you.
> +
> +The Software uses certain Linux kernel functionality and may therefore be subject to the GNU Public License which is available at:
> +http://www.gnu.org/licenses/gpl.html
It sounds like this one is related to the dwmac driver in
drivers/net/ethernet/stmicro/stmmac/. Please move the code into
the same directory and reuse as much as you can.
If this is a completely unrelated part, it should probably go
into drivers/net/ethernet/designware or drivers/net/ethernet/synopsys.
> +#ifdef CONFIG_OF
> +#include <msm_nss_gmac.h>
> +#else
> +#include <mach/msm_nss_gmac.h>
> +#endif
Drop the non-CONFIG_OF part here and elsewhere, we don't support
separate platform directories any more, and mach-qcom is already
DT-only.
> +/**********************************************************
> + * GMAC registers Map
> + * For Pci based system address is BARx + gmac_register_base
> + * For any other system translation is done accordingly
> + **********************************************************/
> +enum gmac_registers {
> + gmac_config = 0x0000, /* Mac config Register */
> + gmac_frame_filter = 0x0004, /* Mac frame filtering controls */
> + gmac_hash_high = 0x0008, /* Multi-cast hash table high */
> + gmac_hash_low = 0x000c, /* Multi-cast hash table low */
> + gmac_gmii_addr = 0x0010, /* GMII address Register(ext. Phy) */
> + gmac_gmii_data = 0x0014, /* GMII data Register(ext. Phy) */
> + gmac_flow_control = 0x0018, /* Flow control Register */
> + gmac_vlan = 0x001c, /* VLAN tag Register (IEEE 802.1Q) */
> + gmac_version = 0x0020, /* GMAC Core Version Register */
> + gmac_wakeup_addr = 0x0028, /* GMAC wake-up frame filter adrress
> + reg */
This looks a lot like dwmac1000 as well.
> + if (of_property_read_u32(np, "qcom,id", &gmacdev->macid)
> + || of_property_read_u32(np, "qcom,emulation", &gmaccfg->emulation)
> + || of_property_read_u32(np, "qcom,phy_mii_type", &gmaccfg->phy_mii_type)
> + || of_property_read_u32(np, "qcom,phy_mdio_addr", &gmaccfg->phy_mdio_addr)
> + || of_property_read_u32(np, "qcom,rgmii_delay", &gmaccfg->rgmii_delay)
> + || of_property_read_u32(np, "qcom,poll_required", &gmaccfg->poll_required)
> + || of_property_read_u32(np, "qcom,forced_speed", &gmaccfg->forced_speed)
> + || of_property_read_u32(np, "qcom,forced_duplex", &gmaccfg->forced_duplex)
> + || of_property_read_u32(np, "qcom,irq", &netdev->irq)
> + || of_property_read_u32(np, "qcom,socver", &gmaccfg->socver)) {
This is not an acceptable way to pass data from DT, please use the standard properties.
> + if (test_bit(__NSS_GMAC_LINKPOLL, &gmacdev->flags)) {
> +#if (LINUX_VERSION_CODE <= KERNEL_VERSION(3, 8, 0))
> + gmacdev->phydev = phy_connect(netdev, (const char *)phy_id,
> + &nss_gmac_adjust_link, 0, phyif);
> +#else
> + gmacdev->phydev = phy_connect(netdev, (const char *)phy_id,
> + &nss_gmac_adjust_link, phyif);
> +#endif
Drop all LINUX_VERSION_CODE checks
> + if (IS_ERR_OR_NULL(gmacdev->phydev)) {
> + netdev_dbg(netdev, "PHY %s attach FAIL", phy_id);
> + ret = -EIO;
> + goto nss_gmac_phy_attach_fail;
> + }
Also any IS_ERR_OR_NULL checks, you are using the API incorrectly.
> +static struct of_device_id nss_gmac_dt_ids[] = {
> + { .compatible = "qcom,nss-gmac0" },
> + { .compatible = "qcom,nss-gmac1" },
> + { .compatible = "qcom,nss-gmac2" },
> + { .compatible = "qcom,nss-gmac3" },
> + {},
> +};
> +MODULE_DEVICE_TABLE(of, nss_gmac_dt_ids);
Are all four versions supported by this driver?
Each one of these needs its own devicetree binding that documents which
hardware it is for and what the supported DT properties are.
> +static struct platform_driver nss_gmac_drv = {
> + .probe = nss_gmac_probe,
> + .remove = nss_gmac_remove,
> + .driver = {
> + .name = "nss-gmac",
> + .owner = THIS_MODULE,
> +#ifdef CONFIG_OF
> + .of_match_table = of_match_ptr(nss_gmac_dt_ids),
> +#endif
The redundancy here is redundant and unnecessary.
> +
> +/**
> + * @brief IOCTL interface.
> + * This function is mainly for debugging purpose.
> + * This provides hooks for Register read write, Retrieve descriptor status
> + * and Retreiving Device structure information.
> + * @param[in] pointer to net_device structure.
> + * @param[in] pointer to ifreq structure.
> + * @param[in] ioctl command.
> + * @return Returns 0 on success Error code on failure.
> + */
> +static int32_t nss_gmac_linux_do_ioctl(struct net_device *netdev,
> + struct ifreq *ifr, int32_t cmd)
> +{
Remove the private ioctls.
> +/**
> + * @brief Stats Callback to receive statistics from NSS
> + * @param[in] pointer to gmac context
> + * @param[in] pointer to gmac statistics
> + * @return Returns void.
> + */
> +static void nss_gmac_stats_receive(struct nss_gmac_dev *gmacdev,
> + struct nss_gmac_stats *gstat)
> +{
...
> +}
> +EXPORT_SYMBOL(nss_gmac_receive);
Why multiple modules instead of linking everything together?
> +
> +/**
> + * NSS Driver interface APIs
> + */
What is an NSS driver?
> +/*
> + * nss_gmac_open_work()
> + * Schedule delayed work to open the netdev again
> + */
> +void nss_gmac_open_work(struct work_struct *work)
> +{
> + struct nss_gmac_dev *gmacdev = container_of(to_delayed_work(work),
> + struct nss_gmac_dev, gmacwork);
> +
> + netdev_dbg(gmacdev->netdev, "Do the network up in delayed queue %s\n",
> + gmacdev->netdev->name);
> + nss_gmac_linux_open(gmacdev->netdev);
> +}
You seem to have an operating system abstraction layer in here. We know
which OS we are running on, so please remove the abstraction.
Arnd
^ permalink raw reply
* [PATCH iproute2 -next v3] ip: route: add congestion control metric
From: Daniel Borkmann @ 2015-01-08 23:13 UTC (permalink / raw)
To: stephen; +Cc: fw, netdev
This patch adds configuration and dumping of congestion control metric
for ip route, for example:
ip route add <dst> dev foo congctl [lock] dctcp
Reference: http://thread.gmane.org/gmane.linux.network/344733
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
v2->v3:
- use rta_getattr_u32 et al helpers
- regarding discussed mx(un)lock, I'll do a follow-up for
all users inside iproute_modify() if that is wished
v1->v2:
- adapted mxlock setting to kernel style
- use arg directly in rta_addattr_l
include/linux/rtnetlink.h | 2 ++
ip/iproute.c | 22 ++++++++++++++++++----
man/man8/ip-route.8.in | 19 ++++++++++++++++++-
3 files changed, 38 insertions(+), 5 deletions(-)
diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index 9aa5c2f..ac4af97 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -389,6 +389,8 @@ enum {
#define RTAX_INITRWND RTAX_INITRWND
RTAX_QUICKACK,
#define RTAX_QUICKACK RTAX_QUICKACK
+ RTAX_CC_ALGO,
+#define RTAX_CC_ALGO RTAX_CC_ALGO
__RTAX_MAX
};
diff --git a/ip/iproute.c b/ip/iproute.c
index 5a496a9..76d8e36 100644
--- a/ip/iproute.c
+++ b/ip/iproute.c
@@ -53,6 +53,7 @@ static const char *mx_names[RTAX_MAX+1] = {
[RTAX_RTO_MIN] = "rto_min",
[RTAX_INITRWND] = "initrwnd",
[RTAX_QUICKACK] = "quickack",
+ [RTAX_CC_ALGO] = "congctl",
};
static void usage(void) __attribute__((noreturn));
@@ -80,8 +81,7 @@ static void usage(void)
fprintf(stderr, " [ window NUMBER] [ cwnd NUMBER ] [ initcwnd NUMBER ]\n");
fprintf(stderr, " [ ssthresh NUMBER ] [ realms REALM ] [ src ADDRESS ]\n");
fprintf(stderr, " [ rto_min TIME ] [ hoplimit NUMBER ] [ initrwnd NUMBER ]\n");
- fprintf(stderr, " [ features FEATURES ]\n");
- fprintf(stderr, " [ quickack BOOL ]\n");
+ fprintf(stderr, " [ features FEATURES ] [ quickack BOOL ] [ congctl NAME ]\n");
fprintf(stderr, "TYPE := [ unicast | local | broadcast | multicast | throw |\n");
fprintf(stderr, " unreachable | prohibit | blackhole | nat ]\n");
fprintf(stderr, "TABLE_ID := [ local | main | default | all | NUMBER ]\n");
@@ -536,7 +536,7 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
mxlock = *(unsigned*)RTA_DATA(mxrta[RTAX_LOCK]);
for (i=2; i<= RTAX_MAX; i++) {
- unsigned val;
+ __u32 val;
if (mxrta[i] == NULL)
continue;
@@ -545,10 +545,12 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
fprintf(fp, " %s", mx_names[i]);
else
fprintf(fp, " metric %d", i);
+
if (mxlock & (1<<i))
fprintf(fp, " lock");
+ if (i != RTAX_CC_ALGO)
+ val = rta_getattr_u32(mxrta[i]);
- val = *(unsigned*)RTA_DATA(mxrta[i]);
switch (i) {
case RTAX_FEATURES:
print_rtax_features(fp, val);
@@ -573,6 +575,10 @@ int print_route(const struct sockaddr_nl *who, struct nlmsghdr *n, void *arg)
fprintf(fp, " %gs", val/1e3);
else
fprintf(fp, " %ums", val);
+ break;
+ case RTAX_CC_ALGO:
+ fprintf(fp, " %s", rta_getattr_str(mxrta[i]));
+ break;
}
}
}
@@ -925,6 +931,14 @@ static int iproute_modify(int cmd, unsigned flags, int argc, char **argv)
if (quickack != 1 && quickack != 0)
invarg("\"quickack\" value should be 0 or 1\n", *argv);
rta_addattr32(mxrta, sizeof(mxbuf), RTAX_QUICKACK, quickack);
+ } else if (matches(*argv, "congctl") == 0) {
+ NEXT_ARG();
+ if (strcmp(*argv, "lock") == 0) {
+ mxlock |= 1 << RTAX_CC_ALGO;
+ NEXT_ARG();
+ }
+ rta_addattr_l(mxrta, sizeof(mxbuf), RTAX_CC_ALGO, *argv,
+ strlen(*argv));
} else if (matches(*argv, "rttvar") == 0) {
unsigned win;
NEXT_ARG();
diff --git a/man/man8/ip-route.8.in b/man/man8/ip-route.8.in
index 89960c1..2b1583d 100644
--- a/man/man8/ip-route.8.in
+++ b/man/man8/ip-route.8.in
@@ -116,7 +116,9 @@ replace " } "
.B features
.IR FEATURES " ] [ "
.B quickack
-.IR BOOL " ]"
+.IR BOOL " ] [ "
+.B congctl
+.IR NAME " ]"
.ti -8
.IR TYPE " := [ "
@@ -433,6 +435,21 @@ sysctl is set to 0.
Enable or disable quick ack for connections to this destination.
.TP
+.BI congctl " NAME " "(3.20+ only)"
+.TP
+.BI "congctl lock" " NAME " "(3.20+ only)"
+Sets a specific TCP congestion control algorithm only for a given destination.
+If not specified, Linux keeps the current global default TCP congestion control
+algorithm, or the one set from the application. If the modifier
+.B lock
+is not used, an application may nevertheless overwrite the suggested congestion
+control algorithm for that destination. If the modifier
+.B lock
+is used, then an application is not allowed to overwrite the specified congestion
+control algorithm for that destination, thus it will be enforced/guaranteed to
+use the proposed algorithm.
+
+.TP
.BI advmss " NUMBER " "(2.3.15+ only)"
the MSS ('Maximal Segment Size') to advertise to these
destinations when establishing TCP connections. If it is not given,
--
1.9.0
^ permalink raw reply related
* [PATCH 6/6] openvswitch: Support VXLAN Group Policy extension
From: Thomas Graf @ 2015-01-08 22:47 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov; +Cc: netdev, dev
In-Reply-To: <cover.1420756324.git.tgraf@suug.ch>
Introduces support for the group policy extension to the VXLAN virtual
port. The extension is disabled by default and only enabled if the user
has provided the respective configuration.
ovs-vsctl add-port br0 vxlan0 -- \
set Interface vxlan0 type=vxlan options:exts=gbp
The configuration interface to enable the extension is based on a new
attribute OVS_VXLAN_EXT_GBP nested inside OVS_TUNNEL_ATTR_EXTENSION
which can carry additional extensions as needed in the future.
The group policy metadata is stored as binary blob (struct ovs_vxlan_opts)
internally just like Geneve options but transported as nested Netlink
attributes to user space.
Renames the existing TUNNEL_OPTIONS_PRESENT to TUNNEL_GENEVE_OPT with the
binary value kept intact, a new flag TUNNEL_VXLAN_OPT is introduced.
The attributes OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS and existing
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS are implemented mutually exclusive.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v2:
- Addressed Jesse's request to transport VXLAN options as Netlink
attributes instead of a binary blob. Allows a partial transport of
VXLAN extensions. Internally, the datapath continues to use a binary
blob (defined in vport-vxlan.h) for performance reasons.
- Added new TUNNEL_GENEVE_OPT and TUNNEL_VXLAN_OPT flags to mark
tunnel option flavour
- Correctly report VXLAN options to user space
include/net/ip_tunnels.h | 5 +-
include/uapi/linux/openvswitch.h | 11 ++++
net/openvswitch/flow_netlink.c | 114 ++++++++++++++++++++++++++++++++++-----
net/openvswitch/vport-geneve.c | 2 +-
net/openvswitch/vport-vxlan.c | 81 +++++++++++++++++++++++++++-
net/openvswitch/vport-vxlan.h | 11 ++++
6 files changed, 207 insertions(+), 17 deletions(-)
create mode 100644 net/openvswitch/vport-vxlan.h
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 25a59eb..ce4db3c 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -97,7 +97,10 @@ struct ip_tunnel {
#define TUNNEL_DONT_FRAGMENT __cpu_to_be16(0x0100)
#define TUNNEL_OAM __cpu_to_be16(0x0200)
#define TUNNEL_CRIT_OPT __cpu_to_be16(0x0400)
-#define TUNNEL_OPTIONS_PRESENT __cpu_to_be16(0x0800)
+#define TUNNEL_GENEVE_OPT __cpu_to_be16(0x0800)
+#define TUNNEL_VXLAN_OPT __cpu_to_be16(0x1000)
+
+#define TUNNEL_OPTIONS_PRESENT (TUNNEL_GENEVE_OPT | TUNNEL_VXLAN_OPT)
struct tnl_ptk_info {
__be16 flags;
diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 3a6dcaa..e474c95 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -248,11 +248,21 @@ enum ovs_vport_attr {
#define OVS_VPORT_ATTR_MAX (__OVS_VPORT_ATTR_MAX - 1)
+enum {
+ OVS_VXLAN_EXT_UNSPEC,
+ OVS_VXLAN_EXT_GBP, /* Flag or __u32 */
+ __OVS_VXLAN_EXT_MAX,
+};
+
+#define OVS_VXLAN_EXT_MAX (__OVS_VXLAN_EXT_MAX - 1)
+
+
/* OVS_VPORT_ATTR_OPTIONS attributes for tunnels.
*/
enum {
OVS_TUNNEL_ATTR_UNSPEC,
OVS_TUNNEL_ATTR_DST_PORT, /* 16-bit UDP port, used by L4 tunnels. */
+ OVS_TUNNEL_ATTR_EXTENSION,
__OVS_TUNNEL_ATTR_MAX
};
@@ -324,6 +334,7 @@ enum ovs_tunnel_key_attr {
OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS, /* Array of Geneve options. */
OVS_TUNNEL_KEY_ATTR_TP_SRC, /* be16 src Transport Port. */
OVS_TUNNEL_KEY_ATTR_TP_DST, /* be16 dst Transport Port. */
+ OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS, /* Nested OVS_VXLAN_EXT_* */
__OVS_TUNNEL_KEY_ATTR_MAX
};
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 457ccf3..cea492b 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -49,6 +49,7 @@
#include <net/mpls.h>
#include "flow_netlink.h"
+#include "vport-vxlan.h"
struct ovs_len_tbl {
int len;
@@ -268,6 +269,9 @@ size_t ovs_tun_key_attr_size(void)
+ nla_total_size(0) /* OVS_TUNNEL_KEY_ATTR_CSUM */
+ nla_total_size(0) /* OVS_TUNNEL_KEY_ATTR_OAM */
+ nla_total_size(256) /* OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS */
+ /* OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS is mutually exclusive with
+ * OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS and covered by it.
+ */
+ nla_total_size(2) /* OVS_TUNNEL_KEY_ATTR_TP_SRC */
+ nla_total_size(2); /* OVS_TUNNEL_KEY_ATTR_TP_DST */
}
@@ -308,6 +312,7 @@ static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1]
[OVS_TUNNEL_KEY_ATTR_TP_DST] = { .len = sizeof(u16) },
[OVS_TUNNEL_KEY_ATTR_OAM] = { .len = 0 },
[OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = { .len = OVS_ATTR_NESTED },
+ [OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS] = { .len = OVS_ATTR_NESTED },
};
/* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute. */
@@ -460,6 +465,41 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
return 0;
}
+static const struct nla_policy vxlan_opt_policy[OVS_VXLAN_EXT_MAX + 1] = {
+ [OVS_VXLAN_EXT_GBP] = { .type = NLA_U32 },
+};
+
+static int vxlan_tun_opt_from_nlattr(const struct nlattr *a,
+ struct sw_flow_match *match, bool is_mask,
+ bool log)
+{
+ struct nlattr *tb[OVS_VXLAN_EXT_MAX+1];
+ unsigned long opt_key_offset;
+ struct ovs_vxlan_opts opts;
+ int err;
+
+ BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
+
+ err = nla_parse_nested(tb, OVS_VXLAN_EXT_MAX, a, vxlan_opt_policy);
+ if (err < 0)
+ return err;
+
+ memset(&opts, 0, sizeof(opts));
+
+ if (tb[OVS_VXLAN_EXT_MAX])
+ opts.gbp = nla_get_u32(tb[OVS_VXLAN_EXT_MAX]);
+
+ if (!is_mask)
+ SW_FLOW_KEY_PUT(match, tun_opts_len, sizeof(opts), false);
+ else
+ SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
+
+ opt_key_offset = TUN_METADATA_OFFSET(sizeof(opts));
+ SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, &opts, sizeof(opts),
+ is_mask);
+ return 0;
+}
+
static int ipv4_tun_from_nlattr(const struct nlattr *attr,
struct sw_flow_match *match, bool is_mask,
bool log)
@@ -468,6 +508,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
int rem;
bool ttl = false;
__be16 tun_flags = 0;
+ int opts_type = 0;
nla_for_each_nested(a, attr, rem) {
int type = nla_type(a);
@@ -527,11 +568,30 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
tun_flags |= TUNNEL_OAM;
break;
case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+ if (opts_type) {
+ OVS_NLERR(log, "Multiple metadata blocks provided");
+ return -EINVAL;
+ }
+
err = genev_tun_opt_from_nlattr(a, match, is_mask, log);
if (err)
return err;
- tun_flags |= TUNNEL_OPTIONS_PRESENT;
+ tun_flags |= TUNNEL_GENEVE_OPT;
+ opts_type = type;
+ break;
+ case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+ if (opts_type) {
+ OVS_NLERR(log, "Multiple metadata blocks provided");
+ return -EINVAL;
+ }
+
+ err = vxlan_tun_opt_from_nlattr(a, match, is_mask, log);
+ if (err)
+ return err;
+
+ tun_flags |= TUNNEL_VXLAN_OPT;
+ opts_type = type;
break;
default:
OVS_NLERR(log, "Unknown IPv4 tunnel attribute %d",
@@ -560,6 +620,23 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
}
}
+ return opts_type;
+}
+
+static int vxlan_opt_to_nlattr(struct sk_buff *skb,
+ const void *tun_opts, int swkey_tun_opts_len)
+{
+ const struct ovs_vxlan_opts *opts = tun_opts;
+ struct nlattr *nla;
+
+ nla = nla_nest_start(skb, OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS);
+ if (!nla)
+ return -EMSGSIZE;
+
+ if (nla_put_u32(skb, OVS_VXLAN_EXT_GBP, opts->gbp) < 0)
+ return -EMSGSIZE;
+
+ nla_nest_end(skb, nla);
return 0;
}
@@ -596,10 +673,15 @@ static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
if ((output->tun_flags & TUNNEL_OAM) &&
nla_put_flag(skb, OVS_TUNNEL_KEY_ATTR_OAM))
return -EMSGSIZE;
- if (tun_opts &&
- nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
- swkey_tun_opts_len, tun_opts))
- return -EMSGSIZE;
+ if (tun_opts) {
+ if (output->tun_flags & TUNNEL_GENEVE_OPT &&
+ nla_put(skb, OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS,
+ swkey_tun_opts_len, tun_opts))
+ return -EMSGSIZE;
+ else if (output->tun_flags & TUNNEL_VXLAN_OPT &&
+ vxlan_opt_to_nlattr(skb, tun_opts, swkey_tun_opts_len))
+ return -EMSGSIZE;
+ }
return 0;
}
@@ -680,7 +762,7 @@ static int metadata_from_nlattrs(struct sw_flow_match *match, u64 *attrs,
}
if (*attrs & (1 << OVS_KEY_ATTR_TUNNEL)) {
if (ipv4_tun_from_nlattr(a[OVS_KEY_ATTR_TUNNEL], match,
- is_mask, log))
+ is_mask, log) < 0)
return -EINVAL;
*attrs &= ~(1 << OVS_KEY_ATTR_TUNNEL);
}
@@ -1578,17 +1660,23 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
struct sw_flow_key key;
struct ovs_tunnel_info *tun_info;
struct nlattr *a;
- int err, start;
+ int err, start, opts_type;
ovs_match_init(&match, &key, NULL);
- err = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
- if (err)
- return err;
+ opts_type = ipv4_tun_from_nlattr(nla_data(attr), &match, false, log);
+ if (opts_type < 0)
+ return opts_type;
if (key.tun_opts_len) {
- err = validate_and_copy_geneve_opts(&key);
- if (err < 0)
- return err;
+ switch (opts_type) {
+ case OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS:
+ err = validate_and_copy_geneve_opts(&key);
+ if (err < 0)
+ return err;
+ break;
+ case OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS:
+ break;
+ }
};
start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
diff --git a/net/openvswitch/vport-geneve.c b/net/openvswitch/vport-geneve.c
index 484864d..902ee4f 100644
--- a/net/openvswitch/vport-geneve.c
+++ b/net/openvswitch/vport-geneve.c
@@ -90,7 +90,7 @@ static void geneve_rcv(struct geneve_sock *gs, struct sk_buff *skb)
opts_len = geneveh->opt_len * 4;
- flags = TUNNEL_KEY | TUNNEL_OPTIONS_PRESENT |
+ flags = TUNNEL_KEY | TUNNEL_GENEVE_OPT |
(udp_hdr(skb)->check != 0 ? TUNNEL_CSUM : 0) |
(geneveh->oam ? TUNNEL_OAM : 0) |
(geneveh->critical ? TUNNEL_CRIT_OPT : 0);
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index 266c595..dbd6c75 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -40,6 +40,7 @@
#include "datapath.h"
#include "vport.h"
+#include "vport-vxlan.h"
/**
* struct vxlan_port - Keeps track of open UDP ports
@@ -49,6 +50,7 @@
struct vxlan_port {
struct vxlan_sock *vs;
char name[IFNAMSIZ];
+ u32 exts; /* VXLAN_EXT_* in <net/vxlan.h> */
};
static struct vport_ops ovs_vxlan_vport_ops;
@@ -63,16 +65,26 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
struct vxlan_metadata *md)
{
struct ovs_tunnel_info tun_info;
+ struct vxlan_port *vxlan_port;
struct vport *vport = vs->data;
struct iphdr *iph;
+ struct ovs_vxlan_opts opts = {
+ .gbp = md->gbp,
+ };
__be64 key;
+ __be16 flags;
+
+ flags = TUNNEL_KEY;
+ vxlan_port = vxlan_vport(vport);
+ if (vxlan_port->exts & VXLAN_EXT_GBP)
+ flags |= TUNNEL_VXLAN_OPT;
/* Save outer tunnel values */
iph = ip_hdr(skb);
key = cpu_to_be64(ntohl(md->vni) >> 8);
ovs_flow_tun_info_init(&tun_info, iph,
udp_hdr(skb)->source, udp_hdr(skb)->dest,
- key, TUNNEL_KEY, NULL, 0);
+ key, flags, &opts, sizeof(opts));
ovs_vport_receive(vport, skb, &tun_info);
}
@@ -84,6 +96,21 @@ static int vxlan_get_options(const struct vport *vport, struct sk_buff *skb)
if (nla_put_u16(skb, OVS_TUNNEL_ATTR_DST_PORT, ntohs(dst_port)))
return -EMSGSIZE;
+
+ if (vxlan_port->exts) {
+ struct nlattr *exts;
+
+ exts = nla_nest_start(skb, OVS_TUNNEL_ATTR_EXTENSION);
+ if (!exts)
+ return -EMSGSIZE;
+
+ if (vxlan_port->exts & VXLAN_EXT_GBP &&
+ nla_put_flag(skb, OVS_VXLAN_EXT_GBP))
+ return -EMSGSIZE;
+
+ nla_nest_end(skb, exts);
+ }
+
return 0;
}
@@ -96,6 +123,31 @@ static void vxlan_tnl_destroy(struct vport *vport)
ovs_vport_deferred_free(vport);
}
+static const struct nla_policy exts_policy[OVS_VXLAN_EXT_MAX+1] = {
+ [OVS_VXLAN_EXT_GBP] = { .type = NLA_FLAG, },
+};
+
+static int vxlan_configure_exts(struct vport *vport, struct nlattr *attr)
+{
+ struct nlattr *exts[OVS_VXLAN_EXT_MAX+1];
+ struct vxlan_port *vxlan_port;
+ int err;
+
+ if (nla_len(attr) < sizeof(struct nlattr))
+ return -EINVAL;
+
+ err = nla_parse_nested(exts, OVS_VXLAN_EXT_MAX, attr, exts_policy);
+ if (err < 0)
+ return err;
+
+ vxlan_port = vxlan_vport(vport);
+
+ if (exts[OVS_VXLAN_EXT_GBP])
+ vxlan_port->exts |= VXLAN_EXT_GBP;
+
+ return 0;
+}
+
static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
{
struct net *net = ovs_dp_get_net(parms->dp);
@@ -128,7 +180,17 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
vxlan_port = vxlan_vport(vport);
strncpy(vxlan_port->name, parms->name, IFNAMSIZ);
- vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0, 0);
+ a = nla_find_nested(options, OVS_TUNNEL_ATTR_EXTENSION);
+ if (a) {
+ err = vxlan_configure_exts(vport, a);
+ if (err) {
+ ovs_vport_free(vport);
+ goto error;
+ }
+ }
+
+ vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0,
+ vxlan_port->exts);
if (IS_ERR(vs)) {
ovs_vport_free(vport);
return (void *)vs;
@@ -141,6 +203,20 @@ error:
return ERR_PTR(err);
}
+static int vxlan_ext_gbp(struct sk_buff *skb)
+{
+ const struct ovs_tunnel_info *tun_info;
+ const struct ovs_vxlan_opts *opts;
+
+ tun_info = OVS_CB(skb)->egress_tun_info;
+ opts = tun_info->options;
+
+ if (tun_info->options_len >= sizeof(*opts))
+ return opts->gbp;
+ else
+ return 0;
+}
+
static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
{
struct net *net = ovs_dp_get_net(vport->dp);
@@ -181,6 +257,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
src_port = udp_flow_src_port(net, skb, 0, 0, true);
md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
+ md.gbp = vxlan_ext_gbp(skb);
err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
fl.saddr, tun_key->ipv4_dst,
diff --git a/net/openvswitch/vport-vxlan.h b/net/openvswitch/vport-vxlan.h
new file mode 100644
index 0000000..4b08233e
--- /dev/null
+++ b/net/openvswitch/vport-vxlan.h
@@ -0,0 +1,11 @@
+#ifndef VPORT_VXLAN_H
+#define VPORT_VXLAN_H 1
+
+#include <linux/kernel.h>
+#include <linux/types.h>
+
+struct ovs_vxlan_opts {
+ __u32 gbp;
+};
+
+#endif
--
1.9.3
^ permalink raw reply related
* [PATCH 5/6] openvswitch: Allow for any level of nesting in flow attributes
From: Thomas Graf @ 2015-01-08 22:47 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov; +Cc: netdev, dev
In-Reply-To: <cover.1420756324.git.tgraf@suug.ch>
nlattr_set() is currently hardcoded to two levels of nesting. This change
introduces struct ovs_len_tbl to define minimal length requirements plus
next level nesting tables to traverse the key attributes to arbitary depth.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v2:
- New patch to allow nested Netlink attributes inside
OVS_TUNNEL_KEY_ATTR_VXLAN_OPTS
net/openvswitch/flow_netlink.c | 106 ++++++++++++++++++++++-------------------
1 file changed, 56 insertions(+), 50 deletions(-)
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 8980d32..457ccf3 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -50,6 +50,13 @@
#include "flow_netlink.h"
+struct ovs_len_tbl {
+ int len;
+ const struct ovs_len_tbl *next;
+};
+
+#define OVS_ATTR_NESTED -1
+
static void update_range(struct sw_flow_match *match,
size_t offset, size_t size, bool is_mask)
{
@@ -289,29 +296,44 @@ size_t ovs_key_attr_size(void)
+ nla_total_size(28); /* OVS_KEY_ATTR_ND */
}
+static const struct ovs_len_tbl ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] = {
+ [OVS_TUNNEL_KEY_ATTR_ID] = { .len = sizeof(u64) },
+ [OVS_TUNNEL_KEY_ATTR_IPV4_SRC] = { .len = sizeof(u32) },
+ [OVS_TUNNEL_KEY_ATTR_IPV4_DST] = { .len = sizeof(u32) },
+ [OVS_TUNNEL_KEY_ATTR_TOS] = { .len = 1 },
+ [OVS_TUNNEL_KEY_ATTR_TTL] = { .len = 1 },
+ [OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = { .len = 0 },
+ [OVS_TUNNEL_KEY_ATTR_CSUM] = { .len = 0 },
+ [OVS_TUNNEL_KEY_ATTR_TP_SRC] = { .len = sizeof(u16) },
+ [OVS_TUNNEL_KEY_ATTR_TP_DST] = { .len = sizeof(u16) },
+ [OVS_TUNNEL_KEY_ATTR_OAM] = { .len = 0 },
+ [OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = { .len = OVS_ATTR_NESTED },
+};
+
/* The size of the argument for each %OVS_KEY_ATTR_* Netlink attribute. */
-static const int ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
- [OVS_KEY_ATTR_ENCAP] = -1,
- [OVS_KEY_ATTR_PRIORITY] = sizeof(u32),
- [OVS_KEY_ATTR_IN_PORT] = sizeof(u32),
- [OVS_KEY_ATTR_SKB_MARK] = sizeof(u32),
- [OVS_KEY_ATTR_ETHERNET] = sizeof(struct ovs_key_ethernet),
- [OVS_KEY_ATTR_VLAN] = sizeof(__be16),
- [OVS_KEY_ATTR_ETHERTYPE] = sizeof(__be16),
- [OVS_KEY_ATTR_IPV4] = sizeof(struct ovs_key_ipv4),
- [OVS_KEY_ATTR_IPV6] = sizeof(struct ovs_key_ipv6),
- [OVS_KEY_ATTR_TCP] = sizeof(struct ovs_key_tcp),
- [OVS_KEY_ATTR_TCP_FLAGS] = sizeof(__be16),
- [OVS_KEY_ATTR_UDP] = sizeof(struct ovs_key_udp),
- [OVS_KEY_ATTR_SCTP] = sizeof(struct ovs_key_sctp),
- [OVS_KEY_ATTR_ICMP] = sizeof(struct ovs_key_icmp),
- [OVS_KEY_ATTR_ICMPV6] = sizeof(struct ovs_key_icmpv6),
- [OVS_KEY_ATTR_ARP] = sizeof(struct ovs_key_arp),
- [OVS_KEY_ATTR_ND] = sizeof(struct ovs_key_nd),
- [OVS_KEY_ATTR_RECIRC_ID] = sizeof(u32),
- [OVS_KEY_ATTR_DP_HASH] = sizeof(u32),
- [OVS_KEY_ATTR_TUNNEL] = -1,
- [OVS_KEY_ATTR_MPLS] = sizeof(struct ovs_key_mpls),
+static const struct ovs_len_tbl ovs_key_lens[OVS_KEY_ATTR_MAX + 1] = {
+ [OVS_KEY_ATTR_ENCAP] = { .len = OVS_ATTR_NESTED },
+ [OVS_KEY_ATTR_PRIORITY] = { .len = sizeof(u32) },
+ [OVS_KEY_ATTR_IN_PORT] = { .len = sizeof(u32) },
+ [OVS_KEY_ATTR_SKB_MARK] = { .len = sizeof(u32) },
+ [OVS_KEY_ATTR_ETHERNET] = { .len = sizeof(struct ovs_key_ethernet) },
+ [OVS_KEY_ATTR_VLAN] = { .len = sizeof(__be16) },
+ [OVS_KEY_ATTR_ETHERTYPE] = { .len = sizeof(__be16) },
+ [OVS_KEY_ATTR_IPV4] = { .len = sizeof(struct ovs_key_ipv4) },
+ [OVS_KEY_ATTR_IPV6] = { .len = sizeof(struct ovs_key_ipv6) },
+ [OVS_KEY_ATTR_TCP] = { .len = sizeof(struct ovs_key_tcp) },
+ [OVS_KEY_ATTR_TCP_FLAGS] = { .len = sizeof(__be16) },
+ [OVS_KEY_ATTR_UDP] = { .len = sizeof(struct ovs_key_udp) },
+ [OVS_KEY_ATTR_SCTP] = { .len = sizeof(struct ovs_key_sctp) },
+ [OVS_KEY_ATTR_ICMP] = { .len = sizeof(struct ovs_key_icmp) },
+ [OVS_KEY_ATTR_ICMPV6] = { .len = sizeof(struct ovs_key_icmpv6) },
+ [OVS_KEY_ATTR_ARP] = { .len = sizeof(struct ovs_key_arp) },
+ [OVS_KEY_ATTR_ND] = { .len = sizeof(struct ovs_key_nd) },
+ [OVS_KEY_ATTR_RECIRC_ID] = { .len = sizeof(u32) },
+ [OVS_KEY_ATTR_DP_HASH] = { .len = sizeof(u32) },
+ [OVS_KEY_ATTR_TUNNEL] = { .len = OVS_ATTR_NESTED,
+ .next = ovs_tunnel_key_lens, },
+ [OVS_KEY_ATTR_MPLS] = { .len = sizeof(struct ovs_key_mpls) },
};
static bool is_all_zero(const u8 *fp, size_t size)
@@ -352,8 +374,8 @@ static int __parse_flow_nlattrs(const struct nlattr *attr,
return -EINVAL;
}
- expected_len = ovs_key_lens[type];
- if (nla_len(nla) != expected_len && expected_len != -1) {
+ expected_len = ovs_key_lens[type].len;
+ if (nla_len(nla) != expected_len && expected_len != OVS_ATTR_NESTED) {
OVS_NLERR(log, "Key %d has unexpected len %d expected %d",
type, nla_len(nla), expected_len);
return -EINVAL;
@@ -451,30 +473,16 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
int type = nla_type(a);
int err;
- static const u32 ovs_tunnel_key_lens[OVS_TUNNEL_KEY_ATTR_MAX + 1] = {
- [OVS_TUNNEL_KEY_ATTR_ID] = sizeof(u64),
- [OVS_TUNNEL_KEY_ATTR_IPV4_SRC] = sizeof(u32),
- [OVS_TUNNEL_KEY_ATTR_IPV4_DST] = sizeof(u32),
- [OVS_TUNNEL_KEY_ATTR_TOS] = 1,
- [OVS_TUNNEL_KEY_ATTR_TTL] = 1,
- [OVS_TUNNEL_KEY_ATTR_DONT_FRAGMENT] = 0,
- [OVS_TUNNEL_KEY_ATTR_CSUM] = 0,
- [OVS_TUNNEL_KEY_ATTR_TP_SRC] = sizeof(u16),
- [OVS_TUNNEL_KEY_ATTR_TP_DST] = sizeof(u16),
- [OVS_TUNNEL_KEY_ATTR_OAM] = 0,
- [OVS_TUNNEL_KEY_ATTR_GENEVE_OPTS] = -1,
- };
-
if (type > OVS_TUNNEL_KEY_ATTR_MAX) {
OVS_NLERR(log, "Tunnel attr %d out of range max %d",
type, OVS_TUNNEL_KEY_ATTR_MAX);
return -EINVAL;
}
- if (ovs_tunnel_key_lens[type] != nla_len(a) &&
- ovs_tunnel_key_lens[type] != -1) {
+ if (ovs_tunnel_key_lens[type].len != nla_len(a) &&
+ ovs_tunnel_key_lens[type].len != OVS_ATTR_NESTED) {
OVS_NLERR(log, "Tunnel attr %d has unexpected len %d expected %d",
- type, nla_len(a), ovs_tunnel_key_lens[type]);
+ type, nla_len(a), ovs_tunnel_key_lens[type].len);
return -EINVAL;
}
@@ -912,18 +920,16 @@ static int ovs_key_from_nlattrs(struct sw_flow_match *match, u64 attrs,
return 0;
}
-static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
+static void nlattr_set(struct nlattr *attr, u8 val,
+ const struct ovs_len_tbl *tbl)
{
struct nlattr *nla;
int rem;
/* The nlattr stream should already have been validated */
nla_for_each_nested(nla, attr, rem) {
- /* We assume that ovs_key_lens[type] == -1 means that type is a
- * nested attribute
- */
- if (is_attr_mask_key && ovs_key_lens[nla_type(nla)] == -1)
- nlattr_set(nla, val, false);
+ if (tbl && tbl[nla_type(nla)].len == OVS_ATTR_NESTED)
+ nlattr_set(nla, val, tbl[nla_type(nla)].next);
else
memset(nla_data(nla), val, nla_len(nla));
}
@@ -931,7 +937,7 @@ static void nlattr_set(struct nlattr *attr, u8 val, bool is_attr_mask_key)
static void mask_set_nlattr(struct nlattr *attr, u8 val)
{
- nlattr_set(attr, val, true);
+ nlattr_set(attr, val, ovs_key_lens);
}
/**
@@ -1628,8 +1634,8 @@ static int validate_set(const struct nlattr *a,
return -EINVAL;
if (key_type > OVS_KEY_ATTR_MAX ||
- (ovs_key_lens[key_type] != nla_len(ovs_key) &&
- ovs_key_lens[key_type] != -1))
+ (ovs_key_lens[key_type].len != nla_len(ovs_key) &&
+ ovs_key_lens[key_type].len != OVS_ATTR_NESTED))
return -EINVAL;
switch (key_type) {
--
1.9.3
^ permalink raw reply related
* [PATCH 4/6] openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
From: Thomas Graf @ 2015-01-08 22:47 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov; +Cc: netdev, dev
In-Reply-To: <cover.1420756324.git.tgraf@suug.ch>
Also factors out Geneve validation code into a new separate function
validate_and_copy_geneve_opts().
A subsequent patch will introduce VXLAN options. Rename the existing
GENEVE_TUN_OPTS() to reflect its extended purpose of carrying generic
tunnel metadata options.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v2:
- Don't rename genev_tun_opt_from_nlattr() and keep it Geneve specific,
pointed out by Jesse.
- Factor out Geneve specific validation code into separate function as
requested by Jesse.
net/openvswitch/flow.c | 2 +-
net/openvswitch/flow.h | 14 ++++----
net/openvswitch/flow_netlink.c | 72 +++++++++++++++++++++++-------------------
3 files changed, 47 insertions(+), 41 deletions(-)
diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index da2fae0..41f2dfd 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -691,7 +691,7 @@ int ovs_flow_key_extract(const struct ovs_tunnel_info *tun_info,
BUILD_BUG_ON((1 << (sizeof(tun_info->options_len) *
8)) - 1
> sizeof(key->tun_opts));
- memcpy(GENEVE_OPTS(key, tun_info->options_len),
+ memcpy(TUN_METADATA_OPTS(key, tun_info->options_len),
tun_info->options, tun_info->options_len);
key->tun_opts_len = tun_info->options_len;
} else {
diff --git a/net/openvswitch/flow.h b/net/openvswitch/flow.h
index a8b30f3..d3d0a40 100644
--- a/net/openvswitch/flow.h
+++ b/net/openvswitch/flow.h
@@ -53,7 +53,7 @@ struct ovs_key_ipv4_tunnel {
struct ovs_tunnel_info {
struct ovs_key_ipv4_tunnel tunnel;
- const struct geneve_opt *options;
+ const void *options;
u8 options_len;
};
@@ -61,10 +61,10 @@ struct ovs_tunnel_info {
* maximum size. This allows us to get the benefits of variable length
* matching for small options.
*/
-#define GENEVE_OPTS(flow_key, opt_len) \
- ((struct geneve_opt *)((flow_key)->tun_opts + \
- FIELD_SIZEOF(struct sw_flow_key, tun_opts) - \
- opt_len))
+#define TUN_METADATA_OFFSET(opt_len) \
+ (FIELD_SIZEOF(struct sw_flow_key, tun_opts) - opt_len)
+#define TUN_METADATA_OPTS(flow_key, opt_len) \
+ ((void *)((flow_key)->tun_opts + TUN_METADATA_OFFSET(opt_len)))
static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
__be32 saddr, __be32 daddr,
@@ -73,7 +73,7 @@ static inline void __ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
__be16 tp_dst,
__be64 tun_id,
__be16 tun_flags,
- const struct geneve_opt *opts,
+ const void *opts,
u8 opts_len)
{
tun_info->tunnel.tun_id = tun_id;
@@ -105,7 +105,7 @@ static inline void ovs_flow_tun_info_init(struct ovs_tunnel_info *tun_info,
__be16 tp_dst,
__be64 tun_id,
__be16 tun_flags,
- const struct geneve_opt *opts,
+ const void *opts,
u8 opts_len)
{
__ovs_flow_tun_info_init(tun_info, iph->saddr, iph->daddr,
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index d1eecf7..8980d32 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -432,8 +432,7 @@ static int genev_tun_opt_from_nlattr(const struct nlattr *a,
SW_FLOW_KEY_PUT(match, tun_opts_len, 0xff, true);
}
- opt_key_offset = (unsigned long)GENEVE_OPTS((struct sw_flow_key *)0,
- nla_len(a));
+ opt_key_offset = TUN_METADATA_OFFSET(nla_len(a));
SW_FLOW_KEY_MEMCPY_OFFSET(match, opt_key_offset, nla_data(a),
nla_len(a), is_mask);
return 0;
@@ -558,8 +557,7 @@ static int ipv4_tun_from_nlattr(const struct nlattr *attr,
static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
const struct ovs_key_ipv4_tunnel *output,
- const struct geneve_opt *tun_opts,
- int swkey_tun_opts_len)
+ const void *tun_opts, int swkey_tun_opts_len)
{
if (output->tun_flags & TUNNEL_KEY &&
nla_put_be64(skb, OVS_TUNNEL_KEY_ATTR_ID, output->tun_id))
@@ -600,8 +598,7 @@ static int __ipv4_tun_to_nlattr(struct sk_buff *skb,
static int ipv4_tun_to_nlattr(struct sk_buff *skb,
const struct ovs_key_ipv4_tunnel *output,
- const struct geneve_opt *tun_opts,
- int swkey_tun_opts_len)
+ const void *tun_opts, int swkey_tun_opts_len)
{
struct nlattr *nla;
int err;
@@ -1148,10 +1145,10 @@ int ovs_nla_put_flow(const struct sw_flow_key *swkey,
goto nla_put_failure;
if ((swkey->tun_key.ipv4_dst || is_mask)) {
- const struct geneve_opt *opts = NULL;
+ const void *opts = NULL;
if (output->tun_key.tun_flags & TUNNEL_OPTIONS_PRESENT)
- opts = GENEVE_OPTS(output, swkey->tun_opts_len);
+ opts = TUN_METADATA_OPTS(output, swkey->tun_opts_len);
if (ipv4_tun_to_nlattr(skb, &output->tun_key, opts,
swkey->tun_opts_len))
@@ -1540,6 +1537,34 @@ void ovs_match_init(struct sw_flow_match *match,
}
}
+static int validate_and_copy_geneve_opts(struct sw_flow_key *key)
+{
+ struct geneve_opt *option;
+ int opts_len = key->tun_opts_len;
+ bool crit_opt = false;
+
+ option = (struct geneve_opt *)TUN_METADATA_OPTS(key, key->tun_opts_len);
+ while (opts_len > 0) {
+ int len;
+
+ if (opts_len < sizeof(*option))
+ return -EINVAL;
+
+ len = sizeof(*option) + option->length * 4;
+ if (len > opts_len)
+ return -EINVAL;
+
+ crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
+
+ option = (struct geneve_opt *)((u8 *)option + len);
+ opts_len -= len;
+ };
+
+ key->tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+
+ return 0;
+}
+
static int validate_and_copy_set_tun(const struct nlattr *attr,
struct sw_flow_actions **sfa, bool log)
{
@@ -1555,28 +1580,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
return err;
if (key.tun_opts_len) {
- struct geneve_opt *option = GENEVE_OPTS(&key,
- key.tun_opts_len);
- int opts_len = key.tun_opts_len;
- bool crit_opt = false;
-
- while (opts_len > 0) {
- int len;
-
- if (opts_len < sizeof(*option))
- return -EINVAL;
-
- len = sizeof(*option) + option->length * 4;
- if (len > opts_len)
- return -EINVAL;
-
- crit_opt |= !!(option->type & GENEVE_CRIT_OPT_TYPE);
-
- option = (struct geneve_opt *)((u8 *)option + len);
- opts_len -= len;
- };
-
- key.tun_key.tun_flags |= crit_opt ? TUNNEL_CRIT_OPT : 0;
+ err = validate_and_copy_geneve_opts(&key);
+ if (err < 0)
+ return err;
};
start = add_nested_action_start(sfa, OVS_ACTION_ATTR_SET, log);
@@ -1597,9 +1603,9 @@ static int validate_and_copy_set_tun(const struct nlattr *attr,
* everything else will go away after flow setup. We can append
* it to tun_info and then point there.
*/
- memcpy((tun_info + 1), GENEVE_OPTS(&key, key.tun_opts_len),
- key.tun_opts_len);
- tun_info->options = (struct geneve_opt *)(tun_info + 1);
+ memcpy((tun_info + 1),
+ TUN_METADATA_OPTS(&key, key.tun_opts_len), key.tun_opts_len);
+ tun_info->options = (tun_info + 1);
} else {
tun_info->options = NULL;
}
--
1.9.3
^ permalink raw reply related
* [PATCH 2/6] vxlan: Group Policy extension
From: Thomas Graf @ 2015-01-08 22:47 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov; +Cc: netdev, dev
In-Reply-To: <cover.1420756324.git.tgraf@suug.ch>
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The group membership is defined by the lower 16 bits of skb->mark, the
upper 16 bits are used for flags.
SELinux allows to manage label to secure local resources. However,
distributed applications require ACLs to implemented across hosts. This
is typically achieved by matching on L2-L4 fields to identify the
original sending host and process on the receiver. On top of that,
netlabel and specifically CIPSO [1] allow to map security contexts to
universal labels. However, netlabel and CIPSO are relatively complex.
This patch provides a lightweight alternative for overlay network
environments with a trusted underlay. No additional control protocol
is required.
Host 1: Host 2:
Group A Group B Group B Group A
+-----+ +-------------+ +-------+ +-----+
| lxc | | SELinux CTX | | httpd | | VM |
+--+--+ +--+----------+ +---+---+ +--+--+
\---+---/ \----+---/
| |
+---+---+ +---+---+
| vxlan | | vxlan |
+---+---+ +---+---+
+------------------------------+
Backwards compatibility:
A VXLAN-GBP socket can receive standard VXLAN frames and will assign
the default group 0x0000 to such frames. A Linux VXLAN socket will
drop VXLAN-GBP frames. The extension is therefore disabled by default
and needs to be specifically enabled:
ip link add [...] type vxlan [...] gbp
In a mixed environment with VXLAN and VXLAN-GBP sockets, the GBP socket
must run on a separate port number.
Examples:
iptables:
host1# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
host2# iptables -I INPUT -m mark --mark 0x200 -j DROP
OVS:
# ovs-ofctl add-flow br0 'in_port=1,actions=load:0x200->NXM_NX_TUN_GBP_ID[],NORMAL'
# ovs-ofctl add-flow br0 'in_port=2,tun_gbp_id=0x200,actions=drop'
[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] http://lwn.net/Articles/204905/
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v2:
- split GBP header definition into separate struct vxlanhdr_gbp as requested
by Alexei
drivers/net/vxlan.c | 161 ++++++++++++++++++++++++++++++------------
include/net/vxlan.h | 73 +++++++++++++++++--
include/uapi/linux/if_link.h | 8 +++
net/openvswitch/vport-vxlan.c | 9 ++-
4 files changed, 198 insertions(+), 53 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 4d52aa9..b148739 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -132,6 +132,7 @@ struct vxlan_dev {
__u8 tos; /* TOS override */
__u8 ttl;
u32 flags; /* VXLAN_F_* in vxlan.h */
+ u32 exts; /* Enabled extensions */
struct work_struct sock_work;
struct work_struct igmp_join;
@@ -568,7 +569,8 @@ static struct sk_buff **vxlan_gro_receive(struct sk_buff **head, struct sk_buff
continue;
vh2 = (struct vxlanhdr *)(p->data + off_vx);
- if (vh->vx_vni != vh2->vx_vni) {
+ if (vh->vx_flags != vh2->vx_flags ||
+ vh->vx_vni != vh2->vx_vni) {
NAPI_GRO_CB(p)->same_flow = 0;
continue;
}
@@ -1095,6 +1097,7 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
{
struct vxlan_sock *vs;
struct vxlanhdr *vxh;
+ struct vxlan_metadata md = {0};
/* Need Vxlan and inner Ethernet header to be present */
if (!pskb_may_pull(skb, VXLAN_HLEN))
@@ -1113,6 +1116,22 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
if (vs->exts) {
if (!vxh->vni_present)
goto error_invalid_header;
+
+ if (vxh->gbp_present) {
+ struct vxlanhdr_gbp *gbp;
+
+ if (!(vs->exts & VXLAN_EXT_GBP))
+ goto error_invalid_header;
+
+ gbp = (struct vxlanhdr_gbp *)vxh;
+ md.gbp = ntohs(gbp->policy_id);
+
+ if (gbp->dont_learn)
+ md.gbp |= VXLAN_GBP_DONT_LEARN;
+
+ if (gbp->policy_applied)
+ md.gbp |= VXLAN_GBP_POLICY_APPLIED;
+ }
} else {
if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
(vxh->vx_vni & htonl(0xff)))
@@ -1122,7 +1141,8 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
if (iptunnel_pull_header(skb, VXLAN_HLEN, htons(ETH_P_TEB)))
goto drop;
- vs->rcv(vs, skb, vxh->vx_vni);
+ md.vni = vxh->vx_vni;
+ vs->rcv(vs, skb, &md);
return 0;
drop:
@@ -1138,8 +1158,8 @@ error:
return 1;
}
-static void vxlan_rcv(struct vxlan_sock *vs,
- struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+ struct vxlan_metadata *md)
{
struct iphdr *oip = NULL;
struct ipv6hdr *oip6 = NULL;
@@ -1150,7 +1170,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
int err = 0;
union vxlan_addr *remote_ip;
- vni = ntohl(vx_vni) >> 8;
+ vni = ntohl(md->vni) >> 8;
/* Is this VNI defined? */
vxlan = vxlan_vs_find_vni(vs, vni);
if (!vxlan)
@@ -1184,6 +1204,7 @@ static void vxlan_rcv(struct vxlan_sock *vs,
goto drop;
skb_reset_network_header(skb);
+ skb->mark = md->gbp;
if (oip6)
err = IP6_ECN_decapsulate(oip6, skb);
@@ -1533,15 +1554,57 @@ static bool route_shortcircuit(struct net_device *dev, struct sk_buff *skb)
return false;
}
+static int vxlan_build_hdr(struct sk_buff *skb, struct vxlan_sock *vs,
+ int min_headroom, struct vxlan_metadata *md)
+{
+ struct vxlanhdr *vxh;
+ int err;
+
+ /* Need space for new headers (invalidates iph ptr) */
+ err = skb_cow_head(skb, min_headroom);
+ if (unlikely(err)) {
+ kfree_skb(skb);
+ return err;
+ }
+
+ skb = vlan_hwaccel_push_inside(skb);
+ if (WARN_ON(!skb))
+ return -ENOMEM;
+
+ vxh = (struct vxlanhdr *)__skb_push(skb, sizeof(*vxh));
+ vxh->vx_flags = htonl(VXLAN_FLAGS);
+ vxh->vx_vni = md->vni;
+
+ if (vs->exts) {
+ if (vs->exts & VXLAN_EXT_GBP) {
+ struct vxlanhdr_gbp *gbp;
+
+ gbp = (struct vxlanhdr_gbp *)vxh;
+ vxh->gbp_present = 1;
+
+ if (md->gbp & VXLAN_GBP_DONT_LEARN)
+ gbp->dont_learn = 1;
+
+ if (md->gbp & VXLAN_GBP_POLICY_APPLIED)
+ gbp->policy_applied = 1;
+
+ gbp->policy_id = htons(md->gbp & VXLAN_GBP_ID_MASK);
+ }
+ }
+
+ skb_set_inner_protocol(skb, htons(ETH_P_TEB));
+
+ return 0;
+}
+
#if IS_ENABLED(CONFIG_IPV6)
static int vxlan6_xmit_skb(struct vxlan_sock *vs,
struct dst_entry *dst, struct sk_buff *skb,
struct net_device *dev, struct in6_addr *saddr,
struct in6_addr *daddr, __u8 prio, __u8 ttl,
- __be16 src_port, __be16 dst_port, __be32 vni,
- bool xnet)
+ __be16 src_port, __be16 dst_port,
+ struct vxlan_metadata *md, bool xnet)
{
- struct vxlanhdr *vxh;
int min_headroom;
int err;
bool udp_sum = !udp_get_no_check6_tx(vs->sock->sk);
@@ -1558,24 +1621,9 @@ static int vxlan6_xmit_skb(struct vxlan_sock *vs,
+ VXLAN_HLEN + sizeof(struct ipv6hdr)
+ (vlan_tx_tag_present(skb) ? VLAN_HLEN : 0);
- /* Need space for new headers (invalidates iph ptr) */
- err = skb_cow_head(skb, min_headroom);
- if (unlikely(err)) {
- kfree_skb(skb);
- goto err;
- }
-
- skb = vlan_hwaccel_push_inside(skb);
- if (WARN_ON(!skb)) {
- err = -ENOMEM;
+ err = vxlan_build_hdr(skb, vs, min_headroom, md);
+ if (err)
goto err;
- }
-
- vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
- vxh->vx_flags = htonl(VXLAN_FLAGS);
- vxh->vx_vni = vni;
-
- skb_set_inner_protocol(skb, htons(ETH_P_TEB));
udp_tunnel6_xmit_skb(vs->sock, dst, skb, dev, saddr, daddr, prio,
ttl, src_port, dst_port);
@@ -1589,9 +1637,9 @@ err:
int vxlan_xmit_skb(struct vxlan_sock *vs,
struct rtable *rt, struct sk_buff *skb,
__be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
- __be16 src_port, __be16 dst_port, __be32 vni, bool xnet)
+ __be16 src_port, __be16 dst_port,
+ struct vxlan_metadata *md, bool xnet)
{
- struct vxlanhdr *vxh;
int min_headroom;
int err;
bool udp_sum = !vs->sock->sk->sk_no_check_tx;
@@ -1604,22 +1652,9 @@ int vxlan_xmit_skb(struct vxlan_sock *vs,
+ VXLAN_HLEN + sizeof(struct iphdr)
+ (vlan_tx_tag_present(skb) ? VLAN_HLEN : 0);
- /* Need space for new headers (invalidates iph ptr) */
- err = skb_cow_head(skb, min_headroom);
- if (unlikely(err)) {
- kfree_skb(skb);
+ err = vxlan_build_hdr(skb, vs, min_headroom, md);
+ if (err)
return err;
- }
-
- skb = vlan_hwaccel_push_inside(skb);
- if (WARN_ON(!skb))
- return -ENOMEM;
-
- vxh = (struct vxlanhdr *) __skb_push(skb, sizeof(*vxh));
- vxh->vx_flags = htonl(VXLAN_FLAGS);
- vxh->vx_vni = vni;
-
- skb_set_inner_protocol(skb, htons(ETH_P_TEB));
return udp_tunnel_xmit_skb(vs->sock, rt, skb, src, dst, tos,
ttl, df, src_port, dst_port, xnet);
@@ -1679,6 +1714,7 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
const struct iphdr *old_iph;
struct flowi4 fl4;
union vxlan_addr *dst;
+ struct vxlan_metadata md;
__be16 src_port = 0, dst_port;
u32 vni;
__be16 df = 0;
@@ -1749,11 +1785,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
tos = ip_tunnel_ecn_encap(tos, old_iph, skb);
ttl = ttl ? : ip4_dst_hoplimit(&rt->dst);
+ md.vni = htonl(vni << 8);
+ md.gbp = skb->mark;
err = vxlan_xmit_skb(vxlan->vn_sock, rt, skb,
fl4.saddr, dst->sin.sin_addr.s_addr,
- tos, ttl, df, src_port, dst_port,
- htonl(vni << 8),
+ tos, ttl, df, src_port, dst_port, &md,
!net_eq(vxlan->net, dev_net(vxlan->dev)));
if (err < 0) {
/* skb is already freed. */
@@ -1806,10 +1843,12 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
}
ttl = ttl ? : ip6_dst_hoplimit(ndst);
+ md.vni = htonl(vni << 8);
+ md.gbp = skb->mark;
err = vxlan6_xmit_skb(vxlan->vn_sock, ndst, skb,
dev, &fl6.saddr, &fl6.daddr, 0, ttl,
- src_port, dst_port, htonl(vni << 8),
+ src_port, dst_port, &md,
!net_eq(vxlan->net, dev_net(vxlan->dev)));
#endif
}
@@ -2210,6 +2249,11 @@ static const struct nla_policy vxlan_policy[IFLA_VXLAN_MAX + 1] = {
[IFLA_VXLAN_UDP_CSUM] = { .type = NLA_U8 },
[IFLA_VXLAN_UDP_ZERO_CSUM6_TX] = { .type = NLA_U8 },
[IFLA_VXLAN_UDP_ZERO_CSUM6_RX] = { .type = NLA_U8 },
+ [IFLA_VXLAN_EXTENSION] = { .type = NLA_NESTED },
+};
+
+static const struct nla_policy vxlan_ext_policy[IFLA_VXLAN_EXT_MAX + 1] = {
+ [IFLA_VXLAN_EXT_GBP] = { .type = NLA_FLAG, },
};
static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
@@ -2246,6 +2290,18 @@ static int vxlan_validate(struct nlattr *tb[], struct nlattr *data[])
}
}
+ if (data[IFLA_VXLAN_EXTENSION]) {
+ int err;
+
+ err = nla_validate_nested(data[IFLA_VXLAN_EXTENSION],
+ IFLA_VXLAN_EXT_MAX, vxlan_ext_policy);
+ if (err < 0) {
+ pr_debug("invalid VXLAN extension configuration: %d\n",
+ err);
+ return -EINVAL;
+ }
+ }
+
return 0;
}
@@ -2400,6 +2456,18 @@ static void vxlan_sock_work(struct work_struct *work)
dev_put(vxlan->dev);
}
+static void configure_vxlan_exts(struct vxlan_dev *vxlan, struct nlattr *attr)
+{
+ struct nlattr *exts[IFLA_VXLAN_EXT_MAX+1];
+
+ /* Validated in vxlan_validate() */
+ if (nla_parse_nested(exts, IFLA_VXLAN_EXT_MAX, attr, NULL) < 0)
+ BUG();
+
+ if (exts[IFLA_VXLAN_EXT_GBP])
+ vxlan->exts |= VXLAN_EXT_GBP;
+}
+
static int vxlan_newlink(struct net *net, struct net_device *dev,
struct nlattr *tb[], struct nlattr *data[])
{
@@ -2525,6 +2593,9 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
nla_get_u8(data[IFLA_VXLAN_UDP_ZERO_CSUM6_RX]))
vxlan->flags |= VXLAN_F_UDP_ZERO_CSUM6_RX;
+ if (data[IFLA_VXLAN_EXTENSION])
+ configure_vxlan_exts(vxlan, data[IFLA_VXLAN_EXTENSION]);
+
if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
vxlan->dst_port)) {
pr_info("duplicate VNI %u\n", vni);
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 3e98d31..af0526b 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -11,13 +11,65 @@
#define VNI_HASH_BITS 10
#define VNI_HASH_SIZE (1<<VNI_HASH_BITS)
+/*
+ * VXLAN Group Based Policy Extension:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |1|-|-|-|1|-|-|-|R|D|R|R|A|R|R|R| Group Policy ID |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | VXLAN Network Identifier (VNI) | Reserved |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * D = Don't Learn bit. When set, this bit indicates that the egress
+ * VTEP MUST NOT learn the source address of the encapsulated frame.
+ *
+ * A = Indicates that the group policy has already been applied to
+ * this packet. Policies MUST NOT be applied by devices when the
+ * A bit is set.
+ *
+ * [0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
+ */
+struct vxlanhdr_gbp {
+ __u8 vx_flags;
+#ifdef __LITTLE_ENDIAN_BITFIELD
+ __u8 reserved_flags1:3,
+ policy_applied:1,
+ reserved_flags2:2,
+ dont_learn:1,
+ reserved_flags3:1;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 reserved_flags1:1,
+ dont_learn:1,
+ reserved_flags2:2,
+ policy_applied:1,
+ reserved_flags3:3;
+#else
+#error "Please fix <asm/byteorder.h>"
+#endif
+ __be16 policy_id;
+ __be32 vx_vni;
+};
+
+struct vxlan_gbp {
+} __packed;
+
+/* skb->mark mapping
+ *
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |R|R|R|R|R|R|R|R|R|D|R|R|A|R|R|R| Group Policy ID |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ */
+#define VXLAN_GBP_DONT_LEARN (BIT(6) << 16)
+#define VXLAN_GBP_POLICY_APPLIED (BIT(3) << 16)
+#define VXLAN_GBP_ID_MASK (0xFFFF)
+
/* VXLAN protocol header:
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
- * |R|R|R|R|I|R|R|R| Reserved |
+ * |G|R|R|R|I|R|R|R| Reserved |
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
* | VXLAN Network Identifier (VNI) | Reserved |
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
*
+ * G = 1 Group Policy (VXLAN-GBP)
* I = 1 VXLAN Network Identifier (VNI) present
*/
struct vxlanhdr {
@@ -26,9 +78,11 @@ struct vxlanhdr {
#ifdef __LITTLE_ENDIAN_BITFIELD
__u8 reserved_flags1:3,
vni_present:1,
- reserved_flags2:4;
+ reserved_flags2:3,
+ gbp_present:1;
#elif defined(__BIG_ENDIAN_BITFIELD)
- __u8 reserved_flags2:4,
+ __u8 gbp_present:1,
+ reserved_flags2:3,
vni_present:1,
reserved_flags1:3;
#else
@@ -42,8 +96,16 @@ struct vxlanhdr {
__be32 vx_vni;
};
+struct vxlan_metadata {
+ __be32 vni;
+ u32 gbp;
+};
+
struct vxlan_sock;
-typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb, __be32 key);
+typedef void (vxlan_rcv_t)(struct vxlan_sock *vh, struct sk_buff *skb,
+ struct vxlan_metadata *md);
+
+#define VXLAN_EXT_GBP BIT(0)
/* per UDP socket information */
struct vxlan_sock {
@@ -78,7 +140,8 @@ void vxlan_sock_release(struct vxlan_sock *vs);
int vxlan_xmit_skb(struct vxlan_sock *vs,
struct rtable *rt, struct sk_buff *skb,
__be32 src, __be32 dst, __u8 tos, __u8 ttl, __be16 df,
- __be16 src_port, __be16 dst_port, __be32 vni, bool xnet);
+ __be16 src_port, __be16 dst_port, struct vxlan_metadata *md,
+ bool xnet);
static inline netdev_features_t vxlan_features_check(struct sk_buff *skb,
netdev_features_t features)
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index f7d0d2d..9f07bf5 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -370,10 +370,18 @@ enum {
IFLA_VXLAN_UDP_CSUM,
IFLA_VXLAN_UDP_ZERO_CSUM6_TX,
IFLA_VXLAN_UDP_ZERO_CSUM6_RX,
+ IFLA_VXLAN_EXTENSION,
__IFLA_VXLAN_MAX
};
#define IFLA_VXLAN_MAX (__IFLA_VXLAN_MAX - 1)
+enum {
+ IFLA_VXLAN_EXT_UNSPEC,
+ IFLA_VXLAN_EXT_GBP,
+ __IFLA_VXLAN_EXT_MAX,
+};
+#define IFLA_VXLAN_EXT_MAX (__IFLA_VXLAN_EXT_MAX - 1)
+
struct ifla_vxlan_port_range {
__be16 low;
__be16 high;
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index d7c46b3..dd68c97 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -59,7 +59,8 @@ static inline struct vxlan_port *vxlan_vport(const struct vport *vport)
}
/* Called with rcu_read_lock and BH disabled. */
-static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
+static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb,
+ struct vxlan_metadata *md)
{
struct ovs_tunnel_info tun_info;
struct vport *vport = vs->data;
@@ -68,7 +69,7 @@ static void vxlan_rcv(struct vxlan_sock *vs, struct sk_buff *skb, __be32 vx_vni)
/* Save outer tunnel values */
iph = ip_hdr(skb);
- key = cpu_to_be64(ntohl(vx_vni) >> 8);
+ key = cpu_to_be64(ntohl(md->vni) >> 8);
ovs_flow_tun_info_init(&tun_info, iph,
udp_hdr(skb)->source, udp_hdr(skb)->dest,
key, TUNNEL_KEY, NULL, 0);
@@ -146,6 +147,7 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
struct vxlan_port *vxlan_port = vxlan_vport(vport);
__be16 dst_port = inet_sk(vxlan_port->vs->sock->sk)->inet_sport;
struct ovs_key_ipv4_tunnel *tun_key;
+ struct vxlan_metadata md;
struct rtable *rt;
struct flowi4 fl;
__be16 src_port;
@@ -178,12 +180,13 @@ static int vxlan_tnl_send(struct vport *vport, struct sk_buff *skb)
skb->ignore_df = 1;
src_port = udp_flow_src_port(net, skb, 0, 0, true);
+ md.vni = htonl(be64_to_cpu(tun_key->tun_id) << 8);
err = vxlan_xmit_skb(vxlan_port->vs, rt, skb,
fl.saddr, tun_key->ipv4_dst,
tun_key->ipv4_tos, tun_key->ipv4_ttl, df,
src_port, dst_port,
- htonl(be64_to_cpu(tun_key->tun_id) << 8),
+ &md,
false);
if (err < 0)
ip_rt_put(rt);
--
1.9.3
^ permalink raw reply related
* [PATCH 1/6] vxlan: Allow for VXLAN extensions to be implemented
From: Thomas Graf @ 2015-01-08 22:47 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov; +Cc: netdev, dev
In-Reply-To: <cover.1420756324.git.tgraf@suug.ch>
The VXLAN receive code is currently conservative in what it accepts and
will reject any frame that uses any of the reserved VXLAN protocol fields.
The VXLAN draft specifies that "reserved fields MUST be set to zero on
transmit and ignored on receive.".
Retain the current conservative parsing behaviour by default but allows
these fields to be used by VXLAN extensions which are explicitly enabled
on the VXLAN socket respectively VXLAN net_device.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v2:
- No change
drivers/net/vxlan.c | 29 +++++++++++++++++++----------
include/net/vxlan.h | 32 +++++++++++++++++++++++++++++---
2 files changed, 48 insertions(+), 13 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index 2ab0922..4d52aa9 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -65,7 +65,7 @@
#define VXLAN_VID_MASK (VXLAN_N_VID - 1)
#define VXLAN_HLEN (sizeof(struct udphdr) + sizeof(struct vxlanhdr))
-#define VXLAN_FLAGS 0x08000000 /* struct vxlanhdr.vx_flags required value. */
+#define VXLAN_FLAGS 0x08000000 /* struct vxlanhdr.vx_flags default value. */
/* UDP port for VXLAN traffic.
* The IANA assigned port is 4789, but the Linux default is 8472
@@ -1100,22 +1100,28 @@ static int vxlan_udp_encap_recv(struct sock *sk, struct sk_buff *skb)
if (!pskb_may_pull(skb, VXLAN_HLEN))
goto error;
+ vs = rcu_dereference_sk_user_data(sk);
+ if (!vs)
+ goto drop;
+
/* Return packets with reserved bits set */
vxh = (struct vxlanhdr *)(udp_hdr(skb) + 1);
- if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
- (vxh->vx_vni & htonl(0xff))) {
- netdev_dbg(skb->dev, "invalid vxlan flags=%#x vni=%#x\n",
- ntohl(vxh->vx_flags), ntohl(vxh->vx_vni));
- goto error;
+
+ /* For backwards compatibility, only allow reserved fields to be
+ * used by VXLAN extensions if explicitly requested.
+ */
+ if (vs->exts) {
+ if (!vxh->vni_present)
+ goto error_invalid_header;
+ } else {
+ if (vxh->vx_flags != htonl(VXLAN_FLAGS) ||
+ (vxh->vx_vni & htonl(0xff)))
+ goto error_invalid_header;
}
if (iptunnel_pull_header(skb, VXLAN_HLEN, htons(ETH_P_TEB)))
goto drop;
- vs = rcu_dereference_sk_user_data(sk);
- if (!vs)
- goto drop;
-
vs->rcv(vs, skb, vxh->vx_vni);
return 0;
@@ -1124,6 +1130,9 @@ drop:
kfree_skb(skb);
return 0;
+error_invalid_header:
+ netdev_dbg(skb->dev, "invalid vxlan flags=%#x vni=%#x\n",
+ ntohl(vxh->vx_flags), ntohl(vxh->vx_vni));
error:
/* Return non vxlan pkt */
return 1;
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index 903461a..3e98d31 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -11,10 +11,35 @@
#define VNI_HASH_BITS 10
#define VNI_HASH_SIZE (1<<VNI_HASH_BITS)
-/* VXLAN protocol header */
+/* VXLAN protocol header:
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * |R|R|R|R|I|R|R|R| Reserved |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | VXLAN Network Identifier (VNI) | Reserved |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * I = 1 VXLAN Network Identifier (VNI) present
+ */
struct vxlanhdr {
- __be32 vx_flags;
- __be32 vx_vni;
+ union {
+ struct {
+#ifdef __LITTLE_ENDIAN_BITFIELD
+ __u8 reserved_flags1:3,
+ vni_present:1,
+ reserved_flags2:4;
+#elif defined(__BIG_ENDIAN_BITFIELD)
+ __u8 reserved_flags2:4,
+ vni_present:1,
+ reserved_flags1:3;
+#else
+#error "Please fix <asm/byteorder.h>"
+#endif
+ __u8 vx_reserved1;
+ __be16 vx_reserved2;
+ };
+ __be32 vx_flags;
+ };
+ __be32 vx_vni;
};
struct vxlan_sock;
@@ -25,6 +50,7 @@ struct vxlan_sock {
struct hlist_node hlist;
vxlan_rcv_t *rcv;
void *data;
+ u32 exts;
struct work_struct del_work;
struct socket *sock;
struct rcu_head rcu;
--
1.9.3
^ permalink raw reply related
* [PATCH 0/6 net-next v2] VXLAN Group Policy Extension
From: Thomas Graf @ 2015-01-08 22:47 UTC (permalink / raw)
To: davem, jesse, stephen, pshelar, therbert, alexei.starovoitov; +Cc: netdev, dev
Implements supports for the Group Policy VXLAN extension [0] to provide
a lightweight and simple security label mechanism across network peers
based on VXLAN. The security context and associated metadata is mapped
to/from skb->mark. This allows further mapping to a SELinux context
using SECMARK, to implement ACLs directly with nftables, iptables, OVS,
tc, etc.
The extension is disabled by default and should be run on a distinct
port in mixed Linux VXLAN VTEP environments. Liberal VXLAN VTEPs
which ignore unknown reserved bits will be able to receive VXLAN-GBP
frames.
Simple usage example:
10.1.1.1:
# ip link add vxlan0 type vxlan id 10 remote 10.1.1.2 gbp
# iptables -I OUTPUT -m owner --uid-owner 101 -j MARK --set-mark 0x200
10.1.1.2:
# ip link add vxlan0 type vxlan id 10 remote 10.1.1.1 gbp
# iptables -I INPUT -m mark --mark 0x200 -j DROP
iproute2 [1] and OVS [2] support will be provided in separate patches.
[0] https://tools.ietf.org/html/draft-smith-vxlan-group-policy
[1] https://github.com/tgraf/iproute2/tree/vxlan-gbp
[2] https://github.com/tgraf/ovs/tree/vxlan-gbp
Thomas Graf (6):
vxlan: Allow for VXLAN extensions to be implemented
vxlan: Group Policy extension
vxlan: Only bind to sockets with correct extensions enabled
openvswitch: Rename GENEVE_TUN_OPTS() to TUN_METADATA_OPTS()
openvswitch: Allow for any level of nesting in flow attributes
openvswitch: Support VXLAN Group Policy extension
drivers/net/vxlan.c | 225 ++++++++++++++++++++----------
include/net/ip_tunnels.h | 5 +-
include/net/vxlan.h | 101 +++++++++++++-
include/uapi/linux/if_link.h | 8 ++
include/uapi/linux/openvswitch.h | 11 ++
net/openvswitch/flow.c | 2 +-
net/openvswitch/flow.h | 14 +-
net/openvswitch/flow_netlink.c | 286 ++++++++++++++++++++++++++-------------
net/openvswitch/vport-geneve.c | 2 +-
net/openvswitch/vport-vxlan.c | 90 +++++++++++-
net/openvswitch/vport-vxlan.h | 11 ++
11 files changed, 572 insertions(+), 183 deletions(-)
create mode 100644 net/openvswitch/vport-vxlan.h
--
1.9.3
^ permalink raw reply
* [PATCH 3/6] vxlan: Only bind to sockets with correct extensions enabled
From: Thomas Graf @ 2015-01-08 22:47 UTC (permalink / raw)
To: davem-fT/PcQaiUtIeIZ0/mPfg9Q, jesse-l0M0P4e3n4LQT0dZR+AlfA,
stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ,
pshelar-l0M0P4e3n4LQT0dZR+AlfA, therbert-hpIqsD4AKlfQT0dZR+AlfA,
alexei.starovoitov-Re5JQEeQqe8AvxtiuMwx3w
Cc: dev-yBygre7rU0TnMu66kgdUjQ, netdev-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <cover.1420756324.git.tgraf-G/eBtMaohhA@public.gmane.org>
A VXLAN net_device looking for an appropriate socket may only consider
a socket which has a matching set of extensions enabled. If the
extensions don't match, return a conflict to have the caller create a
distinct socket with distinct port.
The OVS VXLAN port is kept unaware of extensions at this point.
Signed-off-by: Thomas Graf <tgraf@suug.ch>
---
v2:
- Improved commit message, reported by Jesse
drivers/net/vxlan.c | 35 +++++++++++++++++++++--------------
include/net/vxlan.h | 2 +-
net/openvswitch/vport-vxlan.c | 2 +-
3 files changed, 23 insertions(+), 16 deletions(-)
diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index b148739..61e1112 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -271,14 +271,15 @@ static inline struct vxlan_rdst *first_remote_rtnl(struct vxlan_fdb *fdb)
}
/* Find VXLAN socket based on network namespace, address family and UDP port */
-static struct vxlan_sock *vxlan_find_sock(struct net *net,
- sa_family_t family, __be16 port)
+static struct vxlan_sock *vxlan_find_sock(struct net *net, sa_family_t family,
+ __be16 port, u32 exts)
{
struct vxlan_sock *vs;
hlist_for_each_entry_rcu(vs, vs_head(net, port), hlist) {
if (inet_sk(vs->sock->sk)->inet_sport == port &&
- inet_sk(vs->sock->sk)->sk.sk_family == family)
+ inet_sk(vs->sock->sk)->sk.sk_family == family &&
+ vs->exts == exts)
return vs;
}
return NULL;
@@ -298,11 +299,12 @@ static struct vxlan_dev *vxlan_vs_find_vni(struct vxlan_sock *vs, u32 id)
/* Look up VNI in a per net namespace table */
static struct vxlan_dev *vxlan_find_vni(struct net *net, u32 id,
- sa_family_t family, __be16 port)
+ sa_family_t family, __be16 port,
+ u32 exts)
{
struct vxlan_sock *vs;
- vs = vxlan_find_sock(net, family, port);
+ vs = vxlan_find_sock(net, family, port, exts);
if (!vs)
return NULL;
@@ -1776,7 +1778,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
ip_rt_put(rt);
dst_vxlan = vxlan_find_vni(vxlan->net, vni,
- dst->sa.sa_family, dst_port);
+ dst->sa.sa_family, dst_port,
+ vxlan->exts);
if (!dst_vxlan)
goto tx_error;
vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -1835,7 +1838,8 @@ static void vxlan_xmit_one(struct sk_buff *skb, struct net_device *dev,
dst_release(ndst);
dst_vxlan = vxlan_find_vni(vxlan->net, vni,
- dst->sa.sa_family, dst_port);
+ dst->sa.sa_family, dst_port,
+ vxlan->exts);
if (!dst_vxlan)
goto tx_error;
vxlan_encap_bypass(skb, vxlan, dst_vxlan);
@@ -2005,7 +2009,7 @@ static int vxlan_init(struct net_device *dev)
spin_lock(&vn->sock_lock);
vs = vxlan_find_sock(vxlan->net, ipv6 ? AF_INET6 : AF_INET,
- vxlan->dst_port);
+ vxlan->dst_port, vxlan->exts);
if (vs && atomic_add_unless(&vs->refcnt, 1, 0)) {
/* If we have a socket with same port already, reuse it */
vxlan_vs_add_dev(vs, vxlan);
@@ -2359,7 +2363,7 @@ static struct socket *vxlan_create_sock(struct net *net, bool ipv6,
/* Create new listen socket if needed */
static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
vxlan_rcv_t *rcv, void *data,
- u32 flags)
+ u32 flags, u32 exts)
{
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_sock *vs;
@@ -2387,6 +2391,7 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
atomic_set(&vs->refcnt, 1);
vs->rcv = rcv;
vs->data = data;
+ vs->exts = exts;
/* Initialize the vxlan udp offloads structure */
vs->udp_offloads.port = port;
@@ -2411,13 +2416,14 @@ static struct vxlan_sock *vxlan_socket_create(struct net *net, __be16 port,
struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
vxlan_rcv_t *rcv, void *data,
- bool no_share, u32 flags)
+ bool no_share, u32 flags,
+ u32 exts)
{
struct vxlan_net *vn = net_generic(net, vxlan_net_id);
struct vxlan_sock *vs;
bool ipv6 = flags & VXLAN_F_IPV6;
- vs = vxlan_socket_create(net, port, rcv, data, flags);
+ vs = vxlan_socket_create(net, port, rcv, data, flags, exts);
if (!IS_ERR(vs))
return vs;
@@ -2425,7 +2431,7 @@ struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
return vs;
spin_lock(&vn->sock_lock);
- vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port);
+ vs = vxlan_find_sock(net, ipv6 ? AF_INET6 : AF_INET, port, exts);
if (vs && ((vs->rcv != rcv) ||
!atomic_add_unless(&vs->refcnt, 1, 0)))
vs = ERR_PTR(-EBUSY);
@@ -2447,7 +2453,8 @@ static void vxlan_sock_work(struct work_struct *work)
__be16 port = vxlan->dst_port;
struct vxlan_sock *nvs;
- nvs = vxlan_sock_add(net, port, vxlan_rcv, NULL, false, vxlan->flags);
+ nvs = vxlan_sock_add(net, port, vxlan_rcv, NULL, false, vxlan->flags,
+ vxlan->exts);
spin_lock(&vn->sock_lock);
if (!IS_ERR(nvs))
vxlan_vs_add_dev(nvs, vxlan);
@@ -2597,7 +2604,7 @@ static int vxlan_newlink(struct net *net, struct net_device *dev,
configure_vxlan_exts(vxlan, data[IFLA_VXLAN_EXTENSION]);
if (vxlan_find_vni(net, vni, use_ipv6 ? AF_INET6 : AF_INET,
- vxlan->dst_port)) {
+ vxlan->dst_port, vxlan->exts)) {
pr_info("duplicate VNI %u\n", vni);
return -EEXIST;
}
diff --git a/include/net/vxlan.h b/include/net/vxlan.h
index af0526b..416aa2b 100644
--- a/include/net/vxlan.h
+++ b/include/net/vxlan.h
@@ -133,7 +133,7 @@ struct vxlan_sock {
struct vxlan_sock *vxlan_sock_add(struct net *net, __be16 port,
vxlan_rcv_t *rcv, void *data,
- bool no_share, u32 flags);
+ bool no_share, u32 flags, u32 exts);
void vxlan_sock_release(struct vxlan_sock *vs);
diff --git a/net/openvswitch/vport-vxlan.c b/net/openvswitch/vport-vxlan.c
index dd68c97..266c595 100644
--- a/net/openvswitch/vport-vxlan.c
+++ b/net/openvswitch/vport-vxlan.c
@@ -128,7 +128,7 @@ static struct vport *vxlan_tnl_create(const struct vport_parms *parms)
vxlan_port = vxlan_vport(vport);
strncpy(vxlan_port->name, parms->name, IFNAMSIZ);
- vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0);
+ vs = vxlan_sock_add(net, htons(dst_port), vxlan_rcv, vport, true, 0, 0);
if (IS_ERR(vs)) {
ovs_vport_free(vport);
return (void *)vs;
--
1.9.3
_______________________________________________
dev mailing list
dev@openvswitch.org
http://openvswitch.org/mailman/listinfo/dev
^ permalink raw reply related
* Re: [PATCH net] ipv6: Prevent ipv6_find_hdr() from returning ENOENT for valid non-first fragments
From: Hannes Frederic Sowa @ 2015-01-08 22:39 UTC (permalink / raw)
To: Pablo Neira Ayuso; +Cc: Rahul Sharma, netdev, linux-kernel, netfilter-devel
In-Reply-To: <20150108205328.GA3361@salvia>
Hi Pablo,
On Thu, Jan 8, 2015, at 21:53, Pablo Neira Ayuso wrote:
> I'm afraid we cannot just get rid of that !ipv6_ext_hdr() check. The
> ipv6_find_hdr() function is designed to return the transport protocol.
> After the proposed change, it will return extension header numbers.
> This will break existing ip6tables rulesets since the `-p' option
> relies on this function to match the transport protocol.
>
> Note that the AH header is skipped (see code a bit below this
> problematic fragmentation handling) so the follow up header after the
> AH header is returned as the transport header.
>
> We can probably return the AH protocol number for non-1st fragments.
> However, that would be something new to ip6tables since nobody has
> ever seen packet matching `-p ah' rules. Thus, we restore control to
> the user to allow this, but we would accept all kind of fragmented AH
> traffic through the firewall since we cannot know what transport
> protocol contains from non-1st fragments (unless I'm missing anything,
> I need to have a closer look at this again tomorrow with fresher
> mind).
The code in question is guarded by (_frag_off != 0), so we are
definitely processing a non-1st fragment currently. The -p match would
happen at the time when the packet is reassembled and thus ipv6_find_hdr
will find the real transport (final) header at this point (I hope I
followed the code correctly here).
The next proto field of the fragmentation header is copied from the
first fragment, thus it may specify AH header but actually only in the
original (unfragmented) packet an AH header is directly following the
fragmentation header, thus the next proto chain is somewhat unstable if
looking at the fragmented packets only.
Bye,
Hannes
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox