* Re: [bpf-next V1-RFC PATCH 02/14] xdp/mlx5: setup xdp_rxq_info and extend with qtype
From: Saeed Mahameed @ 2017-12-13 23:03 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Tariq Toukan
Cc: Daniel Borkmann, Alexei Starovoitov, netdev, dsahern, Matan Barak,
gospo, bjorn.topel, michael.chan
In-Reply-To: <20171213144439.2036c59b@redhat.com>
On 12/13/2017 5:44 AM, Jesper Dangaard Brouer wrote:
> On Wed, 13 Dec 2017 14:27:08 +0200
> Tariq Toukan <tariqt@mellanox.com> wrote:
>
>> Hi Jesper,
>> Thanks for taking care of the drop RQ.
>>
>> In general, mlx5 part looks ok to me.
>> Find a few comments below. Mostly pointing out some typos.
>>
>> On 13/12/2017 1:19 PM, Jesper Dangaard Brouer wrote:
>>> The mlx5 driver have a special drop-RQ queue (one per interface) that
>>> simply drops all incoming traffic. It helps driver keep other HW
>>> objects (flow steering) alive upon down/up operations. It is
>>> temporarily pointed by flow steering objects during the interface
>>> setup, and when interface is down. It lacks many fields that are set
>>> in a regular RQ (for example its state is never switched to
>>> MLX5_RQC_STATE_RDY). (Thanks to Tariq Toukan for explaination).
>> typo: explanation
>
> Fixed
>
>>>
>>> The XDP RX-queue info API is extended with a queue-type, and mlx5 uses
>>> this kind of drop/sink-type (RXQ_TYPE_SINK) for this kind of sink queue.
>>>
>>> Driver hook points for xdp_rxq_info:
>>> * init+reg: mlx5e_alloc_rq()
>>> * init+reg: mlx5e_alloc_drop_rq()
>>> * unreg : mlx5e_free_rq()
>>>
>>> Tested on actual hardware with samples/bpf program
>>>
>>> Cc: Saeed Mahameed <saeedm@mellanox.com>
>>> Cc: Matan Barak <matanb@mellanox.com>
>>> Cc: Tariq Toukan <tariqt@mellanox.com>
>>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>>> ---
>>> drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 ++++
>>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 14 +++++++++++++
>>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 1 +
>>> include/net/xdp.h | 23 +++++++++++++++++++++
>>> net/core/xdp.c | 6 +++++
>>> 5 files changed, 48 insertions(+)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> index c0872b3284cb..fe10a042783b 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> @@ -46,6 +46,7 @@
>>> #include <linux/mlx5/transobj.h>
>>> #include <linux/rhashtable.h>
>>> #include <net/switchdev.h>
>>> +#include <net/xdp.h>
>>> #include "wq.h"
>>> #include "mlx5_core.h"
>>> #include "en_stats.h"
>>> @@ -568,6 +569,9 @@ struct mlx5e_rq {
>>> u32 rqn;
>>> struct mlx5_core_dev *mdev;
>>> struct mlx5_core_mkey umr_mkey;
>>> +
>>> + /* XDP read-mostly */
>>> + struct xdp_rxq_info xdp_rxq;
>>> } ____cacheline_aligned_in_smp;
>>>
>>> struct mlx5e_channel {
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> index 0f5c012de52e..ea44b5f25e11 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> @@ -582,6 +582,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
>>> rq->ix = c->ix;
>>> rq->mdev = mdev;
>>>
>>> + /* XDP RX-queue info */
>>> + xdp_rxq_info_init(&rq->xdp_rxq);
>>> + rq->xdp_rxq.dev = rq->netdev;
>>> + rq->xdp_rxq.queue_index = rq->ix;
>>> + xdp_rxq_info_reg(&rq->xdp_rxq);
>>> +
See my comment below and my comment on patch #12 I believe we can reduce
the amount of code duplication, and have a more generic way to register
XDP RXQs, without the need for drivers to take care of xdp_rxq_info
declaration and handling.
>> You don't set type here. This is ok as long as the following hold:
>> 1) RXQ_TYPE_DEFAULT is zero
>
> True
>
>> 2) xdp_rxq is zalloc'ed.
>
> xdp_rxq memory area is part of rq allocation, but in
> xdp_rxq_info_init() I memset/zero the area explicit.
>
>
>>> rq->xdp_prog = params->xdp_prog ?
>>> bpf_prog_inc(params->xdp_prog) : NULL; if (IS_ERR(rq->xdp_prog)) {
>>> err = PTR_ERR(rq->xdp_prog);
>>> @@ -695,6 +701,7 @@ static int mlx5e_alloc_rq(struct mlx5e_channel
>>> *c, err_rq_wq_destroy:
>>> if (rq->xdp_prog)
>>> bpf_prog_put(rq->xdp_prog);
>>> + xdp_rxq_info_unreg(&rq->xdp_rxq);
>>> mlx5_wq_destroy(&rq->wq_ctrl);
>>>
>>> return err;
>>> @@ -707,6 +714,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
>>> if (rq->xdp_prog)
>>> bpf_prog_put(rq->xdp_prog);
>>>
>>> + xdp_rxq_info_unreg(&rq->xdp_rxq);
>>> +
>>> switch (rq->wq_type) {
>>> case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
>>> mlx5e_rq_free_mpwqe_info(rq);
>>> @@ -2768,6 +2777,11 @@ static int mlx5e_alloc_drop_rq(struct
>>> mlx5_core_dev *mdev, if (err)
>>> return err;
>>>
>>> + /* XDP RX-queue info for "Drop-RQ", packets never reach
>>> XDP */
>>> + xdp_rxq_info_init(&rq->xdp_rxq);
>>> + xdp_rxq_info_type(&rq->xdp_rxq, RXQ_TYPE_SINK);
>>> + xdp_rxq_info_reg(&rq->xdp_rxq);
>>> +
I don't see why you need this, This RQ is not even assigned to any
netdev_rxq! it is a pure HW object that drops traffic in HW when netdev
is down, it even has no buffers or napi handling, just ignore it's
existence for the sake of mlx5 xdp_rxq_info reg/unreg stuff and remove
RXQ_TYPE_SINK, bottom line it is not a real RQ and for sure XDP has
nothing to do with it.
>>> rq->mdev = mdev;
>>>
>>> return 0;
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>>> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c index
>>> 5b499c7a698f..7b38480811d4 100644 ---
>>> a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c +++
>>> b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c @@ -812,6 +812,7
>>> @@ static inline int mlx5e_xdp_handle(struct mlx5e_rq *rq,
>>> xdp_set_data_meta_invalid(&xdp); xdp.data_end = xdp.data + *len;
>>> xdp.data_hard_start = va;
>>> + xdp.rxq = &rq->xdp_rxq;
>>>
>>> act = bpf_prog_run_xdp(prog, &xdp);
>>> switch (act) {
>>> diff --git a/include/net/xdp.h b/include/net/xdp.h
>>> index e4acd198fd60..5be560d943e1 100644
>>> --- a/include/net/xdp.h
>>> +++ b/include/net/xdp.h
>>> @@ -36,10 +36,33 @@ struct xdp_rxq_info {
>>> struct net_device *dev;
>>> u32 queue_index;
>>> u32 reg_state;
>>> + u32 qtype;
>>> } ____cacheline_aligned; /* perf critical, avoid false-sharing */
>>>
>>> void xdp_rxq_info_init(struct xdp_rxq_info *xdp_rxq);
>>> void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq);
>>> void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq);
>>>
>>> +/**
>>> + * DOC: XDP RX-queue type
>>> + *
>>> + * The XDP RX-queue info can have associated a type.
>>> + *
>>> + * @RXQ_TYPE_DEFAULT: default no specifik queue type need to be
>>> specified
>>
>> typo: specific
>
> Thanks, this is a Danish typo (it's spelled that way in Danish).
>
>>> + *
>>> + * @RXQ_TYPE_SINK: indicate a fake queue that never reach XDP RX
>>> + * code. Some drivers have a need to maintain a lower layer
>>> + * RX-queue as a sink queue, while reconfiguring other
>>> RX-queues.
>>> + */
>>> +#define RXQ_TYPE_DEFAULT 0
>>> +#define RXQ_TYPE_SINK 1
>>> +#define RXQ_TYPE_MAX RXQ_TYPE_SINK
>>
>> Definitions of incremental numbers, enum might be best here, you can
>> give them some enum type and use it in xdp_rxq_info->qtype.
>
> I use defines to make the below BUILD_BUG_ON work, as enums does not
> get expanded to their values in the C-preprocessor stage.
>
>>> +
>>> +static inline
>>> +void xdp_rxq_info_type(struct xdp_rxq_info *xdp_rxq, u32 qtype)
>>> +{
>>> + BUILD_BUG_ON(qtype > RXQ_TYPE_MAX);
>>> + xdp_rxq->qtype = qtype;
>>> +}
>>> +
>>> #endif /* __LINUX_NET_XDP_H__ */
>>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>>> index a9d2dd7b1ede..2a111f5987f6 100644
>>> --- a/net/core/xdp.c
>>> +++ b/net/core/xdp.c
>>> @@ -32,8 +32,14 @@ EXPORT_SYMBOL_GPL(xdp_rxq_info_init);
>>>
>>> void xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq)
>>> {
>>> + if (xdp_rxq->qtype == RXQ_TYPE_SINK)
>>> + goto skip_content_check;
>>> +
>>> + /* Check information setup by driver code */
>>> WARN(!xdp_rxq->dev, "Missing net_device from driver");
>>> WARN(xdp_rxq->queue_index == U32_MAX, "Miss queue_index from driver"); +
>>> +skip_content_check:
>>> WARN(!(xdp_rxq->reg_state == REG_STATE_NEW),"API violation, miss init");
>>> xdp_rxq->reg_state = REG_STATE_REGISTRED;
>> typo: REGISTERED (introduced in a previous patch)
>
> Thanks for catching that! :-)
>
^ permalink raw reply
* Re: BUG REPORT: iproute2 seems to have bug with dsfield/tos in ip-rule and ip-route
From: David Ahern @ 2017-12-13 23:05 UTC (permalink / raw)
To: Daniel Lakeland, Stephen Hemminger; +Cc: netdev
In-Reply-To: <a745127e-58f6-47e6-7b44-a1b5e028bd2d@street-artists.org>
On 12/13/17 3:52 PM, Daniel Lakeland wrote:
> On 12/13/2017 02:40 PM, David Ahern wrote:
>>
>> In fib4_rule_configure, this the check that is failing:
>>
>> if (frh->tos & ~IPTOS_TOS_MASK)
>> goto errout;
>>
>> and EINVAL is returned.
>>
>> IPv4 routes has not checking on tos -- it is passed from user and
>> rtm_tos to fc_tos to fib alias tos.
>
> it seems to me that this IPTOS_TOS_MASK check should be either gotten
> rid of, or equal to 0x03 in modern usage. The bottom 2 bits are ECN and
> I suppose someone might want to route based on congestion... and hence
> maybe the mask should be dropped entirely, but if you refuse to allow
> routes on ECN then you'd want 0x03 as the mask
>
> it seems to me this is left over from before DSCP.
>
> apparently most people don't route on DSCP or work around this with
> firewall marks, and so this doesn't cause trouble enough to have been
> reported before?
>
> I think the follow up question is does anyone have any idea why someone
> who set up routes with dsfield settings is not seeing packets routed?
> The kernel may not handle ip rule with DSCP, but it takes
>
> ip route add default dsfield CS6 dev veth0
>
> just fine... and shows up in the route table, but for example the person
> is not seeing CS6 marked packets going to veth2 and instead is seeing
> them routed to veth0 the default route...
>
>
If you are running a modern kernel (>= ~4.5) there are fib tracepoints
you can use to try to answer that:
perf record -e fib:fib_table_lookup -a -g
perf script [-G]
I've had some doubts about tos handling in the output path but have not
had the time (or motivation) to dig into it. Specifically, the tos
adjustments in ip_route_output_key_hash look weird to me.
^ permalink raw reply
* Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device
From: Stephen Hemminger @ 2017-12-13 23:19 UTC (permalink / raw)
To: Stefano Brivio
Cc: David S . Miller, netdev, Matthias Schiffer, Junhan Yan,
Jiri Benc, Hangbin Liu
In-Reply-To: <0c6caef03156aa51673c50ebb59889fc001b74be.1513198761.git.sbrivio@redhat.com>
On Wed, 13 Dec 2017 23:37:00 +0100
Stefano Brivio <sbrivio@redhat.com> wrote:
> Commit a985343ba906 ("vxlan: refactor verification and
> application of configuration") introduced a change in the
> behaviour of initial MTU setting: earlier, the MTU for a link
> created on top of a given lower device, without an initial MTU
> specification, was set to the MTU of the lower device minus
> headroom as a result of this path in vxlan_dev_configure():
>
> if (!conf->mtu)
> dev->mtu = lowerdev->mtu -
> (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM);
>
> which is now gone. Now, the initial MTU, in absence of a
> configured value, is simply set by ether_setup() to ETH_DATA_LEN
> (1500 bytes).
>
> This breaks userspace expectations in case the MTU of
> the lower device is higher than 1500 bytes minus headroom.
>
> Restore the previous behaviour by calculating, for a new link,
> the MTU from the lower device, if present, and if no value is
> explicitly configured.
>
> Reported-by: Junhan Yan <juyan@redhat.com>
> Fixes: a985343ba906 ("vxlan: refactor verification and application of configuration")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
Good catch.
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
^ permalink raw reply
* Re: [PATCH] net: thunderx: add support for rgmii internal delay
From: Tim Harvey @ 2017-12-13 23:28 UTC (permalink / raw)
To: Andrew Lunn; +Cc: Sunil Goutham, netdev
In-Reply-To: <20171213111054.GE12446@lunn.ch>
On Wed, Dec 13, 2017 at 3:10 AM, Andrew Lunn <andrew@lunn.ch> wrote:
>> +void xcv_init_hw(int phy_mode)
>> {
>> u64 cfg;
>>
>> @@ -81,12 +81,31 @@ void xcv_init_hw(void)
>> /* Wait for DLL to lock */
>> msleep(1);
>>
>> - /* Configure DLL - enable or bypass
>> - * TX no bypass, RX bypass
>> - */
>> + /* enable/bypass DLL providing MAC based internal TX/RX delays */
>> cfg = readq_relaxed(xcv->reg_base + XCV_DLL_CTL);
>> - cfg &= ~0xFF03;
>> - cfg |= CLKRX_BYP;
>> + cfg &= ~0xffff00;
>> + switch (phy_mode) {
>> + /* RX and TX delays are added by the MAC */
>> + case PHY_INTERFACE_MODE_RGMII:
>> + break;
>> + /* internal RX and TX delays provided by the PHY */
>> + case PHY_INTERFACE_MODE_RGMII_ID:
>> + cfg |= CLKRX_BYP;
>> + cfg |= CLKTX_BYP;
>> + break;
>> + /* internal RX delay provided by the PHY, the MAC
>> + * should not add an RX delay in this case
>> + */
>> + case PHY_INTERFACE_MODE_RGMII_RXID:
>> + cfg |= CLKRX_BYP;
>> + break;
>> + /* internal TX delay provided by the PHY, the MAC
>> + * should not add an TX delay in this case
>> + */
>> + case PHY_INTERFACE_MODE_RGMII_TXID:
>> + cfg |= CLKRX_BYP;
>> + break;
>> + }
>
> Hi Tim
>
> This i don't get. Normally, you leave the PHY to handle delays, if
> needed. The MAC should not add any. Here you seem to assume a delay is
> always needed, and if the PHY is not providing it, the MAC should.
>
> Andrew
Andrew,
The thunder RGX inserts a delay via an on-board DLL. The 'bypass'
register will bypass this DLL and not insert a delay from the MAC
side. By default out of reset CLKTX_BYP=1 causing the RGX transmit
interface to not introduce a delay and CLKRX_BYP=0 causing the RGX
receive interface to introduce a delay.
The current code assumes the opposite setting CLKRX_BYP and clearing
CLKTX_BYP such that the RGX interface introduces a TX delay but not RX
which would be appropriate for
PHY_INTERFACE_MODE_RGMII_RXID/rgmii-txid (right, or is my logic there
backwards?). I may have my commit msg wrong in this case.
At any rate, I've got a board where the phy provides both TX/RX delay
and thus I don't want the RGX to insert any delays which means I need
to set both CLKRX_BYP and CLKTX_BYP which the driver currently doesn't
support.
Tim
^ permalink raw reply
* Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device
From: Matthias Schiffer @ 2017-12-13 23:57 UTC (permalink / raw)
To: Stefano Brivio
Cc: David S . Miller, netdev, Junhan Yan, Jiri Benc, Hangbin Liu
In-Reply-To: <0c6caef03156aa51673c50ebb59889fc001b74be.1513198761.git.sbrivio@redhat.com>
[-- Attachment #1.1: Type: text/plain, Size: 2574 bytes --]
On 12/13/2017 11:37 PM, Stefano Brivio wrote:
> Commit a985343ba906 ("vxlan: refactor verification and
> application of configuration") introduced a change in the
> behaviour of initial MTU setting: earlier, the MTU for a link
> created on top of a given lower device, without an initial MTU
> specification, was set to the MTU of the lower device minus
> headroom as a result of this path in vxlan_dev_configure():
>
> if (!conf->mtu)
> dev->mtu = lowerdev->mtu -
> (use_ipv6 ? VXLAN6_HEADROOM : VXLAN_HEADROOM);
>
> which is now gone. Now, the initial MTU, in absence of a
> configured value, is simply set by ether_setup() to ETH_DATA_LEN
> (1500 bytes).
>
> This breaks userspace expectations in case the MTU of
> the lower device is higher than 1500 bytes minus headroom.
>
> Restore the previous behaviour by calculating, for a new link,
> the MTU from the lower device, if present, and if no value is
> explicitly configured.
>
> Reported-by: Junhan Yan <juyan@redhat.com>
> Fixes: a985343ba906 ("vxlan: refactor verification and application of configuration")
> Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
> ---
> I guess this should be queued up for -stable, back to 4.13.
>
> I'm actually introducing the third occurrence of this calculation (there's
> another one in vxlan_config_apply(), and one in vxlan_change_mtu()). I would
> anyway fix the userspace breakage first, and then plan on getting rid of several
> bits of MTU logic duplication, which spans further than this.
As you note, there is another occurrence of this calculation in
vxlan_config_apply():
[...]
if (lowerdev) {
[...]
max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
VXLAN_HEADROOM);
}
if (dev->mtu > max_mtu)
dev->mtu = max_mtu;
[...]
Unless I'm overlooking something, this should already do the same thing and
your patch is redundant.
Regards,
Matthias
>
> drivers/net/vxlan.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
> index 19b9cc51079e..3a7e36cdf2c7 100644
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -3085,6 +3085,9 @@ static void vxlan_config_apply(struct net_device *dev,
>
> if (conf->mtu)
> dev->mtu = conf->mtu;
> + else if (lowerdev)
> + dev->mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
> + VXLAN_HEADROOM);
>
> vxlan->net = src_net;
> }
>
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device
From: Stefano Brivio @ 2017-12-14 0:10 UTC (permalink / raw)
To: Matthias Schiffer
Cc: David S . Miller, netdev, Junhan Yan, Jiri Benc, Hangbin Liu
In-Reply-To: <f7ce892a-60e7-4dfd-31d4-ee8fa06c9c1f@universe-factory.net>
On Thu, 14 Dec 2017 00:57:32 +0100
Matthias Schiffer <mschiffer@universe-factory.net> wrote:
> As you note, there is another occurrence of this calculation in
> vxlan_config_apply():
>
>
> [...]
> if (lowerdev) {
> [...]
> max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
> VXLAN_HEADROOM);
> }
>
> if (dev->mtu > max_mtu)
> dev->mtu = max_mtu;
> [...]
>
>
> Unless I'm overlooking something, this should already do the same thing and
> your patch is redundant.
The code above sets max_mtu, and only if dev->mtu exceeds that, the
latter is then clamped.
What my patch does is to actually set dev->mtu to that value, no matter
what's the previous value set by ether_setup() (only on creation, and
only if lowerdev is there), just like the previous behaviour used to be.
Let's consider these two cases, on the existing code:
1. lowerdev->mtu is 1500:
- ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
- here max_mtu is 1450
- we enter the second if clause above (dev->mtu > max_mtu)
- at the end of vxlan_config_apply(), dev->mtu will be 1450
which is consistent with the previous behaviour.
2. lowerdev->mtu is 9000:
- ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
- here max_mtu is 8950
- we do not enter the second if clause above (dev->mtu < max_mtu)
- at the end of vxlan_config_apply(), dev->mtu will still be 1500
which is not consistent with the previous behaviour, where it used to
be 8950 instead.
--
Stefano
^ permalink raw reply
* Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device
From: Matthias Schiffer @ 2017-12-14 0:25 UTC (permalink / raw)
To: Stefano Brivio
Cc: David S . Miller, netdev, Junhan Yan, Jiri Benc, Hangbin Liu
In-Reply-To: <20171214011042.6b4a2e8b@elisabeth>
[-- Attachment #1.1: Type: text/plain, Size: 2397 bytes --]
On 12/14/2017 01:10 AM, Stefano Brivio wrote:
> On Thu, 14 Dec 2017 00:57:32 +0100
> Matthias Schiffer <mschiffer@universe-factory.net> wrote:
>
>> As you note, there is another occurrence of this calculation in
>> vxlan_config_apply():
>>
>>
>> [...]
>> if (lowerdev) {
>> [...]
>> max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
>> VXLAN_HEADROOM);
>> }
>>
>> if (dev->mtu > max_mtu)
>> dev->mtu = max_mtu;
>> [...]
>>
>>
>> Unless I'm overlooking something, this should already do the same thing and
>> your patch is redundant.
>
> The code above sets max_mtu, and only if dev->mtu exceeds that, the
> latter is then clamped.
>
> What my patch does is to actually set dev->mtu to that value, no matter
> what's the previous value set by ether_setup() (only on creation, and
> only if lowerdev is there), just like the previous behaviour used to be.
>
> Let's consider these two cases, on the existing code:
>
> 1. lowerdev->mtu is 1500:
> - ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
> - here max_mtu is 1450
> - we enter the second if clause above (dev->mtu > max_mtu)
> - at the end of vxlan_config_apply(), dev->mtu will be 1450
>
> which is consistent with the previous behaviour.
>
> 2. lowerdev->mtu is 9000:
> - ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
> - here max_mtu is 8950
> - we do not enter the second if clause above (dev->mtu < max_mtu)
> - at the end of vxlan_config_apply(), dev->mtu will still be 1500
>
> which is not consistent with the previous behaviour, where it used to
> be 8950 instead.
Ah, thank you for the explanation, I was missing the context that this was
about higher rather than lower MTUs.
Personally, I would prefer a change like the following, as it does not
introduce another duplication of the MTU calculation (not tested at all):
> --- a/drivers/net/vxlan.c
> +++ b/drivers/net/vxlan.c
> @@ -3105,7 +3105,7 @@ static void vxlan_config_apply(struct net_device *dev,
> VXLAN_HEADROOM);
> }
>
> - if (dev->mtu > max_mtu)
> + if (dev->mtu > max_mtu || (!changelink && !conf->mtu))
> dev->mtu = max_mtu;
>
> if (use_ipv6 || conf->flags & VXLAN_F_COLLECT_METADATA)
Regards,
Matthias
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* [PATCH iproute2 net-next] erspan: add erspan version II support
From: William Tu @ 2017-12-14 0:29 UTC (permalink / raw)
To: netdev
The patch adds support for configuring the erspan v2, for both
ipv4 and ipv6 erspan implementation. Three additional fields
are added: 'erspan_ver' for distinguishing v1 or v2, 'erspan_dir'
for specifying direction of the mirrored traffic, and 'erspan_hwid'
for users to set ERSPAN engine ID within a system.
For details of version II, see
https://tools.ietf.org/html/draft-foschiano-erspan-03
Signed-off-by: William Tu <u9012063@gmail.com>
---
include/uapi/linux/if_ether.h | 1 +
include/uapi/linux/if_tunnel.h | 3 +++
ip/link_gre.c | 59 +++++++++++++++++++++++++++++++++++++---
ip/link_gre6.c | 61 +++++++++++++++++++++++++++++++++++++++---
4 files changed, 117 insertions(+), 7 deletions(-)
diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 2eb529a90250..5fd5c12ef8e9 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -47,6 +47,7 @@
#define ETH_P_PUP 0x0200 /* Xerox PUP packet */
#define ETH_P_PUPAT 0x0201 /* Xerox PUP Addr Trans packet */
#define ETH_P_TSN 0x22F0 /* TSN (IEEE 1722) packet */
+#define ETH_P_ERsPAN2 0x22EB /* ERSPAN version 2 (type III) */
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
#define ETH_P_X25 0x0805 /* CCITT X.25 */
#define ETH_P_ARP 0x0806 /* Address Resolution packet */
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 38cdf90692f8..29602df037e9 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -137,6 +137,9 @@ enum {
IFLA_GRE_IGNORE_DF,
IFLA_GRE_FWMARK,
IFLA_GRE_ERSPAN_INDEX,
+ IFLA_GRE_ERSPAN_VER,
+ IFLA_GRE_ERSPAN_DIR,
+ IFLA_GRE_ERSPAN_HWID,
__IFLA_GRE_MAX,
};
diff --git a/ip/link_gre.c b/ip/link_gre.c
index 43cb1af6196a..924a05530f5c 100644
--- a/ip/link_gre.c
+++ b/ip/link_gre.c
@@ -98,6 +98,9 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
__u8 ignore_df = 0;
__u32 fwmark = 0;
__u32 erspan_idx = 0;
+ __u8 erspan_ver = 0;
+ __u8 erspan_dir = 0;
+ __u16 erspan_hwid = 0;
if (!(n->nlmsg_flags & NLM_F_CREATE)) {
if (rtnl_talk(&rth, &req.n, &answer) < 0) {
@@ -179,6 +182,15 @@ get_failed:
if (greinfo[IFLA_GRE_ERSPAN_INDEX])
erspan_idx = rta_getattr_u32(greinfo[IFLA_GRE_ERSPAN_INDEX]);
+ if (greinfo[IFLA_GRE_ERSPAN_VER])
+ erspan_ver = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_VER]);
+
+ if (greinfo[IFLA_GRE_ERSPAN_DIR])
+ erspan_dir = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_DIR]);
+
+ if (greinfo[IFLA_GRE_ERSPAN_HWID])
+ erspan_hwid = rta_getattr_u16(greinfo[IFLA_GRE_ERSPAN_HWID]);
+
free(answer);
}
@@ -343,6 +355,22 @@ get_failed:
invarg("invalid erspan index\n", *argv);
if (erspan_idx & ~((1<<20) - 1) || erspan_idx == 0)
invarg("erspan index must be > 0 and <= 20-bit\n", *argv);
+ } else if (strcmp(*argv, "erspan_ver") == 0) {
+ NEXT_ARG();
+ if (get_u8(&erspan_ver, *argv, 0))
+ invarg("invalid erspan version\n", *argv);
+ if (erspan_ver != 1 && erspan_ver != 2)
+ invarg("erspan version must be 1 or 2\n", *argv);
+ } else if (strcmp(*argv, "erspan_dir") == 0) {
+ NEXT_ARG();
+ if (get_u8(&erspan_dir, *argv, 0))
+ invarg("invalid erspan direction\n", *argv);
+ if (erspan_dir != 0 && erspan_dir != 1)
+ invarg("erspan direction must be 0(Ingress) or 1(Egress)\n", *argv);
+ } else if (strcmp(*argv, "erspan_hwid") == 0) {
+ NEXT_ARG();
+ if (get_u16(&erspan_hwid, *argv, 0))
+ invarg("invalid erspan hwid\n", *argv);
} else
usage();
argc--; argv++;
@@ -374,8 +402,15 @@ get_failed:
addattr_l(n, 1024, IFLA_GRE_TTL, &ttl, 1);
addattr_l(n, 1024, IFLA_GRE_TOS, &tos, 1);
addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
- if (erspan_idx != 0)
- addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
+ if (erspan_ver) {
+ addattr8(n, 1024, IFLA_GRE_ERSPAN_VER, erspan_ver);
+ if (erspan_ver == 1 && erspan_idx != 0) {
+ addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
+ } else if (erspan_ver == 2) {
+ addattr8(n, 1024, IFLA_GRE_ERSPAN_DIR, erspan_dir);
+ addattr16(n, 1024, IFLA_GRE_ERSPAN_HWID, erspan_hwid);
+ }
+ }
} else {
addattr_l(n, 1024, IFLA_GRE_COLLECT_METADATA, NULL, 0);
}
@@ -514,7 +549,25 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
if (tb[IFLA_GRE_ERSPAN_INDEX]) {
__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
- fprintf(f, "erspan_index %u ", erspan_idx);
+ print_uint(PRINT_ANY, "erspan_index", "erspan_index %u", erspan_idx);
+ }
+
+ if (tb[IFLA_GRE_ERSPAN_VER]) {
+ __u8 erspan_ver = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_VER]);
+
+ print_uint(PRINT_ANY, "erspan_ver", "erspan_ver %u", erspan_ver);
+ }
+
+ if (tb[IFLA_GRE_ERSPAN_DIR]) {
+ __u8 erspan_dir = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_DIR]);
+
+ print_uint(PRINT_ANY, "erspan_dir", "erspan_dir %u", erspan_dir);
+ }
+
+ if (tb[IFLA_GRE_ERSPAN_HWID]) {
+ __u16 erspan_hwid = rta_getattr_u16(tb[IFLA_GRE_ERSPAN_HWID]);
+
+ print_hex(PRINT_ANY, "erspan_hwid", "erspan_hwid %x", erspan_hwid);
}
if (tb[IFLA_GRE_ENCAP_TYPE] &&
diff --git a/ip/link_gre6.c b/ip/link_gre6.c
index 2cb46ca116d0..71181af91bc8 100644
--- a/ip/link_gre6.c
+++ b/ip/link_gre6.c
@@ -109,6 +109,9 @@ static int gre_parse_opt(struct link_util *lu, int argc, char **argv,
int len;
__u32 fwmark = 0;
__u32 erspan_idx = 0;
+ __u8 erspan_ver = 0;
+ __u8 erspan_dir = 0;
+ __u16 erspan_hwid = 0;
if (!(n->nlmsg_flags & NLM_F_CREATE)) {
if (rtnl_talk(&rth, &req.n, &answer) < 0) {
@@ -191,6 +194,15 @@ get_failed:
if (greinfo[IFLA_GRE_ERSPAN_INDEX])
erspan_idx = rta_getattr_u32(greinfo[IFLA_GRE_ERSPAN_INDEX]);
+ if (greinfo[IFLA_GRE_ERSPAN_VER])
+ erspan_ver = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_VER]);
+
+ if (greinfo[IFLA_GRE_ERSPAN_DIR])
+ erspan_dir = rta_getattr_u8(greinfo[IFLA_GRE_ERSPAN_DIR]);
+
+ if (greinfo[IFLA_GRE_ERSPAN_HWID])
+ erspan_hwid = rta_getattr_u16(greinfo[IFLA_GRE_ERSPAN_HWID]);
+
free(answer);
}
@@ -389,6 +401,22 @@ get_failed:
invarg("invalid erspan index\n", *argv);
if (erspan_idx & ~((1<<20) - 1) || erspan_idx == 0)
invarg("erspan index must be > 0 and <= 20-bit\n", *argv);
+ } else if (strcmp(*argv, "erspan_ver") == 0) {
+ NEXT_ARG();
+ if (get_u8(&erspan_ver, *argv, 0))
+ invarg("invalid erspan version\n", *argv);
+ if (erspan_ver != 1 && erspan_ver != 2)
+ invarg("erspan version must be 1 or 2\n", *argv);
+ } else if (strcmp(*argv, "erspan_dir") == 0) {
+ NEXT_ARG();
+ if (get_u8(&erspan_dir, *argv, 0))
+ invarg("invalid erspan direction\n", *argv);
+ if (erspan_dir != 0 && erspan_dir != 1)
+ invarg("erspan direction must be 0(Ingress) or 1(Egress)\n", *argv);
+ } else if (strcmp(*argv, "erspan_hwid") == 0) {
+ NEXT_ARG();
+ if (get_u16(&erspan_hwid, *argv, 0))
+ invarg("invalid erspan hwid\n", *argv);
} else
usage();
argc--; argv++;
@@ -408,9 +436,15 @@ get_failed:
addattr_l(n, 1024, IFLA_GRE_FLOWINFO, &flowinfo, 4);
addattr32(n, 1024, IFLA_GRE_FLAGS, flags);
addattr32(n, 1024, IFLA_GRE_FWMARK, fwmark);
- if (erspan_idx != 0)
- addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
-
+ if (erspan_ver) {
+ addattr8(n, 1024, IFLA_GRE_ERSPAN_VER, erspan_ver);
+ if (erspan_ver == 1 && erspan_idx != 0) {
+ addattr32(n, 1024, IFLA_GRE_ERSPAN_INDEX, erspan_idx);
+ } else {
+ addattr8(n, 1024, IFLA_GRE_ERSPAN_DIR, erspan_dir);
+ addattr16(n, 1024, IFLA_GRE_ERSPAN_HWID, erspan_hwid);
+ }
+ }
addattr16(n, 1024, IFLA_GRE_ENCAP_TYPE, encaptype);
addattr16(n, 1024, IFLA_GRE_ENCAP_FLAGS, encapflags);
addattr16(n, 1024, IFLA_GRE_ENCAP_SPORT, htons(encapsport));
@@ -587,9 +621,28 @@ static void gre_print_opt(struct link_util *lu, FILE *f, struct rtattr *tb[])
if (tb[IFLA_GRE_ERSPAN_INDEX]) {
__u32 erspan_idx = rta_getattr_u32(tb[IFLA_GRE_ERSPAN_INDEX]);
- fprintf(f, "erspan_index %u ", erspan_idx);
+ print_uint(PRINT_ANY, "erspan_index", "erspan_index %u ", erspan_idx);
}
+ if (tb[IFLA_GRE_ERSPAN_VER]) {
+ __u8 erspan_ver = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_VER]);
+
+ print_uint(PRINT_ANY, "erspan_ver", "erspan_ver %u", erspan_ver);
+ }
+
+ if (tb[IFLA_GRE_ERSPAN_DIR]) {
+ __u8 erspan_dir = rta_getattr_u8(tb[IFLA_GRE_ERSPAN_DIR]);
+
+ print_uint(PRINT_ANY, "erspan_dir", "erspan_dir %u", erspan_dir);
+ }
+
+ if (tb[IFLA_GRE_ERSPAN_HWID]) {
+ __u16 erspan_hwid = rta_getattr_u16(tb[IFLA_GRE_ERSPAN_HWID]);
+
+ print_hex(PRINT_ANY, "erspan_hwid", "erspan_hwid %x", erspan_hwid);
+ }
+
+
if (tb[IFLA_GRE_ENCAP_TYPE] &&
rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE]) != TUNNEL_ENCAP_NONE) {
__u16 type = rta_getattr_u16(tb[IFLA_GRE_ENCAP_TYPE]);
--
2.7.4
^ permalink raw reply related
* Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device
From: Stefano Brivio @ 2017-12-14 0:31 UTC (permalink / raw)
To: Matthias Schiffer
Cc: David S . Miller, netdev, Junhan Yan, Jiri Benc, Hangbin Liu
In-Reply-To: <42862507-a573-af7a-d4aa-fd8cdd89c01e@universe-factory.net>
On Thu, 14 Dec 2017 01:25:40 +0100
Matthias Schiffer <mschiffer@universe-factory.net> wrote:
> On 12/14/2017 01:10 AM, Stefano Brivio wrote:
> > On Thu, 14 Dec 2017 00:57:32 +0100
> > Matthias Schiffer <mschiffer@universe-factory.net> wrote:
> >
> >> As you note, there is another occurrence of this calculation in
> >> vxlan_config_apply():
> >>
> >>
> >> [...]
> >> if (lowerdev) {
> >> [...]
> >> max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
> >> VXLAN_HEADROOM);
> >> }
> >>
> >> if (dev->mtu > max_mtu)
> >> dev->mtu = max_mtu;
> >> [...]
> >>
> >>
> >> Unless I'm overlooking something, this should already do the same thing and
> >> your patch is redundant.
> >
> > The code above sets max_mtu, and only if dev->mtu exceeds that, the
> > latter is then clamped.
> >
> > What my patch does is to actually set dev->mtu to that value, no matter
> > what's the previous value set by ether_setup() (only on creation, and
> > only if lowerdev is there), just like the previous behaviour used to be.
> >
> > Let's consider these two cases, on the existing code:
> >
> > 1. lowerdev->mtu is 1500:
> > - ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
> > - here max_mtu is 1450
> > - we enter the second if clause above (dev->mtu > max_mtu)
> > - at the end of vxlan_config_apply(), dev->mtu will be 1450
> >
> > which is consistent with the previous behaviour.
> >
> > 2. lowerdev->mtu is 9000:
> > - ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
> > - here max_mtu is 8950
> > - we do not enter the second if clause above (dev->mtu < max_mtu)
> > - at the end of vxlan_config_apply(), dev->mtu will still be 1500
> >
> > which is not consistent with the previous behaviour, where it used to
> > be 8950 instead.
>
> Ah, thank you for the explanation, I was missing the context that this was
> about higher rather than lower MTUs.
>
> Personally, I would prefer a change like the following, as it does not
> introduce another duplication of the MTU calculation (not tested at all):
>
> > --- a/drivers/net/vxlan.c
> > +++ b/drivers/net/vxlan.c
> > @@ -3105,7 +3105,7 @@ static void vxlan_config_apply(struct net_device *dev,
> > VXLAN_HEADROOM);
> > }
> >
> > - if (dev->mtu > max_mtu)
> > + if (dev->mtu > max_mtu || (!changelink && !conf->mtu))
> > dev->mtu = max_mtu;
You would also need to check that lowerdev is present, though.
Otherwise, you're changing the behaviour again, that is, if lowerdev is
not present, we want to keep 1500 and not set ETH_MAX_MTU (65535).
Sure you can change the if condition to reflect that, but IMHO it
becomes quite awkward.
--
Stefano
^ permalink raw reply
* [PATCH net-next 0/4] ERSPAN version 2 (type III) support
From: William Tu @ 2017-12-14 0:38 UTC (permalink / raw)
To: netdev
ERSPAN has two versions, v1 (type II) and v2 (type III). This patch
series add support for erspan v2 based on existing erspan v1
implementation. The first patch refactors the existing erspan v1's
header structure, making it extensible to put additional v2's header.
The second and third patch introduces erspan v2's implementation to
ipv4 and ipv6 erspan, for both native mode and collect metadata mode.
Finally, test cases are added under the samples/bpf.
Note:
ERSPAN version 2 has many features and this patch does not implement
all. One major use case of version 2 over version 1 is its timestamp
and direction. So the traffic collector is able to distinguish the
mirrorred traffic better. Other features such as SGT (security group
tag), FT (frame type) for carrying non-ethernet packet, and optional
subheader are not implemented yet.
Example commandline for ERSPAN version 2:
ip link add dev ip6erspan11 type ip6erspan seq key 102 \
local fc00:100::2 remote fc00:100::1 \
erspan_ver 2 erspan_dir 1 erspan_hwid 17
The corresponding iproute2 patch:
https://marc.info/?l=linux-netdev&m=151321141525106&w=2
William Tu (4):
net: erspan: refactor existing erspan code
net: erspan: introduce erspan v2 for ip_gre
ip6_gre: add erspan v2 support
samples/bpf: add erspan v2 sample code
include/net/erspan.h | 152 ++++++++++++++++++++++++++++++++++++++---
include/net/ip6_tunnel.h | 3 +
include/net/ip_tunnels.h | 5 +-
include/uapi/linux/if_ether.h | 1 +
include/uapi/linux/if_tunnel.h | 3 +
net/ipv4/ip_gre.c | 124 +++++++++++++++++++++++++++------
net/ipv6/ip6_gre.c | 139 +++++++++++++++++++++++++++++++------
net/openvswitch/flow_netlink.c | 8 +--
samples/bpf/tcbpf2_kern.c | 77 ++++++++++++++++++---
samples/bpf/test_tunnel_bpf.sh | 38 ++++++++---
10 files changed, 472 insertions(+), 78 deletions(-)
--
A simple script to test it:
#!/bin/bash
# In the namespace NS0, create veth0 and ip6erspan00
# Out of the namespace, create veth1 and ip6erspan11
# Ping in and out of namespace using ERSPAN protocol
set -ex
function cleanup() {
set +ex
ip netns del ns0
ip link del ip6erspan11
ip link del veth1
}
function main() {
trap cleanup 0 2 3 9
ip netns add ns0
ip link add veth0 type veth peer name veth1
ip link set veth0 netns ns0
# non-namespace
ip addr add dev veth1 fc00:100::2/96
if [ "$1" == "v1" ]; then
echo "create IP6 ERSPAN v1 tunnel"
ip link add dev ip6erspan11 type ip6erspan seq key 102 \
local fc00:100::2 remote fc00:100::1 \
erspan 123 erspan_ver 1
else
echo "create IP6 ERSPAN v2 tunnel"
ip link add dev ip6erspan11 type ip6erspan seq key 102 \
local fc00:100::2 remote fc00:100::1 \
erspan_ver 2 erspan_dir 1 erspan_hwid 17
fi
ip addr add dev ip6erspan11 fc00:200::2/96
ip addr add dev ip6erspan11 10.10.200.2/24
# namespace: ns0
ip netns exec ns0 ip addr add fc00:100::1/96 dev veth0
if [ "$1" == "v1" ]; then
ip netns exec ns0 \
ip link add dev ip6erspan00 type ip6erspan seq key 102 \
local fc00:100::1 remote fc00:100::2 \
erspan 123 erspan_ver 1
else
ip netns exec ns0 \
ip link add dev ip6erspan00 type ip6erspan seq key 102 \
local fc00:100::1 remote fc00:100::2 \
erspan_ver 2 erspan_dir 1 erspan_hwid 7
fi
ip netns exec ns0 ip addr add dev ip6erspan00 fc00:200::1/96
ip netns exec ns0 ip addr add dev ip6erspan00 10.10.200.1/24
ip link set dev veth1 up
ip link set dev ip6erspan11 up
ip netns exec ns0 ip link set dev ip6erspan00 up
ip netns exec ns0 ip link set dev veth0 up
}
main $1
# Ping underlying
ping6 -c 1 fc00:100::1 || true
# ping overlay
ping -c 3 10.10.200.1
exit 0
^ permalink raw reply
* [PATCH net-next 1/4] net: erspan: refactor existing erspan code
From: William Tu @ 2017-12-14 0:38 UTC (permalink / raw)
To: netdev
In-Reply-To: <1513211938-8749-1-git-send-email-u9012063@gmail.com>
The patch refactors the existing erspan implementation in order
to support erspan version 2, which has additional metadata. So, in
stead of having one 'struct erspanhdr' holding erspan version 1,
breaks it into 'struct erspan_base_hdr' and 'struct erspan_metadata'.
Signed-off-by: William Tu <u9012063@gmail.com>
---
include/net/erspan.h | 34 ++++++++++++++++++++++++----------
net/ipv4/ip_gre.c | 27 +++++++++++++++++----------
net/ipv6/ip6_gre.c | 25 ++++++++++++++++---------
net/openvswitch/flow_netlink.c | 8 ++++----
4 files changed, 61 insertions(+), 33 deletions(-)
diff --git a/include/net/erspan.h b/include/net/erspan.h
index 6e758d08c9ee..70c40c7c75b2 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -15,7 +15,7 @@
* s, Recur, Flags, Version fields only S (bit 03) is set to 1. The
* other fields are set to zero, so only a sequence number follows.
*
- * ERSPAN Type II header (8 octets [42:49])
+ * ERSPAN Version 1 (Type II) header (8 octets [42:49])
* 0 1 2 3
* 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
@@ -27,7 +27,7 @@
* GRE proto ERSPAN type II = 0x88BE, type III = 0x22EB
*/
-#define ERSPAN_VERSION 0x1
+#define ERSPAN_VERSION 0x1 /* ERSPAN type II */
#define VER_MASK 0xf000
#define VLAN_MASK 0x0fff
@@ -44,20 +44,29 @@ enum erspan_encap_type {
ERSPAN_ENCAP_INFRAME = 0x3, /* VLAN tag perserved in frame */
};
+#define ERSPAN_V1_MDSIZE 4
+#define ERSPAN_V2_MDSIZE 8
struct erspan_metadata {
- __be32 index; /* type II */
+ union {
+ __be32 index; /* Version 1 (type II)*/
+ } u;
};
-struct erspanhdr {
+struct erspan_base_hdr {
__be16 ver_vlan;
#define VER_OFFSET 12
__be16 session_id;
#define COS_OFFSET 13
#define EN_OFFSET 11
#define T_OFFSET 10
- struct erspan_metadata md;
};
+static inline int erspan_hdr_len(int version)
+{
+ return sizeof(struct erspan_base_hdr) +
+ (version == 1 ? ERSPAN_V1_MDSIZE : ERSPAN_V2_MDSIZE);
+}
+
static inline u8 tos_to_cos(u8 tos)
{
u8 dscp, cos;
@@ -73,7 +82,8 @@ static inline void erspan_build_header(struct sk_buff *skb,
{
struct ethhdr *eth = eth_hdr(skb);
enum erspan_encap_type enc_type;
- struct erspanhdr *ershdr;
+ struct erspan_base_hdr *ershdr;
+ struct erspan_metadata *ersmd;
struct qtag_prefix {
__be16 eth_type;
__be16 tci;
@@ -96,17 +106,21 @@ static inline void erspan_build_header(struct sk_buff *skb,
enc_type = ERSPAN_ENCAP_INFRAME;
}
- skb_push(skb, sizeof(*ershdr));
- ershdr = (struct erspanhdr *)skb->data;
- memset(ershdr, 0, sizeof(*ershdr));
+ skb_push(skb, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
+ ershdr = (struct erspan_base_hdr *)skb->data;
+ memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V1_MDSIZE);
+ /* Build base header */
ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
(ERSPAN_VERSION << VER_OFFSET));
ershdr->session_id = htons((u16)(ntohl(id) & ID_MASK) |
((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
(enc_type << EN_OFFSET & EN_MASK) |
((truncate << T_OFFSET) & T_MASK));
- ershdr->md.index = htonl(index & INDEX_MASK);
+
+ /* Build metadata */
+ ersmd = (struct erspan_metadata *)(ershdr + 1);
+ ersmd->u.index = htonl(index & INDEX_MASK);
}
#endif
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index d828821d88d7..3e37402147f3 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -256,34 +256,41 @@ static int erspan_rcv(struct sk_buff *skb, struct tnl_ptk_info *tpi,
{
struct net *net = dev_net(skb->dev);
struct metadata_dst *tun_dst = NULL;
+ struct erspan_base_hdr *ershdr;
+ struct erspan_metadata *pkt_md;
struct ip_tunnel_net *itn;
struct ip_tunnel *tunnel;
- struct erspanhdr *ershdr;
const struct iphdr *iph;
- __be32 index;
+ int ver;
int len;
itn = net_generic(net, erspan_net_id);
len = gre_hdr_len + sizeof(*ershdr);
+ /* Check based hdr len */
if (unlikely(!pskb_may_pull(skb, len)))
return -ENOMEM;
iph = ip_hdr(skb);
- ershdr = (struct erspanhdr *)(skb->data + gre_hdr_len);
+ ershdr = (struct erspan_base_hdr *)(skb->data + gre_hdr_len);
+ ver = (ntohs(ershdr->ver_vlan) & VER_MASK) >> VER_OFFSET;
/* The original GRE header does not have key field,
* Use ERSPAN 10-bit session ID as key.
*/
tpi->key = cpu_to_be32(ntohs(ershdr->session_id) & ID_MASK);
- index = ershdr->md.index;
+ pkt_md = (struct erspan_metadata *)(ershdr + 1);
tunnel = ip_tunnel_lookup(itn, skb->dev->ifindex,
tpi->flags | TUNNEL_KEY,
iph->saddr, iph->daddr, tpi->key);
if (tunnel) {
+ len = gre_hdr_len + erspan_hdr_len(ver);
+ if (unlikely(!pskb_may_pull(skb, len)))
+ return -ENOMEM;
+
if (__iptunnel_pull_header(skb,
- gre_hdr_len + sizeof(*ershdr),
+ len,
htons(ETH_P_TEB),
false, false) < 0)
goto drop;
@@ -307,12 +314,12 @@ static int erspan_rcv(struct sk_buff *skb, struct tnl_ptk_info *tpi,
if (!md)
return PACKET_REJECT;
- md->index = index;
+ memcpy(md, pkt_md, sizeof(*md));
info = &tun_dst->u.tun_info;
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
} else {
- tunnel->index = ntohl(index);
+ tunnel->index = ntohl(pkt_md->u.index);
}
skb_reset_mac_header(skb);
@@ -571,7 +578,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
key = &tun_info->key;
/* ERSPAN has fixed 8 byte GRE header */
- tunnel_hlen = 8 + sizeof(struct erspanhdr);
+ tunnel_hlen = 8 + sizeof(struct erspan_base_hdr) + ERSPAN_V1_MDSIZE;
rt = prepare_fb_xmit(skb, dev, &fl, tunnel_hlen);
if (!rt)
@@ -590,7 +597,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
goto err_free_rt;
erspan_build_header(skb, tunnel_id_to_key32(key->tun_id),
- ntohl(md->index), truncate, true);
+ ntohl(md->u.index), truncate, true);
gre_build_header(skb, 8, TUNNEL_SEQ,
htons(ETH_P_ERSPAN), 0, htonl(tunnel->o_seqno++));
@@ -1238,7 +1245,7 @@ static int erspan_tunnel_init(struct net_device *dev)
tunnel->tun_hlen = 8;
tunnel->parms.iph.protocol = IPPROTO_GRE;
tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen +
- sizeof(struct erspanhdr);
+ sizeof(struct erspan_base_hdr) + ERSPAN_V1_MDSIZE;
t_hlen = tunnel->hlen + sizeof(struct iphdr);
dev->needed_headroom = LL_MAX_HEADER + t_hlen + 4;
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 4562579797d1..1303d0c44c36 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -501,25 +501,32 @@ static int ip6gre_rcv(struct sk_buff *skb, const struct tnl_ptk_info *tpi)
static int ip6erspan_rcv(struct sk_buff *skb, int gre_hdr_len,
struct tnl_ptk_info *tpi)
{
+ struct erspan_base_hdr *ershdr;
+ struct erspan_metadata *pkt_md;
const struct ipv6hdr *ipv6h;
- struct erspanhdr *ershdr;
struct ip6_tnl *tunnel;
- __be32 index;
+ u8 ver;
ipv6h = ipv6_hdr(skb);
- ershdr = (struct erspanhdr *)skb->data;
+ ershdr = (struct erspan_base_hdr *)skb->data;
if (unlikely(!pskb_may_pull(skb, sizeof(*ershdr))))
return PACKET_REJECT;
+ ver = (ntohs(ershdr->ver_vlan) & VER_MASK) >> VER_OFFSET;
tpi->key = cpu_to_be32(ntohs(ershdr->session_id) & ID_MASK);
- index = ershdr->md.index;
+ pkt_md = (struct erspan_metadata *)(ershdr + 1);
tunnel = ip6gre_tunnel_lookup(skb->dev,
&ipv6h->saddr, &ipv6h->daddr, tpi->key,
tpi->proto);
if (tunnel) {
- if (__iptunnel_pull_header(skb, sizeof(*ershdr),
+ int len = erspan_hdr_len(ver);
+
+ if (unlikely(!pskb_may_pull(skb, len)))
+ return -ENOMEM;
+
+ if (__iptunnel_pull_header(skb, len,
htons(ETH_P_TEB),
false, false) < 0)
return PACKET_REJECT;
@@ -545,14 +552,14 @@ static int ip6erspan_rcv(struct sk_buff *skb, int gre_hdr_len,
if (!md)
return PACKET_REJECT;
- md->index = index;
+ memcpy(md, pkt_md, sizeof(*md));
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
} else {
- tunnel->parms.index = ntohl(index);
+ tunnel->parms.index = ntohl(pkt_md->u.index);
ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
}
@@ -921,7 +928,7 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
goto tx_err;
erspan_build_header(skb, tunnel_id_to_key32(key->tun_id),
- ntohl(md->index), truncate, false);
+ ntohl(md->u.index), truncate, false);
} else {
switch (skb->protocol) {
@@ -1657,7 +1664,7 @@ static int ip6erspan_tap_init(struct net_device *dev)
tunnel->tun_hlen = 8;
tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen +
- sizeof(struct erspanhdr);
+ sizeof(struct erspan_base_hdr) + ERSPAN_V1_MDSIZE;
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
dev->hard_header_len = LL_MAX_HEADER + t_hlen;
diff --git a/net/openvswitch/flow_netlink.c b/net/openvswitch/flow_netlink.c
index 624ea74353dd..bce1f78b0de5 100644
--- a/net/openvswitch/flow_netlink.c
+++ b/net/openvswitch/flow_netlink.c
@@ -644,12 +644,12 @@ static int erspan_tun_opt_from_nlattr(const struct nlattr *attr,
BUILD_BUG_ON(sizeof(opts) > sizeof(match->key->tun_opts));
memset(&opts, 0, sizeof(opts));
- opts.index = nla_get_be32(attr);
+ opts.u.index = nla_get_be32(attr);
/* Index has only 20-bit */
- if (ntohl(opts.index) & ~INDEX_MASK) {
+ if (ntohl(opts.u.index) & ~INDEX_MASK) {
OVS_NLERR(log, "ERSPAN index number %x too large.",
- ntohl(opts.index));
+ ntohl(opts.u.index));
return -EINVAL;
}
@@ -907,7 +907,7 @@ static int __ip_tun_to_nlattr(struct sk_buff *skb,
return -EMSGSIZE;
else if (output->tun_flags & TUNNEL_ERSPAN_OPT &&
nla_put_be32(skb, OVS_TUNNEL_KEY_ATTR_ERSPAN_OPTS,
- ((struct erspan_metadata *)tun_opts)->index))
+ ((struct erspan_metadata *)tun_opts)->u.index))
return -EMSGSIZE;
}
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 2/4] net: erspan: introduce erspan v2 for ip_gre
From: William Tu @ 2017-12-14 0:38 UTC (permalink / raw)
To: netdev
In-Reply-To: <1513211938-8749-1-git-send-email-u9012063@gmail.com>
The patch adds support for erspan version 2. Not all features are
supported in this patch. The SGT (security group tag), GRA (timestamp
granularity), FT (frame type) are set to fixed value. Only hardware
ID and direction are configurable. Optional subheader is also not
supported.
Signed-off-by: William Tu <u9012063@gmail.com>
---
include/net/erspan.h | 120 ++++++++++++++++++++++++++++++++++++++++-
include/net/ip_tunnels.h | 5 +-
include/uapi/linux/if_ether.h | 1 +
include/uapi/linux/if_tunnel.h | 3 ++
net/ipv4/ip_gre.c | 105 ++++++++++++++++++++++++++++++------
5 files changed, 216 insertions(+), 18 deletions(-)
diff --git a/include/net/erspan.h b/include/net/erspan.h
index 70c40c7c75b2..acdf6843095d 100644
--- a/include/net/erspan.h
+++ b/include/net/erspan.h
@@ -24,11 +24,29 @@
* | Reserved | Index |
* +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
*
+ *
+ * ERSPAN Version 2 (Type III) header (12 octets [42:49])
+ * 0 1 2 3
+ * 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | Ver | VLAN | COS |BSO|T| Session ID |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | Timestamp |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | SGT |P| FT | Hw ID |D|Gra|O|
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
+ * Platform Specific SubHeader (8 octets, optional)
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | Platf ID | Platform Specific Info |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ * | Platform Specific Info |
+ * +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+ *
* GRE proto ERSPAN type II = 0x88BE, type III = 0x22EB
*/
#define ERSPAN_VERSION 0x1 /* ERSPAN type II */
-
#define VER_MASK 0xf000
#define VLAN_MASK 0x0fff
#define COS_MASK 0xe000
@@ -37,6 +55,28 @@
#define ID_MASK 0x03ff
#define INDEX_MASK 0xfffff
+#define ERSPAN_VERSION2 0x2 /* ERSPAN type III*/
+#define BSO_MASK EN_MASK
+#define SGT_MASK 0xffff0000
+#define P_MASK 0x8000
+#define FT_MASK 0x7c00
+#define HWID_MASK 0x03f0
+#define DIR_MASK 0x0008
+#define GRA_MASK 0x0006
+#define O_MASK 0x0001
+
+/* ERSPAN version 2 metadata header */
+struct erspan_md2 {
+ __be32 timestamp;
+ __be16 sgt; /* security group tag */
+ __be16 flags;
+#define P_OFFSET 15
+#define FT_OFFSET 10
+#define HWID_OFFSET 4
+#define DIR_OFFSET 3
+#define GRA_OFFSET 1
+};
+
enum erspan_encap_type {
ERSPAN_ENCAP_NOVLAN = 0x0, /* originally without VLAN tag */
ERSPAN_ENCAP_ISL = 0x1, /* originally ISL encapsulated */
@@ -48,8 +88,10 @@ enum erspan_encap_type {
#define ERSPAN_V2_MDSIZE 8
struct erspan_metadata {
union {
- __be32 index; /* Version 1 (type II)*/
+ __be32 index; /* Version 1 (type II)*/
+ struct erspan_md2 md2; /* Version 2 (type III) */
} u;
+ int version;
};
struct erspan_base_hdr {
@@ -58,6 +100,7 @@ struct erspan_base_hdr {
__be16 session_id;
#define COS_OFFSET 13
#define EN_OFFSET 11
+#define BSO_OFFSET EN_OFFSET
#define T_OFFSET 10
};
@@ -123,4 +166,77 @@ static inline void erspan_build_header(struct sk_buff *skb,
ersmd->u.index = htonl(index & INDEX_MASK);
}
+/* ERSPAN GRA: timestamp granularity
+ * 00b --> granularity = 100 microseconds
+ * 01b --> granularity = 100 nanoseconds
+ * 10b --> granularity = IEEE 1588
+ * Here we only support 100 microseconds.
+ */
+static inline __be32 erspan_get_timestamp(void)
+{
+ u64 h_usecs;
+ ktime_t kt;
+
+ kt = ktime_get_real();
+ h_usecs = ktime_divns(kt, 100 * NSEC_PER_USEC);
+
+ /* ERSPAN base header only has 32-bit,
+ * so it wraps around 4 days.
+ */
+ return htonl((u32)h_usecs);
+}
+
+static inline void erspan_build_header_v2(struct sk_buff *skb,
+ __be32 id, u8 direction, u16 hwid,
+ bool truncate, bool is_ipv4)
+{
+ struct ethhdr *eth = eth_hdr(skb);
+ struct erspan_base_hdr *ershdr;
+ struct erspan_metadata *md;
+ struct qtag_prefix {
+ __be16 eth_type;
+ __be16 tci;
+ } *qp;
+ u16 vlan_tci = 0;
+ u16 session_id;
+ u8 gra = 0; /* 100 usec */
+ u8 bso = 0; /* Bad/Short/Oversized */
+ u8 sgt = 0;
+ u8 tos;
+
+ tos = is_ipv4 ? ip_hdr(skb)->tos :
+ (ipv6_hdr(skb)->priority << 4) +
+ (ipv6_hdr(skb)->flow_lbl[0] >> 4);
+
+ /* Unlike v1, v2 does not have En field,
+ * so only extract vlan tci field.
+ */
+ if (eth->h_proto == htons(ETH_P_8021Q)) {
+ qp = (struct qtag_prefix *)(skb->data + 2 * ETH_ALEN);
+ vlan_tci = ntohs(qp->tci);
+ }
+
+ skb_push(skb, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
+ ershdr = (struct erspan_base_hdr *)skb->data;
+ memset(ershdr, 0, sizeof(*ershdr) + ERSPAN_V2_MDSIZE);
+
+ /* Build base header */
+ ershdr->ver_vlan = htons((vlan_tci & VLAN_MASK) |
+ (ERSPAN_VERSION2 << VER_OFFSET));
+ session_id = (u16)(ntohl(id) & ID_MASK) |
+ ((tos_to_cos(tos) << COS_OFFSET) & COS_MASK) |
+ (bso << BSO_OFFSET & BSO_MASK) |
+ ((truncate << T_OFFSET) & T_MASK);
+ ershdr->session_id = htons(session_id);
+
+ /* Build metadata */
+ md = (struct erspan_metadata *)(ershdr + 1);
+ md->u.md2.timestamp = erspan_get_timestamp();
+ md->u.md2.sgt = htons(sgt);
+ md->u.md2.flags = htons(((1 << P_OFFSET) & P_MASK) |
+ ((hwid << HWID_OFFSET) & HWID_MASK) |
+ ((direction << DIR_OFFSET) & DIR_MASK) |
+ ((gra << GRA_OFFSET) & GRA_MASK));
+}
+
#endif
diff --git a/include/net/ip_tunnels.h b/include/net/ip_tunnels.h
index 24628f6b09bf..1f16773cfd76 100644
--- a/include/net/ip_tunnels.h
+++ b/include/net/ip_tunnels.h
@@ -116,8 +116,11 @@ struct ip_tunnel {
u32 o_seqno; /* The last output seqno */
int tun_hlen; /* Precalculated header length */
- /* This field used only by ERSPAN */
+ /* These four fields used only by ERSPAN */
u32 index; /* ERSPAN type II index */
+ u8 erspan_ver; /* ERSPAN version */
+ u8 dir; /* ERSPAN direction */
+ u16 hwid; /* ERSPAN hardware ID */
struct dst_cache dst_cache;
diff --git a/include/uapi/linux/if_ether.h b/include/uapi/linux/if_ether.h
index 3ee3bf7c8526..87b7529fcdfe 100644
--- a/include/uapi/linux/if_ether.h
+++ b/include/uapi/linux/if_ether.h
@@ -47,6 +47,7 @@
#define ETH_P_PUP 0x0200 /* Xerox PUP packet */
#define ETH_P_PUPAT 0x0201 /* Xerox PUP Addr Trans packet */
#define ETH_P_TSN 0x22F0 /* TSN (IEEE 1722) packet */
+#define ETH_P_ERSPAN2 0x22EB /* ERSPAN version 2 (type III) */
#define ETH_P_IP 0x0800 /* Internet Protocol packet */
#define ETH_P_X25 0x0805 /* CCITT X.25 */
#define ETH_P_ARP 0x0806 /* Address Resolution packet */
diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index e68dadbd6d45..1b3d148c4560 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -137,6 +137,9 @@ enum {
IFLA_GRE_IGNORE_DF,
IFLA_GRE_FWMARK,
IFLA_GRE_ERSPAN_INDEX,
+ IFLA_GRE_ERSPAN_VER,
+ IFLA_GRE_ERSPAN_DIR,
+ IFLA_GRE_ERSPAN_HWID,
__IFLA_GRE_MAX,
};
diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
index 3e37402147f3..004800b923c6 100644
--- a/net/ipv4/ip_gre.c
+++ b/net/ipv4/ip_gre.c
@@ -315,11 +315,26 @@ static int erspan_rcv(struct sk_buff *skb, struct tnl_ptk_info *tpi,
return PACKET_REJECT;
memcpy(md, pkt_md, sizeof(*md));
+ md->version = ver;
+
info = &tun_dst->u.tun_info;
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
} else {
- tunnel->index = ntohl(pkt_md->u.index);
+ tunnel->erspan_ver = ver;
+ if (ver == 1) {
+ tunnel->index = ntohl(pkt_md->u.index);
+ } else {
+ u16 md2_flags;
+ u16 dir, hwid;
+
+ md2_flags = ntohs(pkt_md->u.md2.flags);
+ dir = (md2_flags & DIR_MASK) >> DIR_OFFSET;
+ hwid = (md2_flags & HWID_MASK) >> HWID_OFFSET;
+ tunnel->dir = dir;
+ tunnel->hwid = hwid;
+ }
+
}
skb_reset_mac_header(skb);
@@ -413,7 +428,8 @@ static int gre_rcv(struct sk_buff *skb)
if (hdr_len < 0)
goto drop;
- if (unlikely(tpi.proto == htons(ETH_P_ERSPAN))) {
+ if (unlikely(tpi.proto == htons(ETH_P_ERSPAN) ||
+ tpi.proto == htons(ETH_P_ERSPAN2))) {
if (erspan_rcv(skb, &tpi, hdr_len) == PACKET_RCVD)
return 0;
}
@@ -568,6 +584,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
bool truncate = false;
struct flowi4 fl;
int tunnel_hlen;
+ int version;
__be16 df;
tun_info = skb_tunnel_info(skb);
@@ -576,9 +593,13 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
goto err_free_skb;
key = &tun_info->key;
+ md = ip_tunnel_info_opts(tun_info);
+ if (!md)
+ goto err_free_rt;
/* ERSPAN has fixed 8 byte GRE header */
- tunnel_hlen = 8 + sizeof(struct erspan_base_hdr) + ERSPAN_V1_MDSIZE;
+ version = md->version;
+ tunnel_hlen = 8 + erspan_hdr_len(version);
rt = prepare_fb_xmit(skb, dev, &fl, tunnel_hlen);
if (!rt)
@@ -592,12 +613,23 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
truncate = true;
}
- md = ip_tunnel_info_opts(tun_info);
- if (!md)
- goto err_free_rt;
+ if (version == 1) {
+ erspan_build_header(skb, tunnel_id_to_key32(key->tun_id),
+ ntohl(md->u.index), truncate, true);
+ } else if (version == 2) {
+ u16 md2_flags;
+ u8 direction;
+ u16 hwid;
- erspan_build_header(skb, tunnel_id_to_key32(key->tun_id),
- ntohl(md->u.index), truncate, true);
+ md2_flags = ntohs(md->u.md2.flags);
+ direction = (md2_flags & DIR_MASK) >> DIR_OFFSET;
+ hwid = (md2_flags & HWID_MASK) >> HWID_OFFSET;
+
+ erspan_build_header_v2(skb, tunnel_id_to_key32(key->tun_id),
+ direction, hwid, truncate, true);
+ } else {
+ goto err_free_rt;
+ }
gre_build_header(skb, 8, TUNNEL_SEQ,
htons(ETH_P_ERSPAN), 0, htonl(tunnel->o_seqno++));
@@ -699,8 +731,14 @@ static netdev_tx_t erspan_xmit(struct sk_buff *skb,
}
/* Push ERSPAN header */
- erspan_build_header(skb, tunnel->parms.o_key, tunnel->index,
- truncate, true);
+ if (tunnel->erspan_ver == 1)
+ erspan_build_header(skb, tunnel->parms.o_key, tunnel->index,
+ truncate, true);
+ else
+ erspan_build_header_v2(skb, tunnel->parms.o_key,
+ tunnel->dir, tunnel->hwid,
+ truncate, true);
+
tunnel->parms.o_flags &= ~TUNNEL_KEY;
__gre_xmit(skb, dev, &tunnel->parms.iph, htons(ETH_P_ERSPAN));
return NETDEV_TX_OK;
@@ -1172,13 +1210,32 @@ static int ipgre_netlink_parms(struct net_device *dev,
if (data[IFLA_GRE_FWMARK])
*fwmark = nla_get_u32(data[IFLA_GRE_FWMARK]);
- if (data[IFLA_GRE_ERSPAN_INDEX]) {
- t->index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]);
+ if (data[IFLA_GRE_ERSPAN_VER]) {
+ t->erspan_ver = nla_get_u8(data[IFLA_GRE_ERSPAN_VER]);
- if (t->index & ~INDEX_MASK)
+ if (t->erspan_ver != 1 && t->erspan_ver != 2)
return -EINVAL;
}
+ if (t->erspan_ver == 1) {
+ if (data[IFLA_GRE_ERSPAN_INDEX]) {
+ t->index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]);
+ if (t->index & ~INDEX_MASK)
+ return -EINVAL;
+ }
+ } else if (t->erspan_ver == 2) {
+ if (data[IFLA_GRE_ERSPAN_DIR]) {
+ t->dir = nla_get_u8(data[IFLA_GRE_ERSPAN_DIR]);
+ if (t->dir & ~(DIR_MASK >> DIR_OFFSET))
+ return -EINVAL;
+ }
+ if (data[IFLA_GRE_ERSPAN_HWID]) {
+ t->hwid = nla_get_u16(data[IFLA_GRE_ERSPAN_HWID]);
+ if (t->hwid & ~(HWID_MASK >> HWID_OFFSET))
+ return -EINVAL;
+ }
+ }
+
return 0;
}
@@ -1245,7 +1302,7 @@ static int erspan_tunnel_init(struct net_device *dev)
tunnel->tun_hlen = 8;
tunnel->parms.iph.protocol = IPPROTO_GRE;
tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen +
- sizeof(struct erspan_base_hdr) + ERSPAN_V1_MDSIZE;
+ erspan_hdr_len(tunnel->erspan_ver);
t_hlen = tunnel->hlen + sizeof(struct iphdr);
dev->needed_headroom = LL_MAX_HEADER + t_hlen + 4;
@@ -1375,6 +1432,12 @@ static size_t ipgre_get_size(const struct net_device *dev)
nla_total_size(4) +
/* IFLA_GRE_ERSPAN_INDEX */
nla_total_size(4) +
+ /* IFLA_GRE_ERSPAN_VER */
+ nla_total_size(1) +
+ /* IFLA_GRE_ERSPAN_DIR */
+ nla_total_size(1) +
+ /* IFLA_GRE_ERSPAN_HWID */
+ nla_total_size(2) +
0;
}
@@ -1417,9 +1480,18 @@ static int ipgre_fill_info(struct sk_buff *skb, const struct net_device *dev)
goto nla_put_failure;
}
- if (t->index)
+ if (nla_put_u8(skb, IFLA_GRE_ERSPAN_VER, t->erspan_ver))
+ goto nla_put_failure;
+
+ if (t->erspan_ver == 1) {
if (nla_put_u32(skb, IFLA_GRE_ERSPAN_INDEX, t->index))
goto nla_put_failure;
+ } else if (t->erspan_ver == 2) {
+ if (nla_put_u8(skb, IFLA_GRE_ERSPAN_DIR, t->dir))
+ goto nla_put_failure;
+ if (nla_put_u16(skb, IFLA_GRE_ERSPAN_HWID, t->hwid))
+ goto nla_put_failure;
+ }
return 0;
@@ -1455,6 +1527,9 @@ static const struct nla_policy ipgre_policy[IFLA_GRE_MAX + 1] = {
[IFLA_GRE_IGNORE_DF] = { .type = NLA_U8 },
[IFLA_GRE_FWMARK] = { .type = NLA_U32 },
[IFLA_GRE_ERSPAN_INDEX] = { .type = NLA_U32 },
+ [IFLA_GRE_ERSPAN_VER] = { .type = NLA_U8 },
+ [IFLA_GRE_ERSPAN_DIR] = { .type = NLA_U8 },
+ [IFLA_GRE_ERSPAN_HWID] = { .type = NLA_U16 },
};
static struct rtnl_link_ops ipgre_link_ops __read_mostly = {
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 3/4] ip6_gre: add erspan v2 support
From: William Tu @ 2017-12-14 0:38 UTC (permalink / raw)
To: netdev
In-Reply-To: <1513211938-8749-1-git-send-email-u9012063@gmail.com>
Similar to support for ipv4 erspan, this patch adds
erspan v2 to ip6erspan tunnel.
Signed-off-by: William Tu <u9012063@gmail.com>
---
include/net/ip6_tunnel.h | 3 ++
net/ipv6/ip6_gre.c | 120 ++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 107 insertions(+), 16 deletions(-)
diff --git a/include/net/ip6_tunnel.h b/include/net/ip6_tunnel.h
index 109a5a8877ef..236e40ba06bf 100644
--- a/include/net/ip6_tunnel.h
+++ b/include/net/ip6_tunnel.h
@@ -37,6 +37,9 @@ struct __ip6_tnl_parm {
__u32 fwmark;
__u32 index; /* ERSPAN type II index */
+ __u8 erspan_ver; /* ERSPAN version */
+ __u8 dir; /* direction */
+ __u16 hwid; /* hwid */
};
/* IPv6 tunnel */
diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 1303d0c44c36..5c9c65f1d5c2 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -553,13 +553,28 @@ static int ip6erspan_rcv(struct sk_buff *skb, int gre_hdr_len,
return PACKET_REJECT;
memcpy(md, pkt_md, sizeof(*md));
+ md->version = ver;
info->key.tun_flags |= TUNNEL_ERSPAN_OPT;
info->options_len = sizeof(*md);
ip6_tnl_rcv(tunnel, skb, tpi, tun_dst, log_ecn_error);
} else {
- tunnel->parms.index = ntohl(pkt_md->u.index);
+ tunnel->parms.erspan_ver = ver;
+
+ if (ver == 1) {
+ tunnel->parms.index = ntohl(pkt_md->u.index);
+ } else {
+ u16 md2_flags;
+ u16 dir, hwid;
+
+ md2_flags = ntohs(pkt_md->u.md2.flags);
+ dir = (md2_flags & DIR_MASK) >> DIR_OFFSET;
+ hwid = (md2_flags & HWID_MASK) >> HWID_OFFSET;
+ tunnel->parms.dir = dir;
+ tunnel->parms.hwid = hwid;
+ }
+
ip6_tnl_rcv(tunnel, skb, tpi, NULL, log_ecn_error);
}
@@ -582,7 +597,8 @@ static int gre_rcv(struct sk_buff *skb)
if (iptunnel_pull_header(skb, hdr_len, tpi.proto, false))
goto drop;
- if (unlikely(tpi.proto == htons(ETH_P_ERSPAN))) {
+ if (unlikely(tpi.proto == htons(ETH_P_ERSPAN) ||
+ tpi.proto == htons(ETH_P_ERSPAN2))) {
if (ip6erspan_rcv(skb, hdr_len, &tpi) == PACKET_RCVD)
return 0;
goto drop;
@@ -927,9 +943,24 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
if (!md)
goto tx_err;
- erspan_build_header(skb, tunnel_id_to_key32(key->tun_id),
- ntohl(md->u.index), truncate, false);
-
+ if (md->version == 1) {
+ erspan_build_header(skb,
+ tunnel_id_to_key32(key->tun_id),
+ ntohl(md->u.index), truncate,
+ false);
+ } else if (md->version == 2) {
+ u16 md2_flags;
+ u16 dir, hwid;
+
+ md2_flags = ntohs(md->u.md2.flags);
+ dir = (md2_flags & DIR_MASK) >> DIR_OFFSET;
+ hwid = (md2_flags & HWID_MASK) >> HWID_OFFSET;
+
+ erspan_build_header_v2(skb,
+ tunnel_id_to_key32(key->tun_id),
+ dir, hwid, truncate,
+ false);
+ }
} else {
switch (skb->protocol) {
case htons(ETH_P_IP):
@@ -949,8 +980,15 @@ static netdev_tx_t ip6erspan_tunnel_xmit(struct sk_buff *skb,
break;
}
- erspan_build_header(skb, t->parms.o_key, t->parms.index,
- truncate, false);
+ if (t->parms.erspan_ver == 1)
+ erspan_build_header(skb, t->parms.o_key,
+ t->parms.index,
+ truncate, false);
+ else
+ erspan_build_header_v2(skb, t->parms.o_key,
+ t->parms.dir,
+ t->parms.hwid,
+ truncate, false);
fl6.daddr = t->parms.raddr;
}
@@ -1514,7 +1552,7 @@ static int ip6erspan_tap_validate(struct nlattr *tb[], struct nlattr *data[],
struct netlink_ext_ack *extack)
{
__be16 flags = 0;
- int ret;
+ int ret, ver = 0;
if (!data)
return 0;
@@ -1543,12 +1581,35 @@ static int ip6erspan_tap_validate(struct nlattr *tb[], struct nlattr *data[],
(ntohl(nla_get_be32(data[IFLA_GRE_OKEY])) & ~ID_MASK))
return -EINVAL;
- if (data[IFLA_GRE_ERSPAN_INDEX]) {
- u32 index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]);
-
- if (index & ~INDEX_MASK)
+ if (data[IFLA_GRE_ERSPAN_VER]) {
+ ver = nla_get_u8(data[IFLA_GRE_ERSPAN_VER]);
+ if (ver != 1 && ver != 2)
return -EINVAL;
}
+
+ if (ver == 1) {
+ if (data[IFLA_GRE_ERSPAN_INDEX]) {
+ u32 index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]);
+
+ if (index & ~INDEX_MASK)
+ return -EINVAL;
+ }
+ } else if (ver == 2) {
+ if (data[IFLA_GRE_ERSPAN_DIR]) {
+ u16 dir = nla_get_u8(data[IFLA_GRE_ERSPAN_DIR]);
+
+ if (dir & ~(DIR_MASK >> DIR_OFFSET))
+ return -EINVAL;
+ }
+
+ if (data[IFLA_GRE_ERSPAN_HWID]) {
+ u16 hwid = nla_get_u16(data[IFLA_GRE_ERSPAN_HWID]);
+
+ if (hwid & ~(HWID_MASK >> HWID_OFFSET))
+ return -EINVAL;
+ }
+ }
+
return 0;
}
@@ -1598,11 +1659,21 @@ static void ip6gre_netlink_parms(struct nlattr *data[],
if (data[IFLA_GRE_FWMARK])
parms->fwmark = nla_get_u32(data[IFLA_GRE_FWMARK]);
- if (data[IFLA_GRE_ERSPAN_INDEX])
- parms->index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]);
-
if (data[IFLA_GRE_COLLECT_METADATA])
parms->collect_md = true;
+
+ if (data[IFLA_GRE_ERSPAN_VER])
+ parms->erspan_ver = nla_get_u8(data[IFLA_GRE_ERSPAN_VER]);
+
+ if (parms->erspan_ver == 1) {
+ if (data[IFLA_GRE_ERSPAN_INDEX])
+ parms->index = nla_get_u32(data[IFLA_GRE_ERSPAN_INDEX]);
+ } else if (parms->erspan_ver == 2) {
+ if (data[IFLA_GRE_ERSPAN_DIR])
+ parms->dir = nla_get_u8(data[IFLA_GRE_ERSPAN_DIR]);
+ if (data[IFLA_GRE_ERSPAN_HWID])
+ parms->hwid = nla_get_u16(data[IFLA_GRE_ERSPAN_HWID]);
+ }
}
static int ip6gre_tap_init(struct net_device *dev)
@@ -1664,7 +1735,7 @@ static int ip6erspan_tap_init(struct net_device *dev)
tunnel->tun_hlen = 8;
tunnel->hlen = tunnel->tun_hlen + tunnel->encap_hlen +
- sizeof(struct erspan_base_hdr) + ERSPAN_V1_MDSIZE;
+ erspan_hdr_len(tunnel->parms.erspan_ver);
t_hlen = tunnel->hlen + sizeof(struct ipv6hdr);
dev->hard_header_len = LL_MAX_HEADER + t_hlen;
@@ -1932,6 +2003,19 @@ static int ip6gre_fill_info(struct sk_buff *skb, const struct net_device *dev)
goto nla_put_failure;
}
+ if (nla_put_u8(skb, IFLA_GRE_ERSPAN_VER, p->erspan_ver))
+ goto nla_put_failure;
+
+ if (p->erspan_ver == 1) {
+ if (nla_put_u32(skb, IFLA_GRE_ERSPAN_INDEX, p->index))
+ goto nla_put_failure;
+ } else if (p->erspan_ver == 2) {
+ if (nla_put_u8(skb, IFLA_GRE_ERSPAN_DIR, p->dir))
+ goto nla_put_failure;
+ if (nla_put_u16(skb, IFLA_GRE_ERSPAN_HWID, p->hwid))
+ goto nla_put_failure;
+ }
+
return 0;
nla_put_failure:
@@ -1957,6 +2041,9 @@ static const struct nla_policy ip6gre_policy[IFLA_GRE_MAX + 1] = {
[IFLA_GRE_COLLECT_METADATA] = { .type = NLA_FLAG },
[IFLA_GRE_FWMARK] = { .type = NLA_U32 },
[IFLA_GRE_ERSPAN_INDEX] = { .type = NLA_U32 },
+ [IFLA_GRE_ERSPAN_VER] = { .type = NLA_U8 },
+ [IFLA_GRE_ERSPAN_DIR] = { .type = NLA_U8 },
+ [IFLA_GRE_ERSPAN_HWID] = { .type = NLA_U16 },
};
static void ip6erspan_tap_setup(struct net_device *dev)
@@ -2078,4 +2165,5 @@ MODULE_AUTHOR("D. Kozlov (xeb@mail.ru)");
MODULE_DESCRIPTION("GRE over IPv6 tunneling device");
MODULE_ALIAS_RTNL_LINK("ip6gre");
MODULE_ALIAS_RTNL_LINK("ip6gretap");
+MODULE_ALIAS_RTNL_LINK("ip6erspan");
MODULE_ALIAS_NETDEV("ip6gre0");
--
2.7.4
^ permalink raw reply related
* [PATCH net-next 4/4] samples/bpf: add erspan v2 sample code
From: William Tu @ 2017-12-14 0:38 UTC (permalink / raw)
To: netdev
In-Reply-To: <1513211938-8749-1-git-send-email-u9012063@gmail.com>
Extend the existing tests for ipv4 ipv6 erspan version 2.
Signed-off-by: William Tu <u9012063@gmail.com>
---
samples/bpf/tcbpf2_kern.c | 77 +++++++++++++++++++++++++++++++++++++-----
samples/bpf/test_tunnel_bpf.sh | 38 +++++++++++++++------
2 files changed, 96 insertions(+), 19 deletions(-)
diff --git a/samples/bpf/tcbpf2_kern.c b/samples/bpf/tcbpf2_kern.c
index 79ad061079dd..f6bbf8f50da3 100644
--- a/samples/bpf/tcbpf2_kern.c
+++ b/samples/bpf/tcbpf2_kern.c
@@ -35,12 +35,22 @@ struct geneve_opt {
u8 opt_data[8]; /* hard-coded to 8 byte */
};
+struct erspan_md2 {
+ __be32 timestamp;
+ __be16 sgt;
+ __be16 flags;
+};
+
struct vxlan_metadata {
u32 gbp;
};
struct erspan_metadata {
- __be32 index;
+ union {
+ __be32 index;
+ struct erspan_md2 md2;
+ } u;
+ int version;
};
SEC("gre_set_tunnel")
@@ -143,7 +153,18 @@ int _erspan_set_tunnel(struct __sk_buff *skb)
return TC_ACT_SHOT;
}
- md.index = htonl(123);
+ __builtin_memset(&md, 0, sizeof(md));
+#ifdef ERSPAN_V1
+ md.version = 1;
+ md.u.index = htonl(123);
+#else
+ u8 direction = 1;
+ u16 hwid = 7;
+
+ md.version = 2;
+ md.u.md2.flags = htons((direction << 3) | (hwid << 4));
+#endif
+
ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
if (ret < 0) {
ERROR(ret);
@@ -156,7 +177,7 @@ int _erspan_set_tunnel(struct __sk_buff *skb)
SEC("erspan_get_tunnel")
int _erspan_get_tunnel(struct __sk_buff *skb)
{
- char fmt[] = "key %d remote ip 0x%x erspan index 0x%x\n";
+ char fmt[] = "key %d remote ip 0x%x erspan version %d\n";
struct bpf_tunnel_key key;
struct erspan_metadata md;
u32 index;
@@ -174,9 +195,22 @@ int _erspan_get_tunnel(struct __sk_buff *skb)
return TC_ACT_SHOT;
}
- index = bpf_ntohl(md.index);
bpf_trace_printk(fmt, sizeof(fmt),
- key.tunnel_id, key.remote_ipv4, index);
+ key.tunnel_id, key.remote_ipv4, md.version);
+
+#ifdef ERSPAN_V1
+ char fmt2[] = "\tindex %x\n";
+
+ index = bpf_ntohl(md.u.index);
+ bpf_trace_printk(fmt2, sizeof(fmt2), index);
+#else
+ char fmt2[] = "\tdirection %d hwid %x timestamp %u\n";
+
+ bpf_trace_printk(fmt2, sizeof(fmt2),
+ (ntohs(md.u.md2.flags) >> 3) & 0x1,
+ (ntohs(md.u.md2.flags) >> 4) & 0x3f,
+ bpf_ntohl(md.u.md2.timestamp));
+#endif
return TC_ACT_OK;
}
@@ -201,7 +235,19 @@ int _ip4ip6erspan_set_tunnel(struct __sk_buff *skb)
return TC_ACT_SHOT;
}
- md.index = htonl(123);
+ __builtin_memset(&md, 0, sizeof(md));
+
+#ifdef ERSPAN_V1
+ md.u.index = htonl(123);
+ md.version = 1;
+#else
+ u8 direction = 0;
+ u16 hwid = 17;
+
+ md.version = 2;
+ md.u.md2.flags = htons((direction << 3) | (hwid << 4));
+#endif
+
ret = bpf_skb_set_tunnel_opt(skb, &md, sizeof(md));
if (ret < 0) {
ERROR(ret);
@@ -214,7 +260,7 @@ int _ip4ip6erspan_set_tunnel(struct __sk_buff *skb)
SEC("ip4ip6erspan_get_tunnel")
int _ip4ip6erspan_get_tunnel(struct __sk_buff *skb)
{
- char fmt[] = "key %d remote ip6 ::%x erspan index 0x%x\n";
+ char fmt[] = "ip6erspan get key %d remote ip6 ::%x erspan version %d\n";
struct bpf_tunnel_key key;
struct erspan_metadata md;
u32 index;
@@ -232,9 +278,22 @@ int _ip4ip6erspan_get_tunnel(struct __sk_buff *skb)
return TC_ACT_SHOT;
}
- index = bpf_ntohl(md.index);
bpf_trace_printk(fmt, sizeof(fmt),
- key.tunnel_id, key.remote_ipv6[0], index);
+ key.tunnel_id, key.remote_ipv4, md.version);
+
+#ifdef ERSPAN_V1
+ char fmt2[] = "\tindex %x\n";
+
+ index = bpf_ntohl(md.u.index);
+ bpf_trace_printk(fmt2, sizeof(fmt2), index);
+#else
+ char fmt2[] = "\tdirection %d hwid %x timestamp %u\n";
+
+ bpf_trace_printk(fmt2, sizeof(fmt2),
+ (ntohs(md.u.md2.flags) >> 3) & 0x1,
+ (ntohs(md.u.md2.flags) >> 4) & 0x3f,
+ bpf_ntohl(md.u.md2.timestamp));
+#endif
return TC_ACT_OK;
}
diff --git a/samples/bpf/test_tunnel_bpf.sh b/samples/bpf/test_tunnel_bpf.sh
index f53efb62f699..ae7f7c38309b 100755
--- a/samples/bpf/test_tunnel_bpf.sh
+++ b/samples/bpf/test_tunnel_bpf.sh
@@ -59,8 +59,17 @@ function add_ip6gretap_tunnel {
function add_erspan_tunnel {
# in namespace
- ip netns exec at_ns0 \
- ip link add dev $DEV_NS type $TYPE seq key 2 local 172.16.1.100 remote 172.16.1.200 erspan 123
+ if [ "$1" == "v1" ]; then
+ ip netns exec at_ns0 \
+ ip link add dev $DEV_NS type $TYPE seq key 2 \
+ local 172.16.1.100 remote 172.16.1.200 \
+ erspan_ver 1 erspan 123
+ else
+ ip netns exec at_ns0 \
+ ip link add dev $DEV_NS type $TYPE seq key 2 \
+ local 172.16.1.100 remote 172.16.1.200 \
+ erspan_ver 2 erspan_dir 1 erspan_hwid 3
+ fi
ip netns exec at_ns0 ip link set dev $DEV_NS up
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
@@ -79,10 +88,17 @@ function add_ip6erspan_tunnel {
ip link set dev veth1 up
# in namespace
- ip netns exec at_ns0 \
- ip link add dev $DEV_NS type $TYPE seq key 2 erspan 123 \
- local ::11 remote ::22
-
+ if [ "$1" == "v1" ]; then
+ ip netns exec at_ns0 \
+ ip link add dev $DEV_NS type $TYPE seq key 2 \
+ local ::11 remote ::22 \
+ erspan_ver 1 erspan 123
+ else
+ ip netns exec at_ns0 \
+ ip link add dev $DEV_NS type $TYPE seq key 2 \
+ local ::11 remote ::22 \
+ erspan_ver 2 erspan_dir 1 erspan_hwid 7
+ fi
ip netns exec at_ns0 ip addr add dev $DEV_NS 10.1.1.100/24
ip netns exec at_ns0 ip link set dev $DEV_NS up
@@ -199,7 +215,7 @@ function test_erspan {
DEV_NS=erspan00
DEV=erspan11
config_device
- add_erspan_tunnel
+ add_erspan_tunnel $1
attach_bpf $DEV erspan_set_tunnel erspan_get_tunnel
ping -c 1 10.1.1.100
ip netns exec at_ns0 ping -c 1 10.1.1.200
@@ -211,7 +227,7 @@ function test_ip6erspan {
DEV_NS=ip6erspan00
DEV=ip6erspan11
config_device
- add_ip6erspan_tunnel
+ add_ip6erspan_tunnel $1
attach_bpf $DEV ip4ip6erspan_set_tunnel ip4ip6erspan_get_tunnel
ping6 -c 3 ::11
ip netns exec at_ns0 ping -c 1 10.1.1.200
@@ -288,9 +304,11 @@ test_ip6gre
echo "Testing IP6GRETAP tunnel..."
test_ip6gretap
echo "Testing ERSPAN tunnel..."
-test_erspan
+test_erspan v1
+test_erspan v2
echo "Testing IP6ERSPAN tunnel..."
-test_ip6erspan
+test_ip6erspan v1
+test_ip6erspan v2
echo "Testing VXLAN tunnel..."
test_vxlan
echo "Testing GENEVE tunnel..."
--
2.7.4
^ permalink raw reply related
* Re: [patch net-next v3 00/10] net: sched: allow qdiscs to share filter block instances
From: Jakub Kicinski @ 2017-12-14 0:46 UTC (permalink / raw)
To: Jiri Pirko, David Ahern
Cc: netdev, davem, jhs, xiyou.wangcong, mlxsw, andrew, vivien.didelot,
f.fainelli, michael.chan, ganeshgr, saeedm, matanb, leonro,
idosch, simon.horman, pieter.jansenvanvuuren, john.hurley,
alexander.h.duyck, ogerlitz, john.fastabend, daniel
In-Reply-To: <20171213184241.GL2031@nanopsycho>
On Wed, 13 Dec 2017 19:42:41 +0100, Jiri Pirko wrote:
> >>>> I plan to do it as a follow-up patch. But this is how things are done
> >>>> now and have to continue to work.
> >>>
> >>> Why is that? You are introducing the notion of a shared block with this
> >>> patch set. What is the legacy "how things are done now" you are
> >>> referring to?
> >>
> >> Well, the filter add/del should just work no matter if the block behind is
> >> shared or not.
> >
> >My argument is that modifying a shared block instance via a dev should
> >not be allowed. Those changes should only be allowed via the shared
> >block. So if a user puts adds a shared block to the device and then
> >attempts to add a filter via the device it should not be allowed.
>
> I don't see why. The handle is the qdisc here.
If you look at it from Linux perspective that makes sense. For people
coming from switching world the fact that we use qdiscs as a handle for
ACL blocks is an implementation detail.. is that the argument here?
^ permalink raw reply
* Re: [PATCH net] vxlan: Restore initial MTU setting based on lower device
From: Matthias Schiffer @ 2017-12-14 0:53 UTC (permalink / raw)
To: Stefano Brivio
Cc: David S . Miller, netdev, Junhan Yan, Jiri Benc, Hangbin Liu
In-Reply-To: <20171214013148.3da52cc9@elisabeth>
[-- Attachment #1.1: Type: text/plain, Size: 3044 bytes --]
On 12/14/2017 01:31 AM, Stefano Brivio wrote:
> On Thu, 14 Dec 2017 01:25:40 +0100
> Matthias Schiffer <mschiffer@universe-factory.net> wrote:
>
>> On 12/14/2017 01:10 AM, Stefano Brivio wrote:
>>> On Thu, 14 Dec 2017 00:57:32 +0100
>>> Matthias Schiffer <mschiffer@universe-factory.net> wrote:
>>>
>>>> As you note, there is another occurrence of this calculation in
>>>> vxlan_config_apply():
>>>>
>>>>
>>>> [...]
>>>> if (lowerdev) {
>>>> [...]
>>>> max_mtu = lowerdev->mtu - (use_ipv6 ? VXLAN6_HEADROOM :
>>>> VXLAN_HEADROOM);
>>>> }
>>>>
>>>> if (dev->mtu > max_mtu)
>>>> dev->mtu = max_mtu;
>>>> [...]
>>>>
>>>>
>>>> Unless I'm overlooking something, this should already do the same thing and
>>>> your patch is redundant.
>>>
>>> The code above sets max_mtu, and only if dev->mtu exceeds that, the
>>> latter is then clamped.
>>>
>>> What my patch does is to actually set dev->mtu to that value, no matter
>>> what's the previous value set by ether_setup() (only on creation, and
>>> only if lowerdev is there), just like the previous behaviour used to be.
>>>
>>> Let's consider these two cases, on the existing code:
>>>
>>> 1. lowerdev->mtu is 1500:
>>> - ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
>>> - here max_mtu is 1450
>>> - we enter the second if clause above (dev->mtu > max_mtu)
>>> - at the end of vxlan_config_apply(), dev->mtu will be 1450
>>>
>>> which is consistent with the previous behaviour.
>>>
>>> 2. lowerdev->mtu is 9000:
>>> - ether_setup(), called by vxlan_setup(), sets dev->mtu to 1500
>>> - here max_mtu is 8950
>>> - we do not enter the second if clause above (dev->mtu < max_mtu)
>>> - at the end of vxlan_config_apply(), dev->mtu will still be 1500
>>>
>>> which is not consistent with the previous behaviour, where it used to
>>> be 8950 instead.
>>
>> Ah, thank you for the explanation, I was missing the context that this was
>> about higher rather than lower MTUs.
>>
>> Personally, I would prefer a change like the following, as it does not
>> introduce another duplication of the MTU calculation (not tested at all):
>>
>>> --- a/drivers/net/vxlan.c
>>> +++ b/drivers/net/vxlan.c
>>> @@ -3105,7 +3105,7 @@ static void vxlan_config_apply(struct net_device *dev,
>>> VXLAN_HEADROOM);
>>> }
>>>
>>> - if (dev->mtu > max_mtu)
>>> + if (dev->mtu > max_mtu || (!changelink && !conf->mtu))
>>> dev->mtu = max_mtu;
>
> You would also need to check that lowerdev is present, though.
>
> Otherwise, you're changing the behaviour again, that is, if lowerdev is
> not present, we want to keep 1500 and not set ETH_MAX_MTU (65535).
>
> Sure you can change the if condition to reflect that, but IMHO it
> becomes quite awkward.
>
Hmm, right. So here's a
Reviewed-by: Matthias Schiffer <mschiffer@universe-factory.net>
for the minimal change for -stable.
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH] openvswitch: Trim off padding before L3 conntrack processing
From: Pravin Shelar @ 2017-12-14 0:58 UTC (permalink / raw)
To: Ed Swierk
Cc: ovs-dev, Linux Kernel Network Developers, Pravin Shelar,
Lance Richardson, Benjamin Warren, Keith Holleman
In-Reply-To: <1513095439-128864-1-git-send-email-eswierk@skyportsystems.com>
On Tue, Dec 12, 2017 at 8:17 AM, Ed Swierk <eswierk@skyportsystems.com> wrote:
> A short IPv4 packet may have up to 6 bytes of padding following the IP
> payload when received on an Ethernet device.
>
> In the normal IPv4 receive path, ip_rcv() trims the packet to
> ip_hdr->tot_len before invoking NF_INET_PRE_ROUTING hooks (including
> conntrack). Then any subsequent L3+ processing steps, like
> nf_checksum(), use skb->len as the length of the packet, rather than
> referring back to ip_hdr->tot_len. In the IPv6 receive path, ip6_rcv()
> does the same using ipv6_hdr->payload_len.
>
> In the OVS conntrack receive path, this trimming does not occur, so
> the checksum verification in tcp_header() fails, printing "nf_ct_tcp:
> bad TCP checksum". Extra zero bytes don't affect the checksum, but the
> length in the IP pseudoheader does. That length is based on skb->len,
> and without trimming, it doesn't match the length the sender used when
> computing the checksum.
>
> With this change, OVS conntrack trims IPv4 and IPv6 packets prior to
> L3 processing.
>
> Signed-off-by: Ed Swierk <eswierk@skyportsystems.com>
> ---
> net/openvswitch/conntrack.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
> index d558e882ca0c..3a7c9215c431 100644
> --- a/net/openvswitch/conntrack.c
> +++ b/net/openvswitch/conntrack.c
> @@ -1105,12 +1105,29 @@ int ovs_ct_execute(struct net *net, struct sk_buff *skb,
> const struct ovs_conntrack_info *info)
> {
> int nh_ofs;
> + unsigned int nh_len;
> int err;
>
> /* The conntrack module expects to be working at L3. */
> nh_ofs = skb_network_offset(skb);
> skb_pull_rcsum(skb, nh_ofs);
>
> + /* Trim to L3 length since nf_checksum() doesn't expect padding. */
Can you explore if nf_checksum can be changed to avoid the padding?
> + switch (skb->protocol) {
> + case htons(ETH_P_IP):
> + nh_len = ntohs(ip_hdr(skb)->tot_len);
> + break;
> + case htons(ETH_P_IPV6):
> + nh_len = ntohs(ipv6_hdr(skb)->payload_len)
> + + sizeof(struct ipv6hdr);
> + break;
> + default:
> + nh_len = skb->len;
> + }
> + err = pskb_trim_rcsum(skb, nh_len);
> + if (err)
> + return err;
> +
In case of error skb needs to be freed.
^ permalink raw reply
* Re: [patch net-next v3 05/10] net: sched: keep track of offloaded filters and check tc offload feature
From: Jakub Kicinski @ 2017-12-14 1:05 UTC (permalink / raw)
To: Jiri Pirko
Cc: netdev, davem, jhs, xiyou.wangcong, mlxsw, andrew, vivien.didelot,
f.fainelli, michael.chan, ganeshgr, saeedm, matanb, leonro,
idosch, simon.horman, pieter.jansenvanvuuren, john.hurley,
alexander.h.duyck, ogerlitz, john.fastabend, daniel
In-Reply-To: <20171213151038.29665-6-jiri@resnulli.us>
On Wed, 13 Dec 2017 16:10:33 +0100, Jiri Pirko wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> During block bind, we need to check tc offload feature. If it is
> disabled yet still the block contains offloaded filters, forbid the
> bind. Also forbid to register callback for a block that already
> containes offloaded filters, as the play back is not supported now.
> For keeping track of offloaded filters there is a new counter
> introduced, alongside with couple of helpers called from cls_* code.
> These helpers set and clear TCA_CLS_FLAGS_IN_HW flag.
>
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
> index de9dbcb..ac25142 100644
> --- a/net/sched/cls_api.c
> +++ b/net/sched/cls_api.c
> @@ -266,31 +266,50 @@ void tcf_chain_put(struct tcf_chain *chain)
> }
> EXPORT_SYMBOL(tcf_chain_put);
>
> -static void tcf_block_offload_cmd(struct tcf_block *block, struct Qdisc *q,
> +static bool tcf_block_offload_in_use(struct tcf_block *block)
> +{
> + return block->offloadcnt;
> +}
> +
> +static void tcf_block_offload_cmd(struct tcf_block *block,
> + struct net_device *dev,
> struct tcf_block_ext_info *ei,
> enum tc_block_command command)
> {
> - struct net_device *dev = q->dev_queue->dev;
> struct tc_block_offload bo = {};
>
> - if (!dev->netdev_ops->ndo_setup_tc)
> - return;
> bo.command = command;
> bo.binder_type = ei->binder_type;
> bo.block = block;
> dev->netdev_ops->ndo_setup_tc(dev, TC_SETUP_BLOCK, &bo);
> }
>
> -static void tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
> - struct tcf_block_ext_info *ei)
> +static int tcf_block_offload_bind(struct tcf_block *block, struct Qdisc *q,
> + struct tcf_block_ext_info *ei)
> {
> - tcf_block_offload_cmd(block, q, ei, TC_BLOCK_BIND);
> + struct net_device *dev = q->dev_queue->dev;
> +
> + if (!dev->netdev_ops->ndo_setup_tc)
> + return 0;
> +
> + /* If tc offload feature is disabled and the block we try to bind
> + * to already has some offloaded filters, forbid to bind.
> + */
> + if (!tc_can_offload(dev) && tcf_block_offload_in_use(block))
> + return -EOPNOTSUPP;
I don't understand the tc_can_offload(dev) check here. The flow is -
on bind if TC offloads are enabled the new port will get a TC_BLOCK_BIND
and request a register, but the register will fail since the block is
offloaded?
But the whole bind operation does not fail, so user will not see an
error. The block will get bound but port's driver has no way to
register the callback. I'm sorry if I'm just being slow here..
> + tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_BIND);
> + return 0;
> }
>
> static void tcf_block_offload_unbind(struct tcf_block *block, struct Qdisc *q,
> struct tcf_block_ext_info *ei)
> {
> - tcf_block_offload_cmd(block, q, ei, TC_BLOCK_UNBIND);
> + struct net_device *dev = q->dev_queue->dev;
> +
> + if (!dev->netdev_ops->ndo_setup_tc)
> + return;
> + tcf_block_offload_cmd(block, dev, ei, TC_BLOCK_UNBIND);
> }
>
> static int
> @@ -499,10 +518,15 @@ int tcf_block_get_ext(struct tcf_block **p_block, struct Qdisc *q,
> if (err)
> goto err_chain_head_change_cb_add;
>
> - tcf_block_offload_bind(block, q, ei);
> + err = tcf_block_offload_bind(block, q, ei);
> + if (err)
> + goto err_block_offload_bind;
Would it perhaps make more sense to add a TC_BLOCK_JOIN or some such?
IIUC the problem is we don't know whether the driver/callee of the new
port is aware of previous callbacks/filters and we can't replay them.
Obviously registering new callbacks on offloaded blocks is a no-go.
For simple bind to a new port of an ASIC which already knows the rule
set, we just need to ask all callbacks if they know the port and as
long as any of them responds "yes" we are safe to assume the bind is OK.
(Existing drivers will all respond with EOPNOTSUPP to a new unknown command.)
Does that make sense?
> *p_block = block;
> return 0;
>
> +err_block_offload_bind:
> + tcf_chain_head_change_cb_del(tcf_block_chain_zero(block), ei);
> err_chain_head_change_cb_add:
> tcf_block_owner_del(block, q, ei->binder_type);
> err_block_owner_add:
> @@ -630,9 +654,16 @@ struct tcf_block_cb *__tcf_block_cb_register(struct tcf_block *block,
> {
> struct tcf_block_cb *block_cb;
>
> + /* At this point, playback of previous block cb calls is not supported,
> + * so forbid to register to block which already has some offloaded
> + * filters present.
> + */
> + if (tcf_block_offload_in_use(block))
> + return ERR_PTR(-EOPNOTSUPP);
>
> block_cb = kzalloc(sizeof(*block_cb), GFP_KERNEL);
> if (!block_cb)
> - return NULL;
> + return ERR_PTR(-ENOMEM);
> block_cb->cb = cb;
> block_cb->cb_ident = cb_ident;
> block_cb->cb_priv = cb_priv;
> @@ -648,7 +679,7 @@ int tcf_block_cb_register(struct tcf_block *block,
> struct tcf_block_cb *block_cb;
>
> block_cb = __tcf_block_cb_register(block, cb, cb_ident, cb_priv);
> - return block_cb ? 0 : -ENOMEM;
> + return IS_ERR(block_cb) ? PTR_ERR(block_cb) : 0;
> }
> EXPORT_SYMBOL(tcf_block_cb_register);
>
^ permalink raw reply
* Re: [PATCH 1/2] hp100: Fix a possible sleep-in-atomic bug in hp100_login_to_vg_hub
From: Joe Perches @ 2017-12-14 1:21 UTC (permalink / raw)
To: Jia-Ju Bai, perex, floeff, acme; +Cc: netdev, linux-kernel
In-Reply-To: <1513158468-14382-1-git-send-email-baijiaju1990@gmail.com>
On Wed, 2017-12-13 at 17:47 +0800, Jia-Ju Bai wrote:
> The driver may sleep under a spinlock.
> The function call path is:
> hp100_set_multicast_list (acquire the spinlock)
> hp100_login_to_vg_hub
> schedule_timeout_interruptible --> may sleep
>
> To fix it, schedule_timeout_interruptible is replaced with udelay.
>
> This bug is found by my static analysis tool(DSAC) and checked by my code review.
Are there any current working VG/AnyLan network installations?
I rather doubt it.
^ permalink raw reply
* Re: [PATCH v6 2/3] sock: Move the socket inuse to namespace.
From: Tonghao Zhang @ 2017-12-14 1:22 UTC (permalink / raw)
To: Cong Wang
Cc: David Miller, Eric Dumazet, Willem de Bruijn, Pavel Emelyanov,
Linux Kernel Network Developers
In-Reply-To: <CAM_iQpUQb-khaHfNnvvdey69ok-FzLHUs2MmHKo2hUO0BUOKyg@mail.gmail.com>
On Wed, Dec 13, 2017 at 2:04 AM, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> On Sun, Dec 10, 2017 at 7:12 AM, Tonghao Zhang <xiangxia.m.yue@gmail.com> wrote:
>> diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
>> index b797832..6c191fb 100644
>> --- a/net/core/net_namespace.c
>> +++ b/net/core/net_namespace.c
>> @@ -363,6 +363,13 @@ static struct net *net_alloc(void)
>> if (!net)
>> goto out_free;
>>
>> +#ifdef CONFIG_PROC_FS
>> + net->core.sock_inuse = alloc_percpu(int);
>> + if (!net->core.sock_inuse) {
>> + kmem_cache_free(net_cachep, net);
>> + goto out_free;
>> + }
>> +#endif
>> rcu_assign_pointer(net->gen, ng);
>> out:
>> return net;
>> @@ -374,6 +381,9 @@ static struct net *net_alloc(void)
>>
>> static void net_free(struct net *net)
>> {
>> +#ifdef CONFIG_PROC_FS
>> + free_percpu(net->core.sock_inuse);
>> +#endif
>> kfree(rcu_access_pointer(net->gen));
>> kmem_cache_free(net_cachep, net);
>> }
>
> Putting socket code in net_namespace.c doesn't look good.
hi cong,
Thanks for your work. If we dont alloc the in the net_alloc, it's
better to counter the sock for userspace
while the sock created in kernel will be omitted.
^ permalink raw reply
* Re: [PATCH net-next 0/4] ERSPAN version 2 (type III) support
From: Stephen Hemminger @ 2017-12-14 1:54 UTC (permalink / raw)
To: William Tu; +Cc: netdev
In-Reply-To: <1513211938-8749-1-git-send-email-u9012063@gmail.com>
On Wed, 13 Dec 2017 16:38:54 -0800
William Tu <u9012063@gmail.com> wrote:
> ERSPAN has two versions, v1 (type II) and v2 (type III). This patch
> series add support for erspan v2 based on existing erspan v1
> implementation. The first patch refactors the existing erspan v1's
> header structure, making it extensible to put additional v2's header.
> The second and third patch introduces erspan v2's implementation to
> ipv4 and ipv6 erspan, for both native mode and collect metadata mode.
> Finally, test cases are added under the samples/bpf.
>
> Note:
> ERSPAN version 2 has many features and this patch does not implement
> all. One major use case of version 2 over version 1 is its timestamp
> and direction. So the traffic collector is able to distinguish the
> mirrorred traffic better. Other features such as SGT (security group
> tag), FT (frame type) for carrying non-ethernet packet, and optional
> subheader are not implemented yet.
>
> Example commandline for ERSPAN version 2:
> ip link add dev ip6erspan11 type ip6erspan seq key 102 \
> local fc00:100::2 remote fc00:100::1 \
> erspan_ver 2 erspan_dir 1 erspan_hwid 17
>
> The corresponding iproute2 patch:
> https://marc.info/?l=linux-netdev&m=151321141525106&w=2
If this is accepted to net-next you will need to
resubmit the iproute2 patch.
^ permalink raw reply
* Re: [PATCH net-next 0/4] ERSPAN version 2 (type III) support
From: William Tu @ 2017-12-14 2:08 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: Linux Kernel Network Developers
In-Reply-To: <20171213175446.18c98648@xeon-e3>
On Wed, Dec 13, 2017 at 5:54 PM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Wed, 13 Dec 2017 16:38:54 -0800
> William Tu <u9012063@gmail.com> wrote:
>
>> ERSPAN has two versions, v1 (type II) and v2 (type III). This patch
>> series add support for erspan v2 based on existing erspan v1
>> implementation. The first patch refactors the existing erspan v1's
>> header structure, making it extensible to put additional v2's header.
>> The second and third patch introduces erspan v2's implementation to
>> ipv4 and ipv6 erspan, for both native mode and collect metadata mode.
>> Finally, test cases are added under the samples/bpf.
>>
>> Note:
>> ERSPAN version 2 has many features and this patch does not implement
>> all. One major use case of version 2 over version 1 is its timestamp
>> and direction. So the traffic collector is able to distinguish the
>> mirrorred traffic better. Other features such as SGT (security group
>> tag), FT (frame type) for carrying non-ethernet packet, and optional
>> subheader are not implemented yet.
>>
>> Example commandline for ERSPAN version 2:
>> ip link add dev ip6erspan11 type ip6erspan seq key 102 \
>> local fc00:100::2 remote fc00:100::1 \
>> erspan_ver 2 erspan_dir 1 erspan_hwid 17
>>
>> The corresponding iproute2 patch:
>> https://marc.info/?l=linux-netdev&m=151321141525106&w=2
>
>
> If this is accepted to net-next you will need to
> resubmit the iproute2 patch.
Hi Stephen,
Yes, I noticed that I forgot to update the iproute2 man page. Thanks
William
^ permalink raw reply
* Re: [patch iproute2] tc: fix json array closing
From: Stephen Hemminger @ 2017-12-14 2:19 UTC (permalink / raw)
To: Jiri Pirko; +Cc: netdev
In-Reply-To: <20171213195616.2548-1-jiri@resnulli.us>
On Wed, 13 Dec 2017 20:56:16 +0100
Jiri Pirko <jiri@resnulli.us> wrote:
> From: Jiri Pirko <jiri@mellanox.com>
>
> Fixes: 2704bd625583 ("tc: jsonify actions core")
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Good catch. Applied
^ permalink raw reply
* Re: [PATCH iproute2] ip: add vxcan to help text
From: Stephen Hemminger @ 2017-12-14 2:20 UTC (permalink / raw)
To: Oliver Hartkopp; +Cc: linux-can, netdev
In-Reply-To: <20171213202128.4424-1-socketcan@hartkopp.net>
On Wed, 13 Dec 2017 21:21:28 +0100
Oliver Hartkopp <socketcan@hartkopp.net> wrote:
> Add missing tag 'vxcan' inside the help text which was missing in commit
> efe459c76d35f ('ip: link add vxcan support').
>
> Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net>
Applied. Could you also fix the man page?
^ permalink raw reply
* Re: [bpf-next V1-RFC PATCH 08/14] nfp: setup xdp_rxq_info
From: Jakub Kicinski @ 2017-12-14 2:34 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Daniel Borkmann, Alexei Starovoitov, dsahern, oss-drivers,
Simon Horman, netdev, gospo, bjorn.topel, michael.chan
In-Reply-To: <151316400161.14967.13516011331263133183.stgit@firesoul>
On Wed, 13 Dec 2017 12:20:01 +0100, Jesper Dangaard Brouer wrote:
> Driver hook points for xdp_rxq_info:
> * init+reg: nfp_net_rx_ring_alloc
> * unreg : nfp_net_rx_ring_free
>
> In struct nfp_net_rx_ring moved member @size into a hole on 64-bit.
> Thus, the size remaines the same after adding member @xdp_rxq.
>
> Cc: oss-drivers@netronome.com
> Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
> Cc: Simon Horman <simon.horman@netronome.com>
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> index 3801c52098d5..0e564cfabe7e 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> @@ -47,6 +47,7 @@
> #include <linux/netdevice.h>
> #include <linux/pci.h>
> #include <linux/io-64-nonatomic-hi-lo.h>
> +#include <net/xdp.h>
>
> #include "nfp_net_ctrl.h"
>
> @@ -350,6 +351,7 @@ struct nfp_net_rx_buf {
> * @rxds: Virtual address of FL/RX ring in host memory
> * @dma: DMA address of the FL/RX ring
> * @size: Size, in bytes, of the FL/RX ring (needed to free)
> + * @xdp_rxq: RX-ring info avail for XDP
> */
> struct nfp_net_rx_ring {
> struct nfp_net_r_vector *r_vec;
> @@ -361,13 +363,14 @@ struct nfp_net_rx_ring {
> u32 idx;
>
> int fl_qcidx;
> + unsigned int size;
> u8 __iomem *qcp_fl;
>
> struct nfp_net_rx_buf *rxbufs;
> struct nfp_net_rx_desc *rxds;
>
> dma_addr_t dma;
> - unsigned int size;
> + struct xdp_rxq_info xdp_rxq;
> } ____cacheline_aligned;
The @size member is not in the hole on purpose. IIRC all the members
up to @dma are in the first cacheline. All things which are not needed
on the fast path are after @dma. IOW @size is not used on the fast
path and the hole is for fast path stuff :)
> /**
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> index ad3e9f6a61e5..6474aecd0451 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> @@ -2252,6 +2253,7 @@ static void nfp_net_rx_ring_free(struct nfp_net_rx_ring *rx_ring)
> struct nfp_net_r_vector *r_vec = rx_ring->r_vec;
> struct nfp_net_dp *dp = &r_vec->nfp_net->dp;
>
> + xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
> kfree(rx_ring->rxbufs);
>
> if (rx_ring->rxds)
> @@ -2277,6 +2279,12 @@ nfp_net_rx_ring_alloc(struct nfp_net_dp *dp, struct nfp_net_rx_ring *rx_ring)
> {
> int sz;
>
> + /* XDP RX-queue info */
> + xdp_rxq_info_init(&rx_ring->xdp_rxq);
> + rx_ring->xdp_rxq.dev = dp->netdev;
> + rx_ring->xdp_rxq.queue_index = rx_ring->idx;
> + xdp_rxq_info_reg(&rx_ring->xdp_rxq);
> +
> rx_ring->cnt = dp->rxd_cnt;
> rx_ring->size = sizeof(*rx_ring->rxds) * rx_ring->cnt;
> rx_ring->rxds = dma_zalloc_coherent(dp->dev, rx_ring->size,
The nfp driver implements the prepare/commit for reallocating rings.
I don't think it matters now, but there can be 2 sets of rings with the
same ID allocated during reconfiguration (see nfp_net_ring_reconfig()).
Maybe place the register/unregister in nfp_net_open_stack() and
nfp_net_close_stack() respectively?
Perhaps that won't be necessary, only cleaner :) I'm not sure how is
the redirect between drivers intended to work WRT freeing rings and
unloading drivers while packets fly...
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox