Netdev List
 help / color / mirror / Atom feed
* Re: [Patch nf] ipvs: initialize tbl->entries after allocation
From: Pablo Neira Ayuso @ 2018-04-26 22:21 UTC (permalink / raw)
  To: Simon Horman
  Cc: Julian Anastasov, Cong Wang, netdev, lvs-devel, netfilter-devel
In-Reply-To: <20180426121423.5c7iy2ddhjy2clzf@verge.net.au>

On Thu, Apr 26, 2018 at 02:14:25PM +0200, Simon Horman wrote:
> On Tue, Apr 24, 2018 at 08:16:14AM +0300, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Mon, 23 Apr 2018, Cong Wang wrote:
> > 
> > > tbl->entries is not initialized after kmalloc(), therefore
> > > causes an uninit-value warning in ip_vs_lblc_check_expire()
> > > as reported by syzbot.
> > > 
> > > Reported-by: <syzbot+3dfdea57819073a04f21@syzkaller.appspotmail.com>
> > > Cc: Simon Horman <horms@verge.net.au>
> > > Cc: Julian Anastasov <ja@ssi.bg>
> > > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > 
> > 	Thanks!
> > 
> > Acked-by: Julian Anastasov <ja@ssi.bg>
> 
> Thanks.
> 
> Pablo, could you take this into nf?
> 
> Acked-by: Simon Horman <horms@verge.net.au>

Done, thanks Simon.

^ permalink raw reply

* Re: [Patch nf] ipvs: initialize tbl->entries in ip_vs_lblc_init_svc()
From: Pablo Neira Ayuso @ 2018-04-26 22:21 UTC (permalink / raw)
  To: Simon Horman
  Cc: Julian Anastasov, Cong Wang, netdev, lvs-devel, netfilter-devel
In-Reply-To: <20180426121436.fscassmhtiju2wwi@verge.net.au>

On Thu, Apr 26, 2018 at 02:14:36PM +0200, Simon Horman wrote:
> On Tue, Apr 24, 2018 at 08:17:06AM +0300, Julian Anastasov wrote:
> > 
> > 	Hello,
> > 
> > On Mon, 23 Apr 2018, Cong Wang wrote:
> > 
> > > Similarly, tbl->entries is not initialized after kmalloc(),
> > > therefore causes an uninit-value warning in ip_vs_lblc_check_expire(),
> > > as reported by syzbot.
> > > 
> > > Reported-by: <syzbot+3e9695f147fb529aa9bc@syzkaller.appspotmail.com>
> > > Cc: Simon Horman <horms@verge.net.au>
> > > Cc: Julian Anastasov <ja@ssi.bg>
> > > Cc: Pablo Neira Ayuso <pablo@netfilter.org>
> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > 
> > 	Thanks!
> > 
> > Acked-by: Julian Anastasov <ja@ssi.bg>
> 
> Thanks. 
> 
> Pablo, could you take this into nf?
> 
> Acked-by: Simon Horman <horms@verge.net.au>

Done, thanks.

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Michael S. Tsirkin @ 2018-04-26 22:21 UTC (permalink / raw)
  To: Mikulas Patocka
  Cc: John Stoffel, James Bottomley, Michal, eric.dumazet, netdev,
	jasowang, Randy Dunlap, linux-kernel, Matthew Wilcox, Hocko,
	linux-mm, dm-devel, Vlastimil Babka, Andrew, David Rientjes,
	Morton, virtualization, David Miller, edumazet
In-Reply-To: <alpine.LRH.2.02.1804261726540.13401@file01.intranet.prod.int.rdu2.redhat.com>

On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> How is the user or developer supposed to learn about this option, if 
> he gets no crash at all?

Look in /sys/kernel/debug/fail* ? That actually lets you
filter by module, process etc.

I think this patch conflates two things:

1. Make kvmalloc use the vmalloc path.
    This seems a bit narrow.
    What is special about kvmalloc? IMHO nothing - it's yet another user
    of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such
    user, it either recovers correctly or not.
    So IMHO it's just a case of
    making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
    fail once in a while.
    Seems like a better extension to me than focusing on vmalloc.
    I think you will find more bugs this way.

2. Ability to control this from a separate config
   option.

   It's still not that clear to me why is this such a
   hard requirement.  If a distro wants to force specific
   boot time options, why isn't CONFIG_CMDLINE sufficient?

   But assuming it's important to control this kind of
   fault injection to be controlled from
   a dedicated menuconfig option, why not the rest of
   faults?

IMHO if you split 1/2 up, and generalize, the path upstream
will be much smoother.

Hope this helps.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
From: Marcelo Ricardo Leitner @ 2018-04-26 22:28 UTC (permalink / raw)
  To: Oleg Babin
  Cc: netdev, linux-sctp, David S. Miller, Vlad Yasevich, Neil Horman,
	Xin Long, Andrey Ryabinin
In-Reply-To: <27adcf09-830d-48cb-34ab-aaabffa2b202@virtuozzo.com>

On Fri, Apr 27, 2018 at 01:14:56AM +0300, Oleg Babin wrote:
> Hi Marcelo,
>
> On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
> > Hi,
> >
> > On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
> >> Each SCTP association can have up to 65535 input and output streams.
> >> For each stream type an array of sctp_stream_in or sctp_stream_out
> >> structures is allocated using kmalloc_array() function. This function
> >> allocates physically contiguous memory regions, so this can lead
> >> to allocation of memory regions of very high order, i.e.:
> >>
> >>   sizeof(struct sctp_stream_out) == 24,
> >>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
> >>   which means 9th memory order.
> >>
> >> This can lead to a memory allocation failures on the systems
> >> under a memory stress.
> >
> > Did you do performance tests while actually using these 65k streams
> > and with 256 (so it gets 2 pages)?
> >
> > This will introduce another deref on each access to an element, but
> > I'm not expecting any impact due to it.
> >
>
> No, I didn't do such tests. Could you please tell me what methodology
> do you usually use to measure performance properly?
>
> I'm trying to do measurements with iperf3 on unmodified kernel and get
> very strange results like this:
...

I've been trying to fight this fluctuation for some time now but
couldn't really fix it yet. One thing that usually helps (quite a lot)
is increasing the socket buffer sizes and/or using smaller messages,
so there is more cushion in the buffers.

What I have seen in my tests is that when it floats like this, is
because socket buffers floats between 0 and full and don't get into a
steady state. I believe this is because of socket buffer size is used
for limiting the amount of memory used by the socket, instead of being
the amount of payload that the buffer can hold. This causes some
discrepancy, especially because in SCTP we don't defrag the buffer (as
TCP does, it's the collapse operation), and the announced rwnd may
turn up being a lie in the end, which triggers rx drops, then tx cwnd
reduction, and so on. SCTP min_rto of 1s also doesn't help much on
this situation.

On netperf, you may use -S 200000,200000 -s 200000,200000. That should
help it.

Cheers,
Marcelo

^ permalink raw reply

* Re: [PATCH v2] net: qrtr: Expose tunneling endpoint to user space
From: Chris Lew @ 2018-04-26 22:29 UTC (permalink / raw)
  To: Bjorn Andersson, David S. Miller; +Cc: linux-kernel, netdev, linux-arm-msm
In-Reply-To: <20180423214653.10016-1-bjorn.andersson@linaro.org>



On 4/23/2018 2:46 PM, Bjorn Andersson wrote:
> This implements a misc character device named "qrtr-tun" for the purpose
> of allowing user space applications to implement endpoints in the qrtr
> network.
> 
> This allows more advanced (and dynamic) testing of the qrtr code as well
> as opens up the ability of tunneling qrtr over a network or USB link.
> 
> Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>

Acked-by: Chris Lew <clew@codeaurora.org>

> +static ssize_t qrtr_tun_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *filp = iocb->ki_filp;
> +	struct qrtr_tun *tun = filp->private_data;
> +	struct sk_buff *skb;
> +	int count;
> +
> +	while (!(skb = skb_dequeue(&tun->queue))) {
> +		if (filp->f_flags & O_NONBLOCK)
> +			return -EAGAIN;
> +
> +		/* Wait until we get data or the endpoint goes away */
> +		if (wait_event_interruptible(tun->readq,
> +					     !skb_queue_empty(&tun->queue)))
> +			return -ERESTARTSYS;
> +	}
> +
> +	count = min_t(size_t, iov_iter_count(to), skb->len);
> +	if (copy_to_iter(skb->data, count, to) != count)
> +		count = -EFAULT;
> +
> +	kfree_skb(skb);

Is it better to use consume_skb() since this is the expected behavior path?

> +
> +	return count;
> +}
> +

Thanks,
Chris

-- 
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
a Linux Foundation Collaborative Project

^ permalink raw reply

* Re: [PATCH net-next v8 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Stephen Hemminger @ 2018-04-26 22:33 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Sridhar Samudrala, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, jiri, aaron.f.brown
In-Reply-To: <20180426052908-mutt-send-email-mst@kernel.org>

On Thu, 26 Apr 2018 05:30:05 +0300
"Michael S. Tsirkin" <mst@redhat.com> wrote:

> On Wed, Apr 25, 2018 at 05:08:37PM -0700, Stephen Hemminger wrote:
> > On Wed, 25 Apr 2018 16:59:28 -0700
> > Sridhar Samudrala <sridhar.samudrala@intel.com> wrote:
> >   
> > > Use the registration/notification framework supported by the generic
> > > failover infrastructure.
> > > 
> > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>  
> > 
> > NAK unless you prove this works on legacy distributions and with DPDK 18.05
> > without modification.  
> 
> It looks like it should work. What kind of proof are you looking for?
> 


I tried this with working Ubuntu 17 on WS2016.
It boots if the failover driver is configured in (as module).
But if the configuration has:

$ grep FAILOVER .config
# CONFIG_NET_FAILOVER is not set
CONFIG_MAY_USE_NET_FAILOVER=y


The netvsc driver fails on boot with:

[    0.826447] hv_vmbus: registering driver hv_netvsc
[    0.829616] scsi 0:0:0:0: Direct-Access     Msft     Virtual Disk     1.0  PQ: 0 ANSI: 5
[    0.836291] input: Microsoft Vmbus HID-compliant Mouse as /devices/0006:045E:0621.0001/input/input1
[    0.839139] hid-generic 0006:045E:0621.0001: input: <UNKNOWN> HID v0.01 Mouse [Microsoft Vmbus HID-compliant Mouse] on
[    0.964897] hv_vmbus: probe failed for device 849a776e-8120-4e4a-9a36-7e3d95ac75b3 (-95)
[    0.968039] hv_netvsc: probe of 849a776e-8120-4e4a-9a36-7e3d95ac75b3 failed with error -95
[    1.112877] hv_vmbus: probe failed for device 53557f8e-057d-425b-9265-01c0fd7e273e (-95)
[    1.116064] hv_netvsc: probe of 53557f8e-057d-425b-9265-01c0fd7e273e failed with error -95

The system has two virtual networks. eth0 is on vswitch for management.
eth1 is on vswitch with SR-IOV for performance tests.

You probably need to just put the failover part in net/core and select it.


It is trivial to get an evaluation version of Windows Server 2016 and setup a Linux VM.
Please try it.

^ permalink raw reply

* Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
From: Qing Huang @ 2018-04-26 22:37 UTC (permalink / raw)
  To: Saeed Mahameed, Eran Ben Elisha
  Cc: linux-kernel, RDMA mailing list, Linux Netdev List,
	Leon Romanovsky, Matan Barak, Saeed Mahameed
In-Reply-To: <CALzJLG8=dnyLGRXamLR5-eYJ9Q8WCwKe44sHHZ57=1h_x0RCRA@mail.gmail.com>



On 04/26/2018 02:50 PM, Saeed Mahameed wrote:
> On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang <qing.huang@oracle.com> wrote:
>> Current stats collecting scheme in mlx5 driver is to periodically fetch
>> aggregated stats from all the active mlx5 software channels associated
>> with the device. However when a mlx5 interface is brought down(ifdown),
>> all the channels will be deactivated and closed. A new set of channels
>> will be created when next ifup command or a similar command is called.
>> Unfortunately the new channels will have all stats reset to 0. So you
>> lose the accumulated stats information. This behavior is different from
>> other netdev drivers including the mlx4 driver. In order to fix it, we
>> now save prior mlx5 software stats into netdev stats fields, so all the
>> accumulated stats will survive multiple runs of ifdown/ifup commands and
>> be shown correctly.
>>
>> Orabug: 27548610
>>
>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>> ---
> Hi Qing,
>
> I am adding Eran since he is currently working on a similar patch,
> He is also taking care of all cores/rings stats to make them
> persistent, so you won't have discrepancy  between
> ethtool and ifconfig stats.
>
> I am ok with this patch, but this means Eran has to work his way around it.
>
> so we have two options:
>
> 1. Temporary accept this patch, and change it later with Eran's work.
> 2. Wait for Eran's work.
>
> I am ok with either one of them, please let me know.
>
> Thanks !
Hi Saeed,

Any idea on rough ETA of Eran's stats work to be in upstream? If it will 
be available soon, I think
we can wait a bit. If it will take a while to redesign the whole stats 
scheme (for both ethtool and ifconfig),
maybe we can go with this incremental fix first?

Thanks!

>
>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30 +++++++++++++++++++----
>>   1 file changed, 25 insertions(+), 5 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> index f1fe490..5d50e69 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>> @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device *netdev)
>>                  netdev_set_tc_queue(netdev, tc, nch, 0);
>>   }
>>
>> +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
>> +{
>> +       struct net_device *netdev = priv->netdev;
>> +
>> +       netdev->stats.rx_packets += priv->stats.sw.rx_packets;
>> +       netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
>> +       netdev->stats.tx_packets += priv->stats.sw.tx_packets;
>> +       netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
>> +       netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
>> +
>> +       priv->stats.sw.rx_packets       = 0;
>> +       priv->stats.sw.rx_bytes         = 0;
>> +       priv->stats.sw.tx_packets       = 0;
>> +       priv->stats.sw.tx_bytes         = 0;
>> +       priv->stats.sw.tx_queue_dropped = 0;
>> +}
>> +
> This means that we are now explicitly clearing channels stats on
> ifconfig down or switch_channels.
> and now after ifconfing down, ethtool will always show 0, before this
> patch it didn't.
> Anyway update sw stats function will always override them with the new
> channels stats next time we load new channels.
> so it is not that big of a deal.
>
>
>>   static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
>>   {
>>          struct mlx5e_channel *c;
>> @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv *priv,
>>                  netif_set_real_num_tx_queues(netdev, new_num_txqs);
>>
>>          mlx5e_deactivate_priv_channels(priv);
>> +       mlx5e_netdev_save_stats(priv);
>>          mlx5e_close_channels(&priv->channels);
>>
>>          priv->channels = *new_chs;
>> @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
>>
>>          netif_carrier_off(priv->netdev);
>>          mlx5e_deactivate_priv_channels(priv);
>> +       mlx5e_netdev_save_stats(priv);
>>          mlx5e_close_channels(&priv->channels);
>>
>>          return 0;
>> @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev, enum tc_setup_type type,
>>                  stats->tx_packets = PPORT_802_3_GET(pstats, a_frames_transmitted_ok);
>>                  stats->tx_bytes   = PPORT_802_3_GET(pstats, a_octets_transmitted_ok);
>>          } else {
>> -               stats->rx_packets = sstats->rx_packets;
>> -               stats->rx_bytes   = sstats->rx_bytes;
>> -               stats->tx_packets = sstats->tx_packets;
>> -               stats->tx_bytes   = sstats->tx_bytes;
>> -               stats->tx_dropped = sstats->tx_queue_dropped;
>> +               stats->rx_packets = sstats->rx_packets + dev->stats.rx_packets;
>> +               stats->rx_bytes   = sstats->rx_bytes + dev->stats.rx_bytes;
>> +               stats->tx_packets = sstats->tx_packets + dev->stats.tx_packets;
>> +               stats->tx_bytes   = sstats->tx_bytes + dev->stats.tx_bytes;
>> +               stats->tx_dropped = sstats->tx_queue_dropped +
>> +                                   dev->stats.tx_dropped;
>>          }
>>
>>          stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
>> --
>> 1.8.3.1
>>

^ permalink raw reply

* Re: [PATCH net-next 0/2] net/sctp: Avoid allocating high order memory with kmalloc()
From: Oleg Babin @ 2018-04-26 22:45 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner
  Cc: netdev, linux-sctp, David S. Miller, Vlad Yasevich, Neil Horman,
	Xin Long, Andrey Ryabinin
In-Reply-To: <20180426222814.GA10301@localhost.localdomain>

On 04/27/2018 01:28 AM, Marcelo Ricardo Leitner wrote:
> On Fri, Apr 27, 2018 at 01:14:56AM +0300, Oleg Babin wrote:
>> Hi Marcelo,
>>
>> On 04/24/2018 12:33 AM, Marcelo Ricardo Leitner wrote:
>>> Hi,
>>>
>>> On Mon, Apr 23, 2018 at 09:41:04PM +0300, Oleg Babin wrote:
>>>> Each SCTP association can have up to 65535 input and output streams.
>>>> For each stream type an array of sctp_stream_in or sctp_stream_out
>>>> structures is allocated using kmalloc_array() function. This function
>>>> allocates physically contiguous memory regions, so this can lead
>>>> to allocation of memory regions of very high order, i.e.:
>>>>
>>>>   sizeof(struct sctp_stream_out) == 24,
>>>>   ((65535 * 24) / 4096) == 383 memory pages (4096 byte per page),
>>>>   which means 9th memory order.
>>>>
>>>> This can lead to a memory allocation failures on the systems
>>>> under a memory stress.
>>>
>>> Did you do performance tests while actually using these 65k streams
>>> and with 256 (so it gets 2 pages)?
>>>
>>> This will introduce another deref on each access to an element, but
>>> I'm not expecting any impact due to it.
>>>
>>
>> No, I didn't do such tests. Could you please tell me what methodology
>> do you usually use to measure performance properly?
>>
>> I'm trying to do measurements with iperf3 on unmodified kernel and get
>> very strange results like this:
> ...
> 
> I've been trying to fight this fluctuation for some time now but
> couldn't really fix it yet. One thing that usually helps (quite a lot)
> is increasing the socket buffer sizes and/or using smaller messages,
> so there is more cushion in the buffers.
> 
> What I have seen in my tests is that when it floats like this, is
> because socket buffers floats between 0 and full and don't get into a
> steady state. I believe this is because of socket buffer size is used
> for limiting the amount of memory used by the socket, instead of being
> the amount of payload that the buffer can hold. This causes some
> discrepancy, especially because in SCTP we don't defrag the buffer (as
> TCP does, it's the collapse operation), and the announced rwnd may
> turn up being a lie in the end, which triggers rx drops, then tx cwnd
> reduction, and so on. SCTP min_rto of 1s also doesn't help much on
> this situation.
> 
> On netperf, you may use -S 200000,200000 -s 200000,200000. That should
> help it.
>

Thank you very much! I'll try this and get back with results later.

-- 
Best regards,
Oleg

^ permalink raw reply

* Re: [RFC PATCH ghak32 V2 01/13] audit: add container id
From: Paul Moore @ 2018-04-26 22:47 UTC (permalink / raw)
  To: Richard Guy Briggs
  Cc: cgroups, containers, linux-api, Linux-Audit Mailing List,
	linux-fsdevel, LKML, netdev, ebiederm, luto, jlayton, carlos,
	dhowells, viro, simo, Eric Paris, serge
In-Reply-To: <20180425004031.zutsno6hvmpq3crd@madcap2.tricolour.ca>

On Tue, Apr 24, 2018 at 8:40 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
> On 2018-04-24 15:01, Paul Moore wrote:
>> On Mon, Apr 23, 2018 at 10:02 PM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> > On 2018-04-23 19:15, Paul Moore wrote:
>> >> On Sat, Apr 21, 2018 at 10:34 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> > On 2018-04-18 19:47, Paul Moore wrote:
>> >> >> On Fri, Mar 16, 2018 at 5:00 AM, Richard Guy Briggs <rgb@redhat.com> wrote:
>> >> >> > Implement the proc fs write to set the audit container ID of a process,
>> >> >> > emitting an AUDIT_CONTAINER record to document the event.
>> >> >> >
>> >> >> > This is a write from the container orchestrator task to a proc entry of
>> >> >> > the form /proc/PID/containerid where PID is the process ID of the newly
>> >> >> > created task that is to become the first task in a container, or an
>> >> >> > additional task added to a container.
>> >> >> >
>> >> >> > The write expects up to a u64 value (unset: 18446744073709551615).
>> >> >> >
>> >> >> > This will produce a record such as this:
>> >> >> > type=CONTAINER msg=audit(1519903238.968:261): op=set pid=596 uid=0 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 auid=0 tty=pts0 ses=1 opid=596 old-contid=18446744073709551615 contid=123455 res=0
>> >> >> >
>> >> >> > The "op" field indicates an initial set.  The "pid" to "ses" fields are
>> >> >> > the orchestrator while the "opid" field is the object's PID, the process
>> >> >> > being "contained".  Old and new container ID values are given in the
>> >> >> > "contid" fields, while res indicates its success.
>> >> >> >
>> >> >> > It is not permitted to self-set, unset or re-set the container ID.  A
>> >> >> > child inherits its parent's container ID, but then can be set only once
>> >> >> > after.
>> >> >> >
>> >> >> > See: https://github.com/linux-audit/audit-kernel/issues/32
>> >> >> >
>> >> >> > Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
>> >> >> > ---
>> >> >> >  fs/proc/base.c             | 37 ++++++++++++++++++++
>> >> >> >  include/linux/audit.h      | 16 +++++++++
>> >> >> >  include/linux/init_task.h  |  4 ++-
>> >> >> >  include/linux/sched.h      |  1 +
>> >> >> >  include/uapi/linux/audit.h |  2 ++
>> >> >> >  kernel/auditsc.c           | 84 ++++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >  6 files changed, 143 insertions(+), 1 deletion(-)
>>
>> ...
>>
>> >> >> >  /* audit_rule_data supports filter rules with both integer and string
>> >> >> >   * fields.  It corresponds with AUDIT_ADD_RULE, AUDIT_DEL_RULE and
>> >> >> > diff --git a/kernel/auditsc.c b/kernel/auditsc.c
>> >> >> > index 4e0a4ac..29c8482 100644
>> >> >> > --- a/kernel/auditsc.c
>> >> >> > +++ b/kernel/auditsc.c
>> >> >> > @@ -2073,6 +2073,90 @@ int audit_set_loginuid(kuid_t loginuid)
>> >> >> >         return rc;
>> >> >> >  }
>> >> >> >
>> >> >> > +static int audit_set_containerid_perm(struct task_struct *task, u64 containerid)
>> >> >> > +{
>> >> >> > +       struct task_struct *parent;
>> >> >> > +       u64 pcontainerid, ccontainerid;
>> >> >> > +
>> >> >> > +       /* Don't allow to set our own containerid */
>> >> >> > +       if (current == task)
>> >> >> > +               return -EPERM;
>> >> >>
>> >> >> Why not?  Is there some obvious security concern that I missing?
>> >> >
>> >> > We then lose the distinction in the AUDIT_CONTAINER record between the
>> >> > initiating PID and the target PID.  This was outlined in the proposal.
>> >>
>> >> I just went back and reread the v3 proposal and I still don't see a
>> >> good explanation of this.  Why is this bad?  What's the security
>> >> concern?
>> >
>> > I don't remember, specifically.  Maybe this has been addressed by the
>> > check for children/threads or identical parent container ID.  So, I'm
>> > reluctantly willing to remove that check for now.
>>
>> Okay.  For the record, if someone can explain to me why this
>> restriction saves us from some terrible situation I'm all for leaving
>> it.  I'm just opposed to restrictions without solid reasoning behind
>> them.
>>
>> >> > Having said that, I'm still not sure we have protected sufficiently from
>> >> > a child turning around and setting it's parent's as yet unset or
>> >> > inherited audit container ID.
>> >>
>> >> Yes, I believe we only want to let a task set the audit container for
>> >> it's children (or itself/threads if we decide to allow that, see
>> >> above).  There *has* to be a function to check to see if a task if a
>> >> child of a given task ... right? ... although this is likely to be a
>> >> pointer traversal and locking nightmare ... hmmm.
>> >
>> > Isn't that just (struct task_struct)parent == (struct
>> > task_struct)child->parent (or ->real_parent)?
>> >
>> > And now that I say that, it is covered by the following patch's child
>> > check, so as long as we keep that, we should be fine.
>>
>> I was thinking of checking not just current's immediate children, but
>> any of it's descendants as I believe that is what we want to limit,
>> yes?  I just worry that it isn't really practical to perform that
>> check.
>
> The child check I'm talking about prevents setting a task's audit
> container ID if it *has* any children or threads, so if it has children
> it is automatically disqualified and its grandchildren are irrelevant.
>
>> >> >> I ask because I suppose it might be possible for some container
>> >> >> runtime to do a fork, setup some of the environment and them exec the
>> >> >> container (before you answer the obvious "namespaces!" please remember
>> >> >> we're not trying to define containers).
>> >> >
>> >> > I don't think namespaces have any bearing on this concern since none are
>> >> > required.
>> >> >
>> >> >> > +       /* Don't allow the containerid to be unset */
>> >> >> > +       if (!cid_valid(containerid))
>> >> >> > +               return -EINVAL;
>> >> >> > +       /* if we don't have caps, reject */
>> >> >> > +       if (!capable(CAP_AUDIT_CONTROL))
>> >> >> > +               return -EPERM;
>> >> >> > +       /* if containerid is unset, allow */
>> >> >> > +       if (!audit_containerid_set(task))
>> >> >> > +               return 0;
>> >> >> > +       /* it is already set, and not inherited from the parent, reject */
>> >> >> > +       ccontainerid = audit_get_containerid(task);
>> >> >> > +       rcu_read_lock();
>> >> >> > +       parent = rcu_dereference(task->real_parent);
>> >> >> > +       rcu_read_unlock();
>> >> >> > +       task_lock(parent);
>> >> >> > +       pcontainerid = audit_get_containerid(parent);
>> >> >> > +       task_unlock(parent);
>> >> >> > +       if (ccontainerid != pcontainerid)
>> >> >> > +               return -EPERM;
>> >> >> > +       return 0;
>>
>> I'm looking at the parent checks again and I wonder if the logic above
>> is what we really want.  Maybe it is, but I'm not sure.
>>
>> Things I'm wondering about:
>>
>> * "ccontainerid" and "containerid" are too close in name, I kept
>> confusing myself when looking at this code.  Please change one.  Bonus
>> points if it is shorter.
>
> Would c_containerid and p_containerid be ok?  child_cid and parent_cid?

Either would be an improvement over ccontainerid/containerid.  I would
give a slight node to child_cid/parent_cid just for length reasons.

> I'd really like it to have the same root as the parameter handed in so
> teh code is easier to follow.  It would be nice to have that across
> caller to local, but that's challenging.

That's fine, but you have to admit that ccontainerid/containerid is
awkward and not easy to quickly differentiate :)

> I've been tempted to use contid or even cid everywhere instead of
> containerid.  Perhaps the longer name doesn't bother me because I
> like its uniqueness and I learned touch-typing in grade 9 and I like
> 100+ character wide terminals?  ;-)

I would definitely appreciate contid/cid or similar, but I don't care
too much either way.  As far as terminal width is concerned, please
make sure your code fits in 80 char terminals.

>> * What if the orchestrator wants to move the task to a new container?
>> Right now it looks like you can only do that once, then then the
>> task's audit container ID will no longer be the same as real_parent
>
> A task's audit container ID can be unset or inherited, and then set
> only once.  After that, if you want it moved to a new container you
> can't and your only option is to spawn another peer to that task or a
> child of it and set that new task's audit container ID.

Okay.  We've had some many discussions about this both on and off list
that I lose track on where we stand for certain things.  I think
preventing task movement is fine for the initial effort so long as we
don't prevent adding it in the future; I don't see anything (other
than the permission checks under discussion, which is fine) preventing
this.

> Currently, the method of detecting if its audit container ID has been
> set (rather than inherited) was to check its parent's audit container
> ID.

Yeah ... those are two different things.  I've been wondering if we
should introduce a set/inherited flag as simply checking the parent
task's audit container ID isn't quite the same; although it may be
"close enough" that it doesn't matter in practice.  However, I'm
beginning to think this parent/child relationship isn't really
important beyond the inheritance issue ... more on this below.

> The only reason to change this might be if the audit container ID
> were not inheritable, but then we lose the accountability of a task
> spawning another process and being able to leave its child's audit
> container ID unset and unaccountable to any existing container.  I think
> the relationship to the parent is crucial, and if something wants to
> change audit container ID it can, by spawning childrent and leaving a
> trail of container IDs in its parent processes.  (So what if a parent
> dies?)

The audit container ID *must* be inherited, I don't really think
anyone is questioning that.  What I'm wondering about is what we
accomplish by comparing the child's and parent's audit container ID?

I've thought about this a bit more and I think we are making this way
too complicated right now.  We basically have three rules for the
audit container ID which we need to follow:

1. Children inherit their parent's audit container ID; this includes
the magic "unset" audit container ID.
2. You can't change the audit container ID once set.
3. In order to set the audit container ID of a process you must have
CAP_AUDIT_CONTROL.

With that in mind, I think the permission checks would be something like this:

[SIDE NOTE: Audit Container ID in acronym form works out to "acid" ;) ]

  int perm(task, acid)
  {
      if (!task || !valid(acid))
         return -EINVAL;
      if (!capable(CAP_AUDIT_CONTROL))
         return -EPERM;
      if (task->acid != UNSET)
         return -EPERM;
      return 0;
  }

>> ... or does the orchestrator change that?  *Can* the orchestrator
>> change real_parent (I suspect the answer is "no")?
>
> I don't think the orchestrator is able to change real_parent.

I didn't think so either, but I didn't do an exhaustive check.

> I've forgotten why there is a ->parent and ->real_parent and how they can
> change.  One is for the wait signal.  I don't remember the purpose of
> the other.

I know ptrace makes use of real_parent when re-parenting the process
being ptrace'd.

> If the parent dies before the child, the child will be re-parented on
> its grandparent if the parent doesn't hang around zombified, if I
> understand correctly.  If anything, a parent dying would likely further
> restrict the ability to set a task's audit container ID because a parent
> with an identical ID could vanish.

All the more reason to go with the simplified approach above.  I think
the parent/child relationship is a bit of a distraction and a
complexity that isn't important (except for the inheritance of
course).

>> * I think the key is the relationship between current and task, not
>> between task and task->real_parent.  I believe what we really care
>> about is that task is a descendant of current.  We might also want to
>> allow current to change the audit container ID if it holds
>> CAP_AUDIT_CONTROL, regardless of it's relationship with task.
>
> Currently, a process with CAP_AUDIT_CONTROL can set the audit container
> ID of any task that hasn't got children or threads, isn't itself, and
> its audit container ID is inherited or unset.  This was to try to
> prevent games with parents and children scratching each other's backs.
>
> I would feel more comfortable if only descendants were settable, so
> adding that restriction sounds like a good idea to me other than the
> tree-climbing excercise and overhead involved.
>
>> >> >> > +static void audit_log_set_containerid(struct task_struct *task, u64 oldcontainerid,
>> >> >> > +                                     u64 containerid, int rc)
>> >> >> > +{
>> >> >> > +       struct audit_buffer *ab;
>> >> >> > +       uid_t uid;
>> >> >> > +       struct tty_struct *tty;
>> >> >> > +
>> >> >> > +       if (!audit_enabled)
>> >> >> > +               return;
>> >> >> > +
>> >> >> > +       ab = audit_log_start(NULL, GFP_KERNEL, AUDIT_CONTAINER);
>> >> >> > +       if (!ab)
>> >> >> > +               return;
>> >> >> > +
>> >> >> > +       uid = from_kuid(&init_user_ns, task_uid(current));
>> >> >> > +       tty = audit_get_tty(current);
>> >> >> > +
>> >> >> > +       audit_log_format(ab, "op=set pid=%d uid=%u", task_tgid_nr(current), uid);
>> >> >> > +       audit_log_task_context(ab);
>> >> >> > +       audit_log_format(ab, " auid=%u tty=%s ses=%u opid=%d old-contid=%llu contid=%llu res=%d",
>> >> >> > +                        from_kuid(&init_user_ns, audit_get_loginuid(current)),
>> >> >> > +                        tty ? tty_name(tty) : "(none)", audit_get_sessionid(current),
>> >> >> > +                        task_tgid_nr(task), oldcontainerid, containerid, !rc);
>> >> >> > +
>> >> >> > +       audit_put_tty(tty);
>> >> >> > +       audit_log_end(ab);
>> >> >> > +}
>> >> >> > +
>> >> >> > +/**
>> >> >> > + * audit_set_containerid - set current task's audit_context containerid
>> >> >> > + * @containerid: containerid value
>> >> >> > + *
>> >> >> > + * Returns 0 on success, -EPERM on permission failure.
>> >> >> > + *
>> >> >> > + * Called (set) from fs/proc/base.c::proc_containerid_write().
>> >> >> > + */
>> >> >> > +int audit_set_containerid(struct task_struct *task, u64 containerid)
>> >> >> > +{
>> >> >> > +       u64 oldcontainerid;
>> >> >> > +       int rc;
>> >> >> > +
>> >> >> > +       oldcontainerid = audit_get_containerid(task);
>> >> >> > +
>> >> >> > +       rc = audit_set_containerid_perm(task, containerid);
>> >> >> > +       if (!rc) {
>> >> >> > +               task_lock(task);
>> >> >> > +               task->containerid = containerid;
>> >> >> > +               task_unlock(task);
>> >> >> > +       }
>> >> >> > +
>> >> >> > +       audit_log_set_containerid(task, oldcontainerid, containerid, rc);
>> >> >> > +       return rc;
>> >> >>
>> >> >> Why are audit_set_containerid_perm() and audit_log_containerid()
>> >> >> separate functions?
>> >> >
>> >> > (I assume you mean audit_log_set_containerid()?)
>> >>
>> >> Yep.  My fingers got tired typing in that function name and decided a
>> >> shortcut was necessary.
>> >>
>> >> > It seemed clearer that all the permission checking was in one function
>> >> > and its return code could be used to report the outcome when logging the
>> >> > (attempted) action.  This is the same structure as audit_set_loginuid()
>> >> > and it made sense.
>> >>
>> >> When possible I really like it when the permission checks are in the
>> >> same function as the code which does the work; it's less likely to get
>> >> abused that way (you have to willfully bypass the access checks).  The
>> >> exceptions might be if you wanted to reuse the access control code, or
>> >> insert a modular access mechanism (e.g. LSMs).
>> >
>> > I don't follow how it could be abused.  The return code from the perm
>> > check gates setting the value and is used in the success field in the
>> > log.
>>
>> If the permission checks are in the same function body as the code
>> which does the work you have to either split the function, or rewrite
>> it, if you want to bypass the permission checks.  It may be more of a
>> style issue than an actual safety issue, but the comments about
>> single-use functions in the same scope is the tie breaker.
>
> Perhaps I'm just being quite dense, but I just don't follow what the
> problem is and how you suggest fixing it.  A bunch of gotos to a label
> such as "out:" to log the refused action?  That seems messy and
> unstructured.

Fold audit_set_containerid_perm() and audit_log_set_containerid() into
their only caller, audit_set_containerid().

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* Proposal
From: MS Zeliha Omer Faruk @ 2018-04-26 22:50 UTC (permalink / raw)





Hello

   Greetings to you today i asked before but i did't get a response please
i know this might come to you as a surprise because you do not know me
personally i have a business proposal for you please reply for more
info.



Best Regards,

Esentepe Mahallesi Büyükdere
Caddesi Kristal Kule Binasi
No:215
 Sisli - Istanbul, Turkey

^ permalink raw reply

* Re: [dm-devel] [PATCH v5] fault-injection: introduce kvmalloc fallback options
From: Mikulas Patocka @ 2018-04-26 22:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Andrew, eric.dumazet, linux-mm, edumazet, netdev, Randy Dunlap,
	John Stoffel, linux-kernel, Matthew Wilcox, Hocko,
	James Bottomley, Michal, dm-devel, David Rientjes, Morton,
	virtualization, David Miller, Vlastimil Babka
In-Reply-To: <20180427005213-mutt-send-email-mst@kernel.org>



On Fri, 27 Apr 2018, Michael S. Tsirkin wrote:

> On Thu, Apr 26, 2018 at 05:50:20PM -0400, Mikulas Patocka wrote:
> > How is the user or developer supposed to learn about this option, if 
> > he gets no crash at all?
> 
> Look in /sys/kernel/debug/fail* ? That actually lets you
> filter by module, process etc.
> 
> I think this patch conflates two things:
> 
> 1. Make kvmalloc use the vmalloc path.
>     This seems a bit narrow.
>     What is special about kvmalloc? IMHO nothing - it's yet another user
>     of __GFP_NORETRY or __GFP_RETRY_MAYFAIL. As any such

__GFP_RETRY_MAYFAIL makes the allocator retry the costly_order allocations

>     user, it either recovers correctly or not.
>     So IMHO it's just a case of
>     making __GFP_NORETRY, __GFP_RETRY_MAYFAIL, or both
>     fail once in a while.
>     Seems like a better extension to me than focusing on vmalloc.
>     I think you will find more bugs this way.

If the array is <= PAGE_SIZE, vmalloc will not use __GFP_NORETRY. So it 
still hides some bugs - such as, if a structure grows above 4k, it would 
start randomly crashing due to memory fragmentation.

> 2. Ability to control this from a separate config
>    option.
> 
>    It's still not that clear to me why is this such a
>    hard requirement.  If a distro wants to force specific
>    boot time options, why isn't CONFIG_CMDLINE sufficient?

There are 489 kernel options declared with the __setup keyword. Hardly any 
kernel developer notices that a new one was added and selects it when 
testing his code.

>    But assuming it's important to control this kind of
>    fault injection to be controlled from
>    a dedicated menuconfig option, why not the rest of
>    faults?

The injected faults cause damage to the user, so there's no point to 
enable them by default. vmalloc fallback should not cause any damage 
(assuming that the code is correctly written).

> IMHO if you split 1/2 up, and generalize, the path upstream
> will be much smoother.

This seems like a lost case. So, let's not care about code correctness and 
let's solve crashes only after they are reported. If the upstream wants to 
work this way, there's nothing that can be done about it.

I'm wondering if I can still push it to RHEL or not.

> Hope this helps.
> 
> -- 
> MST

Mikulas

^ permalink raw reply

* Re: [PATCH net-next] net/mlx4_en: optimizes get_fixed_ipv6_csum()
From: Saeed Mahameed @ 2018-04-26 22:56 UTC (permalink / raw)
  To: davem@davemloft.net, edumazet@google.com
  Cc: netdev@vger.kernel.org, eric.dumazet@gmail.com, Tariq Toukan
In-Reply-To: <20180419154929.25718-1-edumazet@google.com>

On Thu, 2018-04-19 at 08:49 -0700, Eric Dumazet wrote:
> While trying to support CHECKSUM_COMPLETE for IPV6 fragments,
> I had to experiments various hacks in get_fixed_ipv6_csum().
> I must admit I could not find how to implement this :/
> 
> However, get_fixed_ipv6_csum() does a lot of redundant operations,
> calling csum_partial() twice.
> 
> First csum_partial() computes the checksum of saddr and daddr,
> put in @csum_pseudo_hdr. Undone later in the second csum_partial()
> computed on whole ipv6 header.
> 
> Then nexthdr is added once, added a second time, then substracted.
> 
> payload_len is added once, then substracted.
> 
> Really all this can be reduced to two add_csum(), to add back 6 bytes
> that were removed by mlx4 when providing hw_checksum in RX
> descriptor.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Saeed Mahameed <saeedm@mellanox.com>
> Cc: Tariq Toukan <tariqt@mellanox.com>
> ---
> Note: This patch, like other mlx4 patches can definitely wait
> Tariq approval, thanks !
> 

LGTM,

Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>

>  drivers/net/ethernet/mellanox/mlx4/en_rx.c | 21 ++++++++----------
> ---
>  1 file changed, 8 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index
> 5c613c6663da51a4ae792eeb4d8956b54655786b..38c56fb6e5f5970f245dd56c38e
> 1fc63a9349a07 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -593,30 +593,25 @@ static int get_fixed_ipv4_csum(__wsum
> hw_checksum, struct sk_buff *skb,
>  }
>  
>  #if IS_ENABLED(CONFIG_IPV6)
> -/* In IPv6 packets, besides subtracting the pseudo header checksum,
> - * we also compute/add the IP header checksum which
> - * is not added by the HW.
> +/* In IPv6 packets, hw_checksum lacks 6 bytes from IPv6 header:
> + * 4 first bytes : priority, version, flow_lbl
> + * and 2 additional bytes : nexthdr, hop_limit.
>   */
>  static int get_fixed_ipv6_csum(__wsum hw_checksum, struct sk_buff
> *skb,
>  			       struct ipv6hdr *ipv6h)
>  {
>  	__u8 nexthdr = ipv6h->nexthdr;
> -	__wsum csum_pseudo_hdr = 0;
> +	__wsum temp;
>  
>  	if (unlikely(nexthdr == IPPROTO_FRAGMENT ||
>  		     nexthdr == IPPROTO_HOPOPTS ||
>  		     nexthdr == IPPROTO_SCTP))
>  		return -1;
> -	hw_checksum = csum_add(hw_checksum, (__force
> __wsum)htons(nexthdr));
>  
> -	csum_pseudo_hdr = csum_partial(&ipv6h->saddr,
> -				       sizeof(ipv6h->saddr) +
> sizeof(ipv6h->daddr), 0);
> -	csum_pseudo_hdr = csum_add(csum_pseudo_hdr, (__force
> __wsum)ipv6h->payload_len);
> -	csum_pseudo_hdr = csum_add(csum_pseudo_hdr,
> -				   (__force __wsum)htons(nexthdr));
> -
> -	skb->csum = csum_sub(hw_checksum, csum_pseudo_hdr);
> -	skb->csum = csum_add(skb->csum, csum_partial(ipv6h,
> sizeof(struct ipv6hdr), 0));
> +	/* priority, version, flow_lbl */
> +	temp = csum_add(hw_checksum, *(__wsum *)ipv6h);
> +	/* nexthdr and hop_limit */
> +	skb->csum = csum_add(temp, (__force __wsum)*(__be16
> *)&ipv6h->nexthdr);
>  	return 0;
>  }
>  #endif

^ permalink raw reply

* [PATCH net-next 0/7] selftests: Add tests for mirroring to gretap
From: Petr Machata @ 2018-04-26 23:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw

This suite tests GRE-encapsulated mirroring. The general topology that
most of the tests use is as follows, but each test defines details of
the topology based on its needs, and some tests actually use a somewhat
different topology.

+---------------------+                      +---------------------+
| H1                  |                      |                  H2 |
|     + $h1           |                      |           $h2 +     |
+-----|---------------+                      +---------------|-----+
      |                                                      |
+-----|------------------------------------------------------|-----+
| SW  o---> mirror                                           |     |
| +---|------------------------------------------------------|---+ |
| |   + $swp1               BR                         $swp2 +   | |
| +--------------------------------------------------------------+ |
|                                                                  |
|     + $swp3          + gt6 (ip6gretap)    + gt4 (gretap)         |
+-----|----------------:--------------------:----------------------+
      |                :                    :
+-----|----------------:--------------------:----------------------+
|     + $h3            + h3-gt6(ip6gretap)  + h3-gt4 (gretap)      |
| H3                                                               |
+------------------------------------------------------------------+

The following axes of configuration space are tested:

- ingress and egress mirroring
- mirroring triggered by matchall and flower
- mirroring to ipgretap and ip6gretap
- remote tunnel reachable directly or through a next-hop route
- skip_sw as well as skip_hw configurations

Apart from basic tests with the above mentioned features, the following
tests are included:

- handling of changes to neighbors pertinent to routing decisions in
  mirrored underlay
- handling of configuration changes at the mirrored-to tunnel (endpoint
  addresses, upness)

A suite of mlxsw-specific tests will be part of a separate submission
through linux-mlxsw patch queue.

Petr Machata (7):
  selftests: forwarding: Add libs for gretap mirror testing
  selftests: forwarding: Add test for mirror to gretap
  selftests: forwarding: Test gretap mirror with next-hop remote
  selftests: forwarding: Test mirror to gretap w/ bound dev
  selftests: forwarding: Test flower mirror to gretap
  selftests: forwarding: Test neighbor updates when mirroring to gretap
  selftests: forwarding: Test changes in mirror-to-gretap

 tools/testing/selftests/net/forwarding/lib.sh      |  96 ++++++++++
 .../testing/selftests/net/forwarding/mirror_gre.sh | 139 ++++++++++++++
 .../selftests/net/forwarding/mirror_gre_bound.sh   | 213 +++++++++++++++++++++
 .../selftests/net/forwarding/mirror_gre_changes.sh | 194 +++++++++++++++++++
 .../selftests/net/forwarding/mirror_gre_flower.sh  | 116 +++++++++++
 .../selftests/net/forwarding/mirror_gre_lib.sh     |  85 ++++++++
 .../selftests/net/forwarding/mirror_gre_neigh.sh   | 101 ++++++++++
 .../selftests/net/forwarding/mirror_gre_nh.sh      | 117 +++++++++++
 .../net/forwarding/mirror_gre_topo_lib.sh          | 129 +++++++++++++
 .../testing/selftests/net/forwarding/mirror_lib.sh |  40 ++++
 10 files changed, 1230 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre.sh
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_bound.sh
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_flower.sh
 create mode 100644 tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_neigh.sh
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_nh.sh
 create mode 100644 tools/testing/selftests/net/forwarding/mirror_gre_topo_lib.sh
 create mode 100644 tools/testing/selftests/net/forwarding/mirror_lib.sh

-- 
2.4.11

^ permalink raw reply

* [PATCH net-next 1/7] selftests: forwarding: Add libs for gretap mirror testing
From: Petr Machata @ 2018-04-26 23:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw
In-Reply-To: <cover.1524784510.git.petrm@mellanox.com>

To simplify implementation of mirror-to-gretap tests, extend lib.sh with
several new functions that might potentially be useful more
broadly (although right now the mirroring tests will be the only
client).

Also add mirror_lib.sh with code useful for mirroring tests,
mirror_gre_lib.sh with code specifically useful for mirror-to-gretap
tests, and mirror_gre_topo.sh that primes a given test with a good
baseline topology that the test can then tweak to its liking.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 tools/testing/selftests/net/forwarding/lib.sh      |  96 +++++++++++++++
 .../selftests/net/forwarding/mirror_gre_lib.sh     |  85 ++++++++++++++
 .../net/forwarding/mirror_gre_topo_lib.sh          | 129 +++++++++++++++++++++
 .../testing/selftests/net/forwarding/mirror_lib.sh |  40 +++++++
 4 files changed, 350 insertions(+)
 create mode 100644 tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
 create mode 100644 tools/testing/selftests/net/forwarding/mirror_gre_topo_lib.sh
 create mode 100644 tools/testing/selftests/net/forwarding/mirror_lib.sh

diff --git a/tools/testing/selftests/net/forwarding/lib.sh b/tools/testing/selftests/net/forwarding/lib.sh
index 1ac6c62..a066ca5 100644
--- a/tools/testing/selftests/net/forwarding/lib.sh
+++ b/tools/testing/selftests/net/forwarding/lib.sh
@@ -321,6 +321,25 @@ simple_if_fini()
 	vrf_destroy $vrf_name
 }
 
+tunnel_create()
+{
+	local name=$1; shift
+	local type=$1; shift
+	local local=$1; shift
+	local remote=$1; shift
+
+	ip link add name $name type $type \
+	   local $local remote $remote "$@"
+	ip link set dev $name up
+}
+
+tunnel_destroy()
+{
+	local name=$1; shift
+
+	ip link del dev $name
+}
+
 master_name_get()
 {
 	local if_name=$1
@@ -335,6 +354,15 @@ link_stats_tx_packets_get()
        ip -j -s link show dev $if_name | jq '.[]["stats64"]["tx"]["packets"]'
 }
 
+tc_rule_stats_get()
+{
+	local dev=$1; shift
+	local pref=$1; shift
+
+	tc -j -s filter show dev $dev ingress pref $pref |
+	jq '.[1].options.actions[].stats.packets'
+}
+
 mac_get()
 {
 	local if_name=$1
@@ -381,6 +409,74 @@ tc_offload_check()
 	return 0
 }
 
+slow_path_trap_install()
+{
+	local dev=$1; shift
+	local direction=$1; shift
+
+	if [ "${tcflags/skip_hw}" != "$tcflags" ]; then
+		# For slow-path testing, we need to install a trap to get to
+		# slow path the packets that would otherwise be switched in HW.
+		tc filter add dev $dev $direction pref 1 \
+		   flower skip_sw action trap
+	fi
+}
+
+slow_path_trap_uninstall()
+{
+	local dev=$1; shift
+	local direction=$1; shift
+
+	if [ "${tcflags/skip_hw}" != "$tcflags" ]; then
+		tc filter del dev $dev $direction pref 1 flower skip_sw
+	fi
+}
+
+__icmp_capture_add_del()
+{
+	local add_del=$1; shift
+	local pref=$1; shift
+	local vsuf=$1; shift
+	local tundev=$1; shift
+	local filter=$1; shift
+
+	tc filter $add_del dev "$tundev" ingress \
+	   proto ip$vsuf pref $pref \
+	   flower ip_proto icmp$vsuf $filter \
+	   action pass
+}
+
+icmp_capture_install()
+{
+	__icmp_capture_add_del add 100 "" "$@"
+}
+
+icmp_capture_uninstall()
+{
+	__icmp_capture_add_del del 100 "" "$@"
+}
+
+icmp6_capture_install()
+{
+	__icmp_capture_add_del add 100 v6 "$@"
+}
+
+icmp6_capture_uninstall()
+{
+	__icmp_capture_add_del del 100 v6 "$@"
+}
+
+matchall_sink_create()
+{
+	local dev=$1; shift
+
+	tc qdisc add dev $dev clsact
+	tc filter add dev $dev ingress \
+	   pref 10000 \
+	   matchall \
+	   action drop
+}
+
 ##############################################################################
 # Tests
 
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh b/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
new file mode 100644
index 0000000..207ffd1
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_lib.sh
@@ -0,0 +1,85 @@
+# SPDX-License-Identifier: GPL-2.0
+
+do_test_span_gre_dir_ips()
+{
+	local expect=$1; shift
+	local tundev=$1; shift
+	local direction=$1; shift
+	local ip1=$1; shift
+	local ip2=$1; shift
+
+	icmp_capture_install h3-$tundev
+	mirror_test v$h1 $ip1 $ip2 h3-$tundev 100 $expect
+	mirror_test v$h2 $ip2 $ip1 h3-$tundev 100 $expect
+	icmp_capture_uninstall h3-$tundev
+}
+
+quick_test_span_gre_dir_ips()
+{
+	do_test_span_gre_dir_ips 10 "$@"
+}
+
+fail_test_span_gre_dir_ips()
+{
+	do_test_span_gre_dir_ips 0 "$@"
+}
+
+test_span_gre_dir_ips()
+{
+	local tundev=$1; shift
+	local direction=$1; shift
+	local forward_type=$1; shift
+	local backward_type=$1; shift
+	local ip1=$1; shift
+	local ip2=$1; shift
+
+	quick_test_span_gre_dir_ips "$tundev" "$direction" "$ip1" "$ip2"
+
+	icmp_capture_install h3-$tundev "type $forward_type"
+	mirror_test v$h1 $ip1 $ip2 h3-$tundev 100 10
+	icmp_capture_uninstall h3-$tundev
+
+	icmp_capture_install h3-$tundev "type $backward_type"
+	mirror_test v$h2 $ip2 $ip1 h3-$tundev 100 10
+	icmp_capture_uninstall h3-$tundev
+}
+
+full_test_span_gre_dir_ips()
+{
+	local tundev=$1; shift
+	local direction=$1; shift
+	local forward_type=$1; shift
+	local backward_type=$1; shift
+	local what=$1; shift
+	local ip1=$1; shift
+	local ip2=$1; shift
+
+	RET=0
+
+	mirror_install $swp1 $direction $tundev "matchall $tcflags"
+	test_span_gre_dir_ips "$tundev" "$direction" "$forward_type" \
+			      "$backward_type" "$ip1" "$ip2"
+	mirror_uninstall $swp1 $direction
+
+	log_test "$direction $what ($tcflags)"
+}
+
+quick_test_span_gre_dir()
+{
+	quick_test_span_gre_dir_ips "$@" 192.0.2.1 192.0.2.2
+}
+
+fail_test_span_gre_dir()
+{
+	fail_test_span_gre_dir_ips "$@" 192.0.2.1 192.0.2.2
+}
+
+test_span_gre_dir()
+{
+	test_span_gre_dir_ips "$@" 192.0.2.1 192.0.2.2
+}
+
+full_test_span_gre_dir()
+{
+	full_test_span_gre_dir_ips "$@" 192.0.2.1 192.0.2.2
+}
diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_topo_lib.sh b/tools/testing/selftests/net/forwarding/mirror_gre_topo_lib.sh
new file mode 100644
index 0000000..b3ceda2
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_topo_lib.sh
@@ -0,0 +1,129 @@
+# SPDX-License-Identifier: GPL-2.0
+
+# This is the standard topology for testing mirroring to gretap and ip6gretap
+# netdevices. The tests that use it tweak it in one way or another--importantly,
+# $swp3 and $h3 need to have addresses set up.
+#
+#   +---------------------+                             +---------------------+
+#   | H1                  |                             |                  H2 |
+#   |     + $h1           |                             |           $h2 +     |
+#   |     | 192.0.2.1/28  |                             |  192.0.2.2/28 |     |
+#   +-----|---------------+                             +---------------|-----+
+#         |                                                             |
+#   +-----|-------------------------------------------------------------|-----+
+#   | SW  o--> mirror                                                   |     |
+#   | +---|-------------------------------------------------------------|---+ |
+#   | |   + $swp1                    BR                           $swp2 +   | |
+#   | +---------------------------------------------------------------------+ |
+#   |                                                                         |
+#   |     + $swp3               + gt6 (ip6gretap)      + gt4 (gretap)         |
+#   |     |                     : loc=2001:db8:2::1    : loc=192.0.2.129      |
+#   |     |                     : rem=2001:db8:2::2    : rem=192.0.2.130      |
+#   |     |                     : ttl=100              : ttl=100              |
+#   |     |                     : tos=inherit          : tos=inherit          |
+#   |     |                     :                      :                      |
+#   +-----|---------------------:----------------------:----------------------+
+#         |                     :                      :
+#   +-----|---------------------:----------------------:----------------------+
+#   | H3  + $h3                 + h3-gt6 (ip6gretap)   + h3-gt4 (gretap)      |
+#   |                             loc=2001:db8:2::2      loc=192.0.2.130      |
+#   |                             rem=2001:db8:2::1      rem=192.0.2.129      |
+#   |                             ttl=100                ttl=100              |
+#   |                             tos=inherit            tos=inherit          |
+#   |                                                                         |
+#   +-------------------------------------------------------------------------+
+
+mirror_gre_topo_h1_create()
+{
+	simple_if_init $h1 192.0.2.1/28
+}
+
+mirror_gre_topo_h1_destroy()
+{
+	simple_if_fini $h1 192.0.2.1/28
+}
+
+mirror_gre_topo_h2_create()
+{
+	simple_if_init $h2 192.0.2.2/28
+}
+
+mirror_gre_topo_h2_destroy()
+{
+	simple_if_fini $h2 192.0.2.2/28
+}
+
+mirror_gre_topo_h3_create()
+{
+	simple_if_init $h3
+
+	tunnel_create h3-gt4 gretap 192.0.2.130 192.0.2.129
+	ip link set h3-gt4 vrf v$h3
+	matchall_sink_create h3-gt4
+
+	tunnel_create h3-gt6 ip6gretap 2001:db8:2::2 2001:db8:2::1
+	ip link set h3-gt6 vrf v$h3
+	matchall_sink_create h3-gt6
+}
+
+mirror_gre_topo_h3_destroy()
+{
+	tunnel_destroy h3-gt6
+	tunnel_destroy h3-gt4
+
+	simple_if_fini $h3
+}
+
+mirror_gre_topo_switch_create()
+{
+	ip link set dev $swp3 up
+
+	ip link add name br1 type bridge vlan_filtering 1
+	ip link set dev br1 up
+
+	ip link set dev $swp1 master br1
+	ip link set dev $swp1 up
+
+	ip link set dev $swp2 master br1
+	ip link set dev $swp2 up
+
+	tunnel_create gt4 gretap 192.0.2.129 192.0.2.130 \
+		      ttl 100 tos inherit
+
+	tunnel_create gt6 ip6gretap 2001:db8:2::1 2001:db8:2::2 \
+		      ttl 100 tos inherit allow-localremote
+
+	tc qdisc add dev $swp1 clsact
+}
+
+mirror_gre_topo_switch_destroy()
+{
+	tc qdisc del dev $swp1 clsact
+
+	tunnel_destroy gt6
+	tunnel_destroy gt4
+
+	ip link set dev $swp1 down
+	ip link set dev $swp2 down
+	ip link del dev br1
+
+	ip link set dev $swp3 down
+}
+
+mirror_gre_topo_create()
+{
+	mirror_gre_topo_h1_create
+	mirror_gre_topo_h2_create
+	mirror_gre_topo_h3_create
+
+	mirror_gre_topo_switch_create
+}
+
+mirror_gre_topo_destroy()
+{
+	mirror_gre_topo_switch_destroy
+
+	mirror_gre_topo_h3_destroy
+	mirror_gre_topo_h2_destroy
+	mirror_gre_topo_h1_destroy
+}
diff --git a/tools/testing/selftests/net/forwarding/mirror_lib.sh b/tools/testing/selftests/net/forwarding/mirror_lib.sh
new file mode 100644
index 0000000..e5028a5
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_lib.sh
@@ -0,0 +1,40 @@
+# SPDX-License-Identifier: GPL-2.0
+
+mirror_install()
+{
+	local from_dev=$1; shift
+	local direction=$1; shift
+	local to_dev=$1; shift
+	local filter=$1; shift
+
+	tc filter add dev $from_dev $direction \
+	   pref 1000 $filter \
+	   action mirred egress mirror dev $to_dev
+}
+
+mirror_uninstall()
+{
+	local from_dev=$1; shift
+	local direction=$1; shift
+
+	tc filter del dev $swp1 $direction pref 1000
+}
+
+mirror_test()
+{
+	local vrf_name=$1; shift
+	local sip=$1; shift
+	local dip=$1; shift
+	local dev=$1; shift
+	local pref=$1; shift
+	local expect=$1; shift
+
+	local t0=$(tc_rule_stats_get $dev $pref)
+	ip vrf exec $vrf_name \
+	   ${PING} ${sip:+-I $sip} $dip -c 10 -i 0.1 -w 2 &> /dev/null
+	local t1=$(tc_rule_stats_get $dev $pref)
+	local delta=$((t1 - t0))
+	# Tolerate a couple stray extra packets.
+	((expect <= delta && delta <= expect + 2))
+	check_err $? "Expected to capture $expect packets, got $delta."
+}
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 2/7] selftests: forwarding: Add test for mirror to gretap
From: Petr Machata @ 2018-04-26 23:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw
In-Reply-To: <cover.1524784510.git.petrm@mellanox.com>

Add a test for basic mirroring to gretap and ip6gretap netdevices.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 .../testing/selftests/net/forwarding/mirror_gre.sh | 139 +++++++++++++++++++++
 1 file changed, 139 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre.sh

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre.sh b/tools/testing/selftests/net/forwarding/mirror_gre.sh
new file mode 100755
index 0000000..a8abc73
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre.sh
@@ -0,0 +1,139 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This test uses standard topology for testing gretap. See
+# mirror_gre_topo_lib.sh for more details.
+#
+# Test for "tc action mirred egress mirror" when the device to mirror to is a
+# gretap or ip6gretap netdevice. Expect that the packets come out encapsulated,
+# and another gretap / ip6gretap netdevice is then capable of decapsulating the
+# traffic. Test that the payload is what is expected (ICMP ping request or
+# reply, depending on test).
+
+NUM_NETIFS=6
+source lib.sh
+source mirror_lib.sh
+source mirror_gre_lib.sh
+source mirror_gre_topo_lib.sh
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	swp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	vrf_prepare
+	mirror_gre_topo_create
+
+	ip address add dev $swp3 192.0.2.129/28
+	ip address add dev $h3 192.0.2.130/28
+
+	ip address add dev $swp3 2001:db8:2::1/64
+	ip address add dev $h3 2001:db8:2::2/64
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	ip address del dev $h3 2001:db8:2::2/64
+	ip address del dev $swp3 2001:db8:2::1/64
+
+	ip address del dev $h3 192.0.2.130/28
+	ip address del dev $swp3 192.0.2.129/28
+
+	mirror_gre_topo_destroy
+	vrf_cleanup
+}
+
+test_span_gre_mac()
+{
+	local tundev=$1; shift
+	local direction=$1; shift
+	local prot=$1; shift
+	local what=$1; shift
+
+	local swp3mac=$(mac_get $swp3)
+	local h3mac=$(mac_get $h3)
+
+	RET=0
+
+	mirror_install $swp1 $direction $tundev "matchall $tcflags"
+	tc qdisc add dev $h3 clsact
+	tc filter add dev $h3 ingress pref 77 prot $prot \
+		flower ip_proto 0x2f src_mac $swp3mac dst_mac $h3mac \
+		action pass
+
+	mirror_test v$h1 192.0.2.1 192.0.2.2 $h3 77 10
+
+	tc filter del dev $h3 ingress pref 77
+	tc qdisc del dev $h3 clsact
+	mirror_uninstall $swp1 $direction
+
+	log_test "$direction $what: envelope MAC ($tcflags)"
+}
+
+test_two_spans()
+{
+	RET=0
+
+	mirror_install $swp1 ingress gt4 "matchall $tcflags"
+	mirror_install $swp1 egress gt6 "matchall $tcflags"
+	quick_test_span_gre_dir gt4 ingress
+	quick_test_span_gre_dir gt6 egress
+
+	mirror_uninstall $swp1 ingress
+	fail_test_span_gre_dir gt4 ingress
+	quick_test_span_gre_dir gt6 egress
+
+	mirror_install $swp1 ingress gt4 "matchall $tcflags"
+	mirror_uninstall $swp1 egress
+	quick_test_span_gre_dir gt4 ingress
+	fail_test_span_gre_dir gt6 egress
+
+	mirror_uninstall $swp1 ingress
+	log_test "two simultaneously configured mirrors ($tcflags)"
+}
+
+test_all()
+{
+	slow_path_trap_install $swp1 ingress
+	slow_path_trap_install $swp1 egress
+
+	full_test_span_gre_dir gt4 ingress 8 0 "mirror to gretap"
+	full_test_span_gre_dir gt6 ingress 8 0 "mirror to ip6gretap"
+	full_test_span_gre_dir gt4 egress 0 8 "mirror to gretap"
+	full_test_span_gre_dir gt6 egress 0 8 "mirror to ip6gretap"
+
+	test_span_gre_mac gt4 ingress ip "mirror to gretap"
+	test_span_gre_mac gt6 ingress ipv6 "mirror to ip6gretap"
+	test_span_gre_mac gt4 egress ip "mirror to gretap"
+	test_span_gre_mac gt6 egress ipv6 "mirror to ip6gretap"
+
+	test_two_spans
+
+	slow_path_trap_uninstall $swp1 egress
+	slow_path_trap_uninstall $swp1 ingress
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tcflags="skip_hw"
+test_all
+
+if ! tc_offload_check; then
+	echo "WARN: Could not test offloaded functionality"
+else
+	tcflags="skip_sw"
+	test_all
+fi
+
+exit $EXIT_STATUS
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 3/7] selftests: forwarding: Test gretap mirror with next-hop remote
From: Petr Machata @ 2018-04-26 23:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw
In-Reply-To: <cover.1524784510.git.petrm@mellanox.com>

Test mirror to a gretap and an ip6gretap netdevice such that the remote
address of the tunnel is reachable through a next-hop route.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 .../selftests/net/forwarding/mirror_gre_nh.sh      | 117 +++++++++++++++++++++
 1 file changed, 117 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_nh.sh

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_nh.sh b/tools/testing/selftests/net/forwarding/mirror_gre_nh.sh
new file mode 100755
index 0000000..9ac7097
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_nh.sh
@@ -0,0 +1,117 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This test uses standard topology for testing gretap. See
+# mirror_gre_topo_lib.sh for more details.
+#
+# Test that gretap and ip6gretap mirroring works when the other tunnel endpoint
+# is reachable through a next-hop route (as opposed to directly-attached route).
+
+NUM_NETIFS=6
+source lib.sh
+source mirror_lib.sh
+source mirror_gre_lib.sh
+source mirror_gre_topo_lib.sh
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	swp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	vrf_prepare
+	mirror_gre_topo_create
+
+	ip address add dev $swp3 192.0.2.161/28
+	ip address add dev $h3 192.0.2.162/28
+	ip address add dev gt4 192.0.2.129/32
+	ip address add dev h3-gt4 192.0.2.130/32
+
+	# IPv6 route can't be added after address. Such routes are rejected due
+	# to the gateway address having been configured on the local system. It
+	# works the other way around though.
+	ip address add dev $swp3 2001:db8:4::1/64
+	ip -6 route add 2001:db8:2::2/128 via 2001:db8:4::2
+	ip address add dev $h3 2001:db8:4::2/64
+	ip address add dev gt6 2001:db8:2::1
+	ip address add dev h3-gt6 2001:db8:2::2
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	ip -6 route del 2001:db8:2::2/128 via 2001:db8:4::2
+	ip address del dev $h3 2001:db8:4::2/64
+	ip address del dev $swp3 2001:db8:4::1/64
+
+	ip address del dev $h3 192.0.2.162/28
+	ip address del dev $swp3 192.0.2.161/28
+
+	mirror_gre_topo_destroy
+	vrf_cleanup
+}
+
+test_gretap()
+{
+	RET=0
+	mirror_install $swp1 ingress gt4 "matchall $tcflags"
+
+	# For IPv4, test that there's no mirroring without the route directing
+	# the traffic to tunnel remote address. Then add it and test that
+	# mirroring starts. For IPv6 we can't test this due to the limitation
+	# that routes for locally-specified IPv6 addresses can't be added.
+	fail_test_span_gre_dir gt4 ingress
+
+	ip route add 192.0.2.130/32 via 192.0.2.162
+	quick_test_span_gre_dir gt4 ingress
+	ip route del 192.0.2.130/32 via 192.0.2.162
+
+	mirror_uninstall $swp1 ingress
+	log_test "mirror to gre with next-hop remote ($tcflags)"
+}
+
+test_ip6gretap()
+{
+	RET=0
+
+	mirror_install $swp1 ingress gt6 "matchall $tcflags"
+	quick_test_span_gre_dir gt6 ingress
+	mirror_uninstall $swp1 ingress
+
+	log_test "mirror to ip6gre with next-hop remote ($tcflags)"
+}
+
+test_all()
+{
+	slow_path_trap_install $swp1 ingress
+	slow_path_trap_install $swp1 egress
+
+	test_gretap
+	test_ip6gretap
+
+	slow_path_trap_uninstall $swp1 egress
+	slow_path_trap_uninstall $swp1 ingress
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tcflags="skip_hw"
+test_all
+
+if ! tc_offload_check; then
+	echo "WARN: Could not test offloaded functionality"
+else
+	tcflags="skip_sw"
+	test_all
+fi
+
+exit $EXIT_STATUS
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 4/7] selftests: forwarding: Test mirror to gretap w/ bound dev
From: Petr Machata @ 2018-04-26 23:19 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw
In-Reply-To: <cover.1524784510.git.petrm@mellanox.com>

Test mirroring to a gretap and an ip6gretap netdevice with a bound
device, where the tunnel device and the bound device are in different
VRFs (an overlay / underlay configuration).

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 .../selftests/net/forwarding/mirror_gre_bound.sh   | 213 +++++++++++++++++++++
 1 file changed, 213 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_bound.sh

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_bound.sh b/tools/testing/selftests/net/forwarding/mirror_gre_bound.sh
new file mode 100755
index 0000000..3708ac0
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_bound.sh
@@ -0,0 +1,213 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+#   +---------------------+                             +---------------------+
+#   | H1                  |                             |                  H2 |
+#   |     + $h1           |                             |           $h2 +     |
+#   |     | 192.0.2.1/28  |                             |  192.0.2.2/28 |     |
+#   +-----|---------------+                             +---------------|-----+
+#         |                                                             |
+#   +-----|-------------------------------------------------------------|-----+
+#   | SW  o--> mirror                                                   |     |
+#   | +---|-------------------------------------------------------------|---+ |
+#   | |   + $swp1                    BR                           $swp2 +   | |
+#   | +---------------------------------------------------------------------+ |
+#   |                                                                         |
+#   | +---------------------------------------------------------------------+ |
+#   | | OL                      + gt6 (ip6gretap)      + gt4 (gretap)       | |
+#   | |                         : loc=2001:db8:2::1    : loc=192.0.2.129    | |
+#   | |                         : rem=2001:db8:2::2    : rem=192.0.2.130    | |
+#   | |                         : ttl=100              : ttl=100            | |
+#   | |                         : tos=inherit          : tos=inherit        | |
+#   | +-------------------------:--|-------------------:--|-----------------+ |
+#   |                           :  |                   :  |                   |
+#   | +-------------------------:--|-------------------:--|-----------------+ |
+#   | | UL                      :  |,---------------------'                 | |
+#   | |   + $swp3               :  ||                  :                    | |
+#   | |   | 192.0.2.129/28      :  vv                  :                    | |
+#   | |   | 2001:db8:2::1/64    :  + ul (dummy)        :                    | |
+#   | +---|---------------------:----------------------:--------------------+ |
+#   +-----|---------------------:----------------------:----------------------+
+#         |                     :                      :
+#   +-----|---------------------:----------------------:----------------------+
+#   | H3  + $h3                 + h3-gt6 (ip6gretap)   + h3-gt4 (gretap)      |
+#   |       192.0.2.130/28        loc=2001:db8:2::2      loc=192.0.2.130      |
+#   |       2001:db8:2::2/64      rem=2001:db8:2::1      rem=192.0.2.129      |
+#   |                             ttl=100                ttl=100              |
+#   |                             tos=inherit            tos=inherit          |
+#   |                                                                         |
+#   +-------------------------------------------------------------------------+
+#
+# This tests mirroring to gretap and ip6gretap configured in an overlay /
+# underlay manner, i.e. with a bound dummy device that marks underlay VRF where
+# the encapsulated packed should be routed.
+
+NUM_NETIFS=6
+source lib.sh
+source mirror_lib.sh
+source mirror_gre_lib.sh
+
+h1_create()
+{
+	simple_if_init $h1 192.0.2.1/28
+}
+
+h1_destroy()
+{
+	simple_if_fini $h1 192.0.2.1/28
+}
+
+h2_create()
+{
+	simple_if_init $h2 192.0.2.2/28
+}
+
+h2_destroy()
+{
+	simple_if_fini $h2 192.0.2.2/28
+}
+
+h3_create()
+{
+	simple_if_init $h3 192.0.2.130/28 2001:db8:2::2/64
+
+	tunnel_create h3-gt4 gretap 192.0.2.130 192.0.2.129
+	ip link set h3-gt4 vrf v$h3
+	matchall_sink_create h3-gt4
+
+	tunnel_create h3-gt6 ip6gretap 2001:db8:2::2 2001:db8:2::1
+	ip link set h3-gt6 vrf v$h3
+	matchall_sink_create h3-gt6
+}
+
+h3_destroy()
+{
+	tunnel_destroy h3-gt6
+	tunnel_destroy h3-gt4
+
+	simple_if_fini $h3 192.0.2.130/28 2001:db8:2::2/64
+}
+
+switch_create()
+{
+	# Bridge between H1 and H2.
+
+	ip link add name br1 type bridge vlan_filtering 1
+	ip link set dev br1 up
+
+	ip link set dev $swp1 master br1
+	ip link set dev $swp1 up
+
+	ip link set dev $swp2 master br1
+	ip link set dev $swp2 up
+
+	tc qdisc add dev $swp1 clsact
+
+	# Underlay.
+
+	simple_if_init $swp3 192.0.2.129/28 2001:db8:2::1/64
+
+	ip link add name ul type dummy
+	ip link set dev ul master v$swp3
+	ip link set dev ul up
+
+	# Overlay.
+
+	vrf_create vrf-ol
+	ip link set dev vrf-ol up
+
+	tunnel_create gt4 gretap 192.0.2.129 192.0.2.130 \
+		      ttl 100 tos inherit dev ul
+	ip link set dev gt4 master vrf-ol
+	ip link set dev gt4 up
+
+	tunnel_create gt6 ip6gretap 2001:db8:2::1 2001:db8:2::2 \
+		      ttl 100 tos inherit dev ul allow-localremote
+	ip link set dev gt6 master vrf-ol
+	ip link set dev gt6 up
+}
+
+switch_destroy()
+{
+	vrf_destroy vrf-ol
+
+	tunnel_destroy gt6
+	tunnel_destroy gt4
+
+	simple_if_fini $swp3 192.0.2.129/28 2001:db8:2::1/64
+
+	ip link del dev ul
+
+	tc qdisc del dev $swp1 clsact
+
+	ip link set dev $swp1 down
+	ip link set dev $swp2 down
+	ip link del dev br1
+}
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	swp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	vrf_prepare
+
+	h1_create
+	h2_create
+	h3_create
+
+	switch_create
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	switch_destroy
+
+	h3_destroy
+	h2_destroy
+	h1_destroy
+
+	vrf_cleanup
+}
+
+test_all()
+{
+	RET=0
+
+	slow_path_trap_install $swp1 ingress
+	slow_path_trap_install $swp1 egress
+
+	full_test_span_gre_dir gt4 ingress 8 0 "mirror to gretap w/ UL"
+	full_test_span_gre_dir gt6 ingress 8 0 "mirror to ip6gretap w/ UL"
+
+	full_test_span_gre_dir gt4 egress  0 8 "mirror to gretap w/ UL"
+	full_test_span_gre_dir gt6 egress  0 8 "mirror to ip6gretap w/ UL"
+
+	slow_path_trap_uninstall $swp1 egress
+	slow_path_trap_uninstall $swp1 ingress
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tcflags="skip_hw"
+test_all
+
+if ! tc_offload_check; then
+	echo "WARN: Could not test offloaded functionality"
+else
+	tcflags="skip_sw"
+	test_all
+fi
+
+exit $EXIT_STATUS
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 5/7] selftests: forwarding: Test flower mirror to gretap
From: Petr Machata @ 2018-04-26 23:20 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw
In-Reply-To: <cover.1524784510.git.petrm@mellanox.com>

Add a test for mirroring to a gretap and an ip6gretap netdevices such
that the mirroring action is triggered by a flower match.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 .../selftests/net/forwarding/mirror_gre_flower.sh  | 116 +++++++++++++++++++++
 1 file changed, 116 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_flower.sh

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_flower.sh b/tools/testing/selftests/net/forwarding/mirror_gre_flower.sh
new file mode 100755
index 0000000..178a42d
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_flower.sh
@@ -0,0 +1,116 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This test uses standard topology for testing gretap. See
+# mirror_gre_topo_lib.sh for more details.
+#
+# This tests flower-triggered mirroring to gretap and ip6gretap netdevices. The
+# interfaces on H1 and H2 have two addresses each. Flower match on one of the
+# addresses is configured with mirror action. It is expected that when pinging
+# this address, mirroring takes place, whereas when pinging the other one,
+# there's no mirroring.
+
+NUM_NETIFS=6
+source lib.sh
+source mirror_lib.sh
+source mirror_gre_lib.sh
+source mirror_gre_topo_lib.sh
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	swp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	vrf_prepare
+	mirror_gre_topo_create
+
+	ip address add dev $swp3 192.0.2.129/28
+	ip address add dev $h3 192.0.2.130/28
+
+	ip address add dev $swp3 2001:db8:2::1/64
+	ip address add dev $h3 2001:db8:2::2/64
+
+	ip address add dev $h1 192.0.2.3/28
+	ip address add dev $h2 192.0.2.4/28
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	ip address del dev $h2 192.0.2.4/28
+	ip address del dev $h1 192.0.2.3/28
+
+	ip address del dev $h3 2001:db8:2::2/64
+	ip address del dev $swp3 2001:db8:2::1/64
+
+	ip address del dev $h3 192.0.2.130/28
+	ip address del dev $swp3 192.0.2.129/28
+
+	mirror_gre_topo_destroy
+	vrf_cleanup
+}
+
+test_span_gre_dir_acl()
+{
+	test_span_gre_dir_ips "$@" 192.0.2.3 192.0.2.4
+}
+
+full_test_span_gre_dir_acl()
+{
+	local tundev=$1; shift
+	local direction=$1; shift
+	local forward_type=$1; shift
+	local backward_type=$1; shift
+	local match_dip=$1; shift
+	local what=$1; shift
+
+	mirror_install $swp1 $direction $tundev \
+		       "protocol ip flower $tcflags dst_ip $match_dip"
+	fail_test_span_gre_dir $tundev $direction
+	test_span_gre_dir_acl "$tundev" "$direction" \
+			  "$forward_type" "$backward_type"
+	mirror_uninstall $swp1 $direction
+
+	log_test "$direction $what ($tcflags)"
+}
+
+test_all()
+{
+	RET=0
+
+	slow_path_trap_install $swp1 ingress
+	slow_path_trap_install $swp1 egress
+
+	full_test_span_gre_dir_acl gt4 ingress 8 0 192.0.2.4 "ACL mirror to gretap"
+	full_test_span_gre_dir_acl gt6 ingress 8 0 192.0.2.4 "ACL mirror to ip6gretap"
+
+	full_test_span_gre_dir_acl gt4 egress 0 8 192.0.2.3 "ACL mirror to gretap"
+	full_test_span_gre_dir_acl gt6 egress 0 8 192.0.2.3 "ACL mirror to ip6gretap"
+
+	slow_path_trap_uninstall $swp1 egress
+	slow_path_trap_uninstall $swp1 ingress
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tcflags="skip_hw"
+test_all
+
+if ! tc_offload_check; then
+	echo "WARN: Could not test offloaded functionality"
+else
+	tcflags="skip_sw"
+	test_all
+fi
+
+exit $EXIT_STATUS
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 6/7] selftests: forwarding: Test neighbor updates when mirroring to gretap
From: Petr Machata @ 2018-04-26 23:21 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw
In-Reply-To: <cover.1524784510.git.petrm@mellanox.com>

Test that when a mirror to gretap or ip6gretap netdevice is configured,
changes to neighbors are reflected.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 .../selftests/net/forwarding/mirror_gre_neigh.sh   | 101 +++++++++++++++++++++
 1 file changed, 101 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_neigh.sh

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_neigh.sh b/tools/testing/selftests/net/forwarding/mirror_gre_neigh.sh
new file mode 100755
index 0000000..1ca29ba
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_neigh.sh
@@ -0,0 +1,101 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This test uses standard topology for testing gretap. See
+# mirror_gre_topo_lib.sh for more details.
+#
+# Test for mirroring to gretap and ip6gretap, such that the neighbor entry for
+# the tunnel remote address has invalid address at the time that the mirroring
+# is set up. Later on, the neighbor is deleted and it is expected to be
+# reinitialized using the usual ARP process, and the mirroring offload updated.
+
+NUM_NETIFS=6
+source lib.sh
+source mirror_lib.sh
+source mirror_gre_lib.sh
+source mirror_gre_topo_lib.sh
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	swp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	vrf_prepare
+	mirror_gre_topo_create
+
+	ip address add dev $swp3 192.0.2.129/28
+	ip address add dev $h3 192.0.2.130/28
+
+	ip address add dev $swp3 2001:db8:2::1/64
+	ip address add dev $h3 2001:db8:2::2/64
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	ip address del dev $h3 2001:db8:2::2/64
+	ip address del dev $swp3 2001:db8:2::1/64
+
+	ip address del dev $h3 192.0.2.130/28
+	ip address del dev $swp3 192.0.2.129/28
+
+	mirror_gre_topo_destroy
+	vrf_cleanup
+}
+
+test_span_gre_neigh()
+{
+	local addr=$1; shift
+	local tundev=$1; shift
+	local direction=$1; shift
+	local what=$1; shift
+
+	RET=0
+
+	ip neigh replace dev $swp3 $addr lladdr 00:11:22:33:44:55
+	mirror_install $swp1 $direction $tundev "matchall $tcflags"
+	fail_test_span_gre_dir $tundev ingress
+	ip neigh del dev $swp3 $addr
+	quick_test_span_gre_dir $tundev ingress
+	mirror_uninstall $swp1 $direction
+
+	log_test "$direction $what: neighbor change ($tcflags)"
+}
+
+test_all()
+{
+	slow_path_trap_install $swp1 ingress
+	slow_path_trap_install $swp1 egress
+
+	test_span_gre_neigh 192.0.2.130 gt4 ingress "mirror to gretap"
+	test_span_gre_neigh 192.0.2.130 gt4 egress "mirror to gretap"
+	test_span_gre_neigh 2001:db8:2::2 gt6 ingress "mirror to ip6gretap"
+	test_span_gre_neigh 2001:db8:2::2 gt6 egress "mirror to ip6gretap"
+
+	slow_path_trap_uninstall $swp1 egress
+	slow_path_trap_uninstall $swp1 ingress
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tcflags="skip_hw"
+test_all
+
+if ! tc_offload_check; then
+	echo "WARN: Could not test offloaded functionality"
+else
+	tcflags="skip_sw"
+	test_all
+fi
+
+exit $EXIT_STATUS
-- 
2.4.11

^ permalink raw reply related

* [PATCH net-next 7/7] selftests: forwarding: Test changes in mirror-to-gretap
From: Petr Machata @ 2018-04-26 23:22 UTC (permalink / raw)
  To: netdev; +Cc: davem, linux-mlxsw
In-Reply-To: <cover.1524784510.git.petrm@mellanox.com>

These tests set up mirroring in a situation that the configuration is
incorrect, i.e. mirrored packets, if any, are not supposed to reach
destination tunnel device. Then the configuration is rectified and
mirroring is checked to have started working.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
---
 .../selftests/net/forwarding/mirror_gre_changes.sh | 194 +++++++++++++++++++++
 1 file changed, 194 insertions(+)
 create mode 100755 tools/testing/selftests/net/forwarding/mirror_gre_changes.sh

diff --git a/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh b/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
new file mode 100755
index 0000000..0ed288a
--- /dev/null
+++ b/tools/testing/selftests/net/forwarding/mirror_gre_changes.sh
@@ -0,0 +1,194 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+
+# This test uses standard topology for testing gretap. See
+# mirror_gre_topo_lib.sh for more details.
+#
+# Test how mirrors to gretap and ip6gretap react to changes to relevant
+# configuration.
+
+NUM_NETIFS=6
+source lib.sh
+source mirror_lib.sh
+source mirror_gre_lib.sh
+source mirror_gre_topo_lib.sh
+
+setup_prepare()
+{
+	h1=${NETIFS[p1]}
+	swp1=${NETIFS[p2]}
+
+	swp2=${NETIFS[p3]}
+	h2=${NETIFS[p4]}
+
+	swp3=${NETIFS[p5]}
+	h3=${NETIFS[p6]}
+
+	vrf_prepare
+	mirror_gre_topo_create
+
+	# This test downs $swp3, which deletes the configured IPv6 address
+	# unless this sysctl is set.
+	local key=net.ipv6.conf.$swp3.keep_addr_on_down
+	SWP3_KEEP_ADDR_ON_DOWN=$(sysctl -n $key)
+	sysctl -qw $key=1
+
+	ip address add dev $swp3 192.0.2.129/28
+	ip address add dev $h3 192.0.2.130/28
+
+	ip address add dev $swp3 2001:db8:2::1/64
+	ip address add dev $h3 2001:db8:2::2/64
+}
+
+cleanup()
+{
+	pre_cleanup
+
+	ip address del dev $h3 2001:db8:2::2/64
+	ip address del dev $swp3 2001:db8:2::1/64
+
+	ip address del dev $h3 192.0.2.130/28
+	ip address del dev $swp3 192.0.2.129/28
+
+	local key=net.ipv6.conf.$swp3.keep_addr_on_down
+	sysctl -qw $key=$SWP3_KEEP_ADDR_ON_DOWN
+
+	mirror_gre_topo_destroy
+	vrf_cleanup
+}
+
+test_span_gre_ttl()
+{
+	local tundev=$1; shift
+	local type=$1; shift
+	local prot=$1; shift
+	local what=$1; shift
+
+	RET=0
+
+	mirror_install $swp1 ingress $tundev "matchall $tcflags"
+	tc qdisc add dev $h3 clsact
+	tc filter add dev $h3 ingress pref 77 prot $prot \
+		flower ip_ttl 50 action pass
+
+	mirror_test v$h1 192.0.2.1 192.0.2.2 $h3 77 0
+
+	ip link set dev $tundev type $type ttl 50
+	mirror_test v$h1 192.0.2.1 192.0.2.2 $h3 77 10
+
+	ip link set dev $tundev type $type ttl 100
+	tc filter del dev $h3 ingress pref 77
+	tc qdisc del dev $h3 clsact
+	mirror_uninstall $swp1 ingress
+
+	log_test "$what: TTL change ($tcflags)"
+}
+
+test_span_gre_tun_up()
+{
+	local tundev=$1; shift
+	local what=$1; shift
+
+	RET=0
+
+	ip link set dev $tundev down
+	mirror_install $swp1 ingress $tundev "matchall $tcflags"
+	fail_test_span_gre_dir $tundev ingress
+
+	ip link set dev $tundev up
+
+	quick_test_span_gre_dir $tundev ingress
+	mirror_uninstall $swp1 ingress
+
+	log_test "$what: tunnel down/up ($tcflags)"
+}
+
+test_span_gre_egress_up()
+{
+	local tundev=$1; shift
+	local remote_ip=$1; shift
+	local what=$1; shift
+
+	RET=0
+
+	ip link set dev $swp3 down
+	mirror_install $swp1 ingress $tundev "matchall $tcflags"
+	fail_test_span_gre_dir $tundev ingress
+
+	# After setting the device up, wait for neighbor to get resolved so that
+	# we can expect mirroring to work.
+	ip link set dev $swp3 up
+	while true; do
+		ip neigh sh dev $swp3 $remote_ip nud reachable |
+		    grep -q ^
+		if [[ $? -ne 0 ]]; then
+			sleep 1
+		else
+			break
+		fi
+	done
+
+	quick_test_span_gre_dir $tundev ingress
+	mirror_uninstall $swp1 ingress
+
+	log_test "$what: egress down/up ($tcflags)"
+}
+
+test_span_gre_remote_ip()
+{
+	local tundev=$1; shift
+	local type=$1; shift
+	local correct_ip=$1; shift
+	local wrong_ip=$1; shift
+	local what=$1; shift
+
+	RET=0
+
+	ip link set dev $tundev type $type remote $wrong_ip
+	mirror_install $swp1 ingress $tundev "matchall $tcflags"
+	fail_test_span_gre_dir $tundev ingress
+
+	ip link set dev $tundev type $type remote $correct_ip
+	quick_test_span_gre_dir $tundev ingress
+	mirror_uninstall $swp1 ingress
+
+	log_test "$what: remote address change ($tcflags)"
+}
+
+test_all()
+{
+	slow_path_trap_install $swp1 ingress
+	slow_path_trap_install $swp1 egress
+
+	test_span_gre_ttl gt4 gretap ip "mirror to gretap"
+	test_span_gre_ttl gt6 ip6gretap ipv6 "mirror to ip6gretap"
+
+	test_span_gre_tun_up gt4 "mirror to gretap"
+	test_span_gre_tun_up gt6 "mirror to ip6gretap"
+
+	test_span_gre_egress_up gt4 192.0.2.130 "mirror to gretap"
+	test_span_gre_egress_up gt6 2001:db8:2::2 "mirror to ip6gretap"
+
+	test_span_gre_remote_ip gt4 gretap 192.0.2.130 192.0.2.132 "mirror to gretap"
+	test_span_gre_remote_ip gt6 ip6gretap 2001:db8:2::2 2001:db8:2::4 "mirror to ip6gretap"
+
+	slow_path_trap_uninstall $swp1 egress
+	slow_path_trap_uninstall $swp1 ingress
+}
+
+trap cleanup EXIT
+
+setup_prepare
+setup_wait
+
+tcflags="skip_hw"
+test_all
+
+if ! tc_offload_check; then
+	echo "WARN: Could not test offloaded functionality"
+else
+	tcflags="skip_sw"
+	test_all
+fi
+
+exit $EXIT_STATUS
-- 
2.4.11

^ permalink raw reply related

* Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
From: Saeed Mahameed @ 2018-04-26 23:30 UTC (permalink / raw)
  To: Qing Huang
  Cc: Eran Ben Elisha, linux-kernel, RDMA mailing list,
	Linux Netdev List, Leon Romanovsky, Matan Barak, Saeed Mahameed
In-Reply-To: <5cd29779-fc70-930a-7060-ee3332c15558@oracle.com>

On Thu, Apr 26, 2018 at 3:37 PM, Qing Huang <qing.huang@oracle.com> wrote:
>
>
> On 04/26/2018 02:50 PM, Saeed Mahameed wrote:
>>
>> On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang <qing.huang@oracle.com> wrote:
>>>
>>> Current stats collecting scheme in mlx5 driver is to periodically fetch
>>> aggregated stats from all the active mlx5 software channels associated
>>> with the device. However when a mlx5 interface is brought down(ifdown),
>>> all the channels will be deactivated and closed. A new set of channels
>>> will be created when next ifup command or a similar command is called.
>>> Unfortunately the new channels will have all stats reset to 0. So you
>>> lose the accumulated stats information. This behavior is different from
>>> other netdev drivers including the mlx4 driver. In order to fix it, we
>>> now save prior mlx5 software stats into netdev stats fields, so all the
>>> accumulated stats will survive multiple runs of ifdown/ifup commands and
>>> be shown correctly.
>>>
>>> Orabug: 27548610

I don't think we need this internal bug reference, let's remove it.

>>>
>>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>>> ---
>>
>> Hi Qing,
>>
>> I am adding Eran since he is currently working on a similar patch,
>> He is also taking care of all cores/rings stats to make them
>> persistent, so you won't have discrepancy  between
>> ethtool and ifconfig stats.
>>
>> I am ok with this patch, but this means Eran has to work his way around
>> it.
>>
>> so we have two options:
>>
>> 1. Temporary accept this patch, and change it later with Eran's work.
>> 2. Wait for Eran's work.
>>
>> I am ok with either one of them, please let me know.
>>
>> Thanks !
>
> Hi Saeed,
>
> Any idea on rough ETA of Eran's stats work to be in upstream? If it will be

All i know that it is planned for v4.18.

> available soon, I think
> we can wait a bit. If it will take a while to redesign the whole stats
> scheme (for both ethtool and ifconfig),
> maybe we can go with this incremental fix first?
>

Yes we can go with incremental fix,
but let's fix the commit message as requested above and make the patch
title prefix "net/mlx5e:"
since this is a mlx5 netdev related change, not mlx5/core.
also please see a comments below.

Thanks,
Saeed.

> Thanks!
>
>
>>
>>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30
>>> +++++++++++++++++++----
>>>   1 file changed, 25 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> index f1fe490..5d50e69 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device
>>> *netdev)
>>>                  netdev_set_tc_queue(netdev, tc, nch, 0);
>>>   }
>>>
>>> +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
>>> +{
>>> +       struct net_device *netdev = priv->netdev;
>>> +
>>> +       netdev->stats.rx_packets += priv->stats.sw.rx_packets;
>>> +       netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
>>> +       netdev->stats.tx_packets += priv->stats.sw.tx_packets;
>>> +       netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
>>> +       netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
>>> +
>>> +       priv->stats.sw.rx_packets       = 0;
>>> +       priv->stats.sw.rx_bytes         = 0;
>>> +       priv->stats.sw.tx_packets       = 0;
>>> +       priv->stats.sw.tx_bytes         = 0;
>>> +       priv->stats.sw.tx_queue_dropped = 0;

Note: it is safe to do this here since writing to priv->stats.sw is
currently done under priv->state_lock,
Please add a comment to this function that it must be called under
priv->state_lock

>>> +}
>>> +
>>
>> This means that we are now explicitly clearing channels stats on
>> ifconfig down or switch_channels.
>> and now after ifconfing down, ethtool will always show 0, before this
>> patch it didn't.
>> Anyway update sw stats function will always override them with the new
>> channels stats next time we load new channels.
>> so it is not that big of a deal.
>>
>>
>>>   static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
>>>   {
>>>          struct mlx5e_channel *c;
>>> @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv
>>> *priv,
>>>                  netif_set_real_num_tx_queues(netdev, new_num_txqs);
>>>
>>>          mlx5e_deactivate_priv_channels(priv);
>>> +       mlx5e_netdev_save_stats(priv);

This function call doesn't have to be between deactivate and close,
lets move it to after
mlx5e_close_channels here and below to avoid confusion.

also you need to accumulate what is left in the rings/channels stats,
before calling
mlx5e_close_channels, see mlx5e_grp_sw_update_stats.

>>>          mlx5e_close_channels(&priv->channels);
>>>
>>>          priv->channels = *new_chs;
>>> @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
>>>
>>>          netif_carrier_off(priv->netdev);
>>>          mlx5e_deactivate_priv_channels(priv);
>>> +       mlx5e_netdev_save_stats(priv);
>>>          mlx5e_close_channels(&priv->channels);
>>>
>>>          return 0;
>>> @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev,
>>> enum tc_setup_type type,
>>>                  stats->tx_packets = PPORT_802_3_GET(pstats,
>>> a_frames_transmitted_ok);
>>>                  stats->tx_bytes   = PPORT_802_3_GET(pstats,
>>> a_octets_transmitted_ok);
>>>          } else {
>>> -               stats->rx_packets = sstats->rx_packets;
>>> -               stats->rx_bytes   = sstats->rx_bytes;
>>> -               stats->tx_packets = sstats->tx_packets;
>>> -               stats->tx_bytes   = sstats->tx_bytes;
>>> -               stats->tx_dropped = sstats->tx_queue_dropped;
>>> +               stats->rx_packets = sstats->rx_packets +
>>> dev->stats.rx_packets;
>>> +               stats->rx_bytes   = sstats->rx_bytes +
>>> dev->stats.rx_bytes;
>>> +               stats->tx_packets = sstats->tx_packets +
>>> dev->stats.tx_packets;
>>> +               stats->tx_bytes   = sstats->tx_bytes +
>>> dev->stats.tx_bytes;
>>> +               stats->tx_dropped = sstats->tx_queue_dropped +
>>> +                                   dev->stats.tx_dropped;
>>>          }
>>>
>>>          stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
>>> --
>>> 1.8.3.1
>>>
>

^ permalink raw reply

* Re: [PATCH v7 net-next 4/4] netvsc: refactor notifier/event handling code to use the failover framework
From: Michael S. Tsirkin @ 2018-04-26 23:42 UTC (permalink / raw)
  To: Siwei Liu
  Cc: Stephen Hemminger, Jiri Pirko, Sridhar Samudrala, David Miller,
	Netdev, virtualization, virtio-dev, Brandeburg, Jesse,
	Alexander Duyck, Jakub Kicinski, Jason Wang
In-Reply-To: <CADGSJ22D=NTWtjVeLKsgvcesy0b_ugzb9XGHGsWOktUEFhYN_Q@mail.gmail.com>

On Thu, Apr 26, 2018 at 03:14:46PM -0700, Siwei Liu wrote:
> On Wed, Apr 25, 2018 at 7:28 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Apr 25, 2018 at 03:57:57PM -0700, Siwei Liu wrote:
> >> On Wed, Apr 25, 2018 at 3:22 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> > On Wed, Apr 25, 2018 at 02:38:57PM -0700, Siwei Liu wrote:
> >> >> On Mon, Apr 23, 2018 at 1:06 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> > On Mon, Apr 23, 2018 at 12:44:39PM -0700, Siwei Liu wrote:
> >> >> >> On Mon, Apr 23, 2018 at 10:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >> >> >> > On Mon, Apr 23, 2018 at 10:44:40AM -0700, Stephen Hemminger wrote:
> >> >> >> >> On Mon, 23 Apr 2018 20:24:56 +0300
> >> >> >> >> "Michael S. Tsirkin" <mst@redhat.com> wrote:
> >> >> >> >>
> >> >> >> >> > On Mon, Apr 23, 2018 at 10:04:06AM -0700, Stephen Hemminger wrote:
> >> >> >> >> > > > >
> >> >> >> >> > > > >I will NAK patches to change to common code for netvsc especially the
> >> >> >> >> > > > >three device model.  MS worked hard with distro vendors to support transparent
> >> >> >> >> > > > >mode, ans we really can't have a new model; or do backport.
> >> >> >> >> > > > >
> >> >> >> >> > > > >Plus, DPDK is now dependent on existing model.
> >> >> >> >> > > >
> >> >> >> >> > > > Sorry, but nobody here cares about dpdk or other similar oddities.
> >> >> >> >> > >
> >> >> >> >> > > The network device model is a userspace API, and DPDK is a userspace application.
> >> >> >> >> >
> >> >> >> >> > It is userspace but are you sure dpdk is actually poking at netdevs?
> >> >> >> >> > AFAIK it's normally banging device registers directly.
> >> >> >> >> >
> >> >> >> >> > > You can't go breaking userspace even if you don't like the application.
> >> >> >> >> >
> >> >> >> >> > Could you please explain how is the proposed patchset breaking
> >> >> >> >> > userspace? Ignoring DPDK for now, I don't think it changes the userspace
> >> >> >> >> > API at all.
> >> >> >> >> >
> >> >> >> >>
> >> >> >> >> The DPDK has a device driver vdev_netvsc which scans the Linux network devices
> >> >> >> >> to look for Linux netvsc device and the paired VF device and setup the
> >> >> >> >> DPDK environment.  This setup creates a DPDK failsafe (bondingish) instance
> >> >> >> >> and sets up TAP support over the Linux netvsc device as well as the Mellanox
> >> >> >> >> VF device.
> >> >> >> >>
> >> >> >> >> So it depends on existing 2 device model. You can't go to a 3 device model
> >> >> >> >> or start hiding devices from userspace.
> >> >> >> >
> >> >> >> > Okay so how does the existing patch break that? IIUC does not go to
> >> >> >> > a 3 device model since netvsc calls failover_register directly.
> >> >> >> >
> >> >> >> >> Also, I am working on associating netvsc and VF device based on serial number
> >> >> >> >> rather than MAC address. The serial number is how Windows works now, and it makes
> >> >> >> >> sense for Linux and Windows to use the same mechanism if possible.
> >> >> >> >
> >> >> >> > Maybe we should support same for virtio ...
> >> >> >> > Which serial do you mean? From vpd?
> >> >> >> >
> >> >> >> > I guess you will want to keep supporting MAC for old hypervisors?
> >> >> >> >
> >> >> >> > It all seems like a reasonable thing to support in the generic core.
> >> >> >>
> >> >> >> That's the reason why I chose explicit identifier rather than rely on
> >> >> >> MAC address to bind/pair a device. MAC address can change. Even if it
> >> >> >> can't, malicious guest user can fake MAC address to skip binding.
> >> >> >>
> >> >> >> -Siwei
> >> >> >
> >> >> > Address should be sampled at device creation to prevent this
> >> >> > kind of hack. Not that it buys the malicious user much:
> >> >> > if you can poke at MAC addresses you probably already can
> >> >> > break networking.
> >> >>
> >> >> I don't understand why poking at MAC address may potentially break
> >> >> networking.
> >> >
> >> > Set a MAC address to match another device on the same LAN,
> >> > packets will stop reaching that MAC.
> >>
> >> What I meant was guest users may create a virtual link, say veth that
> >> has exactly the same MAC address as that for the VF, which can easily
> >> get around of the binding procedure.
> >
> > This patchset limits binding to PCI devices so it won't be affected
> > by any hacks around virtual devices.
> 
> Wait, I vaguely recall you seemed to like to generalize this feature
> to non-PCI device. 

It's purely a layering thing.  It is cleaner not to have PCI specific
data in the device-specific transport-independent section of the virtio
spec.


> But now you're saying it should stick to PCI. It's
> not that I'm reluctant with sticking to PCI. The fact is that I don't
> think we can go with implementation until the semantics of the
> so-called _F_STANDBY feature can be clearly defined into the spec.
> Previously the boundary of using MAC address as the identifier for
> bonding was quite confusing to me. And now PCI adds to the matrix.

PCI is simply one way to exclude software NICs. It's not the most
elegant one, but it will cover many setups.  We can add more types, but
we do want to exclude software devices since these have
not been supplied by the hypervisor.

> However it still does not gurantee uniqueness I think. It's almost
> incorrect of choosing MAC address as the ID in the beginning since
> that has the implication of breaking existing configs.

IMO there's no chance it will break any existing config since
no existing config sets _F_STANDBY.

> I don't think
> libvirt or QEMU today retricts the MAC address to be unique per VM
> instance. Neither the virtio spec mentions that.

You really don't have to.

> In addition, it's difficult to fake PCI device on Linux does not mean
> the same applies to other OSes that is going to implement this VirtIO
> feature. It's a fragile assumption IMHO.

What an OS does internally is its own business.

What we are telling the guest here is simply that the virtio NIC is
actually the same device as some other NIC. At this point we do not
specify this other NIC in any way. So how do you find it?  Well it has
to have the same MAC clearly.

You point out that there could be multiple NICs with the same
MAC in theory. It's a broken config generally but since it
kind of works in some setups maybe it's worth supporting.
If so we can look for ways to make the matching more specific by e.g.
adding more flags but I see that as a separate issue,
and pretty narrow in scope.

> >
> >> There's no explicit flag to
> >> identify a VF or pass-through device AFAIK. And sometimes this happens
> >> maybe due to user misconfiguring the link. This process should be
> >> hardened to avoid from any potential configuration errors.
> >
> > They are still PCI devices though.
> >
> >> >
> >> >> Unlike VF, passthrough PCI endpoint device has its freedom
> >> >> to change the MAC address. Even on a VF setup it's not neccessarily
> >> >> always safe to assume the VF's MAC address cannot or shouldn't be
> >> >> changed. That depends on the specific need whether the host admin
> >> >> wants to restrict guest from changing the MAC address, although in
> >> >> most cases it's true.
> >> >>
> >> >> I understand we can use the perm_addr to distinguish. But as said,
> >> >> this will pose limitation of flexible configuration where one can
> >> >> assign VFs with identical MAC address at all while each VF belongs to
> >> >> different PF and/or different subnet for e.g. load balancing.
> >> >> And
> >> >> furthermore, the QEMU device model never uses MAC address to be
> >> >> interpreted as an identifier, which requires to be unique per VM
> >> >> instance. Why we're introducing this inconsistency?
> >> >>
> >> >> -Siwei
> >> >
> >> > Because it addresses most of the issues and is simple.  That's already
> >> > much better than what we have now which is nothing unless guest
> >> > configures things manually.
> >>
> >> Did you see my QEMU patch for using BDF as the grouping identifier?
> >
> > Yes. And I don't think it can work because bus numbers are
> > guest specified.
> 
> I know it's not ideal but perhaps its the best one can do in the KVM
> world without adding complex config e.g. PCI bridge.

KVM is just a VMX/SVM driver. I think you mean QEMU.  And well -
"best one can do" is a high bar to clear.


> Even if bus
> number is guest specified, it's readily available in the guest and
> recognizable by any OS, while on the QEMU configuration users specify
> an id instead of the bus number. Unlike Hyper-V PCI bus, I don't think
> there exists a para-virtual PCI bus in QEMU backend to expose VPD
> capability to a passthrough device.

We can always add more interfaces if we need them.  But let's be clear
that we are adding an interface and what are we trying to fix by doing
it. Let's not mix it as part of the failover discussion.

> >
> >> And there can be others like what you suggested, but the point is that
> >> it's requried to support explicit grouping mechanism from day one,
> >> before the backup property cast into stones.
> >
> > Let's start with addressing simple configs with just two NICs.
> >
> > Down the road I can see possible extensions that can work: for example,
> > require that devices are on the same pci bridge. Or we could even make
> > the virtio device actually include a pci bridge (as part of same
> > or a child function), the PT would have to be
> > behind it.
> >
> > As long as we are not breaking anything, adding more flags to fix
> > non-working configurations is always fair game.
> 
> While it may work, the PCI bridge has NUMA and IOMMU implications that
> would restrict the current flexibility to group devices.

It's interesting you should mention that.

If you want to be flexible in placing the primary device WRT NUMA and
IOMMU, and given that both IOMMU and NUMA are keyed by the bus address,
then doesn't this completely break the idea of passing
the bus address to the guest?

> I'm not sure
> if vIOMMU would have to be introduced inadvertently for
> isolation/protection of devices under the PCI bridge which may cause
> negative performance impact on the VF.

No idea how do you introduce an IOMMU inadvertently.

> >
> >> This is orthogonal to
> >> device model being proposed, be it 1-netdev or not. Delaying it would
> >> just mean support and compatibility burden, appearing more like a
> >> design flaw rather than a feature to add later on.
> >
> > Well it's mostly myself who gets to support it, and I see the device
> > model as much more fundamental as userspace will come to depend
> > on it. So I'm not too worried, let's take this one step at a time.
> >
> >> >
> >> > I think ideally the infrastructure should suppport flexible matching of
> >> > NICs - netvsc is already reported to be moving to some kind of serial
> >> > address.
> >> >
> >> As Stephen said, Hyper-V supports the serial UUID thing from day-one.
> >> It's just the Linux netvsc guest driver itself does not leverage that
> >> ID from the very beginging.
> >>
> >> Regards,
> >> -Siwei
> >
> > We could add something like this, too. For example,
> > we could add a virtual VPD capability with a UUID.
> 
> I'm not an expert on that and wonder how you could do this (add a
> virtual VPD capability with a UUID to passthrough device) with
> existing QEMU emulation model and native PCI bus.


I think I see an elegant way to do that.

You could put it in the port where you want to stick you PT device.

Here's how it could work then:


- standby virtio device is tied to a pci bridge.

  Tied how? Well it could be 
  - behind this bridge
  - include a bridge internally
  - have the bridge as a PCI function
  - include a bridge and the bridge as a PCI function
  - have a VPD or serial capability with same UUID as the bridge

- primary passthrough device is placed behind a bridge
  *with the same ID*

	- either simply behind the same bridge
	- or behind another bridge with the same UUID.


The treatment could also be limited just to bridges which have a
specific vendor/device id (maybe a good idea), or in any other arbitrary
way.




> >
> > Do you know how exactly does hyperv pass the UUID for NICs?
> 
> Stephen might know it more and can correct me. But my personal
> interpretation is that the SN is a host generated 32 bit sequence
> number which is unique per VM instance and gets propogated to guest
> via the para-virtual Hyper-V PCI bus.
> 
> Regards,
> -Siwei

Ah, so it's a Hyper-V thing.




> >
> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> >>
> >> >> >> >
> >> >> >> > --
> >> >> >> > MST

^ permalink raw reply

* Re: [PATCH] net/mlx5: report persistent netdev stats across ifdown/ifup commands
From: Saeed Mahameed @ 2018-04-26 23:43 UTC (permalink / raw)
  To: Qing Huang
  Cc: Eran Ben Elisha, linux-kernel, RDMA mailing list,
	Linux Netdev List, Leon Romanovsky, Matan Barak, Saeed Mahameed
In-Reply-To: <CALzJLG8atGxoJNQj4F6wR3kRg7jrj4oQi3O-1i0=UOo+3HPq0w@mail.gmail.com>

On Thu, Apr 26, 2018 at 4:30 PM, Saeed Mahameed
<saeedm@dev.mellanox.co.il> wrote:
> On Thu, Apr 26, 2018 at 3:37 PM, Qing Huang <qing.huang@oracle.com> wrote:
>>
>>
>> On 04/26/2018 02:50 PM, Saeed Mahameed wrote:
>>>
>>> On Thu, Apr 26, 2018 at 1:37 PM, Qing Huang <qing.huang@oracle.com> wrote:
>>>>
>>>> Current stats collecting scheme in mlx5 driver is to periodically fetch
>>>> aggregated stats from all the active mlx5 software channels associated
>>>> with the device. However when a mlx5 interface is brought down(ifdown),
>>>> all the channels will be deactivated and closed. A new set of channels
>>>> will be created when next ifup command or a similar command is called.
>>>> Unfortunately the new channels will have all stats reset to 0. So you
>>>> lose the accumulated stats information. This behavior is different from
>>>> other netdev drivers including the mlx4 driver. In order to fix it, we
>>>> now save prior mlx5 software stats into netdev stats fields, so all the
>>>> accumulated stats will survive multiple runs of ifdown/ifup commands and
>>>> be shown correctly.
>>>>
>>>> Orabug: 27548610
>
> I don't think we need this internal bug reference, let's remove it.
>
>>>>
>>>> Signed-off-by: Qing Huang <qing.huang@oracle.com>
>>>> ---
>>>
>>> Hi Qing,
>>>
>>> I am adding Eran since he is currently working on a similar patch,
>>> He is also taking care of all cores/rings stats to make them
>>> persistent, so you won't have discrepancy  between
>>> ethtool and ifconfig stats.
>>>
>>> I am ok with this patch, but this means Eran has to work his way around
>>> it.
>>>
>>> so we have two options:
>>>
>>> 1. Temporary accept this patch, and change it later with Eran's work.
>>> 2. Wait for Eran's work.
>>>
>>> I am ok with either one of them, please let me know.
>>>
>>> Thanks !
>>
>> Hi Saeed,
>>
>> Any idea on rough ETA of Eran's stats work to be in upstream? If it will be
>
> All i know that it is planned for v4.18.
>
>> available soon, I think
>> we can wait a bit. If it will take a while to redesign the whole stats
>> scheme (for both ethtool and ifconfig),
>> maybe we can go with this incremental fix first?
>>

Qing,

Before you address my comments, it looks like Eran's work is
converging and we will finalize the internal
review next week, in which case submission will take 2-3 weeks from today.

So i suggest to wait.

please checkout:
https://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux.git/commit/?h=testing/mlx5e-stats&id=15ffb7f87432d073e8ac0e6b895188d40fdda4d4

>
> Yes we can go with incremental fix,
> but let's fix the commit message as requested above and make the patch
> title prefix "net/mlx5e:"
> since this is a mlx5 netdev related change, not mlx5/core.
> also please see a comments below.
>
> Thanks,
> Saeed.
>
>> Thanks!
>>
>>
>>>
>>>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 30
>>>> +++++++++++++++++++----
>>>>   1 file changed, 25 insertions(+), 5 deletions(-)
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>> b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>> index f1fe490..5d50e69 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>>> @@ -2621,6 +2621,23 @@ static void mlx5e_netdev_set_tcs(struct net_device
>>>> *netdev)
>>>>                  netdev_set_tc_queue(netdev, tc, nch, 0);
>>>>   }
>>>>
>>>> +static void mlx5e_netdev_save_stats(struct mlx5e_priv *priv)
>>>> +{
>>>> +       struct net_device *netdev = priv->netdev;
>>>> +
>>>> +       netdev->stats.rx_packets += priv->stats.sw.rx_packets;
>>>> +       netdev->stats.rx_bytes   += priv->stats.sw.rx_bytes;
>>>> +       netdev->stats.tx_packets += priv->stats.sw.tx_packets;
>>>> +       netdev->stats.tx_bytes   += priv->stats.sw.tx_bytes;
>>>> +       netdev->stats.tx_dropped += priv->stats.sw.tx_queue_dropped;
>>>> +
>>>> +       priv->stats.sw.rx_packets       = 0;
>>>> +       priv->stats.sw.rx_bytes         = 0;
>>>> +       priv->stats.sw.tx_packets       = 0;
>>>> +       priv->stats.sw.tx_bytes         = 0;
>>>> +       priv->stats.sw.tx_queue_dropped = 0;
>
> Note: it is safe to do this here since writing to priv->stats.sw is
> currently done under priv->state_lock,
> Please add a comment to this function that it must be called under
> priv->state_lock
>
>>>> +}
>>>> +
>>>
>>> This means that we are now explicitly clearing channels stats on
>>> ifconfig down or switch_channels.
>>> and now after ifconfing down, ethtool will always show 0, before this
>>> patch it didn't.
>>> Anyway update sw stats function will always override them with the new
>>> channels stats next time we load new channels.
>>> so it is not that big of a deal.
>>>
>>>
>>>>   static void mlx5e_build_channels_tx_maps(struct mlx5e_priv *priv)
>>>>   {
>>>>          struct mlx5e_channel *c;
>>>> @@ -2691,6 +2708,7 @@ void mlx5e_switch_priv_channels(struct mlx5e_priv
>>>> *priv,
>>>>                  netif_set_real_num_tx_queues(netdev, new_num_txqs);
>>>>
>>>>          mlx5e_deactivate_priv_channels(priv);
>>>> +       mlx5e_netdev_save_stats(priv);
>
> This function call doesn't have to be between deactivate and close,
> lets move it to after
> mlx5e_close_channels here and below to avoid confusion.
>
> also you need to accumulate what is left in the rings/channels stats,
> before calling
> mlx5e_close_channels, see mlx5e_grp_sw_update_stats.
>
>>>>          mlx5e_close_channels(&priv->channels);
>>>>
>>>>          priv->channels = *new_chs;
>>>> @@ -2770,6 +2788,7 @@ int mlx5e_close_locked(struct net_device *netdev)
>>>>
>>>>          netif_carrier_off(priv->netdev);
>>>>          mlx5e_deactivate_priv_channels(priv);
>>>> +       mlx5e_netdev_save_stats(priv);
>>>>          mlx5e_close_channels(&priv->channels);
>>>>
>>>>          return 0;
>>>> @@ -3215,11 +3234,12 @@ static int mlx5e_setup_tc(struct net_device *dev,
>>>> enum tc_setup_type type,
>>>>                  stats->tx_packets = PPORT_802_3_GET(pstats,
>>>> a_frames_transmitted_ok);
>>>>                  stats->tx_bytes   = PPORT_802_3_GET(pstats,
>>>> a_octets_transmitted_ok);
>>>>          } else {
>>>> -               stats->rx_packets = sstats->rx_packets;
>>>> -               stats->rx_bytes   = sstats->rx_bytes;
>>>> -               stats->tx_packets = sstats->tx_packets;
>>>> -               stats->tx_bytes   = sstats->tx_bytes;
>>>> -               stats->tx_dropped = sstats->tx_queue_dropped;
>>>> +               stats->rx_packets = sstats->rx_packets +
>>>> dev->stats.rx_packets;
>>>> +               stats->rx_bytes   = sstats->rx_bytes +
>>>> dev->stats.rx_bytes;
>>>> +               stats->tx_packets = sstats->tx_packets +
>>>> dev->stats.tx_packets;
>>>> +               stats->tx_bytes   = sstats->tx_bytes +
>>>> dev->stats.tx_bytes;
>>>> +               stats->tx_dropped = sstats->tx_queue_dropped +
>>>> +                                   dev->stats.tx_dropped;
>>>>          }
>>>>
>>>>          stats->rx_dropped = priv->stats.qcnt.rx_out_of_buffer;
>>>> --
>>>> 1.8.3.1
>>>>
>>

^ permalink raw reply

* [RFC PATCH] MAINTAINERS: add davem in NETWORKING DRIVERS
From: Vivien Didelot @ 2018-04-26 23:47 UTC (permalink / raw)
  To: David S. Miller, netdev, linux-kernel; +Cc: Joe Perches, Vivien Didelot

"./scripts/get_maintainer.pl -f" does not actually show us David as the
maintainer of drivers/net directories such as team, bonding, phy or dsa.
Adding him in an M: entry of NETWORKING DRIVERS fixes this.

Signed-off-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index a52800867850..21038e0dadc2 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -9732,6 +9732,7 @@ W:	https://fedorahosted.org/dropwatch/
 F:	net/core/drop_monitor.c
 
 NETWORKING DRIVERS
+M:	"David S. Miller" <davem@davemloft.net>
 L:	netdev@vger.kernel.org
 W:	http://www.linuxfoundation.org/en/Net
 Q:	http://patchwork.ozlabs.org/project/netdev/list/
-- 
2.17.0

^ permalink raw reply related

* [PATCH net-next v9 0/4] Prerequisites for Cavium OCTEON-III network driver.
From: Steven J. Hill @ 2018-04-26 23:30 UTC (permalink / raw)
  To: netdev

We want to add the Cavium OCTEON-III network driver.  But since
interacting with the input and output queues is done via special CPU
local memory, we also need to add support to the MIPS/Octeon
architecture code.  Aren't SoCs nice in this way?  These are the
prerequisite patches that are needed before the network driver can be
merged.

Changes in v9:

o Rebased to 'net-next'
o Revert code removal in kernel-entry-init.h
o Move CAVIUM_OCTEON_SCRATCH_OFFSET into tlbex.c

Changes in v8:

o Rebased to v4.16-rc2

Changes in v7:

o Splitting of the patch set only.  These 4 mips patches are unchanged
  from the previous posting.

Changes in v6:

o Added back cleanup patch for previous generation SoC "staging"
  driver, as Greg K-H acked it.

o Moved FPA driver to drivers/net/ethernet/cavium/octeon as it is
  currently only used by the octeon3-ethernet driver.

o Many code formatting fixes as noted by davem.

Changes in v5:

o Removed cleanup patch for previous generation SoC "staging" driver,
  as it will be sent as a follow-on.

o Fixed kernel doc formatting in all patches.

o Removed redundant licensing text boilerplate.

o Reviewed-by: header added to 2/7.

o Rewrote locking code in 3/7 to eliminate inline asm.

Changes in v4:

o Use phy_print_status() instead of open coding the equivalent.

o Print warning on phy mode mismatch.

o Improve dt-bindings and add Acked-by.

Changes in v3:

o Fix PKI (RX path) initialization to work with little endian kernel.

Changes in v2:

o Cleanup and use of standard bindings in the device tree bindings
  document.

o Added (hopefully) clarifying comments about several OCTEON
  architectural peculiarities.

o Removed unused testing code from the driver.

o Removed some module parameters that already default to the proper
  values.

o KConfig cleanup, including testing on x86_64, arm64 and mips.

o Fixed breakage to the driver for previous generation of OCTEON SoCs
  (in the staging directory still).

o Verified bisectability of the patch set.

Carlos Munoz (2):
  MIPS: Octeon: Enable LMTDMA/LMTST operations.
  MIPS: Octeon: Add a global resource manager.

David Daney (2):
  MIPS: Octeon: Automatically provision CVMSEG space.
  staging: octeon: Remove USE_ASYNC_IOBDMA macro.

 arch/mips/cavium-octeon/Kconfig                    |  27 +-
 arch/mips/cavium-octeon/Makefile                   |   1 +
 arch/mips/cavium-octeon/resource-mgr.c             | 351 +++++++++++++++++++++
 arch/mips/cavium-octeon/setup.c                    |  22 +-
 .../asm/mach-cavium-octeon/kernel-entry-init.h     |  15 +-
 arch/mips/include/asm/octeon/octeon.h              |  32 +-
 arch/mips/include/asm/processor.h                  |   2 +-
 arch/mips/kernel/octeon_switch.S                   |   2 -
 arch/mips/mm/tlbex.c                               |  31 +-
 drivers/staging/octeon/ethernet-defines.h          |   6 -
 drivers/staging/octeon/ethernet-rx.c               |  25 +-
 drivers/staging/octeon/ethernet-tx.c               |  85 ++---
 12 files changed, 472 insertions(+), 127 deletions(-)
 create mode 100644 arch/mips/cavium-octeon/resource-mgr.c

-- 
2.1.4

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox