Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 00/14] ARM: move lpc32xx and dove to multiplatform
From: Arnd Bergmann @ 2019-08-01  7:33 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: soc, Linux ARM, Vladimir Zapolskiy, Sylvain Lemieux,
	Gregory Clement, Linus Walleij, Jason Cooper, Andrew Lunn,
	Sebastian Hesselbarth, David S. Miller, Greg Kroah-Hartman,
	Alan Stern, Guenter Roeck, open list:GPIO SUBSYSTEM, Networking,
	linux-serial, USB list, LINUXWATCHDOG
In-Reply-To: <20190731225303.GC1330@shell.armlinux.org.uk>

On Thu, Aug 1, 2019 at 12:53 AM Russell King - ARM Linux admin
<linux@armlinux.org.uk> wrote:
>
> On Wed, Jul 31, 2019 at 09:56:42PM +0200, Arnd Bergmann wrote:
> > For dove, the patches are basically what I had proposed back in
> > 2015 when all other ARMv6/ARMv7 machines became part of a single
> > kernel build. I don't know what the state is mach-dove support is,
> > compared to the DT based support in mach-mvebu for the same
> > hardware. If they are functionally the same, we could also just
> > remove mach-dove rather than applying my patches.
>
> Well, the good news is that I'm down to a small board support file
> for the Dove Cubox now - but the bad news is, that there's still a
> board support file necessary to support everything the Dove SoC has
> to offer.
>
> Even for a DT based Dove Cubox, I'm still using mach-dove, but it
> may be possible to drop most of mach-dove now.  Without spending a
> lot of time digging through it, it's impossible to really know.

Ok, so we won't remove it then, but I'd like to merge my patches to
at least get away from the special case of requiring a separate kernel
image for it.

Can you try if applying patches 12 and 14 from my series causes
problems for you? (it may be easier to apply the entire set
or pull from [1] to avoid rebase conflicts).

     Arnd

[1] kernel.org:/pub/scm/linux/kernel/git/arnd/playground.git mach-cleanup-5.4

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH bpf-next v4 03/11] libbpf: add flags to umem config
From: Björn Töpel @ 2019-08-01  7:34 UTC (permalink / raw)
  To: Andrii Nakryiko
  Cc: Kevin Laatz, Netdev, Alexei Starovoitov, Daniel Borkmann,
	Björn Töpel, Karlsson, Magnus, Jakub Kicinski,
	Jonathan Lemon, Saeed Mahameed, Maxim Mikityanskiy,
	Stephen Hemminger, Bruce Richardson, ciara.loftus,
	intel-wired-lan, bpf
In-Reply-To: <CAEf4BzbTbX-Teth+4-yiorO-oHp+JhGfW2e08iBoCsBA4JCbMQ@mail.gmail.com>

On Thu, 1 Aug 2019 at 08:59, Andrii Nakryiko <andrii.nakryiko@gmail.com> wrote:
>
> On Wed, Jul 31, 2019 at 8:21 AM Björn Töpel <bjorn.topel@gmail.com> wrote:
> >
> > On Tue, 30 Jul 2019 at 19:43, Kevin Laatz <kevin.laatz@intel.com> wrote:
> > >
> > > This patch adds a 'flags' field to the umem_config and umem_reg structs.
> > > This will allow for more options to be added for configuring umems.
> > >
> > > The first use for the flags field is to add a flag for unaligned chunks
> > > mode. These flags can either be user-provided or filled with a default.
> > >
> > > Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
> > > Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
> > >
> > > ---
> > > v2:
> > >   - Removed the headroom check from this patch. It has moved to the
> > >     previous patch.
> > >
> > > v4:
> > >   - modified chunk flag define
> > > ---
>
> [...]
>
> > > diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
> > > index 833a6e60d065..44a03d8c34b9 100644
> > > --- a/tools/lib/bpf/xsk.h
> > > +++ b/tools/lib/bpf/xsk.h
> > > @@ -170,12 +170,14 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk);
> > >  #define XSK_UMEM__DEFAULT_FRAME_SHIFT    12 /* 4096 bytes */
> > >  #define XSK_UMEM__DEFAULT_FRAME_SIZE     (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
> > >  #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
> > > +#define XSK_UMEM__DEFAULT_FLAGS 0
> > >
> > >  struct xsk_umem_config {
> > >         __u32 fill_size;
> > >         __u32 comp_size;
> > >         __u32 frame_size;
> > >         __u32 frame_headroom;
> > > +       __u32 flags;
> >
> > And the flags addition here, unfortunately, requires symbol versioning
> > of xsk_umem__create(). That'll be the first in libbpf! :-)
>
> xsk_umem_config is passed by pointer to xsk_umem__create(), so this
> doesn't break ABI, does it?
>

Old application, dynamically linked to new libbpf.so will crash,
right? Old application passes old version of xsk_umem_config, and new
library accesses (non-existing) flag struct member.


Björn


> >
> >
> > Björn
> >
> > >  };
> > >
> > >  /* Flags for the libbpf_flags field. */
> > > --
> > > 2.17.1
> > >
> > > _______________________________________________
> > > Intel-wired-lan mailing list
> > > Intel-wired-lan@osuosl.org
> > > https://lists.osuosl.org/mailman/listinfo/intel-wired-lan

^ permalink raw reply

* Re: next-20190723: bpf/seccomp - systemd/journald issue?
From: Sedat Dilek @ 2019-08-01  7:39 UTC (permalink / raw)
  To: Yonghong Song, Josh Poimboeuf
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin Lau, Song Liu,
	netdev@vger.kernel.org, bpf@vger.kernel.org, Clang-Built-Linux ML,
	Kees Cook, Nick Desaulniers, Nathan Chancellor
In-Reply-To: <c2524c96-d71c-d7db-22ec-12da905dc180@fb.com>

Hi,

just want to let you know that I did a git bisect with Linux v5.3-rc2
(where the problem also exists) and the result (details see [1]):

e55a73251da335873a6e87d68fb17e5aabb8978e is the first bad commit
commit e55a73251da335873a6e87d68fb17e5aabb8978e
Author: Josh Poimboeuf <jpoimboe@redhat.com>
Date:   Thu Jun 27 20:50:47 2019 -0500

    bpf: Fix ORC unwinding in non-JIT BPF code

    Objtool previously ignored ___bpf_prog_run() because it didn't understand
    the jump table.  This resulted in the ORC unwinder not being able to unwind
    through non-JIT BPF code.

    Now that objtool knows how to read jump tables, remove the whitelist and
    annotate the jump table so objtool can recognize it.

    Also add an additional "const" to the jump table definition to clarify that
    the text pointers are constant.  Otherwise GCC sets the section writable
    flag and the assembler spits out warnings.

    Fixes: d15d356887e7 ("perf/x86: Make perf callchains work without
CONFIG_FRAME_POINTER")
    Reported-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Acked-by: Alexei Starovoitov <ast@kernel.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Kairui Song <kasong@redhat.com>
    Cc: Steven Rostedt <rostedt@goodmis.org>
    Cc: Borislav Petkov <bp@alien8.de>
    Cc: Daniel Borkmann <daniel@iogearbox.net>
    Link: https://lkml.kernel.org/r/881939122b88f32be4c374d248c09d7527a87e35.1561685471.git.jpoimboe@redhat.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

:040000 040000 4735e9d14fa416c1c361ec3923440a3d586a627d
31de80b85c7b0292e47a719ecb6b1a451de2f8ef M      kernel

Maybe you want to look at this, too.

The object files are attached in [2].

Thanks,
- Sedat -

[0] https://github.com/ClangBuiltLinux/linux/issues/619
[1] https://github.com/ClangBuiltLinux/linux/issues/619#issuecomment-517152467
[2] https://github.com/ClangBuiltLinux/linux/issues/619#issuecomment-517159635

^ permalink raw reply

* Re: [PATCH] net/mlx5e: always initialize frag->last_in_page
From: Tariq Toukan @ 2019-08-01  7:46 UTC (permalink / raw)
  To: Qian Cai, davem@davemloft.net
  Cc: Saeed Mahameed, Tariq Toukan, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <1564599773-9474-1-git-send-email-cai@lca.pw>



On 7/31/2019 10:02 PM, Qian Cai wrote:
> The commit 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue
> memory scheme") introduced an undefined behaviour below due to
> "frag->last_in_page" is only initialized in
> mlx5e_init_frags_partition() when,
> 
> if (next_frag.offset + frag_info[f].frag_stride > PAGE_SIZE)
> 
> or after bailed out the loop,
> 
> for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++)
> 
> As the result, there could be some "frag" have uninitialized
> value of "last_in_page".
> 
> Later, get_frag() obtains those "frag" and check "rag->last_in_page" in
> mlx5e_put_rx_frag() and triggers the error during boot. Fix it by always
> initializing "frag->last_in_page" to "false" in
> mlx5e_init_frags_partition().
> 
> UBSAN: Undefined behaviour in
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c:325:12
> load of value 170 is not a valid value for type 'bool' (aka '_Bool')
> Call trace:
>   dump_backtrace+0x0/0x264
>   show_stack+0x20/0x2c
>   dump_stack+0xb0/0x104
>   __ubsan_handle_load_invalid_value+0x104/0x128
>   mlx5e_handle_rx_cqe+0x8e8/0x12cc [mlx5_core]
>   mlx5e_poll_rx_cq+0xca8/0x1a94 [mlx5_core]
>   mlx5e_napi_poll+0x17c/0xa30 [mlx5_core]
>   net_rx_action+0x248/0x940
>   __do_softirq+0x350/0x7b8
>   irq_exit+0x200/0x26c
>   __handle_domain_irq+0xc8/0x128
>   gic_handle_irq+0x138/0x228
>   el1_irq+0xb8/0x140
>   arch_cpu_idle+0x1a4/0x348
>   do_idle+0x114/0x1b0
>   cpu_startup_entry+0x24/0x28
>   rest_init+0x1ac/0x1dc
>   arch_call_rest_init+0x10/0x18
>   start_kernel+0x4d4/0x57c
> 
> Fixes: 069d11465a80 ("net/mlx5e: RX, Enhance legacy Receive Queue memory scheme")
> Signed-off-by: Qian Cai <cai@lca.pw>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 47eea6b3a1c3..96f5110a9b43 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -336,6 +336,7 @@ static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
>   
>   	next_frag.di = &rq->wqe.di[0];
>   	next_frag.offset = 0;
> +	next_frag.last_in_page = false;
>   	prev = NULL;
>   
>   	for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++) {
> 

Thanks Qian.
Please zero-init the whole struct instead:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1f433a06e637..55f4f5cc1d8f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -312,11 +312,10 @@ static inline u64 mlx5e_get_mpwqe_offset(struct 
mlx5e_rq *rq, u16 wqe_ix)

  static void mlx5e_init_frags_partition(struct mlx5e_rq *rq)
  {
-       struct mlx5e_wqe_frag_info next_frag, *prev;
+       struct mlx5e_wqe_frag_info next_frag = {}, *prev;
         int i;

         next_frag.di = &rq->wqe.di[0];
-       next_frag.offset = 0;
         prev = NULL;

         for (i = 0; i < mlx5_wq_cyc_get_size(&rq->wqe.wq); i++) {


Thanks,
Tariq

^ permalink raw reply related

* Re: [PATCH] can: flexcan: free error skb if enqueueing failed
From: Martin Hundebøll @ 2019-08-01  7:59 UTC (permalink / raw)
  To: Wolfgang Grandegger, Marc Kleine-Budde, linux-can
  Cc: David S . Miller, netdev, Sean Nyekjaer
In-Reply-To: <20190715185308.104333-1-martin@geanix.com>

On 15/07/2019 20.53, Martin Hundebøll wrote:
> If the call to can_rx_offload_queue_sorted() fails, the passed skb isn't
> consumed, so the caller must do so.
> 
> Fixes: 30164759db1b ("can: flexcan: make use of rx-offload's irq_offload_fifo")
> Signed-off-by: Martin Hundebøll <martin@geanix.com>

Ping.

^ permalink raw reply

* Re: [PATCH V2 7/9] vhost: do not use RCU to synchronize MMU notifier with worker
From: Jason Wang @ 2019-08-01  8:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: kvm, virtualization, netdev, linux-kernel, linux-mm, jgg,
	Paul E. McKenney
In-Reply-To: <20190731132438-mutt-send-email-mst@kernel.org>

On 2019/8/1 上午2:29, Michael S. Tsirkin wrote:
> On Wed, Jul 31, 2019 at 04:46:53AM -0400, Jason Wang wrote:
>> We used to use RCU to synchronize MMU notifier with worker. This leads
>> calling synchronize_rcu() in invalidate_range_start(). But on a busy
>> system, there would be many factors that may slow down the
>> synchronize_rcu() which makes it unsuitable to be called in MMU
>> notifier.
>>
>> A solution is SRCU but its overhead is obvious with the expensive full
>> memory barrier. Another choice is to use seqlock, but it doesn't
>> provide a synchronization method between readers and writers. The last
>> choice is to use vq mutex, but it need to deal with the worst case
>> that MMU notifier must be blocked and wait for the finish of swap in.
>>
>> So this patch switches use a counter to track whether or not the map
>> was used. The counter was increased when vq try to start or finish
>> uses the map. This means, when it was even, we're sure there's no
>> readers and MMU notifier is synchronized. When it was odd, it means
>> there's a reader we need to wait it to be even again then we are
>> synchronized. To avoid full memory barrier, store_release +
>> load_acquire on the counter is used.
>
> Unfortunately this needs a lot of review and testing, so this can't make
> rc2, and I don't think this is the kind of patch I can merge after rc3.
> Subtle memory barrier tricks like this can introduce new bugs while they
> are fixing old ones.

I admit the patch is tricky. Some questions:

- Do we must address the case of e.g swap in? If not, a simple
  vhost_work_flush() instead of synchronize_rcu() may work.
- Having some hard thought, I think we can use seqlock, it looks
  to me smp_wmb() is in write_segcount_begin() is sufficient, we don't
  care vq->map read before smp_wmb(), and for the other we all have
  good data devendency so smp_wmb() in the write_seqbegin_end() is
  sufficient.

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index db2c81cb1e90..6d9501303258 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -363,39 +363,29 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
 
 static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
 {
-	int ref = READ_ONCE(vq->ref);
-
-	smp_store_release(&vq->ref, ref + 1);
-	/* Make sure ref counter is visible before accessing the map */
-	smp_load_acquire(&vq->ref);
+	write_seqcount_begin(&vq->seq);
 }
 
 static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
 {
-	int ref = READ_ONCE(vq->ref);
-
-	/* Make sure vq access is done before increasing ref counter */
-	smp_store_release(&vq->ref, ref + 1);
+	write_seqcount_end(&vq->seq);
 }
 
 static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
 {
-	int ref;
+	unsigned int ret;
 
 	/* Make sure map change was done before checking ref counter */
 	smp_mb();
-
-	ref = READ_ONCE(vq->ref);
-	if (ref & 0x1) {
-		/* When ref change, we are sure no reader can see
+	ret = raw_read_seqcount(&vq->seq);
+	if (ret & 0x1) {
+		/* When seq changes, we are sure no reader can see
 		 * previous map */
-		while (READ_ONCE(vq->ref) == ref) {
-			set_current_state(TASK_RUNNING);
+		while (raw_read_seqcount(&vq->seq) == ret)
 			schedule();
-		}
 	}
-	/* Make sure ref counter was checked before any other
-	 * operations that was dene on map. */
+	/* Make sure seq was checked before any other operations that
+	 * was dene on map. */
 	smp_mb();
 }
 
@@ -691,7 +681,7 @@ void vhost_dev_init(struct vhost_dev *dev,
 		vq->indirect = NULL;
 		vq->heads = NULL;
 		vq->dev = dev;
-		vq->ref = 0;
+		seqcount_init(&vq->seq);
 		mutex_init(&vq->mutex);
 		spin_lock_init(&vq->mmu_lock);
 		vhost_vq_reset(dev, vq);
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 3d10da0ae511..1a705e181a84 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -125,7 +125,7 @@ struct vhost_virtqueue {
 	 */
 	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
 #endif
-	int ref;
+	seqcount_t seq;
 	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
 
 	struct file *kick;
-- 
2.18.1

>
>
>
>
>
>>
>> Consider the read critical section is pretty small the synchronization
>> should be done very fast.
>>
>> Note the patch lead about 3% PPS dropping.
>
> Sorry what do you mean by this last sentence? This degrades performance
> compared to what?

Compare to without this patch.

>
>>
>> Reported-by: Michael S. Tsirkin <mst@redhat.com>
>> Fixes: 7f466032dc9e ("vhost: access vq metadata through kernel virtual address")
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>> ---
>>  drivers/vhost/vhost.c | 145 ++++++++++++++++++++++++++----------------
>>  drivers/vhost/vhost.h |   7 +-
>>  2 files changed, 94 insertions(+), 58 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index cfc11f9ed9c9..db2c81cb1e90 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -324,17 +324,16 @@ static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
>>  
>>  	spin_lock(&vq->mmu_lock);
>>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>> -		map[i] = rcu_dereference_protected(vq->maps[i],
>> -				  lockdep_is_held(&vq->mmu_lock));
>> +		map[i] = vq->maps[i];
>>  		if (map[i]) {
>>  			vhost_set_map_dirty(vq, map[i], i);
>> -			rcu_assign_pointer(vq->maps[i], NULL);
>> +			vq->maps[i] = NULL;
>>  		}
>>  	}
>>  	spin_unlock(&vq->mmu_lock);
>>  
>> -	/* No need for synchronize_rcu() or kfree_rcu() since we are
>> -	 * serialized with memory accessors (e.g vq mutex held).
>> +	/* No need for synchronization since we are serialized with
>> +	 * memory accessors (e.g vq mutex held).
>>  	 */
>>  
>>  	for (i = 0; i < VHOST_NUM_ADDRS; i++)
>> @@ -362,6 +361,44 @@ static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
>>  	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
>>  }
>>  
>> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
>> +{
>> +	int ref = READ_ONCE(vq->ref);
>> +
>> +	smp_store_release(&vq->ref, ref + 1);
>> +	/* Make sure ref counter is visible before accessing the map */
>> +	smp_load_acquire(&vq->ref);
>
> The map access is after this sequence, correct?

Yes.

>
> Just going by the rules in Documentation/memory-barriers.txt,
> I think that this pair will not order following accesses with ref store.
>
> Documentation/memory-barriers.txt says:
>
>
> +     In addition, a RELEASE+ACQUIRE
> +     pair is -not- guaranteed to act as a full memory barrier.
>
>
>
> The guarantee that is made is this:
> 	after
>      an ACQUIRE on a given variable, all memory accesses preceding any prior
>      RELEASE on that same variable are guaranteed to be visible.

Yes, but it's not clear about the order of ACQUIRE the same location
of previous RELEASE. And it only has a example like:

"
	*A = a;
	RELEASE M
	ACQUIRE N
	*B = b;

could occur as:

	ACQUIRE N, STORE *B, STORE *A, RELEASE M
"

But it doesn't explain what happen when

*A = a
RELEASE M
ACQUIRE M
*B = b;

And tools/memory-model/Documentation said

"
First, when a lock-acquire reads from a lock-release, the LKMM
requires that every instruction po-before the lock-release must
execute before any instruction po-after the lock-acquire.
"

Is this a hint that I was correct?

>
>
> And if we also had the reverse rule we'd end up with a full barrier,
> won't we?
>
> Cc Paul in case I missed something here. And if I'm right,
> maybe we should call this out, adding
>
> 	"The opposite is not true: a prior RELEASE is not
> 	 guaranteed to be visible before memory accesses following
> 	 the subsequent ACQUIRE".

That kinds of violates the RELEASE?

"
     This also acts as a one-way permeable barrier.  It guarantees that all
     memory operations before the RELEASE operation will appear to happen
     before the RELEASE operation with respect to the other components of the
"

>
>
>
>> +}
>> +
>> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
>> +{
>> +	int ref = READ_ONCE(vq->ref);
>> +
>> +	/* Make sure vq access is done before increasing ref counter */
>> +	smp_store_release(&vq->ref, ref + 1);
>> +}
>> +
>> +static void inline vhost_vq_sync_access(struct vhost_virtqueue *vq)
>> +{
>> +	int ref;
>> +
>> +	/* Make sure map change was done before checking ref counter */
>> +	smp_mb();
>> +
>> +	ref = READ_ONCE(vq->ref);
>> +	if (ref & 0x1) {
>
> Please document the even/odd trick here too, not just in the commit log.
>

Ok.

>> +		/* When ref change,
>
> changes
>
>> we are sure no reader can see
>> +		 * previous map */
>> +		while (READ_ONCE(vq->ref) == ref) {
>
>
> what is the below line in aid of?
>
>> +			set_current_state(TASK_RUNNING);
>> +			schedule();
>
>                         if (need_resched())
>                                 schedule();
>
> ?

Yes, better.

>
>> +		}
>
> On an interruptible kernel, there's a risk here is that
> a task got preempted with an odd ref.
> So I suspect we'll have to disable preemption when we
> make ref odd.

I'm not sure I get, if the odd is not the original value we read,
we're sure it won't read the new map here I believe.

>
>
>> +	}
>> +	/* Make sure ref counter was checked before any other
>> +	 * operations that was dene on map. */
>
> was dene -> were done?
>

Yes.

>> +	smp_mb();
>> +}
>> +
>>  static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>>  				      int index,
>>  				      unsigned long start,
>> @@ -376,16 +413,15 @@ static void vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
>>  	spin_lock(&vq->mmu_lock);
>>  	++vq->invalidate_count;
>>  
>> -	map = rcu_dereference_protected(vq->maps[index],
>> -					lockdep_is_held(&vq->mmu_lock));
>> +	map = vq->maps[index];
>>  	if (map) {
>>  		vhost_set_map_dirty(vq, map, index);
>> -		rcu_assign_pointer(vq->maps[index], NULL);
>> +		vq->maps[index] = NULL;
>>  	}
>>  	spin_unlock(&vq->mmu_lock);
>>  
>>  	if (map) {
>> -		synchronize_rcu();
>> +		vhost_vq_sync_access(vq);
>>  		vhost_map_unprefetch(map);
>>  	}
>>  }
>> @@ -457,7 +493,7 @@ static void vhost_init_maps(struct vhost_dev *dev)
>>  	for (i = 0; i < dev->nvqs; ++i) {
>>  		vq = dev->vqs[i];
>>  		for (j = 0; j < VHOST_NUM_ADDRS; j++)
>> -			RCU_INIT_POINTER(vq->maps[j], NULL);
>> +			vq->maps[j] = NULL;
>>  	}
>>  }
>>  #endif
>> @@ -655,6 +691,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>>  		vq->indirect = NULL;
>>  		vq->heads = NULL;
>>  		vq->dev = dev;
>> +		vq->ref = 0;
>>  		mutex_init(&vq->mutex);
>>  		spin_lock_init(&vq->mmu_lock);
>>  		vhost_vq_reset(dev, vq);
>> @@ -921,7 +958,7 @@ static int vhost_map_prefetch(struct vhost_virtqueue *vq,
>>  	map->npages = npages;
>>  	map->pages = pages;
>>  
>> -	rcu_assign_pointer(vq->maps[index], map);
>> +	vq->maps[index] = map;
>>  	/* No need for a synchronize_rcu(). This function should be
>>  	 * called by dev->worker so we are serialized with all
>>  	 * readers.
>> @@ -1216,18 +1253,18 @@ static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>>  	struct vring_used *used;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> +		map = vq->maps[VHOST_ADDR_USED];
>>  		if (likely(map)) {
>>  			used = map->addr;
>>  			*((__virtio16 *)&used->ring[vq->num]) =
>>  				cpu_to_vhost16(vq, vq->avail_idx);
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1245,18 +1282,18 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>>  	size_t size;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> +		map = vq->maps[VHOST_ADDR_USED];
>>  		if (likely(map)) {
>>  			used = map->addr;
>>  			size = count * sizeof(*head);
>>  			memcpy(used->ring + idx, head, size);
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1272,17 +1309,17 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>>  	struct vring_used *used;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> +		map = vq->maps[VHOST_ADDR_USED];
>>  		if (likely(map)) {
>>  			used = map->addr;
>>  			used->flags = cpu_to_vhost16(vq, vq->used_flags);
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1298,17 +1335,17 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>>  	struct vring_used *used;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> +		map = vq->maps[VHOST_ADDR_USED];
>>  		if (likely(map)) {
>>  			used = map->addr;
>>  			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1362,17 +1399,17 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>>  	struct vring_avail *avail;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>  		if (likely(map)) {
>>  			avail = map->addr;
>>  			*idx = avail->idx;
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1387,17 +1424,17 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>>  	struct vring_avail *avail;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>  		if (likely(map)) {
>>  			avail = map->addr;
>>  			*head = avail->ring[idx & (vq->num - 1)];
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1413,17 +1450,17 @@ static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>>  	struct vring_avail *avail;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>  		if (likely(map)) {
>>  			avail = map->addr;
>>  			*flags = avail->flags;
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1438,15 +1475,15 @@ static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>>  	struct vring_avail *avail;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_AVAIL]);
>> +		vhost_vq_access_map_begin(vq);
>> +		map = vq->maps[VHOST_ADDR_AVAIL];
>>  		if (likely(map)) {
>>  			avail = map->addr;
>>  			*event = (__virtio16)avail->ring[vq->num];
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1461,17 +1498,17 @@ static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>>  	struct vring_used *used;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_USED]);
>> +		map = vq->maps[VHOST_ADDR_USED];
>>  		if (likely(map)) {
>>  			used = map->addr;
>>  			*idx = used->idx;
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1486,17 +1523,17 @@ static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>>  	struct vring_desc *d;
>>  
>>  	if (!vq->iotlb) {
>> -		rcu_read_lock();
>> +		vhost_vq_access_map_begin(vq);
>>  
>> -		map = rcu_dereference(vq->maps[VHOST_ADDR_DESC]);
>> +		map = vq->maps[VHOST_ADDR_DESC];
>>  		if (likely(map)) {
>>  			d = map->addr;
>>  			*desc = *(d + idx);
>> -			rcu_read_unlock();
>> +			vhost_vq_access_map_end(vq);
>>  			return 0;
>>  		}
>>  
>> -		rcu_read_unlock();
>> +		vhost_vq_access_map_end(vq);
>>  	}
>>  #endif
>>  
>> @@ -1843,13 +1880,11 @@ static bool iotlb_access_ok(struct vhost_virtqueue *vq,
>>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
>>  static void vhost_vq_map_prefetch(struct vhost_virtqueue *vq)
>>  {
>> -	struct vhost_map __rcu *map;
>> +	struct vhost_map *map;
>>  	int i;
>>  
>>  	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
>> -		rcu_read_lock();
>> -		map = rcu_dereference(vq->maps[i]);
>> -		rcu_read_unlock();
>> +		map = vq->maps[i];
>>  		if (unlikely(!map))
>>  			vhost_map_prefetch(vq, i);
>>  	}
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index a9a2a93857d2..f9e9558a529d 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -115,16 +115,17 @@ struct vhost_virtqueue {
>>  #if VHOST_ARCH_CAN_ACCEL_UACCESS
>>  	/* Read by memory accessors, modified by meta data
>>  	 * prefetching, MMU notifier and vring ioctl().
>> -	 * Synchonrized through mmu_lock (writers) and RCU (writers
>> -	 * and readers).
>> +	 * Synchonrized through mmu_lock (writers) and ref counters,
>> +	 * see vhost_vq_access_map_begin()/vhost_vq_access_map_end().
>>  	 */
>> -	struct vhost_map __rcu *maps[VHOST_NUM_ADDRS];
>> +	struct vhost_map *maps[VHOST_NUM_ADDRS];
>>  	/* Read by MMU notifier, modified by vring ioctl(),
>>  	 * synchronized through MMU notifier
>>  	 * registering/unregistering.
>>  	 */
>>  	struct vhost_uaddr uaddrs[VHOST_NUM_ADDRS];
>>  #endif
>> +	int ref;
>
> Is it important that this is signed? If not I'd do unsigned here:
> even though kernel does compile with 2s complement sign overflow,
> it seems cleaner not to depend on that.

Not a must, let me fix.

Thanks

>
>>  	const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
>>  
>>  	struct file *kick;
>> -- 
>> 2.18.1

^ permalink raw reply related

* [v2,0/2] tools: bpftool: add net attach/detach command to attach XDP prog
From: Daniel T. Lee @ 2019-08-01  8:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev

Currently, bpftool net only supports dumping progs attached on the
interface. To attach XDP prog on interface, user must use other tool
(eg. iproute2). By this patch, with `bpftool net attach/detach`, user
can attach/detach XDP prog on interface.

    $ ./bpftool prog
    ...
    208: xdp  name xdp_prog1  tag ad822e38b629553f  gpl
      loaded_at 2019-07-28T18:03:11+0900  uid 0
    ...
    $ ./bpftool net attach id 208 xdpdrv enp6s0np1
    $ ./bpftool net
    xdp:
    enp6s0np1(5) driver id 208
    ...
    $ ./bpftool net detach xdpdrv enp6s0np1
    $ ./bpftool net
    xdp:
    ...

While this patch only contains support for XDP, through `net
attach/detach`, bpftool can further support other prog attach types.

XDP attach/detach tested on Mellanox ConnectX-4 and Netronome Agilio.

---
Changes in v2:
  - command 'load/unload' changed to 'attach/detach' for the consistency

Daniel T. Lee (2):
  tools: bpftool: add net attach command to attach XDP on interface
  tools: bpftool: add net detach command to detach XDP on interface

 tools/bpf/bpftool/net.c | 160 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 159 insertions(+), 1 deletion(-)

-- 
2.20.1


^ permalink raw reply

* [v2,1/2] tools: bpftool: add net attach command to attach XDP on interface
From: Daniel T. Lee @ 2019-08-01  8:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20190801081133.13200-1-danieltimlee@gmail.com>

By this commit, using `bpftool net attach`, user can attach XDP prog on
interface. New type of enum 'net_attach_type' has been made, as stated at
cover-letter, the meaning of 'attach' is, prog will be attached on interface.

BPF prog will be attached through libbpf 'bpf_set_link_xdp_fd'.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
Changes in v2:
  - command 'load' changed to 'attach' for the consistency
  - 'NET_ATTACH_TYPE_XDP_DRIVE' changed to 'NET_ATTACH_TYPE_XDP_DRIVER'

 tools/bpf/bpftool/net.c | 107 +++++++++++++++++++++++++++++++++++++++-
 1 file changed, 106 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index 67e99c56bc88..f3b57660b303 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -55,6 +55,35 @@ struct bpf_attach_info {
 	__u32 flow_dissector_id;
 };
 
+enum net_attach_type {
+	NET_ATTACH_TYPE_XDP,
+	NET_ATTACH_TYPE_XDP_GENERIC,
+	NET_ATTACH_TYPE_XDP_DRIVER,
+	NET_ATTACH_TYPE_XDP_OFFLOAD,
+	__MAX_NET_ATTACH_TYPE
+};
+
+static const char * const attach_type_strings[] = {
+	[NET_ATTACH_TYPE_XDP] = "xdp",
+	[NET_ATTACH_TYPE_XDP_GENERIC] = "xdpgeneric",
+	[NET_ATTACH_TYPE_XDP_DRIVER] = "xdpdrv",
+	[NET_ATTACH_TYPE_XDP_OFFLOAD] = "xdpoffload",
+	[__MAX_NET_ATTACH_TYPE] = NULL,
+};
+
+static enum net_attach_type parse_attach_type(const char *str)
+{
+	enum net_attach_type type;
+
+	for (type = 0; type < __MAX_NET_ATTACH_TYPE; type++) {
+		if (attach_type_strings[type] &&
+		   is_prefix(str, attach_type_strings[type]))
+			return type;
+	}
+
+	return __MAX_NET_ATTACH_TYPE;
+}
+
 static int dump_link_nlmsg(void *cookie, void *msg, struct nlattr **tb)
 {
 	struct bpf_netdev_t *netinfo = cookie;
@@ -223,6 +252,77 @@ static int query_flow_dissector(struct bpf_attach_info *attach_info)
 	return 0;
 }
 
+static int parse_attach_args(int argc, char **argv, int *progfd,
+			     enum net_attach_type *attach_type, int *ifindex)
+{
+	if (!REQ_ARGS(3))
+		return -EINVAL;
+
+	*progfd = prog_parse_fd(&argc, &argv);
+	if (*progfd < 0)
+		return *progfd;
+
+	*attach_type = parse_attach_type(*argv);
+	if (*attach_type == __MAX_NET_ATTACH_TYPE) {
+		p_err("invalid net attach/detach type");
+		return -EINVAL;
+	}
+
+	NEXT_ARG();
+	if (!REQ_ARGS(1))
+		return -EINVAL;
+
+	*ifindex = if_nametoindex(*argv);
+	if (!*ifindex) {
+		p_err("Invalid ifname");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int do_attach_detach_xdp(int *progfd, enum net_attach_type *attach_type,
+				int *ifindex)
+{
+	__u32 flags;
+	int err;
+
+	flags = XDP_FLAGS_UPDATE_IF_NOEXIST;
+	if (*attach_type == NET_ATTACH_TYPE_XDP_GENERIC)
+		flags |= XDP_FLAGS_SKB_MODE;
+	if (*attach_type == NET_ATTACH_TYPE_XDP_DRIVER)
+		flags |= XDP_FLAGS_DRV_MODE;
+	if (*attach_type == NET_ATTACH_TYPE_XDP_OFFLOAD)
+		flags |= XDP_FLAGS_HW_MODE;
+
+	err = bpf_set_link_xdp_fd(*ifindex, *progfd, flags);
+
+	return err;
+}
+
+static int do_attach(int argc, char **argv)
+{
+	enum net_attach_type attach_type;
+	int err, progfd, ifindex;
+
+	err = parse_attach_args(argc, argv, &progfd, &attach_type, &ifindex);
+	if (err)
+		return err;
+
+	if (is_prefix("xdp", attach_type_strings[attach_type]))
+		err = do_attach_detach_xdp(&progfd, &attach_type, &ifindex);
+
+	if (err < 0) {
+		p_err("link set %s failed", attach_type_strings[attach_type]);
+		return -1;
+	}
+
+	if (json_output)
+		jsonw_null(json_wtr);
+
+	return 0;
+}
+
 static int do_show(int argc, char **argv)
 {
 	struct bpf_attach_info attach_info = {};
@@ -305,13 +405,17 @@ static int do_help(int argc, char **argv)
 
 	fprintf(stderr,
 		"Usage: %s %s { show | list } [dev <devname>]\n"
+		"       %s %s attach PROG LOAD_TYPE <devname>\n"
 		"       %s %s help\n"
+		"\n"
+		"       " HELP_SPEC_PROGRAM "\n"
+		"       LOAD_TYPE := { xdp | xdpgeneric | xdpdrv | xdpoffload }\n"
 		"Note: Only xdp and tc attachments are supported now.\n"
 		"      For progs attached to cgroups, use \"bpftool cgroup\"\n"
 		"      to dump program attachments. For program types\n"
 		"      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
 		"      consult iproute2.\n",
-		bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
 
 	return 0;
 }
@@ -319,6 +423,7 @@ static int do_help(int argc, char **argv)
 static const struct cmd cmds[] = {
 	{ "show",	do_show },
 	{ "list",	do_show },
+	{ "attach",	do_attach },
 	{ "help",	do_help },
 	{ 0 }
 };
-- 
2.20.1


^ permalink raw reply related

* [v2,2/2] tools: bpftool: add net detach command to detach XDP on interface
From: Daniel T. Lee @ 2019-08-01  8:11 UTC (permalink / raw)
  To: Daniel Borkmann, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20190801081133.13200-1-danieltimlee@gmail.com>

By this commit, using `bpftool net detach`, the attached XDP prog can
be detached. Detaching the BPF prog will be done through libbpf
'bpf_set_link_xdp_fd' with the progfd set to -1.

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
---
Changes in v2:
  - command 'unload' changed to 'detach' for the consistency

 tools/bpf/bpftool/net.c | 55 ++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/tools/bpf/bpftool/net.c b/tools/bpf/bpftool/net.c
index f3b57660b303..2ae9a613b05c 100644
--- a/tools/bpf/bpftool/net.c
+++ b/tools/bpf/bpftool/net.c
@@ -281,6 +281,31 @@ static int parse_attach_args(int argc, char **argv, int *progfd,
 	return 0;
 }
 
+static int parse_detach_args(int argc, char **argv,
+			     enum net_attach_type *attach_type, int *ifindex)
+{
+	if (!REQ_ARGS(2))
+		return -EINVAL;
+
+	*attach_type = parse_attach_type(*argv);
+	if (*attach_type == __MAX_NET_ATTACH_TYPE) {
+		p_err("invalid net attach/detach type");
+		return -EINVAL;
+	}
+
+	NEXT_ARG();
+	if (!REQ_ARGS(1))
+		return -EINVAL;
+
+	*ifindex = if_nametoindex(*argv);
+	if (!*ifindex) {
+		p_err("Invalid ifname");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int do_attach_detach_xdp(int *progfd, enum net_attach_type *attach_type,
 				int *ifindex)
 {
@@ -323,6 +348,31 @@ static int do_attach(int argc, char **argv)
 	return 0;
 }
 
+static int do_detach(int argc, char **argv)
+{
+	enum net_attach_type attach_type;
+	int err, progfd, ifindex;
+
+	err = parse_detach_args(argc, argv, &attach_type, &ifindex);
+	if (err)
+		return err;
+
+	/* to detach xdp prog */
+	progfd = -1;
+	if (is_prefix("xdp", attach_type_strings[attach_type]))
+		err = do_attach_detach_xdp(&progfd, &attach_type, &ifindex);
+
+	if (err < 0) {
+		p_err("link set %s failed", attach_type_strings[attach_type]);
+		return -1;
+	}
+
+	if (json_output)
+		jsonw_null(json_wtr);
+
+	return 0;
+}
+
 static int do_show(int argc, char **argv)
 {
 	struct bpf_attach_info attach_info = {};
@@ -406,6 +456,7 @@ static int do_help(int argc, char **argv)
 	fprintf(stderr,
 		"Usage: %s %s { show | list } [dev <devname>]\n"
 		"       %s %s attach PROG LOAD_TYPE <devname>\n"
+		"       %s %s detach LOAD_TYPE <devname>\n"
 		"       %s %s help\n"
 		"\n"
 		"       " HELP_SPEC_PROGRAM "\n"
@@ -415,7 +466,8 @@ static int do_help(int argc, char **argv)
 		"      to dump program attachments. For program types\n"
 		"      sk_{filter,skb,msg,reuseport} and lwt/seg6, please\n"
 		"      consult iproute2.\n",
-		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2]);
+		bin_name, argv[-2], bin_name, argv[-2], bin_name, argv[-2],
+		bin_name, argv[-2]);
 
 	return 0;
 }
@@ -424,6 +476,7 @@ static const struct cmd cmds[] = {
 	{ "show",	do_show },
 	{ "list",	do_show },
 	{ "attach",	do_attach },
+	{ "detach",	do_detach },
 	{ "help",	do_help },
 	{ 0 }
 };
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH 03/12] block: bio_release_pages: use flags arg instead of bool
From: Christoph Hellwig @ 2019-08-01  8:20 UTC (permalink / raw)
  To: Jerome Glisse
  Cc: Christoph Hellwig, Christoph Hellwig, john.hubbard, Andrew Morton,
	Alexander Viro, Anna Schumaker, David S . Miller,
	Dominique Martinet, Eric Van Hensbergen, Jason Gunthorpe,
	Jason Wang, Jens Axboe, Latchesar Ionkov, Michael S . Tsirkin,
	Miklos Szeredi, Trond Myklebust, Matthew Wilcox, linux-mm, LKML,
	ceph-devel, kvm, linux-block, linux-cifs, linux-fsdevel,
	linux-nfs, linux-rdma, netdev, samba-technical, v9fs-developer,
	virtualization, John Hubbard, Minwoo Im
In-Reply-To: <20190730155702.GB10366@redhat.com>

On Tue, Jul 30, 2019 at 11:57:02AM -0400, Jerome Glisse wrote:
> Other user can also add page that are not coming from GUP but need to
> have a reference see __blkdev_direct_IO()

Except for the zero page case I mentioned in my last mail explicitly,
and the KVEC/PIPE type iov vecs from the original mail what other
pages do you see to get added?

^ permalink raw reply

* [PATCH net] ipv4/route: do not check saddr dev if iif is LOOPBACK_IFINDEX
From: Hangbin Liu @ 2019-08-01  8:29 UTC (permalink / raw)
  To: netdev
  Cc: Stefano Brivio, Marcelo Ricardo Leitner, David Ahern,
	David S . Miller, Hangbin Liu

Jianlin reported a bug that for IPv4, ip route get from src_addr would fail
if src_addr is not an address on local system.

\# ip route get 1.1.1.1 from 2.2.2.2
RTNETLINK answers: Invalid argument

While IPv6 would get the default route

\# ip -6 route get 1111::1 from 2222::2
1111::1 from 2222::2 via fe80:52:0:4982::1fe dev eth0 proto ra src \
	2620::2cf:e2ff:fe40:37 metric 1024 hoplimit 64 pref medium

Fix it by disabling the __ip_dev_find() check if iif is LOOPBACK_IFINDEX.

Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 net/ipv4/route.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 517300d587a7..1be78e8f9e3c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2512,7 +2512,8 @@ struct rtable *ip_route_output_key_hash_rcu(struct net *net, struct flowi4 *fl4,
 			goto make_route;
 		}
 
-		if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC)) {
+		if (!(fl4->flowi4_flags & FLOWI_FLAG_ANYSRC) &&
+		    fl4->flowi4_iif != LOOPBACK_IFINDEX) {
 			/* It is equivalent to inet_addr_type(saddr) == RTN_LOCAL */
 			if (!__ip_dev_find(net, fl4->saddr, false))
 				goto out;
-- 
2.19.2


^ permalink raw reply related

* [PATCH net] ibmveth: use net_err_ratelimited when set_multicast_list
From: Hangbin Liu @ 2019-08-01  9:03 UTC (permalink / raw)
  To: netdev; +Cc: Thomas Falcon, David S . Miller, Hangbin Liu

When setting lots of multicast list on ibmveth, e.g. add 3000 membership on a
multicast group, the following error message flushes our log file

8507    [  901.478251] ibmveth 30000003 env3: h_multicast_ctrl rc=4 when adding an entry to the filter table
...
1718386 [ 5636.808658] ibmveth 30000003 env3: h_multicast_ctrl rc=4 when adding an entry to the filter table

We got 1.5 million lines of messages in 1.3h. Let's replace netdev_err() by
net_err_ratelimited() to avoid this issue. I don't use netdev_err_once() in
case h_multicast_ctrl() return different lpar_rc types.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
---
 drivers/net/ethernet/ibm/ibmveth.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
index d654c234aaf7..3b9406a4ca92 100644
--- a/drivers/net/ethernet/ibm/ibmveth.c
+++ b/drivers/net/ethernet/ibm/ibmveth.c
@@ -1446,9 +1446,11 @@ static void ibmveth_set_multicast_list(struct net_device *netdev)
 						   IbmVethMcastAddFilter,
 						   mcast_addr);
 			if (lpar_rc != H_SUCCESS) {
-				netdev_err(netdev, "h_multicast_ctrl rc=%ld "
-					   "when adding an entry to the filter "
-					   "table\n", lpar_rc);
+				net_err_ratelimited("%s %s%s: h_multicast_ctrl rc=%ld when adding an entry to the filter table\n",
+						    ibmveth_driver_name,
+						    netdev_name(netdev),
+						    netdev_reg_state(netdev),
+						    lpar_rc);
 			}
 		}
 
-- 
2.19.2


^ permalink raw reply related

* [PATCH v2 net-next] be2net: disable bh with spin_lock in be_process_mcc
From: Denis Kirjanov @ 2019-08-01  9:24 UTC (permalink / raw)
  To: sathya.perla, ajit.khaparde, sriharsha.basavapatna; +Cc: netdev, Denis Kirjanov

be_process_mcc() is invoked in 3 different places and
always with BHs disabled except the be_poll function
but since it's invoked from softirq with BHs
disabled it won't hurt.

v1->v2: added explanation to the patch

Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
---
 drivers/net/ethernet/emulex/benet/be_cmds.c | 4 ++--
 drivers/net/ethernet/emulex/benet/be_main.c | 2 --
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/emulex/benet/be_cmds.c b/drivers/net/ethernet/emulex/benet/be_cmds.c
index ef5d61d57597..9365218f4d3b 100644
--- a/drivers/net/ethernet/emulex/benet/be_cmds.c
+++ b/drivers/net/ethernet/emulex/benet/be_cmds.c
@@ -550,7 +550,7 @@ int be_process_mcc(struct be_adapter *adapter)
 	int num = 0, status = 0;
 	struct be_mcc_obj *mcc_obj = &adapter->mcc_obj;
 
-	spin_lock(&adapter->mcc_cq_lock);
+	spin_lock_bh(&adapter->mcc_cq_lock);
 
 	while ((compl = be_mcc_compl_get(adapter))) {
 		if (compl->flags & CQE_FLAGS_ASYNC_MASK) {
@@ -566,7 +566,7 @@ int be_process_mcc(struct be_adapter *adapter)
 	if (num)
 		be_cq_notify(adapter, mcc_obj->cq.id, mcc_obj->rearm_cq, num);
 
-	spin_unlock(&adapter->mcc_cq_lock);
+	spin_unlock_bh(&adapter->mcc_cq_lock);
 	return status;
 }
 
diff --git a/drivers/net/ethernet/emulex/benet/be_main.c b/drivers/net/ethernet/emulex/benet/be_main.c
index 2edb86ec9fe9..4d8e40ac66d2 100644
--- a/drivers/net/ethernet/emulex/benet/be_main.c
+++ b/drivers/net/ethernet/emulex/benet/be_main.c
@@ -5630,9 +5630,7 @@ static void be_worker(struct work_struct *work)
 	 * mcc completions
 	 */
 	if (!netif_running(adapter->netdev)) {
-		local_bh_disable();
 		be_process_mcc(adapter);
-		local_bh_enable();
 		goto reschedule;
 	}
 
-- 
2.12.3


^ permalink raw reply related

* [PATCH net-next] net/mlx5e: Allow dropping specific tunnel packets
From: xiangxia.m.yue @ 2019-08-01  8:40 UTC (permalink / raw)
  To: roid, saeedm; +Cc: netdev, Tonghao Zhang

From: Tonghao Zhang <xiangxia.m.yue@gmail.com>

In some case, we don't want to allow specific tunnel packets
to host that can avoid to take up high CPU (e.g network attacks).
But other tunnel packets which not matched in hardware will be
sent to host too.

    $ tc filter add dev vxlan_sys_4789 \
	    protocol ip chain 0 parent ffff: prio 1 handle 1 \
	    flower dst_ip 1.1.1.100 ip_proto tcp dst_port 80 \
	    enc_dst_ip 2.2.2.100 enc_key_id 100 enc_dst_port 4789 \
	    action tunnel_key unset pipe action drop

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index f3ed028..25d423e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -2485,7 +2485,8 @@ static bool actions_match_supported(struct mlx5e_priv *priv,
 
 	if (flow_flag_test(flow, EGRESS) &&
 	    !((actions & MLX5_FLOW_CONTEXT_ACTION_DECAP) ||
-	      (actions & MLX5_FLOW_CONTEXT_ACTION_VLAN_POP)))
+	      (actions & MLX5_FLOW_CONTEXT_ACTION_VLAN_POP) ||
+	      (actions & MLX5_FLOW_CONTEXT_ACTION_DROP)))
 		return false;
 
 	if (actions & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
-- 
1.8.3.1


^ permalink raw reply related

* Re: next-20190723: bpf/seccomp - systemd/journald issue?
From: Sedat Dilek @ 2019-08-01  9:35 UTC (permalink / raw)
  To: Yonghong Song, Josh Poimboeuf
  Cc: Alexei Starovoitov, Daniel Borkmann, Martin Lau, Song Liu,
	netdev@vger.kernel.org, bpf@vger.kernel.org, Clang-Built-Linux ML,
	Kees Cook, Nick Desaulniers, Nathan Chancellor
In-Reply-To: <CA+icZUW1YQqDjFCX5Ek100SbveX0Zevr7T5gbtdpcmZD+kCuZg@mail.gmail.com>

On Thu, Aug 1, 2019 at 9:39 AM Sedat Dilek <sedat.dilek@gmail.com> wrote:
>
> Hi,
>
> just want to let you know that I did a git bisect with Linux v5.3-rc2
> (where the problem also exists) and the result (details see [1]):
>
> e55a73251da335873a6e87d68fb17e5aabb8978e is the first bad commit
> commit e55a73251da335873a6e87d68fb17e5aabb8978e
> Author: Josh Poimboeuf <jpoimboe@redhat.com>
> Date:   Thu Jun 27 20:50:47 2019 -0500
>
>     bpf: Fix ORC unwinding in non-JIT BPF code
>
>     Objtool previously ignored ___bpf_prog_run() because it didn't understand
>     the jump table.  This resulted in the ORC unwinder not being able to unwind
>     through non-JIT BPF code.
>
>     Now that objtool knows how to read jump tables, remove the whitelist and
>     annotate the jump table so objtool can recognize it.
>
>     Also add an additional "const" to the jump table definition to clarify that
>     the text pointers are constant.  Otherwise GCC sets the section writable
>     flag and the assembler spits out warnings.
>
>     Fixes: d15d356887e7 ("perf/x86: Make perf callchains work without
> CONFIG_FRAME_POINTER")
>     Reported-by: Song Liu <songliubraving@fb.com>
>     Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
>     Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>     Acked-by: Alexei Starovoitov <ast@kernel.org>
>     Cc: Peter Zijlstra <peterz@infradead.org>
>     Cc: Kairui Song <kasong@redhat.com>
>     Cc: Steven Rostedt <rostedt@goodmis.org>
>     Cc: Borislav Petkov <bp@alien8.de>
>     Cc: Daniel Borkmann <daniel@iogearbox.net>
>     Link: https://lkml.kernel.org/r/881939122b88f32be4c374d248c09d7527a87e35.1561685471.git.jpoimboe@redhat.com
>     Signed-off-by: Ingo Molnar <mingo@kernel.org>
>
> :040000 040000 4735e9d14fa416c1c361ec3923440a3d586a627d
> 31de80b85c7b0292e47a719ecb6b1a451de2f8ef M      kernel
>
> Maybe you want to look at this, too.
>
> The object files are attached in [2].
>
> Thanks,
> - Sedat -
>
> [0] https://github.com/ClangBuiltLinux/linux/issues/619
> [1] https://github.com/ClangBuiltLinux/linux/issues/619#issuecomment-517152467
> [2] https://github.com/ClangBuiltLinux/linux/issues/619#issuecomment-517159635

After reverting above commit I can boot into Linux v5.3-rc2 built with
clang-9.0.0-rc1 and lld with no issues.

- Sedat -

^ permalink raw reply

* Re: [PATCH bpf-next v4 07/11] mlx5e: modify driver for handling offsets
From: Maxim Mikityanskiy @ 2019-08-01 10:05 UTC (permalink / raw)
  To: Kevin Laatz
  Cc: netdev@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	bjorn.topel@intel.com, magnus.karlsson@intel.com,
	jakub.kicinski@netronome.com, jonathan.lemon@gmail.com,
	Saeed Mahameed, stephen@networkplumber.org,
	bruce.richardson@intel.com, ciara.loftus@intel.com,
	bpf@vger.kernel.org, intel-wired-lan@lists.osuosl.org
In-Reply-To: <20190730085400.10376-8-kevin.laatz@intel.com>

On 2019-07-30 11:53, Kevin Laatz wrote:
> With the addition of the unaligned chunks option, we need to make sure we
> handle the offsets accordingly based on the mode we are currently running
> in. This patch modifies the driver to appropriately mask the address for
> each case.
> 
> Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

Please note that this patch doesn't actually add the support for the new 
feature, because the validation checks in mlx5e_rx_get_linear_frag_sz 
and mlx5e_validate_xsk_param need to be relaxed. Currently the frame 
size of PAGE_SIZE is forced, and the fragment size is increased to 
PAGE_SIZE in case of XDP (including XSK).

After making the changes required to permit frame sizes smaller than 
PAGE_SIZE, our Striding RQ feature will be used in a way we haven't used 
it before, so we need to verify with the hardware team that this usage 
is legitimate.

> ---
> v3:
>    - Use new helper function to handle offset
> 
> v4:
>    - fixed headroom addition to handle. Using xsk_umem_adjust_headroom()
>      now.
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c    | 8 ++++++--
>   drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 3 ++-
>   2 files changed, 8 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index b0b982cf69bb..d5245893d2c8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
>   		      void *va, u16 *rx_headroom, u32 *len, bool xsk)
>   {
>   	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
> +	struct xdp_umem *umem = rq->umem;
>   	struct xdp_buff xdp;
>   	u32 act;
>   	int err;
> @@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
>   	xdp.rxq = &rq->xdp_rxq;
>   
>   	act = bpf_prog_run_xdp(prog, &xdp);
> -	if (xsk)
> -		xdp.handle += xdp.data - xdp.data_hard_start;
> +	if (xsk) {
> +		u64 off = xdp.data - xdp.data_hard_start;
> +
> +		xdp.handle = xsk_umem_handle_offset(umem, xdp.handle, off);
> +	}
>   	switch (act) {
>   	case XDP_PASS:
>   		*rx_headroom = xdp.data - xdp.data_hard_start;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> index 6a55573ec8f2..7c49a66d28c9 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> @@ -24,7 +24,8 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
>   	if (!xsk_umem_peek_addr_rq(umem, &handle))
>   		return -ENOMEM;
>   
> -	dma_info->xsk.handle = handle + rq->buff.umem_headroom;
> +	dma_info->xsk.handle = xsk_umem_adjust_offset(umem, handle,
> +						      rq->buff.umem_headroom);
>   	dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);
>   
>   	/* No need to add headroom to the DMA address. In striding RQ case, we
> 


^ permalink raw reply

* [PATCH iproute2-next] ip tunnel: add json output
From: Andrea Claudi @ 2019-08-01 10:12 UTC (permalink / raw)
  To: netdev; +Cc: stephen, dsahern

Add json support on iptunnel and ip6tunnel.
The plain text output format should remain the same.

Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
---
 ip/ip6tunnel.c | 82 +++++++++++++++++++++++++++++++--------------
 ip/iptunnel.c  | 90 +++++++++++++++++++++++++++++++++-----------------
 ip/tunnel.c    | 42 +++++++++++++++++------
 3 files changed, 148 insertions(+), 66 deletions(-)

diff --git a/ip/ip6tunnel.c b/ip/ip6tunnel.c
index d7684a673fdc4..f2b9710c1320f 100644
--- a/ip/ip6tunnel.c
+++ b/ip/ip6tunnel.c
@@ -71,57 +71,90 @@ static void usage(void)
 static void print_tunnel(const void *t)
 {
 	const struct ip6_tnl_parm2 *p = t;
-	char s1[1024];
-	char s2[1024];
+	SPRINT_BUF(b1);
 
 	/* Do not use format_host() for local addr,
 	 * symbolic name will not be useful.
 	 */
-	printf("%s: %s/ipv6 remote %s local %s",
-	       p->name,
-	       tnl_strproto(p->proto),
-	       format_host_r(AF_INET6, 16, &p->raddr, s1, sizeof(s1)),
-	       rt_addr_n2a_r(AF_INET6, 16, &p->laddr, s2, sizeof(s2)));
+	open_json_object(NULL);
+	print_string(PRINT_ANY, "ifname", "%s: ", p->name);
+	snprintf(b1, sizeof(b1), "%s/ipv6", tnl_strproto(p->proto));
+	print_string(PRINT_ANY, "mode", "%s ", b1);
+	print_string(PRINT_ANY,
+		     "remote",
+		     "remote %s ",
+		     format_host_r(AF_INET6, 16, &p->raddr, b1, sizeof(b1)));
+	print_string(PRINT_ANY,
+		     "local",
+		     "local %s",
+		     rt_addr_n2a_r(AF_INET6, 16, &p->laddr, b1, sizeof(b1)));
+
 	if (p->link) {
 		const char *n = ll_index_to_name(p->link);
 
 		if (n)
-			printf(" dev %s", n);
+			print_string(PRINT_ANY, "link", " dev %s", n);
 	}
 
 	if (p->flags & IP6_TNL_F_IGN_ENCAP_LIMIT)
-		printf(" encaplimit none");
+		print_bool(PRINT_ANY,
+			   "ip6_tnl_f_ign_encap_limit",
+			   " encaplimit none",
+			   true);
 	else
-		printf(" encaplimit %u", p->encap_limit);
+		print_uint(PRINT_ANY,
+			   "encap_limit",
+			   " encaplimit %u",
+			   p->encap_limit);
 
 	if (p->hop_limit)
-		printf(" hoplimit %u", p->hop_limit);
+		print_uint(PRINT_ANY, "hoplimit", " hoplimit %u", p->hop_limit);
 	else
-		printf(" hoplimit inherit");
-
-	if (p->flags & IP6_TNL_F_USE_ORIG_TCLASS)
-		printf(" tclass inherit");
-	else {
+		print_string(PRINT_FP, "hoplimit", " hoplimit %s", "inherit");
+
+	if (p->flags & IP6_TNL_F_USE_ORIG_TCLASS) {
+		print_bool(PRINT_ANY,
+			   "ip6_tnl_f_use_orig_tclass",
+			   " tclass inherit",
+			   true);
+	} else {
 		__u32 val = ntohl(p->flowinfo & IP6_FLOWINFO_TCLASS);
 
-		printf(" tclass 0x%02x", (__u8)(val >> 20));
+		snprintf(b1, sizeof(b1), "0x%02x", (__u8)(val >> 20));
+		print_string(PRINT_ANY, "tclass", " tclass %s", b1);
 	}
 
-	if (p->flags & IP6_TNL_F_USE_ORIG_FLOWLABEL)
-		printf(" flowlabel inherit");
-	else
-		printf(" flowlabel 0x%05x", ntohl(p->flowinfo & IP6_FLOWINFO_FLOWLABEL));
+	if (p->flags & IP6_TNL_F_USE_ORIG_FLOWLABEL) {
+		print_bool(PRINT_ANY,
+			   "ip6_tnl_f_use_orig_flowlabel",
+			   " flowlabel inherit",
+			   true);
+	} else {
+		__u32 val = ntohl(p->flowinfo & IP6_FLOWINFO_FLOWLABEL);
 
-	printf(" (flowinfo 0x%08x)", ntohl(p->flowinfo));
+		snprintf(b1, sizeof(b1), "0x%05x", val);
+		print_string(PRINT_ANY, "flowlabel", " flowlabel %s", b1);
+	}
+
+	snprintf(b1, sizeof(b1), "0x%08x", ntohl(p->flowinfo));
+	print_string(PRINT_ANY, "flowinfo", " (flowinfo %s)", b1);
 
 	if (p->flags & IP6_TNL_F_RCV_DSCP_COPY)
-		printf(" dscp inherit");
+		print_bool(PRINT_ANY,
+			   "ip6_tnl_f_rcv_dscp_copy",
+			   " dscp inherit",
+			   true);
 
 	if (p->flags & IP6_TNL_F_ALLOW_LOCAL_REMOTE)
-		printf(" allow-localremote");
+		print_bool(PRINT_ANY,
+			   "ip6_tnl_f_allow_local_remote",
+			   " allow-localremote",
+			   true);
 
 	tnl_print_gre_flags(p->proto, p->i_flags, p->o_flags,
 			    p->i_key, p->o_key);
+
+	close_json_object();
 }
 
 static int parse_args(int argc, char **argv, int cmd, struct ip6_tnl_parm2 *p)
@@ -357,7 +390,6 @@ static int do_show(int argc, char **argv)
 		return -1;
 
 	print_tunnel(&p);
-	fputc('\n', stdout);
 	return 0;
 }
 
diff --git a/ip/iptunnel.c b/ip/iptunnel.c
index 66929e75c7448..84f708c727dc7 100644
--- a/ip/iptunnel.c
+++ b/ip/iptunnel.c
@@ -289,17 +289,23 @@ static void print_tunnel(const void *t)
 {
 	const struct ip_tunnel_parm *p = t;
 	struct ip_tunnel_6rd ip6rd = {};
-	char s1[1024];
-	char s2[1024];
+	SPRINT_BUF(b1);
 
 	/* Do not use format_host() for local addr,
 	 * symbolic name will not be useful.
 	 */
-	printf("%s: %s/ip remote %s local %s",
-	       p->name,
-	       tnl_strproto(p->iph.protocol),
-	       p->iph.daddr ? format_host_r(AF_INET, 4, &p->iph.daddr, s1, sizeof(s1)) : "any",
-	       p->iph.saddr ? rt_addr_n2a_r(AF_INET, 4, &p->iph.saddr, s2, sizeof(s2)) : "any");
+	open_json_object(NULL);
+	print_string(PRINT_ANY, "ifname", "%s: ", p->name);
+	snprintf(b1, sizeof(b1), "%s/ip", tnl_strproto(p->iph.protocol));
+	print_string(PRINT_ANY, "mode", "%s ", b1);
+	print_string(PRINT_ANY, "remote", "remote %s ",
+		     p->iph.daddr || is_json_context()
+			? format_host_r(AF_INET, 4, &p->iph.daddr, b1, sizeof(b1))
+			: "any");
+	print_string(PRINT_ANY, "local", "local %s",
+		     p->iph.saddr || is_json_context()
+			? rt_addr_n2a_r(AF_INET, 4, &p->iph.saddr, b1, sizeof(b1))
+			: "any");
 
 	if (p->iph.protocol == IPPROTO_IPV6 && (p->i_flags & SIT_ISATAP)) {
 		struct ip_tunnel_prl prl[16] = {};
@@ -308,54 +314,78 @@ static void print_tunnel(const void *t)
 		prl[0].datalen = sizeof(prl) - sizeof(prl[0]);
 		prl[0].addr = htonl(INADDR_ANY);
 
-		if (!tnl_prl_ioctl(SIOCGETPRL, p->name, prl))
+		if (!tnl_prl_ioctl(SIOCGETPRL, p->name, prl)) {
 			for (i = 1; i < ARRAY_SIZE(prl); i++) {
-				if (prl[i].addr != htonl(INADDR_ANY)) {
-					printf(" %s %s ",
-					       (prl[i].flags & PRL_DEFAULT) ? "pdr" : "pr",
-					       format_host(AF_INET, 4, &prl[i].addr));
-				}
+				if (prl[i].addr == htonl(INADDR_ANY))
+					continue;
+				if (prl[i].flags & PRL_DEFAULT)
+					print_string(PRINT_ANY,
+						     "pdr",
+						     " pdr %s",
+						     format_host(AF_INET, 4, &prl[i].addr));
+				else
+					print_string(PRINT_ANY,
+						     "pr",
+						     " pr %s",
+						     format_host(AF_INET, 4, &prl[i].addr));
 			}
+		}
 	}
 
 	if (p->link) {
 		const char *n = ll_index_to_name(p->link);
 
 		if (n)
-			printf(" dev %s", n);
+			print_string(PRINT_ANY, "dev", " dev %s", n);
 	}
 
 	if (p->iph.ttl)
-		printf(" ttl %u", p->iph.ttl);
+		print_uint(PRINT_ANY, "ttl", " ttl %u", p->iph.ttl);
 	else
-		printf(" ttl inherit");
+		print_string(PRINT_FP, "ttl", " ttl %s", "inherit");
 
 	if (p->iph.tos) {
-		SPRINT_BUF(b1);
-		printf(" tos");
-		if (p->iph.tos & 1)
-			printf(" inherit");
-		if (p->iph.tos & ~1)
-			printf("%c%s ", p->iph.tos & 1 ? '/' : ' ',
-			       rtnl_dsfield_n2a(p->iph.tos & ~1, b1, sizeof(b1)));
+		SPRINT_BUF(b2);
+
+		if (p->iph.tos != 1) {
+			if (!is_json_context() && p->iph.tos & 1)
+				snprintf(b2, sizeof(b2), "%s%s",
+					 p->iph.tos & 1 ? "inherit/" : "",
+					 rtnl_dsfield_n2a(p->iph.tos & ~1, b1, sizeof(b1)));
+			else
+				snprintf(b2, sizeof(b2), "%s",
+					 rtnl_dsfield_n2a(p->iph.tos, b1, sizeof(b1)));
+			print_string(PRINT_ANY, "tos", " tos %s", b2);
+		} else {
+			print_string(PRINT_FP, NULL, " tos %s", "inherit");
+		}
 	}
 
 	if (!(p->iph.frag_off & htons(IP_DF)))
-		printf(" nopmtudisc");
+		print_bool(PRINT_ANY, "nopmtudisc", " nopmtudisc", true);
 
 	if (p->iph.protocol == IPPROTO_IPV6 && !tnl_ioctl_get_6rd(p->name, &ip6rd) && ip6rd.prefixlen) {
-		printf(" 6rd-prefix %s/%u",
-		       inet_ntop(AF_INET6, &ip6rd.prefix, s1, sizeof(s1)),
-		       ip6rd.prefixlen);
+		print_string(PRINT_ANY,
+			     "6rd-prefix",
+			     " 6rd-prefix %s",
+			     inet_ntop(AF_INET6, &ip6rd.prefix, b1, sizeof(b1)));
+		print_uint(PRINT_ANY, "6rd-prefixlen", "/%u", ip6rd.prefixlen);
 		if (ip6rd.relay_prefix) {
-			printf(" 6rd-relay_prefix %s/%u",
-			       format_host(AF_INET, 4, &ip6rd.relay_prefix),
-			       ip6rd.relay_prefixlen);
+			print_string(PRINT_ANY,
+				     "6rd-relay_prefix",
+				     " 6rd-relay_prefix %s",
+				     format_host(AF_INET, 4, &ip6rd.relay_prefix));
+			print_uint(PRINT_ANY,
+				   "6rd-relay_prefixlen",
+				   "/%u",
+				   ip6rd.relay_prefixlen);
 		}
 	}
 
 	tnl_print_gre_flags(p->iph.protocol, p->i_flags, p->o_flags,
 			    p->i_key, p->o_key);
+
+	close_json_object();
 }
 
 
diff --git a/ip/tunnel.c b/ip/tunnel.c
index 41b0ef3165fdc..c2c0b9bc94356 100644
--- a/ip/tunnel.c
+++ b/ip/tunnel.c
@@ -314,24 +314,42 @@ void tnl_print_gre_flags(__u8 proto,
 {
 	if ((i_flags & GRE_KEY) && (o_flags & GRE_KEY) &&
 	    o_key == i_key) {
-		printf(" key %u", ntohl(i_key));
+		print_uint(PRINT_ANY, "key", " key %u", ntohl(i_key));
 	} else {
 		if (i_flags & GRE_KEY)
-			printf(" ikey %u", ntohl(i_key));
+			print_uint(PRINT_ANY, "ikey", " ikey %u", ntohl(i_key));
 		if (o_flags & GRE_KEY)
-			printf(" okey %u", ntohl(o_key));
+			print_uint(PRINT_ANY, "okey", " okey %u", ntohl(o_key));
 	}
 
+	open_json_array(PRINT_JSON, "flags");
 	if (proto == IPPROTO_GRE) {
-		if (i_flags & GRE_SEQ)
-			printf("%s  Drop packets out of sequence.", _SL_);
-		if (i_flags & GRE_CSUM)
-			printf("%s  Checksum in received packet is required.", _SL_);
-		if (o_flags & GRE_SEQ)
-			printf("%s  Sequence packets on output.", _SL_);
-		if (o_flags & GRE_CSUM)
-			printf("%s  Checksum output packets.", _SL_);
+		if (i_flags & GRE_SEQ) {
+			if (is_json_context())
+				print_string(PRINT_JSON, NULL, "%s", "rx_drop_ooseq");
+			else
+				printf("%s  Drop packets out of sequence.", _SL_);
+		}
+		if (i_flags & GRE_CSUM) {
+			if (is_json_context())
+				print_string(PRINT_JSON, NULL, "%s", "rx_csum");
+			else
+				printf("%s  Checksum in received packet is required.", _SL_);
+		}
+		if (o_flags & GRE_SEQ) {
+			if (is_json_context())
+				print_string(PRINT_JSON, NULL, "%s", "tx_seq");
+			else
+				printf("%s  Sequence packets on output.", _SL_);
+		}
+		if (o_flags & GRE_CSUM) {
+			if (is_json_context())
+				print_string(PRINT_JSON, NULL, "%s", "tx_csum");
+			else
+				printf("%s  Checksum output packets.", _SL_);
+		}
 	}
+	close_json_array(PRINT_JSON, NULL);
 }
 
 static void tnl_print_stats(const struct rtnl_link_stats64 *s)
@@ -417,6 +435,7 @@ static int print_nlmsg_tunnel(struct nlmsghdr *n, void *arg)
 
 int do_tunnels_list(struct tnl_print_nlmsg_info *info)
 {
+	new_json_obj(json);
 	if (rtnl_linkdump_req(&rth, preferred_family) < 0) {
 		perror("Cannot send dump request\n");
 		return -1;
@@ -426,6 +445,7 @@ int do_tunnels_list(struct tnl_print_nlmsg_info *info)
 		fprintf(stderr, "Dump terminated\n");
 		return -1;
 	}
+	delete_json_obj();
 
 	return 0;
 }
-- 
2.20.1


^ permalink raw reply related

* Re: [PATCH net] ibmveth: use net_err_ratelimited when set_multicast_list
From: Joe Perches @ 2019-08-01 10:28 UTC (permalink / raw)
  To: Hangbin Liu, netdev; +Cc: Thomas Falcon, David S . Miller
In-Reply-To: <20190801090347.8258-1-liuhangbin@gmail.com>

On Thu, 2019-08-01 at 17:03 +0800, Hangbin Liu wrote:
> When setting lots of multicast list on ibmveth, e.g. add 3000 membership on a
> multicast group, the following error message flushes our log file
> 
> 8507    [  901.478251] ibmveth 30000003 env3: h_multicast_ctrl rc=4 when adding an entry to the filter table
> ...
> 1718386 [ 5636.808658] ibmveth 30000003 env3: h_multicast_ctrl rc=4 when adding an entry to the filter table
> 
> We got 1.5 million lines of messages in 1.3h. Let's replace netdev_err() by
> net_err_ratelimited() to avoid this issue. I don't use netdev_err_once() in
> case h_multicast_ctrl() return different lpar_rc types.
> 
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  drivers/net/ethernet/ibm/ibmveth.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/ibm/ibmveth.c b/drivers/net/ethernet/ibm/ibmveth.c
> index d654c234aaf7..3b9406a4ca92 100644
> --- a/drivers/net/ethernet/ibm/ibmveth.c
> +++ b/drivers/net/ethernet/ibm/ibmveth.c
> @@ -1446,9 +1446,11 @@ static void ibmveth_set_multicast_list(struct net_device *netdev)
>  						   IbmVethMcastAddFilter,
>  						   mcast_addr);
>  			if (lpar_rc != H_SUCCESS) {
> -				netdev_err(netdev, "h_multicast_ctrl rc=%ld "
> -					   "when adding an entry to the filter "
> -					   "table\n", lpar_rc);
> +				net_err_ratelimited("%s %s%s: h_multicast_ctrl rc=%ld when adding an entry to the filter table\n",
> +						    ibmveth_driver_name,
> +						    netdev_name(netdev),
> +						    netdev_reg_state(netdev),
> +						    lpar_rc);

Perhaps add the netdev_<level>_ratelimited variants and use that instead

Somthing like:

---
 include/linux/netdevice.h | 54 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 88292953aa6f..37116019e14f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -4737,6 +4737,60 @@ do {								\
 #define netdev_info_once(dev, fmt, ...) \
 	netdev_level_once(KERN_INFO, dev, fmt, ##__VA_ARGS__)
 
+#define netdev_level_ratelimited(netdev_level, dev, fmt, ...)		\
+do {									\
+	static DEFINE_RATELIMIT_STATE(_rs,				\
+				      DEFAULT_RATELIMIT_INTERVAL,	\
+				      DEFAULT_RATELIMIT_BURST);		\
+	if (__ratelimit(&_rs))						\
+		netdev_level(dev, fmt, ##__VA_ARGS__);			\
+} while (0)
+
+#define netdev_emerg_ratelimited(dev, fmt, ...)				\
+	netdev_level_ratelimited(netdev_emerg, dev, fmt, ##__VA_ARGS__)
+#define netdev_alert_ratelimited(dev, fmt, ...)				\
+	netdev_level_ratelimited(netdev_alert, dev, fmt, ##__VA_ARGS__)
+#define netdev_crit_ratelimited(dev, fmt, ...)				\
+	netdev_level_ratelimited(netdev_crit, dev, fmt, ##__VA_ARGS__)
+#define netdev_err_ratelimited(dev, fmt, ...)				\
+	netdev_level_ratelimited(netdev_err, dev, fmt, ##__VA_ARGS__)
+#define netdev_warn_ratelimited(dev, fmt, ...)				\
+	netdev_level_ratelimited(netdev_warn, dev, fmt, ##__VA_ARGS__)
+#define netdev_notice_ratelimited(dev, fmt, ...)			\
+	netdev_level_ratelimited(netdev_notice, dev, fmt, ##__VA_ARGS__)
+#define netdev_info_ratelimited(dev, fmt, ...)				\
+	netdev_level_ratelimited(netdev_info, dev, fmt, ##__VA_ARGS__)
+
+#if defined(CONFIG_DYNAMIC_DEBUG)
+/* descriptor check is first to prevent flooding with "callbacks suppressed" */
+#define netdev_dbg_ratelimited(dev, fmt, ...)				\
+do {									\
+	static DEFINE_RATELIMIT_STATE(_rs,				\
+				      DEFAULT_RATELIMIT_INTERVAL,	\
+				      DEFAULT_RATELIMIT_BURST);		\
+	DEFINE_DYNAMIC_DEBUG_METADATA(descriptor, fmt);			\
+	if (DYNAMIC_DEBUG_BRANCH(descriptor) &&				\
+	    __ratelimit(&_rs))						\
+		__dynamic_netdev_dbg(&descriptor, dev, fmt,		\
+				     ##__VA_ARGS__);			\
+} while (0)
+#elif defined(DEBUG)
+#define netdev_dbg_ratelimited(dev, fmt, ...)				\
+do {									\
+	static DEFINE_RATELIMIT_STATE(_rs,				\
+				      DEFAULT_RATELIMIT_INTERVAL,	\
+				      DEFAULT_RATELIMIT_BURST);		\
+	if (__ratelimit(&_rs))						\
+		netdev_printk(KERN_DEBUG, dev, fmt, ##__VA_ARGS__);	\
+} while (0)
+#else
+#define netdev_dbg_ratelimited(dev, fmt, ...)				\
+do {									\
+	if (0)								\
+		netdev_printk(KERN_DEBUG, dev, fmt, ##__VA_ARGS__);	\
+} while (0)
+#endif
+
 #define MODULE_ALIAS_NETDEV(device) \
 	MODULE_ALIAS("netdev-" device)
 



^ permalink raw reply related

* Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket
From: Stefano Garzarella @ 2019-08-01 10:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190730163807-mutt-send-email-mst@kernel.org>

On Tue, Jul 30, 2019 at 04:42:25PM -0400, Michael S. Tsirkin wrote:
> On Tue, Jul 30, 2019 at 11:35:39AM +0200, Stefano Garzarella wrote:

(...)

> > 
> > The problem here is the compatibility. Before this series virtio-vsock
> > and vhost-vsock modules had the RX buffer size hard-coded
> > (VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE = 4K). So, if we send a buffer smaller
> > of 4K, there might be issues.
> 
> Shouldn't be if they are following the spec. If not let's fix
> the broken parts.
> 
> > 
> > Maybe it is the time to add add 'features' to virtio-vsock device.
> > 
> > Thanks,
> > Stefano
> 
> Why would a remote care about buffer sizes?
> 
> Let's first see what the issues are. If they exist
> we can either fix the bugs, or code the bug as a feature in spec.
> 

The vhost_transport '.stream_enqueue' callback
[virtio_transport_stream_enqueue()] calls the virtio_transport_send_pkt_info(),
passing the user message. This function allocates a new packet, copying
the user message, but (before this series) it limits the packet size to
the VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (4K):

static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
					  struct virtio_vsock_pkt_info *info)
{
 ...
	/* we can send less than pkt_len bytes */
	if (pkt_len > VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE)
		pkt_len = VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE;

	/* virtio_transport_get_credit might return less than pkt_len credit */
	pkt_len = virtio_transport_get_credit(vvs, pkt_len);

	/* Do not send zero length OP_RW pkt */
	if (pkt_len == 0 && info->op == VIRTIO_VSOCK_OP_RW)
		return pkt_len;
 ...
}

then it queues the packet for the TX worker calling .send_pkt()
[vhost_transport_send_pkt() in the vhost_transport case]

The main function executed by the TX worker is
vhost_transport_do_send_pkt() that picks up a buffer from the virtqueue
and it tries to copy the packet (up to 4K) on it.  If the buffer
allocated from the guest will be smaller then 4K, I think here it will
be discarded with an error:

static void
vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
				struct vhost_virtqueue *vq)
{
 ...
		nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
		if (nbytes != pkt->len) {
			virtio_transport_free_pkt(pkt);
			vq_err(vq, "Faulted on copying pkt buf\n");
			break;
		}
 ...
}


This series changes this behavior since now we will split the packet in
vhost_transport_do_send_pkt() depending on the buffer found in the
virtqueue.

We didn't change the buffer size in this series, so we still backward
compatible, but if we will use buffers smaller than 4K, we should
encounter the error described above.

How do you suggest we proceed if we want to change the buffer size?
Maybe adding a feature to "support any buffer size"?

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH] net: sctp: Rename fallthrough label to unhandled
From: Neil Horman @ 2019-08-01 10:50 UTC (permalink / raw)
  To: Joe Perches
  Cc: Vlad Yasevich, Marcelo Ricardo Leitner, David S. Miller,
	linux-sctp, netdev, linux-kernel
In-Reply-To: <d68403ce9f7e8a68fff09d6b17e5d1327eb1e12d.camel@perches.com>

On Wed, Jul 31, 2019 at 03:23:46PM -0700, Joe Perches wrote:
> On Wed, 2019-07-31 at 16:58 -0400, Neil Horman wrote:
> > On Wed, Jul 31, 2019 at 09:35:31AM -0700, Joe Perches wrote:
> > > On Wed, 2019-07-31 at 08:16 -0400, Neil Horman wrote:
> > > > On Wed, Jul 31, 2019 at 04:32:43AM -0700, Joe Perches wrote:
> > > > > On Wed, 2019-07-31 at 07:19 -0400, Neil Horman wrote:
> > > > > > On Tue, Jul 30, 2019 at 10:04:37PM -0700, Joe Perches wrote:
> > > > > > > fallthrough may become a pseudo reserved keyword so this only use of
> > > > > > > fallthrough is better renamed to allow it.
> > > > > > > 
> > > > > > > Signed-off-by: Joe Perches <joe@perches.com>
> > > > > > Are you referring to the __attribute__((fallthrough)) statement that gcc
> > > > > > supports?  If so the compiler should by all rights be able to differentiate
> > > > > > between a null statement attribute and a explicit goto and label without the
> > > > > > need for renaming here.  Or are you referring to something else?
> > > > > 
> > > > > Hi.
> > > > > 
> > > > > I sent after this a patch that adds
> > > > > 
> > > > > # define fallthrough                    __attribute__((__fallthrough__))
> > > > > 
> > > > > https://lore.kernel.org/patchwork/patch/1108577/
> > > > > 
> > > > > So this rename is a prerequisite to adding this #define.
> > > > > 
> > > > why not just define __fallthrough instead, like we do for all the other
> > > > attributes we alias (i.e. __read_mostly, __protected_by, __unused, __exception,
> > > > etc)
> > > 
> > > Because it's not as intelligible when used as a statement.
> > I think thats somewhat debatable.  __fallthrough to me looks like an internal
> > macro, whereas fallthrough looks like a comment someone forgot to /* */
> 
> 
> I'd rather see:
> 
> 	switch (foo) {
> 	case FOO:
> 		bar |= baz;
> 		fallthrough;
> 	case BAR:
> 		bar |= qux;
> 		break;
> 	default:
> 		error();
> 	}
> 
> than
> 
> 	switch (foo) {
> 	case FOO:
> 		bar |= baz;
> 		__fallthrough;
> 	case BAR:
> 		bar |= qux;
> 		break;
> 	default:
> 		error();
> 	}
> 
> or esoecially
> 
> 	switch (foo) {
> 	case FOO:
> 		bar |= baz;
> 		/* fallthrough
> */;
> 	case BAR:
> 		bar |= qux;
> 		break;
> 	default:
> 		error();
> 	}
> 
> but <shrug>, bikeshed ahoy!...
You can say that if you want, but you made the point that your think the macro
as you have written is more readable.  You example illustrates though that /*
fallthrough */ is a pretty common comment, and not prefixing it makes it look
like someone didn't add a comment that they meant to.  The __ prefix is standard
practice for defining macros to attributes (212 instances of it by my count).  I
don't mind rewriting the goto labels at all, but I think consistency is
valuable.

Neil

> 
> 
> 

^ permalink raw reply

* [PATCH net 0/2] flow_offload hardware priority fixes
From: Pablo Neira Ayuso @ 2019-08-01 11:28 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, jakub.kicinski, marcelo.leitner, jiri, wenxu,
	saeedm, paulb, gerlitz.or

Hi,

This patchset contains two updates for the flow_offload users:

1) Pass the major tc priority to drivers so they do not have to
   lshift it. This is a preparation patch for the fix coming in
   patch #2.

2) Set the hardware priority from the netfilter basechain priority,
   some drivers break when using the existing hardware priority
   number that is set to zero.

Please, apply, thank you.

Pablo Neira Ayuso (2):
  net: sched: use major priority number as hardware priority
  netfilter: nf_tables: map basechain priority to hardware priority

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c      |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c   |  2 +-
 drivers/net/ethernet/mscc/ocelot_flower.c            | 12 +++---------
 drivers/net/ethernet/netronome/nfp/flower/qos_conf.c |  2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c      |  2 +-
 include/net/netfilter/nf_tables_offload.h            |  2 ++
 include/net/pkt_cls.h                                |  2 +-
 net/netfilter/nf_tables_api.c                        |  4 ++++
 net/netfilter/nf_tables_offload.c                    | 18 +++++++++++++++---
 9 files changed, 29 insertions(+), 17 deletions(-)

-- 
2.11.0


^ permalink raw reply

* [PATCH net 2/2,v3] netfilter: nf_tables: map basechain priority to hardware priority
From: Pablo Neira Ayuso @ 2019-08-01 11:28 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, jakub.kicinski, marcelo.leitner, jiri, wenxu,
	saeedm, paulb, gerlitz.or
In-Reply-To: <20190801112817.24976-1-pablo@netfilter.org>

This patch adds initial support for offloading basechains using
the priority range from -32767 to 32767. This is restricting the
netfilter priority range to 16-bit integer since this is what most
drivers assume so far from tc.

The software to hardware priority mapping is not exposed to userspace.
Hence, it should be possible to extend this range of supported priorities
later on once drivers are updated to support for 32-bit integer
priorities.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
v3: update mapping not to skip tc priority space, each subsystem has
    its own priority space. - Jakub Kicinski.

 include/net/netfilter/nf_tables_offload.h |  2 ++
 net/netfilter/nf_tables_api.c             |  4 ++++
 net/netfilter/nf_tables_offload.c         | 18 +++++++++++++++---
 3 files changed, 21 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/nf_tables_offload.h b/include/net/netfilter/nf_tables_offload.h
index 3196663a10e3..3c31e9d55028 100644
--- a/include/net/netfilter/nf_tables_offload.h
+++ b/include/net/netfilter/nf_tables_offload.h
@@ -73,4 +73,6 @@ int nft_flow_rule_offload_commit(struct net *net);
 	(__reg)->key		= __key;				\
 	memset(&(__reg)->mask, 0xff, (__reg)->len);
 
+u32 nft_chain_offload_priority(struct nft_base_chain *basechain);
+
 #endif
diff --git a/net/netfilter/nf_tables_api.c b/net/netfilter/nf_tables_api.c
index 605a7cfe7ca7..9cf0fecf5cb9 100644
--- a/net/netfilter/nf_tables_api.c
+++ b/net/netfilter/nf_tables_api.c
@@ -1662,6 +1662,10 @@ static int nf_tables_addchain(struct nft_ctx *ctx, u8 family, u8 genmask,
 
 		chain->flags |= NFT_BASE_CHAIN | flags;
 		basechain->policy = NF_ACCEPT;
+		if (chain->flags & NFT_CHAIN_HW_OFFLOAD &&
+		    !nft_chain_offload_priority(basechain))
+			return -EOPNOTSUPP;
+
 		flow_block_init(&basechain->flow_block);
 	} else {
 		chain = kzalloc(sizeof(*chain), GFP_KERNEL);
diff --git a/net/netfilter/nf_tables_offload.c b/net/netfilter/nf_tables_offload.c
index 64f5fd5f240e..81d636fac571 100644
--- a/net/netfilter/nf_tables_offload.c
+++ b/net/netfilter/nf_tables_offload.c
@@ -103,10 +103,11 @@ void nft_offload_update_dependency(struct nft_offload_ctx *ctx,
 }
 
 static void nft_flow_offload_common_init(struct flow_cls_common_offload *common,
-					 __be16 proto,
-					struct netlink_ext_ack *extack)
+					 __be16 proto, int priority,
+					 struct netlink_ext_ack *extack)
 {
 	common->protocol = proto;
+	common->prio = priority;
 	common->extack = extack;
 }
 
@@ -124,6 +125,15 @@ static int nft_setup_cb_call(struct nft_base_chain *basechain,
 	return 0;
 }
 
+u32 nft_chain_offload_priority(struct nft_base_chain *basechain)
+{
+	if (basechain->ops.priority < SHRT_MIN ||
+	    basechain->ops.priority > SHRT_MAX)
+		return 0;
+
+	return basechain->ops.priority + abs(SHRT_MIN);
+}
+
 static int nft_flow_offload_rule(struct nft_trans *trans,
 				 enum flow_cls_command command)
 {
@@ -142,7 +152,9 @@ static int nft_flow_offload_rule(struct nft_trans *trans,
 	if (flow)
 		proto = flow->proto;
 
-	nft_flow_offload_common_init(&cls_flow.common, proto, &extack);
+	nft_flow_offload_common_init(&cls_flow.common, proto,
+				     nft_chain_offload_priority(basechain),
+				     &extack);
 	cls_flow.command = command;
 	cls_flow.cookie = (unsigned long) rule;
 	if (flow)
-- 
2.11.0


^ permalink raw reply related

* [PATCH net 1/2] net: sched: use major priority number as hardware priority
From: Pablo Neira Ayuso @ 2019-08-01 11:28 UTC (permalink / raw)
  To: netfilter-devel
  Cc: davem, netdev, jakub.kicinski, marcelo.leitner, jiri, wenxu,
	saeedm, paulb, gerlitz.or
In-Reply-To: <20190801112817.24976-1-pablo@netfilter.org>

tc transparently maps the software priority number to hardware. Update
it to pass the major priority which is what most drivers expect. Update
drivers too so they do not need to lshift the priority field of the
flow_cls_common_offload object. The stmmac driver is an exception, since
this code assumes the tc software priority is fine, therefore, lshift it
just to be conservative.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
v1: initial version.

 drivers/net/ethernet/mellanox/mlx5/core/en_tc.c      |  2 +-
 drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c   |  2 +-
 drivers/net/ethernet/mscc/ocelot_flower.c            | 12 +++---------
 drivers/net/ethernet/netronome/nfp/flower/qos_conf.c |  2 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c      |  2 +-
 include/net/pkt_cls.h                                |  2 +-
 6 files changed, 8 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
index cc096f6011d9..744c0c640c10 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
@@ -3158,7 +3158,7 @@ mlx5e_flow_esw_attr_init(struct mlx5_esw_flow_attr *esw_attr,
 
 	esw_attr->parse_attr = parse_attr;
 	esw_attr->chain = f->common.chain_index;
-	esw_attr->prio = TC_H_MAJ(f->common.prio) >> 16;
+	esw_attr->prio = f->common.prio;
 
 	esw_attr->in_rep = in_rep;
 	esw_attr->in_mdev = in_mdev;
diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
index e8ac90564dbe..84a87d059333 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_acl.c
@@ -471,7 +471,7 @@ int mlxsw_sp_acl_rulei_commit(struct mlxsw_sp_acl_rule_info *rulei)
 void mlxsw_sp_acl_rulei_priority(struct mlxsw_sp_acl_rule_info *rulei,
 				 unsigned int priority)
 {
-	rulei->priority = priority >> 16;
+	rulei->priority = priority;
 }
 
 void mlxsw_sp_acl_rulei_keymask_u32(struct mlxsw_sp_acl_rule_info *rulei,
diff --git a/drivers/net/ethernet/mscc/ocelot_flower.c b/drivers/net/ethernet/mscc/ocelot_flower.c
index 59487d446a09..b894bc0c9c16 100644
--- a/drivers/net/ethernet/mscc/ocelot_flower.c
+++ b/drivers/net/ethernet/mscc/ocelot_flower.c
@@ -13,12 +13,6 @@ struct ocelot_port_block {
 	struct ocelot_port *port;
 };
 
-static u16 get_prio(u32 prio)
-{
-	/* prio starts from 0x1000 while the ids starts from 0 */
-	return prio >> 16;
-}
-
 static int ocelot_flower_parse_action(struct flow_cls_offload *f,
 				      struct ocelot_ace_rule *rule)
 {
@@ -168,7 +162,7 @@ static int ocelot_flower_parse(struct flow_cls_offload *f,
 	}
 
 finished_key_parsing:
-	ocelot_rule->prio = get_prio(f->common.prio);
+	ocelot_rule->prio = f->common.prio;
 	ocelot_rule->id = f->cookie;
 	return ocelot_flower_parse_action(f, ocelot_rule);
 }
@@ -218,7 +212,7 @@ static int ocelot_flower_destroy(struct flow_cls_offload *f,
 	struct ocelot_ace_rule rule;
 	int ret;
 
-	rule.prio = get_prio(f->common.prio);
+	rule.prio = f->common.prio;
 	rule.port = port_block->port;
 	rule.id = f->cookie;
 
@@ -236,7 +230,7 @@ static int ocelot_flower_stats_update(struct flow_cls_offload *f,
 	struct ocelot_ace_rule rule;
 	int ret;
 
-	rule.prio = get_prio(f->common.prio);
+	rule.prio = f->common.prio;
 	rule.port = port_block->port;
 	rule.id = f->cookie;
 	ret = ocelot_ace_rule_stats_update(&rule);
diff --git a/drivers/net/ethernet/netronome/nfp/flower/qos_conf.c b/drivers/net/ethernet/netronome/nfp/flower/qos_conf.c
index 86e968cd5ffd..124a43dc136a 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/qos_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/qos_conf.c
@@ -93,7 +93,7 @@ nfp_flower_install_rate_limiter(struct nfp_app *app, struct net_device *netdev,
 		return -EOPNOTSUPP;
 	}
 
-	if (flow->common.prio != (1 << 16)) {
+	if (flow->common.prio != 1) {
 		NL_SET_ERR_MSG_MOD(extack, "unsupported offload: qos rate limit offload requires highest priority");
 		return -EOPNOTSUPP;
 	}
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
index 58ea18af9813..5cd040215469 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_tc.c
@@ -94,7 +94,7 @@ static int tc_fill_entry(struct stmmac_priv *priv,
 	struct stmmac_tc_entry *entry, *frag = NULL;
 	struct tc_u32_sel *sel = cls->knode.sel;
 	u32 off, data, mask, real_off, rem;
-	u32 prio = cls->common.prio;
+	u32 prio = cls->common.prio << 16;
 	int ret;
 
 	/* Only 1 match per entry */
diff --git a/include/net/pkt_cls.h b/include/net/pkt_cls.h
index e429809ca90d..98be18ef1ed3 100644
--- a/include/net/pkt_cls.h
+++ b/include/net/pkt_cls.h
@@ -646,7 +646,7 @@ tc_cls_common_offload_init(struct flow_cls_common_offload *cls_common,
 {
 	cls_common->chain_index = tp->chain->index;
 	cls_common->protocol = tp->protocol;
-	cls_common->prio = tp->prio;
+	cls_common->prio = tp->prio >> 16;
 	if (tc_skip_sw(flags) || flags & TCA_CLS_FLAGS_VERBOSE)
 		cls_common->extack = extack;
 }
-- 
2.11.0


^ permalink raw reply related

* Re: [PATCH net-next] net/mlx5e: Allow dropping specific tunnel packets
From: Roi Dayan @ 2019-08-01 11:40 UTC (permalink / raw)
  To: xiangxia.m.yue@gmail.com, Saeed Mahameed; +Cc: netdev@vger.kernel.org
In-Reply-To: <1564648859-17369-1-git-send-email-xiangxia.m.yue@gmail.com>



On 2019-08-01 11:40 AM, xiangxia.m.yue@gmail.com wrote:
> From: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> 
> In some case, we don't want to allow specific tunnel packets
> to host that can avoid to take up high CPU (e.g network attacks).
> But other tunnel packets which not matched in hardware will be
> sent to host too.
> 
>     $ tc filter add dev vxlan_sys_4789 \
> 	    protocol ip chain 0 parent ffff: prio 1 handle 1 \
> 	    flower dst_ip 1.1.1.100 ip_proto tcp dst_port 80 \
> 	    enc_dst_ip 2.2.2.100 enc_key_id 100 enc_dst_port 4789 \
> 	    action tunnel_key unset pipe action drop
> 
> Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> index f3ed028..25d423e 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tc.c
> @@ -2485,7 +2485,8 @@ static bool actions_match_supported(struct mlx5e_priv *priv,
>  
>  	if (flow_flag_test(flow, EGRESS) &&
>  	    !((actions & MLX5_FLOW_CONTEXT_ACTION_DECAP) ||
> -	      (actions & MLX5_FLOW_CONTEXT_ACTION_VLAN_POP)))
> +	      (actions & MLX5_FLOW_CONTEXT_ACTION_VLAN_POP) ||
> +	      (actions & MLX5_FLOW_CONTEXT_ACTION_DROP)))
>  		return false;
>  
>  	if (actions & MLX5_FLOW_CONTEXT_ACTION_MOD_HDR)
> 

thanks!

Reviewed-by: Roi Dayan <roid@mellanox.com>

^ permalink raw reply

* Re: [PATCH net] mvpp2: fix panic on module removal
From: Matteo Croce @ 2019-08-01 11:46 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: netdev, Miquel Raynal, LKML, Lorenzo Bianconi, Maxime Chevallier,
	David S. Miller
In-Reply-To: <20190801071801.GF3579@kwain>

On Thu, Aug 1, 2019 at 9:18 AM Antoine Tenart
<antoine.tenart@bootlin.com> wrote:
>
> Hi Matteo,
>
> On Wed, Jul 31, 2019 at 08:31:16PM +0200, Matteo Croce wrote:
> >
> > diff --git a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> > index c51f1d5b550b..5002d51fc9d6 100644
> > --- a/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> > +++ b/drivers/net/ethernet/marvell/mvpp2/mvpp2_main.c
> > @@ -5760,7 +5760,6 @@ static int mvpp2_remove(struct platform_device *pdev)
> >       mvpp2_dbgfs_cleanup(priv);
> >
> >       flush_workqueue(priv->stats_queue);
> > -     destroy_workqueue(priv->stats_queue);
> >
> >       fwnode_for_each_available_child_node(fwnode, port_fwnode) {
> >               if (priv->port_list[i]) {
> > @@ -5770,6 +5769,8 @@ static int mvpp2_remove(struct platform_device *pdev)
> >               i++;
> >       }
>
> Shouldn't you also move flush_workqueue() here?
>

I think that that flush it's unneeded at all, as all port remove calls
cancel_delayed_work_sync().

I tried removing it and it doesn't crash on rmmod.

--
Matteo Croce
per aspera ad upstream

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox