Netdev List
 help / color / mirror / Atom feed
* [PATCH net] bpf, arm64: fix faulty emission of map access in tail calls
From: Daniel Borkmann @ 2017-05-10 23:53 UTC (permalink / raw)
  To: davem
  Cc: alexei.starovoitov, illusionist.neo, zlim.lnx, catalin.marinas,
	netdev, linux-arm-kernel, Daniel Borkmann

Shubham was recently asking on netdev why in arm64 JIT we don't multiply
the index for accessing the tail call map by 8. That led me into testing
out arm64 JIT wrt tail calls and it turned out I got a NULL pointer
dereference on the tail call.

The buggy access is at:

  prog = array->ptrs[index];
  if (prog == NULL)
      goto out;

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // #112
  00000064:  f86a682a  ldr x10, [x1,x10]
  00000068:  f862694b  ldr x11, [x10,x2]
  0000006c:  b40000ab  cbz x11, 0x00000080
  [...]

The code triggering the crash is f862694b. x1 at the time contains the
address of the bpf array, x10 offsetof(struct bpf_array, ptrs). Meaning,
above we load the pointer to the program at map slot 0 into x10. x10
can then be NULL if the slot is not occupied, which we later on try to
access with a user given offset in x2 that is the map index.

Fix this by emitting the following instead:

  [...]
  00000060:  d2800e0a  mov x10, #0x70 // #112
  00000064:  8b0a002a  add x10, x1, x10
  00000068:  d37df04b  lsl x11, x2, #3
  0000006c:  f86b694b  ldr x11, [x10,x11]
  00000070:  b40000ab  cbz x11, 0x00000084
  [...]

This basically adds the offset to ptrs to the base address of the bpf
array we got and we later on access the map with an index * 8 offset
relative to that. The tail call map itself is basically one large area
with meta data at the head followed by the array of prog pointers.
This makes tail calls working again, tested on Cavium ThunderX ARMv8.

Fixes: ddb55992b04d ("arm64: bpf: implement bpf_tail_call() helper")
Reported-by: Shubham Bansal <illusionist.neo@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
---
 arch/arm64/net/bpf_jit_comp.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/net/bpf_jit_comp.c b/arch/arm64/net/bpf_jit_comp.c
index c6e5358..71f9305 100644
--- a/arch/arm64/net/bpf_jit_comp.c
+++ b/arch/arm64/net/bpf_jit_comp.c
@@ -253,8 +253,9 @@ static int emit_bpf_tail_call(struct jit_ctx *ctx)
 	 */
 	off = offsetof(struct bpf_array, ptrs);
 	emit_a64_mov_i64(tmp, off, ctx);
-	emit(A64_LDR64(tmp, r2, tmp), ctx);
-	emit(A64_LDR64(prg, tmp, r3), ctx);
+	emit(A64_ADD(1, tmp, r2, tmp), ctx);
+	emit(A64_LSL(1, prg, r3, 3), ctx);
+	emit(A64_LDR64(prg, tmp, prg), ctx);
 	emit(A64_CBZ(1, prg, jmp_offset), ctx);
 
 	/* goto *(prog->bpf_func + prologue_size); */
-- 
1.9.3

^ permalink raw reply related

* [PATCH net] tcp: avoid fragmenting peculiar skbs in SACK
From: Yuchung Cheng @ 2017-05-11  0:01 UTC (permalink / raw)
  To: davem; +Cc: netdev, ncardwell, edumazet, soheil, Yuchung Cheng

This patch fixes a bug in splitting an SKB during SACK
processing. Specifically if an skb contains multiple
packets and is only partially sacked in the higher sequences,
tcp_match_sack_to_skb() splits the skb and marks the second fragment
as SACKed.

The current code further attempts rounding up the first fragment
to MSS boundaries. But it misses a boundary condition when the
rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
such an skb is pointless and causses a kernel warning and aborts
the SACK processing. This patch universally checks such over-split
before calling tcp_fragment to prevent these unnecessary warnings.

Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5a3ad09e2786..06e2dbc2b4a2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1179,13 +1179,14 @@ static int tcp_match_skb_to_sack(struct sock *sk, struct sk_buff *skb,
 		 */
 		if (pkt_len > mss) {
 			unsigned int new_len = (pkt_len / mss) * mss;
-			if (!in_sack && new_len < pkt_len) {
+			if (!in_sack && new_len < pkt_len)
 				new_len += mss;
-				if (new_len >= skb->len)
-					return 0;
-			}
 			pkt_len = new_len;
 		}
+
+		if (pkt_len >= skb->len && !in_sack)
+			return 0;
+
 		err = tcp_fragment(sk, skb, pkt_len, mss, GFP_ATOMIC);
 		if (err < 0)
 			return err;
-- 
2.13.0.rc2.291.g57267f2277-goog

^ permalink raw reply related

* Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
From: Ding Tianhong @ 2017-05-11  1:15 UTC (permalink / raw)
  To: Casey Leedom, Alexander Duyck
  Cc: Mark Rutland, Gabriele Paoloni, Asit K Mallick, Catalin Marinas,
	Will Deacon, LinuxArm, Raj, Ashok, Bjorn Helgaas, Ganesh GR,
	Jeff Kirsher, Bob Shaw, Patrick J Cramer, Arjun V.,
	Michael Werner, linux-arm-kernel@lists.infradead.org, Amir Ancel,
	Netdev, David Laight, Suravee Suthikulpanit, Robin Murphy,
	David Miller, h
In-Reply-To: <MWHPR12MB1600AC957C7E3B0DB4F06D6CC8EE0@MWHPR12MB1600.namprd12.prod.outlook.com>



On 2017/5/9 8:48, Casey Leedom wrote:
> 
> | From: Alexander Duyck <alexander.duyck@gmail.com>
> | Date: Saturday, May 6, 2017 11:07 AM
> | 
> | | From: Ding Tianhong <dingtianhong@huawei.com>
> | | Date: Fri, May 5, 2017 at 8:08 PM
> | | 
> | | According the suggestion, I could only think of this code:
> | | ..
> | 
> | This is a bit simplistic but it is a start.
> 
>   Yes, something tells me that this is going to be more complicated than any
> of us like ...
> 
> | The other bit I was getting at is that we need to update the core PCIe
> | code so that when we configure devices and the root complex reports no
> | support for relaxed ordering it should be clearing the relaxed
> | ordering bits in the PCIe configuration registers on the upstream
> | facing devices.
> 
>   Of course, this can be written to by the Driver at any time ... and is in
> the case of the cxgb4 Driver ...
> 
>   After a lot of rummaging around, it does look like KVM prohibits writes to
> the PCIe Capability Device Control register in drivers/xen/xen-pciback/
> conf_space_capability.c and conf_space.c simply because writes aren't
> allowed unless "permissive" is set.  So it ~looks~ like a driver running in
> a Virtual Machine can't turn Enable Relaxed Ordering back on ...
> 
> | The last bit we need in all this is a way to allow for setups where
> | peer-to-peer wants to perform relaxed ordering but for writes to the
> | host we have to not use relaxed ordering. For that we need to enable a
> | special case and that isn't handled right now in any of the solutions
> | we have coded up so far.
> 
>   Yes, we do need this.
> 
> 
> | From: Alexander Duyck <alexander.duyck@gmail.com>
> | Date: Saturday, May 8, 2017 08:22 AM
> |
> | The problem is we need to have something that can be communicated
> | through a VM. Your change doesn't work in that regard. That was why I
> | suggested just updating the code so that we when we initialized PCIe
> | devices what we do is either set or clear the relaxed ordering bit in
> | the PCIe device control register. That way when we direct assign an
> | interface it could know just based on the bits int the PCIe
> | configuration if it could use relaxed ordering or not.
> | 
> | At that point the driver code itself becomes very simple since you
> | could just enable the relaxed ordering by default in the igb/ixgbe
> | driver and if the bit is set or cleared in the PCIe configuration then
> | we are either sending with relaxed ordering requests or not and don't
> | have to try and locate the root complex.
> | 
> | So from the sound of it Casey has a special use case where he doesn't
> | want to send relaxed ordering frames to the root complex, but instead
> | would like to send them to another PCIe device. To do that he needs to
> | have a way to enable the relaxed ordering bit in the PCIe
> | configuration but then not send any to the root complex. Odds are that
> | is something he might be able to just implement in the driver, but is
> | something that may become a more general case in the future. I don't
> | see our change here impacting it as long as we keep the solution
> | generic and mostly confined to when we instantiate the devices as the
> | driver could likely make the decision to change the behavior later.
> 
>   It's not just me.  Intel has said that while RO directed at the Root
> Complex Host Coherent Memory has a performance bug (not Data Corruption),
> it's a performance win for Peer-to-Peer writes to MMIO Space.  (I'll be very
> interested in hearing what the bug is if we get that much detail.  The very
> same TLPs directed to the Root Complex Port without Relaxed Ordering set get
> good performance.  So this is essentially a bug in the hardware that was
> ~trying~ to implement a performance win.)
> 
>   Meanwhile, I currently only know of a single PCIe End Point which causes
> catastrophic results: the AMD A1100 ARM SoC ("SEATTLE").  And it's not even
> clear that product is even alive anymore since I haven't been able to get
> any responses from them for several months.
> 
>   What I'm saying is: let's try to architect a solution which doesn't throw
> the baby out with the bath water ...
> 
>   I think that if a Device's Root Complex Port has problems with Relaxed
> Ordering, it ~probably~ makes sense to turn off the PCIe Capability Device
> Control[Enable Relaxed Ordering] when we assign a device to a Virtual
> Machine since the Device Driver can no longer query the Relaxed Ordering
> Support of the Root Complex Port.  The only down side of this would be if we
> assigned two Peers to a VM in an application which wanted to do Peer-to-Peer
> transfers.  But that seems like a hard application to support in any case
> since the PCI Bus:Slot.Function IDs for assigned Devices within a VM don't
> match the actual values.
> 
>   For Devices running in the base OS/Hypervisor, their Drivers can query the
> Relaxed Ordering Support for the Root Complex Port or a Peer Device.  So a
> simple flag within the (struct pci_dev *)->dev_flags would serve for that
> along with a per-Architecture/Platform mechanism for setting it ...
> 

Hi Casey:

Will you continue to work on this solution or send a new version patches?

Thanks
Ding
> Casey
> 
> .
> 

^ permalink raw reply

* Re: [PATCH net] tcp: avoid fragmenting peculiar skbs in SACK
From: Neal Cardwell @ 2017-05-11  1:33 UTC (permalink / raw)
  To: Yuchung Cheng; +Cc: David Miller, Netdev, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <20170511000127.4249-1-ycheng@google.com>

On Wed, May 10, 2017 at 8:01 PM, Yuchung Cheng <ycheng@google.com> wrote:
>
> This patch fixes a bug in splitting an SKB during SACK
> processing. Specifically if an skb contains multiple
> packets and is only partially sacked in the higher sequences,
> tcp_match_sack_to_skb() splits the skb and marks the second fragment
> as SACKed.
>
> The current code further attempts rounding up the first fragment
> to MSS boundaries. But it misses a boundary condition when the
> rounded-up fragment size (pkt_len) is exactly skb size.  Spliting
> such an skb is pointless and causses a kernel warning and aborts
> the SACK processing. This patch universally checks such over-split
> before calling tcp_fragment to prevent these unnecessary warnings.
>
> Fixes: adb92db857ee ("tcp: Make SACK code to split only at mss boundaries")
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: openvswitch MTU patch needed in 4.10 stable
From: David Miller @ 2017-05-11  2:40 UTC (permalink / raw)
  To: stephen; +Cc: netdev
In-Reply-To: <20170509084656.53f2d590@xeon-e3>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Tue, 9 May 2017 08:46:56 -0700

> Could you queue the patch to stable?

It speeds things along if you actually specify the SHA1 ID of the
specific commit being requested for a -stable backport.

I did this work, but you could have taken a brief amount of time so
that this would not have been necessary.

Thanks.

^ permalink raw reply

* Re: DQL and TCQ_F_CAN_BYPASS destroy performance under virtualizaiton (Was: "Re: net_sched strange in 4.11")
From: Jason Wang @ 2017-05-11  2:43 UTC (permalink / raw)
  To: Anton Ivanov, Stefan Hajnoczi; +Cc: David S. Miller, netdev, Michael S. Tsirkin
In-Reply-To: <c9fe2748-0a14-17f4-3742-3184f0b2831b@cambridgegreys.com>



On 2017年05月10日 17:42, Anton Ivanov wrote:
> On 10/05/17 09:56, Jason Wang wrote:
>>
>>
>> On 2017年05月10日 13:28, Anton Ivanov wrote:
>>> On 10/05/17 03:18, Jason Wang wrote:
>>>>
>>>> On 2017年05月09日 23:11, Stefan Hajnoczi wrote:
>>>>> On Tue, May 09, 2017 at 08:46:46AM +0100, Anton Ivanov wrote:
>>>>>> I have figured it out. Two issues.
>>>>>>
>>>>>> 1) skb->xmit_more is hardly ever set under virtualization because
>>>>>> the qdisc
>>>>>> is usually bypassed because of TCQ_F_CAN_BYPASS. Once
>>>>>> TCQ_F_CAN_BYPASS is
>>>>>> set a virtual NIC driver is not likely see skb->xmit_more (this
>>>>>> answers my
>>>>>> "how does this work at all" question).
>>>>>>
>>>>>> 2) If that flag is turned off (I patched sched_generic to turn it
>>>>>> off in
>>>>>> pfifo_fast while testing), DQL keeps xmit_more from being set. If
>>>>>> the driver
>>>>>> is not DQL enabled xmit_more is never ever set. If the driver is DQL
>>>>>> enabled
>>>>>> the queue is adjusted to ensure xmit_more stops happening within
>>>>>> 10-15 xmit
>>>>>> cycles.
>>>>>>
>>>>>> That is plain *wrong* for virtual NICs - virtio, emulated NICs, etc.
>>>>>> There,
>>>>>> the BIG cost is telling the hypervisor that it needs to "kick" the
>>>>>> packets.
>>>>>> The cost of putting them into the vNIC buffers is negligible. You 
>>>>>> want
>>>>>> xmit_more to happen - it makes between 50% and 300% (depending on 
>>>>>> vNIC
>>>>>> design) difference. If there is no xmit_more the vNIC will 
>>>>>> immediately
>>>>>> "kick" the hypervisor and try to signal that  the packet needs to 
>>>>>> move
>>>>>> straight away (as for example in virtio_net).
>>>> How do you measure the performance? TCP or just measure pps?
>>> In this particular case - tcp from guest. I have a couple of other
>>> benchmarks (forwarding, etc).
>>
>> One more question, is the number for virtio-net or other emulated vNIC?
>
> Other for now - you are cc-ed to keep you in the loop.
>
> Virtio is next on my list - I am revisiting the l2tpv3.c driver in 
> QEMU and looking at how to preserve bulking by adding back sendmmsg 
> (as well as a list of other features/transports).
>
> We had sendmmsg removed for the final inclusion in QEMU 2.1, it 
> presently uses only recvmmsg so for the time being it does not care. 
> That will most likely change once it starts using sendmmsg as well.

An issue is that qemu net API does not support bulking, do you plan to 
add it?

Thanks

^ permalink raw reply

* Re: [PATCH net-next V4 10/10] vhost_net: try batch dequing from skb array
From: Jason Wang @ 2017-05-11  2:47 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel
In-Reply-To: <20170510152848-mutt-send-email-mst@kernel.org>



On 2017年05月10日 20:34, Michael S. Tsirkin wrote:
> On Wed, May 10, 2017 at 11:36:22AM +0800, Jason Wang wrote:
>> We used to dequeue one skb during recvmsg() from skb_array, this could
>> be inefficient because of the bad cache utilization and spinlock
>> touching for each packet. This patch tries to batch them by calling
>> batch dequeuing helpers explicitly on the exported skb array and pass
>> the skb back through msg_control for underlayer socket to finish the
>> userspace copying.
>>
>> Batch dequeuing is also the requirement for more batching improvement
>> on rx.
>>
>> Tests were done by pktgen on tap with XDP1 in guest on top of batch
>> zeroing:
>>
>> rx batch | pps
>>
>> 256        2.41Mpps (+6.16%)
>> 128        2.48Mpps (+8.80%)
>> 64         2.38Mpps (+3.96%) <- Default
>> 16         2.31Mpps (+1.76%)
>> 4          2.31Mpps (+1.76%)
>> 1          2.30Mpps (+1.32%)
>> 0          2.27Mpps (+7.48%)
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>>   drivers/vhost/net.c | 117 +++++++++++++++++++++++++++++++++++++++++++++++++---
>>   1 file changed, 111 insertions(+), 6 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 9b51989..fbaecf3 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -28,6 +28,8 @@
>>   #include <linux/if_macvlan.h>
>>   #include <linux/if_tap.h>
>>   #include <linux/if_vlan.h>
>> +#include <linux/skb_array.h>
>> +#include <linux/skbuff.h>
>>   
>>   #include <net/sock.h>
>>   
>> @@ -85,6 +87,13 @@ struct vhost_net_ubuf_ref {
>>   	struct vhost_virtqueue *vq;
>>   };
>>   
>> +#define VHOST_RX_BATCH 64
>> +struct vhost_net_buf {
>> +	struct sk_buff *queue[VHOST_RX_BATCH];
>> +	int tail;
>> +	int head;
>> +};
>> +
> Do you strictly need to put this inline? This structure is quite big
> already. Do you see a measureabe difference if you make it
>
> 	struct sk_buff **queue;
> 	int tail;
> 	int head;
>
> ?

I don't.

>
> Will also make it easier to play with the size in the future
> should someone want to see how does it work e.g. for different
> ring sizes.
>

Ok, will do this in next version

Thanks

^ permalink raw reply

* Re: [PATCH] libertas: Avoid reading past end of buffer
From: Kalle Valo @ 2017-05-11  3:45 UTC (permalink / raw)
  To: Joe Perches
  Cc: Kees Cook, netdev, libertas-dev, linux-wireless, Daniel Micay,
	linux-kernel
In-Reply-To: <1494457506.2028.1.camel@perches.com>

Joe Perches <joe@perches.com> writes:

> unrelated trivia:
>
> lbs_deb_enter is used incorrectly here at
> function exit as both enter and leave calls.
>
> That type of copy/paste defect may be common.
>
> $ git grep -w lbs_deb_enter | wc -l
> 148
> $ git grep -w lbs_deb_leave | wc -l
> 71
>
> One would expect these numbers to be the same.
>
> Another option would be to delete all these
> calls as ftrace function tracing works well.

Yeah, deleting all the enter/exit calls would be better.

-- 
Kalle Valo

^ permalink raw reply

* Re: [PATCH] net: fec: select queue depending on VLAN priority
From: Stefan Agner @ 2017-05-11  4:07 UTC (permalink / raw)
  To: Andy Duan; +Cc: David Miller, andrew, festevam, netdev, linux-kernel
In-Reply-To: <AM4PR0401MB2260BA1BC2F65B36284CF7D2FFEC0@AM4PR0401MB2260.eurprd04.prod.outlook.com>

On 2017-05-09 19:42, Andy Duan wrote:
> From: David Miller <davem@davemloft.net> Sent: Tuesday, May 09, 2017 9:39 PM
>>To: stefan@agner.ch
>>Cc: Andy Duan <fugang.duan@nxp.com>; andrew@lunn.ch;
>>festevam@gmail.com; netdev@vger.kernel.org; linux-
>>kernel@vger.kernel.org
>>Subject: Re: [PATCH] net: fec: select queue depending on VLAN priority
>>
>>From: Stefan Agner <stefan@agner.ch>
>>Date: Mon,  8 May 2017 22:37:08 -0700
>>
>>> Since the addition of the multi queue code with commit 59d0f7465644
>>> ("net: fec: init multi queue date structure") the queue selection has
>>> been handelt by the default transmit queue selection implementation
>>> which tries to evenly distribute the traffic across all available
>>> queues. This selection presumes that the queues are using an equal
>>> priority, however, the queues 1 and 2 are actually of higher priority
>>> (the classification of the queues is enabled in fec_enet_enable_ring).
>>>
>>> This can lead to net scheduler warnings and continuous TX ring dumps
>>> when exercising the system with iperf.
>>>
>>> Use only queue 0 for all common traffic (no VLAN and P802.1p priority
>>> 0 and 1) and route level 2-7 through queue 1 and 2.
>>>
>>> Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
>>> Fixes: 59d0f7465644 ("net: fec: init multi queue date structure")
>>
>>If the queues are used for prioritization, and it does not have multiple normal
>>priority level queues, multiqueue is not what the driver should have
>>implemented.
> Firstly, HW multiple queues support:
> 	- Traffic-shaping bandwidth distribution supports credit-based and
> round-robin-based policies. Either policy can be combined with
> time-based shaping.
> 	- AVB (Audio Video Bridging, IEEE 802.1Qav) features:
> 		* Credit-based bandwidth distribution policy can be combined with
> time-based shaping
> 		* AVB endpoint talker and listener support
> 		* Support for arbitration between different priority traffic (for
> example, AVB class A, AVB class B, and non-AVB)
> Round-robin-based policies:
> 	It has the same priority for three queues: In the round-robin QoS
> scheme, each queue is given an equal opportunity to transmit one
> frame. For example, if queue n has a frame to transmit, the queue
> transmits its frame. After queue n has transmitted its frame, or if
> queue n does not have a frame to transmit, queue n+1 is then allowed
> to transmit its frame, and so on.
> 
> Credit-based policies:
> 	The AVB credit based shaper acts independently, per class, to control
> the bandwidth distribution between normal traffic and time-sensitive
> traffic with respect to the total link bandwidth available.
> 	Credit based shaper conbined with time-based shaping:  
> 		- priority: ClassA queue > ClassB queue > best effort
> 		- ensure the queue bandwidth as user set based on time-based shaping
> algorithms (transmitter transmit frame from three queue in turn based
> on time-based shaping algorithms)
> 	And in real AVB case,  each streaming can be independent, and are
> fixed on related queue. Then driver level should implement
> .ndo_select_queue() to put the streaming into related queue. That is
> what the patch did.
> 
> The current driver config the three queue to credit-based policies
> (AVB), the patch seems no problem for the implementation. Do you have
> any suggestion ?
> 

I tried using the round robin mode by adding this:

+       /* Set Round-Robin policy */
+       writel(1, fep->hwp + FEC_QOS_SCHEME);

After a while I got the warning non the less:

WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
dev_watchdog+0x248/0x24c
NETDEV WATCHDOG: eth0 (fec): transmit queue 2 timed out
Modules linked in:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.11.0-rc1-00058-g56d22eced8bc-dirty #377
Hardware name: Freescale i.MX7 Dual (Device Tree)
[<c02266b0>] (unwind_backtrace) from [<c0222d7c>] (show_stack+0x10/0x14)
[<c0222d7c>] (show_stack) from [<c04d4098>] (dump_stack+0x78/0x8c)
[<c04d4098>] (dump_stack) from [<c0236548>] (__warn+0xe8/0x100)
[<c0236548>] (__warn) from [<c0236598>] (warn_slowpath_fmt+0x38/0x48)
[<c0236598>] (warn_slowpath_fmt) from [<c0805904>]
(dev_watchdog+0x248/0x24c)
[<c0805904>] (dev_watchdog) from [<c028a0e8>] (call_timer_fn+0x28/0x98)
[<c028a0e8>] (call_timer_fn) from [<c028a1f8>] (expire_timers+0xa0/0xac)
[<c028a1f8>] (expire_timers) from [<c028a2a0>]
(run_timer_softirq+0x9c/0x194)
[<c028a2a0>] (run_timer_softirq) from [<c023aaf8>]
(__do_softirq+0x114/0x234)
[<c023aaf8>] (__do_softirq) from [<c023aee4>] (irq_exit+0xcc/0x108)
[<c023aee4>] (irq_exit) from [<c0279920>]
(__handle_domain_irq+0x80/0xec)
[<c0279920>] (__handle_domain_irq) from [<c0201544>]
(gic_handle_irq+0x48/0x8c)
[<c0201544>] (gic_handle_irq) from [<c0223838>] (__irq_svc+0x58/0x8c)
Exception stack(0xc1001f28 to 0xc1001f70)
1f20:                   00000001 00000000 00000000 c022fdc0 c1000000
c1003d80
1f40: c1003d34 c0e72ed0 c0bd9b04 c1001f80 00000000 00000000 00000000
c1001f78
1f60: c022048c c0220490 60000013 ffffffff
[<c0223838>] (__irq_svc) from [<c0220490>] (arch_cpu_idle+0x38/0x3c)
[<c0220490>] (arch_cpu_idle) from [<c026ec60>] (do_idle+0x170/0x204)
[<c026ec60>] (do_idle) from [<c026efac>] (cpu_startup_entry+0x18/0x1c)
[<c026efac>] (cpu_startup_entry) from [<c0e00c88>]
(start_kernel+0x394/0x3a0)
---[ end trace a474f341d40e0705 ]---
fec 30be0000.ethernet eth0: TX ring dump

I disabled the regular ring dump and only printed one line. It seems to
come up every 2 seconds, and checking cat /proc/interrupts showed that
queue 2 stayed at its last value (3562218):

 58:    3091320     GIC-0 150 Level     30be0000.ethernet
 59:    3562218     GIC-0 151 Level     30be0000.ethernet
 60:   13377922     GIC-0 152 Level     30be0000.ethernet


--
Stefan

^ permalink raw reply

* RE: [PATCH] net: fec: select queue depending on VLAN priority
From: Andy Duan @ 2017-05-11  4:49 UTC (permalink / raw)
  To: Stefan Agner
  Cc: David Miller, andrew@lunn.ch, festevam@gmail.com,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <ed62e3b2eb91f7ef795279e92e249c51@agner.ch>

From: Stefan Agner <stefan@agner.ch> Sent: Thursday, May 11, 2017 12:08 PM
>To: Andy Duan <fugang.duan@nxp.com>
>Cc: David Miller <davem@davemloft.net>; andrew@lunn.ch;
>festevam@gmail.com; netdev@vger.kernel.org; linux-
>kernel@vger.kernel.org
>Subject: Re: [PATCH] net: fec: select queue depending on VLAN priority
>
>On 2017-05-09 19:42, Andy Duan wrote:
>> From: David Miller <davem@davemloft.net> Sent: Tuesday, May 09, 2017
>> 9:39 PM
>>>To: stefan@agner.ch
>>>Cc: Andy Duan <fugang.duan@nxp.com>; andrew@lunn.ch;
>>>festevam@gmail.com; netdev@vger.kernel.org; linux-
>>>kernel@vger.kernel.org
>>>Subject: Re: [PATCH] net: fec: select queue depending on VLAN priority
>>>
>>>From: Stefan Agner <stefan@agner.ch>
>>>Date: Mon,  8 May 2017 22:37:08 -0700
>>>
>>>> Since the addition of the multi queue code with commit 59d0f7465644
>>>> ("net: fec: init multi queue date structure") the queue selection
>>>> has been handelt by the default transmit queue selection
>>>> implementation which tries to evenly distribute the traffic across
>>>> all available queues. This selection presumes that the queues are
>>>> using an equal priority, however, the queues 1 and 2 are actually of
>>>> higher priority (the classification of the queues is enabled in
>fec_enet_enable_ring).
>>>>
>>>> This can lead to net scheduler warnings and continuous TX ring dumps
>>>> when exercising the system with iperf.
>>>>
>>>> Use only queue 0 for all common traffic (no VLAN and P802.1p
>>>> priority
>>>> 0 and 1) and route level 2-7 through queue 1 and 2.
>>>>
>>>> Signed-off-by: Fugang Duan <fugang.duan@nxp.com>
>>>> Fixes: 59d0f7465644 ("net: fec: init multi queue date structure")
>>>
>>>If the queues are used for prioritization, and it does not have
>>>multiple normal priority level queues, multiqueue is not what the
>>>driver should have implemented.
>> Firstly, HW multiple queues support:
>> 	- Traffic-shaping bandwidth distribution supports credit-based and
>> round-robin-based policies. Either policy can be combined with
>> time-based shaping.
>> 	- AVB (Audio Video Bridging, IEEE 802.1Qav) features:
>> 		* Credit-based bandwidth distribution policy can be combined
>with
>> time-based shaping
>> 		* AVB endpoint talker and listener support
>> 		* Support for arbitration between different priority traffic (for
>> example, AVB class A, AVB class B, and non-AVB) Round-robin-based
>> policies:
>> 	It has the same priority for three queues: In the round-robin QoS
>> scheme, each queue is given an equal opportunity to transmit one
>> frame. For example, if queue n has a frame to transmit, the queue
>> transmits its frame. After queue n has transmitted its frame, or if
>> queue n does not have a frame to transmit, queue n+1 is then allowed
>> to transmit its frame, and so on.
>>
>> Credit-based policies:
>> 	The AVB credit based shaper acts independently, per class, to control
>> the bandwidth distribution between normal traffic and time-sensitive
>> traffic with respect to the total link bandwidth available.
>> 	Credit based shaper conbined with time-based shaping:
>> 		- priority: ClassA queue > ClassB queue > best effort
>> 		- ensure the queue bandwidth as user set based on time-
>based shaping
>> algorithms (transmitter transmit frame from three queue in turn based
>> on time-based shaping algorithms)
>> 	And in real AVB case,  each streaming can be independent, and are
>> fixed on related queue. Then driver level should implement
>> .ndo_select_queue() to put the streaming into related queue. That is
>> what the patch did.
>>
>> The current driver config the three queue to credit-based policies
>> (AVB), the patch seems no problem for the implementation. Do you have
>> any suggestion ?
>>
>
>I tried using the round robin mode by adding this:
>
>+       /* Set Round-Robin policy */
>+       writel(1, fep->hwp + FEC_QOS_SCHEME);
>
>After a while I got the warning non the less:
>
>WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:316
>dev_watchdog+0x248/0x24c NETDEV WATCHDOG: eth0 (fec): transmit queue
>2 timed out Modules linked in:
>CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.11.0-rc1-00058-g56d22eced8bc-
>dirty #377 Hardware name: Freescale i.MX7 Dual (Device Tree) [<c02266b0>]
>(unwind_backtrace) from [<c0222d7c>] (show_stack+0x10/0x14) [<c0222d7c>]
>(show_stack) from [<c04d4098>] (dump_stack+0x78/0x8c) [<c04d4098>]
>(dump_stack) from [<c0236548>] (__warn+0xe8/0x100) [<c0236548>]
>(__warn) from [<c0236598>] (warn_slowpath_fmt+0x38/0x48) [<c0236598>]
>(warn_slowpath_fmt) from [<c0805904>]
>(dev_watchdog+0x248/0x24c)
>[<c0805904>] (dev_watchdog) from [<c028a0e8>] (call_timer_fn+0x28/0x98)
>[<c028a0e8>] (call_timer_fn) from [<c028a1f8>] (expire_timers+0xa0/0xac)
>[<c028a1f8>] (expire_timers) from [<c028a2a0>]
>(run_timer_softirq+0x9c/0x194)
>[<c028a2a0>] (run_timer_softirq) from [<c023aaf8>]
>(__do_softirq+0x114/0x234)
>[<c023aaf8>] (__do_softirq) from [<c023aee4>] (irq_exit+0xcc/0x108)
>[<c023aee4>] (irq_exit) from [<c0279920>]
>(__handle_domain_irq+0x80/0xec)
>[<c0279920>] (__handle_domain_irq) from [<c0201544>]
>(gic_handle_irq+0x48/0x8c)
>[<c0201544>] (gic_handle_irq) from [<c0223838>] (__irq_svc+0x58/0x8c)
>Exception stack(0xc1001f28 to 0xc1001f70)
>1f20:                   00000001 00000000 00000000 c022fdc0 c1000000
>c1003d80
>1f40: c1003d34 c0e72ed0 c0bd9b04 c1001f80 00000000 00000000 00000000
>c1001f78
>1f60: c022048c c0220490 60000013 ffffffff [<c0223838>] (__irq_svc) from
>[<c0220490>] (arch_cpu_idle+0x38/0x3c) [<c0220490>] (arch_cpu_idle) from
>[<c026ec60>] (do_idle+0x170/0x204) [<c026ec60>] (do_idle) from
>[<c026efac>] (cpu_startup_entry+0x18/0x1c) [<c026efac>]
>(cpu_startup_entry) from [<c0e00c88>]
>(start_kernel+0x394/0x3a0)
>---[ end trace a474f341d40e0705 ]---
>fec 30be0000.ethernet eth0: TX ring dump
>
>I disabled the regular ring dump and only printed one line. It seems to come
>up every 2 seconds, and checking cat /proc/interrupts showed that queue 2
>stayed at its last value (3562218):
>
> 58:    3091320     GIC-0 150 Level     30be0000.ethernet
> 59:    3562218     GIC-0 151 Level     30be0000.ethernet
> 60:   13377922     GIC-0 152 Level     30be0000.ethernet

Pls check ENETx_DMAnCFG[16] whether is set, and disable time-based shaping for round robin.

^ permalink raw reply

* [PATCH net] net: sched: optimize class dumps
From: Eric Dumazet @ 2017-05-11  4:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Jiri Kosina, Jamal Hadi Salim, Cong Wang, Jiri Pirko

From: Eric Dumazet <edumazet@google.com>

In commit 59cc1f61f09c ("net: sched: convert qdisc linked list to
hashtable") we missed the opportunity to considerably speed up
tc_dump_tclass_root() if a qdisc handle is provided by user.

Instead of iterating all the qdiscs, use qdisc_match_from_root()
to directly get the one we look for.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Kosina <jkosina@suse.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
---
 net/sched/sch_api.c |    6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index bbe57d57b67fd498692bd41db49147511f1bb091..e88342fde1bc409aed6a3c86e7a628030eaac66f 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1831,6 +1831,12 @@ static int tc_dump_tclass_root(struct Qdisc *root, struct sk_buff *skb,
 	if (!qdisc_dev(root))
 		return 0;
 
+	if (tcm->tcm_parent) {
+		q = qdisc_match_from_root(root, TC_H_MAJ(tcm->tcm_parent));
+		if (q && tc_dump_tclass_qdisc(q, skb, tcm, cb, t_p, s_t) < 0)
+			return -1;
+		return 0;
+	}
 	hash_for_each(qdisc_dev(root)->qdisc_hash, b, q, hash) {
 		if (tc_dump_tclass_qdisc(q, skb, tcm, cb, t_p, s_t) < 0)
 			return -1;

^ permalink raw reply related

* Re: DQL and TCQ_F_CAN_BYPASS destroy performance under virtualizaiton (Was: "Re: net_sched strange in 4.11")
From: Anton Ivanov @ 2017-05-11  5:43 UTC (permalink / raw)
  To: Jason Wang, Stefan Hajnoczi; +Cc: David S. Miller, netdev, Michael S. Tsirkin
In-Reply-To: <c11c50db-4e66-f3c2-5ebb-519ad6dbc2fe@redhat.com>

On 11/05/17 03:43, Jason Wang wrote:
>
>
> On 2017年05月10日 17:42, Anton Ivanov wrote:
>> On 10/05/17 09:56, Jason Wang wrote:
>>>
>>>
>>> On 2017年05月10日 13:28, Anton Ivanov wrote:
>>>> On 10/05/17 03:18, Jason Wang wrote:
>>>>>
>>>>> On 2017年05月09日 23:11, Stefan Hajnoczi wrote:
>>>>>> On Tue, May 09, 2017 at 08:46:46AM +0100, Anton Ivanov wrote:
>>>>>>> I have figured it out. Two issues.
>>>>>>>
>>>>>>> 1) skb->xmit_more is hardly ever set under virtualization because
>>>>>>> the qdisc
>>>>>>> is usually bypassed because of TCQ_F_CAN_BYPASS. Once
>>>>>>> TCQ_F_CAN_BYPASS is
>>>>>>> set a virtual NIC driver is not likely see skb->xmit_more (this
>>>>>>> answers my
>>>>>>> "how does this work at all" question).
>>>>>>>
>>>>>>> 2) If that flag is turned off (I patched sched_generic to turn it
>>>>>>> off in
>>>>>>> pfifo_fast while testing), DQL keeps xmit_more from being set. If
>>>>>>> the driver
>>>>>>> is not DQL enabled xmit_more is never ever set. If the driver is
>>>>>>> DQL
>>>>>>> enabled
>>>>>>> the queue is adjusted to ensure xmit_more stops happening within
>>>>>>> 10-15 xmit
>>>>>>> cycles.
>>>>>>>
>>>>>>> That is plain *wrong* for virtual NICs - virtio, emulated NICs,
>>>>>>> etc.
>>>>>>> There,
>>>>>>> the BIG cost is telling the hypervisor that it needs to "kick" the
>>>>>>> packets.
>>>>>>> The cost of putting them into the vNIC buffers is negligible.
>>>>>>> You want
>>>>>>> xmit_more to happen - it makes between 50% and 300% (depending
>>>>>>> on vNIC
>>>>>>> design) difference. If there is no xmit_more the vNIC will
>>>>>>> immediately
>>>>>>> "kick" the hypervisor and try to signal that  the packet needs
>>>>>>> to move
>>>>>>> straight away (as for example in virtio_net).
>>>>> How do you measure the performance? TCP or just measure pps?
>>>> In this particular case - tcp from guest. I have a couple of other
>>>> benchmarks (forwarding, etc).
>>>
>>> One more question, is the number for virtio-net or other emulated vNIC?
>>
>> Other for now - you are cc-ed to keep you in the loop.
>>
>> Virtio is next on my list - I am revisiting the l2tpv3.c driver in
>> QEMU and looking at how to preserve bulking by adding back sendmmsg
>> (as well as a list of other features/transports).
>>
>> We had sendmmsg removed for the final inclusion in QEMU 2.1, it
>> presently uses only recvmmsg so for the time being it does not care.
>> That will most likely change once it starts using sendmmsg as well.
>
> An issue is that qemu net API does not support bulking, do you plan to
> add it?

Yes :)

A.

>
> Thanks
>


-- 
Anton R. Ivanov
Cambridgegreys Limited. Registered in England. Company Number 10273661

^ permalink raw reply

* [PATCH] net/mlx4_core: Use min_t instead of if for consistency
From: Yuval Shaia @ 2017-05-11  8:20 UTC (permalink / raw)
  To: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 7032054..a58b15a 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2862,12 +2862,11 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	int port = 0;
 
 	if (msi_x) {
-		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
+		int nreq = min_t(int,
+				 dev->caps.num_ports * num_online_cpus() + 1,
+				 dev->caps.num_eqs - dev->caps.reserved_eqs);
 
-		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
-			     nreq);
-		if (nreq > MAX_MSIX)
-			nreq = MAX_MSIX;
+		nreq = min_t(nreq, MAX_MSIX);
 
 		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
 		if (!entries)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH] net/mlx4_core: Use min_t instead of if for consistency
From: Johannes Thumshirn @ 2017-05-11  8:28 UTC (permalink / raw)
  To: Yuval Shaia, yishaih-VPRAkNaXOzVWk0Htik3J/w,
	netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1494490852-5567-1-git-send-email-yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On 05/11/2017 10:20 AM, Yuval Shaia wrote:
> +		nreq = min_t(nreq, MAX_MSIX);

Ahm...
include/linux/kernel.h +802
#define min_t(type, x, y)				\
	__min(type, type,				\
	      __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),	\
	      x, y)

Did you compile this patch?

-- 
Johannes Thumshirn                                          Storage
jthumshirn-l3A5Bk7waGM@public.gmane.org                                +49 911 74053 689
SUSE LINUX GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Felix Imendörffer, Jane Smithard, Graham Norton
HRB 21284 (AG Nürnberg)
Key fingerprint = EC38 9CAB C2C4 F25D 8600 D0D0 0393 969D 2D76 0850
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH] net/mlx4_core: Use min_t instead of if for consistency
From: Leon Romanovsky @ 2017-05-11  8:28 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1494490852-5567-1-git-send-email-yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1332 bytes --]

On Thu, May 11, 2017 at 11:20:52AM +0300, Yuval Shaia wrote:
> Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 7032054..a58b15a 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2862,12 +2862,11 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
>  	int port = 0;
>
>  	if (msi_x) {
> -		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
> +		int nreq = min_t(int,
> +				 dev->caps.num_ports * num_online_cpus() + 1,
> +				 dev->caps.num_eqs - dev->caps.reserved_eqs);
>
> -		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
> -			     nreq);
> -		if (nreq > MAX_MSIX)
> -			nreq = MAX_MSIX;
> +		nreq = min_t(nreq, MAX_MSIX);

I don't see min_t (int, ..) here.

>
>  		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
>  		if (!entries)
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH v1] net/mlx4_core: Use min_t instead of if for consistency
From: Yuval Shaia @ 2017-05-11  8:46 UTC (permalink / raw)
  To: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
v0 -> v1:
	* s/"min_t("/"min_t(int"
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 7032054..7bb377e 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2862,12 +2862,11 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	int port = 0;
 
 	if (msi_x) {
-		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
+		int nreq = min_t(int,
+				 dev->caps.num_ports * num_online_cpus() + 1,
+				 dev->caps.num_eqs - dev->caps.reserved_eqs);
 
-		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
-			     nreq);
-		if (nreq > MAX_MSIX)
-			nreq = MAX_MSIX;
+		nreq = min_t(int, nreq, MAX_MSIX);
 
 		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
 		if (!entries)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: arch: arm: bpf: Converting cBPF to eBPF for arm 32 bit
From: Shubham Bansal @ 2017-05-11  9:32 UTC (permalink / raw)
  To: Kees Cook
  Cc: David Miller, Mircea Gherzan, Network Development,
	kernel-hardening, linux-arm-kernel, ast, Daniel Borkmann
In-Reply-To: <CAHgaXdKZ_v+iO7uqEDx7PA7D+xcp1FngGvJ1SRSsGXNQ-iWWDQ@mail.gmail.com>

Hi kees & Daniel,

David suggested following :

"""
eBPF has registers 0 through 10 plus you need to allocate another
temporary register for constant blinding (this is BPF_REG_AX).

I would put all of BPF_REG_0 through BPF_REG_5 in registers if
possible.  BPF_REG_FP is the frame pointer which you don't have to
really allocate.  That leaves BPF_REG_6 through BPF_REG_9, which
are callee saved, for perhaps stack slot allocation.

You seem to have R0 through R10 on ARM plus a separate frame pointer.
And then I see something called "LR" which is probably the function
return address register.Why can't you just use R0 through R9
for BPF_REG_0 through BPF_REG_9, BPF_REG_10 is just FP and then you
have R10 for BPF_REG_AX?
"""

"""
static const u8 bpf2a32[][2] = {
        /* return value from in-kernel function, and exit value from eBPF */
        [BPF_REG_0] = {ARM_R1, ARM_R0},
        /* arguments from eBPF program to in-kernel function */
        [BPF_REG_1] = {ARM_R1, ARM_R0},
        [BPF_REG_2] = {ARM_R3, ARM_R2},
        /* Stored on stack */
        [BPF_REG_3] = {STACK_OFFSET(0), STACK_OFFSET(4)},
        [BPF_REG_4] = {STACK_OFFSET(8), STACK_OFFSET(12)},
        [BPF_REG_5] = {STACK_OFFSET(16), STACK_OFFSET(20)},
"bpf_jit/* callee saved registers that in-kernel function will preserve */
        [BPF_REG_6] = {ARM_R5, ARM_R4},
        [BPF_REG_7] = {STACK_OFFSET(24), STACK_OFFSET(28)},
        /* Stored on stack */
        [BPF_REG_8] = {STACK_OFFSET(32), STACK_OFFSET(36)},
        [BPF_REG_9] = {STACK_OFFSET(40), STACK_OFFSET(44)},
        /* Read only Frame Pointer to access Stack */
        [BPF_REG_FP] = {ARM_FP},
        /* Temperory Register for internal BPF JIT, can be used
         * for constant blindings and others. */
        [TMP_REG_1] = {ARM_R7, ARM_R6},
        [TMP_REG_2] = {ARM_R10, ARM_R8},
        /* Tail call count. */
        [TCALL_CNT] = {STACK_OFFSET(48), STACK_OFFSET(52)},

        [BPF_REG_AX] = {STACK_OFFSET(56), STACK_OFFSET(60)},
};

> How register starved are you?
Super Starved.
>
> eBPF has registers 0 through 10 plus you need to allocate another
> temporary register for constant blinding (this is BPF_REG_AX).
I am storing BPF_REG_AX on stack as of now.
>
> I would put all of BPF_REG_0 through BPF_REG_5 in registers if
> possible.  BPF_REG_FP is the frame pointer which you don't have to
> really allocate.  That leaves BPF_REG_6 through BPF_REG_9, which
> are callee saved, for perhaps stack slot allocation.
>
> You seem to have R0 through R10 on ARM plus a separate frame pointer.
> And then I see something called "LR" which is probably the function
> return address register.  Why can't you just use R0 through R9
> for BPF_REG_0 through BPF_REG_9, BPF_REG_10 is just FP and then you
> have R10 for BPF_REG_AX?
I can't do that. BPF registers are 64 bits and ARM registers are 32
bit. So I have to map each BPF register with 2 arm registers.
Also, I need 4 temp registers which I am currently using.
"""

"""
>> I can't do that. BPF registers are 64 bits and ARM registers are 32
>> bit. So I have to map each BPF register with 2 arm registers.
>> Also, I need 4 temp registers which I am currently using.
>
> Ummm, no you don't.
>
> You can do proper data flow analysis on the register values and you
> can just use plain 32-bit registers when that is all that the data
> flow tells you the register is used for.
I don't understand. Can you explain that with example?

>
> This is what the netronome driver does, it is in the same situation
> you are.  The NPU cpus on their networking card are 32-bits, and
> they have to do 32-bit value analysis while JIT'ing into their
> device.
As far as I know their ISA is more like cBPF? isn't it?
>
> It is actually rare for full 64-bit values to be used.  Those ususally
> come from pointers.  But on arm32, pointers will be 32-bits therefore
> any pointer relative value will be 32-bits as well.
Well, in that case I have to rewrite the whole code. I asked what
mapping I should use when I started and nobody replied so I went ahead
and started implementing. :(
>
> When you actually have to fabricate a full 64-bit operation, yeah
> use a stack slot or something like that.
So you are telling me to store the low 32 bit in registers and high 32
bit in scratch memory?
"""

What do you guys suggest i should implement it? I am almost done with
my current implementation but if you think I should change it to the
way David suggested, its better to suggest now before I send the
patch.

Let me know if you have any questions.
Best,
Shubham Bansal


On Thu, May 11, 2017 at 7:23 AM, Shubham Bansal
<illusionist.neo@gmail.com> wrote:
> Okay. My mistake.
>
> -Shubham
>
> On May 11, 2017 7:22 AM, "David Miller" <davem@davemloft.net> wrote:
>>
>>
>> Please keep this discussion on the mailing list.
>>
>> When you drop the CC:, you exclude the entire world from contributing
>> and continuing to help you.

^ permalink raw reply

* Implementing Dynamic Rerouting in Kernel
From: Ravish Kumar @ 2017-05-11  9:59 UTC (permalink / raw)
  To: Networking, linux-kernel, mst

Hi Experts,

Need expert advice for the one of the requirement Where in VPN
solution we want to dynaically route the packets to different adapter.
We will manage our own DNS cache and , based on DNS to IP lookup, we
can redirect the packet either to Tun device or to a physical adapter.

Please suggest some design what i need to do.

^ permalink raw reply

* Re: [PATCH v1] net/mlx4_core: Use min_t instead of if for consistency
From: Leon Romanovsky @ 2017-05-11 10:23 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1494492389-15040-1-git-send-email-yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1459 bytes --]

On Thu, May 11, 2017 at 11:46:29AM +0300, Yuval Shaia wrote:
> Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
> v0 -> v1:
> 	* s/"min_t("/"min_t(int"
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c | 9 ++++-----
>  1 file changed, 4 insertions(+), 5 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> index 7032054..7bb377e 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> @@ -2862,12 +2862,11 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
>  	int port = 0;
>
>  	if (msi_x) {
> -		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
> +		int nreq = min_t(int,
> +				 dev->caps.num_ports * num_online_cpus() + 1,
> +				 dev->caps.num_eqs - dev->caps.reserved_eqs);
>
> -		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
> -			     nreq);
> -		if (nreq > MAX_MSIX)
> -			nreq = MAX_MSIX;
> +		nreq = min_t(int, nreq, MAX_MSIX);

You don't need type checking for these variables, they are all int and
you can use directly min3(..).

Thanks

>
>  		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
>  		if (!entries)
> --
> 2.7.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v1] net/mlx4_core: Use min_t instead of if for consistency
From: Yuval Shaia @ 2017-05-11 10:27 UTC (permalink / raw)
  To: Leon Romanovsky; +Cc: yishaih, netdev, linux-rdma
In-Reply-To: <20170511102329.GC3616@mtr-leonro.local>

On Thu, May 11, 2017 at 01:23:29PM +0300, Leon Romanovsky wrote:
> On Thu, May 11, 2017 at 11:46:29AM +0300, Yuval Shaia wrote:
> > Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
> > ---
> > v0 -> v1:
> > 	* s/"min_t("/"min_t(int"
> > ---
> >  drivers/net/ethernet/mellanox/mlx4/main.c | 9 ++++-----
> >  1 file changed, 4 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
> > index 7032054..7bb377e 100644
> > --- a/drivers/net/ethernet/mellanox/mlx4/main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx4/main.c
> > @@ -2862,12 +2862,11 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
> >  	int port = 0;
> >
> >  	if (msi_x) {
> > -		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
> > +		int nreq = min_t(int,
> > +				 dev->caps.num_ports * num_online_cpus() + 1,
> > +				 dev->caps.num_eqs - dev->caps.reserved_eqs);
> >
> > -		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
> > -			     nreq);
> > -		if (nreq > MAX_MSIX)
> > -			nreq = MAX_MSIX;
> > +		nreq = min_t(int, nreq, MAX_MSIX);
> 
> You don't need type checking for these variables, they are all int and
> you can use directly min3(..).

Ha ha ha, i can't explain how this macro slipped out from my sight when was
looking for it :).
Hope you can review v2.

> 
> Thanks
> 
> >
> >  		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
> >  		if (!entries)
> > --
> > 2.7.4
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v2] net/mlx4_core: Use min_t instead of if for consistency
From: Yuval Shaia @ 2017-05-11 10:36 UTC (permalink / raw)
  To: yishaih, netdev, linux-rdma

Signed-off-by: Yuval Shaia <yuval.shaia@oracle.com>
---
v0 -> v1:
	* s/"min_t("/"min_t(int"
v1 -> v2:
	* Use min3 instead of min_t twice
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 7032054..2afa340 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2862,12 +2862,9 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	int port = 0;
 
 	if (msi_x) {
-		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
-
-		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
-			     nreq);
-		if (nreq > MAX_MSIX)
-			nreq = MAX_MSIX;
+		int nreq = min3(dev->caps.num_ports * num_online_cpus() + 1,
+				dev->caps.num_eqs - dev->caps.reserved_eqs,
+				MAX_MSIX);
 
 		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
 		if (!entries)
-- 
2.7.4

^ permalink raw reply related

* [PATCH v3] net/mlx4_core: Use min3 to select number of MSI-X vectors
From: Yuval Shaia @ 2017-05-11 10:40 UTC (permalink / raw)
  To: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA

Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
---
v0 -> v1:
	* s/"min_t("/"min_t(int"
v1 -> v2:
	* Use min3 instead of min_t twice
v2 -> v3:
	* Change commit log header message to reflect the changes made in
	  v2
---
 drivers/net/ethernet/mellanox/mlx4/main.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/main.c b/drivers/net/ethernet/mellanox/mlx4/main.c
index 7032054..2afa340 100644
--- a/drivers/net/ethernet/mellanox/mlx4/main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/main.c
@@ -2862,12 +2862,9 @@ static void mlx4_enable_msi_x(struct mlx4_dev *dev)
 	int port = 0;
 
 	if (msi_x) {
-		int nreq = dev->caps.num_ports * num_online_cpus() + 1;
-
-		nreq = min_t(int, dev->caps.num_eqs - dev->caps.reserved_eqs,
-			     nreq);
-		if (nreq > MAX_MSIX)
-			nreq = MAX_MSIX;
+		int nreq = min3(dev->caps.num_ports * num_online_cpus() + 1,
+				dev->caps.num_eqs - dev->caps.reserved_eqs,
+				MAX_MSIX);
 
 		entries = kcalloc(nreq, sizeof *entries, GFP_KERNEL);
 		if (!entries)
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [PATCH v2] net/mlx4_core: Use min_t instead of if for consistency
From: Leon Romanovsky @ 2017-05-11 10:42 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1494498962-16852-1-git-send-email-yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 444 bytes --]

On Thu, May 11, 2017 at 01:36:02PM +0300, Yuval Shaia wrote:
> Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
> v0 -> v1:
> 	* s/"min_t("/"min_t(int"
> v1 -> v2:
> 	* Use min3 instead of min_t twice
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [PATCH v3] net/mlx4_core: Use min3 to select number of MSI-X vectors
From: Leon Romanovsky @ 2017-05-11 10:48 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: yishaih-VPRAkNaXOzVWk0Htik3J/w, netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1494499258-17017-1-git-send-email-yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 533 bytes --]

On Thu, May 11, 2017 at 01:40:58PM +0300, Yuval Shaia wrote:
> Signed-off-by: Yuval Shaia <yuval.shaia-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
> ---
> v0 -> v1:
> 	* s/"min_t("/"min_t(int"
> v1 -> v2:
> 	* Use min3 instead of min_t twice
> v2 -> v3:
> 	* Change commit log header message to reflect the changes made in
> 	  v2
> ---
>  drivers/net/ethernet/mellanox/mlx4/main.c | 9 +++------
>  1 file changed, 3 insertions(+), 6 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <leonro-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* [PATCH] wlcore: fix 64K page support
From: Arnd Bergmann @ 2017-05-11 11:52 UTC (permalink / raw)
  To: Kalle Valo
  Cc: Arnd Bergmann, stable, Reizer, Eyal, Tony Lindgren, Wei Yongjun,
	linux-wireless, netdev, linux-kernel

In the stable linux-3.16 branch, I ran into a warning in the
wlcore driver:

drivers/net/wireless/ti/wlcore/spi.c: In function 'wl12xx_spi_raw_write':
drivers/net/wireless/ti/wlcore/spi.c:315:1: error: the frame size of 12848 bytes is larger than 2048 bytes [-Werror=frame-larger-than=]

Newer kernels no longer show the warning, but the bug is still there,
as the allocation is based on the CPU page size rather than the
actual capabilities of the hardware.

This replaces the PAGE_SIZE macro with the SZ_4K macro, i.e. 4096 bytes
per buffer.

Cc: stable@vger.kernel.org
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
 drivers/net/wireless/ti/wlcore/spi.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/net/wireless/ti/wlcore/spi.c b/drivers/net/wireless/ti/wlcore/spi.c
index f949ad2bd898..fa3547e06424 100644
--- a/drivers/net/wireless/ti/wlcore/spi.c
+++ b/drivers/net/wireless/ti/wlcore/spi.c
@@ -70,10 +70,10 @@
 #define WSPI_MAX_CHUNK_SIZE    4092
 
 /*
- * wl18xx driver aggregation buffer size is (13 * PAGE_SIZE) compared to
- * (4 * PAGE_SIZE) for wl12xx, so use the larger buffer needed for wl18xx
+ * wl18xx driver aggregation buffer size is (13 * 4K) compared to
+ * (4 * 4K) for wl12xx, so use the larger buffer needed for wl18xx
  */
-#define SPI_AGGR_BUFFER_SIZE (13 * PAGE_SIZE)
+#define SPI_AGGR_BUFFER_SIZE (13 * SZ_4K)
 
 /* Maximum number of SPI write chunks */
 #define WSPI_MAX_NUM_OF_CHUNKS \
-- 
2.9.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox