Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH net-next 3/3] net: stmmac: Introducing support for Page Pool
From: Jose Abreu @ 2019-07-18  7:48 UTC (permalink / raw)
  To: Jon Hunter, Jose Abreu, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org, linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org
  Cc: Joao Pinto, David S . Miller, Giuseppe Cavallaro,
	Alexandre Torgue, Maxime Coquelin, Maxime Ripard, Chen-Yu Tsai,
	linux-tegra
In-Reply-To: <29dcc161-f7c8-026e-c3cc-5adb04df128c@nvidia.com>

[-- Attachment #1: Type: text/plain, Size: 192 bytes --]

From: Jon Hunter <jonathanh@nvidia.com>
Date: Jul/17/2019, 19:58:53 (UTC+00:00)

> Let me know if you have any thoughts.

Can you try attached patch ?

---
Thanks,
Jose Miguel Abreu

[-- Attachment #2: 0001-net-stmmac-RX-Descriptors-need-to-be-clean-before-se.patch --]
[-- Type: application/octet-stream, Size: 2042 bytes --]

From 00bfde6f589e60ba1a2d0671c8ba0fcd0964d6e7 Mon Sep 17 00:00:00 2001
Message-Id: <00bfde6f589e60ba1a2d0671c8ba0fcd0964d6e7.1563435927.git.joabreu@synopsys.com>
From: Jose Abreu <joabreu@synopsys.com>
Date: Thu, 18 Jul 2019 09:42:31 +0200
Subject: [PATCH net] net: stmmac: RX Descriptors need to be clean before
 setting buffers

RX Descriptors are being cleaned after setting the buffers which may
lead to buffer addresses being wiped out.

Fix this by clearing earlier the RX Descriptors.

Reported-by: Jon Hunter <jonathanh@nvidia.com>
Fixes: 2af6106ae949 ("net: stmmac: Introducing support for Page Pool")
Signed-off-by: Jose Abreu <joabreu@synopsys.com>

---
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
Cc: Jose Abreu <joabreu@synopsys.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Maxime Coquelin <mcoquelin.stm32@gmail.com>
Cc: netdev@vger.kernel.org
Cc: linux-stm32@st-md-mailman.stormreply.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
---
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index c7c9e5f162e6..5f1294ce0216 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1295,6 +1295,8 @@ static int init_dma_rx_desc_rings(struct net_device *dev, gfp_t flags)
 			  "(%s) dma_rx_phy=0x%08x\n", __func__,
 			  (u32)rx_q->dma_rx_phy);
 
+		stmmac_clear_rx_descriptors(priv, queue);
+
 		for (i = 0; i < DMA_RX_SIZE; i++) {
 			struct dma_desc *p;
 
@@ -1312,8 +1314,6 @@ static int init_dma_rx_desc_rings(struct net_device *dev, gfp_t flags)
 		rx_q->cur_rx = 0;
 		rx_q->dirty_rx = (unsigned int)(i - DMA_RX_SIZE);
 
-		stmmac_clear_rx_descriptors(priv, queue);
-
 		/* Setup the chained descriptor addresses */
 		if (priv->mode == STMMAC_CHAIN_MODE) {
 			if (priv->extend_desc)
-- 
2.7.4


^ permalink raw reply related

* IP GRO verifies csum again?
From: Jacob Wen @ 2019-07-18  7:49 UTC (permalink / raw)
  To: netdev; +Cc: herbert

Hi,

inet_gro_receive verifies IP csum but a NIC already did so and set 
CHECKSUM_UNNECESSARY.

https://github.com/torvalds/linux/blob/v5.2/net/ipv4/af_inet.c#L1432-L1433

if (unlikely(ip_fast_csum((u8 *)iph, 5)))

         goto out_unlock;

Is this a bug?

^ permalink raw reply

* Re: [PATCH v4 4/5] vhost/vsock: split packets to send using multiple buffers
From: Stefano Garzarella @ 2019-07-18  7:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190717105336-mutt-send-email-mst@kernel.org>

On Wed, Jul 17, 2019 at 4:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jul 17, 2019 at 01:30:29PM +0200, Stefano Garzarella wrote:
> > If the packets to sent to the guest are bigger than the buffer
> > available, we can split them, using multiple buffers and fixing
> > the length in the packet header.
> > This is safe since virtio-vsock supports only stream sockets.
> >
> > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>
> So how does it work right now? If an app
> does sendmsg with a 64K buffer and the other
> side publishes 4K buffers - does it just stall?

Before this series, the 64K (or bigger) user messages was split in 4K packets
(fixed in the code) and queued in an internal list for the TX worker.

After this series, we will queue up to 64K packets and then it will be split in
the TX worker, depending on the size of the buffers available in the
vring. (The idea was to allow EWMA or a configuration of the buffers size, but
for now we postponed it)

Note: virtio-vsock only supports stream socket for now.

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH v4 5/5] vsock/virtio: change the maximum packet size allowed
From: Stefano Garzarella @ 2019-07-18  7:52 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190717105703-mutt-send-email-mst@kernel.org>

On Wed, Jul 17, 2019 at 5:00 PM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Jul 17, 2019 at 01:30:30PM +0200, Stefano Garzarella wrote:
> > Since now we are able to split packets, we can avoid limiting
> > their sizes to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE.
> > Instead, we can use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max
> > packet size.
> >
> > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
>
>
> OK so this is kind of like GSO where we are passing
> 64K packets to the vsock and then split at the
> low level.

Exactly, something like that in the Host->Guest path, instead in the
Guest->Host we use the entire 64K packet.

Thanks,
Stefano

^ permalink raw reply

* Re: [PATCH v4 4/5] vhost/vsock: split packets to send using multiple buffers
From: Michael S. Tsirkin @ 2019-07-18  8:13 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <CAGxU2F45v40qAOHkm1Hk2E69gCS0UwVgS5NS+tDXXuzdF4EixA@mail.gmail.com>

On Thu, Jul 18, 2019 at 09:50:14AM +0200, Stefano Garzarella wrote:
> On Wed, Jul 17, 2019 at 4:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jul 17, 2019 at 01:30:29PM +0200, Stefano Garzarella wrote:
> > > If the packets to sent to the guest are bigger than the buffer
> > > available, we can split them, using multiple buffers and fixing
> > > the length in the packet header.
> > > This is safe since virtio-vsock supports only stream sockets.
> > >
> > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> >
> > So how does it work right now? If an app
> > does sendmsg with a 64K buffer and the other
> > side publishes 4K buffers - does it just stall?
> 
> Before this series, the 64K (or bigger) user messages was split in 4K packets
> (fixed in the code) and queued in an internal list for the TX worker.
> 
> After this series, we will queue up to 64K packets and then it will be split in
> the TX worker, depending on the size of the buffers available in the
> vring. (The idea was to allow EWMA or a configuration of the buffers size, but
> for now we postponed it)

Got it. Using workers for xmit is IMHO a bad idea btw.
Why is it done like this?

> Note: virtio-vsock only supports stream socket for now.
> 
> Thanks,
> Stefano

^ permalink raw reply

* RE: [EXT] [PATCH] qed: Prefer pcie_capability_read_word()
From: Michal Kalderon @ 2019-07-18  8:22 UTC (permalink / raw)
  To: Frederick Lawler, Ariel Elior, GR-everest-linux-l2
  Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	bhelgaas@google.com
In-Reply-To: <20190718020745.8867-7-fred@fredlawl.com>

> From: Frederick Lawler <fred@fredlawl.com>
> Sent: Thursday, July 18, 2019 5:08 AM
> 
> External Email
> 
> ----------------------------------------------------------------------
> Commit 8c0d3a02c130 ("PCI: Add accessors for PCI Express Capability") added
> accessors for the PCI Express Capability so that drivers didn't need to be
> aware of differences between v1 and v2 of the PCI Express Capability.
> 
> Replace pci_read_config_word() and pci_write_config_word() calls with
> pcie_capability_read_word() and pcie_capability_write_word().
> 
> Signed-off-by: Frederick Lawler <fred@fredlawl.com>
> ---
>  drivers/net/ethernet/qlogic/qed/qed_rdma.c | 5 ++---
>  1 file changed, 2 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> index 7873d6dfd91f..8d8a920c3195 100644
> --- a/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> +++ b/drivers/net/ethernet/qlogic/qed/qed_rdma.c
> @@ -530,9 +530,8 @@ static void qed_rdma_init_devinfo(struct qed_hwfn
> *p_hwfn,
>  	SET_FIELD(dev->dev_caps,
> QED_RDMA_DEV_CAP_LOCAL_INV_FENCE, 1);
> 
>  	/* Check atomic operations support in PCI configuration space. */
> -	pci_read_config_dword(cdev->pdev,
> -			      cdev->pdev->pcie_cap + PCI_EXP_DEVCTL2,
> -			      &pci_status_control);
> +	pcie_capability_read_dword(cdev->pdev, PCI_EXP_DEVCTL2,
> +				   &pci_status_control);
> 
>  	if (pci_status_control & PCI_EXP_DEVCTL2_LTR_EN)
>  		SET_FIELD(dev->dev_caps,
> QED_RDMA_DEV_CAP_ATOMIC_OP, 1);
> --
> 2.17.1

Thanks, 

Acked-by: Michal Kalderon <michal.kalderon@marvell.com>



^ permalink raw reply

* Re: IP GRO verifies csum again?
From: Eric Dumazet @ 2019-07-18  8:27 UTC (permalink / raw)
  To: Jacob Wen, netdev; +Cc: herbert
In-Reply-To: <6cc12686-a13d-81a8-ad3c-4601397c900e@oracle.com>

On 7/18/19 9:49 AM, Jacob Wen wrote:
> Hi,
> 
> inet_gro_receive verifies IP csum but a NIC already did so and set CHECKSUM_UNNECESSARY.
> 
> 
> https://github.com/torvalds/linux/blob/v5.2/net/ipv4/af_inet.c#L1432-L1433
> 
> if (unlikely(ip_fast_csum((u8 *)iph, 5)))
> 
>         goto out_unlock;
> 
> 
> Is this a bug?
> 

This checksum validates the TCP one, which is the real cost, since we need to touch all
the packet.

We do not bother 'offloading' IPV4 checksum over 20 bytes or so, since in modern cpus,
having the cache line hot in cpu caches means the checksum is almost free.

Adding a test here would not always be a win, say for CHECKSUM_COMPLETE cases,
which we try to generalize in favor of old CHECKSUM_UNNECESSARY.

^ permalink raw reply

* Re: [RFC PATCH 0/5] PTP: add support for Intel's TGPIO controller
From: Felipe Balbi @ 2019-07-18  8:58 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall
In-Reply-To: <20190717173915.GE1464@localhost>


Hi,

Richard Cochran <richardcochran@gmail.com> writes:

> On Wed, Jul 17, 2019 at 09:52:55AM +0300, Felipe Balbi wrote:
>> 
>> It's just a pin, like a GPIO. So it would be a PCB trace, flat flex,
>> copper wire... Anything, really.
>
> Cool.  Are there any Intel CPUs available that have this feature?

At least canon lake has it, but its BIOS doesn't enable it. This is
something that will be more important on future CPUs.

-- 
balbi

^ permalink raw reply

* Re: [RFC PATCH 4/5] PTP: Add flag for non-periodic output
From: Felipe Balbi @ 2019-07-18  8:59 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	H . Peter Anvin, x86, linux-kernel, Christopher S . Hall
In-Reply-To: <20190717173645.GD1464@localhost>


Hi,

Richard Cochran <richardcochran@gmail.com> writes:

> On Wed, Jul 17, 2019 at 09:49:13AM +0300, Felipe Balbi wrote:
>> No worries, I'll work on this after vacations (I'll off for 2 weeks
>> starting next week). I thought about adding a new IOCTL until I saw that
>> rsv field. Oh well :-)
>
> It would be great if you could fix up the PTP ioctls as a preface to
> your series.

no problem, anything in particular in mind? Just create new versions of
all the IOCTLs so we can actually use the reserved fields in the future?

cheers

-- 
balbi

^ permalink raw reply

* [PATCH bpf] selftests/bpf: fix "valid read map access into a read-only array 1" on s390
From: Ilya Leoshkevich @ 2019-07-18  9:13 UTC (permalink / raw)
  To: bpf, netdev; +Cc: gor, heiko.carstens, Ilya Leoshkevich

This test looks up a 32-bit map element and then loads it using a 64-bit
load. This does not work on s390, which is a big-endian machine.

Since the point of this test doesn't seem to be loading a smaller value
using a larger load, simply use a 32-bit load.

Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com>
---
 tools/testing/selftests/bpf/verifier/array_access.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/verifier/array_access.c b/tools/testing/selftests/bpf/verifier/array_access.c
index bcb83196e459..f3c33e128709 100644
--- a/tools/testing/selftests/bpf/verifier/array_access.c
+++ b/tools/testing/selftests/bpf/verifier/array_access.c
@@ -226,7 +226,7 @@
 	BPF_LD_MAP_FD(BPF_REG_1, 0),
 	BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
 	BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1),
-	BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_0, 0),
+	BPF_LDX_MEM(BPF_W, BPF_REG_0, BPF_REG_0, 0),
 	BPF_EXIT_INSN(),
 	},
 	.fixup_map_array_ro = { 3 },
-- 
2.21.0


^ permalink raw reply related

* Re: [PATCH net-next 3/3] net: stmmac: Introducing support for Page Pool
From: Jon Hunter @ 2019-07-18  9:16 UTC (permalink / raw)
  To: Jose Abreu, linux-kernel@vger.kernel.org, netdev@vger.kernel.org,
	linux-stm32@st-md-mailman.stormreply.com,
	linux-arm-kernel@lists.infradead.org
  Cc: Joao Pinto, David S . Miller, Giuseppe Cavallaro,
	Alexandre Torgue, Maxime Coquelin, Maxime Ripard, Chen-Yu Tsai,
	linux-tegra
In-Reply-To: <BN8PR12MB32661E919A8DEBC7095BAA12D3C80@BN8PR12MB3266.namprd12.prod.outlook.com>


On 18/07/2019 08:48, Jose Abreu wrote:
> From: Jon Hunter <jonathanh@nvidia.com>
> Date: Jul/17/2019, 19:58:53 (UTC+00:00)
> 
>> Let me know if you have any thoughts.
> 
> Can you try attached patch ?

Yes this did not help. I tried enabling the following but no more output
is seen.

CONFIG_DMA_API_DEBUG=y
CONFIG_DMA_API_DEBUG_SG=y

Have you tried using NFS on a board with this ethernet controller?

Cheers,
Jon

-- 
nvpublic

^ permalink raw reply

* WARNING: refcount bug in nr_rx_frame
From: syzbot @ 2019-07-18  9:28 UTC (permalink / raw)
  To: davem, linux-hams, linux-kernel, netdev, ralf, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    0a8ad0ff Merge tag 'for-linus-5.3-ofs1' of git://git.kerne..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16f3cb70600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=88095c4f62402bcd
dashboard link: https://syzkaller.appspot.com/bug?extid=622bdabb128acc33427d
compiler:       clang version 9.0.0 (/home/glider/llvm/clang  
80fee25776c2fb61e74c1ecb1a523375c2500b69)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=140a8cafa00000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=123b2e68600000

The bug was bisected to:

commit c8c8218ec5af5d2598381883acbefbf604e56b5e
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Thu Jun 27 21:30:58 2019 +0000

     netrom: fix a memory leak in nr_rx_frame()

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1479a100600000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=1679a100600000
console output: https://syzkaller.appspot.com/x/log.txt?x=1279a100600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+622bdabb128acc33427d@syzkaller.appspotmail.com
Fixes: c8c8218ec5af ("netrom: fix a memory leak in nr_rx_frame()")

------------[ cut here ]------------
refcount_t: increment on 0; use-after-free.
WARNING: CPU: 1 PID: 0 at lib/refcount.c:156 refcount_inc_checked+0x4b/0x50  
/lib/refcount.c:156
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0+ #33
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack /lib/dump_stack.c:77 [inline]
  dump_stack+0x1d8/0x2f8 /lib/dump_stack.c:113
  panic+0x29b/0x7d9 /kernel/panic.c:219
  __warn+0x22f/0x230 /kernel/panic.c:576
  report_bug+0x190/0x290 /lib/bug.c:186
  fixup_bug /arch/x86/kernel/traps.c:179 [inline]
  do_error_trap+0xd7/0x440 /arch/x86/kernel/traps.c:272
  do_invalid_op+0x36/0x40 /arch/x86/kernel/traps.c:291
  invalid_op+0x14/0x20 /arch/x86/entry/entry_64.S:1008
RIP: 0010:refcount_inc_checked+0x4b/0x50 /lib/refcount.c:156
Code: 3d d9 5e 94 05 01 75 08 e8 22 8f 11 fe 5b 5d c3 e8 1a 8f 11 fe c6 05  
c3 5e 94 05 01 48 c7 c7 49 ca 87 88 31 c0 e8 35 ca e2 fd <0f> 0b eb df 90  
55 48 89 e5 41 57 41 56 41 55 41 54 53 48 83 e4 e0
RSP: 0018:ffff8880aeb09b40 EFLAGS: 00010246
RAX: d6b548c1187c2900 RBX: ffff8880a149f6c0 RCX: ffff8880a98bc340
RDX: 0000000000000302 RSI: 0000000000000302 RDI: 0000000000000000
RBP: ffff8880aeb09b48 R08: ffffffff81600c14 R09: fffffbfff13bb2bb
R10: fffffbfff13bb2bb R11: 0000000000000000 R12: ffff88808240aa40
R13: dffffc0000000000 R14: 0000000000000004 R15: ffff8880a149f640
  sock_hold /./include/net/sock.h:649 [inline]
  sk_add_node /./include/net/sock.h:701 [inline]
  nr_insert_socket /net/netrom/af_netrom.c:137 [inline]
  nr_rx_frame+0x17bc/0x1e40 /net/netrom/af_netrom.c:1023
  nr_loopback_timer+0x6a/0x140 /net/netrom/nr_loopback.c:59
  call_timer_fn+0xec/0x200 /kernel/time/timer.c:1322
  expire_timers /kernel/time/timer.c:1366 [inline]
  __run_timers+0x7cd/0x9c0 /kernel/time/timer.c:1685
  run_timer_softirq+0x4a/0x90 /kernel/time/timer.c:1698
  __do_softirq+0x333/0x7c4 /./arch/x86/include/asm/paravirt.h:777
  invoke_softirq /kernel/softirq.c:373 [inline]
  irq_exit+0x227/0x230 /kernel/softirq.c:413
  exiting_irq /./arch/x86/include/asm/apic.h:537 [inline]
  smp_apic_timer_interrupt+0x113/0x280 /arch/x86/kernel/apic/apic.c:1095
  apic_timer_interrupt+0xf/0x20 /arch/x86/entry/entry_64.S:828
  </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 /./arch/x86/include/asm/irqflags.h:61
Code: 08 fa eb ae 89 d9 80 e1 07 80 c1 03 38 c1 7c ba 48 89 df e8 a4 04 08  
fa eb b0 90 90 e9 07 00 00 00 0f 00 2d e6 9c 58 00 fb f4 <c3> 90 e9 07 00  
00 00 0f 00 2d d6 9c 58 00 f4 c3 90 90 55 48 89 e5
RSP: 0018:ffff8880a98cfd38 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff11950db RBX: ffff8880a98bc340 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: ffffffff812cde8a RDI: ffff8880a98bcb78
RBP: ffff8880a98cfd40 R08: ffff8880a98bcb90 R09: ffffed1015317869
R10: ffffed1015317869 R11: 0000000000000000 R12: 0000000000000001
R13: 1ffff11015317868 R14: dffffc0000000000 R15: dffffc0000000000
  arch_cpu_idle+0xa/0x10 /arch/x86/kernel/process.c:571
  default_idle_call+0x59/0xa0 /kernel/sched/idle.c:94
  cpuidle_idle_call /kernel/sched/idle.c:154 [inline]
  do_idle+0x180/0x780 /kernel/sched/idle.c:263
  cpu_startup_entry+0x25/0x30 /kernel/sched/idle.c:354
  start_secondary+0x3f4/0x490 /arch/x86/kernel/smpboot.c:264
  secondary_startup_64+0xa4/0xb0 /arch/x86/kernel/head_64.S:243
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH v4 4/5] vhost/vsock: split packets to send using multiple buffers
From: Stefano Garzarella @ 2019-07-18  9:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <20190718041234-mutt-send-email-mst@kernel.org>

On Thu, Jul 18, 2019 at 10:13 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> On Thu, Jul 18, 2019 at 09:50:14AM +0200, Stefano Garzarella wrote:
> > On Wed, Jul 17, 2019 at 4:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > On Wed, Jul 17, 2019 at 01:30:29PM +0200, Stefano Garzarella wrote:
> > > > If the packets to sent to the guest are bigger than the buffer
> > > > available, we can split them, using multiple buffers and fixing
> > > > the length in the packet header.
> > > > This is safe since virtio-vsock supports only stream sockets.
> > > >
> > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > >
> > > So how does it work right now? If an app
> > > does sendmsg with a 64K buffer and the other
> > > side publishes 4K buffers - does it just stall?
> >
> > Before this series, the 64K (or bigger) user messages was split in 4K packets
> > (fixed in the code) and queued in an internal list for the TX worker.
> >
> > After this series, we will queue up to 64K packets and then it will be split in
> > the TX worker, depending on the size of the buffers available in the
> > vring. (The idea was to allow EWMA or a configuration of the buffers size, but
> > for now we postponed it)
>
> Got it. Using workers for xmit is IMHO a bad idea btw.
> Why is it done like this?

Honestly, I don't know the exact reasons for this design, but I suppose
that the idea was to have only one worker that uses the vring, and
multiple user threads that enqueue packets in the list.
This can simplify the code and we can put the user threads to sleep if
we don't have "credit" available (this means that the receiver doesn't
have space to receive the packet).

What are the drawbacks in your opinion?


Thanks,
Stefano

^ permalink raw reply

* [PATCH] Signed-off-by: Peter Kosyh <p.kosyh@gmail.com>
From: Peter Kosyh @ 2019-07-18  9:41 UTC (permalink / raw)
  To: p.kosyh; +Cc: davem, David Ahern, Shrijeet Mukherjee, netdev, linux-kernel

vrf_process_v4_outbound() and vrf_process_v6_outbound() do routing
using ip/ipv6 addresses, but don't make sure the header is available in
skb->data[] (skb_headlen() is less then header size).

The situation may occures while forwarding from MPLS layer to vrf, for
example.

So, this patch adds pskb_may_pull() calls in is_ip_tx_frame(), just before
call to vrf_process_... functions.

Signed-off-by: Peter Kosyh <p.kosyh@gmail.com>
---
 drivers/net/vrf.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 54edf8956a25..d552f29a58d1 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -292,13 +292,16 @@ static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
 {
 	switch (skb->protocol) {
 	case htons(ETH_P_IP):
+		if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct iphdr))
+			break;
 		return vrf_process_v4_outbound(skb, dev);
 	case htons(ETH_P_IPV6):
+		if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct ipv6hdr))
+			break;
 		return vrf_process_v6_outbound(skb, dev);
-	default:
-		vrf_tx_error(dev, skb);
-		return NET_XMIT_DROP;
 	}
+	vrf_tx_error(dev, skb);
+	return NET_XMIT_DROP;
 }
 
 static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
-- 
2.11.0


^ permalink raw reply related

* RE: [EXT] [PATCH] bnx2x: Prevent load reordering in tx completion processing
From: Manish Chopra @ 2019-07-18 10:12 UTC (permalink / raw)
  To: Brian King, GR-everest-linux-l2
  Cc: Sudarsana Reddy Kalluru, Ariel Elior, netdev@vger.kernel.org
In-Reply-To: <1563226910-21660-1-git-send-email-brking@linux.vnet.ibm.com>

> -----Original Message-----
> From: Brian King <brking@linux.vnet.ibm.com>
> Sent: Tuesday, July 16, 2019 3:12 AM
> To: GR-everest-linux-l2 <GR-everest-linux-l2@marvell.com>
> Cc: Sudarsana Reddy Kalluru <skalluru@marvell.com>; Ariel Elior
> <aelior@marvell.com>; netdev@vger.kernel.org; Brian King
> <brking@linux.vnet.ibm.com>
> Subject: [EXT] [PATCH] bnx2x: Prevent load reordering in tx completion
> processing
> 
> External Email
> 
> ----------------------------------------------------------------------
> This patch fixes an issue seen on Power systems with bnx2x which results in
> the skb is NULL WARN_ON in bnx2x_free_tx_pkt firing due to the skb pointer
> getting loaded in bnx2x_free_tx_pkt prior to the hw_cons load in
> bnx2x_tx_int. Adding a read memory barrier resolves the issue.
> 
> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
> ---
>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> index 656ed80..e2be5a6 100644
> --- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> +++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
> @@ -285,6 +285,9 @@ int bnx2x_tx_int(struct bnx2x *bp, struct
> bnx2x_fp_txdata *txdata)
>  	hw_cons = le16_to_cpu(*txdata->tx_cons_sb);
>  	sw_cons = txdata->tx_pkt_cons;
> 
> +	/* Ensure subsequent loads occur after hw_cons */
> +	smp_rmb();
> +
>  	while (sw_cons != hw_cons) {
>  		u16 pkt_cons;
> 
> --
> 1.8.3.1

Could you please explain a bit in detail what could have caused skb to NULL exactly ?
Curious that if skb would have been NULL for some reason it did not cause NULL pointer dereference in bnx2x_free_tx_pkt() on below call -

prefetch(&skb->end);

Which is prior to the said WARN_ON(!skb) in bnx2x_free_tx_pkt().

^ permalink raw reply

* [RFC net] net: generate icmp redirects after netfilter forward hook
From: Florian Westphal @ 2019-07-18 10:24 UTC (permalink / raw)
  To: netdev; +Cc: netfilter-devel, Florian Westphal, Jason Muskat

Quoting from
https://bugzilla.kernel.org/show_bug.cgi?id=204155

 ----
1. Configure an PPPoE interface (e.g., ppp0) with an IPv6 address and
   a prefixlen of 64; e.g., 2001:0DB8::1/64.
2. Configure a Netfilter ip6table rule to drop IN to OUT forwarding of traffic
   across the PPP interface: ip6tables -A FORWARD -i ppp+ -o ppp+ -j DROP
3. Create a Netfilter NFLOG rule to log all outbound traffic.
4. From an Internet IPv6 Address initiate TCP traffic to an IPv6 address
   within the /64 IPv6 address space from step 1, but an IPv6 address that
   is NOT configured on that interface; e.g., ` nc 2001:0DB8::2 80`
5. Observe the NFLOG showing the Netfilter ip6table filter FORWARD rule
   is matched and therefore the traffic should be dropped.
6. Observe the traffic from step 4, that should have been dropped,
   resulted in an outbound ICMPv6 Redirect with a source IPv6 address of
   the PPP interface’s Local Link to the Internet IPv6 Address.
 ----

Problem is that we emit the redirect before passing the packet to the
netfilter FORWARD hook.   The same "problem" exists in ipv4.

There are various counter-arguments to changing this, e.g.  we would
still emit such redirect when packet is dropped later in the stack
(e.g. in POSTROUTING or qdisc).

We will also still emit e.g. ICMPV6 PKTTOOBIG error, as that occurs
before FORWARD as well, and moving that seems wrong (packet
has to be dropped).

The only argument that I can think of in favor of this change
is the lack of a proper alternative to filtering such traffic.

PREROUTING would work, but at that point we lack the
"packet will be forwarded from ppp0 to ppp0" information that
we only have available in FORWARD.

Compile tested only.

Cc: Jason Muskat <jason@metalcastle.ca>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/ip_forward.c | 16 +++++------
 net/ipv6/ip6_output.c | 63 ++++++++++++++++++++++++-------------------
 2 files changed, 44 insertions(+), 35 deletions(-)

diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 06f6f280b9ff..33c46470c966 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -69,6 +69,14 @@ static int ip_forward_finish(struct net *net, struct sock *sk, struct sk_buff *s
 	__IP_INC_STATS(net, IPSTATS_MIB_OUTFORWDATAGRAMS);
 	__IP_ADD_STATS(net, IPSTATS_MIB_OUTOCTETS, skb->len);
 
+	/*
+	 *	We now generate an ICMP HOST REDIRECT giving the route
+	 *	we calculated.
+	 */
+	if (IPCB(skb)->flags & IPSKB_DOREDIRECT && !opt->srr &&
+	    !skb_sec_path(skb))
+		ip_rt_send_redirect(skb);
+
 #ifdef CONFIG_NET_SWITCHDEV
 	if (skb->offload_l3_fwd_mark) {
 		consume_skb(skb);
@@ -143,14 +151,6 @@ int ip_forward(struct sk_buff *skb)
 	/* Decrease ttl after skb cow done */
 	ip_decrease_ttl(iph);
 
-	/*
-	 *	We now generate an ICMP HOST REDIRECT giving the route
-	 *	we calculated.
-	 */
-	if (IPCB(skb)->flags & IPSKB_DOREDIRECT && !opt->srr &&
-	    !skb_sec_path(skb))
-		ip_rt_send_redirect(skb);
-
 	if (net->ipv4.sysctl_ip_fwd_update_priority)
 		skb->priority = rt_tos2priority(iph->tos);
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 8e49fd62eea9..2dafd2da2926 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -383,11 +383,45 @@ static int ip6_forward_proxy_check(struct sk_buff *skb)
 static inline int ip6_forward_finish(struct net *net, struct sock *sk,
 				     struct sk_buff *skb)
 {
+	struct inet6_skb_parm *opt = IP6CB(skb);
 	struct dst_entry *dst = skb_dst(skb);
 
 	__IP6_INC_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTFORWDATAGRAMS);
 	__IP6_ADD_STATS(net, ip6_dst_idev(dst), IPSTATS_MIB_OUTOCTETS, skb->len);
 
+	/* IPv6 specs say nothing about it, but it is clear that we cannot
+	   send redirects to source routed frames.
+	   We don't send redirects to frames decapsulated from IPsec.
+	 */
+	if (IP6CB(skb)->iif == dst->dev->ifindex &&
+	    opt->srcrt == 0 && !skb_sec_path(skb)) {
+		struct ipv6hdr *hdr = ipv6_hdr(skb);
+		struct in6_addr *target = NULL;
+		struct inet_peer *peer;
+		struct rt6_info *rt;
+
+		/*
+		 *	incoming and outgoing devices are the same
+		 *	send a redirect.
+		 */
+
+		rt = (struct rt6_info *) dst;
+		if (rt->rt6i_flags & RTF_GATEWAY)
+			target = &rt->rt6i_gateway;
+		else
+			target = &hdr->daddr;
+
+		peer = inet_getpeer_v6(net->ipv6.peers, &hdr->daddr, 1);
+
+		/* Limit redirects both by destination (here)
+		   and by source (inside ndisc_send_redirect)
+		 */
+		if (inet_peer_xrlim_allow(peer, 1*HZ))
+			ndisc_send_redirect(skb, target);
+		if (peer)
+			inet_putpeer(peer);
+	}
+
 #ifdef CONFIG_NET_SWITCHDEV
 	if (skb->offload_l3_fwd_mark) {
 		consume_skb(skb);
@@ -498,33 +532,8 @@ int ip6_forward(struct sk_buff *skb)
 	   send redirects to source routed frames.
 	   We don't send redirects to frames decapsulated from IPsec.
 	 */
-	if (IP6CB(skb)->iif == dst->dev->ifindex &&
-	    opt->srcrt == 0 && !skb_sec_path(skb)) {
-		struct in6_addr *target = NULL;
-		struct inet_peer *peer;
-		struct rt6_info *rt;
-
-		/*
-		 *	incoming and outgoing devices are the same
-		 *	send a redirect.
-		 */
-
-		rt = (struct rt6_info *) dst;
-		if (rt->rt6i_flags & RTF_GATEWAY)
-			target = &rt->rt6i_gateway;
-		else
-			target = &hdr->daddr;
-
-		peer = inet_getpeer_v6(net->ipv6.peers, &hdr->daddr, 1);
-
-		/* Limit redirects both by destination (here)
-		   and by source (inside ndisc_send_redirect)
-		 */
-		if (inet_peer_xrlim_allow(peer, 1*HZ))
-			ndisc_send_redirect(skb, target);
-		if (peer)
-			inet_putpeer(peer);
-	} else {
+	if (IP6CB(skb)->iif != dst->dev->ifindex ||
+	    opt->srcrt || skb_sec_path(skb)) {
 		int addrtype = ipv6_addr_type(&hdr->saddr);
 
 		/* This check is security critical. */
-- 
2.21.0


^ permalink raw reply related

* [PATCH] Signed-off-by: Peter Kosyh <p.kosyh@gmail.com>
From: Peter Kosyh @ 2019-07-18 10:26 UTC (permalink / raw)
  To: p.kosyh; +Cc: davem, David Ahern, Shrijeet Mukherjee, netdev, linux-kernel

vrf_process_v4_outbound() and vrf_process_v6_outbound() do routing
using ip/ipv6 addresses, but don't make sure the header is available in
skb->data[] (skb_headlen() is less then header size).

The situation may occures while forwarding from MPLS layer to vrf, for example.

So, patch adds pskb_may_pull() calls in is_ip_tx_frame(), just before call
to vrf_process_... functions.

Signed-off-by: Peter Kosyh <p.kosyh@gmail.com>
---
 drivers/net/vrf.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 54edf8956a25..7c0200bc46f9 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -292,13 +292,16 @@ static netdev_tx_t is_ip_tx_frame(struct sk_buff *skb, struct net_device *dev)
 {
 	switch (skb->protocol) {
 	case htons(ETH_P_IP):
+		if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct iphdr)))
+			break;
 		return vrf_process_v4_outbound(skb, dev);
 	case htons(ETH_P_IPV6):
+		if (!pskb_may_pull(skb, ETH_HLEN + sizeof(struct ipv6hdr)))
+			break;
 		return vrf_process_v6_outbound(skb, dev);
-	default:
-		vrf_tx_error(dev, skb);
-		return NET_XMIT_DROP;
 	}
+	vrf_tx_error(dev, skb);
+	return NET_XMIT_DROP;
 }
 
 static netdev_tx_t vrf_xmit(struct sk_buff *skb, struct net_device *dev)
-- 
2.11.0


^ permalink raw reply related

* Re: [RFC v2] vhost: introduce mdev based hardware vhost backend
From: Jason Wang @ 2019-07-18 10:31 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Alex Williamson, mst, maxime.coquelin, linux-kernel, kvm,
	virtualization, netdev, dan.daly, cunming.liang, zhihong.wang,
	idos, Rob Miller, Ariel Adam
In-Reply-To: <1b49aa84-2c1f-eec2-2809-711e1f2dd7de@redhat.com>


On 2019/7/10 下午3:22, Jason Wang wrote:
>> Yeah, that's a major concern. If it's true, is it something
>> that's not acceptable?
>
>
> I think not, but I don't know if any other one that care this.
>
>
>>
>>> And I do see some new RFC for VFIO to add more DMA API.
>> Is there any pointers?
>
>
> I don't remember the details, but it should be something related to 
> SVA support in recent intel IOMMU.


E.g this series:

https://www.spinics.net/lists/iommu/msg37146.html

Thanks


^ permalink raw reply

* Re: regression with napi/softirq ?
From: Sudip Mukherjee @ 2019-07-18 11:18 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), David S. Miller, netdev,
	linux-kernel
In-Reply-To: <052e43b6-26f8-3e46-784e-dc3c6a82bdf0@gmail.com>

Hi Eric,

On Thu, Jul 18, 2019 at 7:58 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 7/17/19 11:52 PM, Thomas Gleixner wrote:
> > Sudip,
> >
> > On Wed, 17 Jul 2019, Sudip Mukherjee wrote:
> >> On Wed, Jul 17, 2019 at 9:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>> You can hack ksoftirq_running() to return always false to avoid this, but
> >>> that might cause application starvation and a huge packet buffer backlog
> >>> when the amount of incoming packets makes the CPU do nothing else than
> >>> softirq processing.
> >>
> >> I tried that now, it is better but still not as good as v3.8
> >> Now I am getting 375.9usec as the maximum time between raising the softirq
> >> and it starting to execute and packet drops still there.
> >>
> >> And just a thought, do you think there should be a CONFIG_ option for
> >> this feature of ksoftirqd_running() so that it can be disabled if needed
> >> by users like us?
> >
> > If at all then a sysctl to allow runtime control.
> >

<snip>

>
> ksoftirqd might be spuriously scheduled from tx path, when
> __qdisc_run() also reacts to need_resched().
>
> By raising NET_TX while we are processing NET_RX (say we send a TCP ACK packet
> in response to incoming packet), we force __do_softirq() to perform
> another loop, but before doing an other round, it will also check need_resched()
> and eventually call wakeup_softirqd()
>
> I wonder if following patch makes any difference.
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 11c03cf4aa74b44663c74e0e3284140b0c75d9c4..ab736e974396394ae6ba409868aaea56a50ad57b 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -377,6 +377,8 @@ void __qdisc_run(struct Qdisc *q)
>         int packets;
>
>         while (qdisc_restart(q, &packets)) {
> +               if (qdisc_is_empty(q))
> +                       break;

unfortunately its v4.14.55 and qdisc_is_empty() is not yet introduced.
And I can not backport 28cff537ef2e ("net: sched: add empty status
flag for NOLOCK qdisc")
also as TCQ_F_NOLOCK is there. :(


-- 
Regards
Sudip

^ permalink raw reply

* Re: [PATCH v4 4/5] vhost/vsock: split packets to send using multiple buffers
From: Michael S. Tsirkin @ 2019-07-18 11:35 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <CAGxU2F6oo7Cou7t9o=gG2=wxHMKX9xYQXNxVtDYeHq5fyEhJWg@mail.gmail.com>

On Thu, Jul 18, 2019 at 11:37:30AM +0200, Stefano Garzarella wrote:
> On Thu, Jul 18, 2019 at 10:13 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Thu, Jul 18, 2019 at 09:50:14AM +0200, Stefano Garzarella wrote:
> > > On Wed, Jul 17, 2019 at 4:55 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> > > > On Wed, Jul 17, 2019 at 01:30:29PM +0200, Stefano Garzarella wrote:
> > > > > If the packets to sent to the guest are bigger than the buffer
> > > > > available, we can split them, using multiple buffers and fixing
> > > > > the length in the packet header.
> > > > > This is safe since virtio-vsock supports only stream sockets.
> > > > >
> > > > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> > > >
> > > > So how does it work right now? If an app
> > > > does sendmsg with a 64K buffer and the other
> > > > side publishes 4K buffers - does it just stall?
> > >
> > > Before this series, the 64K (or bigger) user messages was split in 4K packets
> > > (fixed in the code) and queued in an internal list for the TX worker.
> > >
> > > After this series, we will queue up to 64K packets and then it will be split in
> > > the TX worker, depending on the size of the buffers available in the
> > > vring. (The idea was to allow EWMA or a configuration of the buffers size, but
> > > for now we postponed it)
> >
> > Got it. Using workers for xmit is IMHO a bad idea btw.
> > Why is it done like this?
> 
> Honestly, I don't know the exact reasons for this design, but I suppose
> that the idea was to have only one worker that uses the vring, and
> multiple user threads that enqueue packets in the list.
> This can simplify the code and we can put the user threads to sleep if
> we don't have "credit" available (this means that the receiver doesn't
> have space to receive the packet).


I think you mean the reverse: even without credits you can copy from
user and queue up data, then process it without waking up the user
thread.
Does it help though? It certainly adds up work outside of
user thread context which means it's not accounted for
correctly.

Maybe we want more VQs. Would help improve parallelism. The question
would then become how to map sockets to VQs. With a simple hash
it's easy to create collisions ...


> 
> What are the drawbacks in your opinion?
> 
> 
> Thanks,
> Stefano

- More pressure on scheduler
- Increased latency


-- 
MST

^ permalink raw reply

* Re: regression with napi/softirq ?
From: Eric Dumazet @ 2019-07-18 11:42 UTC (permalink / raw)
  To: Sudip Mukherjee
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), David S. Miller, netdev,
	linux-kernel
In-Reply-To: <CADVatmN6xNO1iMQ4ihsT5OqV2cuj2ajq+v00NrtUyOHkiKPo-Q@mail.gmail.com>



On 7/18/19 1:18 PM, Sudip Mukherjee wrote:
> Hi Eric,
> 
> On Thu, Jul 18, 2019 at 7:58 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>
>>
>>
>> On 7/17/19 11:52 PM, Thomas Gleixner wrote:
>>> Sudip,
>>>
>>> On Wed, 17 Jul 2019, Sudip Mukherjee wrote:
>>>> On Wed, Jul 17, 2019 at 9:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
>>>>> You can hack ksoftirq_running() to return always false to avoid this, but
>>>>> that might cause application starvation and a huge packet buffer backlog
>>>>> when the amount of incoming packets makes the CPU do nothing else than
>>>>> softirq processing.
>>>>
>>>> I tried that now, it is better but still not as good as v3.8
>>>> Now I am getting 375.9usec as the maximum time between raising the softirq
>>>> and it starting to execute and packet drops still there.
>>>>
>>>> And just a thought, do you think there should be a CONFIG_ option for
>>>> this feature of ksoftirqd_running() so that it can be disabled if needed
>>>> by users like us?
>>>
>>> If at all then a sysctl to allow runtime control.
>>>
> 
> <snip>
> 
>>
>> ksoftirqd might be spuriously scheduled from tx path, when
>> __qdisc_run() also reacts to need_resched().
>>
>> By raising NET_TX while we are processing NET_RX (say we send a TCP ACK packet
>> in response to incoming packet), we force __do_softirq() to perform
>> another loop, but before doing an other round, it will also check need_resched()
>> and eventually call wakeup_softirqd()
>>
>> I wonder if following patch makes any difference.
>>
>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>> index 11c03cf4aa74b44663c74e0e3284140b0c75d9c4..ab736e974396394ae6ba409868aaea56a50ad57b 100644
>> --- a/net/sched/sch_generic.c
>> +++ b/net/sched/sch_generic.c
>> @@ -377,6 +377,8 @@ void __qdisc_run(struct Qdisc *q)
>>         int packets;
>>
>>         while (qdisc_restart(q, &packets)) {
>> +               if (qdisc_is_empty(q))
>> +                       break;
> 
> unfortunately its v4.14.55 and qdisc_is_empty() is not yet introduced.
> And I can not backport 28cff537ef2e ("net: sched: add empty status
> flag for NOLOCK qdisc")
> also as TCQ_F_NOLOCK is there. :(
> 

On old kernels, you can simply use

static inline bool qdisc_is_empty(struct Qdisc *q)
{
	return !qdisc_qlen(q);
}


^ permalink raw reply

* KASAN: use-after-free Read in nr_insert_socket
From: syzbot @ 2019-07-18 12:18 UTC (permalink / raw)
  To: davem, linux-hams, linux-kernel, netdev, ralf, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    a5b64700 fix: taprio: Change type of txtime-delay paramete..
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=1588b458600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=87305c3ca9c25c70
dashboard link: https://syzkaller.appspot.com/bug?extid=9399c158fcc09b21d0d2
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=105a61a4600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=153ef948600000

The bug was bisected to:

commit c8c8218ec5af5d2598381883acbefbf604e56b5e
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Thu Jun 27 21:30:58 2019 +0000

     netrom: fix a memory leak in nr_rx_frame()

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=159ef948600000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=179ef948600000
console output: https://syzkaller.appspot.com/x/log.txt?x=139ef948600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+9399c158fcc09b21d0d2@syzkaller.appspotmail.com
Fixes: c8c8218ec5af ("netrom: fix a memory leak in nr_rx_frame()")

==================================================================
BUG: KASAN: use-after-free in atomic_read  
/./include/asm-generic/atomic-instrumented.h:26 [inline]
BUG: KASAN: use-after-free in refcount_inc_not_zero_checked+0x81/0x200  
/lib/refcount.c:123
Read of size 4 at addr ffff8880a5d3f380 by task swapper/1/0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0+ #89
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack /lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 /lib/dump_stack.c:113
  print_address_description.cold+0xd4/0x306 /mm/kasan/report.c:351
  __kasan_report.cold+0x1b/0x36 /mm/kasan/report.c:482
  kasan_report+0x12/0x20 /mm/kasan/common.c:612
  check_memory_region_inline /mm/kasan/generic.c:185 [inline]
  check_memory_region+0x134/0x1a0 /mm/kasan/generic.c:192
  __kasan_check_read+0x11/0x20 /mm/kasan/common.c:92
  atomic_read /./include/asm-generic/atomic-instrumented.h:26 [inline]
  refcount_inc_not_zero_checked+0x81/0x200 /lib/refcount.c:123
  refcount_inc_checked+0x17/0x70 /lib/refcount.c:156
  sock_hold /./include/net/sock.h:649 [inline]
  sk_add_node /./include/net/sock.h:701 [inline]
  nr_insert_socket+0x2d/0xe0 /net/netrom/af_netrom.c:137
  nr_rx_frame+0x1605/0x1e80 /net/netrom/af_netrom.c:1023
  nr_loopback_timer+0x7b/0x170 /net/netrom/nr_loopback.c:59
  call_timer_fn+0x1ac/0x780 /kernel/time/timer.c:1322
  expire_timers /kernel/time/timer.c:1366 [inline]
  __run_timers /kernel/time/timer.c:1685 [inline]
  __run_timers /kernel/time/timer.c:1653 [inline]
  run_timer_softirq+0x697/0x17a0 /kernel/time/timer.c:1698
  __do_softirq+0x262/0x98c /kernel/softirq.c:292
  invoke_softirq /kernel/softirq.c:373 [inline]
  irq_exit+0x19b/0x1e0 /kernel/softirq.c:413
  exiting_irq /./arch/x86/include/asm/apic.h:537 [inline]
  smp_apic_timer_interrupt+0x1a3/0x610 /arch/x86/kernel/apic/apic.c:1095
  apic_timer_interrupt+0xf/0x20 /arch/x86/entry/entry_64.S:828
  </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 /./arch/x86/include/asm/irqflags.h:61
Code: e8 2b 7b fa eb 8a 90 90 90 90 90 90 e9 07 00 00 00 0f 00 2d d4 0e 57  
00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d c4 0e 57 00 fb f4 <c3> 90 55 48 89  
e5 41 57 41 56 41 55 41 54 53 e8 7e 27 2f fa e8 59
RSP: 0018:ffff8880a98e7d68 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff11a5ca5 RBX: ffff8880a98ce340 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a98cebcc
RBP: ffff8880a98e7d98 R08: ffff8880a98ce340 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff89a29778 R14: 0000000000000000 R15: 0000000000000001
  arch_cpu_idle+0xa/0x10 /arch/x86/kernel/process.c:571
  default_idle_call+0x84/0xb0 /kernel/sched/idle.c:94
  cpuidle_idle_call /kernel/sched/idle.c:154 [inline]
  do_idle+0x413/0x760 /kernel/sched/idle.c:263
  cpu_startup_entry+0x1b/0x20 /kernel/sched/idle.c:354
  start_secondary+0x315/0x430 /arch/x86/kernel/smpboot.c:264
  secondary_startup_64+0xa4/0xb0 /arch/x86/kernel/head_64.S:243

Allocated by task 0:
  save_stack+0x23/0x90 /mm/kasan/common.c:69
  set_track /mm/kasan/common.c:77 [inline]
  __kasan_kmalloc /mm/kasan/common.c:487 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 /mm/kasan/common.c:460
  kasan_kmalloc+0x9/0x10 /mm/kasan/common.c:501
  __do_kmalloc /mm/slab.c:3655 [inline]
  __kmalloc+0x163/0x780 /mm/slab.c:3664
  kmalloc /./include/linux/slab.h:557 [inline]
  sk_prot_alloc+0x23a/0x310 /net/core/sock.c:1603
  sk_alloc+0x39/0xf70 /net/core/sock.c:1657
  nr_make_new /net/netrom/af_netrom.c:476 [inline]
  nr_rx_frame+0x733/0x1e80 /net/netrom/af_netrom.c:959
  nr_loopback_timer+0x7b/0x170 /net/netrom/nr_loopback.c:59
  call_timer_fn+0x1ac/0x780 /kernel/time/timer.c:1322
  expire_timers /kernel/time/timer.c:1366 [inline]
  __run_timers /kernel/time/timer.c:1685 [inline]
  __run_timers /kernel/time/timer.c:1653 [inline]
  run_timer_softirq+0x697/0x17a0 /kernel/time/timer.c:1698
  __do_softirq+0x262/0x98c /kernel/softirq.c:292

Freed by task 11342:
  save_stack+0x23/0x90 /mm/kasan/common.c:69
  set_track /mm/kasan/common.c:77 [inline]
  __kasan_slab_free+0x102/0x150 /mm/kasan/common.c:449
  kasan_slab_free+0xe/0x10 /mm/kasan/common.c:457
  __cache_free /mm/slab.c:3425 [inline]
  kfree+0x10a/0x2c0 /mm/slab.c:3756
  sk_prot_free /net/core/sock.c:1640 [inline]
  __sk_destruct+0x4f7/0x6e0 /net/core/sock.c:1726
  sk_destruct+0x86/0xa0 /net/core/sock.c:1734
  __sk_free+0xfb/0x360 /net/core/sock.c:1745
  sk_free+0x42/0x50 /net/core/sock.c:1756
  sock_put /./include/net/sock.h:1725 [inline]
  sock_efree+0x61/0x80 /net/core/sock.c:2042
  skb_release_head_state+0xeb/0x260 /net/core/skbuff.c:652
  skb_release_all+0x16/0x60 /net/core/skbuff.c:663
  __kfree_skb /net/core/skbuff.c:679 [inline]
  kfree_skb /net/core/skbuff.c:697 [inline]
  kfree_skb+0x101/0x3c0 /net/core/skbuff.c:691
  nr_accept+0x570/0x720 /net/netrom/af_netrom.c:819
  __sys_accept4+0x34e/0x6a0 /net/socket.c:1750
  __do_sys_accept4 /net/socket.c:1785 [inline]
  __se_sys_accept4 /net/socket.c:1782 [inline]
  __x64_sys_accept4+0x97/0xf0 /net/socket.c:1782
  do_syscall_64+0xfd/0x6a0 /arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880a5d3f300
  which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 128 bytes inside of
  2048-byte region [ffff8880a5d3f300, ffff8880a5d3fb00)
The buggy address belongs to the page:
page:ffffea0002974f80 refcount:1 mapcount:0 mapping:ffff8880aa400e00  
index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea00021bf808 ffffea000282a708 ffff8880aa400e00
raw: 0000000000000000 ffff8880a5d3e200 0000000100000003 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
  ffff8880a5d3f280: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8880a5d3f300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ffff8880a5d3f380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                    ^
  ffff8880a5d3f400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8880a5d3f480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH v4 5/5] vsock/virtio: change the maximum packet size allowed
From: Michael S. Tsirkin @ 2019-07-18 12:33 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, linux-kernel, Stefan Hajnoczi, David S. Miller,
	virtualization, Jason Wang, kvm
In-Reply-To: <CAGxU2F5ybg1_8VhS=COMnxSKC4AcW4ZagYwNMi==d6-rNPgzsg@mail.gmail.com>

On Thu, Jul 18, 2019 at 09:52:41AM +0200, Stefano Garzarella wrote:
> On Wed, Jul 17, 2019 at 5:00 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Jul 17, 2019 at 01:30:30PM +0200, Stefano Garzarella wrote:
> > > Since now we are able to split packets, we can avoid limiting
> > > their sizes to VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE.
> > > Instead, we can use VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max
> > > packet size.
> > >
> > > Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
> >
> >
> > OK so this is kind of like GSO where we are passing
> > 64K packets to the vsock and then split at the
> > low level.
> 
> Exactly, something like that in the Host->Guest path, instead in the
> Guest->Host we use the entire 64K packet.
> 
> Thanks,
> Stefano

btw two allocations for each packet isn't great. How about
allocating the struct linearly with the data?
And all buffers are same length for you - so you can actually
do alloc_pages.
Allocating/freeing pages in a batch should also be considered.

-- 
MST

^ permalink raw reply

* Re: regression with napi/softirq ?
From: Thomas Gleixner @ 2019-07-18 12:40 UTC (permalink / raw)
  To: Sudip Mukherjee
  Cc: Eric Dumazet, Peter Zijlstra (Intel), David S. Miller, netdev,
	linux-kernel
In-Reply-To: <CADVatmN6xNO1iMQ4ihsT5OqV2cuj2ajq+v00NrtUyOHkiKPo-Q@mail.gmail.com>

On Thu, 18 Jul 2019, Sudip Mukherjee wrote:
> On Thu, Jul 18, 2019 at 7:58 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > ksoftirqd might be spuriously scheduled from tx path, when
> > __qdisc_run() also reacts to need_resched().
> >
> > By raising NET_TX while we are processing NET_RX (say we send a TCP ACK packet
> > in response to incoming packet), we force __do_softirq() to perform
> > another loop, but before doing an other round, it will also check need_resched()
> > and eventually call wakeup_softirqd()
> >
> > I wonder if following patch makes any difference.
> >
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index 11c03cf4aa74b44663c74e0e3284140b0c75d9c4..ab736e974396394ae6ba409868aaea56a50ad57b 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -377,6 +377,8 @@ void __qdisc_run(struct Qdisc *q)
> >         int packets;
> >
> >         while (qdisc_restart(q, &packets)) {
> > +               if (qdisc_is_empty(q))
> > +                       break;
> 
> unfortunately its v4.14.55 and qdisc_is_empty() is not yet introduced.
> And I can not backport 28cff537ef2e ("net: sched: add empty status
> flag for NOLOCK qdisc")
> also as TCQ_F_NOLOCK is there. :(

Then please run the test on a current kernel. Backports can be discussed
once the problem is understood.

Thanks,

	tglx

^ permalink raw reply

* Re: regression with napi/softirq ?
From: Sudip Mukherjee @ 2019-07-18 12:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Thomas Gleixner, Peter Zijlstra (Intel), David S. Miller, netdev,
	linux-kernel
In-Reply-To: <8124bbe5-eaa8-2106-2695-4788ec0f6544@gmail.com>

On Thu, Jul 18, 2019 at 12:42 PM Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
>
> On 7/18/19 1:18 PM, Sudip Mukherjee wrote:
> > Hi Eric,
> >
> > On Thu, Jul 18, 2019 at 7:58 AM Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>
> >>
> >>
> >> On 7/17/19 11:52 PM, Thomas Gleixner wrote:
> >>> Sudip,
> >>>
> >>> On Wed, 17 Jul 2019, Sudip Mukherjee wrote:
> >>>> On Wed, Jul 17, 2019 at 9:53 PM Thomas Gleixner <tglx@linutronix.de> wrote:
> >>>>> You can hack ksoftirq_running() to return always false to avoid this, but
> >>>>> that might cause application starvation and a huge packet buffer backlog
> >>>>> when the amount of incoming packets makes the CPU do nothing else than
> >>>>> softirq processing.
> >>>>
> >>>> I tried that now, it is better but still not as good as v3.8
> >>>> Now I am getting 375.9usec as the maximum time between raising the softirq
> >>>> and it starting to execute and packet drops still there.
> >>>>
> >>>> And just a thought, do you think there should be a CONFIG_ option for
> >>>> this feature of ksoftirqd_running() so that it can be disabled if needed
> >>>> by users like us?
> >>>
> >>> If at all then a sysctl to allow runtime control.
> >>>
> >
> > <snip>
> >
> >>
> >> ksoftirqd might be spuriously scheduled from tx path, when
> >> __qdisc_run() also reacts to need_resched().
> >>
> >> By raising NET_TX while we are processing NET_RX (say we send a TCP ACK packet
> >> in response to incoming packet), we force __do_softirq() to perform
> >> another loop, but before doing an other round, it will also check need_resched()
> >> and eventually call wakeup_softirqd()
> >>
> >> I wonder if following patch makes any difference.
> >>
> >> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> >> index 11c03cf4aa74b44663c74e0e3284140b0c75d9c4..ab736e974396394ae6ba409868aaea56a50ad57b 100644
> >> --- a/net/sched/sch_generic.c
> >> +++ b/net/sched/sch_generic.c
> >> @@ -377,6 +377,8 @@ void __qdisc_run(struct Qdisc *q)
> >>         int packets;
> >>
> >>         while (qdisc_restart(q, &packets)) {
> >> +               if (qdisc_is_empty(q))
> >> +                       break;
> >
> > unfortunately its v4.14.55 and qdisc_is_empty() is not yet introduced.
> > And I can not backport 28cff537ef2e ("net: sched: add empty status
> > flag for NOLOCK qdisc")
> > also as TCQ_F_NOLOCK is there. :(
> >
>
> On old kernels, you can simply use
>
> static inline bool qdisc_is_empty(struct Qdisc *q)
> {
>         return !qdisc_qlen(q);
> }
>

Thanks Eric. But there is no improvement in delay between
softirq_raise and softirq_entry with this change.
But moving to a later kernel (linus master branch? ) like Thomas has
said in the other mail might be difficult atm. I can definitely
move to v4.14.133 if that helps. Thomas ?


-- 
Regards
Sudip

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox