Netdev List

Netdev List
 help / color / mirror / Atom feed

* [Patch net] net_sched: check cops->tcf_block in tc_bind_tclass()
From: Cong Wang @ 2019-09-08 19:11 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang, syzbot+21b29db13c065852f64b, Jamal Hadi Salim,
	Jiri Pirko

At least sch_red and sch_tbf don't implement ->tcf_block()
while still have a non-zero tc "class".

Instead of adding nop implementations to each of such qdisc's,
we can just relax the check of cops->tcf_block() in
tc_bind_tclass(). They don't support TC filter anyway.

Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/sched/sch_api.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 04faee7ccbce..1047825d9f48 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1920,6 +1920,8 @@ static void tc_bind_tclass(struct Qdisc *q, u32 portid, u32 clid,
 	cl = cops->find(q, portid);
 	if (!cl)
 		return;
+	if (!cops->tcf_block)
+		return;
 	block = cops->tcf_block(q, cl, NULL);
 	if (!block)
 		return;
-- 
2.21.0


^ permalink raw reply related

* Re: general protection fault in qdisc_put
From: Linus Torvalds @ 2019-09-08 17:18 UTC (permalink / raw)
  To: syzbot
  Cc: akinobu.mita, Andrew Morton, David Miller, Dmitry Vyukov, jhs,
	jiri, Linux List Kernel Mailing, Michal Hocko, Netdev,
	syzkaller-bugs, Cong Wang
In-Reply-To: <000000000000df42500592047e0a@google.com>

On Sat, Sep 7, 2019 at 11:08 PM syzbot
<syzbot+d5870a903591faaca4ae@syzkaller.appspotmail.com> wrote:
>
> The bug was bisected to:
>
> commit e41d58185f1444368873d4d7422f7664a68be61d
> Author: Dmitry Vyukov <dvyukov@google.com>
> Date:   Wed Jul 12 21:34:35 2017 +0000
>
>      fault-inject: support systematic fault injection

That commit does seem a bit questionable, but not the cause of this
problem (just the trigger).

I think the questionable part is that the new code doesn't honor the
task filtering, and will fail even for protected tasks. Dmitry?

> kasan: GPF could be caused by NULL-ptr deref or user memory access
> general protection fault: 0000 [#1] PREEMPT SMP KASAN
> CPU: 1 PID: 9699 Comm: syz-executor169 Not tainted 5.3.0-rc7+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:qdisc_put+0x25/0x90 net/sched/sch_generic.c:983

Yes, looks like 'qdisc' is NULL.

This is the

        qdisc_put(q->qdisc);

in sfb_destroy(), called from qdisc_create().

I think what is happening is this (in qdisc_create()):

        if (ops->init) {
                err = ops->init(sch, tca[TCA_OPTIONS], extack);
                if (err != 0)
                        goto err_out5;
        }
        ...
err_out5:
        /* ops->init() failed, we call ->destroy() like qdisc_create_dflt() */
        if (ops->destroy)
                ops->destroy(sch);

and "ops->init" is sfb_init(), which will not initialize q->qdisc if
tcf_block_get() fails.

I see two solutions:

 (a) move the

        q->qdisc = &noop_qdisc;

     up earlier in sfb_init(), so that qdisc is always initialized
after sfb_init(), even on failure.

 (b) just make qdisc_put(NULL) just silently work as a no-op.

 (c) change all the semantics to not call ->destroy if ->init failed.

Honestly, (a) seems very fragile - do all the other init routines do
this? And (c) sounds like a big change, and very fragile too.

So I'd suggest that qdisc_put() be made to just ignore a NULL pointer
(and maybe an error pointer too?).

But I'll leave it to the maintainers to sort out the proper fix.
Maybe people prefer (a)?

                   Linus

^ permalink raw reply

* Default qdisc not correctly initialized with custom MTU
From: Holger Hoffstätte @ 2019-09-08 14:13 UTC (permalink / raw)
  To: Netdev

I just installed a better NIC (Aquantia 2.5/5/10Gb, apparently with
multiple queues) and now get the "mq" pseudo-qdisc automatically installed -
so far, so good. I also configure fq_codel as default qdisc via sysctls
and a larger MTU of 9000 for the device. This somehow leads to some
slight confusion about initialization order between the qdiscs and the
device.

Right after booting, where sysctl runs before eth0 setup:

$tc qd show
qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 0: dev eth0 root
qdisc fq_codel 0: dev eth0 parent :8 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :7 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :6 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :5 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :4 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :3 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :2 limit 10240p flows 1024 quantum 1514 ...
qdisc fq_codel 0: dev eth0 parent :1 limit 10240p flows 1024 quantum 1514 ...

Note that fq_codel thinks the quantum (derived from the MTU) is still 1500;
it just used the default setting as there was no link yet.

Howwver, the MTU is set to 9000:

$ip link show eth0
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000

It seems the default qdisc is created before the link is set up, and then
just attached to mq without consideration for the actual link configuration.
Simply kicking the whole thing to replace mq with itself again does the trick:

$tc qd replace root dev eth0 mq
$tc qd show
qdisc noqueue 0: dev lo root refcnt 2
qdisc mq 8001: dev eth0 root
qdisc fq_codel 0: dev eth0 parent 8001:8 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:7 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:6 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:5 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:4 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:3 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:2 limit 10240p flows 1024 quantum 9014 ...
qdisc fq_codel 0: dev eth0 parent 8001:1 limit 10240p flows 1024 quantum 9014 ...

Now the quanta are in line with the actual MTU.

I can't help but feel this is a slight bug in terms of initialization order,
and that the default qdisc should only be created when it's first being
used/attached to a link, not when the sysctls are configured.
Kernel is 5.2.x and I didn't see anything in 5.3 or net-next to address
this yet.

Thoughts?

Holger

^ permalink raw reply

* Re: general protection fault in cbs_destroy
From: syzbot @ 2019-09-08 12:33 UTC (permalink / raw)
  To: davem, hdanton, jhs, jiri, leandro.maciel.dorileo, linux-kernel,
	netdev, syzkaller-bugs, vedang.patel, xiyou.wangcong
In-Reply-To: <000000000000d2a5c60592047e58@google.com>

syzbot has bisected this bug to:

commit e0a7683d30e91e30ee6cf96314ae58a0314a095e
Author: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Date:   Mon Apr 8 17:12:18 2019 +0000

     net/sched: cbs: fix port_rate miscalculation

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=158f3c01600000
start commit:   3b47fd5c Merge tag 'nfs-for-5.3-4' of git://git.linux-nfs...
git tree:       upstream
final crash:    https://syzkaller.appspot.com/x/report.txt?x=178f3c01600000
console output: https://syzkaller.appspot.com/x/log.txt?x=138f3c01600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=144488c6c6c6d2b6
dashboard link: https://syzkaller.appspot.com/bug?extid=3a8d6a998cbb73bcf337
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17998f9e600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10421efa600000

Reported-by: syzbot+3a8d6a998cbb73bcf337@syzkaller.appspotmail.com
Fixes: e0a7683d30e9 ("net/sched: cbs: fix port_rate miscalculation")

For information about bisection process see: https://goo.gl/tpsmEJ#bisection

^ permalink raw reply

* Re: [PATCH bpf-next] xdp: Fix race in dev_map_hash_update_elem() when replacing element
From: Jesper Dangaard Brouer @ 2019-09-08 11:28 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: brouer, ast, bpf, daniel, davem, hawk, jakub.kicinski,
	john.fastabend, kafai, linux-kernel, netdev, songliubraving,
	syzkaller-bugs, yhs, syzbot+4e7a85b1432052e8d6f8
In-Reply-To: <20190908082016.17214-1-toke@redhat.com>

On Sun,  8 Sep 2019 09:20:16 +0100
Toke Høiland-Jørgensen <toke@redhat.com> wrote:

> syzbot found a crash in dev_map_hash_update_elem(), when replacing an
> element with a new one. Jesper correctly identified the cause of the crash
> as a race condition between the initial lookup in the map (which is done
> before taking the lock), and the removal of the old element.
> 
> Rather than just add a second lookup into the hashmap after taking the
> lock, fix this by reworking the function logic to take the lock before the
> initial lookup.
> 
> Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
> Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>

Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [patch net-next v2 3/3] net: devlink: move reload fail indication to devlink core and expose to user
From: Ido Schimmel @ 2019-09-08 11:25 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev@vger.kernel.org, davem@davemloft.net, dsahern@gmail.com,
	jakub.kicinski@netronome.com, Tariq Toukan, mlxsw
In-Reply-To: <20190907205400.14589-4-jiri@resnulli.us>

On Sat, Sep 07, 2019 at 10:54:00PM +0200, Jiri Pirko wrote:
> +bool devlink_is_reload_failed(struct devlink *devlink)

Forgot to mention that this can be 'const'

> +{
> +	return devlink->reload_failed;
> +}
> +EXPORT_SYMBOL_GPL(devlink_is_reload_failed);

^ permalink raw reply

* Re: [PATCH v1 net-next 00/15] tc-taprio offload for SJA1105 DSA
From: Vladimir Oltean @ 2019-09-08 11:07 UTC (permalink / raw)
  To: Andrew Lunn, David Miller
  Cc: f.fainelli, vivien.didelot, vinicius.gomes, vedang.patel,
	richardcochran, weifeng.voon, jiri, m-karicheri2, Jose.Abreu,
	ilias.apalodimas, jhs, xiyou.wangcong, kurt.kanzenbach, netdev
In-Reply-To: <20190907144548.GA21922@lunn.ch>

Hi Andrew, David,

On Sep 7, 2019, at 3:46 PM, Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Fri, Sep 06, 2019 at 02:54:03PM +0200, David Miller wrote:
>>
>>  From: Vladimir Oltean <olteanv@gmail.com>
>>  Date: Mon,  2 Sep 2019 19:25:29 +0300
>>
>>>
>>>  This is the first attempt to submit the tc-taprio offload model for
>>>  inclusion in the net tree.
>>
>>
>>  Someone really needs to review this.
>
> Hi Vladimir
>
> You might have more chance getting this reviewed if you split it up
> into a number of smaller series. Richard could probably review the
> plain PTP changes. Who else has worked on tc-taprio recently? A series
> purely about tc-taprio might be more likely reviewed by a tc-taprio
> person, if it does not contain PTP changes.
>
>     Andrew

I think Richard has been there when the taprio, etf qdiscs, SO_TXTIME
were first defined and developed:
https://patchwork.ozlabs.org/cover/808504/
I expect he is capable of delivering a competent review of the entire
series, possibly way more competent than my patch set itself.

The reason why I'm not splitting it up is because I lose around 10 ns
of synchronization offset when using the hardware-corrected PTPCLKVAL
clock for timestamping rather than the PTPTSCLK free-running counter.
This is mostly due to the fact that SPI interaction is reduced to a
minimum when correcting the switch's PHC in software - OTOH when that
correction translates into SPI writes to PTPCLKADD/PTPCLKVAL and
PTPCLKRATE, that's when things go a bit downhill with the precision.
Now the compromise is fully acceptable if the PTP clock is to be used
as the trigger source for the time-aware scheduler, but the conversion
would be quite pointless with no user to really require the hardware
clock.

Additionally, the 802.1AS PTP profile even calls for switches and
end-stations to use timestamping counters that are free-running, and
scale&rate-correct those in software - due to a perceived "double
feedback loop", or "changing the ruler while measuring with it". Now
I'm no expert at all, but it would be interesting if we went on with
the discussion in the direction of what Linux is currently
understanding by a "free-running" PTP counter. On one hand there's the
timecounter/cyclecounter in the kernel which makes for a
software-corrected PHC, and on the other there's the free_running
option in linuxptp which makes for a "nowhere-corrected" PHC that is
only being used in the E2E_TC and P2P_TC profiles. But user space
otherwise has no insight into the PHC implementation from the kernel,
and "free_running" from ptp4l can't really be used to implement the
synchronization mechanism required by 802.1AS.

To me, the most striking aspect is that this particular recommendation
from 802.1AS is at direct odds with 802.1Qbv (time-based egress) /
802.1Qci (time-based ingress policing) which clearly require a PTP
counter in the NIC that ticks to the wall clock, and not to a random
free-running time since boot up. I simply can't seem to reconcile the
two.
What this particular switch does is that it permits RX and TX
timestamps to be taken in either corrected or uncorrected timebases
(but unfortunately not both at the same time). I think the hardware
designers' idea was to take timestamps off the uncorrected clock
(PTPTSCLK) and then do a sort of phc2sys-to-itself: write the
software-corrected value of the timecounter/cyclecounter into the
PTPCLKVAL hardware registers which get used for Qbv/Qci.
Actually I hate to use those terms when talking about SJA1105 hardware
support, since it's more "in the style of" IEEE rather than strict
compliance (timing of the design vs the standard might have played a
role as well).

But let's leave 802.1AS aside for a second - that's not what the patch
set is about, but rather a bit of background on why there are 2 PTP
clocks in this switch, and why I'm switching from one to the other.
Richard didn't really warm up to the phc2sys-to-itself idea in the
past, and opted for simplicity: just use the hardware-corrected
PTPCLKVAL for everything, which is exactly what I'm doing as of now.

The only people whom I know are working on TSN stuff are mostly
entrenched in papers, standards and generally in the hardware-only
mentality. There is obviously a lot to be done for Linux to be a
proper TSN endpoint, and RT is a big one. For a switch in particular,
things are a bit easier due to the fact that it just needs to ensure
the real-time guarantees of a frame that was supposedly already
delivered in-band with the schedule. And there's no other way to do
that rather than through a hardware offload - otherwise the software
tc-taprio would only shape the frames egressed by the management CPU
of the switch. The tc-taprio offload for a switch only makes sense
when taken together with the bridging offload, if you will.

I "dared" to submit this for merging maybe because I don't see the
subtleties that prevent it from going in, at least for a switch - it
just works and does the job. I would have loved to see this in 5.4
just so I would have to lug around a bit less patches when finally
starting to evaluate the endpoint side of things with the 5.4-rt
patch. But nonetheless, there's no hurry and getting a healthy
discussion going is surely more important than the patches themselves
are. On the other hand there needs to be a balance, and just talking
with no code is no good either - fixes, improvements, rework can
always come later once we commit to the basic offload model.

I happen to be around at Plumbers during the following days to learn
what else is going on in the Linux community, and develop a more
complete mental model for myself for how TSN fits in with all of that.
If anybody happens to also be around, I'd be more than happy to talk.

Regards,
-Vladimir

^ permalink raw reply

* [RFC PATCH untested] vhost: block speculation of translated descriptors
From: Michael S. Tsirkin @ 2019-09-08 11:05 UTC (permalink / raw)
  To: linux-kernel; +Cc: Jason Wang, kvm, virtualization, netdev

iovec addresses coming from vhost are assumed to be
pre-validated, but in fact can be speculated to a value
out of range.

Userspace address are later validated with array_index_nospec so we can
be sure kernel info does not leak through these addresses, but vhost
must also not leak userspace info outside the allowed memory table to
guests.

Following the defence in depth principle, make sure
the address is not validated out of node range.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/vhost/vhost.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5dc174ac8cac..0ee375fb7145 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2072,7 +2072,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
-			(node->userspace_addr + addr - node->start);
+			(node->userspace_addr +
+			 array_index_nospec(addr - node->start,
+					    node->size));
 		s += size;
 		addr += size;
 		++ret;
-- 
MST

^ permalink raw reply related

* Re: [PATCH 2/2] vhost: re-introducing metadata acceleration through kernel virtual address
From: Michael S. Tsirkin @ 2019-09-08 11:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, virtualization, netdev, linux-kernel, jgg, aarcange, jglisse,
	linux-mm, James Bottomley, Christoph Hellwig, David Miller,
	linux-arm-kernel, linux-parisc
In-Reply-To: <20190905122736.19768-3-jasowang@redhat.com>

On Thu, Sep 05, 2019 at 08:27:36PM +0800, Jason Wang wrote:
> This is a rework on the commit 7f466032dc9e ("vhost: access vq
> metadata through kernel virtual address").
> 
> It was noticed that the copy_to/from_user() friends that was used to
> access virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software checks,
> speculation barriers,

So if we drop speculation barrier,
there's a problem here in access will now be speculated.
This effectively disables the defence in depth effect of
b3bbfb3fb5d25776b8e3f361d2eedaabb0b496cd
    x86: Introduce __uaccess_begin_nospec() and uaccess_try_nospec


So now we need to sprinkle array_index_nospec or barrier_nospec over the
code whenever we use an index we got from userspace.
See below for some examples.


> hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets since
> the time spent on metadata accessing become more significant.
> 
> This patch tries to eliminate those overheads by accessing them
> through direct mapping of those pages. Invalidation callbacks is
> implemented for co-operation with general VM management (swap, KSM,
> THP or NUMA balancing). We will try to get the direct mapping of vq
> metadata before each round of packet processing if it doesn't
> exist. If we fail, we will simplely fallback to copy_to/from_user()
> friends.
> 
> This invalidation, direct mapping access and set are synchronized
> through spinlock. This takes a step back from the original commit
> 7f466032dc9e ("vhost: access vq metadata through kernel virtual
> address") which tries to RCU which is suspicious and hard to be
> reviewed. This won't perform as well as RCU because of the atomic,
> this could be addressed by the future optimization.
> 
> This method might does not work for high mem page which requires
> temporary mapping so we just fallback to normal
> copy_to/from_user() and may not for arch that has virtual tagged cache
> since extra cache flushing is needed to eliminate the alias. This will
> result complex logic and bad performance. For those archs, this patch
> simply go for copy_to/from_user() friends. This is done by ruling out
> kernel mapping codes through ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE.
> 
> Note that this is only done when device IOTLB is not enabled. We
> could use similar method to optimize IOTLB in the future.
> 
> Tests shows at most about 22% improvement on TX PPS when using
> virtio-user + vhost_net + xdp1 + TAP on 4.0GHz Kaby Lake.
> 
>         SMAP on | SMAP off
> Before: 4.9Mpps | 6.9Mpps
> After:  6.0Mpps | 7.5Mpps
> 
> On a elder CPU Sandy Bridge without SMAP support. TX PPS doesn't see
> any difference.

Why is not Kaby Lake with SMAP off the same as Sandy Bridge?


> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
> Cc: Christoph Hellwig <hch@infradead.org>
> Cc: David Miller <davem@davemloft.net>
> Cc: Jerome Glisse <jglisse@redhat.com>
> Cc: Jason Gunthorpe <jgg@mellanox.com>
> Cc: linux-mm@kvack.org
> Cc: linux-arm-kernel@lists.infradead.org
> Cc: linux-parisc@vger.kernel.org
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/vhost/vhost.c | 551 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.h |  41 ++++
>  2 files changed, 589 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 791562e03fe0..f98155f28f02 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -298,6 +298,182 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
>  		__vhost_vq_meta_reset(d->vqs[i]);
>  }
>  
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +static void vhost_map_unprefetch(struct vhost_map *map)
> +{
> +	kfree(map->pages);
> +	kfree(map);
> +}
> +
> +static void vhost_set_map_dirty(struct vhost_virtqueue *vq,
> +				struct vhost_map *map, int index)
> +{
> +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> +	int i;
> +
> +	if (uaddr->write) {
> +		for (i = 0; i < map->npages; i++)
> +			set_page_dirty(map->pages[i]);
> +	}
> +}
> +
> +static void vhost_uninit_vq_maps(struct vhost_virtqueue *vq)
> +{
> +	struct vhost_map *map[VHOST_NUM_ADDRS];
> +	int i;
> +
> +	spin_lock(&vq->mmu_lock);
> +	for (i = 0; i < VHOST_NUM_ADDRS; i++) {
> +		map[i] = vq->maps[i];
> +		if (map[i]) {
> +			vhost_set_map_dirty(vq, map[i], i);
> +			vq->maps[i] = NULL;
> +		}
> +	}
> +	spin_unlock(&vq->mmu_lock);
> +
> +	/* No need for synchronization since we are serialized with
> +	 * memory accessors (e.g vq mutex held).
> +	 */
> +
> +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> +		if (map[i])
> +			vhost_map_unprefetch(map[i]);
> +
> +}
> +
> +static void vhost_reset_vq_maps(struct vhost_virtqueue *vq)
> +{
> +	int i;
> +
> +	vhost_uninit_vq_maps(vq);
> +	for (i = 0; i < VHOST_NUM_ADDRS; i++)
> +		vq->uaddrs[i].size = 0;
> +}
> +
> +static bool vhost_map_range_overlap(struct vhost_uaddr *uaddr,
> +				     unsigned long start,
> +				     unsigned long end)
> +{
> +	if (unlikely(!uaddr->size))
> +		return false;
> +
> +	return !(end < uaddr->uaddr || start > uaddr->uaddr - 1 + uaddr->size);
> +}
> +
> +static void inline vhost_vq_access_map_begin(struct vhost_virtqueue *vq)
> +{
> +	spin_lock(&vq->mmu_lock);
> +}
> +
> +static void inline vhost_vq_access_map_end(struct vhost_virtqueue *vq)
> +{
> +	spin_unlock(&vq->mmu_lock);
> +}
> +
> +static int vhost_invalidate_vq_start(struct vhost_virtqueue *vq,
> +				     int index,
> +				     unsigned long start,
> +				     unsigned long end,
> +				     bool blockable)
> +{
> +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> +	struct vhost_map *map;
> +
> +	if (!vhost_map_range_overlap(uaddr, start, end))
> +		return 0;
> +	else if (!blockable)
> +		return -EAGAIN;
> +
> +	spin_lock(&vq->mmu_lock);
> +	++vq->invalidate_count;
> +
> +	map = vq->maps[index];
> +	if (map)
> +		vq->maps[index] = NULL;
> +	spin_unlock(&vq->mmu_lock);
> +
> +	if (map) {
> +		vhost_set_map_dirty(vq, map, index);
> +		vhost_map_unprefetch(map);
> +	}
> +
> +	return 0;
> +}
> +
> +static void vhost_invalidate_vq_end(struct vhost_virtqueue *vq,
> +				    int index,
> +				    unsigned long start,
> +				    unsigned long end)
> +{
> +	if (!vhost_map_range_overlap(&vq->uaddrs[index], start, end))
> +		return;
> +
> +	spin_lock(&vq->mmu_lock);
> +	--vq->invalidate_count;
> +	spin_unlock(&vq->mmu_lock);
> +}
> +
> +static int vhost_invalidate_range_start(struct mmu_notifier *mn,
> +					const struct mmu_notifier_range *range)
> +{
> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> +					     mmu_notifier);
> +	bool blockable = mmu_notifier_range_blockable(range);
> +	int i, j, ret;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		struct vhost_virtqueue *vq = dev->vqs[i];
> +
> +		for (j = 0; j < VHOST_NUM_ADDRS; j++) {
> +			ret = vhost_invalidate_vq_start(vq, j,
> +							range->start,
> +							range->end, blockable);
> +			if (ret)
> +				return ret;
> +		}
> +	}
> +
> +	return 0;
> +}
> +
> +static void vhost_invalidate_range_end(struct mmu_notifier *mn,
> +				       const struct mmu_notifier_range *range)
> +{
> +	struct vhost_dev *dev = container_of(mn, struct vhost_dev,
> +					     mmu_notifier);
> +	int i, j;
> +
> +	for (i = 0; i < dev->nvqs; i++) {
> +		struct vhost_virtqueue *vq = dev->vqs[i];
> +
> +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> +			vhost_invalidate_vq_end(vq, j,
> +						range->start,
> +						range->end);
> +	}
> +}
> +
> +static const struct mmu_notifier_ops vhost_mmu_notifier_ops = {
> +	.invalidate_range_start = vhost_invalidate_range_start,
> +	.invalidate_range_end = vhost_invalidate_range_end,
> +};
> +
> +static void vhost_init_maps(struct vhost_dev *dev)
> +{
> +	struct vhost_virtqueue *vq;
> +	int i, j;
> +
> +	dev->mmu_notifier.ops = &vhost_mmu_notifier_ops;
> +
> +	for (i = 0; i < dev->nvqs; ++i) {
> +		vq = dev->vqs[i];
> +		for (j = 0; j < VHOST_NUM_ADDRS; j++)
> +			vq->maps[j] = NULL;
> +	}
> +}
> +#endif
> +
>  static void vhost_vq_reset(struct vhost_dev *dev,
>  			   struct vhost_virtqueue *vq)
>  {
> @@ -326,7 +502,11 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->busyloop_timeout = 0;
>  	vq->umem = NULL;
>  	vq->iotlb = NULL;
> +	vq->invalidate_count = 0;
>  	__vhost_vq_meta_reset(vq);
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	vhost_reset_vq_maps(vq);
> +#endif
>  }
>  
>  static int vhost_worker(void *data)
> @@ -471,12 +651,15 @@ void vhost_dev_init(struct vhost_dev *dev,
>  	dev->iov_limit = iov_limit;
>  	dev->weight = weight;
>  	dev->byte_weight = byte_weight;
> +	dev->has_notifier = false;
>  	init_llist_head(&dev->work_list);
>  	init_waitqueue_head(&dev->wait);
>  	INIT_LIST_HEAD(&dev->read_list);
>  	INIT_LIST_HEAD(&dev->pending_list);
>  	spin_lock_init(&dev->iotlb_lock);
> -
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	vhost_init_maps(dev);
> +#endif
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
>  		vq = dev->vqs[i];
> @@ -485,6 +668,7 @@ void vhost_dev_init(struct vhost_dev *dev,
>  		vq->heads = NULL;
>  		vq->dev = dev;
>  		mutex_init(&vq->mutex);
> +		spin_lock_init(&vq->mmu_lock);
>  		vhost_vq_reset(dev, vq);
>  		if (vq->handle_kick)
>  			vhost_poll_init(&vq->poll, vq->handle_kick,
> @@ -564,7 +748,19 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
>  	if (err)
>  		goto err_cgroup;
>  
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	err = mmu_notifier_register(&dev->mmu_notifier, dev->mm);
> +	if (err)
> +		goto err_mmu_notifier;
> +#endif
> +	dev->has_notifier = true;
> +
>  	return 0;
> +
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +err_mmu_notifier:
> +	vhost_dev_free_iovecs(dev);
> +#endif
>  err_cgroup:
>  	kthread_stop(worker);
>  	dev->worker = NULL;
> @@ -655,6 +851,107 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>  	spin_unlock(&dev->iotlb_lock);
>  }
>  
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +static void vhost_setup_uaddr(struct vhost_virtqueue *vq,
> +			      int index, unsigned long uaddr,
> +			      size_t size, bool write)
> +{
> +	struct vhost_uaddr *addr = &vq->uaddrs[index];
> +
> +	addr->uaddr = uaddr;
> +	addr->size = size;
> +	addr->write = write;
> +}
> +
> +static void vhost_setup_vq_uaddr(struct vhost_virtqueue *vq)
> +{
> +	vhost_setup_uaddr(vq, VHOST_ADDR_DESC,
> +			  (unsigned long)vq->desc,
> +			  vhost_get_desc_size(vq, vq->num),
> +			  false);
> +	vhost_setup_uaddr(vq, VHOST_ADDR_AVAIL,
> +			  (unsigned long)vq->avail,
> +			  vhost_get_avail_size(vq, vq->num),
> +			  false);
> +	vhost_setup_uaddr(vq, VHOST_ADDR_USED,
> +			  (unsigned long)vq->used,
> +			  vhost_get_used_size(vq, vq->num),
> +			  true);
> +}
> +
> +static int vhost_map_prefetch(struct vhost_virtqueue *vq,
> +			       int index)
> +{
> +	struct vhost_map *map;
> +	struct vhost_uaddr *uaddr = &vq->uaddrs[index];
> +	struct page **pages;
> +	int npages = DIV_ROUND_UP(uaddr->size, PAGE_SIZE);
> +	int npinned;
> +	void *vaddr, *v;
> +	int err;
> +	int i;
> +
> +	spin_lock(&vq->mmu_lock);
> +
> +	err = -EFAULT;
> +	if (vq->invalidate_count)
> +		goto err;
> +
> +	err = -ENOMEM;
> +	map = kmalloc(sizeof(*map), GFP_ATOMIC);
> +	if (!map)
> +		goto err;
> +
> +	pages = kmalloc_array(npages, sizeof(struct page *), GFP_ATOMIC);
> +	if (!pages)
> +		goto err_pages;
> +
> +	err = EFAULT;
> +	npinned = __get_user_pages_fast(uaddr->uaddr, npages,
> +					uaddr->write, pages);
> +	if (npinned > 0)
> +		release_pages(pages, npinned);
> +	if (npinned != npages)
> +		goto err_gup;
> +
> +	for (i = 0; i < npinned; i++)
> +		if (PageHighMem(pages[i]))
> +			goto err_gup;
> +
> +	vaddr = v = page_address(pages[0]);
> +
> +	/* For simplicity, fallback to userspace address if VA is not
> +	 * contigious.
> +	 */
> +	for (i = 1; i < npinned; i++) {
> +		v += PAGE_SIZE;
> +		if (v != page_address(pages[i]))
> +			goto err_gup;
> +	}
> +
> +	map->addr = vaddr + (uaddr->uaddr & (PAGE_SIZE - 1));
> +	map->npages = npages;
> +	map->pages = pages;
> +
> +	vq->maps[index] = map;
> +	/* No need for a synchronize_rcu(). This function should be
> +	 * called by dev->worker so we are serialized with all
> +	 * readers.
> +	 */
> +	spin_unlock(&vq->mmu_lock);
> +
> +	return 0;
> +
> +err_gup:
> +	kfree(pages);
> +err_pages:
> +	kfree(map);
> +err:
> +	spin_unlock(&vq->mmu_lock);
> +	return err;
> +}
> +#endif
> +
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
>  	int i;
> @@ -684,8 +981,20 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
>  		kthread_stop(dev->worker);
>  		dev->worker = NULL;
>  	}
> -	if (dev->mm)
> +	if (dev->mm) {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +		if (dev->has_notifier) {
> +			mmu_notifier_unregister(&dev->mmu_notifier,
> +						dev->mm);
> +			dev->has_notifier = false;
> +		}
> +#endif
>  		mmput(dev->mm);
> +	}
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	for (i = 0; i < dev->nvqs; i++)
> +		vhost_uninit_vq_maps(dev->vqs[i]);
> +#endif
>  	dev->mm = NULL;
>  }
>  EXPORT_SYMBOL_GPL(vhost_dev_cleanup);
> @@ -914,6 +1223,26 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>  
>  static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			*((__virtio16 *)&used->ring[vq->num]) =
> +				cpu_to_vhost16(vq, vq->avail_idx);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
>  			      vhost_avail_event(vq));
>  }
> @@ -922,6 +1251,27 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  				 struct vring_used_elem *head, int idx,
>  				 int count)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +	size_t size;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			size = count * sizeof(*head);
> +			memcpy(used->ring + idx, head, size);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_copy_to_user(vq, vq->used->ring + idx, head,
>  				  count * sizeof(*head));
>  }
> @@ -929,6 +1279,25 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
>  static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			used->flags = cpu_to_vhost16(vq, vq->used_flags);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
>  			      &vq->used->flags);
>  }
> @@ -936,6 +1305,25 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>  static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>  
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
>  			      &vq->used->idx);
>  }
> @@ -981,12 +1369,50 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>  static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
>  				      __virtio16 *idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*idx = avail->idx;

index can now be speculated.

> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *idx, &vq->avail->idx);
>  }
>  
>  static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  				       __virtio16 *head, int idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*head = avail->ring[idx & (vq->num - 1)];


Since idx can be speculated, I guess we need array_index_nospec here?


> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *head,
>  			       &vq->avail->ring[idx & (vq->num - 1)]);
>  }
> @@ -994,24 +1420,98 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
>  static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
>  					__virtio16 *flags)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*flags = avail->flags;
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *flags, &vq->avail->flags);
>  }
>  
>  static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
>  				       __virtio16 *event)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_avail *avail;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +		map = vq->maps[VHOST_ADDR_AVAIL];
> +		if (likely(map)) {
> +			avail = map->addr;
> +			*event = (__virtio16)avail->ring[vq->num];
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_avail(vq, *event, vhost_used_event(vq));
>  }
>  
>  static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
>  				     __virtio16 *idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_used *used;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_USED];
> +		if (likely(map)) {
> +			used = map->addr;
> +			*idx = used->idx;
> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_get_used(vq, *idx, &vq->used->idx);
>  }


This seems to be used during init. Why do we bother
accelerating this?


>  
>  static inline int vhost_get_desc(struct vhost_virtqueue *vq,
>  				 struct vring_desc *desc, int idx)
>  {
> +#if VHOST_ARCH_CAN_ACCEL_UACCESS
> +	struct vhost_map *map;
> +	struct vring_desc *d;
> +
> +	if (!vq->iotlb) {
> +		vhost_vq_access_map_begin(vq);
> +
> +		map = vq->maps[VHOST_ADDR_DESC];
> +		if (likely(map)) {
> +			d = map->addr;
> +			*desc = *(d + idx);


Since idx can be speculated, I guess we need array_index_nospec here?


> +			vhost_vq_access_map_end(vq);
> +			return 0;
> +		}
> +
> +		vhost_vq_access_map_end(vq);
> +	}
> +#endif
> +
>  	return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
>  }
>  

I also wonder about the userspace address we get eventualy.
It would seem that we need to prevent that from speculating -
and that seems like a good idea even if this patch isn't
applied. As you are playing with micro-benchmarks, maybe
you could the below patch?
It's unfortunately untested.
Thanks a lot in advance!

===>
vhost: block speculation of translated descriptors

iovec addresses coming from vhost are assumed to be
pre-validated, but in fact can be speculated to a value
out of range.

Userspace address are later validated with array_index_nospec so we can
be sure kernel info does not leak through these addresses, but vhost
must also not leak userspace info outside the allowed memory table to
guests.

Following the defence in depth principle, make sure
the address is not validated out of node range.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

---


diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5dc174ac8cac..863e25011ef6 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2072,7 +2076,9 @@ static int translate_desc(struct vhost_virtqueue *vq, u64 addr, u32 len,
 		size = node->size - addr + node->start;
 		_iov->iov_len = min((u64)len - s, size);
 		_iov->iov_base = (void __user *)(unsigned long)
-			(node->userspace_addr + addr - node->start);
+			(node->userspace_addr +
+			 array_index_nospec(addr - node->start,
+					    node->size));
 		s += size;
 		addr += size;
 		++ret;

^ permalink raw reply related

* Re: [patch net-next v2 3/3] net: devlink: move reload fail indication to devlink core and expose to user
From: Ido Schimmel @ 2019-09-08 10:39 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev@vger.kernel.org, davem@davemloft.net, dsahern@gmail.com,
	jakub.kicinski@netronome.com, Tariq Toukan, mlxsw
In-Reply-To: <20190907205400.14589-4-jiri@resnulli.us>

On Sat, Sep 07, 2019 at 10:54:00PM +0200, Jiri Pirko wrote:
> diff --git a/include/uapi/linux/devlink.h b/include/uapi/linux/devlink.h
> index 546e75dd74ac..7cb5e8c5ae0d 100644
> --- a/include/uapi/linux/devlink.h
> +++ b/include/uapi/linux/devlink.h
> @@ -410,6 +410,8 @@ enum devlink_attr {
>  	DEVLINK_ATTR_TRAP_METADATA,			/* nested */
>  	DEVLINK_ATTR_TRAP_GROUP_NAME,			/* string */
>  
> +	DEVLINK_ATTR_RELOAD_FAILED,			/* u8 0 or 1 */
> +
>  	/* add new attributes above here, update the policy in devlink.c */
>  
>  	__DEVLINK_ATTR_MAX,
> diff --git a/net/core/devlink.c b/net/core/devlink.c
> index 1e3a2288b0b2..e00a4a643d17 100644
> --- a/net/core/devlink.c
> +++ b/net/core/devlink.c
> @@ -471,6 +471,8 @@ static int devlink_nl_fill(struct sk_buff *msg, struct devlink *devlink,
>  
>  	if (devlink_nl_put_handle(msg, devlink))
>  		goto nla_put_failure;
> +	if (nla_put_u8(msg, DEVLINK_ATTR_RELOAD_FAILED, devlink->reload_failed))

Why not use NLA_FLAG for this?

> +		goto nla_put_failure;
>  
>  	genlmsg_end(msg, hdr);
>  	return 0;
> @@ -2677,6 +2679,21 @@ static bool devlink_reload_supported(struct devlink *devlink)
>  	return devlink->ops->reload_down && devlink->ops->reload_up;
>  }

^ permalink raw reply

* Re: [PATCH v3 1/2] net: core: Notify on changes to dev->promiscuity.
From: Ido Schimmel @ 2019-09-08 10:15 UTC (permalink / raw)
  To: Allan W. Nielsen
  Cc: Jiri Pirko, David Miller, andrew, horatiu.vultur,
	alexandre.belloni, UNGLinuxDriver, ivecera, f.fainelli, netdev,
	linux-kernel
In-Reply-To: <20190903081410.zpcdm2dzqrxyg43c@lx-anielsen.microsemi.net>

On Tue, Sep 03, 2019 at 10:14:12AM +0200, Allan W. Nielsen wrote:
> The 09/03/2019 09:13, Ido Schimmel wrote:
> > On Mon, Sep 02, 2019 at 07:42:31PM +0200, Allan W. Nielsen wrote:
> > With these patches applied I assume I will see the following traffic
> > when running tcpdump on one of the netdevs exposed by the ocelot driver:
> > 
> > - Ingress: All
> > - Egress: Only locally generated traffic and traffic forwarded by the
> >   kernel from interfaces not belonging to the ocelot driver
> > 
> > The above means I will not see any offloaded traffic transmitted by the
> > port. Is that correct?
> Correct - but maybe we should change this.
> 
> In Ocelot and in LANxxxx (the part we are working on now), we can come pretty
> close. We can get the offloaded TX traffic to the CPU, but it will not be
> re-written (it will look like the ingress frame, which is not always the same as
> the egress frame, vlan tags an others may be re-written).

Yes, this is the same with mlxsw. You can trap the egress frames, but
they will reach the CPU unmodified via the ingress port.

> In some of our chips we can actually do this (not Ocelot, and not the LANxxxx
> part we are working on now) after the frame as been re-written.

Cool.

> > I see that the driver is setting 'offload_fwd_mark' for any traffic trapped
> > from bridged ports, which means the bridge will drop it before it traverses
> > the packet taps on egress.
> Correct.
> 
> > Large parts of the discussion revolve around the fact that switch ports
> > are not any different than other ports. Dave wrote "Please stop
> > portraying switches as special in this regard" and Andrew wrote "[The
> > user] just wants tcpdump to work like on their desktop."
> And we are trying to get as close to this as practical possible, knowing that it
> may not be exactly the same.
> 
> > But if anything, this discussion proves that switch ports are special in
> > this regard and that tcpdump will not work like on the desktop.
> I think it can come really close. Some drivers may be able to fix the TX issue
> you point out, others may not.
> 
> > Beside the fact that I don't agree (but gave up) with the new
> > interpretation of promisc mode, I wonder if we're not asking for trouble
> > with this patchset. Users will see all offloaded traffic on ingress, but
> > none of it on egress. This is in contrast to the sever/desktop, where
> > Linux is much more dominant in comparison to switches (let alone hw
> > accelerated ones) and where all the traffic is visible through tcpdump.
> > I can already see myself having to explain this over and over again to
> > confused users.
> > 
> > Now, I understand that showing egress traffic is inherently difficult.
> > It means one of two things:
> > 
> > 1. We allow packets to be forwarded by both the software and the
> > hardware
> > 2. We trap all ingressing traffic from all the ports
> If the HW cannot copy the egress traffic to the CPU (which our HW cannot), then
> you need to do both. All ingress traffic needs to go to the CPU, you need to
> make all the forwarding decisions in the CPU, to figure out what traffic happens
> to go to the port you want to monitor.
> 
> I really doubt this will work in real life. Too much traffic, and HW may make
> different forwarding decision that the SW (tc rules in HW but not in SW), which
> means that it will not be good for debugging anyway.

I agree.

> 
> > Both options can have devastating effects on the network and therefore
> > should not be triggered by a supposedly innocent invocation of tcpdump.
> Agree.
> 
> > I again wonder if it would not be wiser to solve this by introducing two
> > new flags to tcpdump for ingress/egress (similar to -Q in/out) capturing
> > of offloaded traffic. The capturing of egress offloaded traffic can be
> > documented with the appropriate warnings.
> Not sure I agree, but I will try to spend some more time considering it.
> 
> In the mean while, what TC action was it that Jiri suggestion we should use? The
> trap action is no good, and it prevents the forwarding in silicon, and I'm not
> aware of a "COPY-TO-CPU" action.

I agree. We would either need a new or just extend the existing one with
a new attribute.

> > Anyway, I don't want to hold you up, I merely want to make sure that the
> > above (assuming it's correct) is considered before the patches are
> > applied.
> Sounds good, and thanks for all the time spend on reviewing and asking the
> critical questions.

Thanks for bringing up these issues. I will be happy to review future
patches.

^ permalink raw reply

* Re: Q: fixed link
From: Andrew Lunn @ 2019-09-08  9:05 UTC (permalink / raw)
  To: Ranran; +Cc: netdev
In-Reply-To: <CAJ2oMhKUTUU0eHTmS62itBw6L9Jut=ps6y8GuVDP44xadn03dw@mail.gmail.com>

On Sun, Sep 08, 2019 at 10:30:51AM +0300, Ranran wrote:
> Hello,
> 
> In documentation of fixed-link it is said:"
> Some Ethernet MACs have a "fixed link", and are not connected to a
> normal MDIO-managed PHY device. For those situations, a Device Tree
> binding allows to describe a "fixed link".
> "
> Does it mean, that on using unmanaged switch ("no cpu" mode), it is
> better be used with fixed-link ?

Hi Ranran

Is there a MAC to MAC connection, or PHY to PHY connection?

If the interface MAC is directly connected to the switch MAC, fixed
link is what you should use. The fixed link will then tell the
interface MAC what speed it should use.

If you have back to back PHYs, you need a PHY driver for the PHY
connected to the interface MAC, to configure its speed, duplex
etc. The dumb switch should be controlling its PHY, and auto-neg will
probably work.

	 Andrew

^ permalink raw reply

* Re: [PATCH net-next 3/3] net: dsa: mv88e6xxx: add RXNFC support
From: Andrew Lunn @ 2019-09-08  8:55 UTC (permalink / raw)
  To: Vivien Didelot; +Cc: netdev, davem, f.fainelli
In-Reply-To: <20190907172510.GB27514@t480s.localdomain>

On Sat, Sep 07, 2019 at 05:25:10PM -0400, Vivien Didelot wrote:
> Hi Andrew,
> 
> On Sat, 7 Sep 2019 22:32:56 +0200, Andrew Lunn <andrew@lunn.ch> wrote:
> > > +	policy = devm_kzalloc(chip->dev, sizeof(*policy), GFP_KERNEL);
> > > +	if (!policy)
> > > +		return -ENOMEM;
> > 
> > I think this might be the first time we have done dynamic memory
> > allocation in the mv88e6xxx driver. It might even be a first for a DSA
> > driver?
> > 
> > I'm not saying it is wrong, but maybe we should discuss it. 
> > 
> > I assume you are doing this because the ATU entry itself is not
> > sufficient?
> > 
> > How much memory is involved here, worst case? I assume one struct
> > mv88e6xxx_policy per ATU entry? Which you think is too much to
> > allocate as part of chip? I guess most users will never use this
> > feature, so for most users it would be wasted memory. So i do see the
> > point for dynamically allocating it.
> 
> A layer 2 policy is not limited to the ATU. It can also be based on a VTU
> entry, on the port's Etype, or frame's Etype. We can have 0, 1 or literally
> thousands of policies programmed by the user.

O.K, then it has to by dynamic memory.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* Re: [PATCH 1/2] net: phy: dp83867: Add documentation for SGMII mode type
From: Andrew Lunn @ 2019-09-08  8:54 UTC (permalink / raw)
  To: Vitaly Gaiduk
  Cc: davem@davemloft.net, robh+dt@kernel.org, f.fainelli@gmail.com,
	Mark Rutland, netdev@vger.kernel.org, devicetree@vger.kernel.org,
	linux-kernel@vger.kernel.org, Trent Piepho
In-Reply-To: <2894361567896439@iva5-be053096037b.qloud-c.yandex.net>

On Sun, Sep 08, 2019 at 01:47:19AM +0300, Vitaly Gaiduk wrote:
> Hi, Andrew.<div>I’m ready to do this property with such name but is it good practice to do such long names? :)</div><div>Also, Trent Piepho wrote about sgmii-clk and merged all ideas we have “ti,sgmii-ref-clk”.</div><div>It’s better, isn’t it?</div><div>Vitaly.</div><div><div><br />07.09.2019, 18:39, "Andrew Lunn" &lt;andrew@lunn.ch&gt;:<br /><blockquote><p>On Thu, Sep 05, 2019 at 07:26:00PM +0300, Vitaly Gaiduk wrote:<br /></p><blockquote class="b4fd5cf2ec92bc68cb898700bb81355fwmi-quote"> Add documentation of ti,sgmii-type which can be used to select<br /> SGMII mode type (4 or 6-wire).<br /><br /> Signed-off-by: Vitaly Gaiduk &lt;<a href="mailto:vitaly.gaiduk@cloudbear.ru">vitaly.gaiduk@cloudbear.ru</a>&gt;<br /> ---<br />  Documentation/devicetree/bindings/net/ti,dp83867.txt | 1 +<br />  1 file changed, 1 insertion(+)<br /><br /> diff --git a/Documentation/devicetree/bindings/net/ti,dp83867.txt b/Documentation/devicetree/bindings/net/ti,dp83867.txt<br /> index db6aa3f2215b..18e7fd52897f 100644<br /> --- a/Documentation/devicetree/bindings/net/ti,dp83867.txt<br /> +++ b/Documentation/devicetree/bindings/net/ti,dp83867.txt<br /> @@ -37,6 +37,7 @@ Optional property:<br />                                for applicable values.  The CLK_OUT pin can also<br />                                be disabled by this property.  When omitted, the<br />                                PHY's default will be left as is.<br /> +	- ti,sgmii-type - This denotes the fact which SGMII mode is used (4 or 6-wire).<br /></blockquote><p><br />Hi Vitaly<br /><br />You probably want to make this a Boolean. I don't think SGMII type is<br />a good idea. This is about enabling the receive clock to be passed to<br />the MAC. So how about ti,sgmii-ref-clock-output-enable.<br /><br />    Andrew<br /></p></blockquote></div></div>

Hi Vitaly

Please reconfigure your mail client to not obfuscate with HTML.

The length should be O.K. For a PHY node, it should not be too deeply
indented, unless it happens to be part of an Ethernet switch.

	  Andrew

^ permalink raw reply

* Load-balancing considering queue lengths
From: Daniel Schaffrath @ 2019-09-08  8:13 UTC (permalink / raw)
  To: netdev

Hello everybody,

when load balancing packets/bytes among several links it seems to be a 
natural choice to rely the decisions about the outgoing device on the 
current queue lengths of the available devices. Looking at typical 
netfilter configurations or nftlb this does not seem to be a common 
choice, though.

Considering the abilities of eBPF and netfilter I think it would be 
totally possible. But I might be mistaken in either regard (good idea / 
technical possibility).

I would be very grateful, if you could provide me with any pointers that 
I could educate myself on that matter.

Thanks a lot in advance, Daniel

^ permalink raw reply

* [PATCH bpf-next] xdp: Fix race in dev_map_hash_update_elem() when replacing element
From: Toke Høiland-Jørgensen @ 2019-09-08  8:20 UTC (permalink / raw)
  To: make-wifi-fast, linux-wireless, ast, bpf, daniel, davem, hawk,
	jakub.kicinski, john.fastabend, kafai, linux-kernel, netdev,
	songliubraving, syzkaller-bugs, yhs
  Cc: Toke Høiland-Jørgensen, syzbot+4e7a85b1432052e8d6f8
In-Reply-To: <0000000000005091a70591d3e1d9@google.com>

syzbot found a crash in dev_map_hash_update_elem(), when replacing an
element with a new one. Jesper correctly identified the cause of the crash
as a race condition between the initial lookup in the map (which is done
before taking the lock), and the removal of the old element.

Rather than just add a second lookup into the hashmap after taking the
lock, fix this by reworking the function logic to take the lock before the
initial lookup.

Fixes: 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking up devices by hashed index")
Reported-and-tested-by: syzbot+4e7a85b1432052e8d6f8@syzkaller.appspotmail.com
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 kernel/bpf/devmap.c | 17 ++++++++++++-----
 1 file changed, 12 insertions(+), 5 deletions(-)

diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c
index 9af048a932b5..d27f3b60ff6d 100644
--- a/kernel/bpf/devmap.c
+++ b/kernel/bpf/devmap.c
@@ -650,19 +650,22 @@ static int __dev_map_hash_update_elem(struct net *net, struct bpf_map *map,
 	u32 ifindex = *(u32 *)value;
 	u32 idx = *(u32 *)key;
 	unsigned long flags;
+	int err = -EEXIST;
 
 	if (unlikely(map_flags > BPF_EXIST || !ifindex))
 		return -EINVAL;
 
+	spin_lock_irqsave(&dtab->index_lock, flags);
+
 	old_dev = __dev_map_hash_lookup_elem(map, idx);
 	if (old_dev && (map_flags & BPF_NOEXIST))
-		return -EEXIST;
+		goto out_err;
 
 	dev = __dev_map_alloc_node(net, dtab, ifindex, idx);
-	if (IS_ERR(dev))
-		return PTR_ERR(dev);
-
-	spin_lock_irqsave(&dtab->index_lock, flags);
+	if (IS_ERR(dev)) {
+		err = PTR_ERR(dev);
+		goto out_err;
+	}
 
 	if (old_dev) {
 		hlist_del_rcu(&old_dev->index_hlist);
@@ -683,6 +686,10 @@ static int __dev_map_hash_update_elem(struct net *net, struct bpf_map *map,
 		call_rcu(&old_dev->rcu, __dev_map_entry_free);
 
 	return 0;
+
+out_err:
+	spin_unlock_irqrestore(&dtab->index_lock, flags);
+	return err;
 }
 
 static int dev_map_hash_update_elem(struct bpf_map *map, void *key, void *value,
-- 
2.23.0


^ permalink raw reply related

* Re: general protection fault in dev_map_hash_update_elem
From: Toke Høiland-Jørgensen @ 2019-09-08  8:09 UTC (permalink / raw)
  To: Hillf Danton, syzbot
  Cc: alexei.starovoitov, ast, bpf, daniel, davem, hawk, jakub.kicinski,
	jbrouer, john.fastabend, kafai, linux-kernel, netdev,
	songliubraving, syzkaller-bugs, yhs
In-Reply-To: <20190908030726.7520-1-hdanton@sina.com>

Hillf Danton <hdanton@sina.com> writes:

>> syzbot has found a reproducer for the following crash on Sat, 07 Sep 2019 18:59:06 -0700
>> 
>> HEAD commit:    a2c11b03 kcm: use BPF_PROG_RUN
>> git tree:       bpf-next
>> console output: https://syzkaller.appspot.com/x/log.txt?x=13d46ec1600000
>> kernel config:  https://syzkaller.appspot.com/x/.config?x=cf0c85d15c20ade3
>> dashboard link: https://syzkaller.appspot.com/bug?extid=4e7a85b1432052e8d6f8
>> compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1220b2d1600000
>> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=1360b26e600000
>> 
>> general protection fault: 0000 [#1] PREEMPT SMP KASAN
>> CPU: 1 PID: 10210 Comm: syz-executor910 Not tainted 5.3.0-rc7+ #0
>> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
>> Google 01/01/2011
>> RIP: 0010:__write_once_size include/linux/compiler.h:226 [inline]
>> RIP: 0010:__hlist_del include/linux/list.h:762 [inline]
>> RIP: 0010:hlist_del_rcu include/linux/rculist.h:455 [inline]
>> RIP: 0010:__dev_map_hash_update_elem kernel/bpf/devmap.c:668 [inline]
>> RIP: 0010:dev_map_hash_update_elem+0x3c8/0x6e0 kernel/bpf/devmap.c:691
>
> Fix commit 6f9d451ab1a3 ("xdp: Add devmap_hash map type for looking
> up devices by hashed index")

While this minimal patch does fix the bug (as Jesper already noted), I
prefer to rework the logic instead of just repeating the lookup; a patch
is on its way :)

-Toke

^ permalink raw reply

* Q: fixed link
From: Ranran @ 2019-09-08  7:30 UTC (permalink / raw)
  To: netdev

Hello,

In documentation of fixed-link it is said:"
Some Ethernet MACs have a "fixed link", and are not connected to a
normal MDIO-managed PHY device. For those situations, a Device Tree
binding allows to describe a "fixed link".
"
Does it mean, that on using unmanaged switch ("no cpu" mode), it is
better be used with fixed-link ?

Thanks,
ranran

^ permalink raw reply

* INFO: rcu detected stall in pppoe_sendmsg
From: syzbot @ 2019-09-08  7:19 UTC (permalink / raw)
  To: davem, jhs, jiri, linux-kernel, netdev, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    1e3778cb Merge tag 'scsi-fixes' of git://git.kernel.org/pu..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=137b2971600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b89bb446a3faaba4
dashboard link: https://syzkaller.appspot.com/bug?extid=55be5f513bed37fc4367
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+55be5f513bed37fc4367@syzkaller.appspotmail.com

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	0-...!: (10501 ticks this GP) idle=06a/1/0x4000000000000002  
softirq=173683/173683 fqs=0
	(t=10502 jiffies g=271749 q=3228)
rcu: rcu_preempt kthread starved for 10503 jiffies! g271749 f0x0  
RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
rcu: RCU grace-period kthread stack dump:
rcu_preempt     I29160    10      2 0x80004000
Call Trace:
  context_switch kernel/sched/core.c:3254 [inline]
  __schedule+0x755/0x1580 kernel/sched/core.c:3880
  schedule+0xd9/0x260 kernel/sched/core.c:3947
  schedule_timeout+0x486/0xc50 kernel/time/timer.c:1807
  rcu_gp_fqs_loop kernel/rcu/tree.c:1611 [inline]
  rcu_gp_kthread+0x9b2/0x18c0 kernel/rcu/tree.c:1768
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
NMI backtrace for cpu 0
CPU: 0 PID: 4124 Comm: syz-executor.2 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold+0x70/0xb2 lib/nmi_backtrace.c:101
  nmi_trigger_cpumask_backtrace+0x23b/0x28b lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
  rcu_dump_cpu_stacks+0x183/0x1cf kernel/rcu/tree_stall.h:254
  print_cpu_stall kernel/rcu/tree_stall.h:455 [inline]
  check_cpu_stall kernel/rcu/tree_stall.h:529 [inline]
  rcu_pending kernel/rcu/tree.c:2736 [inline]
  rcu_sched_clock_irq.cold+0x4dd/0xc13 kernel/rcu/tree.c:2183
  update_process_times+0x32/0x80 kernel/time/timer.c:1639
  tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:167
  tick_sched_timer+0x53/0x140 kernel/time/tick-sched.c:1296
  __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
  __hrtimer_run_queues+0x364/0xe40 kernel/time/hrtimer.c:1451
  hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1106 [inline]
  smp_apic_timer_interrupt+0x160/0x610 arch/x86/kernel/apic/apic.c:1131
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
  </IRQ>
RIP: 0010:preempt_count arch/x86/include/asm/preempt.h:26 [inline]
RIP: 0010:check_kcov_mode kernel/kcov.c:68 [inline]
RIP: 0010:__sanitizer_cov_trace_pc+0xd/0x50 kernel/kcov.c:102
Code: 6d 9f e9 ff 48 c7 05 1e 4d 19 09 00 00 00 00 e9 77 e9 ff ff 90 90 90  
90 90 90 90 90 90 55 48 89 e5 65 48 8b 04 25 40 fe 01 00 <65> 8b 15 04 89  
8f 7e 81 e2 00 01 1f 00 48 8b 75 08 75 2b 8b 90 f0
RSP: 0018:ffff88821b02f098 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: ffff88806c8524c0 RBX: ffff888096f52c38 RCX: ffffc9000bf4d000
RDX: 0000000000000000 RSI: ffffffff85c65fd2 RDI: ffff888096f52cec
RBP: ffff88821b02f098 R08: ffff88806c8524c0 R09: 0000000000000000
R10: fffffbfff134afaf R11: ffff88806c8524c0 R12: dffffc0000000000
R13: ffff888096f52940 R14: 0000000000000000 R15: 0000000000000000
  hhf_dequeue+0x586/0xa20 net/sched/sch_hhf.c:438
  dequeue_skb net/sched/sch_generic.c:258 [inline]
  qdisc_restart net/sched/sch_generic.c:361 [inline]
  __qdisc_run+0x1e7/0x19d0 net/sched/sch_generic.c:379
  __dev_xmit_skb net/core/dev.c:3533 [inline]
  __dev_queue_xmit+0x16f1/0x3650 net/core/dev.c:3838
  dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
  br_dev_queue_push_xmit+0x3f3/0x5c0 net/bridge/br_forward.c:52
  NF_HOOK include/linux/netfilter.h:305 [inline]
  NF_HOOK include/linux/netfilter.h:299 [inline]
  br_forward_finish+0xfa/0x400 net/bridge/br_forward.c:65
  NF_HOOK include/linux/netfilter.h:305 [inline]
  NF_HOOK include/linux/netfilter.h:299 [inline]
  __br_forward+0x641/0xb00 net/bridge/br_forward.c:109
  br_forward+0x47c/0x500 net/bridge/br_forward.c:158
  br_dev_xmit+0xbf0/0x15a0 net/bridge/br_device.c:102
  __netdev_start_xmit include/linux/netdevice.h:4406 [inline]
  netdev_start_xmit include/linux/netdevice.h:4420 [inline]
  xmit_one net/core/dev.c:3280 [inline]
  dev_hard_start_xmit+0x1a3/0x9c0 net/core/dev.c:3296
  __dev_queue_xmit+0x2b15/0x3650 net/core/dev.c:3869
  dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
  pppoe_sendmsg+0x661/0x7f0 drivers/net/ppp/pppoe.c:899
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:657
  ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
  __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
  __do_sys_sendmmsg net/socket.c:2442 [inline]
  __se_sys_sendmmsg net/socket.c:2439 [inline]
  __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4598e9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f20901b4c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 0000000000000004 RCX: 00000000004598e9
RDX: 000000000000033b RSI: 000000002000d180 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f20901b56d4
R13: 00000000004c70a7 R14: 00000000004dc768 R15: 00000000ffffffff


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: WARNING in __vunmap
From: syzbot @ 2019-09-08  7:05 UTC (permalink / raw)
  To: davem, herbert, linux-kernel, netdev, steffen.klassert,
	syzkaller-bugs
In-Reply-To: <00000000000092839d0581fd74ad@google.com>

syzbot has found a reproducer for the following crash on:

HEAD commit:    b3a9964c Merge tag 'char-misc-5.3-rc8' of git://git.kernel..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16c9f70a600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=144488c6c6c6d2b6
dashboard link: https://syzkaller.appspot.com/bug?extid=5ec9bb042ddfe9644773
compiler:       clang version 9.0.0 (/home/glider/llvm/clang  
80fee25776c2fb61e74c1ecb1a523375c2500b69)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=11a30371600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+5ec9bb042ddfe9644773@syzkaller.appspotmail.com

------------[ cut here ]------------
Trying to vfree() nonexistent vm area (00000000dddfa71b)
WARNING: CPU: 0 PID: 10463 at mm/vmalloc.c:2235 __vunmap+0x148/0xa20  
mm/vmalloc.c:2234
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 10463 Comm: syz-executor.0 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1d8/0x2f8 lib/dump_stack.c:113
  panic+0x25c/0x799 kernel/panic.c:219
  __warn+0x22f/0x230 kernel/panic.c:576
  report_bug+0x190/0x290 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  do_error_trap+0xd7/0x440 arch/x86/kernel/traps.c:272
  do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:__vunmap+0x148/0xa20 mm/vmalloc.c:2234
Code: 0c e8 8c 36 d0 ff eb 24 e8 85 36 d0 ff 48 c7 c7 a8 f5 8f 88 e8 69 9b  
d8 05 48 c7 c7 e5 5c 3e 88 4c 89 f6 31 c0 e8 68 18 a3 ff <0f> 0b 48 83 c4  
60 5b 41 5c 41 5d 41 5e 41 5f 5d c3 48 c7 c7 a8 f5
RSP: 0018:ffff8880a81d75b8 EFLAGS: 00010246
RAX: 4324ba28a2c9f400 RBX: 0000000000000000 RCX: ffff8880a48ce080
RDX: 0000000000000000 RSI: 0000000080000000 RDI: 0000000000000000
RBP: ffff8880a81d7640 R08: ffffffff815cfa54 R09: ffffed1015d46088
R10: ffffed1015d46088 R11: 0000000000000000 R12: ffff88808a93f708
R13: dffffc0000000000 R14: ffffc900080f7000 R15: ffffc90008108000
  __vfree mm/vmalloc.c:2299 [inline]
  vfree+0x85/0x130 mm/vmalloc.c:2329
  ipcomp_free_scratches net/xfrm/xfrm_ipcomp.c:212 [inline]
  ipcomp_free_data+0x12a/0x1d0 net/xfrm/xfrm_ipcomp.c:321
  ipcomp_init_state+0x7bf/0x8b0 net/xfrm/xfrm_ipcomp.c:373
  ipcomp6_init_state+0xb7/0x630 net/ipv6/ipcomp6.c:153
  __xfrm_init_state+0x7d0/0xbf0 net/xfrm/xfrm_state.c:2493
  xfrm_state_construct net/xfrm/xfrm_user.c:626 [inline]
  xfrm_add_sa+0x223f/0x38e0 net/xfrm/xfrm_user.c:683
  xfrm_user_rcv_msg+0x3e6/0x650 net/xfrm/xfrm_user.c:2676
  netlink_rcv_skb+0x19e/0x3d0 net/netlink/af_netlink.c:2477
  xfrm_netlink_rcv+0x74/0x90 net/xfrm/xfrm_user.c:2684
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0x787/0x900 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x993/0xc50 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg net/socket.c:657 [inline]
  ___sys_sendmsg+0x60d/0x910 net/socket.c:2311
  __sys_sendmsg net/socket.c:2356 [inline]
  __do_sys_sendmsg net/socket.c:2365 [inline]
  __se_sys_sendmsg net/socket.c:2363 [inline]
  __x64_sys_sendmsg+0x17c/0x200 net/socket.c:2363
  do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4598e9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f4880699c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004598e9
RDX: 0000000000000000 RSI: 0000000020000840 RDI: 0000000000000003
RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f488069a6d4
R13: 00000000004c7812 R14: 00000000004dd0b0 R15: 00000000ffffffff
Kernel Offset: disabled
Rebooting in 86400 seconds..


^ permalink raw reply

* BUG: unable to handle kernel NULL pointer dereference in tc_bind_tclass
From: syzbot @ 2019-09-08  6:08 UTC (permalink / raw)
  To: davem, jhs, jiri, linux-kernel, netdev, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    0e5b36bc r8152: adjust the settings of ups flags
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=10e5ad76600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=67b69b427c3b2dbf
dashboard link: https://syzkaller.appspot.com/bug?extid=21b29db13c065852f64b
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16cebbda600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15fb9d0a600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+21b29db13c065852f64b@syzkaller.appspotmail.com

8021q: adding VLAN 0 to HW filter on device batadv0
BUG: kernel NULL pointer dereference, address: 0000000000000000
#PF: supervisor instruction fetch in kernel mode
#PF: error_code(0x0010) - not-present page
PGD a9ba0067 P4D a9ba0067 PUD a7851067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8672 Comm: syz-executor994 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffff888097fb74d8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffffff884a7740 RCX: ffffffff85b55676
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8880a4cd7400
RBP: ffff888097fb75d0 R08: ffff88808dc2e440 R09: ffff888097fb7658
R10: ffffed1012ff6ed9 R11: ffff888097fb76cf R12: ffff8880a4cd7400
R13: 0000000000000001 R14: ffff888097fb75a8 R15: ffffffff884a7740
FS:  0000555556952880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000009c578000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  tc_bind_tclass+0x13e/0x2f0 net/sched/sch_api.c:1923
  tc_ctl_tclass+0xadb/0xcd0 net/sched/sch_api.c:2059
  rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5223
  netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
  rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5241
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:657
  ___sys_sendmsg+0x803/0x920 net/socket.c:2311
  __sys_sendmsg+0x105/0x1d0 net/socket.c:2356
  __do_sys_sendmsg net/socket.c:2365 [inline]
  __se_sys_sendmsg net/socket.c:2363 [inline]
  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2363
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441cd9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 7b 10 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc9938bcf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000315f6576616c RCX: 0000000000441cd9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 735f656764697262 R08: 0000000001bbbbbb R09: 0000000001bbbbbb
R10: 0000000001bbbbbb R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000403270 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
CR2: 0000000000000000
---[ end trace d5605e2bdb92fab7 ]---
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffff888097fb74d8 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffffff884a7740 RCX: ffffffff85b55676
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8880a4cd7400
RBP: ffff888097fb75d0 R08: ffff88808dc2e440 R09: ffff888097fb7658
R10: ffffed1012ff6ed9 R11: ffff888097fb76cf R12: ffff8880a4cd7400
R13: 0000000000000001 R14: ffff888097fb75a8 R15: ffffffff884a7740
FS:  0000555556952880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 000000009c578000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* INFO: rcu detected stall in mld_ifc_timer_expire
From: syzbot @ 2019-09-08  6:08 UTC (permalink / raw)
  To: davem, jhs, jiri, linux-kernel, netdev, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    3b47fd5c Merge tag 'nfs-for-5.3-4' of git://git.linux-nfs...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=15807dc6600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=144488c6c6c6d2b6
dashboard link: https://syzkaller.appspot.com/bug?extid=bc6297c11f19ee807dc2
compiler:       clang version 9.0.0 (/home/glider/llvm/clang  
80fee25776c2fb61e74c1ecb1a523375c2500b69)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=119ee6c1600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15c4eb0a600000

Bisection is inconclusive: the bug happens on the oldest tested release.

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13b7343e600000
console output: https://syzkaller.appspot.com/x/log.txt?x=17b7343e600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+bc6297c11f19ee807dc2@syzkaller.appspotmail.com

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu: 	0-...!: (10500 ticks this GP) idle=d6e/0/0x3 softirq=9083/9083 fqs=0
	(t=10501 jiffies g=6617 q=143)
rcu: rcu_preempt kthread starved for 10502 jiffies! g6617 f0x0  
RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
rcu: RCU grace-period kthread stack dump:
rcu_preempt     I29080    10      2 0x80004000
Call Trace:
  context_switch kernel/sched/core.c:3254 [inline]
  __schedule+0x877/0xc50 kernel/sched/core.c:3880
  schedule+0x131/0x1e0 kernel/sched/core.c:3947
  schedule_timeout+0x14f/0x240 kernel/time/timer.c:1807
  rcu_gp_fqs_loop kernel/rcu/tree.c:1611 [inline]
  rcu_gp_kthread+0xef8/0x1790 kernel/rcu/tree.c:1768
  kthread+0x332/0x350 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
NMI backtrace for cpu 0
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1d8/0x2f8 lib/dump_stack.c:113
  nmi_cpu_backtrace+0xaf/0x1a0 lib/nmi_backtrace.c:101
  nmi_trigger_cpumask_backtrace+0x174/0x290 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x10/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
  rcu_dump_cpu_stacks+0x15a/0x220 kernel/rcu/tree_stall.h:254
  print_cpu_stall kernel/rcu/tree_stall.h:455 [inline]
  check_cpu_stall kernel/rcu/tree_stall.h:529 [inline]
  rcu_pending kernel/rcu/tree.c:2736 [inline]
  rcu_sched_clock_irq+0xb95/0x16d0 kernel/rcu/tree.c:2183
  update_process_times+0x134/0x190 kernel/time/timer.c:1639
  tick_sched_handle kernel/time/tick-sched.c:167 [inline]
  tick_sched_timer+0x263/0x420 kernel/time/tick-sched.c:1296
  __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
  __hrtimer_run_queues+0x403/0x850 kernel/time/hrtimer.c:1451
  hrtimer_interrupt+0x38c/0xda0 kernel/time/hrtimer.c:1509
  local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1106 [inline]
  smp_apic_timer_interrupt+0x109/0x280 arch/x86/kernel/apic/apic.c:1131
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
RIP: 0010:__list_add_valid+0xc/0xc0 lib/list_debug.c:22
Code: 89 e5 53 48 89 fb e8 83 d6 1f fe 48 c7 c7 56 5a 45 88 48 89 de e8 44  
fd ff ff 5b 5d c3 90 55 48 89 e5 41 57 41 56 41 55 41 54 <53> 49 89 d6 49  
89 f4 49 89 ff 49 bd 00 00 00 00 00 fc ff df 48 8d
RSP: 0018:ffff8880aea09730 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
RAX: 1ffff11012f53d0a RBX: 1ffff11012f53d0b RCX: ffffffff88875a00
RDX: ffff888097a9e850 RSI: ffff888097a9e850 RDI: ffff888097a9e7b8
RBP: ffff8880aea09750 R08: ffffffff860c4d6a R09: 0000000000000000
R10: fffffbfff117be8d R11: 0000000000000000 R12: dffffc0000000000
R13: ffff888097a9e4c0 R14: ffff888097a9e850 R15: ffff888097a9e840
  __list_add include/linux/list.h:60 [inline]
  list_add_tail include/linux/list.h:93 [inline]
  list_move_tail include/linux/list.h:214 [inline]
  hhf_dequeue+0x535/0xaa0 net/sched/sch_hhf.c:439
  dequeue_skb net/sched/sch_generic.c:258 [inline]
  qdisc_restart net/sched/sch_generic.c:361 [inline]
  __qdisc_run+0x217/0x1b30 net/sched/sch_generic.c:379
  __dev_xmit_skb net/core/dev.c:3533 [inline]
  __dev_queue_xmit+0x1161/0x3020 net/core/dev.c:3838
  dev_queue_xmit+0x17/0x20 net/core/dev.c:3902
  neigh_hh_output include/net/neighbour.h:500 [inline]
  neigh_output include/net/neighbour.h:509 [inline]
  ip6_finish_output2+0xff2/0x13d0 net/ipv6/ip6_output.c:116
  __ip6_finish_output+0x693/0x910 net/ipv6/ip6_output.c:142
  ip6_finish_output+0x52/0x1e0 net/ipv6/ip6_output.c:152
  NF_HOOK_COND include/linux/netfilter.h:294 [inline]
  ip6_output+0x26f/0x390 net/ipv6/ip6_output.c:175
  dst_output include/net/dst.h:436 [inline]
  NF_HOOK include/linux/netfilter.h:305 [inline]
  mld_sendpack+0x770/0xb90 net/ipv6/mcast.c:1682
  mld_send_cr net/ipv6/mcast.c:1978 [inline]
  mld_ifc_timer_expire+0x820/0xb70 net/ipv6/mcast.c:2477
  call_timer_fn+0x95/0x170 kernel/time/timer.c:1322
  expire_timers kernel/time/timer.c:1366 [inline]
  __run_timers+0x79e/0x970 kernel/time/timer.c:1685
  run_timer_softirq+0x4a/0x90 kernel/time/timer.c:1698
  __do_softirq+0x333/0x7c4 arch/x86/include/asm/paravirt.h:778
  invoke_softirq kernel/softirq.c:373 [inline]
  irq_exit+0x227/0x230 kernel/softirq.c:413
  exiting_irq arch/x86/include/asm/apic.h:537 [inline]
  smp_apic_timer_interrupt+0x113/0x280 arch/x86/kernel/apic/apic.c:1133
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
  </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: 30 fa eb ae 89 d9 80 e1 07 80 c1 03 38 c1 7c ba 48 89 df e8 f4 b0 30  
fa eb b0 90 90 e9 07 00 00 00 0f 00 2d 96 b0 46 00 fb f4 <c3> 90 e9 07 00  
00 00 0f 00 2d 86 b0 46 00 f4 c3 90 90 55 48 89 e5
RSP: 0018:ffffffff88807dc0 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff11150f3 RBX: ffffffff88875a00 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: ffffffff812b7a1a RDI: ffffffff877bd46a
RBP: ffffffff88807dc8 R08: ffffffff81790e14 R09: fffffbfff110eb41
R10: fffffbfff110eb41 R11: 0000000000000000 R12: dffffc0000000000
R13: 1ffffffff110eb40 R14: dffffc0000000000 R15: 0000000000000000
  arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
  default_idle_call+0x59/0xa0 kernel/sched/idle.c:94
  cpuidle_idle_call kernel/sched/idle.c:154 [inline]
  do_idle+0x140/0x6d0 kernel/sched/idle.c:263
  cpu_startup_entry+0x25/0x30 kernel/sched/idle.c:354
  rest_init+0x29d/0x2b0 init/main.c:451
  arch_call_rest_init+0xe/0x10
  start_kernel+0x6f5/0x7f6 init/main.c:785
  x86_64_start_reservations+0x18/0x2e arch/x86/kernel/head64.c:472
  x86_64_start_kernel+0x7a/0x7d arch/x86/kernel/head64.c:453
  secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* general protection fault in qdisc_put
From: syzbot @ 2019-09-08  6:08 UTC (permalink / raw)
  To: akinobu.mita, akpm, davem, dvyukov, jhs, jiri, linux-kernel,
	mhocko, netdev, syzkaller-bugs, torvalds, xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    3b47fd5c Merge tag 'nfs-for-5.3-4' of git://git.linux-nfs...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=10244dd6600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=b89bb446a3faaba4
dashboard link: https://syzkaller.appspot.com/bug?extid=d5870a903591faaca4ae
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=174743fe600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=11f8c43e600000

The bug was bisected to:

commit e41d58185f1444368873d4d7422f7664a68be61d
Author: Dmitry Vyukov <dvyukov@google.com>
Date:   Wed Jul 12 21:34:35 2017 +0000

     fault-inject: support systematic fault injection

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=13f66bc6600000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=100e6bc6600000
console output: https://syzkaller.appspot.com/x/log.txt?x=17f66bc6600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+d5870a903591faaca4ae@syzkaller.appspotmail.com
Fixes: e41d58185f14 ("fault-inject: support systematic fault injection")

RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000001bbbbbb
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 9699 Comm: syz-executor169 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:qdisc_put+0x25/0x90 net/sched/sch_generic.c:983
Code: 00 00 00 00 00 55 48 89 e5 41 54 49 89 fc 53 e8 c1 52 bf fb 49 8d 7c  
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84  
c0 74 04 3c 03 7e 54 41 8b 5c 24 10 31 ff 83 e3 01
RSP: 0018:ffff8880944c7488 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8880945c8540 RCX: ffffffff85b49e8a
RDX: 0000000000000002 RSI: ffffffff85b3228f RDI: 0000000000000010
RBP: ffff8880944c7498 R08: ffff888099d50480 R09: ffffed1012898e45
R10: ffffed1012898e44 R11: 0000000000000003 R12: 0000000000000000
R13: ffff8880945c8540 R14: ffff888094894500 R15: ffff8880945c857c
FS:  0000555557553880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000610 CR3: 000000008c29d000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  sfb_destroy+0x61/0x80 net/sched/sch_sfb.c:468
  qdisc_create+0xbc6/0x1210 net/sched/sch_api.c:1285
  tc_modify_qdisc+0x524/0x1c50 net/sched/sch_api.c:1652
  rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5223
  netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
  rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5241
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x8a5/0xd60 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg+0xd7/0x130 net/socket.c:657
  ___sys_sendmsg+0x803/0x920 net/socket.c:2311
  __sys_sendmsg+0x105/0x1d0 net/socket.c:2356
  __do_sys_sendmsg net/socket.c:2365 [inline]
  __se_sys_sendmsg net/socket.c:2363 [inline]
  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2363
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4424f9
Code: e8 9c 07 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 3b 0a fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fffed10bed8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004424f9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000001bbbbbb
R10: 0000000000000000 R11: 0000000000000246 R12: ffffffffffffffff
R13: 0000000000000005 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace 97e52c48ae7a3cc1 ]---
RIP: 0010:qdisc_put+0x25/0x90 net/sched/sch_generic.c:983
Code: 00 00 00 00 00 55 48 89 e5 41 54 49 89 fc 53 e8 c1 52 bf fb 49 8d 7c  
24 10 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84  
c0 74 04 3c 03 7e 54 41 8b 5c 24 10 31 ff 83 e3 01
RSP: 0018:ffff8880944c7488 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8880945c8540 RCX: ffffffff85b49e8a
RDX: 0000000000000002 RSI: ffffffff85b3228f RDI: 0000000000000010
RBP: ffff8880944c7498 R08: ffff888099d50480 R09: ffffed1012898e45
R10: ffffed1012898e44 R11: 0000000000000003 R12: 0000000000000000
R13: ffff8880945c8540 R14: ffff888094894500 R15: ffff8880945c857c
FS:  0000555557553880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000610 CR3: 000000008c29d000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* WARNING in cbs_dequeue_soft
From: syzbot @ 2019-09-08  6:08 UTC (permalink / raw)
  To: davem, jhs, jiri, leandro.maciel.dorileo, linux-kernel, netdev,
	syzkaller-bugs, vedang.patel, xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    6d028043 Add linux-next specific files for 20190830
git tree:       linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=17f1421a600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=82a6bec43ab0cb69
dashboard link: https://syzkaller.appspot.com/bug?extid=cdbea9b616d35e2365ae
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=147b54d1600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=16c5da6e600000

The bug was bisected to:

commit e0a7683d30e91e30ee6cf96314ae58a0314a095e
Author: Leandro Dorileo <leandro.maciel.dorileo@intel.com>
Date:   Mon Apr 8 17:12:18 2019 +0000

     net/sched: cbs: fix port_rate miscalculation

bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=130c614e600000
final crash:    https://syzkaller.appspot.com/x/report.txt?x=108c614e600000
console output: https://syzkaller.appspot.com/x/log.txt?x=170c614e600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+cdbea9b616d35e2365ae@syzkaller.appspotmail.com
Fixes: e0a7683d30e9 ("net/sched: cbs: fix port_rate miscalculation")

------------[ cut here ]------------
cbs: dequeue() called with unknown port rate.
WARNING: CPU: 1 PID: 8572 at net/sched/sch_cbs.c:185  
cbs_dequeue_soft+0x37e/0x4b0 net/sched/sch_cbs.c:185
Kernel panic - not syncing: panic_on_warn set ...
CPU: 1 PID: 8572 Comm: kworker/1:2 Not tainted 5.3.0-rc6-next-20190830 #75
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Workqueue: ipv6_addrconf addrconf_dad_work
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  panic+0x2dc/0x755 kernel/panic.c:220
  __warn.cold+0x2f/0x3c kernel/panic.c:581
  report_bug+0x289/0x300 lib/bug.c:195
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  fixup_bug arch/x86/kernel/traps.c:174 [inline]
  do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:272
  do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:cbs_dequeue_soft+0x37e/0x4b0 net/sched/sch_cbs.c:185
Code: 1d 2c b3 f5 03 31 ff 89 de e8 fe 6d a6 fb 84 db 75 1a e8 b5 6c a6 fb  
48 c7 c7 80 7d 4a 88 c6 05 0c b3 f5 03 01 e8 0a bb 77 fb <0f> 0b 45 31 e4  
eb b1 49 bc ff ff ff ff ff ff ff 7f 48 89 55 d0 e8
RSP: 0018:ffff8880a129f3e8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815bf786 RDI: ffffed1014253e6f
RBP: ffff8880a129f430 R08: ffff8880a63f4040 R09: fffffbfff14ed341
R10: fffffbfff14ed340 R11: ffffffff8a769a07 R12: ffff8880911a5800
R13: ffff888095de92c8 R14: 0000000f8f3a4493 R15: ffffffffffffffff
  cbs_dequeue+0x34/0x40 net/sched/sch_cbs.c:237
  dequeue_skb net/sched/sch_generic.c:258 [inline]
  qdisc_restart net/sched/sch_generic.c:361 [inline]
  __qdisc_run+0x1e7/0x19d0 net/sched/sch_generic.c:379
  __dev_xmit_skb net/core/dev.c:3533 [inline]
  __dev_queue_xmit+0x16f1/0x37c0 net/core/dev.c:3838
  dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
  neigh_resolve_output net/core/neighbour.c:1490 [inline]
  neigh_resolve_output+0x5a5/0x970 net/core/neighbour.c:1470
  neigh_output include/net/neighbour.h:511 [inline]
  ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:116
  __ip6_finish_output+0x444/0xaa0 net/ipv6/ip6_output.c:142
  ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:152
  NF_HOOK_COND include/linux/netfilter.h:294 [inline]
  ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:175
  dst_output include/net/dst.h:436 [inline]
  NF_HOOK include/linux/netfilter.h:305 [inline]
  ndisc_send_skb+0xf29/0x14a0 net/ipv6/ndisc.c:505
  ndisc_send_ns+0x3a9/0x850 net/ipv6/ndisc.c:647
  addrconf_dad_work+0xb88/0x1150 net/ipv6/addrconf.c:4120
  process_one_work+0x9af/0x1740 kernel/workqueue.c:2269
  worker_thread+0x98/0xe40 kernel/workqueue.c:2415
  kthread+0x361/0x430 kernel/kthread.c:255
  ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
For information about bisection process see: https://goo.gl/tpsmEJ#bisection
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* general protection fault in cbs_destroy
From: syzbot @ 2019-09-08  6:08 UTC (permalink / raw)
  To: davem, jhs, jiri, linux-kernel, netdev, syzkaller-bugs,
	xiyou.wangcong

Hello,

syzbot found the following crash on:

HEAD commit:    3b47fd5c Merge tag 'nfs-for-5.3-4' of git://git.linux-nfs...
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=14854e71600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=144488c6c6c6d2b6
dashboard link: https://syzkaller.appspot.com/bug?extid=3a8d6a998cbb73bcf337
compiler:       clang version 9.0.0 (/home/glider/llvm/clang  
80fee25776c2fb61e74c1ecb1a523375c2500b69)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17998f9e600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=10421efa600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+3a8d6a998cbb73bcf337@syzkaller.appspotmail.com

8021q: adding VLAN 0 to HW filter on device batadv0
netlink: 24 bytes leftover after parsing attributes in process  
`syz-executor457'.
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 9249 Comm: syz-executor457 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:__list_del_entry_valid+0x6b/0x100 lib/list_debug.c:51
Code: 4c 89 f7 e8 97 d0 58 fe 48 ba 00 01 00 00 00 00 ad de 49 8b 1e 48 39  
d3 74 54 48 83 c2 22 49 39 d7 74 5e 4c 89 f8 48 c1 e8 03 <42> 80 3c 20 00  
74 08 4c 89 ff e8 66 d0 58 fe 49 8b 17 4c 39 f2 75
RSP: 0018:ffff88809898f568 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: dead000000000122 RSI: 0000000000000004 RDI: ffff88809fb5a7e8
RBP: ffff88809898f588 R08: dffffc0000000000 R09: ffffed1013131ea8
R10: ffffed1013131ea8 R11: 0000000000000000 R12: dffffc0000000000
R13: ffff88809fb5a480 R14: ffff88809fb5a7e0 R15: 0000000000000000
FS:  00005555568cb880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000610 CR3: 00000000a3968000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  __list_del_entry include/linux/list.h:131 [inline]
  list_del include/linux/list.h:139 [inline]
  cbs_destroy+0x85/0x3e0 net/sched/sch_cbs.c:435
  qdisc_create+0xff8/0x13e0 net/sched/sch_api.c:1302
  tc_modify_qdisc+0x989/0x1ea0 net/sched/sch_api.c:1652
  rtnetlink_rcv_msg+0x889/0xd40 net/core/rtnetlink.c:5223
  netlink_rcv_skb+0x19e/0x3d0 net/netlink/af_netlink.c:2477
  rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:5241
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0x787/0x900 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x993/0xc50 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:637 [inline]
  sock_sendmsg net/socket.c:657 [inline]
  ___sys_sendmsg+0x60d/0x910 net/socket.c:2311
  __sys_sendmsg net/socket.c:2356 [inline]
  __do_sys_sendmsg net/socket.c:2365 [inline]
  __se_sys_sendmsg net/socket.c:2363 [inline]
  __x64_sys_sendmsg+0x17c/0x200 net/socket.c:2363
  do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441b59
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 7b 10 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffe29572cf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441b59
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000003
RBP: 00007ffe29572d10 R08: 0000000001bbbbbb R09: 0000000001bbbbbb
R10: 0000000001bbbbbb R11: 0000000000000246 R12: 0000000000000000
R13: 00000000004030f0 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace 226030e488aca074 ]---
RIP: 0010:__list_del_entry_valid+0x6b/0x100 lib/list_debug.c:51
Code: 4c 89 f7 e8 97 d0 58 fe 48 ba 00 01 00 00 00 00 ad de 49 8b 1e 48 39  
d3 74 54 48 83 c2 22 49 39 d7 74 5e 4c 89 f8 48 c1 e8 03 <42> 80 3c 20 00  
74 08 4c 89 ff e8 66 d0 58 fe 49 8b 17 4c 39 f2 75
RSP: 0018:ffff88809898f568 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000001
RDX: dead000000000122 RSI: 0000000000000004 RDI: ffff88809fb5a7e8
RBP: ffff88809898f588 R08: dffffc0000000000 R09: ffffed1013131ea8
R10: ffffed1013131ea8 R11: 0000000000000000 R12: dffffc0000000000
R13: ffff88809fb5a480 R14: ffff88809fb5a7e0 R15: 0000000000000000
FS:  00005555568cb880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000610 CR3: 00000000a3968000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox