* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-02 13:04 UTC (permalink / raw)
To: Mark Rutland
Cc: Linus Torvalds, Kees Cook, kvm, virtualization, netdev,
Linux Kernel Mailing List, Andrew Morton, bijan.mottahedeh,
gedwards, joe, lenaic, liang.z.li, mhocko, mhocko, stefanha,
wei.w.wang
In-Reply-To: <20181102114635.hi3q53kzmz4qljsf@lakrids.cambridge.arm.com>
On Fri, Nov 02, 2018 at 11:46:36AM +0000, Mark Rutland wrote:
> On Thu, Nov 01, 2018 at 04:06:19PM -0700, Linus Torvalds wrote:
> > On Thu, Nov 1, 2018 at 4:00 PM Kees Cook <keescook@chromium.org> wrote:
> > >
> > > + memset(&rsp, 0, sizeof(rsp));
> > > + rsp.response = VIRTIO_SCSI_S_FUNCTION_REJECTED;
> > > + resp = vq->iov[out].iov_base;
> > > + ret = __copy_to_user(resp, &rsp, sizeof(rsp));
> > >
> > > Is it actually safe to trust that iov_base has passed an earlier
> > > access_ok() check here? Why not just use copy_to_user() instead?
> >
> > Good point.
> >
> > We really should have removed those double-underscore things ages ago.
>
> FWIW, on arm64 we always check/sanitize the user address as a result of
> our sanitization of speculated values. Almost all of our uaccess
> routines have an explicit access_ok().
>
> All our uaccess routines mask the user pointer based on addr_limit,
> which prevents speculative or architectural uaccess to kernel addresses
> when addr_limit it USER_DS:
>
> 4d8efc2d5ee4c9cc ("arm64: Use pointer masking to limit uaccess speculation")
>
> We also inhibit speculative stores to addr_limit being forwarded under
> speculation:
>
> c2f0ad4fc089cff8 ("arm64: uaccess: Prevent speculative use of the current addr_limit")
>
> ... and given all that, we folded explicit access_ok() checks into
> __{get,put}_user():
>
> 84624087dd7e3b48 ("arm64: uaccess: Don't bother eliding access_ok checks in __{get, put}_user")
>
> IMO we could/should do the same for __copy_{to,from}_user().
>
> Thanks,
> Mark.
I've tried making access_ok mask the parameter it gets. Works because
access_ok is a macro. Most users pass in a variable so that will block
attempts to use speculation to bypass the access_ok checks.
Not 100% as someone can copy the value before access_ok, but
then it's all mitigation anyway.
Places which call access_ok on a non-lvalue need to be fixed then but
there are not too many of these.
The advantage here is that a code like this:
access_ok
for(...)
__get_user
isn't slowed down as the masking is outside the loop.
OTOH macros changing their arguments are kind of ugly.
What do others think?
Just to show what I mean:
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index aae77eb8491c..c4d12c8f47d7 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -7,6 +7,7 @@
#include <linux/compiler.h>
#include <linux/kasan-checks.h>
#include <linux/string.h>
+#include <linux/nospec.h>
#include <asm/asm.h>
#include <asm/page.h>
#include <asm/smap.h>
@@ -69,6 +70,33 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
__chk_range_not_ok((unsigned long __force)(addr), size, limit); \
})
+/*
+ * Test whether a block of memory is a valid user space address.
+ * Returns 0 if the range is valid, address itself otherwise.
+ */
+static inline unsigned long __verify_range_nospec(unsigned long addr,
+ unsigned long size,
+ unsigned long limit)
+{
+ /* Be careful about overflow */
+ limit = array_index_nospec(limit, size);
+
+ /*
+ * If we have used "sizeof()" for the size,
+ * we know it won't overflow the limit (but
+ * it might overflow the 'addr', so it's
+ * important to subtract the size from the
+ * limit, not add it to the address).
+ */
+ if (__builtin_constant_p(size)) {
+ return array_index_nospec(addr, limit - size + 1);
+ }
+
+ /* Arbitrary sizes? Be careful about overflow */
+ return array_index_mask_nospec(limit, size) &
+ array_index_nospec(addr, limit - size + 1);
+}
+
#ifdef CONFIG_DEBUG_ATOMIC_SLEEP
# define WARN_ON_IN_IRQ() WARN_ON_ONCE(!in_task())
#else
@@ -95,12 +123,46 @@ static inline bool __chk_range_not_ok(unsigned long addr, unsigned long size, un
* checks that the pointer is in the user space range - after calling
* this function, memory access functions may still return -EFAULT.
*/
-#define access_ok(type, addr, size) \
+#define unsafe_access_ok(type, addr, size) \
({ \
WARN_ON_IN_IRQ(); \
likely(!__range_not_ok(addr, size, user_addr_max())); \
})
+/**
+ * access_ok_nospec: - Checks if a user space pointer is valid
+ * @type: Type of access: %VERIFY_READ or %VERIFY_WRITE. Note that
+ * %VERIFY_WRITE is a superset of %VERIFY_READ - if it is safe
+ * to write to a block, it is always safe to read from it.
+ * @addr: User space pointer to start of block to check
+ * @size: Size of block to check
+ *
+ * Context: User context only. This function may sleep if pagefaults are
+ * enabled.
+ *
+ * Checks if a pointer to a block of memory in user space is valid.
+ *
+ * Returns address itself (nonzero) if the memory block may be valid,
+ * zero if it is definitely invalid.
+ *
+ * To prevent speculation, the returned value must then be used
+ * for accesses.
+ *
+ * Note that, depending on architecture, this function probably just
+ * checks that the pointer is in the user space range - after calling
+ * this function, memory access functions may still return -EFAULT.
+ */
+#define access_ok_nospec(type, addr, size) \
+({ \
+ WARN_ON_IN_IRQ(); \
+ __chk_user_ptr(addr); \
+ addr = (typeof(addr) __force) \
+ __verify_range_nospec((unsigned long __force)(addr), \
+ size, user_addr_max()); \
+})
+
+#define access_ok(type, addr, size) access_ok_nospec(type, addr, size)
+
/*
* These are the main single-value transfer routines. They automatically
* use the right size if we just have the right pointer type.
--
MST
^ permalink raw reply related
* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Steven Rostedt @ 2018-11-02 13:16 UTC (permalink / raw)
To: Aleksa Sarai
Cc: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
Masami Hiramatsu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, Shuah Khan, Alexei Starovoitov, Daniel Borkmann,
Brendan Gregg, Christian Brauner, Aleksa Sarai, netdev, linux-doc,
linux-kernel
In-Reply-To: <20181102065932.bdt4pubbrkvql4mp@yavin>
On Fri, 2 Nov 2018 17:59:32 +1100
Aleksa Sarai <cyphar@cyphar.com> wrote:
> As an aside, I just tested with the frame unwinder and it isn't thrown
> off-course by kretprobe_trampoline (though obviously the stack is still
> wrong). So I think we just need to hook into the ORC unwinder to get it
> to continue skipping up the stack, as well as add the rewriting code for
> the stack traces (for all unwinders I guess -- though ideally we should
I agree that this is the right solution.
> do this without having to add the same code to every architecture).
True, and there's an art to consolidating the code between
architectures.
I'm currently looking at function graph and seeing if I can consolidate
it too. And I'm also trying to get multiple uses to hook into its
infrastructure. I think I finally figured out a way to do so.
The reason it is difficult, is that you need to maintain state between
the entry of a function and the exit for each task and callback that is
registered. Hence, it's a 3x tuple (function stack, task, callbacks).
And this must be maintained with preemption. A task may sleep for
minutes, and the state needs to be retained.
The only state that must be retained is the function stack with the
task, because if that gets out of sync, the system crashes. But the
callback state can be removed.
Here's what is there now:
When something is registered with the function graph tracer, every
task gets a shadowed stack. A hook is added to fork to add shadow
stacks to new tasks. Once a shadow stack is added to a task, that
shadow stack is never removed until the task exits.
When the function is entered, the real return code is stored in the
shadow stack and the trampoline address is put in its place.
On return, the trampoline is called, and it will pop off the return
code from the shadow stack and return to that.
The issue with multiple users, is that different users may want to
trace different functions. On entry, the user could say it doesn't want
to trace the current function, and the return part must not be called
on exit. Keeping track of which user needs the return called is the
tricky part.
Here's what I plan on implementing:
Along with a shadow stack, I was going to add a 4096 byte (one page)
array that holds 64 8 byte masks to every task as well. This will allow
64 simultaneous users (which is rather extreme). If we need to support
more, we could allocate another page for all tasks. The 8 byte mask
will represent each depth (allowing to do this for 64 function call
stack depth, which should also be enough).
Each user will be assigned one of the masks. Each bit in the mask
represents the depth of the shadow stack. When a function is called,
each user registered with the function graph tracer will get called
(if they asked to be called for this function, via the ftrace_ops
hashes) and if they want to trace the function, then the bit is set in
the mask for that stack depth.
When the function exits the function and we pop off the return code
from the shadow stack, we then look at all the bits set for the
corresponding users, and call their return callbacks, and ignore
anything that is not set.
When a user is unregistered, it the corresponding bits that represent
it are cleared, and it the return callback will not be called. But the
tasks being traced will still have their shadow stack to allow it to
get back to normal.
I'll hopefully have a prototype ready by plumbers.
And this too will require each architecture to probably change. As a
side project to this, I'm going to try to consolidate the function
graph code among all the architectures as well. Not an easy task.
-- Steve
^ permalink raw reply
* [PATCH] macvlan: use per-cpu queues for broadcast and multicast packets
From: Konstantin Khlebnikov @ 2018-11-02 13:36 UTC (permalink / raw)
To: netdev, David S. Miller, linux-kernel; +Cc: Vadim Fedorenko
Currently macvlan has single per-port queue for broadcast and multicast.
This disrupts order of packets when flows from different cpus are mixed.
This patch replaces this queue with single set of per-cpu queues.
Pointer to macvlan port is passed in skb control block.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Reported-add-tested-by: Vadim Fedorenko <vfedorenko@yandex-team.ru>
---
drivers/net/macvlan.c | 65 +++++++++++++++++++++++++++++--------------------
1 file changed, 38 insertions(+), 27 deletions(-)
diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c
index fc8d5f1ee1ad..1e9c37ec43c3 100644
--- a/drivers/net/macvlan.c
+++ b/drivers/net/macvlan.c
@@ -46,8 +46,6 @@ struct macvlan_port {
struct net_device *dev;
struct hlist_head vlan_hash[MACVLAN_HASH_SIZE];
struct list_head vlans;
- struct sk_buff_head bc_queue;
- struct work_struct bc_work;
u32 flags;
int count;
struct hlist_head vlan_source_hash[MACVLAN_HASH_SIZE];
@@ -55,6 +53,11 @@ struct macvlan_port {
unsigned char perm_addr[ETH_ALEN];
};
+struct macvlan_bc_work {
+ struct sk_buff_head bc_queue;
+ struct work_struct bc_work;
+};
+
struct macvlan_source_entry {
struct hlist_node hlist;
struct macvlan_dev *vlan;
@@ -63,6 +66,7 @@ struct macvlan_source_entry {
};
struct macvlan_skb_cb {
+ const struct macvlan_port *port;
const struct macvlan_dev *src;
};
@@ -295,20 +299,23 @@ static void macvlan_broadcast(struct sk_buff *skb,
}
}
+static DEFINE_PER_CPU(struct macvlan_bc_work, macvlan_bc_work);
+
static void macvlan_process_broadcast(struct work_struct *w)
{
- struct macvlan_port *port = container_of(w, struct macvlan_port,
+ struct macvlan_bc_work *work = container_of(w, struct macvlan_bc_work,
bc_work);
struct sk_buff *skb;
struct sk_buff_head list;
__skb_queue_head_init(&list);
- spin_lock_bh(&port->bc_queue.lock);
- skb_queue_splice_tail_init(&port->bc_queue, &list);
- spin_unlock_bh(&port->bc_queue.lock);
+ spin_lock_bh(&work->bc_queue.lock);
+ skb_queue_splice_tail_init(&work->bc_queue, &list);
+ spin_unlock_bh(&work->bc_queue.lock);
while ((skb = __skb_dequeue(&list))) {
+ const struct macvlan_port *port = MACVLAN_SKB_CB(skb)->port;
const struct macvlan_dev *src = MACVLAN_SKB_CB(skb)->src;
rcu_read_lock();
@@ -345,6 +352,7 @@ static void macvlan_broadcast_enqueue(struct macvlan_port *port,
const struct macvlan_dev *src,
struct sk_buff *skb)
{
+ struct macvlan_bc_work *work;
struct sk_buff *nskb;
int err = -ENOMEM;
@@ -352,24 +360,30 @@ static void macvlan_broadcast_enqueue(struct macvlan_port *port,
if (!nskb)
goto err;
+ MACVLAN_SKB_CB(nskb)->port = port;
MACVLAN_SKB_CB(nskb)->src = src;
- spin_lock(&port->bc_queue.lock);
- if (skb_queue_len(&port->bc_queue) < MACVLAN_BC_QUEUE_LEN) {
+ work = get_cpu_ptr(&macvlan_bc_work);
+
+ spin_lock(&work->bc_queue.lock);
+ if (skb_queue_len(&work->bc_queue) < MACVLAN_BC_QUEUE_LEN) {
if (src)
dev_hold(src->dev);
- __skb_queue_tail(&port->bc_queue, nskb);
+ __skb_queue_tail(&work->bc_queue, nskb);
err = 0;
}
- spin_unlock(&port->bc_queue.lock);
+ spin_unlock(&work->bc_queue.lock);
if (err)
goto free_nskb;
- schedule_work(&port->bc_work);
+ schedule_work_on(smp_processor_id(), &work->bc_work);
+ put_cpu_ptr(work);
+
return;
free_nskb:
+ put_cpu_ptr(work);
kfree_skb(nskb);
err:
atomic_long_inc(&skb->dev->rx_dropped);
@@ -1168,9 +1182,6 @@ static int macvlan_port_create(struct net_device *dev)
for (i = 0; i < MACVLAN_HASH_SIZE; i++)
INIT_HLIST_HEAD(&port->vlan_source_hash[i]);
- skb_queue_head_init(&port->bc_queue);
- INIT_WORK(&port->bc_work, macvlan_process_broadcast);
-
err = netdev_rx_handler_register(dev, macvlan_handle_frame, port);
if (err)
kfree(port);
@@ -1182,24 +1193,16 @@ static int macvlan_port_create(struct net_device *dev)
static void macvlan_port_destroy(struct net_device *dev)
{
struct macvlan_port *port = macvlan_port_get_rtnl(dev);
- struct sk_buff *skb;
+ int cpu;
dev->priv_flags &= ~IFF_MACVLAN_PORT;
netdev_rx_handler_unregister(dev);
/* After this point, no packet can schedule bc_work anymore,
- * but we need to cancel it and purge left skbs if any.
+ * but we need to flush work.
*/
- cancel_work_sync(&port->bc_work);
-
- while ((skb = __skb_dequeue(&port->bc_queue))) {
- const struct macvlan_dev *src = MACVLAN_SKB_CB(skb)->src;
-
- if (src)
- dev_put(src->dev);
-
- kfree_skb(skb);
- }
+ for_each_possible_cpu(cpu)
+ flush_work(per_cpu_ptr(&macvlan_bc_work.bc_work, cpu));
/* If the lower device address has been changed by passthru
* macvlan, put it back.
@@ -1702,7 +1705,15 @@ static struct notifier_block macvlan_notifier_block __read_mostly = {
static int __init macvlan_init_module(void)
{
- int err;
+ int err, cpu;
+
+ for_each_possible_cpu(cpu) {
+ struct macvlan_bc_work *work;
+
+ work = per_cpu_ptr(&macvlan_bc_work, cpu);
+ skb_queue_head_init(&work->bc_queue);
+ INIT_WORK(&work->bc_work, macvlan_process_broadcast);
+ }
register_netdevice_notifier(&macvlan_notifier_block);
^ permalink raw reply related
* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Paul E. McKenney @ 2018-11-02 13:38 UTC (permalink / raw)
To: Peter Zijlstra
Cc: David Laight, Trond Myklebust, mark.rutland@arm.com,
linux-kernel@vger.kernel.org, ralf@linux-mips.org,
jlayton@kernel.org, linuxppc-dev@lists.ozlabs.org,
bfields@fieldses.org, linux-mips@linux-mips.org,
linux@roeck-us.net, linux-nfs@vger.kernel.org,
akpm@linux-foundation.org, will.deacon@arm.com,
boqun.feng@gmail.com, paul.burton@mips.com,
"anna.schumaker@netapp.com" <a
In-Reply-To: <20181102122328.GM3178@hirez.programming.kicks-ass.net>
On Fri, Nov 02, 2018 at 01:23:28PM +0100, Peter Zijlstra wrote:
> On Fri, Nov 02, 2018 at 10:56:31AM +0000, David Laight wrote:
> > From: Paul E. McKenney
> > > Sent: 01 November 2018 17:02
> > ...
> > > And there is a push to define C++ signed arithmetic as 2s complement,
> > > but there are still 1s complement systems with C compilers. Just not
> > > C++ compilers. Legacy...
> >
> > Hmmm... I've used C compilers for DSPs where signed integer arithmetic
> > used the 'data registers' and would saturate, unsigned used the 'address
> > registers' and wrapped.
> > That was deliberate because it is much better to clip analogue values.
>
> Seems a dodgy heuristic if you ask me.
>
> > Then there was the annoying cobol run time that didn't update the
> > result variable if the result wouldn't fit.
> > Took a while to notice that the sum of a list of values was even wrong!
> > That would be perfectly valid for C - if unexpected.
>
> That's just insane ;-)
>
> > > > But for us using -fno-strict-overflow which actually defines signed
> > > > overflow
> >
> > I wonder how much real code 'strict-overflow' gets rid of?
> > IIRC gcc silently turns loops like:
> > int i; for (i = 1; i != 0; i *= 2) ...
> > into infinite ones.
> > Which is never what is required.
>
> Nobody said C was a 'safe' language. But less UB makes a better language
> IMO. Ideally we'd get all UBs filled in -- but I realise C has a few
> very 'interesting' ones that might be hard to get rid of.
There has been an effort to reduce UB, but not sure how far they got.
Thanx, Paul
^ permalink raw reply
* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Aleksa Sarai @ 2018-11-02 4:37 UTC (permalink / raw)
To: Masami Hiramatsu
Cc: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, Steven Rostedt, Shuah Khan, Alexei Starovoitov,
Daniel Borkmann, Brendan Gregg, Christian Brauner, Aleksa Sarai,
netdev, linux-doc, linux-kernel
In-Reply-To: <20181102120441.d00af1b57e6a739d0e7c7a91@kernel.org>
[-- Attachment #1: Type: text/plain, Size: 4443 bytes --]
On 2018-11-02, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> On Fri, 2 Nov 2018 08:13:43 +1100
> Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> > On 2018-11-02, Masami Hiramatsu <mhiramat@kernel.org> wrote:
> > > Please split the test case as an independent patch.
> >
> > Will do. Should the Documentation/ change also be a separate patch?
>
> I think the Documentation change can be coupled with code change
> if the change is small. But selftests is different, that can be
> backported soon for testing the stable kernels.
>
>
> > > > new file mode 100644
> > > > index 000000000000..03146c6a1a3c
> > > > --- /dev/null
> > > > +++ b/tools/testing/selftests/ftrace/test.d/kprobe/kretprobe_stacktrace.tc
> > > > @@ -0,0 +1,25 @@
> > > > +#!/bin/sh
> > > > +# SPDX-License-Identifier: GPL-2.0+
> > > > +# description: Kretprobe dynamic event with a stacktrace
> > > > +
> > > > +[ -f kprobe_events ] || exit_unsupported # this is configurable
> > > > +
> > > > +echo 0 > events/enable
> > > > +echo 1 > options/stacktrace
> > > > +
> > > > +echo 'r:teststackprobe sched_fork $retval' > kprobe_events
> > > > +grep teststackprobe kprobe_events
> > > > +test -d events/kprobes/teststackprobe
> > >
> > > Hmm, what happen if we have 2 or more kretprobes on same stack?
> > > It seems you just save stack in pre_handler, but that stack can already
> > > includes another kretprobe's trampline address...
> >
> > Yeah, you're quite right...
> >
> > My first instinct was to do something awful like iterate over the set of
> > "kretprobe_instance"s with ->task == current, and then correct
> > kretprobe_trampoline entries using ->ret_addr. (I think this would be
> > correct because each task can only be in one context at once, and in
> > order to get to a particular kretprobe all of your caller kretprobes
> > were inserted before you and any sibling or later kretprobe_instances
> > will have been removed. But I might be insanely wrong on this front.)
>
> yeah, you are right.
>
> >
> > However (as I noted in the other thread), there is a problem where
> > kretprobe_trampoline actually stops the unwinder in its tracks and thus
> > you only get the first kretprobe_trampoline. This is something I'm going
> > to look into some more (despite not having made progress on it last
> > time) since now it's something that actually needs to be fixed (and
> > as I mentioned in the other thread, show_stack() actually works on x86
> > in this context unlike the other stack_trace users).
>
> I should read the unwinder code, but anyway, fixing it up in kretprobe
> handler context is not hard. Since each instance is on an hlist, so when
> we hit the kretprobe_trampoline, we can search it.
As in, find the stored stack and merge the two? Interesting idea, though
Steven has shot this down because of the associated cost (I was
considering making it a kprobe flag, but that felt far too dirty).
> However, the problem is the case where the out of kretprobe handler
> context. In that context we need to try to lock the hlist and search
> the list, which will be a costful operation.
I think the best solution would be to unify all of the kretprobe-like
things so that we don't need to handle non-kretprobe contexts for
basically the same underlying idea. If we wanted to do it like this.
I think a potentially better option would be to just fix the unwinder to
correctly handle kretprobes (like it handles ftrace today).
> On the other hand, func-graph tracer introduces a shadow return address
> stack for each task (thread), and when we hit its trampoline on the stack,
> we can easily search the entry from "current" task without locking the
> shadow stack (and we already did it). This may need to pay a cost (1 page)
> for each task, but smarter than kretprobe, which makes a system-wide
> hash-table and need to search from hlist which has return addresses
> of other context coexist.
Probably a silly question (I've been staring at the function_graph code
trying to understand how it handles return addresses -- and I'm really
struggling), is this what ->ret_stack (and ->curr_ret_stack) do?
Can you explain how the .retp handling works, because I'm really missing
how the core arch/ hooks know to pass the correct retp value.
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Aleksa Sarai @ 2018-11-02 5:05 UTC (permalink / raw)
To: Steven Rostedt
Cc: Naveen N. Rao, Anil S Keshavamurthy, David S. Miller,
Masami Hiramatsu, Jonathan Corbet, Peter Zijlstra, Ingo Molnar,
Arnaldo Carvalho de Melo, Alexander Shishkin, Jiri Olsa,
Namhyung Kim, Shuah Khan, Alexei Starovoitov, Daniel Borkmann,
Brendan Gregg, Christian Brauner, Aleksa Sarai, netdev, linux-doc,
linux-kernel
In-Reply-To: <20181101204720.6ed3fe37@vmware.local.home>
[-- Attachment #1: Type: text/plain, Size: 2288 bytes --]
On 2018-11-01, Steven Rostedt <rostedt@goodmis.org> wrote:
> On Thu, 1 Nov 2018 19:35:50 +1100
> Aleksa Sarai <cyphar@cyphar.com> wrote:
> > @@ -1834,6 +1853,11 @@ static int pre_handler_kretprobe(struct kprobe *p, struct pt_regs *regs)
> > ri->rp = rp;
> > ri->task = current;
> >
> > + trace.entries = &ri->entry.entries[0];
> > + trace.max_entries = KRETPROBE_TRACE_SIZE;
> > + save_stack_trace_regs(regs, &trace);
> > + ri->entry.nr_entries = trace.nr_entries;
> > +
>
> So basically your saving a copy of the stack trace for each probe, for
> something that may not ever be used?
>
> This adds quite a lot of overhead for something not used often.
>
> I think the real answer is to fix kretprobes (and I just checked, the
> return call of function graph tracer stack trace doesn't go past the
> return_to_handler function either. It's just that a stack trace was
> never done on the return side).
>
> The more I look at this patch, the less I like it.
That's more than fair.
There's also an issue that Masami noted -- nested kretprobes don't give
you the full stack trace with this solution since the stack was already
overwritten. I think that the only real option is to fix the unwinder to
work in a kretprobe context -- which is what I'm looking at now.
Unfortunately, I'm having a lot of trouble understanding how the current
ftrace hooking works -- ORC has a couple of ftrace hooks that seem
reasonable on the surface but I don't understand (for instance) how
HAVE_FUNCTION_GRAPH_RET_ADDR_PTR *actually* works. Though your comment
appears to indicate that it doesn't work for stack traces?
For kretprobes I think it would be fairly easy to reconstruct what
landed you into a kretprobe_trampoline by walking the set of
kretprobe_instances (since all new ones are added to the head, you can
get the real return address in-order).
But I still have to figure out what is actually stopping the
save_stack_trace() unwinder that isn't stopping the show_stacks()
unwinder (though the show_stacks() code is more ... liberal with the
degree of certainty it has about the unwind).
Am I barking up the wrong tree?
--
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH
<https://www.cyphar.com/>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Aaron Lu @ 2018-11-02 5:23 UTC (permalink / raw)
To: Saeed Mahameed
Cc: brouer@redhat.com, pstaszewski@itcare.pl, eric.dumazet@gmail.com,
netdev@vger.kernel.org, Tariq Toukan, ilias.apalodimas@linaro.org,
yoel@kviknet.dk, mgorman@techsingularity.net
In-Reply-To: <a7cfa596d6b9979c67fa8dbd633f7dd8293337a1.camel@mellanox.com>
On Thu, Nov 01, 2018 at 08:23:19PM +0000, Saeed Mahameed wrote:
> On Thu, 2018-11-01 at 23:27 +0800, Aaron Lu wrote:
> > On Thu, Nov 01, 2018 at 10:22:13AM +0100, Jesper Dangaard Brouer
> > wrote:
> > ... ...
> > > Section copied out:
> > >
> > > mlx5e_poll_tx_cq
> > > |
> > > --16.34%--napi_consume_skb
> > > |
> > > |--12.65%--__free_pages_ok
> > > | |
> > > | --11.86%--free_one_page
> > > | |
> > > | |--10.10%
> > > --queued_spin_lock_slowpath
> > > | |
> > > | --0.65%--_raw_spin_lock
> >
> > This callchain looks like it is freeing higher order pages than order
> > 0:
> > __free_pages_ok is only called for pages whose order are bigger than
> > 0.
>
> mlx5 rx uses only order 0 pages, so i don't know where these high order
> tx SKBs are coming from..
Perhaps here:
__netdev_alloc_skb(), __napi_alloc_skb(), __netdev_alloc_frag() and
__napi_alloc_frag() will all call page_frag_alloc(), which will use
__page_frag_cache_refill() to get an order 3 page if possible, or fall
back to an order 0 page if order 3 page is not available.
I'm not sure if your workload will use the above code path though.
^ permalink raw reply
* Re: [PATCH] macvlan: use per-cpu queues for broadcast and multicast packets
From: Jiri Pirko @ 2018-11-02 14:32 UTC (permalink / raw)
To: Konstantin Khlebnikov
Cc: netdev, David S. Miller, linux-kernel, Vadim Fedorenko
In-Reply-To: <154116580015.953950.9450253307804393677.stgit@buzz>
Fri, Nov 02, 2018 at 02:36:40PM CET, khlebnikov@yandex-team.ru wrote:
>Reported-add-tested-by: Vadim Fedorenko <vfedorenko@yandex-team.ru>
This is supposed to be split into 2 tags.
^ permalink raw reply
* Re: Business Proposal
From: Edward Yuan @ 2018-10-29 14:51 UTC (permalink / raw)
To: netdev
Dear Friend,
My name is Mr. Edward Yuan, a consultant/broker. I know you might be a bit apprehensive because you do not know me. Nevertheless, I have a proposal on behalf of a client, a lucrative business that might be of mutual benefit to you.
If interested in this proposition please kindly and urgently contact me for more details.
Best Regards.
Mr. Edward Yuan.
---
This email has been checked for viruses by AVG.
https://www.avg.com
^ permalink raw reply
* [PATCH] net/mlx5e: fix high stack usage
From: Arnd Bergmann @ 2018-11-02 15:33 UTC (permalink / raw)
To: Saeed Mahameed, Leon Romanovsky, David S. Miller
Cc: Arnd Bergmann, Tariq Toukan, Eran Ben Elisha, Boris Pismenny,
Ilya Lesokhin, Moshe Shemesh, Kamal Heib, netdev, linux-rdma,
linux-kernel
A patch that looks harmless causes the stack usage of the mlx5e_grp_sw_update_stats()
function to drastically increase with x86 gcc-4.9 and higher (tested up to 8.1):
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c: In function ‘mlx5e_grp_sw_update_stats’:
drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:216:1: warning: the frame size of 1276 bytes is larger than 500 bytes [-Wframe-larger-than=]
By splitting out the loop body into a non-inlined function, the stack size goes
back down to under 500 bytes.
Fixes: 779d986d60de ("net/mlx5e: Do not ignore netdevice TX/RX queues number")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
.../ethernet/mellanox/mlx5/core/en_stats.c | 168 +++++++++---------
1 file changed, 86 insertions(+), 82 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
index 1e55b9c27ffc..c270206f3475 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.c
@@ -126,93 +126,97 @@ static int mlx5e_grp_sw_fill_stats(struct mlx5e_priv *priv, u64 *data, int idx)
return idx;
}
+static noinline_for_stack void
+mlx5e_grp_sw_collect_stat(struct mlx5e_priv *priv, struct mlx5e_sw_stats *s, int i)
+{
+ struct mlx5e_channel_stats *channel_stats = &priv->channel_stats[i];
+ struct mlx5e_xdpsq_stats *xdpsq_red_stats = &channel_stats->xdpsq;
+ struct mlx5e_xdpsq_stats *xdpsq_stats = &channel_stats->rq_xdpsq;
+ struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
+ struct mlx5e_ch_stats *ch_stats = &channel_stats->ch;
+ int j;
+
+ s->rx_packets += rq_stats->packets;
+ s->rx_bytes += rq_stats->bytes;
+ s->rx_lro_packets += rq_stats->lro_packets;
+ s->rx_lro_bytes += rq_stats->lro_bytes;
+ s->rx_ecn_mark += rq_stats->ecn_mark;
+ s->rx_removed_vlan_packets += rq_stats->removed_vlan_packets;
+ s->rx_csum_none += rq_stats->csum_none;
+ s->rx_csum_complete += rq_stats->csum_complete;
+ s->rx_csum_unnecessary += rq_stats->csum_unnecessary;
+ s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
+ s->rx_xdp_drop += rq_stats->xdp_drop;
+ s->rx_xdp_redirect += rq_stats->xdp_redirect;
+ s->rx_xdp_tx_xmit += xdpsq_stats->xmit;
+ s->rx_xdp_tx_full += xdpsq_stats->full;
+ s->rx_xdp_tx_err += xdpsq_stats->err;
+ s->rx_xdp_tx_cqe += xdpsq_stats->cqes;
+ s->rx_wqe_err += rq_stats->wqe_err;
+ s->rx_mpwqe_filler_cqes += rq_stats->mpwqe_filler_cqes;
+ s->rx_mpwqe_filler_strides += rq_stats->mpwqe_filler_strides;
+ s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
+ s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
+ s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
+ s->rx_page_reuse += rq_stats->page_reuse;
+ s->rx_cache_reuse += rq_stats->cache_reuse;
+ s->rx_cache_full += rq_stats->cache_full;
+ s->rx_cache_empty += rq_stats->cache_empty;
+ s->rx_cache_busy += rq_stats->cache_busy;
+ s->rx_cache_waive += rq_stats->cache_waive;
+ s->rx_congst_umr += rq_stats->congst_umr;
+ s->rx_arfs_err += rq_stats->arfs_err;
+ s->ch_events += ch_stats->events;
+ s->ch_poll += ch_stats->poll;
+ s->ch_arm += ch_stats->arm;
+ s->ch_aff_change += ch_stats->aff_change;
+ s->ch_eq_rearm += ch_stats->eq_rearm;
+ /* xdp redirect */
+ s->tx_xdp_xmit += xdpsq_red_stats->xmit;
+ s->tx_xdp_full += xdpsq_red_stats->full;
+ s->tx_xdp_err += xdpsq_red_stats->err;
+ s->tx_xdp_cqes += xdpsq_red_stats->cqes;
+
+ for (j = 0; j < priv->max_opened_tc; j++) {
+ struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j];
+
+ s->tx_packets += sq_stats->packets;
+ s->tx_bytes += sq_stats->bytes;
+ s->tx_tso_packets += sq_stats->tso_packets;
+ s->tx_tso_bytes += sq_stats->tso_bytes;
+ s->tx_tso_inner_packets += sq_stats->tso_inner_packets;
+ s->tx_tso_inner_bytes += sq_stats->tso_inner_bytes;
+ s->tx_added_vlan_packets += sq_stats->added_vlan_packets;
+ s->tx_nop += sq_stats->nop;
+ s->tx_queue_stopped += sq_stats->stopped;
+ s->tx_queue_wake += sq_stats->wake;
+ s->tx_udp_seg_rem += sq_stats->udp_seg_rem;
+ s->tx_queue_dropped += sq_stats->dropped;
+ s->tx_cqe_err += sq_stats->cqe_err;
+ s->tx_recover += sq_stats->recover;
+ s->tx_xmit_more += sq_stats->xmit_more;
+ s->tx_csum_partial_inner += sq_stats->csum_partial_inner;
+ s->tx_csum_none += sq_stats->csum_none;
+ s->tx_csum_partial += sq_stats->csum_partial;
+#ifdef CONFIG_MLX5_EN_TLS
+ s->tx_tls_ooo += sq_stats->tls_ooo;
+ s->tx_tls_resync_bytes += sq_stats->tls_resync_bytes;
+#endif
+ s->tx_cqes += sq_stats->cqes;
+ }
+}
+
void mlx5e_grp_sw_update_stats(struct mlx5e_priv *priv)
{
- struct mlx5e_sw_stats temp, *s = &temp;
+ struct mlx5e_sw_stats s;
int i;
- memset(s, 0, sizeof(*s));
-
- for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev); i++) {
- struct mlx5e_channel_stats *channel_stats =
- &priv->channel_stats[i];
- struct mlx5e_xdpsq_stats *xdpsq_red_stats = &channel_stats->xdpsq;
- struct mlx5e_xdpsq_stats *xdpsq_stats = &channel_stats->rq_xdpsq;
- struct mlx5e_rq_stats *rq_stats = &channel_stats->rq;
- struct mlx5e_ch_stats *ch_stats = &channel_stats->ch;
- int j;
-
- s->rx_packets += rq_stats->packets;
- s->rx_bytes += rq_stats->bytes;
- s->rx_lro_packets += rq_stats->lro_packets;
- s->rx_lro_bytes += rq_stats->lro_bytes;
- s->rx_ecn_mark += rq_stats->ecn_mark;
- s->rx_removed_vlan_packets += rq_stats->removed_vlan_packets;
- s->rx_csum_none += rq_stats->csum_none;
- s->rx_csum_complete += rq_stats->csum_complete;
- s->rx_csum_unnecessary += rq_stats->csum_unnecessary;
- s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
- s->rx_xdp_drop += rq_stats->xdp_drop;
- s->rx_xdp_redirect += rq_stats->xdp_redirect;
- s->rx_xdp_tx_xmit += xdpsq_stats->xmit;
- s->rx_xdp_tx_full += xdpsq_stats->full;
- s->rx_xdp_tx_err += xdpsq_stats->err;
- s->rx_xdp_tx_cqe += xdpsq_stats->cqes;
- s->rx_wqe_err += rq_stats->wqe_err;
- s->rx_mpwqe_filler_cqes += rq_stats->mpwqe_filler_cqes;
- s->rx_mpwqe_filler_strides += rq_stats->mpwqe_filler_strides;
- s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
- s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
- s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
- s->rx_page_reuse += rq_stats->page_reuse;
- s->rx_cache_reuse += rq_stats->cache_reuse;
- s->rx_cache_full += rq_stats->cache_full;
- s->rx_cache_empty += rq_stats->cache_empty;
- s->rx_cache_busy += rq_stats->cache_busy;
- s->rx_cache_waive += rq_stats->cache_waive;
- s->rx_congst_umr += rq_stats->congst_umr;
- s->rx_arfs_err += rq_stats->arfs_err;
- s->ch_events += ch_stats->events;
- s->ch_poll += ch_stats->poll;
- s->ch_arm += ch_stats->arm;
- s->ch_aff_change += ch_stats->aff_change;
- s->ch_eq_rearm += ch_stats->eq_rearm;
- /* xdp redirect */
- s->tx_xdp_xmit += xdpsq_red_stats->xmit;
- s->tx_xdp_full += xdpsq_red_stats->full;
- s->tx_xdp_err += xdpsq_red_stats->err;
- s->tx_xdp_cqes += xdpsq_red_stats->cqes;
-
- for (j = 0; j < priv->max_opened_tc; j++) {
- struct mlx5e_sq_stats *sq_stats = &channel_stats->sq[j];
-
- s->tx_packets += sq_stats->packets;
- s->tx_bytes += sq_stats->bytes;
- s->tx_tso_packets += sq_stats->tso_packets;
- s->tx_tso_bytes += sq_stats->tso_bytes;
- s->tx_tso_inner_packets += sq_stats->tso_inner_packets;
- s->tx_tso_inner_bytes += sq_stats->tso_inner_bytes;
- s->tx_added_vlan_packets += sq_stats->added_vlan_packets;
- s->tx_nop += sq_stats->nop;
- s->tx_queue_stopped += sq_stats->stopped;
- s->tx_queue_wake += sq_stats->wake;
- s->tx_udp_seg_rem += sq_stats->udp_seg_rem;
- s->tx_queue_dropped += sq_stats->dropped;
- s->tx_cqe_err += sq_stats->cqe_err;
- s->tx_recover += sq_stats->recover;
- s->tx_xmit_more += sq_stats->xmit_more;
- s->tx_csum_partial_inner += sq_stats->csum_partial_inner;
- s->tx_csum_none += sq_stats->csum_none;
- s->tx_csum_partial += sq_stats->csum_partial;
-#ifdef CONFIG_MLX5_EN_TLS
- s->tx_tls_ooo += sq_stats->tls_ooo;
- s->tx_tls_resync_bytes += sq_stats->tls_resync_bytes;
-#endif
- s->tx_cqes += sq_stats->cqes;
- }
- }
+ memset(&s, 0, sizeof(s));
+
+ for (i = 0; i < mlx5e_get_netdev_max_channels(priv->netdev); i++)
+ mlx5e_grp_sw_collect_stat(priv, &s, i);
- memcpy(&priv->stats.sw, s, sizeof(*s));
+ memcpy(&priv->stats.sw, &s, sizeof(s));
}
static const struct counter_desc q_stats_desc[] = {
--
2.18.0
^ permalink raw reply related
* [PATCH] qed: fix link config error handling
From: Arnd Bergmann @ 2018-11-02 15:36 UTC (permalink / raw)
To: Ariel Elior, everest-linux-l2, David S. Miller
Cc: Arnd Bergmann, Sudarsana Reddy Kalluru, Tomer Tayar,
Denis Bolotin, Rahul Verma, netdev, linux-kernel
gcc-8 notices that qed_mcp_get_transceiver_data() may fail to
return a result to the caller:
drivers/net/ethernet/qlogic/qed/qed_mcp.c: In function 'qed_mcp_trans_speed_mask':
drivers/net/ethernet/qlogic/qed/qed_mcp.c:1955:2: error: 'transceiver_type' may be used uninitialized in this function [-Werror=maybe-uninitialized]
When an error is returned by qed_mcp_get_transceiver_data(), we
should propagate that to the caller of qed_mcp_trans_speed_mask()
rather than continuing with uninitialized data.
Fixes: c56a8be7e7aa ("qed: Add supported link and advertise link to display in ethtool.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
drivers/net/ethernet/qlogic/qed/qed_mcp.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/qlogic/qed/qed_mcp.c b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
index f40f654398a0..a96364df4320 100644
--- a/drivers/net/ethernet/qlogic/qed/qed_mcp.c
+++ b/drivers/net/ethernet/qlogic/qed/qed_mcp.c
@@ -1944,9 +1944,12 @@ int qed_mcp_trans_speed_mask(struct qed_hwfn *p_hwfn,
struct qed_ptt *p_ptt, u32 *p_speed_mask)
{
u32 transceiver_type, transceiver_state;
+ int ret;
- qed_mcp_get_transceiver_data(p_hwfn, p_ptt, &transceiver_state,
- &transceiver_type);
+ ret = qed_mcp_get_transceiver_data(p_hwfn, p_ptt, &transceiver_state,
+ &transceiver_type);
+ if (ret)
+ return ret;
if (qed_is_transceiver_ready(transceiver_state, transceiver_type) ==
false)
--
2.18.0
^ permalink raw reply related
* [PATCH] openvswitch: fix linking without CONFIG_NF_CONNTRACK_LABELS
From: Arnd Bergmann @ 2018-11-02 15:36 UTC (permalink / raw)
To: Pravin B Shelar, David S. Miller
Cc: Arnd Bergmann, Pablo Neira Ayuso, Florian Westphal,
Flavio Leitner, Gao Feng, Thierry Du Tre, Yi-Hung Wei, Ed Swierk,
Julia Lawall, netdev, dev, linux-kernel
When CONFIG_CC_OPTIMIZE_FOR_DEBUGGING is enabled, the compiler
fails to optimize out a dead code path, which leads to a link failure:
net/openvswitch/conntrack.o: In function `ovs_ct_set_labels':
conntrack.c:(.text+0x2e60): undefined reference to `nf_connlabels_replace'
In this configuration, we can take a shortcut, and completely
remove the contrack label code. This may also help the regular
optimization.
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
net/openvswitch/conntrack.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 6bec37ab4472..a4660c48ff01 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -1203,7 +1203,8 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
&info->labels.mask);
if (err)
return err;
- } else if (labels_nonzero(&info->labels.mask)) {
+ } else if (IS_ENABLED(CONFIG_NF_CONNTRACK_LABELS) &&
+ labels_nonzero(&info->labels.mask)) {
err = ovs_ct_set_labels(ct, key, &info->labels.value,
&info->labels.mask);
if (err)
--
2.18.0
^ permalink raw reply related
* WARNING in kmem_cache_create_usercopy
From: syzbot @ 2018-11-02 15:37 UTC (permalink / raw)
To: asmadeus, davem, ericvh, linux-kernel, lucho, netdev,
syzkaller-bugs, v9fs-developer
Hello,
syzbot found the following crash on:
HEAD commit: 4794a36bf08d Add linux-next specific files for 20180928
git tree: linux-next
console output: https://syzkaller.appspot.com/x/log.txt?x=124f814e400000
kernel config: https://syzkaller.appspot.com/x/.config?x=b0ba9bb377f8ae91
dashboard link: https://syzkaller.appspot.com/bug?extid=0c1d61e4db7db94102ca
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
syz repro: https://syzkaller.appspot.com/x/repro.syz?x=1511532a400000
C reproducer: https://syzkaller.appspot.com/x/repro.c?x=1701f831400000
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+0c1d61e4db7db94102ca@syzkaller.appspotmail.com
WARNING: CPU: 0 PID: 6065 at mm/slab_common.c:473
kmem_cache_create_usercopy+0xad/0x240 mm/slab_common.c:473
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 6065 Comm: syz-executor140 Not tainted
4.19.0-rc5-next-20180928+ #84
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1d3/0x2c4 lib/dump_stack.c:113
panic+0x238/0x4e7 kernel/panic.c:184
__warn.cold.8+0x163/0x1ba kernel/panic.c:536
report_bug+0x254/0x2d0 lib/bug.c:186
fixup_bug arch/x86/kernel/traps.c:178 [inline]
do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:271
do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:290
invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:969
RIP: 0010:kmem_cache_create_usercopy+0xad/0x240 mm/slab_common.c:473
Code: 44 89 f0 25 00 60 de 04 45 85 ed 89 45 cc 75 0b 8b 45 d0 85 c0 0f 85
8e 01 00 00 44 39 eb 72 0a 89 d8 44 29 e8 3b 45 d0 73 7e <0f> 0b c7 45 d0
00 00 00 00 4c 8b 45 10 44 89 fa 89 de 4c 89 e7 8b
RSP: 0018:ffff8801bc23f5d0 EFLAGS: 00010213
RAX: 0000000000000000 RBX: 0000000000000008 RCX: 0000000000000006
RDX: 0000000000000000 RSI: 0000000000000020 RDI: ffffffff88b04b20
RBP: ffff8801bc23f608 R08: fffffbfff1283a2d R09: fffffbfff1283a2c
R10: ffff8801bc23f5c0 R11: ffffffff8941d167 R12: ffffffff88b04b20
R13: 00000000fffffffd R14: 0000000000000000 R15: 0000000000000000
p9_client_create+0xa58/0x1769 net/9p/client.c:1054
v9fs_session_init+0x217/0x1bb0 fs/9p/v9fs.c:421
v9fs_mount+0x7c/0x8f0 fs/9p/vfs_super.c:135
legacy_get_tree+0x131/0x460 fs/fs_context.c:718
vfs_get_tree+0x1cb/0x5c0 fs/super.c:1795
do_new_mount fs/namespace.c:2648 [inline]
do_mount+0x70c/0x1d90 fs/namespace.c:2974
ksys_mount+0x12d/0x140 fs/namespace.c:3190
__do_sys_mount fs/namespace.c:3204 [inline]
__se_sys_mount fs/namespace.c:3201 [inline]
__x64_sys_mount+0xbe/0x150 fs/namespace.c:3201
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440189
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffdd30ec3c8 EFLAGS: 00000202 ORIG_RAX: 00000000000000a5
RAX: ffffffffffffffda RBX: 0030656c69662f2e RCX: 0000000000440189
RDX: 00000000200008c0 RSI: 0000000020000000 RDI: 0000000000000000
RBP: 00000000006ca018 R08: 0000000020000a80 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000401a10
R13: 0000000000401aa0 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches
^ permalink raw reply
* general protection fault in xfrmi_lookup
From: syzbot @ 2018-11-02 15:40 UTC (permalink / raw)
To: davem, herbert, linux-kernel, netdev, steffen.klassert,
syzkaller-bugs
Hello,
syzbot found the following crash on:
HEAD commit: e2a322a0c8ce Merge branch 'net-smc-userspace-breakage-fixes'
git tree: net
console output: https://syzkaller.appspot.com/x/log.txt?x=12087ce9400000
kernel config: https://syzkaller.appspot.com/x/.config?x=c0af03fe452b65fb
dashboard link: https://syzkaller.appspot.com/bug?extid=ef5639e6d1435df3a1f1
compiler: gcc (GCC) 8.0.1 20180413 (experimental)
Unfortunately, I don't have any reproducer for this crash yet.
IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+ef5639e6d1435df3a1f1@syzkaller.appspotmail.com
RBP: 0000000020000000 R08: 00000000000000f0 R09: 0000000000000000
R10: 0000000000000064 R11: 0000000000000293 R12: 00007f25cc19f6d4
R13: 00000000004c48a4 R14: 00000000004d7b90 R15: 0000000000000004
kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 32654 Comm: syz-executor4 Not tainted 4.19.0-rc6+ #134
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:net_generic include/net/netns/generic.h:45 [inline]
RIP: 0010:xfrmi_lookup.isra.18+0x11f/0x6c0 net/xfrm/xfrm_interface.c:61
Code: 57 af d3 fa 45 84 ed 0f 84 4c 04 00 00 e8 79 ae d3 fa 48 8d bb 08 1b
00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f
85 7b 05 00 00 4c 8d 6d 98 48 8b 9b 08 1b 00 00 48
RSP: 0018:ffff880181996878 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000800 RCX: ffffc90009ecc000
RDX: 0000000000000461 RSI: ffffffff86ab2717 RDI: 0000000000002308
RBP: ffff880181996998 R08: ffff88019690c700 R09: 0000000000000000
R10: fffffbfff128759a R11: 0000000000000000 R12: 0000000000000036
R13: 0000000000000000 R14: ffff8801d596dd68 R15: ffffea00000000d0
FS: 00007f25cc19f700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001cc821000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
xfrmi_rcv_cb+0xef/0x9d0 net/xfrm/xfrm_interface.c:256
xfrm6_rcv_cb+0x220/0x400 net/ipv6/xfrm6_protocol.c:59
xfrm_rcv_cb net/xfrm/xfrm_input.c:108 [inline]
xfrm_input+0x8aa/0x3190 net/xfrm/xfrm_input.c:496
xfrm6_rcv_spi net/ipv6/xfrm6_input.c:31 [inline]
xfrm6_rcv_tnl+0x168/0x1d0 net/ipv6/xfrm6_input.c:74
xfrm6_rcv+0x17/0x20 net/ipv6/xfrm6_input.c:81
xfrm6_esp_rcv+0x1a5/0x3a0 net/ipv6/xfrm6_protocol.c:74
ip6_input_finish+0x3fc/0x1aa0 net/ipv6/ip6_input.c:383
NF_HOOK include/linux/netfilter.h:289 [inline]
ip6_input+0xe9/0x600 net/ipv6/ip6_input.c:426
ip6_mc_input+0x48a/0xd20 net/ipv6/ip6_input.c:503
dst_input include/net/dst.h:450 [inline]
ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
NF_HOOK include/linux/netfilter.h:289 [inline]
ipv6_rcv+0x120/0x640 net/ipv6/ip6_input.c:271
__netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4891
__netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5001
netif_receive_skb_internal+0x12c/0x620 net/core/dev.c:5104
napi_frags_finish net/core/dev.c:5642 [inline]
napi_gro_frags+0x75a/0xc90 net/core/dev.c:5715
tun_get_user+0x3189/0x4250 drivers/net/tun.c:1923
tun_chr_write_iter+0xb9/0x154 drivers/net/tun.c:1968
call_write_iter include/linux/fs.h:1808 [inline]
do_iter_readv_writev+0x8b0/0xa80 fs/read_write.c:680
do_iter_write+0x185/0x5f0 fs/read_write.c:959
vfs_writev+0x1f1/0x360 fs/read_write.c:1004
do_writev+0x11a/0x310 fs/read_write.c:1039
__do_sys_writev fs/read_write.c:1112 [inline]
__se_sys_writev fs/read_write.c:1109 [inline]
__x64_sys_writev+0x75/0xb0 fs/read_write.c:1109
do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457431
Code: 75 14 b8 14 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 54 b5 fb ff c3 48
83 ec 08 e8 1a 2d 00 00 48 89 04 24 b8 14 00 00 00 0f 05 <48> 8b 3c 24 48
89 c2 e8 63 2d 00 00 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007f25cc19eba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 000000000000004a RCX: 0000000000457431
RDX: 0000000000000001 RSI: 00007f25cc19ebf0 RDI: 00000000000000f0
RBP: 0000000020000000 R08: 00000000000000f0 R09: 0000000000000000
R10: 0000000000000064 R11: 0000000000000293 R12: 00007f25cc19f6d4
R13: 00000000004c48a4 R14: 00000000004d7b90 R15: 0000000000000004
Modules linked in:
---[ end trace 803f6dba779493b5 ]---
RIP: 0010:__read_once_size include/linux/compiler.h:188 [inline]
RIP: 0010:net_generic include/net/netns/generic.h:45 [inline]
RIP: 0010:xfrmi_lookup.isra.18+0x11f/0x6c0 net/xfrm/xfrm_interface.c:61
Code: 57 af d3 fa 45 84 ed 0f 84 4c 04 00 00 e8 79 ae d3 fa 48 8d bb 08 1b
00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f
85 7b 05 00 00 4c 8d 6d 98 48 8b 9b 08 1b 00 00 48
RSP: 0018:ffff880181996878 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000800 RCX: ffffc90009ecc000
RDX: 0000000000000461 RSI: ffffffff86ab2717 RDI: 0000000000002308
RBP: ffff880181996998 R08: ffff88019690c700 R09: 0000000000000000
R10: fffffbfff128759a R11: 0000000000000000 R12: 0000000000000036
R13: 0000000000000000 R14: ffff8801d596dd68 R15: ffffea00000000d0
FS: 00007f25cc19f700(0000) GS:ffff8801daf00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001cc821000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.
syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
syzbot.
^ permalink raw reply
* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Josh Poimboeuf @ 2018-11-02 15:43 UTC (permalink / raw)
To: Steven Rostedt
Cc: Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
David S. Miller, Masami Hiramatsu, Jonathan Corbet,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Alexander Shishkin, Jiri Olsa, Namhyung Kim, Shuah Khan,
Alexei Starovoitov, Daniel Borkmann, Brendan Gregg,
Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181102091658.1bc979a4@gandalf.local.home>
On Fri, Nov 02, 2018 at 09:16:58AM -0400, Steven Rostedt wrote:
> On Fri, 2 Nov 2018 17:59:32 +1100
> Aleksa Sarai <cyphar@cyphar.com> wrote:
>
> > As an aside, I just tested with the frame unwinder and it isn't thrown
> > off-course by kretprobe_trampoline (though obviously the stack is still
> > wrong). So I think we just need to hook into the ORC unwinder to get it
> > to continue skipping up the stack, as well as add the rewriting code for
> > the stack traces (for all unwinders I guess -- though ideally we should
>
> I agree that this is the right solution.
Sounds good to me.
However, it would be *really* nice if function graph and kretprobes
shared the same infrastructure, like they do for function entry.
There's a lot of duplicated effort there.
> > do this without having to add the same code to every architecture).
>
> True, and there's an art to consolidating the code between
> architectures.
>
> I'm currently looking at function graph and seeing if I can consolidate
> it too. And I'm also trying to get multiple uses to hook into its
> infrastructure. I think I finally figured out a way to do so.
>
> The reason it is difficult, is that you need to maintain state between
> the entry of a function and the exit for each task and callback that is
> registered. Hence, it's a 3x tuple (function stack, task, callbacks).
> And this must be maintained with preemption. A task may sleep for
> minutes, and the state needs to be retained.
>
> The only state that must be retained is the function stack with the
> task, because if that gets out of sync, the system crashes. But the
> callback state can be removed.
>
> Here's what is there now:
>
> When something is registered with the function graph tracer, every
> task gets a shadowed stack. A hook is added to fork to add shadow
> stacks to new tasks. Once a shadow stack is added to a task, that
> shadow stack is never removed until the task exits.
>
> When the function is entered, the real return code is stored in the
> shadow stack and the trampoline address is put in its place.
>
> On return, the trampoline is called, and it will pop off the return
> code from the shadow stack and return to that.
>
> The issue with multiple users, is that different users may want to
> trace different functions. On entry, the user could say it doesn't want
> to trace the current function, and the return part must not be called
> on exit. Keeping track of which user needs the return called is the
> tricky part.
>
> Here's what I plan on implementing:
>
> Along with a shadow stack, I was going to add a 4096 byte (one page)
> array that holds 64 8 byte masks to every task as well. This will allow
> 64 simultaneous users (which is rather extreme). If we need to support
> more, we could allocate another page for all tasks. The 8 byte mask
> will represent each depth (allowing to do this for 64 function call
> stack depth, which should also be enough).
>
> Each user will be assigned one of the masks. Each bit in the mask
> represents the depth of the shadow stack. When a function is called,
> each user registered with the function graph tracer will get called
> (if they asked to be called for this function, via the ftrace_ops
> hashes) and if they want to trace the function, then the bit is set in
> the mask for that stack depth.
>
> When the function exits the function and we pop off the return code
> from the shadow stack, we then look at all the bits set for the
> corresponding users, and call their return callbacks, and ignore
> anything that is not set.
>
>
> When a user is unregistered, it the corresponding bits that represent
> it are cleared, and it the return callback will not be called. But the
> tasks being traced will still have their shadow stack to allow it to
> get back to normal.
>
> I'll hopefully have a prototype ready by plumbers.
Why do we need multiple users? It would be a lot simpler if we could
just enforce a single user per fgraphed/kretprobed function (and return
-EBUSY if it's already being traced/probed).
> And this too will require each architecture to probably change. As a
> side project to this, I'm going to try to consolidate the function
> graph code among all the architectures as well. Not an easy task.
Do you mean implementing HAVE_FUNCTION_GRAPH_RET_ADDR_PTR for all the
arches? If so, I think have an old crusty patch which attempted to
that. I could try to dig it up if you're interested.
--
Josh
^ permalink raw reply
* [PATCH 0/1] vhost: parallel virtqueue handling
From: Vitaly Mayatskikh @ 2018-11-02 16:07 UTC (permalink / raw)
To: Michael S . Tsirkin
Cc: Jason Wang, kvm, virtualization, netdev, linux-kernel,
Vitaly Mayatskikh
Hi,
I stumbled across poor performance of virtio-blk while working on a
high-performance network storage protocol. Moving virtio-blk's host
side to kernel did increase single queue IOPS, but multiqueue disk
still was not scaling well. It turned out that vhost handles events
from all virtio queues in one helper thread, and that's pretty much a
big serialization point.
The following patch enables events handling in per-queue thread and
increases IO concurrency, see IOPS numbers:
# num-queues
# bare metal
# virtio-blk
# vhost-blk
1 171k 148k 195k
2 328k 249k 349k
3 479k 179k 501k
4 622k 143k 620k
5 755k 136k 737k
6 887k 131k 830k
7 1004k 126k 926k
8 1099k 117k 1001k
9 1194k 115k 1055k
10 1278k 109k 1130k
11 1345k 110k 1119k
12 1411k 104k 1201k
13 1466k 106k 1260k
14 1517k 103k 1296k
15 1552k 102k 1322k
16 1480k 101k 1346k
Vitaly Mayatskikh (1):
vhost: add per-vq worker thread
drivers/vhost/vhost.c | 123 +++++++++++++++++++++++++++++++-----------
drivers/vhost/vhost.h | 11 +++-
2 files changed, 100 insertions(+), 34 deletions(-)
--
2.17.1
^ permalink raw reply
* Re: [PATCH] net/mlx5e: fix high stack usage
From: Jason Gunthorpe @ 2018-11-02 16:08 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Tariq Toukan,
Eran Ben Elisha, Boris Pismenny, Ilya Lesokhin, Moshe Shemesh,
Kamal Heib, netdev, linux-rdma, linux-kernel
In-Reply-To: <20181102153316.1492515-1-arnd@arndb.de>
On Fri, Nov 02, 2018 at 04:33:03PM +0100, Arnd Bergmann wrote:
> A patch that looks harmless causes the stack usage of the mlx5e_grp_sw_update_stats()
> function to drastically increase with x86 gcc-4.9 and higher (tested up to 8.1):
>
> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c: In function ‘mlx5e_grp_sw_update_stats’:
> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:216:1: warning: the frame size of 1276 bytes is larger than 500 bytes [-Wframe-larger-than=]
Why is the stack size so big here? The mlx5e_sw_stats is < 500 bytes
and all the other on-stack stuff looks pretty small?
> By splitting out the loop body into a non-inlined function, the stack size goes
> back down to under 500 bytes.
Does this actually reduce the stack consumed or does this just suppress
the warning?
Jason
^ permalink raw reply
* Re: [PATCH v3 1/2] kretprobe: produce sane stack traces
From: Steven Rostedt @ 2018-11-02 16:13 UTC (permalink / raw)
To: Josh Poimboeuf
Cc: Aleksa Sarai, Naveen N. Rao, Anil S Keshavamurthy,
David S. Miller, Masami Hiramatsu, Jonathan Corbet,
Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
Alexander Shishkin, Jiri Olsa, Namhyung Kim, Shuah Khan,
Alexei Starovoitov, Daniel Borkmann, Brendan Gregg,
Christian Brauner, Aleksa Sarai, netdev, linux-doc
In-Reply-To: <20181102154325.bt6xoysl4xdl33wd@treble>
On Fri, 2 Nov 2018 10:43:26 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:
> > I'll hopefully have a prototype ready by plumbers.
>
> Why do we need multiple users? It would be a lot simpler if we could
> just enforce a single user per fgraphed/kretprobed function (and return
> -EBUSY if it's already being traced/probed).
Because that means if function graph tracer is active, then you can't
do a kretprobe, and vice versa. I'd really like to have it working for
multiple users, then we could trace different graph functions and store
them in different buffers. It would also allow for perf to use function
graph tracer too.
>
> > And this too will require each architecture to probably change. As a
> > side project to this, I'm going to try to consolidate the function
> > graph code among all the architectures as well. Not an easy task.
>
> Do you mean implementing HAVE_FUNCTION_GRAPH_RET_ADDR_PTR for all the
> arches? If so, I think have an old crusty patch which attempted to
> that. I could try to dig it up if you're interested.
>
I'd like to have that, but it still requires some work. But I'd just
the truly architecture dependent code be in the architecture (basically
the asm code), and have the ability to move most of the duplicate code
out of the archs.
-- Steve
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Linus Torvalds @ 2018-11-02 16:14 UTC (permalink / raw)
To: mst
Cc: mark.rutland, Kees Cook, kvm, virtualization, netdev,
Linux Kernel Mailing List, Andrew Morton, bijan.mottahedeh,
gedwards, joe, lenaic, liang.z.li, mhocko, mhocko, stefanha,
wei.w.wang
In-Reply-To: <20181102083018-mutt-send-email-mst@kernel.org>
On Fri, Nov 2, 2018 at 6:04 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> I've tried making access_ok mask the parameter it gets.
PLEASE don't do this.
Just use "copy_to/from_user()".
We have had lots of bugs because code bitrots.
And no, the access_ok() checks aren't expensive, not even in a loop.
They *used* to be somewhat expensive compared to the access, but that
simply isn't true any more. The real expense in copy_to_user and
friends are in the user access bit setting (STAC and CLAC on x86),
which easily an order of magnitude more expensive than access_ok().
So just get rid of the double-underscore version. It's basically
always a mis-optimization due to entirely historical reasons. I can
pretty much guarantee that it's not visible in profiles.
Linus
^ permalink raw reply
* [PATCH] ath10k: avoid -Wmaybe-uninitialized warning
From: Arnd Bergmann @ 2018-11-02 16:17 UTC (permalink / raw)
To: Kalle Valo, David S. Miller
Cc: Arnd Bergmann, Rakesh Pillai, Anilkumar Kolli, Balaji Pothunoori,
Yingying Tang, Sriram R, ath10k, linux-wireless, netdev,
linux-kernel
In some configurations the inlining in gcc is suboptimal, causing
a false-positive warning:
drivers/net/wireless/ath/ath10k/mac.c: In function 'ath10k_mac_init_rd':
drivers/net/wireless/ath/ath10k/mac.c:8374:39: error: 'rd' may be used uninitialized in this function [-Werror=maybe-uninitialized]
ar->ath_common.regulatory.current_rd = rd;
If we initialize the output of ath10k_mac_get_wrdd_regulatory()
before returning, this problem goes away.
Fixes: 209b2a68de76 ("ath10k: add platform regulatory domain support")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
---
drivers/net/wireless/ath/ath10k/mac.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
index a1c2801ded10..0d5fde28ee44 100644
--- a/drivers/net/wireless/ath/ath10k/mac.c
+++ b/drivers/net/wireless/ath/ath10k/mac.c
@@ -8321,6 +8321,8 @@ static int ath10k_mac_get_wrdd_regulatory(struct ath10k *ar, u16 *rd)
u32 alpha2_code;
char alpha2[3];
+ *rd = ar->hw_eeprom_rd;
+
root_handle = ACPI_HANDLE(&pdev->dev);
if (!root_handle)
return -EOPNOTSUPP;
@@ -8365,11 +8367,9 @@ static int ath10k_mac_init_rd(struct ath10k *ar)
u16 rd;
ret = ath10k_mac_get_wrdd_regulatory(ar, &rd);
- if (ret) {
+ if (ret)
ath10k_dbg(ar, ATH10K_DBG_BOOT,
"fallback to eeprom programmed regulatory settings\n");
- rd = ar->hw_eeprom_rd;
- }
ar->ath_common.regulatory.current_rd = rd;
return 0;
--
2.18.0
^ permalink raw reply related
* Re: [RFC PATCH] lib: Introduce generic __cmpxchg_u64() and use it where needed
From: Andrey Ryabinin @ 2018-11-02 16:19 UTC (permalink / raw)
To: Peter Zijlstra, Trond Myklebust
Cc: mark.rutland@arm.com, linux-kernel@vger.kernel.org,
ralf@linux-mips.org, jlayton@kernel.org,
linuxppc-dev@lists.ozlabs.org, bfields@fieldses.org,
linux-mips@linux-mips.org, linux@roeck-us.net,
linux-nfs@vger.kernel.org, akpm@linux-foundation.org,
will.deacon@arm.com, boqun.feng@gmail.com, paul.burton@mips.com,
anna.schumaker@netapp.com, jhogan@kernel.org, "netdev@vger.ke
In-Reply-To: <20181101163212.GF3159@hirez.programming.kicks-ass.net>
On 11/01/2018 07:32 PM, Peter Zijlstra wrote:
> On Thu, Nov 01, 2018 at 03:22:15PM +0000, Trond Myklebust wrote:
>> On Thu, 2018-11-01 at 15:59 +0100, Peter Zijlstra wrote:
>>> On Thu, Nov 01, 2018 at 01:18:46PM +0000, Mark Rutland wrote:
>
>>>>> My one question (and the reason why I went with cmpxchg() in the
>>>>> first place) would be about the overflow behaviour for
>>>>> atomic_fetch_inc() and friends. I believe those functions should
>>>>> be OK on x86, so that when we overflow the counter, it behaves
>>>>> like an unsigned value and wraps back around. Is that the case
>>>>> for all architectures?
>>>>>
>>>>> i.e. are atomic_t/atomic64_t always guaranteed to behave like
>>>>> u32/u64 on increment?
>>>>>
>>>>> I could not find any documentation that explicitly stated that
>>>>> they should.
>>>>
>>>> Peter, Will, I understand that the atomic_t/atomic64_t ops are
>>>> required to wrap per 2's-complement. IIUC the refcount code relies
>>>> on this.
>>>>
>>>> Can you confirm?
>>>
>>> There is quite a bit of core code that hard assumes 2s-complement.
>>> Not only for atomics but for any signed integer type. Also see the
>>> kernel using -fno-strict-overflow which implies -fwrapv, which
>>> defines signed overflow to behave like 2s-complement (and rids us of
>>> that particular UB).
>>
>> Fair enough, but there have also been bugfixes to explicitly fix unsafe
>> C standards assumptions for signed integers. See, for instance commit
>> 5a581b367b5d "jiffies: Avoid undefined behavior from signed overflow"
>> from Paul McKenney.
>
> Yes, I feel Paul has been to too many C/C++ committee meetings and got
> properly paranoid. Which isn't always a bad thing :-)
>
> But for us using -fno-strict-overflow which actually defines signed
> overflow, I myself am really not worried. I'm also not sure if KASAN has
> been taught about this, or if it will still (incorrectly) warn about UB
> for signed types.
>
UBSAN warns about signed overflows despite -fno-strict-overflow if gcc version is < 8.
I have learned recently that UBSAN in GCC 8 ignores signed overflows if -fno-strict-overflow of fwrapv is used.
$ cat signed_overflow.c
#include <stdio.h>
__attribute__((noinline))
int foo(int a, int b)
{
return a+b;
}
int main(void)
{
int a = 0x7fffffff;
int b = 2;
printf("%d\n", foo(a,b));
return 0;
}
$ gcc-8.2.0 -fsanitize=signed-integer-overflow signed_overflow.c && ./a.out
signed_overflow.c:6:10: runtime error: signed integer overflow: 2147483647 + 2 cannot be represented in type 'int'
-2147483647
$ gcc-8.2.0 -fno-strict-overflow -fsanitize=signed-integer-overflow signed_overflow.c && ./a.out
-2147483647
$ gcc-7.3.0 -fsanitize=signed-integer-overflow signed_overflow.c && ./a.out
signed_overflow.c:6:10: runtime error: signed integer overflow: 2147483647 + 2 cannot be represented in type 'int'
-2147483647
$ gcc-7.3.0 -fno-strict-overflow -fsanitize=signed-integer-overflow signed_overflow.c && ./a.out
signed_overflow.c:6:10: runtime error: signed integer overflow: 2147483647 + 2 cannot be represented in type 'int'
-2147483647
clang behaves the same way as GCC 8:
$ clang -fsanitize=signed-integer-overflow signed_overflow.c && ./a.out
signed_overflow.c:6:10: runtime error: signed integer overflow: 2147483647 + 2 cannot be represented in type 'int'
-2147483647
$ clang -fno-strict-overflow -fsanitize=signed-integer-overflow signed_overflow.c && ./a.out
-2147483647
We can always just drop -fsanitize=signed-integer-overflow if it considered too noisy.
Although it did catch some real bugs.
^ permalink raw reply
* Re: [PATCH] net/mlx5e: fix high stack usage
From: Arnd Bergmann @ 2018-11-02 16:23 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Tariq Toukan,
Eran Ben Elisha, Boris Pismenny, Ilya Lesokhin, Moshe Shemesh,
Kamal Heib, netdev, linux-rdma, linux-kernel
In-Reply-To: <20181102160858.GA17096@ziepe.ca>
On 11/2/18, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> On Fri, Nov 02, 2018 at 04:33:03PM +0100, Arnd Bergmann wrote:
>> A patch that looks harmless causes the stack usage of the
>> mlx5e_grp_sw_update_stats()
>> function to drastically increase with x86 gcc-4.9 and higher (tested up to
>> 8.1):
>>
>> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c: In function
>> ‘mlx5e_grp_sw_update_stats’:
>> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:216:1: warning: the
>> frame size of 1276 bytes is larger than 500 bytes [-Wframe-larger-than=]
>
> Why is the stack size so big here? The mlx5e_sw_stats is < 500 bytes
> and all the other on-stack stuff looks pretty small?
I am not entirely sure, but my analysis indicates that gcc tries loop unrolling
or some other optimization that leads to two copies on the stack.
>> By splitting out the loop body into a non-inlined function, the stack size
>> goes
>> back down to under 500 bytes.
>
> Does this actually reduce the stack consumed or does this just suppress
> the warning?
It definitely reduces the total stack usage, the separate functions just
had the expected stack usage that was a few hundred bytes combined.
Arnd
^ permalink raw reply
* Re: [PATCH] ath10k: avoid -Wmaybe-uninitialized warning
From: Brian Norris @ 2018-11-02 16:34 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Kalle Valo, davem, pillair, akolli, bpothuno, yintang, srirrama,
ath10k, linux-wireless, netdev, Linux Kernel
In-Reply-To: <20181102161833.2956376-1-arnd@arndb.de>
Hi,
On Fri, Nov 2, 2018 at 9:19 AM Arnd Bergmann <arnd@arndb.de> wrote:
>
> In some configurations the inlining in gcc is suboptimal, causing
> a false-positive warning:
>
> drivers/net/wireless/ath/ath10k/mac.c: In function 'ath10k_mac_init_rd':
> drivers/net/wireless/ath/ath10k/mac.c:8374:39: error: 'rd' may be used uninitialized in this function [-Werror=maybe-uninitialized]
> ar->ath_common.regulatory.current_rd = rd;
>
> If we initialize the output of ath10k_mac_get_wrdd_regulatory()
> before returning, this problem goes away.
>
> Fixes: 209b2a68de76 ("ath10k: add platform regulatory domain support")
> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> ---
> drivers/net/wireless/ath/ath10k/mac.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/wireless/ath/ath10k/mac.c b/drivers/net/wireless/ath/ath10k/mac.c
> index a1c2801ded10..0d5fde28ee44 100644
> --- a/drivers/net/wireless/ath/ath10k/mac.c
> +++ b/drivers/net/wireless/ath/ath10k/mac.c
> @@ -8321,6 +8321,8 @@ static int ath10k_mac_get_wrdd_regulatory(struct ath10k *ar, u16 *rd)
> u32 alpha2_code;
> char alpha2[3];
>
> + *rd = ar->hw_eeprom_rd;
> +
Maybe it's just me, but it seems kinda weird for this function to
assign a (valid) value to its "output" and still potentially return an
error.
If you really need to work around this compiler bug, maybe just put
the eeprom assignment back in ath10k_mac_init_rd()? I'll leave it up
to Kalle as to whether he wants to work around the compiler at all :)
Oh wait, one more thing: this is actually an invalid refactoring. See
how this function assigns '*rd' later in error cases. Today, we still
treat those as errors and clobber those with the eeprom value, but
now, you're making the fallback case continue to use the erroneous
value (0xffff). You need to make that use a local variable and avoid
clobbering *rd, if you want this to be correct.
> root_handle = ACPI_HANDLE(&pdev->dev);
Side note: your patch made me notice -- this code is *not* only used
on PCI devices, yet it utilizes the 'pci_dev' structure. Fortunately,
it only does this to needless convert it right back to a bare
'device', and then the ACPI code safely handles non-ACPI devices
(should result in -EOPNOTSUPP below), but this is awfully strange
code.
Anyway, I'm just thinking out loud. I'll probably patch the pci_dev out myself.
Brian
> if (!root_handle)
> return -EOPNOTSUPP;
> @@ -8365,11 +8367,9 @@ static int ath10k_mac_init_rd(struct ath10k *ar)
> u16 rd;
>
> ret = ath10k_mac_get_wrdd_regulatory(ar, &rd);
> - if (ret) {
> + if (ret)
> ath10k_dbg(ar, ATH10K_DBG_BOOT,
> "fallback to eeprom programmed regulatory settings\n");
> - rd = ar->hw_eeprom_rd;
> - }
>
> ar->ath_common.regulatory.current_rd = rd;
> return 0;
> --
> 2.18.0
>
^ permalink raw reply
* Re: [PULL] vhost: cleanups and fixes
From: Michael S. Tsirkin @ 2018-11-02 16:59 UTC (permalink / raw)
To: Linus Torvalds
Cc: mark.rutland, Kees Cook, kvm, virtualization, netdev,
Linux Kernel Mailing List, Andrew Morton, bijan.mottahedeh,
gedwards, joe, lenaic, liang.z.li, mhocko, mhocko, stefanha,
wei.w.wang, Jason Wang
In-Reply-To: <CAHk-=wh_bQK5zs+CwQ5eyodq4sQT0eOPp60qzvVL2_EtgETP-g@mail.gmail.com>
On Fri, Nov 02, 2018 at 09:14:51AM -0700, Linus Torvalds wrote:
> On Fri, Nov 2, 2018 at 6:04 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > I've tried making access_ok mask the parameter it gets.
>
> PLEASE don't do this.
Okay.
> Just use "copy_to/from_user()".
Just for completeness I'd like to point out for vhost the copies are
done from the kernel thread. So yes we can switch to copy_to/from_user
but for e.g. 32-bit userspace running on top of a 64 bit kernel it is
IIUC not sufficient - we must *also* do access_ok checks on control path
when addresses are passed to the kernel and when current points to the
correct task struct.
> We have had lots of bugs because code bitrots.
Yes, I wish we did not need these access_ok checks and could just rely
on copy_to/from_user.
> And no, the access_ok() checks aren't expensive, not even in a loop.
> They *used* to be somewhat expensive compared to the access, but that
> simply isn't true any more. The real expense in copy_to_user and
> friends are in the user access bit setting (STAC and CLAC on x86),
> which easily an order of magnitude more expensive than access_ok().
>
> So just get rid of the double-underscore version. It's basically
> always a mis-optimization due to entirely historical reasons. I can
> pretty much guarantee that it's not visible in profiles.
>
> Linus
OK. So maybe we should focus on switching to user_access_begin/end +
unsafe_get_user/unsafe_put_user in a loop which does seem to be
measureable. That moves the barrier out of the loop, which seems to be
consistent with what you would expect.
--
MST
^ permalink raw reply
* Re: [PATCH] net/mlx5e: fix high stack usage
From: Jason Gunthorpe @ 2018-11-02 17:00 UTC (permalink / raw)
To: Arnd Bergmann
Cc: Saeed Mahameed, Leon Romanovsky, David S. Miller, Tariq Toukan,
Eran Ben Elisha, Boris Pismenny, Ilya Lesokhin, Moshe Shemesh,
Kamal Heib, netdev, linux-rdma, linux-kernel
In-Reply-To: <CAK8P3a2pvkfR1A6dHAMm8ETd13E1LSxXrEsQHSOFYWbr-7R2Sw@mail.gmail.com>
On Fri, Nov 02, 2018 at 05:23:26PM +0100, Arnd Bergmann wrote:
> On 11/2/18, Jason Gunthorpe <jgg@ziepe.ca> wrote:
> > On Fri, Nov 02, 2018 at 04:33:03PM +0100, Arnd Bergmann wrote:
> >> A patch that looks harmless causes the stack usage of the
> >> mlx5e_grp_sw_update_stats()
> >> function to drastically increase with x86 gcc-4.9 and higher (tested up to
> >> 8.1):
> >>
> >> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c: In function
> >> ‘mlx5e_grp_sw_update_stats’:
> >> drivers/net/ethernet/mellanox/mlx5/core/en_stats.c:216:1: warning: the
> >> frame size of 1276 bytes is larger than 500 bytes [-Wframe-larger-than=]
> >
> > Why is the stack size so big here? The mlx5e_sw_stats is < 500 bytes
> > and all the other on-stack stuff looks pretty small?
>
> I am not entirely sure, but my analysis indicates that gcc tries loop unrolling
> or some other optimization that leads to two copies on the stack.
Wow how strange
Did you consier adding a
__attribute__((optimize("no-unroll-loops")))
type macro?
Jason
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox