* Re: INFO: rcu detected stall in ext4_write_checks
From: Paul E. McKenney @ 2019-07-14 18:49 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: Theodore Ts'o, syzbot, Andreas Dilger, David Miller, eladr,
Ido Schimmel, Jiri Pirko, John Stultz, linux-ext4, LKML, netdev,
syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar
In-Reply-To: <CACT4Y+asYe-uH9OV5R0Nkb-JKP4erYUZ68S9gYNnGg6v+fD20w@mail.gmail.com>
On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> On Sun, Jul 7, 2019 at 4:17 AM Paul E. McKenney <paulmck@linux.ibm.com> wrote:
> > > > > I suppose RCU could take the dueling-banjos approach and use increasingly
> > > > > aggressive scheduler policies itself, up to and including SCHED_DEADLINE,
> > > > > until it started getting decent forward progress. However, that
> > > > > sounds like the something that just might have unintended consequences,
> > > > > particularly if other kernel subsystems were to also play similar
> > > > > games of dueling banjos.
> > > >
> > > > So long as the RCU threads are well-behaved, using SCHED_DEADLINE
> > > > shouldn't have much of an impact on the system --- and the scheduling
> > > > parameters that you can specify on SCHED_DEADLINE allows you to
> > > > specify the worst-case impact on the system while also guaranteeing
> > > > that the SCHED_DEADLINE tasks will urn in the first place. After all,
> > > > that's the whole point of SCHED_DEADLINE.
> > > >
> > > > So I wonder if the right approach is during the the first userspace
> > > > system call to shced_setattr to enable a (any) real-time priority
> > > > scheduler (SCHED_DEADLINE, SCHED_FIFO or SCHED_RR) on a userspace
> > > > thread, before that's allowed to proceed, the RCU kernel threads are
> > > > promoted to be SCHED_DEADLINE with appropriately set deadline
> > > > parameters. That way, a root user won't be able to shoot the system
> > > > in the foot, and since the vast majority of the time, there shouldn't
> > > > be any processes running with real-time priorities, we won't be
> > > > changing the behavior of a normal server system.
> > >
> > > It might well be. However, running the RCU kthreads at real-time
> > > priority does not come for free. For example, it tends to crank up the
> > > context-switch rate.
> > >
> > > Plus I have taken several runs at computing SCHED_DEADLINE parameters,
> > > but things like the rcuo callback-offload threads have computational
> > > requirements that are controlled not by RCU, and not just by the rest of
> > > the kernel, but also by userspace (keeping in mind the example of opening
> > > and closing a file in a tight loop, each pass of which queues a callback).
> > > I suspect that RCU is not the only kernel subsystem whose computational
> > > requirements are set not by the subsystem, but rather by external code.
> > >
> > > OK, OK, I suppose I could just set insanely large SCHED_DEADLINE
> > > parameters, following syzkaller's example, and then trust my ability to
> > > keep the RCU code from abusing the resulting awesome power. But wouldn't
> > > a much nicer approach be to put SCHED_DEADLINE between SCHED_RR/SCHED_FIFO
> > > priorities 98 and 99 or some such? Then the same (admittedly somewhat
> > > scary) result could be obtained much more simply via SCHED_FIFO or
> > > SCHED_RR priority 99.
> > >
> > > Some might argue that this is one of those situations where simplicity
> > > is not necessarily an advantage, but then again, you can find someone
> > > who will complain about almost anything. ;-)
> > >
> > > > (I suspect there might be some audio applications that might try to
> > > > set real-time priorities, but for desktop systems, it's probably more
> > > > important that the system not tie its self into knots since the
> > > > average desktop user isn't going to be well equipped to debug the
> > > > problem.)
> > >
> > > Not only that, but if core counts continue to increase, and if reliance
> > > on cloud computing continues to grow, there are going to be an increasing
> > > variety of mixed workloads in increasingly less-controlled environments.
> > >
> > > So, yes, it would be good to solve this problem in some reasonable way.
> > >
> > > I don't see this as urgent just yet, but I am sure you all will let
> > > me know if I am mistaken on that point.
> > >
> > > > > Alternatively, is it possible to provide stricter admission control?
> > > >
> > > > I think that's an orthogonal issue; better admission control would be
> > > > nice, but it looks to me that it's going to be fundamentally an issue
> > > > of tweaking hueristics, and a fool-proof solution that will protect
> > > > against all malicious userspace applications (including syzkaller) is
> > > > going to require solving the halting problem. So while it would be
> > > > nice to improve the admission control, I don't think that's a going to
> > > > be a general solution.
> > >
> > > Agreed, and my earlier point about the need to trust the coding abilities
> > > of those writing ultimate-priority code is all too consistent with your
> > > point about needing to solve the halting problem. Nevertheless, I believe
> > > that we could make something that worked reasonably well in practice.
> > >
> > > Here are a few components of a possible solution, in practice, but
> > > of course not in theory:
> > >
> > > 1. We set limits to SCHED_DEADLINE parameters, perhaps novel ones.
> > > For one example, insist on (say) 10 milliseconds of idle time
> > > every second on each CPU. Yes, you can configure beyond that
> > > given sufficient permissions, but if you do so, you just voided
> > > your warranty.
> > >
> > > 2. Only allow SCHED_DEADLINE on nohz_full CPUs. (Partial solution,
> > > given that such a CPU might be running in the kernel or have
> > > more than one runnable task. Just for fun, I will suggest the
> > > option of disabling SCHED_DEADLINE during such times.)
> > >
> > > 3. RCU detects slowdowns, and does something TBD to increase its
> > > priority, but only while the slowdown persists. This likely
> > > relies on scheduling-clock interrupts to detect the slowdowns,
> > > so there might be additional challenges on a fully nohz_full
> > > system.
> >
> > 4. SCHED_DEADLINE treats the other three scheduling classes as each
> > having a period, deadline, and a modest CPU consumption budget
> > for the members of the class in aggregate. But this has to have
> > been discussed before. How did that go?
> >
> > > 5. Your idea here.
>
> Trying to digest this thread.
>
> Do I understand correctly that setting rcutree.kthread_prio=99 won't
> help because the deadline priority is higher?
> And there are no other existing mechanisms to either fix the stalls
> nor make kernel reject the non well-behaving parameters? Kernel tries
> to filter out non well-behaving parameters, but the check detects only
> obvious misconfigurations, right?
> This reminds of priority inversion/inheritance problem. I wonder if
> there are other kernel subsystems that suffer from the same problem.
> E.g. the background kernel thread that destroys net namespaces and any
> other type of async work. A high prio user process can overload the
> queue and make kernel eat all memory. May be relatively easy to do
> even unintentionally. I suspect the problem is not specific to rcu and
> plumbing just rcu may just make the next problem pop up.
> Should user be able to starve basic kernel services? User should be
> able to prioritize across user processes (potentially in radical
> ways), but perhaps it should not be able to badly starve kernel
> functions that just happened to be asynchronous? I guess it's not as
> simple as setting the highest prio for all kernel threads because in
> normal case we want to reduce latency of user work by making the work
> async. But user must not be able to starve kernel threads
> infinitely... sounds like something similar to the deadline scheduling
> -- kernel threads need to get at least some time slice per unit of
> time.
As I understand it, there is provision for giving other threads slack
time even in SCHED_DEADLINE, but the timing of that slack time depends
on the other tasks' SCHED_DEADLINE settings. And RCU's kthreads do
need some response time: Optimally a few milliseconds, preferably about
a hundred milliseconds, but definitely a second. With the huge cycle
time specified, RCU might not get that.
And yes, I suspect that RCU is not the only thing in the system needing
a little CPU time fairly frequently, for some ill-defined notion of
"fairly frequently".
> But short term I don't see any other solution than stop testing
> sched_setattr because it does not check arguments enough to prevent
> system misbehavior. Which is a pity because syzkaller has found some
> bad misconfigurations that were oversight on checking side.
> Any other suggestions?
Keep the times down to a few seconds? Of course, that might also
fail to find interesting bugs.
Thanx, Paul
^ permalink raw reply
* Re: [PATCH v2 3/9] rcu/sync: Remove custom check for reader-section
From: Paul E. McKenney @ 2019-07-14 18:50 UTC (permalink / raw)
To: Joel Fernandes
Cc: linux-kernel, Oleg Nesterov, Alexey Kuznetsov, Bjorn Helgaas,
Borislav Petkov, c0d1n61at3, David S. Miller, edumazet,
Greg Kroah-Hartman, Hideaki YOSHIFUJI, H. Peter Anvin,
Ingo Molnar, Jonathan Corbet, Josh Triplett, keescook,
kernel-hardening, kernel-team, Lai Jiangshan, Len Brown,
linux-acpi, linux-doc, linux-pci, linux-pm, Mathieu Desnoyers,
neilb, netdev, Pavel Machek, peterz, Rafael J. Wysocki,
Rasmus Villemoes, rcu, Steven Rostedt, Tejun Heo, Thomas Gleixner,
will, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)
In-Reply-To: <20190714183820.GD34501@google.com>
On Sun, Jul 14, 2019 at 02:38:20PM -0400, Joel Fernandes wrote:
> On Sun, Jul 14, 2019 at 02:10:53PM -0400, Joel Fernandes wrote:
> > On Sat, Jul 13, 2019 at 02:28:12PM -0700, Paul E. McKenney wrote:
> [snip]
> > > > > > > > > >
> > > > > > > > > > Cc: Oleg Nesterov <oleg@redhat.com>
> > > > > > > > > > Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> > > > > > > > > > ---
> > > > > > > > > > include/linux/rcu_sync.h | 4 +---
> > > > > > > > > > 1 file changed, 1 insertion(+), 3 deletions(-)
> > > > > > > > > >
> > > > > > > > > > diff --git a/include/linux/rcu_sync.h b/include/linux/rcu_sync.h
> > > > > > > > > > index 9b83865d24f9..0027d4c8087c 100644
> > > > > > > > > > --- a/include/linux/rcu_sync.h
> > > > > > > > > > +++ b/include/linux/rcu_sync.h
> > > > > > > > > > @@ -31,9 +31,7 @@ struct rcu_sync {
> > > > > > > > > > */
> > > > > > > > > > static inline bool rcu_sync_is_idle(struct rcu_sync *rsp)
> > > > > > > > > > {
> > > > > > > > > > - RCU_LOCKDEP_WARN(!rcu_read_lock_held() &&
> > > > > > > > > > - !rcu_read_lock_bh_held() &&
> > > > > > > > > > - !rcu_read_lock_sched_held(),
> > > > > > > > > > + RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(),
> > > > > > > > >
> > > > > > > > > I believe that replacing rcu_read_lock_sched_held() with preemptible()
> > > > > > > > > in a CONFIG_PREEMPT=n kernel will give you false-positive splats here.
> > > > > > > > > If you have not already done so, could you please give it a try?
> > > > > > > >
> > > > > > > > Hi Paul,
> > > > > > > > I don't think it will cause splats for !CONFIG_PREEMPT.
> > > > > > > >
> > > > > > > > Currently, rcu_read_lock_any_held() introduced in this patch returns true if
> > > > > > > > !preemptible(). This means that:
> > > > > > > >
> > > > > > > > The following expression above:
> > > > > > > > RCU_LOCKDEP_WARN(!rcu_read_lock_any_held(),...)
> > > > > > > >
> > > > > > > > Becomes:
> > > > > > > > RCU_LOCKDEP_WARN(preemptible(), ...)
> > > > > > > >
> > > > > > > > For, CONFIG_PREEMPT=n kernels, this means:
> > > > > > > > RCU_LOCKDEP_WARN(0, ...)
> > > > > > > >
> > > > > > > > Which would mean no splats. Or, did I miss the point?
> > > > > > >
> > > > > > > I suggest trying it out on a CONFIG_PREEMPT=n kernel.
> > > > > >
> > > > > > Sure, will do, sorry did not try it out yet because was busy with weekend
> > > > > > chores but will do soon, thanks!
> > > > >
> > > > > I am not faulting you for taking the weekend off, actually. ;-)
> > > >
> > > > ;-)
> > > >
> > > > I tried doing RCU_LOCKDEP_WARN(preemptible(), ...) in this code path and I
> > > > don't get any splats. I also disassembled the code and it seems to me
> > > > RCU_LOCKDEP_WARN() becomes a NOOP which also the above reasoning confirms.
> > >
> > > OK, very good. Could you do the same thing for the RCU_LOCKDEP_WARN()
> > > in synchronize_rcu()? Why or why not?
> > >
> >
> > Hi Paul,
> >
> > Yes synchronize_rcu() can also make use of this technique since it is
> > strictly illegal to call synchronize_rcu() within a reader section.
> >
> > I will add this to the set of my patches as well and send them all out next
> > week, along with the rcu-sync and bh clean ups we discussed.
>
> After sending this email, it occurs to me it wont work in synchronize_rcu()
> for !CONFIG_PREEMPT kernels. This is because in a !CONFIG_PREEMPT kernel,
> executing in kernel mode itself looks like being in an RCU reader. So we
> should leave that as is. However it will work fine for rcu_sync_is_idle (for
> CONFIG_PREEMPT=n kernels) as I mentioned earlier.
>
> Were trying to throw me a Quick-Quiz ? ;-) In that case, hope I passed!
You did pass. This time. ;-)
Thanx, Paul
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Theodore Ts'o @ 2019-07-14 19:05 UTC (permalink / raw)
To: Dmitry Vyukov
Cc: Paul E. McKenney, syzbot, Andreas Dilger, David Miller, eladr,
Ido Schimmel, Jiri Pirko, John Stultz, linux-ext4, LKML, netdev,
syzkaller-bugs, Thomas Gleixner, Peter Zijlstra, Ingo Molnar
In-Reply-To: <CACT4Y+asYe-uH9OV5R0Nkb-JKP4erYUZ68S9gYNnGg6v+fD20w@mail.gmail.com>
On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> But short term I don't see any other solution than stop testing
> sched_setattr because it does not check arguments enough to prevent
> system misbehavior. Which is a pity because syzkaller has found some
> bad misconfigurations that were oversight on checking side.
> Any other suggestions?
Or maybe syzkaller can put its own limitations on what parameters are
sent to sched_setattr? In practice, there are any number of ways a
root user can shoot themselves in the foot when using sched_setattr or
sched_setaffinity, for that matter. I imagine there must be some such
constraints already --- or else syzkaller might have set a kernel
thread to run with priority SCHED_BATCH, with similar catastrophic
effects --- or do similar configurations to make system threads
completely unschedulable.
Real time administrators who know what they are doing --- and who know
that their real-time threads are well behaved --- will always want to
be able to do things that will be catastrophic if the real-time thread
is *not* well behaved. I don't it is possible to add safety checks
which would allow the kernel to automatically detect and reject unsafe
configurations.
An apt analogy might be civilian versus military aircraft. Most
airplanes are designed to be "inherently stable"; that way, modulo
buggy/insane control systems like on the 737 Max, the airplane will
automatically return to straight and level flight. On the other hand,
some military planes (for example, the F-16, F-22, F-36, the
Eurofighter, etc.) are sometimes designed to be unstable, since that
way they can be more maneuverable.
There are use cases for real-time Linux where this flexibility/power
vs. stability tradeoff is going to argue for giving root the
flexibility to crash the system. Some of these systems might
literally involve using real-time Linux in military applications,
something for which Paul and I have had some experience. :-)
Speaking of sched_setaffinity, one thing which we can do is have
syzkaller move all of the system threads to they run on the "system
CPU's", and then move the syzkaller processes which are testing the
kernel to be on the "system under test CPU's". Then regardless of
what priority the syzkaller test programs try to run themselves at,
they can't crash the system.
Some real-time systems do actually run this way, and it's a
recommended configuration which is much safer than letting the
real-time threads take over the whole system:
http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
- Ted
^ permalink raw reply
* [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: Bharath Vedartham @ 2019-07-14 19:08 UTC (permalink / raw)
To: akpm, ira.weiny, jhubbard
Cc: Bharath Vedartham, Mauro Carvalho Chehab, Dimitri Sivanich,
Arnd Bergmann, Greg Kroah-Hartman, Alex Williamson, Cornelia Huck,
Jens Axboe, Alexander Viro, Björn Töpel,
Magnus Karlsson, David S. Miller, Alexei Starovoitov,
Daniel Borkmann, Jakub Kicinski, Jesper Dangaard Brouer,
John Fastabend, Enrico Weigelt, Thomas Gleixner, Alexios Zavras,
Dan Carpenter, Max Filippov, Matt Sickler, Kirill A. Shutemov,
Keith Busch, YueHaibing, linux-media, linux-kernel, devel, kvm,
linux-block, linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies
This patch converts all call sites of get_user_pages
to use put_user_page*() instead of put_page*() functions to
release reference to gup pinned pages.
This is a bunch of trivial conversions which is a part of an effort
by John Hubbard to solve issues with gup pinned pages and
filesystem writeback.
The issue is more clearly described in John Hubbard's patch[1] where
put_user_page*() functions are introduced.
Currently put_user_page*() simply does put_page but future implementations
look to change that once treewide change of put_page callsites to
put_user_page*() is finished.
The lwn article describing the issue with gup pinned pages and filesystem
writeback [2].
This patch has been tested by building and booting the kernel as I don't
have the required hardware to test the device drivers.
I did not modify gpu/drm drivers which use release_pages instead of
put_page() to release reference of gup pinned pages as I am not clear
whether release_pages and put_page are interchangable.
[1] https://lkml.org/lkml/2019/3/26/1396
[2] https://lwn.net/Articles/784574/
Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
---
drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +--
drivers/misc/sgi-gru/grufault.c | 2 +-
drivers/staging/kpc2000/kpc_dma/fileops.c | 4 +---
drivers/vfio/vfio_iommu_type1.c | 2 +-
fs/io_uring.c | 7 +++----
mm/gup_benchmark.c | 6 +-----
net/xdp/xdp_umem.c | 7 +------
7 files changed, 9 insertions(+), 22 deletions(-)
diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
index 66a6c6c..d6eeb43 100644
--- a/drivers/media/v4l2-core/videobuf-dma-sg.c
+++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
@@ -349,8 +349,7 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
BUG_ON(dma->sglen);
if (dma->pages) {
- for (i = 0; i < dma->nr_pages; i++)
- put_page(dma->pages[i]);
+ put_user_pages(dma->pages, dma->nr_pages);
kfree(dma->pages);
dma->pages = NULL;
}
diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
index 4b713a8..61b3447 100644
--- a/drivers/misc/sgi-gru/grufault.c
+++ b/drivers/misc/sgi-gru/grufault.c
@@ -188,7 +188,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
return -EFAULT;
*paddr = page_to_phys(page);
- put_page(page);
+ put_user_page(page);
return 0;
}
diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
index 6166587..26dceed 100644
--- a/drivers/staging/kpc2000/kpc_dma/fileops.c
+++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
@@ -198,9 +198,7 @@ int kpc_dma_transfer(struct dev_private_data *priv, struct kiocb *kcb, unsigned
sg_free_table(&acd->sgt);
err_dma_map_sg:
err_alloc_sg_table:
- for (i = 0 ; i < acd->page_count ; i++){
- put_page(acd->user_pages[i]);
- }
+ put_user_pages(acd->user_pages, acd->page_count);
err_get_user_pages:
kfree(acd->user_pages);
err_alloc_userpages:
diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index add34ad..c491524 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -369,7 +369,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
*/
if (ret > 0 && vma_is_fsdax(vmas[0])) {
ret = -EOPNOTSUPP;
- put_page(page[0]);
+ put_user_page(page[0]);
}
}
up_read(&mm->mmap_sem);
diff --git a/fs/io_uring.c b/fs/io_uring.c
index 4ef62a4..b4a4549 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
* if we did partial map, or found file backed vmas,
* release any pages we did get
*/
- if (pret > 0) {
- for (j = 0; j < pret; j++)
- put_page(pages[j]);
- }
+ if (pret > 0)
+ put_user_pages(pages, pret);
+
if (ctx->account_mem)
io_unaccount_mem(ctx->user, nr_pages);
kvfree(imu->bvec);
diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
index 7dd602d..15fc7a2 100644
--- a/mm/gup_benchmark.c
+++ b/mm/gup_benchmark.c
@@ -76,11 +76,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
gup->size = addr - gup->addr;
start_time = ktime_get();
- for (i = 0; i < nr_pages; i++) {
- if (!pages[i])
- break;
- put_page(pages[i]);
- }
+ put_user_pages(pages, nr_pages);
end_time = ktime_get();
gup->put_delta_usec = ktime_us_delta(end_time, start_time);
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 9c6de4f..6103e19 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -173,12 +173,7 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
{
unsigned int i;
- for (i = 0; i < umem->npgs; i++) {
- struct page *page = umem->pgs[i];
-
- set_page_dirty_lock(page);
- put_page(page);
- }
+ put_user_pages_dirty_lock(umem->pgs, umem->npgs);
kfree(umem->pgs);
umem->pgs = NULL;
--
1.8.3.1
^ permalink raw reply related
* Re: [PATCH] gve: Remove the exporting of gve_probe
From: David Miller @ 2019-07-14 19:13 UTC (permalink / raw)
To: efremov; +Cc: csully, sagis, jonolson, netdev, linux-kernel
In-Reply-To: <20190714120225.15279-1-efremov@linux.com>
From: Denis Efremov <efremov@linux.com>
Date: Sun, 14 Jul 2019 15:02:25 +0300
> The function gve_probe is declared static and marked EXPORT_SYMBOL, which
> is at best an odd combination. Because the function is not used outside of
> the drivers/net/ethernet/google/gve/gve_main.c file it is defined in, this
> commit removes the EXPORT_SYMBOL() marking.
>
> Signed-off-by: Denis Efremov <efremov@linux.com>
Applied, thanks.
^ permalink raw reply
* Re: [PATCH] sit: use dst_cache in ipip6_tunnel_xmit
From: David Miller @ 2019-07-14 19:15 UTC (permalink / raw)
To: yanhaishuang; +Cc: kuznet, yoshfuji, netdev, linux-kernel
In-Reply-To: <1563111082-10721-1-git-send-email-yanhaishuang@cmss.chinamobile.com>
From: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Date: Sun, 14 Jul 2019 21:31:22 +0800
> Same as other ip tunnel, use dst_cache in xmit action to avoid
> unnecessary fib lookups.
>
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Looks good, applied, thanks.
^ permalink raw reply
* Re: [PATCH net v2] net: neigh: fix multiple neigh timer scheduling
From: David Miller @ 2019-07-14 19:15 UTC (permalink / raw)
To: dsahern; +Cc: lorenzo.bianconi, netdev, marek
In-Reply-To: <26f58e35-f1f8-9543-819f-ef7f52da1e49@gmail.com>
From: David Ahern <dsahern@gmail.com>
Date: Sun, 14 Jul 2019 07:58:19 -0600
> On 7/14/19 2:45 AM, Lorenzo Bianconi wrote:
>> @@ -1124,7 +1125,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
>>
>> atomic_set(&neigh->probes,
>> NEIGH_VAR(neigh->parms, UCAST_PROBES));
>> - neigh->nud_state = NUD_INCOMPLETE;
>> + if (check_timer)
>> + neigh_del_timer(neigh);
>
> Why not just always call neigh_del_timer and avoid the check_timer flag?
> Let the NUD_IN_TIMER flag handle whether anything needs to be done.
Agreed.
^ permalink raw reply
* Re: [PATCH] sis900: correct a few typos
From: David Miller @ 2019-07-14 19:22 UTC (permalink / raw)
To: sergej.benilov; +Cc: venza, netdev, andrew
In-Reply-To: <20190714165627.32765-1-sergej.benilov@googlemail.com>
From: Sergej Benilov <sergej.benilov@googlemail.com>
Date: Sun, 14 Jul 2019 18:56:27 +0200
> Correct a few typos in comments and debug text.
>
> Signed-off-by: Sergej Benilov <sergej.benilov@googlemail.com>
Applied.
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Paul E. McKenney @ 2019-07-14 19:29 UTC (permalink / raw)
To: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger,
David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
Peter Zijlstra, Ingo Molnar
In-Reply-To: <20190714190522.GA24049@mit.edu>
On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > But short term I don't see any other solution than stop testing
> > sched_setattr because it does not check arguments enough to prevent
> > system misbehavior. Which is a pity because syzkaller has found some
> > bad misconfigurations that were oversight on checking side.
> > Any other suggestions?
>
> Or maybe syzkaller can put its own limitations on what parameters are
> sent to sched_setattr? In practice, there are any number of ways a
> root user can shoot themselves in the foot when using sched_setattr or
> sched_setaffinity, for that matter. I imagine there must be some such
> constraints already --- or else syzkaller might have set a kernel
> thread to run with priority SCHED_BATCH, with similar catastrophic
> effects --- or do similar configurations to make system threads
> completely unschedulable.
>
> Real time administrators who know what they are doing --- and who know
> that their real-time threads are well behaved --- will always want to
> be able to do things that will be catastrophic if the real-time thread
> is *not* well behaved. I don't it is possible to add safety checks
> which would allow the kernel to automatically detect and reject unsafe
> configurations.
>
> An apt analogy might be civilian versus military aircraft. Most
> airplanes are designed to be "inherently stable"; that way, modulo
> buggy/insane control systems like on the 737 Max, the airplane will
> automatically return to straight and level flight. On the other hand,
> some military planes (for example, the F-16, F-22, F-36, the
> Eurofighter, etc.) are sometimes designed to be unstable, since that
> way they can be more maneuverable.
>
> There are use cases for real-time Linux where this flexibility/power
> vs. stability tradeoff is going to argue for giving root the
> flexibility to crash the system. Some of these systems might
> literally involve using real-time Linux in military applications,
> something for which Paul and I have had some experience. :-)
>
> Speaking of sched_setaffinity, one thing which we can do is have
> syzkaller move all of the system threads to they run on the "system
> CPU's", and then move the syzkaller processes which are testing the
> kernel to be on the "system under test CPU's". Then regardless of
> what priority the syzkaller test programs try to run themselves at,
> they can't crash the system.
>
> Some real-time systems do actually run this way, and it's a
> recommended configuration which is much safer than letting the
> real-time threads take over the whole system:
>
> http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
Good point! We might still have issues with some per-CPU kthreads,
but perhaps use of nohz_full would help at least reduce these sorts
of problems. (There could still be issues on CPUs with more than
one runnable threads.)
Thanx, Paul
^ permalink raw reply
* Re: [PATCH net v2] net: neigh: fix multiple neigh timer scheduling
From: Lorenzo Bianconi @ 2019-07-14 20:04 UTC (permalink / raw)
To: David Ahern; +Cc: davem, netdev, marek
In-Reply-To: <26f58e35-f1f8-9543-819f-ef7f52da1e49@gmail.com>
[-- Attachment #1: Type: text/plain, Size: 1136 bytes --]
> On 7/14/19 2:45 AM, Lorenzo Bianconi wrote:
> > @@ -1124,7 +1125,9 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
> >
> > atomic_set(&neigh->probes,
> > NEIGH_VAR(neigh->parms, UCAST_PROBES));
> > - neigh->nud_state = NUD_INCOMPLETE;
> > + if (check_timer)
> > + neigh_del_timer(neigh);
>
> Why not just always call neigh_del_timer and avoid the check_timer flag?
> Let the NUD_IN_TIMER flag handle whether anything needs to be done.
ack, I have been too paranoid here. I will post a v3 fixing it.
Regards,
Lorenzo
>
> > + neigh->nud_state = NUD_INCOMPLETE;
> > neigh->updated = now;
> > next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME),
> > HZ/2);
> > @@ -1140,6 +1143,8 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
> > }
> > } else if (neigh->nud_state & NUD_STALE) {
> > neigh_dbg(2, "neigh %p is delayed\n", neigh);
> > + if (check_timer)
> > + neigh_del_timer(neigh);
> > neigh->nud_state = NUD_DELAY;
> > neigh->updated = jiffies;
> > neigh_add_timer(neigh, jiffies +
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply
* Re: [PATCH] sky2: Disable MSI on P5W DH Deluxe
From: David Miller @ 2019-07-14 20:46 UTC (permalink / raw)
To: tasos; +Cc: netdev, linux-kernel, stephen, mlindner
In-Reply-To: <14c872c3-09ac-7f51-dc3d-e68319459fcf@tasossah.com>
From: Tasos Sahanidis <tasos@tasossah.com>
Date: Sun, 14 Jul 2019 13:31:11 +0300
> The onboard sky2 NICs send IRQs after S3, resulting in ethernet not
> working after resume.
> Maskable MSI and MSI-X are also not supported, so fall back to INTx.
>
> Signed-off-by: Tasos Sahanidis <tasos@tasossah.com>
Applied, thanks.
^ permalink raw reply
* [PATCH net v3] net: neigh: fix multiple neigh timer scheduling
From: Lorenzo Bianconi @ 2019-07-14 21:36 UTC (permalink / raw)
To: davem; +Cc: netdev, dsahern, marek
Neigh timer can be scheduled multiple times from userspace adding
multiple neigh entries and forcing the neigh timer scheduling passing
NTF_USE in the netlink requests.
This will result in a refcount leak and in the following dump stack:
[ 32.465295] NEIGH: BUG, double timer add, state is 8
[ 32.465308] CPU: 0 PID: 416 Comm: double_timer_ad Not tainted 5.2.0+ #65
[ 32.465311] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.12.0-2.fc30 04/01/2014
[ 32.465313] Call Trace:
[ 32.465318] dump_stack+0x7c/0xc0
[ 32.465323] __neigh_event_send+0x20c/0x880
[ 32.465326] ? ___neigh_create+0x846/0xfb0
[ 32.465329] ? neigh_lookup+0x2a9/0x410
[ 32.465332] ? neightbl_fill_info.constprop.0+0x800/0x800
[ 32.465334] neigh_add+0x4f8/0x5e0
[ 32.465337] ? neigh_xmit+0x620/0x620
[ 32.465341] ? find_held_lock+0x85/0xa0
[ 32.465345] rtnetlink_rcv_msg+0x204/0x570
[ 32.465348] ? rtnl_dellink+0x450/0x450
[ 32.465351] ? mark_held_locks+0x90/0x90
[ 32.465354] ? match_held_lock+0x1b/0x230
[ 32.465357] netlink_rcv_skb+0xc4/0x1d0
[ 32.465360] ? rtnl_dellink+0x450/0x450
[ 32.465363] ? netlink_ack+0x420/0x420
[ 32.465366] ? netlink_deliver_tap+0x115/0x560
[ 32.465369] ? __alloc_skb+0xc9/0x2f0
[ 32.465372] netlink_unicast+0x270/0x330
[ 32.465375] ? netlink_attachskb+0x2f0/0x2f0
[ 32.465378] netlink_sendmsg+0x34f/0x5a0
[ 32.465381] ? netlink_unicast+0x330/0x330
[ 32.465385] ? move_addr_to_kernel.part.0+0x20/0x20
[ 32.465388] ? netlink_unicast+0x330/0x330
[ 32.465391] sock_sendmsg+0x91/0xa0
[ 32.465394] ___sys_sendmsg+0x407/0x480
[ 32.465397] ? copy_msghdr_from_user+0x200/0x200
[ 32.465401] ? _raw_spin_unlock_irqrestore+0x37/0x40
[ 32.465404] ? lockdep_hardirqs_on+0x17d/0x250
[ 32.465407] ? __wake_up_common_lock+0xcb/0x110
[ 32.465410] ? __wake_up_common+0x230/0x230
[ 32.465413] ? netlink_bind+0x3e1/0x490
[ 32.465416] ? netlink_setsockopt+0x540/0x540
[ 32.465420] ? __fget_light+0x9c/0xf0
[ 32.465423] ? sockfd_lookup_light+0x8c/0xb0
[ 32.465426] __sys_sendmsg+0xa5/0x110
[ 32.465429] ? __ia32_sys_shutdown+0x30/0x30
[ 32.465432] ? __fd_install+0xe1/0x2c0
[ 32.465435] ? lockdep_hardirqs_off+0xb5/0x100
[ 32.465438] ? mark_held_locks+0x24/0x90
[ 32.465441] ? do_syscall_64+0xf/0x270
[ 32.465444] do_syscall_64+0x63/0x270
[ 32.465448] entry_SYSCALL_64_after_hwframe+0x49/0xbe
Fix the issue unscheduling neigh_timer if selected entry is in 'IN_TIMER'
receiving a netlink request with NTF_USE flag set
Reported-by: Marek Majkowski <marek@cloudflare.com>
Fixes: 0c5c2d308906 ("neigh: Allow for user space users of the neighbour table")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
---
Changes since v2:
- remove check_timer flag and run neigh_del_timer directly
Changes since v1:
- fix compilation errors defining neigh_event_send_check_timer routine
---
net/core/neighbour.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 742cea4ce72e..0dfc97bc8760 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1124,6 +1124,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
atomic_set(&neigh->probes,
NEIGH_VAR(neigh->parms, UCAST_PROBES));
+ neigh_del_timer(neigh);
neigh->nud_state = NUD_INCOMPLETE;
neigh->updated = now;
next = now + max(NEIGH_VAR(neigh->parms, RETRANS_TIME),
@@ -1140,6 +1141,7 @@ int __neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
}
} else if (neigh->nud_state & NUD_STALE) {
neigh_dbg(2, "neigh %p is delayed\n", neigh);
+ neigh_del_timer(neigh);
neigh->nud_state = NUD_DELAY;
neigh->updated = jiffies;
neigh_add_timer(neigh, jiffies +
--
2.21.0
^ permalink raw reply related
* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: John Hubbard @ 2019-07-14 23:33 UTC (permalink / raw)
To: Bharath Vedartham, akpm, ira.weiny
Cc: Mauro Carvalho Chehab, Dimitri Sivanich, Arnd Bergmann,
Greg Kroah-Hartman, Alex Williamson, Cornelia Huck, Jens Axboe,
Alexander Viro, Björn Töpel, Magnus Karlsson,
David S. Miller, Alexei Starovoitov, Daniel Borkmann,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Enrico Weigelt, Thomas Gleixner, Alexios Zavras, Dan Carpenter,
Max Filippov, Matt Sickler, Kirill A. Shutemov, Keith Busch,
YueHaibing, linux-media, linux-kernel, devel, kvm, linux-block,
linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies,
Jason Gunthorpe
In-Reply-To: <1563131456-11488-1-git-send-email-linux.bhar@gmail.com>
On 7/14/19 12:08 PM, Bharath Vedartham wrote:
> This patch converts all call sites of get_user_pages
> to use put_user_page*() instead of put_page*() functions to
> release reference to gup pinned pages.
Hi Bharath,
Thanks for jumping in to help, and welcome to the party!
You've caught everyone in the middle of a merge window, btw. As a
result, I'm busy rebasing and reworking the get_user_pages call sites,
and gup tracking, in the wake of some semi-traumatic changes to bio
and gup and such. I plan to re-post right after 5.3-rc1 shows up, from
here:
https://github.com/johnhubbard/linux/commits/gup_dma_core
...which you'll find already covers the changes you've posted, except for:
drivers/misc/sgi-gru/grufault.c
drivers/staging/kpc2000/kpc_dma/fileops.c
...and this one, which is undergoing to larger local changes, due to
bvec, so let's leave it out of the choices:
fs/io_uring.c
Therefore, until -rc1, if you'd like to help, I'd recommend one or more
of the following ideas:
1. Pull down https://github.com/johnhubbard/linux/commits/gup_dma_core
and find missing conversions: look for any additional missing
get_user_pages/put_page conversions. You've already found a couple missing
ones. I haven't re-run a search in a long time, so there's probably even more.
a) And find more, after I rebase to 5.3-rc1: people probably are adding
get_user_pages() calls as we speak. :)
2. Patches: Focus on just one subsystem at a time, and perfect the patch for
it. For example, I think this the staging driver would be perfect to start with:
drivers/staging/kpc2000/kpc_dma/fileops.c
a) verify that you've really, corrected converted the whole
driver. (Hint: I think you might be overlooking a put_page call.)
b) Attempt to test it if you can (I'm being hypocritical in
the extreme here, but one of my problems is that testing
has been light, so any help is very valuable). qemu...?
OTOH, maybe even qemu cannot easily test a kpc2000, but
perhaps `git blame` and talking to the authors would help
figure out a way to validate the changes.
Thinking about whether you can run a test that would prove or
disprove my claim in (a), above, could be useful in coming up
with tests to run.
In other words, a few very high quality conversions (even just one) that
we can really put our faith in, is what I value most here. Tested patches
are awesome.
3. Once I re-post, turn on the new CONFIG_DEBUG_GET_USER_PAGES_REFERENCES
and run things such as xfstest/fstest. (Again, doing so would be going
further than I have yet--very helpful). Help clarify what conversions have
actually been tested and work, and which ones remain unvalidated.
Other: Please note that this:
https://github.com/johnhubbard/linux/commits/gup_dma_core
a) gets rebased often, and
b) has a bunch of commits (iov_iter and related) that conflict
with the latest linux.git,
c) has some bugs in the bio area, that I'm fixing, so I don't trust
that's it's safely runnable, for a few more days.
One note below, for the future:
>
> This is a bunch of trivial conversions which is a part of an effort
> by John Hubbard to solve issues with gup pinned pages and
> filesystem writeback.
>
> The issue is more clearly described in John Hubbard's patch[1] where
> put_user_page*() functions are introduced.
>
> Currently put_user_page*() simply does put_page but future implementations
> look to change that once treewide change of put_page callsites to
> put_user_page*() is finished.
>
> The lwn article describing the issue with gup pinned pages and filesystem
> writeback [2].
>
> This patch has been tested by building and booting the kernel as I don't
> have the required hardware to test the device drivers.
>
> I did not modify gpu/drm drivers which use release_pages instead of
> put_page() to release reference of gup pinned pages as I am not clear
> whether release_pages and put_page are interchangable.
>
> [1] https://lkml.org/lkml/2019/3/26/1396
When referring to patches in a commit description, please use the
commit hash, not an external link. See Submitting Patches [1] for details.
Also, once you figure out the right maintainers and other involved people,
putting Cc: in the commit description is common practice, too.
[1] https://www.kernel.org/doc/html/latest/process/submitting-patches.html
thanks,
--
John Hubbard
NVIDIA
>
> [2] https://lwn.net/Articles/784574/
>
> Signed-off-by: Bharath Vedartham <linux.bhar@gmail.com>
> ---
> drivers/media/v4l2-core/videobuf-dma-sg.c | 3 +--
> drivers/misc/sgi-gru/grufault.c | 2 +-
> drivers/staging/kpc2000/kpc_dma/fileops.c | 4 +---
> drivers/vfio/vfio_iommu_type1.c | 2 +-
> fs/io_uring.c | 7 +++----
> mm/gup_benchmark.c | 6 +-----
> net/xdp/xdp_umem.c | 7 +------
> 7 files changed, 9 insertions(+), 22 deletions(-)
>
> diff --git a/drivers/media/v4l2-core/videobuf-dma-sg.c b/drivers/media/v4l2-core/videobuf-dma-sg.c
> index 66a6c6c..d6eeb43 100644
> --- a/drivers/media/v4l2-core/videobuf-dma-sg.c
> +++ b/drivers/media/v4l2-core/videobuf-dma-sg.c
> @@ -349,8 +349,7 @@ int videobuf_dma_free(struct videobuf_dmabuf *dma)
> BUG_ON(dma->sglen);
>
> if (dma->pages) {
> - for (i = 0; i < dma->nr_pages; i++)
> - put_page(dma->pages[i]);
> + put_user_pages(dma->pages, dma->nr_pages);
> kfree(dma->pages);
> dma->pages = NULL;
> }
> diff --git a/drivers/misc/sgi-gru/grufault.c b/drivers/misc/sgi-gru/grufault.c
> index 4b713a8..61b3447 100644
> --- a/drivers/misc/sgi-gru/grufault.c
> +++ b/drivers/misc/sgi-gru/grufault.c
> @@ -188,7 +188,7 @@ static int non_atomic_pte_lookup(struct vm_area_struct *vma,
> if (get_user_pages(vaddr, 1, write ? FOLL_WRITE : 0, &page, NULL) <= 0)
> return -EFAULT;
> *paddr = page_to_phys(page);
> - put_page(page);
> + put_user_page(page);
> return 0;
> }
>
> diff --git a/drivers/staging/kpc2000/kpc_dma/fileops.c b/drivers/staging/kpc2000/kpc_dma/fileops.c
> index 6166587..26dceed 100644
> --- a/drivers/staging/kpc2000/kpc_dma/fileops.c
> +++ b/drivers/staging/kpc2000/kpc_dma/fileops.c
> @@ -198,9 +198,7 @@ int kpc_dma_transfer(struct dev_private_data *priv, struct kiocb *kcb, unsigned
> sg_free_table(&acd->sgt);
> err_dma_map_sg:
> err_alloc_sg_table:
> - for (i = 0 ; i < acd->page_count ; i++){
> - put_page(acd->user_pages[i]);
> - }
> + put_user_pages(acd->user_pages, acd->page_count);
> err_get_user_pages:
> kfree(acd->user_pages);
> err_alloc_userpages:
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index add34ad..c491524 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -369,7 +369,7 @@ static int vaddr_get_pfn(struct mm_struct *mm, unsigned long vaddr,
> */
> if (ret > 0 && vma_is_fsdax(vmas[0])) {
> ret = -EOPNOTSUPP;
> - put_page(page[0]);
> + put_user_page(page[0]);
> }
> }
> up_read(&mm->mmap_sem);
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 4ef62a4..b4a4549 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> * if we did partial map, or found file backed vmas,
> * release any pages we did get
> */
> - if (pret > 0) {
> - for (j = 0; j < pret; j++)
> - put_page(pages[j]);
> - }
> + if (pret > 0)
> + put_user_pages(pages, pret);
> +
> if (ctx->account_mem)
> io_unaccount_mem(ctx->user, nr_pages);
> kvfree(imu->bvec);
> diff --git a/mm/gup_benchmark.c b/mm/gup_benchmark.c
> index 7dd602d..15fc7a2 100644
> --- a/mm/gup_benchmark.c
> +++ b/mm/gup_benchmark.c
> @@ -76,11 +76,7 @@ static int __gup_benchmark_ioctl(unsigned int cmd,
> gup->size = addr - gup->addr;
>
> start_time = ktime_get();
> - for (i = 0; i < nr_pages; i++) {
> - if (!pages[i])
> - break;
> - put_page(pages[i]);
> - }
> + put_user_pages(pages, nr_pages);
> end_time = ktime_get();
> gup->put_delta_usec = ktime_us_delta(end_time, start_time);
>
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 9c6de4f..6103e19 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -173,12 +173,7 @@ static void xdp_umem_unpin_pages(struct xdp_umem *umem)
> {
> unsigned int i;
>
> - for (i = 0; i < umem->npgs; i++) {
> - struct page *page = umem->pgs[i];
> -
> - set_page_dirty_lock(page);
> - put_page(page);
> - }
> + put_user_pages_dirty_lock(umem->pgs, umem->npgs);
>
> kfree(umem->pgs);
> umem->pgs = NULL;
>
^ permalink raw reply
* Re: [PATCH net-next 01/16] qlge: Remove irq_cnt
From: Benjamin Poirier @ 2019-07-15 1:40 UTC (permalink / raw)
To: David Miller, Greg Kroah-Hartman; +Cc: Manish Chopra, GR-Linux-NIC-Dev, netdev
In-Reply-To: <20190617074858.32467-1-bpoirier@suse.com>
On 2019/06/17 16:48, Benjamin Poirier wrote:
> qlge uses an irq enable/disable refcounting scheme that is:
> * poorly implemented
> Uses a spin_lock to protect accesses to the irq_cnt atomic variable
> * buggy
> Breaks when there is not a 1:1 sequence of irq - napi_poll, such as
> when using SO_BUSY_POLL.
> * unnecessary
> The purpose or irq_cnt is to reduce irq control writes when
> multiple work items result from one irq: the irq is re-enabled
> after all work is done.
> Analysis of the irq handler shows that there is only one case where
> there might be two workers scheduled at once, and those have
> separate irq masking bits.
>
> Therefore, remove irq_cnt.
>
> Additionally, we get a performance improvement:
> perf stat -e cycles -a -r5 super_netperf 100 -H 192.168.33.1 -t TCP_RR
>
> Before:
> 628560
> 628056
> 622103
> 622744
> 627202
> [...]
> 268,803,947,669 cycles ( +- 0.09% )
>
> After:
> 636300
> 634106
> 634984
> 638555
> 634188
> [...]
> 259,237,291,449 cycles ( +- 0.19% )
>
> Signed-off-by: Benjamin Poirier <bpoirier@suse.com>
David, Greg,
Before I send v2 of this patchset, how about moving this driver out to
staging? The hardware has been declared EOL by the vendor but what's
more relevant to the Linux kernel is that the quality of this driver is
not on par with many other mainline drivers, IMO. Over in staging, the
code might benefit from the attention of interested parties (drive-by
fixers) or fade away into obscurity.
Now that I check, the code has >500 basic checkpatch issues. While
working on the rx processing code for the current patchset, I noticed
the following more structural issues:
* commit 7c734359d350 ("qlge: Size RX buffers based on MTU.",
v2.6.33-rc1) introduced dead code in the receive routines, which
should be rewritten anyways by the admission of the author himself
(see the comment above ql_build_rx_skb())
* truesize accounting is incorrect (ex: a 9000B frame has truesize 10280
while containing two frags of order-1 allocations, ie. >16K)
* in some cases the driver allocates skbs with head room but only puts
data in the frags
* there is an inordinate amount of disparate debugging code, most of
which is of questionable value. In particular, qlge_dbg.c has hundreds
of lines of code bitrotting away in ifdef land (doesn't compile since
commit 18c49b91777c ("qlge: do vlan cleanup", v3.1-rc1), 8 years ago).
* triggering an ethtool regdump will instead hexdump a 176k struct to
dmesg depending on some module parameters
* many structures are defined full of holes, as we found in this
thread already; are used inefficiently (struct rx_ring is used for rx
and tx completions, with some members relevant to one case only); are
initialized redundantly (ex. memset 0 after alloc_etherdev). The
driver also has a habit of using runtime checks where compile time
checks are possible.
* in terms of namespace, the driver uses either qlge_, ql_ (used by
other qlogic drivers, with clashes, ex: ql_sem_spinlock) or nothing (with
clashes, ex: struct ob_mac_iocb_req)
I'm guessing other parts of the driver have more issues of the same
nature. The driver also has many smaller issues like deprecated apis,
duplicate or useless comments, weird code structure (some "while" loops
that could be rewritten with simple "for", ex. ql_wait_reg_rdy()) and
weird line wrapping (all over, ex. the ql_set_routing_reg() calls in
qlge_set_multicast_list()).
I'm aware of some precedents for code moving from mainline to staging:
1bb8155080c6 ncpfs: move net/ncpfs to drivers/staging/ncpfs (v4.16-rc1)
6c391ff758eb irda: move drivers/net/irda to drivers/staging/irda/drivers
(v4.14-rc1)
What do you think is the best course of action in this case?
Thanks,
-Benjamin
^ permalink raw reply
* Re: [PATCH] mm/gup: Use put_user_page*() instead of put_page*()
From: Jens Axboe @ 2019-07-15 2:33 UTC (permalink / raw)
To: Bharath Vedartham, akpm, ira.weiny, jhubbard
Cc: Mauro Carvalho Chehab, Dimitri Sivanich, Arnd Bergmann,
Greg Kroah-Hartman, Alex Williamson, Cornelia Huck,
Alexander Viro, Björn Töpel, Magnus Karlsson,
David S. Miller, Alexei Starovoitov, Daniel Borkmann,
Jakub Kicinski, Jesper Dangaard Brouer, John Fastabend,
Enrico Weigelt, Thomas Gleixner, Alexios Zavras, Dan Carpenter,
Max Filippov, Matt Sickler, Kirill A. Shutemov, Keith Busch,
YueHaibing, linux-media, linux-kernel, devel, kvm, linux-block,
linux-fsdevel, linux-mm, netdev, bpf, xdp-newbies
In-Reply-To: <1563131456-11488-1-git-send-email-linux.bhar@gmail.com>
On 7/14/19 1:08 PM, Bharath Vedartham wrote:
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 4ef62a4..b4a4549 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2694,10 +2694,9 @@ static int io_sqe_buffer_register(struct io_ring_ctx *ctx, void __user *arg,
> * if we did partial map, or found file backed vmas,
> * release any pages we did get
> */
> - if (pret > 0) {
> - for (j = 0; j < pret; j++)
> - put_page(pages[j]);
> - }
> + if (pret > 0)
> + put_user_pages(pages, pret);
> +
> if (ctx->account_mem)
> io_unaccount_mem(ctx->user, nr_pages);
> kvfree(imu->bvec);
You handled just the failure case of the buffer registration, but not
the actual free in io_sqe_buffer_unregister().
--
Jens Axboe
^ permalink raw reply
* Re: INFO: rcu detected stall in ext4_write_checks
From: Paul E. McKenney @ 2019-07-15 3:10 UTC (permalink / raw)
To: Theodore Ts'o, Dmitry Vyukov, syzbot, Andreas Dilger,
David Miller, eladr, Ido Schimmel, Jiri Pirko, John Stultz,
linux-ext4, LKML, netdev, syzkaller-bugs, Thomas Gleixner,
Peter Zijlstra, Ingo Molnar
In-Reply-To: <20190714192951.GM26519@linux.ibm.com>
On Sun, Jul 14, 2019 at 12:29:51PM -0700, Paul E. McKenney wrote:
> On Sun, Jul 14, 2019 at 03:05:22PM -0400, Theodore Ts'o wrote:
> > On Sun, Jul 14, 2019 at 05:48:00PM +0300, Dmitry Vyukov wrote:
> > > But short term I don't see any other solution than stop testing
> > > sched_setattr because it does not check arguments enough to prevent
> > > system misbehavior. Which is a pity because syzkaller has found some
> > > bad misconfigurations that were oversight on checking side.
> > > Any other suggestions?
> >
> > Or maybe syzkaller can put its own limitations on what parameters are
> > sent to sched_setattr? In practice, there are any number of ways a
> > root user can shoot themselves in the foot when using sched_setattr or
> > sched_setaffinity, for that matter. I imagine there must be some such
> > constraints already --- or else syzkaller might have set a kernel
> > thread to run with priority SCHED_BATCH, with similar catastrophic
> > effects --- or do similar configurations to make system threads
> > completely unschedulable.
> >
> > Real time administrators who know what they are doing --- and who know
> > that their real-time threads are well behaved --- will always want to
> > be able to do things that will be catastrophic if the real-time thread
> > is *not* well behaved. I don't it is possible to add safety checks
> > which would allow the kernel to automatically detect and reject unsafe
> > configurations.
> >
> > An apt analogy might be civilian versus military aircraft. Most
> > airplanes are designed to be "inherently stable"; that way, modulo
> > buggy/insane control systems like on the 737 Max, the airplane will
> > automatically return to straight and level flight. On the other hand,
> > some military planes (for example, the F-16, F-22, F-36, the
> > Eurofighter, etc.) are sometimes designed to be unstable, since that
> > way they can be more maneuverable.
> >
> > There are use cases for real-time Linux where this flexibility/power
> > vs. stability tradeoff is going to argue for giving root the
> > flexibility to crash the system. Some of these systems might
> > literally involve using real-time Linux in military applications,
> > something for which Paul and I have had some experience. :-)
> >
> > Speaking of sched_setaffinity, one thing which we can do is have
> > syzkaller move all of the system threads to they run on the "system
> > CPU's", and then move the syzkaller processes which are testing the
> > kernel to be on the "system under test CPU's". Then regardless of
> > what priority the syzkaller test programs try to run themselves at,
> > they can't crash the system.
> >
> > Some real-time systems do actually run this way, and it's a
> > recommended configuration which is much safer than letting the
> > real-time threads take over the whole system:
> >
> > http://linuxrealtime.org/index.php/Improving_the_Real-Time_Properties#Isolating_the_Application
>
> Good point! We might still have issues with some per-CPU kthreads,
> but perhaps use of nohz_full would help at least reduce these sorts
> of problems. (There could still be issues on CPUs with more than
> one runnable threads.)
I looked at testing limitations in a bit more detail from an RCU
viewpoint, and came up with the following rough rule of thumb (which of
course might or might not survive actual testing experience, but should at
least be a good place to start). I believe that the sched_setaffinity()
testing rule should be that the SCHED_DEADLINE cycle be no more than
two-thirds of the RCU CPU stall warning timeout, which defaults to 21
seconds in mainline and 60 seconds in many distro kernels.
That is, the SCHED_DEADLINE cycle should never exceed 14 seconds when
testing mainline on the one hand or 40 seconds when testing enterprise
distros on the other.
This assumes quite a bit, though:
o The system has ample memory to spare, and isn't running a
callback-hungry workload. For example, if you "only" have 100MB
of spare memory and you are also repeatedly and concurrently
expanding (say) large source trees from tarballs and then deleting
those source trees, the system might OOM. The reason OOM might
happen is that each close() of a file generates an RCU callback,
and 40 seconds worth of waiting-for-a-grace-period structures
takes up a surprisingly large amount of memory.
So please be careful when combining tests. ;-)
o There are no aggressive real-time workloads on the system.
The reason for this is that RCU is going to start sending IPIs
halfway to the RCU CPU stall timeout, and, in certain situations
on CONFIG_NO_HZ_FULL kernels, much earlier. (These situations
constitute abuse of CONFIG_NO_HZ_FULL, but then again carefully
calibrated abuse is what stress testing is all about.)
o The various RCU kthreads will get a chance to run at least once
during the SCHED_DEADLINE cycle. If in real life, they only
get a chance to run once per two SCHED_DEADLINE cycles, then of
course the 14 seconds becomes 7 and the 40 seconds becomes 20.
o The current RCU CPU stall warning defaults remain in
place. These are set by the CONFIG_RCU_CPU_STALL_TIMEOUT
Kconfig parameter, which may in turn be overridden by the
rcupdate.rcu_cpu_stall_timeout kernel boot parameter.
o The current SCHED_DEADLINE default for providing spare cycles
for other uses remains in place.
o Other kthreads might have other constraints, but given that you
were seeing RCU CPU stall warnings instead of other failures,
the needs of RCU's kthreads seem to be a good place to start.
Again, the candidate rough rule of thumb is that the the SCHED_DEADLINE
cycle be no more than 14 seconds when testing mainline kernels on the one
hand and 40 seconds when testing enterprise distro kernels on the other.
Dmitry, does that help?
Thanx, Paul
^ permalink raw reply
* [PATCH v3 02/24] atm: idt77252: Remove call to memset after dma_alloc_coherent
From: Fuqian Huang @ 2019-07-15 3:17 UTC (permalink / raw)
Cc: Chas Williams, linux-atm-general, netdev, linux-kernel,
Fuqian Huang
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
- Use actual commit rather than the merge commit in the commit message
drivers/atm/idt77252.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/atm/idt77252.c b/drivers/atm/idt77252.c
index 43a14579e80e..df51680e8931 100644
--- a/drivers/atm/idt77252.c
+++ b/drivers/atm/idt77252.c
@@ -1379,7 +1379,6 @@ init_tsq(struct idt77252_dev *card)
printk("%s: can't allocate TSQ.\n", card->name);
return -1;
}
- memset(card->tsq.base, 0, TSQSIZE);
card->tsq.last = card->tsq.base + TSQ_NUM_ENTRIES - 1;
card->tsq.next = card->tsq.last;
--
2.11.0
^ permalink raw reply related
* [PATCH v3 17/24] ethernet: remove redundant memset
From: Fuqian Huang @ 2019-07-15 3:19 UTC (permalink / raw)
Cc: Jay Cliburn, Chris Snook, David S . Miller, Michael Chan,
Vishal Kulkarni, Fugang Duan, Guo-Fu Tseng, Mirko Lindner,
Stephen Hemminger, Tariq Toukan, Saeed Mahameed, Leon Romanovsky,
Jiri Pirko, Ido Schimmel, Jon Mason, Manish Chopra, Rahul Verma,
GR-Linux-NIC-Dev, Samuel Chessman, netdev, linux-kernel,
linux-rdma, Fuqian Huang
kvzalloc already zeroes the memory during the allocation.
pci_alloc_consistent calls dma_alloc_coherent directly.
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So the memset after these function is not needed.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
- Use actual commit rather than the merge commit in the commit message
drivers/net/ethernet/atheros/atlx/atl1.c | 2 --
drivers/net/ethernet/atheros/atlx/atl2.c | 1 -
drivers/net/ethernet/broadcom/bnxt/bnxt.c | 2 --
drivers/net/ethernet/chelsio/cxgb4/sched.c | 1 -
drivers/net/ethernet/freescale/fec_main.c | 2 --
drivers/net/ethernet/jme.c | 5 -----
drivers/net/ethernet/marvell/skge.c | 2 --
drivers/net/ethernet/mellanox/mlx4/eq.c | 2 --
drivers/net/ethernet/mellanox/mlx5/core/eswitch.c | 1 -
drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c | 3 ---
drivers/net/ethernet/mellanox/mlxsw/pci.c | 1 -
drivers/net/ethernet/neterion/s2io.c | 1 -
drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c | 3 ---
drivers/net/ethernet/ti/tlan.c | 1 -
14 files changed, 27 deletions(-)
diff --git a/drivers/net/ethernet/atheros/atlx/atl1.c b/drivers/net/ethernet/atheros/atlx/atl1.c
index 7c767ce9aafa..b5c6dc914720 100644
--- a/drivers/net/ethernet/atheros/atlx/atl1.c
+++ b/drivers/net/ethernet/atheros/atlx/atl1.c
@@ -1060,8 +1060,6 @@ static s32 atl1_setup_ring_resources(struct atl1_adapter *adapter)
goto err_nomem;
}
- memset(ring_header->desc, 0, ring_header->size);
-
/* init TPD ring */
tpd_ring->dma = ring_header->dma;
offset = (tpd_ring->dma & 0x7) ? (8 - (ring_header->dma & 0x7)) : 0;
diff --git a/drivers/net/ethernet/atheros/atlx/atl2.c b/drivers/net/ethernet/atheros/atlx/atl2.c
index 3a3fb5ce0fee..3aba38322717 100644
--- a/drivers/net/ethernet/atheros/atlx/atl2.c
+++ b/drivers/net/ethernet/atheros/atlx/atl2.c
@@ -291,7 +291,6 @@ static s32 atl2_setup_ring_resources(struct atl2_adapter *adapter)
&adapter->ring_dma);
if (!adapter->ring_vir_addr)
return -ENOMEM;
- memset(adapter->ring_vir_addr, 0, adapter->ring_size);
/* Init TXD Ring */
adapter->txd_dma = adapter->ring_dma ;
diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 3f632028eff0..1069eb01cc55 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -2677,8 +2677,6 @@ static int bnxt_alloc_tx_rings(struct bnxt *bp)
mapping = txr->tx_push_mapping +
sizeof(struct tx_push_bd);
txr->data_mapping = cpu_to_le64(mapping);
-
- memset(txr->tx_push, 0, sizeof(struct tx_push_bd));
}
qidx = bp->tc_to_qidx[j];
ring->queue_id = bp->q_info[qidx].queue_id;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sched.c b/drivers/net/ethernet/chelsio/cxgb4/sched.c
index ba6c153ee45c..60218dc676a8 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sched.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sched.c
@@ -207,7 +207,6 @@ static int t4_sched_queue_bind(struct port_info *pi, struct ch_sched_queue *p)
goto out_err;
/* Bind queue to specified class */
- memset(qe, 0, sizeof(*qe));
qe->cntxt_id = qid;
memcpy(&qe->param, p, sizeof(qe->param));
diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c
index 9d459ccf251d..e5610a4da539 100644
--- a/drivers/net/ethernet/freescale/fec_main.c
+++ b/drivers/net/ethernet/freescale/fec_main.c
@@ -3144,8 +3144,6 @@ static int fec_enet_init(struct net_device *ndev)
return -ENOMEM;
}
- memset(cbd_base, 0, bd_size);
-
/* Get the Ethernet address */
fec_get_mac(ndev);
/* make sure MAC we just acquired is programmed into the hw */
diff --git a/drivers/net/ethernet/jme.c b/drivers/net/ethernet/jme.c
index 76b7b7b85e35..0b668357db4d 100644
--- a/drivers/net/ethernet/jme.c
+++ b/drivers/net/ethernet/jme.c
@@ -582,11 +582,6 @@ jme_setup_tx_resources(struct jme_adapter *jme)
if (unlikely(!(txring->bufinf)))
goto err_free_txring;
- /*
- * Initialize Transmit Descriptors
- */
- memset(txring->alloc, 0, TX_RING_ALLOC_SIZE(jme->tx_ring_size));
-
return 0;
err_free_txring:
diff --git a/drivers/net/ethernet/marvell/skge.c b/drivers/net/ethernet/marvell/skge.c
index 35a92fd2cf39..9ac854c2b371 100644
--- a/drivers/net/ethernet/marvell/skge.c
+++ b/drivers/net/ethernet/marvell/skge.c
@@ -2558,8 +2558,6 @@ static int skge_up(struct net_device *dev)
goto free_pci_mem;
}
- memset(skge->mem, 0, skge->mem_size);
-
err = skge_ring_alloc(&skge->rx_ring, skge->mem, skge->dma);
if (err)
goto free_pci_mem;
diff --git a/drivers/net/ethernet/mellanox/mlx4/eq.c b/drivers/net/ethernet/mellanox/mlx4/eq.c
index a5be27772b8e..c790a5fcea73 100644
--- a/drivers/net/ethernet/mellanox/mlx4/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx4/eq.c
@@ -1013,8 +1013,6 @@ static int mlx4_create_eq(struct mlx4_dev *dev, int nent,
dma_list[i] = t;
eq->page_list[i].map = t;
-
- memset(eq->page_list[i].buf, 0, PAGE_SIZE);
}
eq->eqn = mlx4_bitmap_alloc(&priv->eq_table.bitmap);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
index 3b04d8927fb1..1f3891fde2eb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch.c
@@ -2450,7 +2450,6 @@ int mlx5_eswitch_get_vport_stats(struct mlx5_eswitch *esw,
MLX5_SET(query_vport_counter_in, in, vport_number, vport->vport);
MLX5_SET(query_vport_counter_in, in, other_vport, 1);
- memset(out, 0, outlen);
err = mlx5_cmd_exec(esw->dev, in, sizeof(in), out, outlen);
if (err)
goto free_out;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
index 957d9b09dc3f..089ae4d48a82 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eswitch_offloads.c
@@ -1134,7 +1134,6 @@ static int esw_create_offloads_fdb_tables(struct mlx5_eswitch *esw, int nvports)
}
/* create send-to-vport group */
- memset(flow_group_in, 0, inlen);
MLX5_SET(create_flow_group_in, flow_group_in, match_criteria_enable,
MLX5_MATCH_MISC_PARAMETERS);
@@ -1293,8 +1292,6 @@ static int esw_create_vport_rx_group(struct mlx5_eswitch *esw, int nvports)
return -ENOMEM;
/* create vport rx group */
- memset(flow_group_in, 0, inlen);
-
esw_set_flow_group_source_port(esw, flow_group_in);
MLX5_SET(create_flow_group_in, flow_group_in, start_flow_index, 0);
diff --git a/drivers/net/ethernet/mellanox/mlxsw/pci.c b/drivers/net/ethernet/mellanox/mlxsw/pci.c
index 051b19388a81..615455a21567 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/pci.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/pci.c
@@ -847,7 +847,6 @@ static int mlxsw_pci_queue_init(struct mlxsw_pci *mlxsw_pci, char *mbox,
&mem_item->mapaddr);
if (!mem_item->buf)
return -ENOMEM;
- memset(mem_item->buf, 0, mem_item->size);
q->elem_info = kcalloc(q->count, sizeof(*q->elem_info), GFP_KERNEL);
if (!q->elem_info) {
diff --git a/drivers/net/ethernet/neterion/s2io.c b/drivers/net/ethernet/neterion/s2io.c
index 3b2ae1a21678..e0b2bf327905 100644
--- a/drivers/net/ethernet/neterion/s2io.c
+++ b/drivers/net/ethernet/neterion/s2io.c
@@ -747,7 +747,6 @@ static int init_shared_mem(struct s2io_nic *nic)
return -ENOMEM;
}
mem_allocated += size;
- memset(tmp_v_addr, 0, size);
size = sizeof(struct rxd_info) *
rxd_count[nic->rxd_mode];
diff --git a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
index 433052f734ed..5e9f8ee99800 100644
--- a/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
+++ b/drivers/net/ethernet/qlogic/netxen/netxen_nic_ctx.c
@@ -442,10 +442,8 @@ nx_fw_cmd_create_tx_ctx(struct netxen_adapter *adapter)
goto out_free_rq;
}
- memset(rq_addr, 0, rq_size);
prq = rq_addr;
- memset(rsp_addr, 0, rsp_size);
prsp = rsp_addr;
prq->host_rsp_dma_addr = cpu_to_le64(rsp_phys_addr);
@@ -755,7 +753,6 @@ int netxen_alloc_hw_resources(struct netxen_adapter *adapter)
return -ENOMEM;
}
- memset(addr, 0, sizeof(struct netxen_ring_ctx));
recv_ctx->hwctx = addr;
recv_ctx->hwctx->ctx_id = cpu_to_le32(port);
recv_ctx->hwctx->cmd_consumer_offset =
diff --git a/drivers/net/ethernet/ti/tlan.c b/drivers/net/ethernet/ti/tlan.c
index b4ab1a5f6cd0..78f0f2d59e22 100644
--- a/drivers/net/ethernet/ti/tlan.c
+++ b/drivers/net/ethernet/ti/tlan.c
@@ -855,7 +855,6 @@ static int tlan_init(struct net_device *dev)
dev->name);
return -ENOMEM;
}
- memset(priv->dma_storage, 0, dma_size);
priv->rx_list = (struct tlan_list *)
ALIGN((unsigned long)priv->dma_storage, 8);
priv->rx_list_dma = ALIGN(priv->dma_storage_dma, 8);
--
2.11.0
^ permalink raw reply related
* [PATCH v3 18/24] hippi: Remove call to memset after pci_alloc_consistent
From: Fuqian Huang @ 2019-07-15 3:19 UTC (permalink / raw)
Cc: Jes Sorensen, David S . Miller, linux-hippi, netdev, linux-kernel,
Fuqian Huang
pci_alloc_consistent calls dma_alloc_coherent directly.
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
- Use actual commit rather than the merge commit in the commit message
drivers/net/hippi/rrunner.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/drivers/net/hippi/rrunner.c b/drivers/net/hippi/rrunner.c
index 7b9350dbebdd..2a6ec5394966 100644
--- a/drivers/net/hippi/rrunner.c
+++ b/drivers/net/hippi/rrunner.c
@@ -1196,7 +1196,6 @@ static int rr_open(struct net_device *dev)
goto error;
}
rrpriv->rx_ctrl_dma = dma_addr;
- memset(rrpriv->rx_ctrl, 0, 256*sizeof(struct ring_ctrl));
rrpriv->info = pci_alloc_consistent(pdev, sizeof(struct rr_info),
&dma_addr);
@@ -1205,7 +1204,6 @@ static int rr_open(struct net_device *dev)
goto error;
}
rrpriv->info_dma = dma_addr;
- memset(rrpriv->info, 0, sizeof(struct rr_info));
wmb();
spin_lock_irqsave(&rrpriv->lock, flags);
--
2.11.0
^ permalink raw reply related
* [PATCH v3 20/24] wireless: Remove call to memset after dma_alloc_coherent
From: Fuqian Huang @ 2019-07-15 3:19 UTC (permalink / raw)
Cc: Kalle Valo, David S . Miller, Arend van Spriel, Franky Lin,
Hante Meuleman, Chi-Hsien Lin, Wright Feng, Igor Mitsyanko,
Avinash Patil, Sergey Matyukevich, ath10k, linux-wireless, netdev,
linux-kernel, brcm80211-dev-list.pdl, brcm80211-dev-list,
Fuqian Huang
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
- Use actual commit rather than the merge commit in the commit message
drivers/net/wireless/ath/ath10k/ce.c | 5 -----
drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c | 2 --
drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c | 2 --
drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c | 2 --
4 files changed, 11 deletions(-)
diff --git a/drivers/net/wireless/ath/ath10k/ce.c b/drivers/net/wireless/ath/ath10k/ce.c
index eca87f7c5b6c..294fbc1e89ab 100644
--- a/drivers/net/wireless/ath/ath10k/ce.c
+++ b/drivers/net/wireless/ath/ath10k/ce.c
@@ -1704,9 +1704,6 @@ ath10k_ce_alloc_dest_ring_64(struct ath10k *ar, unsigned int ce_id,
/* Correctly initialize memory to 0 to prevent garbage
* data crashing system when download firmware
*/
- memset(dest_ring->base_addr_owner_space_unaligned, 0,
- nentries * sizeof(struct ce_desc_64) + CE_DESC_RING_ALIGN);
-
dest_ring->base_addr_owner_space =
PTR_ALIGN(dest_ring->base_addr_owner_space_unaligned,
CE_DESC_RING_ALIGN);
@@ -2019,8 +2016,6 @@ void ath10k_ce_alloc_rri(struct ath10k *ar)
value |= ar->hw_ce_regs->upd->mask;
ath10k_ce_write32(ar, ce_base_addr + ctrl1_regs, value);
}
-
- memset(ce->vaddr_rri, 0, CE_COUNT * sizeof(u32));
}
EXPORT_SYMBOL(ath10k_ce_alloc_rri);
diff --git a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
index 4ea5401c4d6b..5f0437a3fd82 100644
--- a/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
+++ b/drivers/net/wireless/broadcom/brcm80211/brcmfmac/pcie.c
@@ -1023,8 +1023,6 @@ brcmf_pcie_init_dmabuffer_for_device(struct brcmf_pciedev_info *devinfo,
address & 0xffffffff);
brcmf_pcie_write_tcm32(devinfo, tcm_dma_phys_addr + 4, address >> 32);
- memset(ring, 0, size);
-
return (ring);
}
diff --git a/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c b/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c
index 3aa3714d4dfd..5ec1c9bc1612 100644
--- a/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c
+++ b/drivers/net/wireless/quantenna/qtnfmac/pcie/pearl_pcie.c
@@ -244,8 +244,6 @@ static int pearl_alloc_bd_table(struct qtnf_pcie_pearl_state *ps)
/* tx bd */
- memset(vaddr, 0, len);
-
ps->bd_table_vaddr = vaddr;
ps->bd_table_paddr = paddr;
ps->bd_table_len = len;
diff --git a/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c b/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c
index 9a4380ed7f1b..1f91088e3dff 100644
--- a/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c
+++ b/drivers/net/wireless/quantenna/qtnfmac/pcie/topaz_pcie.c
@@ -199,8 +199,6 @@ static int topaz_alloc_bd_table(struct qtnf_pcie_topaz_state *ts,
if (!vaddr)
return -ENOMEM;
- memset(vaddr, 0, len);
-
/* tx bd */
ts->tx_bd_vbase = vaddr;
--
2.11.0
^ permalink raw reply related
* [PATCH v3 19/24] vmxnet3: Remove call to memset after dma_alloc_coherent
From: Fuqian Huang @ 2019-07-15 3:21 UTC (permalink / raw)
Cc: Ronak Doshi, VMware Inc ., David S . Miller, netdev, linux-kernel,
Fuqian Huang
In commit 518a2f1925c3
("dma-mapping: zero memory returned from dma_alloc_*"),
dma_alloc_coherent has already zeroed the memory.
So memset is not needed.
Signed-off-by: Fuqian Huang <huangfq.daxian@gmail.com>
---
Changes in v3:
- Use actual commit rather than the merge commit in the commit message
drivers/net/vmxnet3/vmxnet3_drv.c | 1 -
1 file changed, 1 deletion(-)
diff --git a/drivers/net/vmxnet3/vmxnet3_drv.c b/drivers/net/vmxnet3/vmxnet3_drv.c
index 3f48f05dd2a6..2a1918f25e47 100644
--- a/drivers/net/vmxnet3/vmxnet3_drv.c
+++ b/drivers/net/vmxnet3/vmxnet3_drv.c
@@ -3430,7 +3430,6 @@ vmxnet3_probe_device(struct pci_dev *pdev,
err = -ENOMEM;
goto err_ver;
}
- memset(adapter->coal_conf, 0, sizeof(*adapter->coal_conf));
adapter->coal_conf->coalMode = VMXNET3_COALESCE_DISABLED;
adapter->default_coal_mode = true;
}
--
2.11.0
^ permalink raw reply related
* [PATCH net] caif-hsi: fix possible deadlock in cfhsi_exit_module()
From: Taehee Yoo @ 2019-07-15 5:10 UTC (permalink / raw)
To: davem, netdev; +Cc: ap420073
cfhsi_exit_module() calls unregister_netdev() under rtnl_lock().
but unregister_netdev() internally calls rtnl_lock().
So deadlock would occur.
Fixes: c41254006377 ("caif-hsi: Add rtnl support")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
---
I did only compile test because I don't have the testbed.
Could anyone test this patch?
drivers/net/caif/caif_hsi.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/drivers/net/caif/caif_hsi.c b/drivers/net/caif/caif_hsi.c
index b2f10b6ad6e5..bbb2575d4728 100644
--- a/drivers/net/caif/caif_hsi.c
+++ b/drivers/net/caif/caif_hsi.c
@@ -1455,7 +1455,7 @@ static void __exit cfhsi_exit_module(void)
rtnl_lock();
list_for_each_safe(list_node, n, &cfhsi_list) {
cfhsi = list_entry(list_node, struct cfhsi, list);
- unregister_netdev(cfhsi->ndev);
+ unregister_netdevice(cfhsi->ndev);
}
rtnl_unlock();
}
--
2.17.1
^ permalink raw reply related
* Re: [PATCH v3 0/3] kernel/notifier.c: avoid duplicate registration
From: Vasily Averin @ 2019-07-15 5:38 UTC (permalink / raw)
To: Xiaoming Ni, gregkh@linuxfoundation.org
Cc: adobriyan@gmail.com, akpm@linux-foundation.org,
anna.schumaker@netapp.com, arjan@linux.intel.com,
bfields@fieldses.org, chuck.lever@oracle.com, davem@davemloft.net,
jlayton@kernel.org, luto@kernel.org, mingo@kernel.org,
Nadia.Derbey@bull.net, paulmck@linux.vnet.ibm.com,
semen.protsenko@linaro.org, stable@kernel.org,
stern@rowland.harvard.edu, tglx@linutronix.de,
torvalds@linux-foundation.org, trond.myklebust@hammerspace.com,
viresh.kumar@linaro.org, Huangjianhui (Alex), Dailei,
linux-kernel@vger.kernel.org, linux-nfs@vger.kernel.org,
netdev@vger.kernel.org
In-Reply-To: <65f50cf2-3051-ab55-078f-30930fe0c9bc@huawei.com>
On 7/14/19 5:45 AM, Xiaoming Ni wrote:
> On 2019/7/12 22:07, gregkh@linuxfoundation.org wrote:
>> On Fri, Jul 12, 2019 at 09:11:57PM +0800, Xiaoming Ni wrote:
>>> On 2019/7/11 21:57, Vasily Averin wrote:
>>>> On 7/11/19 4:55 AM, Nixiaoming wrote:
>>>>> On Wed, July 10, 2019 1:49 PM Vasily Averin wrote:
>>>>>> On 7/10/19 6:09 AM, Xiaoming Ni wrote:
>>>>>>> Registering the same notifier to a hook repeatedly can cause the hook
>>>>>>> list to form a ring or lose other members of the list.
>>>>>>
>>>>>> I think is not enough to _prevent_ 2nd register attempt,
>>>>>> it's enough to detect just attempt and generate warning to mark host in bad state.
>>>>>>
>>>>>
>>>>> Duplicate registration is prevented in my patch, not just "mark host in bad state"
>>>>>
>>>>> Duplicate registration is checked and exited in notifier_chain_cond_register()
>>>>>
>>>>> Duplicate registration was checked in notifier_chain_register() but only
>>>>> the alarm was triggered without exiting. added by commit 831246570d34692e
>>>>> ("kernel/notifier.c: double register detection")
>>>>>
>>>>> My patch is like a combination of 831246570d34692e and notifier_chain_cond_register(),
>>>>> which triggers an alarm and exits when a duplicate registration is detected.
>>>>>
>>>>>> Unexpected 2nd register of the same hook most likely will lead to 2nd unregister,
>>>>>> and it can lead to host crash in any time:
>>>>>> you can unregister notifier on first attempt it can be too early, it can be still in use.
>>>>>> on the other hand you can never call 2nd unregister at all.
>>>>>
>>>>> Since the member was not added to the linked list at the time of the second registration,
>>>>> no linked list ring was formed.
>>>>> The member is released on the first unregistration and -ENOENT on the second unregistration.
>>>>> After patching, the fault has been alleviated
>>>>
>>>> You are wrong here.
>>>> 2nd notifier's registration is a pure bug, this should never happen.
>>>> If you know the way to reproduce this situation -- you need to fix it.
>>>>
>>>> 2nd registration can happen in 2 cases:
>>>> 1) missed rollback, when someone forget to call unregister after successfull registration,
>>>> and then tried to call register again. It can lead to crash for example when according module will be unloaded.
>>>> 2) some subsystem is registered twice, for example from different namespaces.
>>>> in this case unregister called during sybsystem cleanup in first namespace will incorrectly remove notifier used
>>>> in second namespace, it also can lead to unexpacted behaviour.
>>>>
>>> So in these two cases, is it more reasonable to trigger BUG() directly when checking for duplicate registration ?
>>> But why does current notifier_chain_register() just trigger WARN() without exiting ?
>>> notifier_chain_cond_register() direct exit without triggering WARN() ?
>>
>> It should recover from this, if it can be detected. The main point is
>> that not all apis have to be this "robust" when used within the kernel
>> as we do allow for the callers to know what they are doing :)
>>
> In the notifier_chain_register(), the condition ( (*nl) == n) is the same registration of the same hook.
> We can intercept this situation and avoid forming a linked list ring to make the API more rob
Once again -- yes, you CAN prevent list corruption, but you CANNOT recover the host and return it back to safe state.
If double register event was detected -- it means you have bug in kernel.
Yes, you can add BUG here and crash the host immediately, but I prefer to use warning in such situations.
>> If this does not cause any additional problems or slow downs, it's
>> probably fine to add.
>>
> Notifier_chain_register() is not a system hotspot function.
> At the same time, there is already a WARN_ONCE judgment. There is no new judgment in the new patch.
> It only changes the processing under the condition of (*nl) == n, which will not cause performance problems.
> At the same time, avoiding the formation of a link ring can make the system more robust.
I disagree,
yes, node will have correct list, but anyway node will work wrong and can crash the host in any time.
>> thanks,
>>
>> greg k-h
>>
>> .
>>
> Thanks
>
> Xiaoming Ni
>
>
>
^ permalink raw reply
* Re: [PATCH 8/8] docs: remove extra conf.py files
From: Markus Heiser @ 2019-07-15 6:16 UTC (permalink / raw)
To: Mauro Carvalho Chehab, Linux Doc Mailing List
Cc: Mauro Carvalho Chehab, linux-kernel, Jonathan Corbet, Herbert Xu,
David S. Miller, David Airlie, Daniel Vetter, Maarten Lankhorst,
Maxime Ripard, Sean Paul, Dmitry Torokhov, Yoshinori Sato,
Rich Felker, Jaroslav Kysela, Takashi Iwai, Thomas Gleixner,
Ingo Molnar, Borislav Petkov, H. Peter Anvin, x86, linux-crypto,
dri-devel, linux-input, netdev, linux-sh, alsa-devel
In-Reply-To: <12a160afc9e70156f671010bd4ccff9311acdc5e.1563115732.git.mchehab+samsung@kernel.org>
Hi Mauro,
sorry, I havn't tested your patch, but one question ...
Am 14.07.19 um 17:10 schrieb Mauro Carvalho Chehab:
> Now that the latex_documents are handled automatically, we can
> remove those extra conf.py files.
We need these conf.py also for compiling books' into HTML. For this
the tags.add("subproject") is needed. Should we realy drop this feature?
-- Markus --
^ permalink raw reply
* [PATCH] ethtool: igb: dump RR2DCDELAY register
From: Artem Bityutskiy @ 2019-07-15 6:52 UTC (permalink / raw)
To: John W . Linville; +Cc: netdev
Decode 'RR2DCDELAY' register which Linux kernel provides starting from version
5.3. The corresponding commit in the Linux kernel is:
cd502a7f7c9c igb: add RR2DCDELAY to ethtool registers dump
The RR2DCDELAY register is present in I210 and I211 Intel Gigabit Ethernet
chips and it stands for "Read Request To Data Completion Delay". Here is how
this register is described in the I210 datasheet:
"This field captures the maximum PCIe split time in 16 ns units, which is the
maximum delay between the read request to the first data completion. This is
giving an estimation of the PCIe round trip time."
In practice, this register can be used to measure the time it takes the NIC to
read data from the host memory.
Signed-off-by: Artem Bityutskiy <artem.bityutskiy@linux.intel.com>
---
igb.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/igb.c b/igb.c
index e0ccef9..ab0896f 100644
--- a/igb.c
+++ b/igb.c
@@ -859,6 +859,18 @@ igb_dump_regs(struct ethtool_drvinfo *info, struct ethtool_regs *regs)
"0x03430: TDFPC (Tx data FIFO packet count) 0x%08X\n",
regs_buff[550]);
+ /*
+ * Starting from kernel version 5.3 the registers dump buffer grew from
+ * 739 4-byte words to 740 words, and word 740 contains the RR2DCDLAY
+ * register.
+ */
+ if (regs->len < 740)
+ return 0;
+
+ fprintf(stdout,
+ "0x05BF4: RR2DCDELAY (Max. DMA read delay) 0x%08X\n",
+ regs_buff[739]);
+
return 0;
}
--
2.20.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox