* Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
From: Paolo Bonzini @ 2020-05-27 21:07 UTC (permalink / raw)
To: Jakub Kicinski, Emanuele Giuseppe Esposito
Cc: linux-s390, kvm, linux-doc, netdev, Emanuele Giuseppe Esposito,
linux-kernel, kvm-ppc, Jonathan Adams, Christian Borntraeger,
Andrew Lunn, Alexander Viro, David Rientjes, linux-fsdevel,
linux-mips, linuxppc-dev, linux-arm-kernel, Jim Mattson
In-Reply-To: <20200527132321.54bcdf04@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net>
On 27/05/20 22:23, Jakub Kicinski wrote:
> On Wed, 27 May 2020 15:14:41 +0200 Emanuele Giuseppe Esposito wrote:
>> Regarding the config, as I said the idea is to gather multiple
>> subsystems' statistics, therefore there wouldn't be a single
>> configuration method like in netlink.
>> For example in kvm there are file descriptors for configuration, and
>> creating them requires no privilege, contrary to the network interfaces.
>
> Enumerating networking interfaces, addresses, and almost all of the
> configuration requires no extra privilege. In fact I'd hope that
> whatever daemon collects network stats doesn't run as root :)
>
> I think enumerating objects is of primary importance, and statistics
> of those objects are subordinate.
I see what you meant now. statsfs can also be used to enumerate objects
if one is so inclined (with the prototype in patch 7, for example, each
network interface becomes a directory).
> Again, I have little KVM knowledge, but BPF also uses a fd-based API,
> and carries stats over the same syscall interface.
Can BPF stats (for BPF scripts created by whatever process is running in
the system) be collected by an external daemon that does not have access
to the file descriptor? For KVM it's of secondary importance to gather
stats in the program; it can be nice to have and we are thinking of a
way to export the stats over the fd-based API, but it's less useful than
system-wide monitoring. Perhaps this is a difference between the two.
Another case where stats and configuration are separate is CPUs, where
CPU enumeration is done in sysfs but statistics are exposed in various
procfs files such as /proc/interrupts and /proc/stats.
Thanks,
Paolo
^ permalink raw reply
* Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
From: Andrew Lunn @ 2020-05-27 21:17 UTC (permalink / raw)
To: Emanuele Giuseppe Esposito
Cc: linux-s390, kvm, linux-doc, netdev, Emanuele Giuseppe Esposito,
linux-kernel, kvm-ppc, Jonathan Adams, Christian Borntraeger,
Alexander Viro, David Rientjes, linux-fsdevel, Paolo Bonzini,
linux-mips, linuxppc-dev, linux-arm-kernel, Jim Mattson
In-Reply-To: <20200526110318.69006-1-eesposit@redhat.com>
On Tue, May 26, 2020 at 01:03:10PM +0200, Emanuele Giuseppe Esposito wrote:
> There is currently no common way for Linux kernel subsystems to expose
> statistics to userspace shared throughout the Linux kernel; subsystems have
> to take care of gathering and displaying statistics by themselves, for
> example in the form of files in debugfs. For example KVM has its own code
> section that takes care of this in virt/kvm/kvm_main.c, where it sets up
> debugfs handlers for displaying values and aggregating them from various
> subfolders to obtain information about the system state (i.e. displaying
> the total number of exits, calculated by summing all exits of all cpus of
> all running virtual machines).
>
> Allowing each section of the kernel to do so has two disadvantages. First,
> it will introduce redundant code. Second, debugfs is anyway not the right
> place for statistics (for example it is affected by lockdown)
>
> In this patch series I introduce statsfs, a synthetic ram-based virtual
> filesystem that takes care of gathering and displaying statistics for the
> Linux kernel subsystems.
>
> The file system is mounted on /sys/kernel/stats and would be already used
> by kvm. Statsfs was initially introduced by Paolo Bonzini [1].
>
> Statsfs offers a generic and stable API, allowing any kind of
> directory/file organization and supporting multiple kind of aggregations
> (not only sum, but also average, max, min and count_zero) and data types
> (boolean, unsigned/signed and custom types). The implementation, which is
> a generalization of KVM’s debugfs statistics code, takes care of gathering
> and displaying information at run time; users only need to specify the
> values to be included in each source.
>
> Statsfs would also be a different mountpoint from debugfs, and would not
> suffer from limited access due to the security lock down patches. Its main
> function is to display each statistics as a file in the desired folder
> hierarchy defined through the API. Statsfs files can be read, and possibly
> cleared if their file mode allows it.
>
> Statsfs has two main components: the public API defined by
> include/linux/statsfs.h, and the virtual file system which should end up in
> /sys/kernel/stats.
>
Hi Emanuele
> The API has two main elements, values and sources. Kernel subsystems like
> KVM can use the API to create a source, add child sources/values/aggregates
> and register it to the root source (that on the virtual fs would be
> /sys/kernel/statsfs).
Another issue i see with networking is that statistic counters can be
dynamic. They can come and go. One of the drivers i work on has extra
statistics available when a fibre interface is used, compared to a
copper interface. And this happens at run time. The netlink API has no
problems with this. It is a snapshot of what counters are currently
available. There is no state in the API.
In my humble opinion, networking is unlikely to adopt your approach.
You probably want to look around for other subsystems which have
statistics, and see if you can cover their requirements, and get them
on board.
Andrew
^ permalink raw reply
* Re: [PATCH v3 5/8] mm: don't track number of pages during deferred initialization
From: Alexander Duyck @ 2020-05-27 21:18 UTC (permalink / raw)
To: Daniel Jordan
Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
Steffen Klassert, linux-s390, Herbert Xu, Jonathan Corbet,
Jason Gunthorpe, Zi Yan, Robert Elliott, Pavel Tatashin,
Shile Zhang, Josh Triplett, Alex Williamson, Kirill Tkhai,
Dan Williams, Randy Dunlap, LKML, linux-crypto, Tejun Heo,
Andrew Morton, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
In-Reply-To: <20200527173608.2885243-6-daniel.m.jordan@oracle.com>
On Wed, May 27, 2020 at 10:37 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> Deferred page init used to report the number of pages initialized:
>
> node 0 initialised, 32439114 pages in 97ms
>
> Tracking this makes the code more complicated when using multiple
> threads. Given that the statistic probably has limited value,
> especially since a zone grows on demand so that the page count can vary,
> just remove it.
>
> The boot message now looks like
>
> node 0 deferred pages initialised in 97ms
>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Suggested-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
This looks good to me.
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
> ---
> mm/page_alloc.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0c0d9364aa6d..d64f3027fdfa6 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1819,7 +1819,7 @@ static int __init deferred_init_memmap(void *data)
> {
> pg_data_t *pgdat = data;
> const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
> - unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> + unsigned long spfn = 0, epfn = 0;
> unsigned long first_init_pfn, flags;
> unsigned long start = jiffies;
> struct zone *zone;
> @@ -1868,15 +1868,15 @@ static int __init deferred_init_memmap(void *data)
> * allocator.
> */
> while (spfn < epfn) {
> - nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> + deferred_init_maxorder(&i, zone, &spfn, &epfn);
> cond_resched();
> }
> zone_empty:
> /* Sanity check that the next zone really is unpopulated */
> WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
>
> - pr_info("node %d initialised, %lu pages in %ums\n",
> - pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start));
> + pr_info("node %d deferred pages initialised in %ums\n",
> + pgdat->node_id, jiffies_to_msecs(jiffies - start));
>
> pgdat_init_report_one_done();
> return 0;
> --
> 2.26.2
>
>
^ permalink raw reply
* Re: [PATCH v3 6/8] mm: parallelize deferred_init_memmap()
From: Alexander Duyck @ 2020-05-27 21:27 UTC (permalink / raw)
To: Daniel Jordan
Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
Steffen Klassert, linux-s390, Herbert Xu, Jonathan Corbet,
Jason Gunthorpe, Zi Yan, Robert Elliott, Pavel Tatashin,
Shile Zhang, Josh Triplett, Alex Williamson, Kirill Tkhai,
Dan Williams, Randy Dunlap, LKML, linux-crypto, Tejun Heo,
Andrew Morton, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
In-Reply-To: <20200527173608.2885243-7-daniel.m.jordan@oracle.com>
On Wed, May 27, 2020 at 10:37 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> Deferred struct page init is a significant bottleneck in kernel boot.
> Optimizing it maximizes availability for large-memory systems and allows
> spinning up short-lived VMs as needed without having to leave them
> running. It also benefits bare metal machines hosting VMs that are
> sensitive to downtime. In projects such as VMM Fast Restart[1], where
> guest state is preserved across kexec reboot, it helps prevent
> application and network timeouts in the guests.
>
> Multithread to take full advantage of system memory bandwidth.
>
> The maximum number of threads is capped at the number of CPUs on the
> node because speedups always improve with additional threads on every
> system tested, and at this phase of boot, the system is otherwise idle
> and waiting on page init to finish.
>
> Helper threads operate on section-aligned ranges to both avoid false
> sharing when setting the pageblock's migrate type and to avoid accessing
> uninitialized buddy pages, though max order alignment is enough for the
> latter.
>
> The minimum chunk size is also a section. There was benefit to using
> multiple threads even on relatively small memory (1G) systems, and this
> is the smallest size that the alignment allows.
>
> The time (milliseconds) is the slowest node to initialize since boot
> blocks until all nodes finish. intel_pstate is loaded in active mode
> without hwp and with turbo enabled, and intel_idle is active as well.
>
> Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
> 2 nodes * 26 cores * 2 threads = 104 CPUs
> 384G/node = 768G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 4089.7 ( 8.1) -- 1785.7 ( 7.6)
> 2% ( 1) 1.7% 4019.3 ( 1.5) 3.8% 1717.7 ( 11.8)
> 12% ( 6) 34.9% 2662.7 ( 2.9) 79.9% 359.3 ( 0.6)
> 25% ( 13) 39.9% 2459.0 ( 3.6) 91.2% 157.0 ( 0.0)
> 37% ( 19) 39.2% 2485.0 ( 29.7) 90.4% 172.0 ( 28.6)
> 50% ( 26) 39.3% 2482.7 ( 25.7) 90.3% 173.7 ( 30.0)
> 75% ( 39) 39.0% 2495.7 ( 5.5) 89.4% 190.0 ( 1.0)
> 100% ( 52) 40.2% 2443.7 ( 3.8) 92.3% 138.0 ( 1.0)
>
> Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
> 1 node * 16 cores * 2 threads = 32 CPUs
> 192G/node = 192G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 1988.7 ( 9.6) -- 1096.0 ( 11.5)
> 3% ( 1) 1.1% 1967.0 ( 17.6) 0.3% 1092.7 ( 11.0)
> 12% ( 4) 41.1% 1170.3 ( 14.2) 73.8% 287.0 ( 3.6)
> 25% ( 8) 47.1% 1052.7 ( 21.9) 83.9% 177.0 ( 13.5)
> 38% ( 12) 48.9% 1016.3 ( 12.1) 86.8% 144.7 ( 1.5)
> 50% ( 16) 48.9% 1015.7 ( 8.1) 87.8% 134.0 ( 4.4)
> 75% ( 24) 49.1% 1012.3 ( 3.1) 88.1% 130.3 ( 2.3)
> 100% ( 32) 49.5% 1004.0 ( 5.3) 88.5% 125.7 ( 2.1)
>
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
> 2 nodes * 18 cores * 2 threads = 72 CPUs
> 128G/node = 256G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 1680.0 ( 4.6) -- 627.0 ( 4.0)
> 3% ( 1) 0.3% 1675.7 ( 4.5) -0.2% 628.0 ( 3.6)
> 11% ( 4) 25.6% 1250.7 ( 2.1) 67.9% 201.0 ( 0.0)
> 25% ( 9) 30.7% 1164.0 ( 17.3) 81.8% 114.3 ( 17.7)
> 36% ( 13) 31.4% 1152.7 ( 10.8) 84.0% 100.3 ( 17.9)
> 50% ( 18) 31.5% 1150.7 ( 9.3) 83.9% 101.0 ( 14.1)
> 75% ( 27) 31.7% 1148.0 ( 5.6) 84.5% 97.3 ( 6.4)
> 100% ( 36) 32.0% 1142.3 ( 4.0) 85.6% 90.0 ( 1.0)
>
> AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> 1 node * 8 cores * 2 threads = 16 CPUs
> 64G/node = 64G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 1029.3 ( 25.1) -- 240.7 ( 1.5)
> 6% ( 1) -0.6% 1036.0 ( 7.8) -2.2% 246.0 ( 0.0)
> 12% ( 2) 11.8% 907.7 ( 8.6) 44.7% 133.0 ( 1.0)
> 25% ( 4) 13.9% 886.0 ( 10.6) 62.6% 90.0 ( 6.0)
> 38% ( 6) 17.8% 845.7 ( 14.2) 69.1% 74.3 ( 3.8)
> 50% ( 8) 16.8% 856.0 ( 22.1) 72.9% 65.3 ( 5.7)
> 75% ( 12) 15.4% 871.0 ( 29.2) 79.8% 48.7 ( 7.4)
> 100% ( 16) 21.0% 813.7 ( 21.0) 80.5% 47.0 ( 5.2)
>
> Server-oriented distros that enable deferred page init sometimes run in
> small VMs, and they still benefit even though the fraction of boot time
> saved is smaller:
>
> AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
> 1 node * 2 cores * 2 threads = 4 CPUs
> 16G/node = 16G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 716.0 ( 14.0) -- 49.7 ( 0.6)
> 25% ( 1) 1.8% 703.0 ( 5.3) -4.0% 51.7 ( 0.6)
> 50% ( 2) 1.6% 704.7 ( 1.2) 43.0% 28.3 ( 0.6)
> 75% ( 3) 2.7% 696.7 ( 13.1) 49.7% 25.0 ( 0.0)
> 100% ( 4) 4.1% 687.0 ( 10.4) 55.7% 22.0 ( 0.0)
>
> Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
> 1 node * 2 cores * 2 threads = 4 CPUs
> 14G/node = 14G memory
>
> kernel boot deferred init
> ------------------------ ------------------------
> node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
> ( 0) -- 787.7 ( 6.4) -- 122.3 ( 0.6)
> 25% ( 1) 0.2% 786.3 ( 10.8) -2.5% 125.3 ( 2.1)
> 50% ( 2) 5.9% 741.0 ( 13.9) 37.6% 76.3 ( 19.7)
> 75% ( 3) 8.3% 722.0 ( 19.0) 49.9% 61.3 ( 3.2)
> 100% ( 4) 9.3% 714.7 ( 9.5) 56.4% 53.3 ( 1.5)
>
> On Josh's 96-CPU and 192G memory system:
>
> Without this patch series:
> [ 0.487132] node 0 initialised, 23398907 pages in 292ms
> [ 0.499132] node 1 initialised, 24189223 pages in 304ms
> ...
> [ 0.629376] Run /sbin/init as init process
>
> With this patch series:
> [ 0.231435] node 1 initialised, 24189223 pages in 32ms
> [ 0.236718] node 0 initialised, 23398907 pages in 36ms
>
> [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> Tested-by: Josh Triplett <josh@joshtriplett.org>
> ---
> mm/Kconfig | 6 +++---
> mm/page_alloc.c | 46 ++++++++++++++++++++++++++++++++++++++++------
> 2 files changed, 43 insertions(+), 9 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c358..04c1da3f9f44c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
> depends on SPARSEMEM
> depends on !NEED_PER_CPU_KM
> depends on 64BIT
> + select PADATA
> help
> Ordinarily all struct pages are initialised during early boot in a
> single thread. On very large machines this can take a considerable
> amount of time. If this option is set, large machines will bring up
> - a subset of memmap at boot and then initialise the rest in parallel
> - by starting one-off "pgdatinitX" kernel thread for each node X. This
> - has a potential performance impact on processes running early in the
> + a subset of memmap at boot and then initialise the rest in parallel.
> + This has a potential performance impact on tasks running early in the
> lifetime of the system until these kthreads finish the
> initialisation.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d64f3027fdfa6..1d47016849531 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
> #include <linux/lockdep.h>
> #include <linux/nmi.h>
> #include <linux/psi.h>
> +#include <linux/padata.h>
>
> #include <asm/sections.h>
> #include <asm/tlbflush.h>
> @@ -1814,6 +1815,26 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
> return nr_pages;
> }
>
> +static void __init
> +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> + void *arg)
> +{
> + unsigned long spfn, epfn;
> + struct zone *zone = arg;
> + u64 i;
> +
> + deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> +
> + /*
> + * Initialize and free pages in MAX_ORDER sized increments so that we
> + * can avoid introducing any issues with the buddy allocator.
> + */
> + while (spfn < end_pfn) {
> + deferred_init_maxorder(&i, zone, &spfn, &epfn);
> + cond_resched();
> + }
> +}
> +
> /* Initialise remaining memory on a node */
> static int __init deferred_init_memmap(void *data)
> {
> @@ -1823,7 +1844,7 @@ static int __init deferred_init_memmap(void *data)
> unsigned long first_init_pfn, flags;
> unsigned long start = jiffies;
> struct zone *zone;
> - int zid;
> + int zid, max_threads;
> u64 i;
>
> /* Bind memory initialisation thread to a local node if possible */
> @@ -1863,13 +1884,26 @@ static int __init deferred_init_memmap(void *data)
> goto zone_empty;
>
> /*
> - * Initialize and free pages in MAX_ORDER sized increments so
> - * that we can avoid introducing any issues with the buddy
> - * allocator.
> + * More CPUs always led to greater speedups on tested systems, up to
> + * all the nodes' CPUs. Use all since the system is otherwise idle now.
> */
> + max_threads = max(cpumask_weight(cpumask), 1u);
> +
> while (spfn < epfn) {
> - deferred_init_maxorder(&i, zone, &spfn, &epfn);
> - cond_resched();
> + unsigned long epfn_align = ALIGN(epfn, PAGES_PER_SECTION);
> + struct padata_mt_job job = {
> + .thread_fn = deferred_init_memmap_chunk,
> + .fn_arg = zone,
> + .start = spfn,
> + .size = epfn_align - spfn,
> + .align = PAGES_PER_SECTION,
> + .min_chunk = PAGES_PER_SECTION,
> + .max_threads = max_threads,
> + };
> +
> + padata_do_multithreaded(&job);
> + deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
> + epfn_align);
> }
> zone_empty:
> /* Sanity check that the next zone really is unpopulated */
So I am not a huge fan of using deferred_init_mem_pfn_range_in zone
simply for the fact that we end up essentially discarding the i value
and will have to walk the list repeatedly. However I don't think the
overhead will be that great as I suspect there aren't going to be
systems with that many ranges. So this is probably fine.
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
^ permalink raw reply
* Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
From: Jakub Kicinski @ 2020-05-27 21:27 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Emanuele Giuseppe Esposito, linux-s390, kvm, linux-doc, netdev,
Emanuele Giuseppe Esposito, linux-kernel, kvm-ppc, Jonathan Adams,
Christian Borntraeger, Andrew Lunn, Alexander Viro,
David Rientjes, linux-fsdevel, linux-mips, linuxppc-dev,
linux-arm-kernel, Jim Mattson
In-Reply-To: <af2ba926-73bc-26c3-7ce7-bd45f657fd85@redhat.com>
On Wed, 27 May 2020 23:07:53 +0200 Paolo Bonzini wrote:
> > Again, I have little KVM knowledge, but BPF also uses a fd-based API,
> > and carries stats over the same syscall interface.
>
> Can BPF stats (for BPF scripts created by whatever process is running in
> the system) be collected by an external daemon that does not have access
> to the file descriptor? For KVM it's of secondary importance to gather
> stats in the program; it can be nice to have and we are thinking of a
> way to export the stats over the fd-based API, but it's less useful than
> system-wide monitoring. Perhaps this is a difference between the two.
Yes, check out bpftool prog list (bpftool code is under tools/bpf/ in
the kernel tree). BPF statistics are under a static key, so you may not
see any on your system. My system shows e.g.:
81: kprobe name abc tag cefaa9376bdaae75 gpl run_time_ns 80941 run_cnt 152
loaded_at 2020-05-26T13:00:24-0700 uid 0
xlated 512B jited 307B memlock 4096B map_ids 66,64
btf_id 16
In this example run_time_ns and run_cnt are stats.
The first number on the left is the program ID. BPF has an IDA, and
each object gets an integer id. So admin (or CAP_BPF, I think) can
iterate over the ids and open fds to objects of interest.
> Another case where stats and configuration are separate is CPUs, where
> CPU enumeration is done in sysfs but statistics are exposed in various
> procfs files such as /proc/interrupts and /proc/stats.
True, but I'm guessing everyone is just okay living with the legacy
procfs format there. Otherwise I'd guess the stats would had been added
to sysfs. I'd be curious to hear the full story there.
^ permalink raw reply
* Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
From: Paolo Bonzini @ 2020-05-27 21:44 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Emanuele Giuseppe Esposito, linux-s390, kvm, linux-doc, netdev,
Emanuele Giuseppe Esposito, linux-kernel, kvm-ppc, Jonathan Adams,
Christian Borntraeger, Andrew Lunn, Alexander Viro,
David Rientjes, linux-fsdevel, linux-mips, linuxppc-dev,
linux-arm-kernel, Jim Mattson
In-Reply-To: <20200527142741.77e7de37@kicinski-fedora-PC1C0HJN.hsd1.ca.comcast.net>
On 27/05/20 23:27, Jakub Kicinski wrote:
> On Wed, 27 May 2020 23:07:53 +0200 Paolo Bonzini wrote:
>>> Again, I have little KVM knowledge, but BPF also uses a fd-based API,
>>> and carries stats over the same syscall interface.
>>
>> Can BPF stats (for BPF scripts created by whatever process is running in
>> the system) be collected by an external daemon that does not have access
>> to the file descriptor? For KVM it's of secondary importance to gather
>> stats in the program; it can be nice to have and we are thinking of a
>> way to export the stats over the fd-based API, but it's less useful than
>> system-wide monitoring. Perhaps this is a difference between the two.
>
> Yes, check out bpftool prog list (bpftool code is under tools/bpf/ in
> the kernel tree). BPF statistics are under a static key, so you may not
> see any on your system. My system shows e.g.:
>
> 81: kprobe name abc tag cefaa9376bdaae75 gpl run_time_ns 80941 run_cnt 152
> loaded_at 2020-05-26T13:00:24-0700 uid 0
> xlated 512B jited 307B memlock 4096B map_ids 66,64
> btf_id 16
>
> In this example run_time_ns and run_cnt are stats.
>
> The first number on the left is the program ID. BPF has an IDA, and
> each object gets an integer id. So admin (or CAP_BPF, I think) can
> iterate over the ids and open fds to objects of interest.
Got it, thanks. But then "I'd hope that whatever daemon collects [BPF]
stats doesn't run as root". :)
>> Another case where stats and configuration are separate is CPUs, where
>> CPU enumeration is done in sysfs but statistics are exposed in various
>> procfs files such as /proc/interrupts and /proc/stats.
>
> True, but I'm guessing everyone is just okay living with the legacy
> procfs format there. Otherwise I'd guess the stats would had been added
> to sysfs. I'd be curious to hear the full story there.
Yeah, it's a chicken-and-egg problem in that there's no good place in
sysfs to put statistics right now, which is part of what this filesystem
is trying to solve (the other part is the API).
You can read more about Google's usecase at
http://lkml.iu.edu/hypermail/linux/kernel/2005.0/08056.html, it does
include both network and interrupt stats and it's something that they've
been using in production for quite some time. We'd like the statsfs API
to be the basis for including something akin to that in Linux.
To be honest, it's unlikely that Emanuele (who has just finished his
internship at Red Hat) and I will pursue the networking stats further
than the demo patch at the end of this series. However, we're trying to
make sure that the API is at least ready for that, and to probe whether
any developers from other subsystems would be interested in using
statsfs. So thanks for bringing your point of view!
Thanks,
Paolo
^ permalink raw reply
* Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
From: David Ahern @ 2020-05-27 22:21 UTC (permalink / raw)
To: Paolo Bonzini, Jakub Kicinski, Emanuele Giuseppe Esposito
Cc: linux-s390, kvm, linux-doc, netdev, Emanuele Giuseppe Esposito,
linux-kernel, kvm-ppc, Jonathan Adams, Christian Borntraeger,
Andrew Lunn, Alexander Viro, David Rientjes, linux-fsdevel,
linux-mips, linuxppc-dev, linux-arm-kernel, Jim Mattson
In-Reply-To: <af2ba926-73bc-26c3-7ce7-bd45f657fd85@redhat.com>
On 5/27/20 3:07 PM, Paolo Bonzini wrote:
> I see what you meant now. statsfs can also be used to enumerate objects
> if one is so inclined (with the prototype in patch 7, for example, each
> network interface becomes a directory).
there are many use cases that have 100's to 1000's have network devices.
Having a sysfs entry per device already bloats memory usage for these
use cases; another filesystem with an entry per device makes that worse.
Really the wrong direction for large scale systems.
^ permalink raw reply
* Re: [v3 1/2] dts: ppc: t4240rdb: remove interrupts property
From: Li Yang @ 2020-05-27 23:02 UTC (permalink / raw)
To: Biwen Li
Cc: linux-rtc, a.zummo, Alexandre Belloni,
open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
Biwen Li, lkml, Rob Herring, linuxppc-dev
In-Reply-To: <20200527034228.23793-1-biwen.li@oss.nxp.com>
On Tue, May 26, 2020 at 10:49 PM Biwen Li <biwen.li@oss.nxp.com> wrote:
>
> From: Biwen Li <biwen.li@nxp.com>
>
> Since the interrupt pin for RTC DS1374 is not connected
> to the CPU on T4240RDB, remove the interrupt property
> from the device tree.
>
> This also fix the following warning for hwclock.util-linux:
> $ hwclock.util-linux
> hwclock.util-linux: select() to /dev/rtc0
> to wait for clock tick timed out
>
> Signed-off-by: Biwen Li <biwen.li@nxp.com>
Acked-by: Li Yang <leoyang.li@nxp.com>
> ---
> arch/powerpc/boot/dts/fsl/t4240rdb.dts | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/arch/powerpc/boot/dts/fsl/t4240rdb.dts b/arch/powerpc/boot/dts/fsl/t4240rdb.dts
> index a56a705d41f7..145896f2eef6 100644
> --- a/arch/powerpc/boot/dts/fsl/t4240rdb.dts
> +++ b/arch/powerpc/boot/dts/fsl/t4240rdb.dts
> @@ -144,7 +144,6 @@
> rtc@68 {
> compatible = "dallas,ds1374";
> reg = <0x68>;
> - interrupts = <0x1 0x1 0 0>;
> };
> };
>
> --
> 2.17.1
>
^ permalink raw reply
* Re: [v3 2/2] dts: ppc: t1024rdb: remove interrupts property
From: Li Yang @ 2020-05-27 23:03 UTC (permalink / raw)
To: Biwen Li
Cc: linux-rtc, a.zummo, Alexandre Belloni,
open list:OPEN FIRMWARE AND FLATTENED DEVICE TREE BINDINGS,
Biwen Li, lkml, Rob Herring, linuxppc-dev
In-Reply-To: <20200527034228.23793-2-biwen.li@oss.nxp.com>
On Tue, May 26, 2020 at 10:52 PM Biwen Li <biwen.li@oss.nxp.com> wrote:
>
> From: Biwen Li <biwen.li@nxp.com>
>
> Since the interrupt pin for RTC DS1339 is not connected
> to the CPU on T1024RDB, remove the interrupt property
> from the device tree.
>
> This also fix the following warning for hwclock.util-linux:
> $ hwclock.util-linux
> hwclock.util-linux: select() to /dev/rtc0
> to wait for clock tick timed out
>
> Signed-off-by: Biwen Li <biwen.li@nxp.com>
Acked-by: Li Yang <leoyang.li@nxp.com>
> ---
> arch/powerpc/boot/dts/fsl/t1024rdb.dts | 1 -
> 1 file changed, 1 deletion(-)
>
> diff --git a/arch/powerpc/boot/dts/fsl/t1024rdb.dts b/arch/powerpc/boot/dts/fsl/t1024rdb.dts
> index 645caff98ed1..605ceec66af3 100644
> --- a/arch/powerpc/boot/dts/fsl/t1024rdb.dts
> +++ b/arch/powerpc/boot/dts/fsl/t1024rdb.dts
> @@ -161,7 +161,6 @@
> rtc@68 {
> compatible = "dallas,ds1339";
> reg = <0x68>;
> - interrupts = <0x1 0x1 0 0>;
> };
> };
>
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH] powerpc/64: Remove unused generic_secondary_thread_init()
From: Jordan Niethe @ 2020-05-28 0:49 UTC (permalink / raw)
To: Michael Ellerman; +Cc: linuxppc-dev
In-Reply-To: <20200526063446.2476336-1-mpe@ellerman.id.au>
On Tue, May 26, 2020 at 4:36 PM Michael Ellerman <mpe@ellerman.id.au> wrote:
>
> The last caller was removed in 2014 in commit fb5a515704d7 ("powerpc:
> Remove platforms/wsp and associated pieces").
>
> Once generic_secondary_thread_init() is removed there are no longer
> any uses of book3e_secondary_thread_init() or
> generic_secondary_common_init so remove them too.
>
> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
> ---
> arch/powerpc/include/asm/smp.h | 1 -
> arch/powerpc/kernel/exceptions-64e.S | 4 ----
> arch/powerpc/kernel/head_64.S | 18 ------------------
> 3 files changed, 23 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
> index 49a25e2400f2..81a49566ccd8 100644
> --- a/arch/powerpc/include/asm/smp.h
> +++ b/arch/powerpc/include/asm/smp.h
> @@ -243,7 +243,6 @@ extern void arch_send_call_function_ipi_mask(const struct cpumask *mask);
> * 64-bit but defining them all here doesn't harm
> */
> extern void generic_secondary_smp_init(void);
> -extern void generic_secondary_thread_init(void);
> extern unsigned long __secondary_hold_spinloop;
> extern unsigned long __secondary_hold_acknowledge;
> extern char __secondary_hold;
> diff --git a/arch/powerpc/kernel/exceptions-64e.S b/arch/powerpc/kernel/exceptions-64e.S
> index d9ed79415100..9f9e8686798b 100644
> --- a/arch/powerpc/kernel/exceptions-64e.S
> +++ b/arch/powerpc/kernel/exceptions-64e.S
> @@ -1814,10 +1814,6 @@ _GLOBAL(book3e_secondary_core_init)
> 1: mtlr r28
> blr
>
> -_GLOBAL(book3e_secondary_thread_init)
> - mflr r28
> - b 3b
> -
> .globl init_core_book3e
> init_core_book3e:
> /* Establish the interrupt vector base */
> diff --git a/arch/powerpc/kernel/head_64.S b/arch/powerpc/kernel/head_64.S
> index 0e05a9a47a4b..4ae2c18c5fc6 100644
> --- a/arch/powerpc/kernel/head_64.S
> +++ b/arch/powerpc/kernel/head_64.S
> @@ -302,23 +302,6 @@ _GLOBAL(fsl_secondary_thread_init)
> 1:
> #endif
Nothing directly calls generic_secondary_thread_init() but I think
fsl_secondary_thread_init() which is directly above "falls through"
into it. fsl_secondary_thread_init() still has callers.
>
> -_GLOBAL(generic_secondary_thread_init)
> - mr r24,r3
> -
> - /* turn on 64-bit mode */
> - bl enable_64b_mode
> -
> - /* get a valid TOC pointer, wherever we're mapped at */
> - bl relative_toc
> - tovirt(r2,r2)
> -
> -#ifdef CONFIG_PPC_BOOK3E
> - /* Book3E initialization */
> - mr r3,r24
> - bl book3e_secondary_thread_init
> -#endif
> - b generic_secondary_common_init
> -
> /*
> * On pSeries and most other platforms, secondary processors spin
> * in the following code.
> @@ -385,7 +368,6 @@ _GLOBAL(generic_secondary_smp_init)
> 20:
> #endif
>
> -generic_secondary_common_init:
> /* Set up a paca value for this processor. Since we have the
> * physical cpu id in r24, we need to search the pacas to find
> * which logical id maps to our physical one.
> --
> 2.25.1
>
^ permalink raw reply
* Re: [PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
From: Michael Ellerman @ 2020-05-28 1:03 UTC (permalink / raw)
To: Petr Mladek
Cc: Petr Mladek, Daniel Borkmann, linux-kernel, Alexei Starovoitov,
Paul Mackerras, Masami Hiramatsu, Brendan Gregg, Miroslav Benes,
linuxppc-dev, Christoph Hellwig
In-Reply-To: <20200527122844.19524-1-pmladek@suse.com>
Petr Mladek <pmladek@suse.com> writes:
> The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
> to archs where they work") caused that bpf_probe_read{, str}() functions
> were not longer available on architectures where the same logical address
> might have different content in kernel and user memory mapping. These
> architectures should use probe_read_{user,kernel}_str helpers.
>
> For backward compatibility, the problematic functions are still available
> on architectures where the user and kernel address spaces are not
> overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
>
> At the moment, these backward compatible functions are enabled only
> on x86_64, arm, and arm64. Let's do it also on powerpc that has
> the non overlapping address space as well.
>
> Signed-off-by: Petr Mladek <pmladek@suse.com>
This seems like it should have a Fixes: tag and go into v5.7?
cheers
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index d13b5328ca10..b29d7cb38368 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -126,6 +126,7 @@ config PPC
> select ARCH_HAS_MMIOWB if PPC64
> select ARCH_HAS_PHYS_TO_DMA
> select ARCH_HAS_PMEM_API
> + select ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE
> select ARCH_HAS_PTE_DEVMAP if PPC_BOOK3S_64
> select ARCH_HAS_PTE_SPECIAL
> select ARCH_HAS_MEMBARRIER_CALLBACKS
> --
> 2.26.2
^ permalink raw reply
* Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
From: Paul Mackerras @ 2020-05-28 1:43 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, npiggin
In-Reply-To: <20200505071729.54912-10-aneesh.kumar@linux.ibm.com>
On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> The locking rules for walking partition scoped table is different from process
> scoped table. Hence add a helper for secondary linux page table walk and also
> add check whether we are holding the right locks.
This patch is causing new warnings to appear when testing migration,
like this:
[ 142.090159] ------------[ cut here ]------------
[ 142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
[ 142.090176] WARNING: CPU: 23 PID: 5341 at arch/powerpc/include/asm/kvm_book3s_64.h:644 kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[ 142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
[ 142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G W 5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
[ 142.090189] NIP: c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
[ 142.090190] REGS: c000001e19f077e0 TRAP: 0700 Tainted: G W (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
[ 142.090191] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42222422 XER: 20040000
[ 142.090196] CFAR: c00000000012f5ac IRQMASK: 0
GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
[ 142.090217] NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
[ 142.090223] LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
[ 142.090224] Call Trace:
[ 142.090230] [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[ 142.090236] [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[ 142.090292] [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[ 142.090300] [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
[ 142.090302] [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
[ 142.090305] [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
[ 142.090307] [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
[ 142.090309] [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
[ 142.090310] Instruction dump:
[ 142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
[ 142.090317] e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
[ 142.090322] ---[ end trace 619d45057b6919e0 ]---
Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
locklessly, and only takes the kvm->mmu_lock once it finds a dirty
PTE. I think that is important for performance, since on any given
scan of the guest real address space we may only find a small
proportion of the guest pages to be dirty.
Are you now relying on the kvm->mmu_lock to protect the existence of
the PTEs, or just their content?
Paul.
^ permalink raw reply
* Re: [PATCH v3 0/7] Statsfs: a new ram-based file system for Linux kernel statistics
From: Paolo Bonzini @ 2020-05-28 5:22 UTC (permalink / raw)
To: David Ahern, Jakub Kicinski, Emanuele Giuseppe Esposito
Cc: linux-s390, kvm, linux-doc, netdev, Emanuele Giuseppe Esposito,
linux-kernel, kvm-ppc, Jonathan Adams, Christian Borntraeger,
Andrew Lunn, Alexander Viro, David Rientjes, linux-fsdevel,
linux-mips, linuxppc-dev, linux-arm-kernel, Jim Mattson
In-Reply-To: <b6fa4439-c6b8-63a4-84fd-fbac3d4f10fd@gmail.com>
On 28/05/20 00:21, David Ahern wrote:
> On 5/27/20 3:07 PM, Paolo Bonzini wrote:
>> I see what you meant now. statsfs can also be used to enumerate objects
>> if one is so inclined (with the prototype in patch 7, for example, each
>> network interface becomes a directory).
>
> there are many use cases that have 100's to 1000's have network devices.
> Having a sysfs entry per device already bloats memory usage for these
> use cases; another filesystem with an entry per device makes that worse.
> Really the wrong direction for large scale systems.
Hi David,
IMO the important part for now is having a flexible kernel API for
exposing statistics across multiple subsystems, so that they can be
harvested in an efficient way. The userspace API is secondary, and
multiple APIs can be added to cater for different usecases.
For example, as of the first five patches the memory usage is the same
as what is now in the mainline kernel, since all the patchset does is
take existing debugfs inodes and move them to statsfs. I agree that, if
the concept is extended to the whole kernel, scalability and memory
usage becomes an issue; and indeed, the long-term plan is to support a
binary format that is actually _more_ efficient than the status quo for
large scale systems.
In the meanwhile, the new filesystem can be disabled (see the difference
between "STATS_FS" and "STATS_FS_API") if it imposes undesirable overhead.
Thanks,
Paolo
^ permalink raw reply
* Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
From: Aneesh Kumar K.V @ 2020-05-28 6:01 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, npiggin
In-Reply-To: <20200528014338.GC307798@thinks.paulus.ozlabs.org>
On 5/28/20 7:13 AM, Paul Mackerras wrote:
> On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
>> The locking rules for walking partition scoped table is different from process
>> scoped table. Hence add a helper for secondary linux page table walk and also
>> add check whether we are holding the right locks.
>
> This patch is causing new warnings to appear when testing migration,
> like this:
>
> [ 142.090159] ------------[ cut here ]------------
> [ 142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
> [ 142.090176] WARNING: CPU: 23 PID: 5341 at arch/powerpc/include/asm/kvm_book3s_64.h:644 kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> [ 142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
> [ 142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G W 5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
> [ 142.090189] NIP: c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
> [ 142.090190] REGS: c000001e19f077e0 TRAP: 0700 Tainted: G W (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
> [ 142.090191] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42222422 XER: 20040000
> [ 142.090196] CFAR: c00000000012f5ac IRQMASK: 0
> GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
> GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
> GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
> GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
> GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
> GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
> GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
> GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
> [ 142.090217] NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> [ 142.090223] LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
> [ 142.090224] Call Trace:
> [ 142.090230] [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
> [ 142.090236] [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
> [ 142.090292] [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
> [ 142.090300] [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
> [ 142.090302] [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
> [ 142.090305] [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
> [ 142.090307] [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
> [ 142.090309] [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
> [ 142.090310] Instruction dump:
> [ 142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
> [ 142.090317] e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
> [ 142.090322] ---[ end trace 619d45057b6919e0 ]---
>
> Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
> locklessly, and only takes the kvm->mmu_lock once it finds a dirty
> PTE. I think that is important for performance, since on any given
> scan of the guest real address space we may only find a small
> proportion of the guest pages to be dirty.
>
> Are you now relying on the kvm->mmu_lock to protect the existence of
> the PTEs, or just their content?
>
The patch series should not change any rules w.r.t kvm partition scoped
page table walk. We only added helpers to make it explicit that this is
different from regular linux page table walk. And kvm->mmu_lock is what
was used to protect the partition scoped table walk earlier.
In this specific case, what we need probably is an open coded kvm
partition scoped walk with a comment around explaining why is it ok to
walk that partition scoped table without taking kvm->mmu_lock.
What happens when a parallel invalidate happens to Qemu address space?
Since we are not holding kvm->mmu_lock mmu notifier will complete and we
will go ahead with unmapping partition scoped table.
Do we need a change like below?
@@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
{
unsigned long gfn = memslot->base_gfn + pagenum;
unsigned long gpa = gfn << PAGE_SHIFT;
- pte_t *ptep;
+ pte_t *ptep, pte;
unsigned int shift;
int ret = 0;
unsigned long old, *rmapp;
@@ -1049,11 +1049,23 @@ static int kvm_radix_test_clear_dirty(struct kvm
*kvm,
return ret;
ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
- if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
+ if (!ptep)
+ return 0;
+
+ pte = READ_ONCE(*ptep);
+ if (pte_present(pte) && pte_dirty(pte)) {
ret = 1;
if (shift)
ret = 1 << (shift - PAGE_SHIFT);
spin_lock(&kvm->mmu_lock);
+ /*
+ * Recheck the pte again
+ */
+ if (pte_val(pte) != pte_val(*ptep)) {
+ spin_unlock(&kvm->mmu_lock);
+ return 0;
+ }
+
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
gpa, shift);
kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);
^ permalink raw reply
* [PATCH v5] KVM: PPC: clean up redundant kvm_run parameters in assembly
From: Tianjia Zhang @ 2020-05-28 6:24 UTC (permalink / raw)
To: paulus, mpe, benh; +Cc: tianjia.zhang, linuxppc-dev, linux-kernel, kvm-ppc
In the current kvm version, 'kvm_run' has been included in the 'kvm_vcpu'
structure. For historical reasons, many kvm-related function parameters
retain the 'kvm_run' and 'kvm_vcpu' parameters at the same time. This
patch does a unified cleanup of these remaining redundant parameters.
Signed-off-by: Tianjia Zhang <tianjia.zhang@linux.alibaba.com>
---
arch/powerpc/include/asm/kvm_ppc.h | 2 +-
arch/powerpc/kvm/book3s_interrupts.S | 22 ++++++++++------------
arch/powerpc/kvm/book3s_pr.c | 9 ++++-----
arch/powerpc/kvm/booke.c | 9 ++++-----
arch/powerpc/kvm/booke_interrupts.S | 9 ++++-----
arch/powerpc/kvm/bookehv_interrupts.S | 10 +++++-----
6 files changed, 28 insertions(+), 33 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index ccf66b3a4c1d..0a056c64c317 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -59,7 +59,7 @@ enum xlate_readwrite {
};
extern int kvmppc_vcpu_run(struct kvm_vcpu *vcpu);
-extern int __kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu);
+extern int __kvmppc_vcpu_run(struct kvm_vcpu *vcpu);
extern void kvmppc_handler_highmem(void);
extern void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu);
diff --git a/arch/powerpc/kvm/book3s_interrupts.S b/arch/powerpc/kvm/book3s_interrupts.S
index f7ad99d972ce..a3674f6b8d3d 100644
--- a/arch/powerpc/kvm/book3s_interrupts.S
+++ b/arch/powerpc/kvm/book3s_interrupts.S
@@ -55,8 +55,7 @@
****************************************************************************/
/* Registers:
- * r3: kvm_run pointer
- * r4: vcpu pointer
+ * r3: vcpu pointer
*/
_GLOBAL(__kvmppc_vcpu_run)
@@ -68,8 +67,8 @@ kvm_start_entry:
/* Save host state to the stack */
PPC_STLU r1, -SWITCH_FRAME_SIZE(r1)
- /* Save r3 (kvm_run) and r4 (vcpu) */
- SAVE_2GPRS(3, r1)
+ /* Save r3 (vcpu) */
+ SAVE_GPR(3, r1)
/* Save non-volatile registers (r14 - r31) */
SAVE_NVGPRS(r1)
@@ -82,14 +81,13 @@ kvm_start_entry:
PPC_STL r0, _LINK(r1)
/* Load non-volatile guest state from the vcpu */
- VCPU_LOAD_NVGPRS(r4)
+ VCPU_LOAD_NVGPRS(r3)
kvm_start_lightweight:
/* Copy registers into shadow vcpu so we can access them in real mode */
- mr r3, r4
bl FUNC(kvmppc_copy_to_svcpu)
nop
- REST_GPR(4, r1)
+ REST_GPR(3, r1)
#ifdef CONFIG_PPC_BOOK3S_64
/* Get the dcbz32 flag */
@@ -146,7 +144,7 @@ after_sprg3_load:
*
*/
- PPC_LL r3, GPR4(r1) /* vcpu pointer */
+ PPC_LL r3, GPR3(r1) /* vcpu pointer */
/*
* kvmppc_copy_from_svcpu can clobber volatile registers, save
@@ -190,11 +188,11 @@ after_sprg3_load:
PPC_STL r30, VCPU_GPR(R30)(r7)
PPC_STL r31, VCPU_GPR(R31)(r7)
- /* Pass the exit number as 3rd argument to kvmppc_handle_exit */
- lwz r5, VCPU_TRAP(r7)
+ /* Pass the exit number as 2nd argument to kvmppc_handle_exit */
+ lwz r4, VCPU_TRAP(r7)
- /* Restore r3 (kvm_run) and r4 (vcpu) */
- REST_2GPRS(3, r1)
+ /* Restore r3 (vcpu) */
+ REST_GPR(3, r1)
bl FUNC(kvmppc_handle_exit_pr)
/* If RESUME_GUEST, get back in the loop */
diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index ef54f917bdaf..01c8fe5abe0d 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -1151,9 +1151,9 @@ static int kvmppc_exit_pr_progint(struct kvm_vcpu *vcpu, unsigned int exit_nr)
return r;
}
-int kvmppc_handle_exit_pr(struct kvm_run *run, struct kvm_vcpu *vcpu,
- unsigned int exit_nr)
+int kvmppc_handle_exit_pr(struct kvm_vcpu *vcpu, unsigned int exit_nr)
{
+ struct kvm_run *run = vcpu->run;
int r = RESUME_HOST;
int s;
@@ -1826,7 +1826,6 @@ static void kvmppc_core_vcpu_free_pr(struct kvm_vcpu *vcpu)
static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
{
- struct kvm_run *run = vcpu->run;
int ret;
#ifdef CONFIG_ALTIVEC
unsigned long uninitialized_var(vrsave);
@@ -1834,7 +1833,7 @@ static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
/* Check if we can run the vcpu at all */
if (!vcpu->arch.sane) {
- run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
ret = -EINVAL;
goto out;
}
@@ -1861,7 +1860,7 @@ static int kvmppc_vcpu_run_pr(struct kvm_vcpu *vcpu)
kvmppc_fix_ee_before_entry();
- ret = __kvmppc_vcpu_run(run, vcpu);
+ ret = __kvmppc_vcpu_run(vcpu);
kvmppc_clear_debug(vcpu);
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index c0d62a917e20..3e1c9f08e302 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -731,12 +731,11 @@ int kvmppc_core_check_requests(struct kvm_vcpu *vcpu)
int kvmppc_vcpu_run(struct kvm_vcpu *vcpu)
{
- struct kvm_run *run = vcpu->run;
int ret, s;
struct debug_reg debug;
if (!vcpu->arch.sane) {
- run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+ vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
return -EINVAL;
}
@@ -778,7 +777,7 @@ int kvmppc_vcpu_run(struct kvm_vcpu *vcpu)
vcpu->arch.pgdir = vcpu->kvm->mm->pgd;
kvmppc_fix_ee_before_entry();
- ret = __kvmppc_vcpu_run(run, vcpu);
+ ret = __kvmppc_vcpu_run(vcpu);
/* No need for guest_exit. It's done in handle_exit.
We also get here with interrupts enabled. */
@@ -982,9 +981,9 @@ static int kvmppc_resume_inst_load(struct kvm_vcpu *vcpu,
*
* Return value is in the form (errcode<<2 | RESUME_FLAG_HOST | RESUME_FLAG_NV)
*/
-int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu,
- unsigned int exit_nr)
+int kvmppc_handle_exit(struct kvm_vcpu *vcpu, unsigned int exit_nr)
{
+ struct kvm_run *run = vcpu->run;
int r = RESUME_HOST;
int s;
int idx;
diff --git a/arch/powerpc/kvm/booke_interrupts.S b/arch/powerpc/kvm/booke_interrupts.S
index 2e56ab5a5f55..6fa82efe833b 100644
--- a/arch/powerpc/kvm/booke_interrupts.S
+++ b/arch/powerpc/kvm/booke_interrupts.S
@@ -237,7 +237,7 @@ _GLOBAL(kvmppc_resume_host)
/* Switch to kernel stack and jump to handler. */
LOAD_REG_ADDR(r3, kvmppc_handle_exit)
mtctr r3
- lwz r3, HOST_RUN(r1)
+ mr r3, r4
lwz r2, HOST_R2(r1)
mr r14, r4 /* Save vcpu pointer. */
@@ -337,15 +337,14 @@ heavyweight_exit:
/* Registers:
- * r3: kvm_run pointer
- * r4: vcpu pointer
+ * r3: vcpu pointer
*/
_GLOBAL(__kvmppc_vcpu_run)
stwu r1, -HOST_STACK_SIZE(r1)
- stw r1, VCPU_HOST_STACK(r4) /* Save stack pointer to vcpu. */
+ stw r1, VCPU_HOST_STACK(r3) /* Save stack pointer to vcpu. */
/* Save host state to stack. */
- stw r3, HOST_RUN(r1)
+ mr r4, r3
mflr r3
stw r3, HOST_STACK_LR(r1)
mfcr r5
diff --git a/arch/powerpc/kvm/bookehv_interrupts.S b/arch/powerpc/kvm/bookehv_interrupts.S
index c577ba4b3169..8262c14fc9e6 100644
--- a/arch/powerpc/kvm/bookehv_interrupts.S
+++ b/arch/powerpc/kvm/bookehv_interrupts.S
@@ -434,9 +434,10 @@ _GLOBAL(kvmppc_resume_host)
#endif
/* Switch to kernel stack and jump to handler. */
- PPC_LL r3, HOST_RUN(r1)
+ mr r3, r4
mr r5, r14 /* intno */
mr r14, r4 /* Save vcpu pointer. */
+ mr r4, r5
bl kvmppc_handle_exit
/* Restore vcpu pointer and the nonvolatiles we used. */
@@ -525,15 +526,14 @@ heavyweight_exit:
blr
/* Registers:
- * r3: kvm_run pointer
- * r4: vcpu pointer
+ * r3: vcpu pointer
*/
_GLOBAL(__kvmppc_vcpu_run)
stwu r1, -HOST_STACK_SIZE(r1)
- PPC_STL r1, VCPU_HOST_STACK(r4) /* Save stack pointer to vcpu. */
+ PPC_STL r1, VCPU_HOST_STACK(r3) /* Save stack pointer to vcpu. */
/* Save host state to stack. */
- PPC_STL r3, HOST_RUN(r1)
+ mr r4, r3
mflr r3
mfcr r5
PPC_STL r3, HOST_STACK_LR(r1)
--
2.17.1
^ permalink raw reply related
* Re: [PATCH v4 09/22] powerpc/kvm/book3s: Add helper to walk partition scoped linux page table.
From: Paul Mackerras @ 2020-05-28 7:25 UTC (permalink / raw)
To: Aneesh Kumar K.V; +Cc: linuxppc-dev, npiggin
In-Reply-To: <e732e386-4a8c-2a7d-220c-e22e85b7a6c3@linux.ibm.com>
On Thu, May 28, 2020 at 11:31:04AM +0530, Aneesh Kumar K.V wrote:
> On 5/28/20 7:13 AM, Paul Mackerras wrote:
> > On Tue, May 05, 2020 at 12:47:16PM +0530, Aneesh Kumar K.V wrote:
> > > The locking rules for walking partition scoped table is different from process
> > > scoped table. Hence add a helper for secondary linux page table walk and also
> > > add check whether we are holding the right locks.
> >
> > This patch is causing new warnings to appear when testing migration,
> > like this:
> >
> > [ 142.090159] ------------[ cut here ]------------
> > [ 142.090160] find_kvm_secondary_pte called with kvm mmu_lock not held
> > [ 142.090176] WARNING: CPU: 23 PID: 5341 at arch/powerpc/include/asm/kvm_book3s_64.h:644 kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [ 142.090177] Modules linked in: xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables bpfilter overlay binfmt_misc input_leds raid_class scsi_transport_sas sch_fq_codel sunrpc kvm_hv kvm
> > [ 142.090188] CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G W 5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
> > [ 142.090189] NIP: c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
> > [ 142.090190] REGS: c000001e19f077e0 TRAP: 0700 Tainted: G W (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
> > [ 142.090191] MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42222422 XER: 20040000
> > [ 142.090196] CFAR: c00000000012f5ac IRQMASK: 0
> > GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
> > GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
> > GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
> > GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
> > GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
> > GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
> > GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
> > GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
> > [ 142.090217] NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
> > [ 142.090223] LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
> > [ 142.090224] Call Trace:
> > [ 142.090230] [c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
> > [ 142.090236] [c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
> > [ 142.090292] [c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
> > [ 142.090300] [c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
> > [ 142.090302] [c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
> > [ 142.090305] [c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
> > [ 142.090307] [c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
> > [ 142.090309] [c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
> > [ 142.090310] Instruction dump:
> > [ 142.090312] 7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
> > [ 142.090317] e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
> > [ 142.090322] ---[ end trace 619d45057b6919e0 ]---
> >
> > Indeed, kvm_radix_test_clear_dirty() tests the PTE dirty bit
> > locklessly, and only takes the kvm->mmu_lock once it finds a dirty
> > PTE. I think that is important for performance, since on any given
> > scan of the guest real address space we may only find a small
> > proportion of the guest pages to be dirty.
> >
> > Are you now relying on the kvm->mmu_lock to protect the existence of
> > the PTEs, or just their content?
> >
>
> The patch series should not change any rules w.r.t kvm partition scoped page
> table walk. We only added helpers to make it explicit that this is different
> from regular linux page table walk. And kvm->mmu_lock is what was used to
> protect the partition scoped table walk earlier.
>
> In this specific case, what we need probably is an open coded kvm partition
> scoped walk with a comment around explaining why is it ok to walk that
> partition scoped table without taking kvm->mmu_lock.
>
> What happens when a parallel invalidate happens to Qemu address space? Since
> we are not holding kvm->mmu_lock mmu notifier will complete and we will go
> ahead with unmapping partition scoped table.
>
> Do we need a change like below?
>
> @@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
> {
> unsigned long gfn = memslot->base_gfn + pagenum;
> unsigned long gpa = gfn << PAGE_SHIFT;
> - pte_t *ptep;
> + pte_t *ptep, pte;
> unsigned int shift;
> int ret = 0;
> unsigned long old, *rmapp;
> @@ -1049,11 +1049,23 @@ static int kvm_radix_test_clear_dirty(struct kvm
> *kvm,
> return ret;
>
> ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
We need something different from find_kvm_secondary_pte here, since
that is what is generating the warning.
> - if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
> + if (!ptep)
> + return 0;
> +
> + pte = READ_ONCE(*ptep);
> + if (pte_present(pte) && pte_dirty(pte)) {
> ret = 1;
> if (shift)
> ret = 1 << (shift - PAGE_SHIFT);
> spin_lock(&kvm->mmu_lock);
> + /*
> + * Recheck the pte again
> + */
> + if (pte_val(pte) != pte_val(*ptep)) {
I don't think this is quite right. I think it should be something
like:
pte = *ptep;
if (!(pte_present(pte) && pte_dirty(pte))) {
> + spin_unlock(&kvm->mmu_lock);
> + return 0;
> + }
> +
> old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
> gpa, shift);
> kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);
Paul.
^ permalink raw reply
* [PATCH] powerpc/book3s64/kvm: Fix secondary page table walk warning during migration
From: Aneesh Kumar K.V @ 2020-05-28 8:04 UTC (permalink / raw)
To: kvm-ppc, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
This patch fix the below warning reported during migration.
find_kvm_secondary_pte called with kvm mmu_lock not held
CPU: 23 PID: 5341 Comm: qemu-system-ppc Tainted: G W 5.7.0-rc5-kvm-00211-g9ccf10d6d088 #432
NIP: c008000000fe848c LR: c008000000fe8488 CTR: 0000000000000000
REGS: c000001e19f077e0 TRAP: 0700 Tainted: G W (5.7.0-rc5-kvm-00211-g9ccf10d6d088)
MSR: 9000000000029033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 42222422 XER: 20040000
CFAR: c00000000012f5ac IRQMASK: 0
GPR00: c008000000fe8488 c000001e19f07a70 c008000000ffe200 0000000000000039
GPR04: 0000000000000001 c000001ffc8b4900 0000000000018840 0000000000000007
GPR08: 0000000000000003 0000000000000001 0000000000000007 0000000000000001
GPR12: 0000000000002000 c000001fff6d9400 000000011f884678 00007fff70b70000
GPR16: 00007fff7137cb90 00007fff7dcb4410 0000000000000001 0000000000000000
GPR20: 000000000ffe0000 0000000000000000 0000000000000001 0000000000000000
GPR24: 8000000000000000 0000000000000001 c000001e1f67e600 c000001e1fd82410
GPR28: 0000000000001000 c000001e2e410000 0000000000000fff 0000000000000ffe
NIP [c008000000fe848c] kvmppc_hv_get_dirty_log_radix+0x2e4/0x340 [kvm_hv]
LR [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv]
Call Trace:
[c000001e19f07a70] [c008000000fe8488] kvmppc_hv_get_dirty_log_radix+0x2e0/0x340 [kvm_hv] (unreliable)
[c000001e19f07b50] [c008000000fd42e4] kvm_vm_ioctl_get_dirty_log_hv+0x33c/0x3c0 [kvm_hv]
[c000001e19f07be0] [c008000000eea878] kvm_vm_ioctl_get_dirty_log+0x30/0x50 [kvm]
[c000001e19f07c00] [c008000000edc818] kvm_vm_ioctl+0x2b0/0xc00 [kvm]
[c000001e19f07d50] [c00000000046e148] ksys_ioctl+0xf8/0x150
[c000001e19f07da0] [c00000000046e1c8] sys_ioctl+0x28/0x80
[c000001e19f07dc0] [c00000000003652c] system_call_exception+0x16c/0x240
[c000001e19f07e20] [c00000000000d070] system_call_common+0xf0/0x278
Instruction dump:
7d3a512a 4200ffd0 7ffefb78 4bfffdc4 60000000 3c820000 e8848468 3c620000
e86384a8 38840010 4800673d e8410018 <0fe00000> 4bfffdd4 60000000 60000000
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
arch/powerpc/include/asm/kvm_book3s_64.h | 9 ++++++
arch/powerpc/kvm/book3s_64_mmu_radix.c | 35 ++++++++++++++++++++----
2 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h b/arch/powerpc/include/asm/kvm_book3s_64.h
index c58e64a0a74f..cd5bad08b8d3 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -635,6 +635,15 @@ extern void kvmhv_remove_nest_rmap_range(struct kvm *kvm,
unsigned long gpa, unsigned long hpa,
unsigned long nbytes);
+static inline pte_t *__find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
+ unsigned *hshift)
+{
+ pte_t *pte;
+
+ pte = __find_linux_pte(kvm->arch.pgtable, ea, NULL, hshift);
+ return pte;
+}
+
static inline pte_t *find_kvm_secondary_pte(struct kvm *kvm, unsigned long ea,
unsigned *hshift)
{
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 271f1c3d8443..a328db6a5ef8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -1040,7 +1040,7 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
{
unsigned long gfn = memslot->base_gfn + pagenum;
unsigned long gpa = gfn << PAGE_SHIFT;
- pte_t *ptep;
+ pte_t *ptep, pte;
unsigned int shift;
int ret = 0;
unsigned long old, *rmapp;
@@ -1048,12 +1048,35 @@ static int kvm_radix_test_clear_dirty(struct kvm *kvm,
if (kvm->arch.secure_guest & KVMPPC_SECURE_INIT_DONE)
return ret;
- ptep = find_kvm_secondary_pte(kvm, gpa, &shift);
- if (ptep && pte_present(*ptep) && pte_dirty(*ptep)) {
- ret = 1;
- if (shift)
- ret = 1 << (shift - PAGE_SHIFT);
+ /*
+ * For performance reason we don't hold kvm->mmu_lock
+ * while walking the partition scoped table.
+ */
+ ptep = __find_kvm_secondary_pte(kvm, gpa, &shift);
+ if (!ptep)
+ return 0;
+
+ pte = READ_ONCE(*ptep);
+ if (pte_present(pte) && pte_dirty(pte)) {
spin_lock(&kvm->mmu_lock);
+ /*
+ * Recheck the pte again
+ */
+ if (pte_val(pte) != pte_val(*ptep)) {
+ /*
+ * We have KVM_MEM_LOG_DIRTY_PAGES enabled. Hence we
+ * can only find PAGE_SIZE pte entries here. We can
+ * continue to use the pte addr returned by above
+ * page table walk.
+ */
+ if (!pte_present(*ptep) || !pte_dirty(*ptep)) {
+ spin_unlock(&kvm->mmu_lock);
+ return 0;
+ }
+ }
+
+ ret = 1;
+ VM_BUG_ON(shift);
old = kvmppc_radix_update_pte(kvm, ptep, _PAGE_DIRTY, 0,
gpa, shift);
kvmppc_radix_tlbie_page(kvm, gpa, shift, kvm->arch.lpid);
--
2.26.2
^ permalink raw reply related
* Re: [RFC PATCH 1/4] powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
From: Nicholas Piggin @ 2020-05-28 8:09 UTC (permalink / raw)
To: linuxppc-dev, Michael Ellerman; +Cc: alistair, mikey, jniethe5
In-Reply-To: <20200527145843.2761782-1-mpe@ellerman.id.au>
Excerpts from Michael Ellerman's message of May 28, 2020 12:58 am:
> __init_FSCR() was added originally in commit 2468dcf641e4 ("powerpc:
> Add support for context switching the TAR register") (Feb 2013), and
> only set FSCR_TAR.
>
> At that point FSCR (Facility Status and Control Register) was not
> context switched, so the setting was permanent after boot.
>
> Later we added initialisation of FSCR_DSCR to __init_FSCR(), in commit
> 54c9b2253d34 ("powerpc: Set DSCR bit in FSCR setup") (Mar 2013), again
> that was permanent after boot.
>
> Then commit 2517617e0de6 ("powerpc: Fix context switch DSCR on
> POWER8") (Aug 2013) added a limited context switch of FSCR, just the
> FSCR_DSCR bit was context switched based on thread.dscr_inherit. That
> commit said "This clears the H/FSCR DSCR bit initially", but it
> didn't, it left the initialisation of FSCR_DSCR in __init_FSCR().
> However the initial context switch from init_task to pid 1 would clear
> FSCR_DSCR because thread.dscr_inherit was 0.
>
> That commit also introduced the requirement that FSCR_DSCR be clear
> for user processes, so that we can take the facility unavailable
> interrupt in order to manage dscr_inherit.
>
> Then in commit 152d523e6307 ("powerpc: Create context switch helpers
> save_sprs() and restore_sprs()") (Dec 2015) FSCR was added to
> thread_struct. However it still wasn't fully context switched, we just
> took the existing value and set FSCR_DSCR if the new thread had
> dscr_inherit set. FSCR was still initialised at boot to FSCR_DSCR |
> FSCR_TAR, but that value was not propagated into the thread_struct, so
> the initial context switch set FSCR_DSCR back to 0.
>
> Finally commit b57bd2de8c6c ("powerpc: Improve FSCR init and context
> switching") (Jun 2016) added a full context switch of the FSCR, and
> added an initialisation of init_task.thread.fscr to FSCR_TAR |
> FSCR_EBB, but omitted FSCR_DSCR.
>
> The end result is that swapper runs with FSCR_DSCR set because of the
> initialisation in __init_FSCR(), but no other processes do, they use
> the value from init_task.thread.fscr.
>
> Having FSCR_DSCR set for swapper allows it to access SPR 3 from
> userspace, but swapper never runs userspace, so it has no useful
> effect. It's also confusing to have the value initialised in two
> places to two different values.
>
> So remove FSCR_DSCR from __init_FSCR(), this at least gets us to the
> point where there's a single value of FSCR, even if it's still set in
> two places.
Thanks for sorting out this mess, everything looks good to me.
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH] powerpc/bpf: Enable bpf_probe_read{, str}() on powerpc again
From: Petr Mladek @ 2020-05-28 9:13 UTC (permalink / raw)
To: Michael Ellerman
Cc: Daniel Borkmann, linux-kernel, Alexei Starovoitov, Paul Mackerras,
Masami Hiramatsu, Brendan Gregg, Miroslav Benes, linuxppc-dev,
Christoph Hellwig
In-Reply-To: <87ftbkkh00.fsf@mpe.ellerman.id.au>
On Thu 2020-05-28 11:03:43, Michael Ellerman wrote:
> Petr Mladek <pmladek@suse.com> writes:
> > The commit 0ebeea8ca8a4d1d453a ("bpf: Restrict bpf_probe_read{, str}() only
> > to archs where they work") caused that bpf_probe_read{, str}() functions
> > were not longer available on architectures where the same logical address
> > might have different content in kernel and user memory mapping. These
> > architectures should use probe_read_{user,kernel}_str helpers.
> >
> > For backward compatibility, the problematic functions are still available
> > on architectures where the user and kernel address spaces are not
> > overlapping. This is defined CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE.
> >
> > At the moment, these backward compatible functions are enabled only
> > on x86_64, arm, and arm64. Let's do it also on powerpc that has
> > the non overlapping address space as well.
> >
> > Signed-off-by: Petr Mladek <pmladek@suse.com>
>
> This seems like it should have a Fixes: tag and go into v5.7?
Good point:
Fixes: commit 0ebeea8ca8a4d1d4 ("bpf: Restrict bpf_probe_read{, str}() only to archs where they work")
And yes, it should ideally go into v5.7 either directly or via stable.
Should I resend the patch with Fixes and
Cc: stable@vger.kernel.org #v45.7 lines, please?
Best Regards,
Petr
^ permalink raw reply
* [PATCH] powerpc/32: disable KASAN with pages bigger than 16k
From: Christophe Leroy @ 2020-05-28 10:17 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
Cc: linuxppc-dev, linux-kernel
Mapping of early shadow area is implemented by using a single static
page table having all entries pointing to the same early shadow page.
The shadow area must therefore occupy full PGD entries.
The shadow area has a size of 128Mbytes starting at 0xf8000000.
With 4k pages, a PGD entry is 4Mbytes
With 16k pages, a PGD entry is 64Mbytes
With 64k pages, a PGD entry is 256Mbytes which is too big.
Until we rework the early shadow mapping, disable KASAN when the
page size is too big.
Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 2edb16efc899 ("powerpc/32: Add KASAN support")
Cc: stable@vger.kernel.org
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/Kconfig | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 1dfa59126fcf..066b36bf9efa 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -169,8 +169,8 @@ config PPC
select HAVE_ARCH_AUDITSYSCALL
select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU
select HAVE_ARCH_JUMP_LABEL
- select HAVE_ARCH_KASAN if PPC32
- select HAVE_ARCH_KASAN_VMALLOC if PPC32
+ select HAVE_ARCH_KASAN if PPC32 && PPC_PAGE_SHIFT <= 14
+ select HAVE_ARCH_KASAN_VMALLOC if PPC32 && PPC_PAGE_SHIFT <= 14
select HAVE_ARCH_KGDB
select HAVE_ARCH_MMAP_RND_BITS
select HAVE_ARCH_MMAP_RND_COMPAT_BITS if COMPAT
--
2.25.0
^ permalink raw reply related
* [PATCH AUTOSEL 5.6 30/47] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:55 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
linuxppc-dev, David S . Miller
In-Reply-To: <20200528115600.1405808-1-sashal@kernel.org>
From: Valentin Longchamp <valentin@longchamp.me>
[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]
ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.
The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].
A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.
The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.
[1] https://lists.openwall.net/netdev/2020/01/08/201
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index 0d101c00286f..ab1b4a77b4a3 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -42,6 +42,7 @@
#include <soc/fsl/qe/ucc.h>
#include <soc/fsl/qe/ucc_fast.h>
#include <asm/machdep.h>
+#include <net/sch_generic.h>
#include "ucc_geth.h"
@@ -1548,11 +1549,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
static void ugeth_quiesce(struct ucc_geth_private *ugeth)
{
- /* Prevent any further xmits, plus detach the device. */
- netif_device_detach(ugeth->ndev);
-
- /* Wait for any current xmits to finish. */
- netif_tx_disable(ugeth->ndev);
+ /* Prevent any further xmits */
+ netif_tx_stop_all_queues(ugeth->ndev);
/* Disable the interrupt to avoid NAPI rescheduling. */
disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1565,7 +1563,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
{
napi_enable(&ugeth->napi);
enable_irq(ugeth->ug_info->uf_info.irq);
- netif_device_attach(ugeth->ndev);
+
+ /* allow to xmit again */
+ netif_tx_wake_all_queues(ugeth->ndev);
+ __netdev_watchdog_up(ugeth->ndev);
}
/* Called every time the controller might need to be made
--
2.25.1
^ permalink raw reply related
* [PATCH AUTOSEL 5.4 15/26] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:56 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
linuxppc-dev, David S . Miller
In-Reply-To: <20200528115654.1406165-1-sashal@kernel.org>
From: Valentin Longchamp <valentin@longchamp.me>
[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]
ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.
The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].
A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.
The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.
[1] https://lists.openwall.net/netdev/2020/01/08/201
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index f839fa94ebdd..d3b8ce734c1b 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -42,6 +42,7 @@
#include <soc/fsl/qe/ucc.h>
#include <soc/fsl/qe/ucc_fast.h>
#include <asm/machdep.h>
+#include <net/sch_generic.h>
#include "ucc_geth.h"
@@ -1548,11 +1549,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
static void ugeth_quiesce(struct ucc_geth_private *ugeth)
{
- /* Prevent any further xmits, plus detach the device. */
- netif_device_detach(ugeth->ndev);
-
- /* Wait for any current xmits to finish. */
- netif_tx_disable(ugeth->ndev);
+ /* Prevent any further xmits */
+ netif_tx_stop_all_queues(ugeth->ndev);
/* Disable the interrupt to avoid NAPI rescheduling. */
disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1565,7 +1563,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
{
napi_enable(&ugeth->napi);
enable_irq(ugeth->ug_info->uf_info.irq);
- netif_device_attach(ugeth->ndev);
+
+ /* allow to xmit again */
+ netif_tx_wake_all_queues(ugeth->ndev);
+ __netdev_watchdog_up(ugeth->ndev);
}
/* Called every time the controller might need to be made
--
2.25.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.19 12/17] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:57 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
linuxppc-dev, David S . Miller
In-Reply-To: <20200528115724.1406376-1-sashal@kernel.org>
From: Valentin Longchamp <valentin@longchamp.me>
[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]
ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.
The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].
A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.
The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.
[1] https://lists.openwall.net/netdev/2020/01/08/201
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index a5bf02ae4bc5..5de6f7c73c1f 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -45,6 +45,7 @@
#include <soc/fsl/qe/ucc.h>
#include <soc/fsl/qe/ucc_fast.h>
#include <asm/machdep.h>
+#include <net/sch_generic.h>
#include "ucc_geth.h"
@@ -1551,11 +1552,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
static void ugeth_quiesce(struct ucc_geth_private *ugeth)
{
- /* Prevent any further xmits, plus detach the device. */
- netif_device_detach(ugeth->ndev);
-
- /* Wait for any current xmits to finish. */
- netif_tx_disable(ugeth->ndev);
+ /* Prevent any further xmits */
+ netif_tx_stop_all_queues(ugeth->ndev);
/* Disable the interrupt to avoid NAPI rescheduling. */
disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1568,7 +1566,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
{
napi_enable(&ugeth->napi);
enable_irq(ugeth->ug_info->uf_info.irq);
- netif_device_attach(ugeth->ndev);
+
+ /* allow to xmit again */
+ netif_tx_wake_all_queues(ugeth->ndev);
+ __netdev_watchdog_up(ugeth->ndev);
}
/* Called every time the controller might need to be made
--
2.25.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.14 08/13] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:57 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
linuxppc-dev, David S . Miller
In-Reply-To: <20200528115744.1406533-1-sashal@kernel.org>
From: Valentin Longchamp <valentin@longchamp.me>
[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]
ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.
The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].
A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.
The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.
[1] https://lists.openwall.net/netdev/2020/01/08/201
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index bddf4c25ee6e..7c2a9fd4dc1a 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -45,6 +45,7 @@
#include <soc/fsl/qe/ucc.h>
#include <soc/fsl/qe/ucc_fast.h>
#include <asm/machdep.h>
+#include <net/sch_generic.h>
#include "ucc_geth.h"
@@ -1551,11 +1552,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
static void ugeth_quiesce(struct ucc_geth_private *ugeth)
{
- /* Prevent any further xmits, plus detach the device. */
- netif_device_detach(ugeth->ndev);
-
- /* Wait for any current xmits to finish. */
- netif_tx_disable(ugeth->ndev);
+ /* Prevent any further xmits */
+ netif_tx_stop_all_queues(ugeth->ndev);
/* Disable the interrupt to avoid NAPI rescheduling. */
disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1568,7 +1566,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
{
napi_enable(&ugeth->napi);
enable_irq(ugeth->ug_info->uf_info.irq);
- netif_device_attach(ugeth->ndev);
+
+ /* allow to xmit again */
+ netif_tx_wake_all_queues(ugeth->ndev);
+ __netdev_watchdog_up(ugeth->ndev);
}
/* Called every time the controller might need to be made
--
2.25.1
^ permalink raw reply related
* [PATCH AUTOSEL 4.9 5/9] net/ethernet/freescale: rework quiesce/activate for ucc_geth
From: Sasha Levin @ 2020-05-28 11:57 UTC (permalink / raw)
To: linux-kernel, stable
Cc: Matteo Ghidoni, Sasha Levin, netdev, Valentin Longchamp,
linuxppc-dev, David S . Miller
In-Reply-To: <20200528115800.1406703-1-sashal@kernel.org>
From: Valentin Longchamp <valentin@longchamp.me>
[ Upstream commit 79dde73cf9bcf1dd317a2667f78b758e9fe139ed ]
ugeth_quiesce/activate are used to halt the controller when there is a
link change that requires to reconfigure the mac.
The previous implementation called netif_device_detach(). This however
causes the initial activation of the netdevice to fail precisely because
it's detached. For details, see [1].
A possible workaround was the revert of commit
net: linkwatch: add check for netdevice being present to linkwatch_do_dev
However, the check introduced in the above commit is correct and shall be
kept.
The netif_device_detach() is thus replaced with
netif_tx_stop_all_queues() that prevents any tranmission. This allows to
perform mac config change required by the link change, without detaching
the corresponding netdevice and thus not preventing its initial
activation.
[1] https://lists.openwall.net/netdev/2020/01/08/201
Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
Acked-by: Matteo Ghidoni <matteo.ghidoni@ch.abb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
drivers/net/ethernet/freescale/ucc_geth.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/freescale/ucc_geth.c b/drivers/net/ethernet/freescale/ucc_geth.c
index 714593023bbc..af922bac19ae 100644
--- a/drivers/net/ethernet/freescale/ucc_geth.c
+++ b/drivers/net/ethernet/freescale/ucc_geth.c
@@ -45,6 +45,7 @@
#include <soc/fsl/qe/ucc.h>
#include <soc/fsl/qe/ucc_fast.h>
#include <asm/machdep.h>
+#include <net/sch_generic.h>
#include "ucc_geth.h"
@@ -1551,11 +1552,8 @@ static int ugeth_disable(struct ucc_geth_private *ugeth, enum comm_dir mode)
static void ugeth_quiesce(struct ucc_geth_private *ugeth)
{
- /* Prevent any further xmits, plus detach the device. */
- netif_device_detach(ugeth->ndev);
-
- /* Wait for any current xmits to finish. */
- netif_tx_disable(ugeth->ndev);
+ /* Prevent any further xmits */
+ netif_tx_stop_all_queues(ugeth->ndev);
/* Disable the interrupt to avoid NAPI rescheduling. */
disable_irq(ugeth->ug_info->uf_info.irq);
@@ -1568,7 +1566,10 @@ static void ugeth_activate(struct ucc_geth_private *ugeth)
{
napi_enable(&ugeth->napi);
enable_irq(ugeth->ug_info->uf_info.irq);
- netif_device_attach(ugeth->ndev);
+
+ /* allow to xmit again */
+ netif_tx_wake_all_queues(ugeth->ndev);
+ __netdev_watchdog_up(ugeth->ndev);
}
/* Called every time the controller might need to be made
--
2.25.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox