LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3 2/7] powerpc: Add support for ISA v3.1
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

Newer ISA versions are enabled by clearing all bits in the PCR
associated with previous versions of the ISA. Enable ISA v3.1 support
by updating the PCR mask to include ISA v3.0. This ensures all PCR
bits corresponding to earlier architecture versions get cleared
thereby enabling ISA v3.1 if supported by the hardware.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
---
 arch/powerpc/include/asm/cputable.h | 1 +
 arch/powerpc/include/asm/reg.h      | 3 ++-
 arch/powerpc/kvm/book3s_hv.c        | 3 ---
 3 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/cputable.h b/arch/powerpc/include/asm/cputable.h
index c67b94f3334c..1559dbf72842 100644
--- a/arch/powerpc/include/asm/cputable.h
+++ b/arch/powerpc/include/asm/cputable.h
@@ -213,6 +213,7 @@ static inline void cpu_feature_keys_init(void) { }
 #define CPU_FTR_P9_TIDR			LONG_ASM_CONST(0x0000800000000000)
 #define CPU_FTR_P9_TLBIE_ERAT_BUG	LONG_ASM_CONST(0x0001000000000000)
 #define CPU_FTR_P9_RADIX_PREFETCH_BUG	LONG_ASM_CONST(0x0002000000000000)
+#define CPU_FTR_ARCH_31			LONG_ASM_CONST(0x0004000000000000)
 
 #ifndef __ASSEMBLY__
 
diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 054f8a71d686..dd20af367b57 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -487,10 +487,11 @@
  * determine both the compatibility level which we want to emulate and the
  * compatibility level which the host is capable of emulating.
  */
+#define   PCR_ARCH_300	0x10		/* Architecture 3.00 */
 #define   PCR_ARCH_207	0x8		/* Architecture 2.07 */
 #define   PCR_ARCH_206	0x4		/* Architecture 2.06 */
 #define   PCR_ARCH_205	0x2		/* Architecture 2.05 */
-#define   PCR_LOW_BITS	(PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205)
+#define   PCR_LOW_BITS	(PCR_ARCH_207 | PCR_ARCH_206 | PCR_ARCH_205 | PCR_ARCH_300)
 #define   PCR_MASK	~(PCR_HIGH_BITS | PCR_LOW_BITS)	/* PCR Reserved Bits */
 #define	SPRN_HEIR	0x153	/* Hypervisor Emulated Instruction Register */
 #define SPRN_TLBINDEXR	0x154	/* P7 TLB control register */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index db07199f0977..a0cf17597838 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -344,9 +344,6 @@ static void kvmppc_set_pvr_hv(struct kvm_vcpu *vcpu, u32 pvr)
 	vcpu->arch.pvr = pvr;
 }
 
-/* Dummy value used in computing PCR value below */
-#define PCR_ARCH_300	(PCR_ARCH_207 << 1)
-
 static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 {
 	unsigned long host_pcr_bit = 0, guest_pcr_bit = 0;
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 1/7] powerpc: Add new HWCAP bits
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple
In-Reply-To: <20200521014341.29095-1-alistair@popple.id.au>

POWER10 introduces two new architectural features - ISAv3.1 and matrix
multiply assist (MMA) instructions. Userspace detects the presence
of these features via two HWCAP bits introduced in this patch. These
bits have been agreed to by the compiler and binutils team.

According to ISAv3.1 MMA is an optional feature and software that makes
use of it should first check for availability via this HWCAP bit and use
alternate code paths if unavailable.

Signed-off-by: Alistair Popple <alistair@popple.id.au>
Tested-by: Michael Neuling <mikey@neuling.org>
---
 arch/powerpc/include/uapi/asm/cputable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/cputable.h b/arch/powerpc/include/uapi/asm/cputable.h
index 540592034740..731b97dc2d15 100644
--- a/arch/powerpc/include/uapi/asm/cputable.h
+++ b/arch/powerpc/include/uapi/asm/cputable.h
@@ -50,6 +50,8 @@
 #define PPC_FEATURE2_DARN		0x00200000 /* darn random number insn */
 #define PPC_FEATURE2_SCV		0x00100000 /* scv syscall */
 #define PPC_FEATURE2_HTM_NO_SUSPEND	0x00080000 /* TM w/out suspended state */
+#define PPC_FEATURE2_ARCH_3_1		0x00040000 /* ISA 3.1 */
+#define PPC_FEATURE2_MMA		0x00020000 /* Matrix Multiply Assist */
 
 /*
  * IMPORTANT!
-- 
2.20.1


^ permalink raw reply related

* [PATCH v3 0/7] Base support for POWER10
From: Alistair Popple @ 2020-05-21  1:43 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, npiggin, aneesh.kumar, Alistair Popple

This series brings together several previously posted patches required for
POWER10 support and introduces a new patch enabling POWER10 architected
mode to enable booting as a POWER10 pseries guest.

It includes support for enabling facilities related to MMA and prefix
instructions.

Changes from v2:
  - s/accumulate/assist/
  - Updated commit messages

Changes from v1:
  - Two bug-fixes to enable prefix and MMA on pseries
  - Minor updates to commit message wording
  - Fixes a build error when CONFIG_KVM_BOOK3S_64_HV is enabled

Alistair Popple (7):
  powerpc: Add new HWCAP bits
  powerpc: Add support for ISA v3.1
  powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
  powerpc/dt_cpu_ftrs: Set current thread fscr bits
  powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
  powerpc/dt_cpu_ftrs: Add MMA feature
  powerpc: Add POWER10 architected mode

 arch/powerpc/include/asm/cputable.h      | 16 +++++++++++--
 arch/powerpc/include/asm/mmu.h           |  1 +
 arch/powerpc/include/asm/prom.h          |  1 +
 arch/powerpc/include/asm/reg.h           |  6 +++--
 arch/powerpc/include/uapi/asm/cputable.h |  2 ++
 arch/powerpc/kernel/cpu_setup_power.S    | 20 ++++++++++++++--
 arch/powerpc/kernel/cputable.c           | 30 ++++++++++++++++++++++++
 arch/powerpc/kernel/dt_cpu_ftrs.c        | 26 +++++++++++++++++++-
 arch/powerpc/kernel/prom_init.c          | 12 ++++++++--
 arch/powerpc/kvm/book3s_hv.c             |  3 ---
 10 files changed, 105 insertions(+), 12 deletions(-)

-- 
2.20.1


^ permalink raw reply

* Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
From: Alexander Duyck @ 2020-05-21  1:29 UTC (permalink / raw)
  To: Daniel Jordan
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	Steffen Klassert, linux-s390, Herbert Xu, Jonathan Corbet,
	Jason Gunthorpe, Zi Yan, Robert Elliott, Pavel Tatashin,
	Shile Zhang, Josh Triplett, Alex Williamson, Kirill Tkhai,
	Dan Williams, Randy Dunlap, LKML, linux-crypto, Tejun Heo,
	Andrew Morton, open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
In-Reply-To: <20200520182645.1658949-6-daniel.m.jordan@oracle.com>

On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
<daniel.m.jordan@oracle.com> wrote:
>
> Deferred struct page init is a significant bottleneck in kernel boot.
> Optimizing it maximizes availability for large-memory systems and allows
> spinning up short-lived VMs as needed without having to leave them
> running.  It also benefits bare metal machines hosting VMs that are
> sensitive to downtime.  In projects such as VMM Fast Restart[1], where
> guest state is preserved across kexec reboot, it helps prevent
> application and network timeouts in the guests.
>
> Multithread to take full advantage of system memory bandwidth.
>
> The maximum number of threads is capped at the number of CPUs on the
> node because speedups always improve with additional threads on every
> system tested, and at this phase of boot, the system is otherwise idle
> and waiting on page init to finish.
>
> Helper threads operate on section-aligned ranges to both avoid false
> sharing when setting the pageblock's migrate type and to avoid accessing
> uninitialized buddy pages, though max order alignment is enough for the
> latter.
>
> The minimum chunk size is also a section.  There was benefit to using
> multiple threads even on relatively small memory (1G) systems, and this
> is the smallest size that the alignment allows.
>
> The time (milliseconds) is the slowest node to initialize since boot
> blocks until all nodes finish.  intel_pstate is loaded in active mode
> without hwp and with turbo enabled, and intel_idle is active as well.
>
>     Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
>       2 nodes * 26 cores * 2 threads = 104 CPUs
>       384G/node = 768G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   4078.0 (  9.0)         --   1779.0 (  8.7)
>        2% (  1)       1.4%   4021.3 (  2.9)       3.4%   1717.7 (  7.8)
>       12% (  6)      35.1%   2644.7 ( 35.3)      80.8%    341.0 ( 35.5)
>       25% ( 13)      38.7%   2498.0 ( 34.2)      89.1%    193.3 ( 32.3)
>       37% ( 19)      39.1%   2482.0 ( 25.2)      90.1%    175.3 ( 31.7)
>       50% ( 26)      38.8%   2495.0 (  8.7)      89.1%    193.7 (  3.5)
>       75% ( 39)      39.2%   2478.0 ( 21.0)      90.3%    172.7 ( 26.7)
>      100% ( 52)      40.0%   2448.0 (  2.0)      91.9%    143.3 (  1.5)
>
>     Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
>       1 node * 16 cores * 2 threads = 32 CPUs
>       192G/node = 192G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1996.0 ( 18.0)         --   1104.3 (  6.7)
>        3% (  1)       1.4%   1968.0 (  3.0)       2.7%   1074.7 (  9.0)
>       12% (  4)      40.1%   1196.0 ( 22.7)      72.4%    305.3 ( 16.8)
>       25% (  8)      47.4%   1049.3 ( 17.2)      84.2%    174.0 ( 10.6)
>       37% ( 12)      48.3%   1032.0 ( 14.9)      86.8%    145.3 (  2.5)
>       50% ( 16)      48.9%   1020.3 (  2.5)      88.0%    133.0 (  1.7)
>       75% ( 24)      49.1%   1016.3 (  8.1)      88.4%    128.0 (  1.7)
>      100% ( 32)      49.4%   1009.0 (  8.5)      88.6%    126.3 (  0.6)
>
>     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
>       2 nodes * 18 cores * 2 threads = 72 CPUs
>       128G/node = 256G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1682.7 (  6.7)         --    630.0 (  4.6)
>        3% (  1)       0.4%   1676.0 (  2.0)       0.7%    625.3 (  3.2)
>       12% (  4)      25.8%   1249.0 (  1.0)      68.2%    200.3 (  1.2)
>       25% (  9)      30.0%   1178.0 (  5.2)      79.7%    128.0 (  3.5)
>       37% ( 13)      30.6%   1167.7 (  3.1)      81.3%    117.7 (  1.2)
>       50% ( 18)      30.6%   1167.3 (  2.3)      81.4%    117.0 (  1.0)
>       75% ( 27)      31.0%   1161.3 (  4.6)      82.5%    110.0 (  6.9)
>      100% ( 36)      32.1%   1142.0 (  3.6)      85.7%     90.0 (  1.0)
>
>     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
>       1 node * 8 cores * 2 threads = 16 CPUs
>       64G/node = 64G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --   1003.7 ( 16.6)         --    243.3 (  8.1)
>        6% (  1)       1.4%    990.0 (  4.6)       1.2%    240.3 (  1.5)
>       12% (  2)      11.4%    889.3 ( 16.7)      44.5%    135.0 (  3.0)
>       25% (  4)      16.8%    835.3 (  9.0)      65.8%     83.3 (  2.5)
>       37% (  6)      18.6%    816.7 ( 17.6)      70.4%     72.0 (  1.0)
>       50% (  8)      18.2%    821.0 (  5.0)      70.7%     71.3 (  1.2)
>       75% ( 12)      19.0%    813.3 (  5.0)      71.8%     68.7 (  2.1)
>      100% ( 16)      19.8%    805.3 ( 10.8)      76.4%     57.3 ( 15.9)
>
> Server-oriented distros that enable deferred page init sometimes run in
> small VMs, and they still benefit even though the fraction of boot time
> saved is smaller:
>
>     AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
>       1 node * 2 cores * 2 threads = 4 CPUs
>       16G/node = 16G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --    722.3 (  9.5)         --     50.7 (  0.6)
>       25% (  1)      -3.3%    746.3 (  4.7)      -2.0%     51.7 (  1.2)
>       50% (  2)       0.2%    721.0 ( 11.3)      29.6%     35.7 (  4.9)
>       75% (  3)      -0.3%    724.3 ( 11.2)      48.7%     26.0 (  0.0)
>      100% (  4)       3.0%    700.3 ( 13.6)      55.9%     22.3 (  0.6)
>
>     Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
>       1 node * 2 cores * 2 threads = 4 CPUs
>       14G/node = 14G memory
>
>                    kernel boot                 deferred init
>                    ------------------------    ------------------------
>     node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
>           (  0)         --    673.0 (  6.9)         --     57.0 (  1.0)
>       25% (  1)      -0.6%    677.3 ( 19.8)       1.8%     56.0 (  1.0)
>       50% (  2)       3.4%    650.0 (  3.6)      36.8%     36.0 (  5.2)
>       75% (  3)       4.2%    644.7 (  7.6)      56.1%     25.0 (  1.0)
>      100% (  4)       5.3%    637.0 (  5.6)      63.2%     21.0 (  0.0)
>
> On Josh's 96-CPU and 192G memory system:
>
>     Without this patch series:
>     [    0.487132] node 0 initialised, 23398907 pages in 292ms
>     [    0.499132] node 1 initialised, 24189223 pages in 304ms
>     ...
>     [    0.629376] Run /sbin/init as init process
>
>     With this patch series:
>     [    0.227868] node 0 initialised, 23398907 pages in 28ms
>     [    0.230019] node 1 initialised, 24189223 pages in 28ms
>     ...
>     [    0.361069] Run /sbin/init as init process
>
> [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
>
> Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
> ---
>  mm/Kconfig      |  6 ++---
>  mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
>  2 files changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/mm/Kconfig b/mm/Kconfig
> index c1acc34c1c358..04c1da3f9f44c 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
>         depends on SPARSEMEM
>         depends on !NEED_PER_CPU_KM
>         depends on 64BIT
> +       select PADATA
>         help
>           Ordinarily all struct pages are initialised during early boot in a
>           single thread. On very large machines this can take a considerable
>           amount of time. If this option is set, large machines will bring up
> -         a subset of memmap at boot and then initialise the rest in parallel
> -         by starting one-off "pgdatinitX" kernel thread for each node X. This
> -         has a potential performance impact on processes running early in the
> +         a subset of memmap at boot and then initialise the rest in parallel.
> +         This has a potential performance impact on tasks running early in the
>           lifetime of the system until these kthreads finish the
>           initialisation.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index d0c0d9364aa6d..9cb780e8dec78 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -68,6 +68,7 @@
>  #include <linux/lockdep.h>
>  #include <linux/nmi.h>
>  #include <linux/psi.h>
> +#include <linux/padata.h>
>
>  #include <asm/sections.h>
>  #include <asm/tlbflush.h>
> @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
>         return nr_pages;
>  }
>
> +struct definit_args {
> +       struct zone *zone;
> +       atomic_long_t nr_pages;
> +};
> +
> +static void __init
> +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> +                          void *arg)
> +{
> +       unsigned long spfn, epfn, nr_pages = 0;
> +       struct definit_args *args = arg;
> +       struct zone *zone = args->zone;
> +       u64 i;
> +
> +       deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
> +
> +       /*
> +        * Initialize and free pages in MAX_ORDER sized increments so that we
> +        * can avoid introducing any issues with the buddy allocator.
> +        */
> +       while (spfn < end_pfn) {
> +               nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> +               cond_resched();
> +       }
> +
> +       atomic_long_add(nr_pages, &args->nr_pages);
> +}
> +

Personally I would get rid of nr_pages entirely. It isn't worth the
cache thrash to have this atomic variable bouncing around. You could
probably just have this function return void since all nr_pages is
used for is a pr_info  statement at the end of the initialization
which will be completely useless now anyway since we really have the
threads running in parallel anyway.

We only really need the nr_pages logic in deferred_grow_zone in order
to track if we have freed enough pages to allow us to go back to what
we were doing.

>  /* Initialise remaining memory on a node */
>  static int __init deferred_init_memmap(void *data)
>  {
>         pg_data_t *pgdat = data;
>         const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
>         unsigned long spfn = 0, epfn = 0, nr_pages = 0;
> -       unsigned long first_init_pfn, flags;
> +       unsigned long first_init_pfn, flags, epfn_align;
>         unsigned long start = jiffies;
>         struct zone *zone;
> -       int zid;
> +       int zid, max_threads;
>         u64 i;
>
>         /* Bind memory initialisation thread to a local node if possible */
> @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
>                 goto zone_empty;
>
>         /*
> -        * Initialize and free pages in MAX_ORDER sized increments so
> -        * that we can avoid introducing any issues with the buddy
> -        * allocator.
> +        * More CPUs always led to greater speedups on tested systems, up to
> +        * all the nodes' CPUs.  Use all since the system is otherwise idle now.
>          */
> +       max_threads = max(cpumask_weight(cpumask), 1u);
> +
>         while (spfn < epfn) {
> +               epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> +
> +               if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> +                   epfn_align - spfn >= PAGES_PER_SECTION) {
> +                       struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
> +                       struct padata_mt_job job = {
> +                               .thread_fn   = deferred_init_memmap_chunk,
> +                               .fn_arg      = &arg,
> +                               .start       = spfn,
> +                               .size        = epfn_align - spfn,
> +                               .align       = PAGES_PER_SECTION,
> +                               .min_chunk   = PAGES_PER_SECTION,
> +                               .max_threads = max_threads,
> +                       };
> +
> +                       padata_do_multithreaded(&job);
> +                       nr_pages += atomic_long_read(&arg.nr_pages);
> +                       spfn = epfn_align;
> +               }
> +
>                 nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
>                 cond_resched();
>         }

This doesn't look right. You are basically adding threads in addition
to calls to deferred_init_maxorder. In addition you are spawning one
job per section instead of per range. Really you should be going for
something more along the lines of:

        while (spfn < epfn) {
                unsigned long epfn_align = ALIGN(epfn,
PAGE_PER_SECTION);
                struct definit_args arg = { zone, ATOMIC_LONG_INIT(0)
};
                struct padata_mt_job job = {
                        .thread_fn   = deferred_init_memmap_chunk,
                        .fn_arg      = &arg,
                        .start       = spfn,
                        .size        = epfn_align - spfn,
                        .align       = PAGES_PER_SECTION,
                        .min_chunk   = PAGES_PER_SECTION,
                        .max_threads = max_threads,
                };

                padata_do_multithreaded(&job);

                for_each_free_mem_pfn_range_in_zone_from(i, zone,
spfn, epfn) {
                        if (epfn_align <= spfn)
                                break;
                }
        }

This should accomplish the same thing, but much more efficiently. The
only thing you really lose is the tracking of nr_pages which really
doesn't add anything anyway since the value could shift around
depending on how many times deferred_grow_zone got called anyway.

Also the spfn should already be sectioned aligned, or at least be in a
new section unrelated to the one we just scheduled, so there is no
need for the extra checks you had.

^ permalink raw reply

* Re: Endless soft-lockups for compiling workload since next-20200519
From: Frederic Weisbecker @ 2020-05-21  0:40 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Paul E. McKenney, Linux Kernel Mailing List, Qian Cai,
	Borislav Petkov, Thomas Gleixner, linuxppc-dev
In-Reply-To: <20200520125056.GC325280@hirez.programming.kicks-ass.net>

On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
> 
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
> 
> 
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
> 
> > [ 1167.993773][    C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0

So I've tried to think of a race that could produce that and here is
the only thing I could come up with. It's a bit complicated unfortunately:

CPU 0                                              CPU 1
-----                                              -----

tick {
    trigger_load_balance() {
        raise_softirq(SCHED_SOFTIRQ);
        //but nohz_flags(0) = 0
    }
                                                   kick_ilb() {
                                                       atomic_fetch_or(...., nohz_flags(0))
    softirq() {                                        #VMEXIT or anything that could stop a CPU for a while
        run_rebalance_domain() {
            nohz_idle_balance() {
                atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
            }
         }
     }
}

// schedule
nohz_newidle_balance() {
    kick_ilb() { // pick current CPU
        atomic_fetch_or(...., nohz_flags(0))           #VMENTER
        smp_call_function_single_async() {             smp_call_function_single_async() {
            // verified csd->flags != CSD_LOCK             // verified csd->flags != CSD_LOCK
            csd->flags = CSD_LOCK                          csd->flags = CSD_LOCK
            //execute in place                             //queue and send IPI
            csd->flags = 0
            nohz_csd_func()
	}
    }
}


IPI�{
    flush_smp_call_function_queue() {
        csd_unlock() {
            WARN_ON(csd->flags != CSD_LOCK) <---------!!!!!



The root cause here would be that trigger_load_balance() unconditionally raise
the softirq. And I have to confess I'm not clear why since the softirq is
essentially a no-op when nohz_flags() is 0.

Thanks.

^ permalink raw reply

* Re: [PATCH] soc: fsl: qe: Replace one-element array and use struct_size() helper
From: Gustavo A. R. Silva @ 2020-05-21  0:01 UTC (permalink / raw)
  To: Li Yang
  Cc: Kees Cook, Gustavo A. R. Silva, lkml, linuxppc-dev,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	Qiang Zhao
In-Reply-To: <CADRPPNR-Croux9FgnrQJJmdF2jNnuAmC+2xMJSgSbkbRv9u8Mw@mail.gmail.com>

On Wed, May 20, 2020 at 06:52:21PM -0500, Li Yang wrote:
> On Mon, May 18, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
> >
> > On Mon, May 18, 2020 at 05:19:04PM -0500, Gustavo A. R. Silva wrote:
> > > The current codebase makes use of one-element arrays in the following
> > > form:
> > >
> > > struct something {
> > >     int length;
> > >     u8 data[1];
> > > };
> > >
> > > struct something *instance;
> > >
> > > instance = kmalloc(sizeof(*instance) + size, GFP_KERNEL);
> > > instance->length = size;
> > > memcpy(instance->data, source, size);
> > >
> > > but the preferred mechanism to declare variable-length types such as
> > > these ones is a flexible array member[1][2], introduced in C99:
> > >
> > > struct foo {
> > >         int stuff;
> > >         struct boo array[];
> > > };
> > >
> > > By making use of the mechanism above, we will get a compiler warning
> > > in case the flexible array does not occur last in the structure, which
> > > will help us prevent some kind of undefined behavior bugs from being
> > > inadvertently introduced[3] to the codebase from now on. So, replace
> > > the one-element array with a flexible-array member.
> > >
> > > Also, make use of the new struct_size() helper to properly calculate the
> > > size of struct qe_firmware.
> > >
> > > This issue was found with the help of Coccinelle and, audited and fixed
> > > _manually_.
> > >
> > > [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> > > [2] https://github.com/KSPP/linux/issues/21
> > > [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> > >
> > > Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
> > > ---
> > >  drivers/soc/fsl/qe/qe.c | 4 ++--
> > >  include/soc/fsl/qe/qe.h | 2 +-
> > >  2 files changed, 3 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> > > index 447146861c2c1..2df20d6f85fa4 100644
> > > --- a/drivers/soc/fsl/qe/qe.c
> > > +++ b/drivers/soc/fsl/qe/qe.c
> > > @@ -448,7 +448,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> > >       unsigned int i;
> > >       unsigned int j;
> > >       u32 crc;
> > > -     size_t calc_size = sizeof(struct qe_firmware);
> > > +     size_t calc_size;
> > >       size_t length;
> > >       const struct qe_header *hdr;
> > >
> > > @@ -480,7 +480,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> > >       }
> > >
> > >       /* Validate the length and check if there's a CRC */
> > > -     calc_size += (firmware->count - 1) * sizeof(struct qe_microcode);
> > > +     calc_size = struct_size(firmware, microcode, firmware->count);
> > >
> > >       for (i = 0; i < firmware->count; i++)
> > >               /*
> > > diff --git a/include/soc/fsl/qe/qe.h b/include/soc/fsl/qe/qe.h
> > > index e282ac01ec081..3feddfec9f87d 100644
> > > --- a/include/soc/fsl/qe/qe.h
> > > +++ b/include/soc/fsl/qe/qe.h
> > > @@ -307,7 +307,7 @@ struct qe_firmware {
> > >               u8 revision;            /* The microcode version revision */
> > >               u8 padding;             /* Reserved, for alignment */
> > >               u8 reserved[4];         /* Reserved, for future expansion */
> > > -     } __attribute__ ((packed)) microcode[1];
> > > +     } __packed microcode[];
> > >       /* All microcode binaries should be located here */
> > >       /* CRC32 should be located here, after the microcode binaries */
> > >  } __attribute__ ((packed));
> > > --
> > > 2.26.2
> > >
> >
> > Hm, looking at this code, I see a few other things that need to be
> > fixed:
> >
> > 1) drivers/tty/serial/ucc_uart.c does not do a be32_to_cpu() conversion
> >    on the length test (understandably, a little-endian system has never run
> >    this code since it's ppc specific), but it's still wrong:
> >
> >         if (firmware->header.length != fw->size) {
> >
> >    compare to the firmware loader:
> >
> >         length = be32_to_cpu(hdr->length);
> >
> > 2) drivers/soc/fsl/qe/qe.c does not perform bounds checking on the
> >    per-microcode offsets, so the uploader might send data outside the
> >    firmware buffer. Perhaps:
> 
> We do validate the CRC for each microcode, it is unlikely the CRC
> check can pass if the offset or length is not correct.  But you are
> probably right that it will be safer to check the boundary and fail
> quicker before we actually start the CRC check.  Will you come up with
> a formal patch or you want us to deal with it?
> 

Li,

I will send a proper patch for this.

Thanks
--
Gustavo

> >
> >
> > diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> > index 447146861c2c..c4e0bc452f03 100644
> > --- a/drivers/soc/fsl/qe/qe.c
> > +++ b/drivers/soc/fsl/qe/qe.c
> > @@ -451,6 +451,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >         size_t calc_size = sizeof(struct qe_firmware);
> >         size_t length;
> >         const struct qe_header *hdr;
> > +       void *firmware_end;
> >
> >         if (!firmware) {
> >                 printk(KERN_ERR "qe-firmware: invalid pointer\n");
> > @@ -491,19 +492,39 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >                 calc_size += sizeof(__be32) *
> >                         be32_to_cpu(firmware->microcode[i].count);
> >
> > -       /* Validate the length */
> > +       /* Validate total length */
> >         if (length != calc_size + sizeof(__be32)) {
> >                 printk(KERN_ERR "qe-firmware: invalid length\n");
> >                 return -EPERM;
> >         }
> >
> >         /* Validate the CRC */
> > -       crc = be32_to_cpu(*(__be32 *)((void *)firmware + calc_size));
> > +       firmware_end = (void *)firmware + calc_size;
> > +       crc = be32_to_cpu(*(__be32 *)firmware_end);
> >         if (crc != crc32(0, firmware, calc_size)) {
> >                 printk(KERN_ERR "qe-firmware: firmware CRC is invalid\n");
> >                 return -EIO;
> >         }
> >
> > +       /* Validate ucode lengths and offsets */
> > +       for (i = 0; i < firmware->count; i++) {
> > +               const struct qe_microcode *ucode = &firmware->microcode[i];
> > +               __be32 *code;
> > +               size_t count;
> > +
> > +               if (!ucode->code_offset)
> > +                       continue;
> > +
> > +               code = (void *)firmware + be32_to_cpu(ucode->code_offset);
> > +               count = be32_to_cpu(ucode->count) * sizeof(*code);
> > +
> > +               if (code < firmware || code >= firmware_end ||
> > +                   code + count < firmware || code + count >= firmware_end) {
> > +                       printk(KERN_ERR "qe-firmware: invalid ucode offset\n");
> > +                       return -EIO;
> > +               }
> > +       }
> > +
> >         /*
> >          * If the microcode calls for it, split the I-RAM.
> >          */
> >
> >
> > I haven't tested this.
> >
> >
> > --
> > Kees Cook

^ permalink raw reply

* Re: [PATCH] soc: fsl: qe: Replace one-element array and use struct_size() helper
From: Li Yang @ 2020-05-20 23:52 UTC (permalink / raw)
  To: Kees Cook
  Cc: Gustavo A. R. Silva, lkml, Gustavo A. R. Silva, linuxppc-dev,
	moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
	Qiang Zhao
In-Reply-To: <202005181529.C0CB448FBB@keescook>

On Mon, May 18, 2020 at 5:57 PM Kees Cook <keescook@chromium.org> wrote:
>
> On Mon, May 18, 2020 at 05:19:04PM -0500, Gustavo A. R. Silva wrote:
> > The current codebase makes use of one-element arrays in the following
> > form:
> >
> > struct something {
> >     int length;
> >     u8 data[1];
> > };
> >
> > struct something *instance;
> >
> > instance = kmalloc(sizeof(*instance) + size, GFP_KERNEL);
> > instance->length = size;
> > memcpy(instance->data, source, size);
> >
> > but the preferred mechanism to declare variable-length types such as
> > these ones is a flexible array member[1][2], introduced in C99:
> >
> > struct foo {
> >         int stuff;
> >         struct boo array[];
> > };
> >
> > By making use of the mechanism above, we will get a compiler warning
> > in case the flexible array does not occur last in the structure, which
> > will help us prevent some kind of undefined behavior bugs from being
> > inadvertently introduced[3] to the codebase from now on. So, replace
> > the one-element array with a flexible-array member.
> >
> > Also, make use of the new struct_size() helper to properly calculate the
> > size of struct qe_firmware.
> >
> > This issue was found with the help of Coccinelle and, audited and fixed
> > _manually_.
> >
> > [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
> > [2] https://github.com/KSPP/linux/issues/21
> > [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")
> >
> > Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
> > ---
> >  drivers/soc/fsl/qe/qe.c | 4 ++--
> >  include/soc/fsl/qe/qe.h | 2 +-
> >  2 files changed, 3 insertions(+), 3 deletions(-)
> >
> > diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> > index 447146861c2c1..2df20d6f85fa4 100644
> > --- a/drivers/soc/fsl/qe/qe.c
> > +++ b/drivers/soc/fsl/qe/qe.c
> > @@ -448,7 +448,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >       unsigned int i;
> >       unsigned int j;
> >       u32 crc;
> > -     size_t calc_size = sizeof(struct qe_firmware);
> > +     size_t calc_size;
> >       size_t length;
> >       const struct qe_header *hdr;
> >
> > @@ -480,7 +480,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
> >       }
> >
> >       /* Validate the length and check if there's a CRC */
> > -     calc_size += (firmware->count - 1) * sizeof(struct qe_microcode);
> > +     calc_size = struct_size(firmware, microcode, firmware->count);
> >
> >       for (i = 0; i < firmware->count; i++)
> >               /*
> > diff --git a/include/soc/fsl/qe/qe.h b/include/soc/fsl/qe/qe.h
> > index e282ac01ec081..3feddfec9f87d 100644
> > --- a/include/soc/fsl/qe/qe.h
> > +++ b/include/soc/fsl/qe/qe.h
> > @@ -307,7 +307,7 @@ struct qe_firmware {
> >               u8 revision;            /* The microcode version revision */
> >               u8 padding;             /* Reserved, for alignment */
> >               u8 reserved[4];         /* Reserved, for future expansion */
> > -     } __attribute__ ((packed)) microcode[1];
> > +     } __packed microcode[];
> >       /* All microcode binaries should be located here */
> >       /* CRC32 should be located here, after the microcode binaries */
> >  } __attribute__ ((packed));
> > --
> > 2.26.2
> >
>
> Hm, looking at this code, I see a few other things that need to be
> fixed:
>
> 1) drivers/tty/serial/ucc_uart.c does not do a be32_to_cpu() conversion
>    on the length test (understandably, a little-endian system has never run
>    this code since it's ppc specific), but it's still wrong:
>
>         if (firmware->header.length != fw->size) {
>
>    compare to the firmware loader:
>
>         length = be32_to_cpu(hdr->length);
>
> 2) drivers/soc/fsl/qe/qe.c does not perform bounds checking on the
>    per-microcode offsets, so the uploader might send data outside the
>    firmware buffer. Perhaps:

We do validate the CRC for each microcode, it is unlikely the CRC
check can pass if the offset or length is not correct.  But you are
probably right that it will be safer to check the boundary and fail
quicker before we actually start the CRC check.  Will you come up with
a formal patch or you want us to deal with it?

>
>
> diff --git a/drivers/soc/fsl/qe/qe.c b/drivers/soc/fsl/qe/qe.c
> index 447146861c2c..c4e0bc452f03 100644
> --- a/drivers/soc/fsl/qe/qe.c
> +++ b/drivers/soc/fsl/qe/qe.c
> @@ -451,6 +451,7 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
>         size_t calc_size = sizeof(struct qe_firmware);
>         size_t length;
>         const struct qe_header *hdr;
> +       void *firmware_end;
>
>         if (!firmware) {
>                 printk(KERN_ERR "qe-firmware: invalid pointer\n");
> @@ -491,19 +492,39 @@ int qe_upload_firmware(const struct qe_firmware *firmware)
>                 calc_size += sizeof(__be32) *
>                         be32_to_cpu(firmware->microcode[i].count);
>
> -       /* Validate the length */
> +       /* Validate total length */
>         if (length != calc_size + sizeof(__be32)) {
>                 printk(KERN_ERR "qe-firmware: invalid length\n");
>                 return -EPERM;
>         }
>
>         /* Validate the CRC */
> -       crc = be32_to_cpu(*(__be32 *)((void *)firmware + calc_size));
> +       firmware_end = (void *)firmware + calc_size;
> +       crc = be32_to_cpu(*(__be32 *)firmware_end);
>         if (crc != crc32(0, firmware, calc_size)) {
>                 printk(KERN_ERR "qe-firmware: firmware CRC is invalid\n");
>                 return -EIO;
>         }
>
> +       /* Validate ucode lengths and offsets */
> +       for (i = 0; i < firmware->count; i++) {
> +               const struct qe_microcode *ucode = &firmware->microcode[i];
> +               __be32 *code;
> +               size_t count;
> +
> +               if (!ucode->code_offset)
> +                       continue;
> +
> +               code = (void *)firmware + be32_to_cpu(ucode->code_offset);
> +               count = be32_to_cpu(ucode->count) * sizeof(*code);
> +
> +               if (code < firmware || code >= firmware_end ||
> +                   code + count < firmware || code + count >= firmware_end) {
> +                       printk(KERN_ERR "qe-firmware: invalid ucode offset\n");
> +                       return -EIO;
> +               }
> +       }
> +
>         /*
>          * If the microcode calls for it, split the I-RAM.
>          */
>
>
> I haven't tested this.
>
>
> --
> Kees Cook

^ permalink raw reply

* Re: [PATCH v6 09/11] PCI: layerscape: Add EP mode support for ls1088a and ls2088a
From: Rob Herring @ 2020-05-20 20:50 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: devicetree, andrew.murray, lorenzo.pieralisi, jingoohan1,
	linux-pci, Zhiqiang.Hou, linux-kernel, kishon, Minghuan.Lian,
	robh+dt, mingkai.hu, gustavo.pimentel, bhelgaas, shawnguo,
	roy.zang, linuxppc-dev, leoyang.li, linux-arm-kernel, amurray
In-Reply-To: <20200314033038.24844-10-xiaowei.bao@nxp.com>

On Sat, 14 Mar 2020 11:30:36 +0800, Xiaowei Bao wrote:
> Add PCIe EP mode support for ls1088a and ls2088a, there are some
> difference between LS1 and LS2 platform, so refactor the code of
> the EP driver.
> 
> Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> ---
> v2:
>  - This is a new patch for supporting the ls1088a and ls2088a platform.
> v3:
>  - Adjust the some struct assignment order in probe function.
> v4:
>  - No change.
> v5:
>  - No change.
> v6:
>  - No change.
> 
>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 72 +++++++++++++++++++-------
>  1 file changed, 53 insertions(+), 19 deletions(-)
> 

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 07/11] PCI: layerscape: Modify the way of getting capability with different PEX
From: Rob Herring @ 2020-05-20 20:45 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: devicetree, andrew.murray, lorenzo.pieralisi, roy.zang,
	gustavo.pimentel, Zhiqiang.Hou, linux-kernel, kishon,
	Minghuan.Lian, jingoohan1, robh+dt, mingkai.hu, linux-pci,
	bhelgaas, shawnguo, leoyang.li, linuxppc-dev, linux-arm-kernel,
	amurray
In-Reply-To: <20200314033038.24844-8-xiaowei.bao@nxp.com>

On Sat, 14 Mar 2020 11:30:34 +0800, Xiaowei Bao wrote:
> The different PCIe controller in one board may be have different
> capability of MSI or MSIX, so change the way of getting the MSI
> capability, make it more flexible.
> 
> Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> ---
> v2:
>  - Remove the repeated assignment code.
> v3:
>  - Use ep_func msi_cap and msix_cap to decide the msi_capable and
>    msix_capable of pci_epc_features struct.
> v4:
>  - No change.
> v5:
>  - No change.
> v6:
>  - No change.
> 
>  drivers/pci/controller/dwc/pci-layerscape-ep.c | 31 +++++++++++++++++++-------
>  1 file changed, 23 insertions(+), 8 deletions(-)
> 

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH v6 04/11] PCI: designware-ep: Modify MSI and MSIX CAP way of finding
From: Rob Herring @ 2020-05-20 20:45 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: roy.zang, lorenzo.pieralisi, devicetree, jingoohan1, Zhiqiang.Hou,
	linuxppc-dev, linux-pci, linux-kernel, leoyang.li, Minghuan.Lian,
	linux-arm-kernel, gustavo.pimentel, bhelgaas, andrew.murray,
	kishon, shawnguo, mingkai.hu, amurray
In-Reply-To: <20200314033038.24844-5-xiaowei.bao@nxp.com>

On Sat, Mar 14, 2020 at 11:30:31AM +0800, Xiaowei Bao wrote:
> Each PF of EP device should have it's own MSI or MSIX capabitily

s/it's/its/

> struct, so create a dw_pcie_ep_func struct and remove the msi_cap
> and msix_cap to this struct from dw_pcie_ep, and manage the PFs
> with a list.
> 
> Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> ---
> v3:
>  - This is a new patch, to fix the issue of MSI and MSIX CAP way of
>    finding.
> v4:
>  - Correct some word of commit message.
> v5:
>  - No change.
> v6:
>  - Fix up the compile error.
> 
>  drivers/pci/controller/dwc/pcie-designware-ep.c | 135 +++++++++++++++++++++---
>  drivers/pci/controller/dwc/pcie-designware.h    |  18 +++-
>  2 files changed, 134 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/pci/controller/dwc/pcie-designware-ep.c b/drivers/pci/controller/dwc/pcie-designware-ep.c
> index 933bb89..fb915f2 100644
> --- a/drivers/pci/controller/dwc/pcie-designware-ep.c
> +++ b/drivers/pci/controller/dwc/pcie-designware-ep.c
> @@ -19,6 +19,19 @@ void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
>  	pci_epc_linkup(epc);
>  }
>  
> +struct dw_pcie_ep_func *
> +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no)
> +{
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	list_for_each_entry(ep_func, &ep->func_list, list) {
> +		if (ep_func->func_no == func_no)
> +			return ep_func;
> +	}
> +
> +	return NULL;
> +}
> +
>  static unsigned int dw_pcie_ep_func_select(struct dw_pcie_ep *ep, u8 func_no)
>  {
>  	unsigned int func_offset = 0;
> @@ -59,6 +72,47 @@ void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)
>  		__dw_pcie_ep_reset_bar(pci, func_no, bar, 0);
>  }
>  
> +static u8 __dw_pcie_ep_find_next_cap(struct dw_pcie_ep *ep, u8 func_no,
> +		u8 cap_ptr, u8 cap)
> +{
> +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	unsigned int func_offset = 0;
> +	u8 cap_id, next_cap_ptr;
> +	u16 reg;
> +
> +	if (!cap_ptr)
> +		return 0;
> +
> +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> +
> +	reg = dw_pcie_readw_dbi(pci, func_offset + cap_ptr);
> +	cap_id = (reg & 0x00ff);
> +
> +	if (cap_id > PCI_CAP_ID_MAX)
> +		return 0;
> +
> +	if (cap_id == cap)
> +		return cap_ptr;
> +
> +	next_cap_ptr = (reg & 0xff00) >> 8;
> +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
> +}
> +
> +static u8 dw_pcie_ep_find_capability(struct dw_pcie_ep *ep, u8 func_no, u8 cap)
> +{
> +	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	unsigned int func_offset = 0;
> +	u8 next_cap_ptr;
> +	u16 reg;
> +
> +	func_offset = dw_pcie_ep_func_select(ep, func_no);
> +
> +	reg = dw_pcie_readw_dbi(pci, func_offset + PCI_CAPABILITY_LIST);
> +	next_cap_ptr = (reg & 0x00ff);
> +
> +	return __dw_pcie_ep_find_next_cap(ep, func_no, next_cap_ptr, cap);
> +}
> +
>  static int dw_pcie_ep_write_header(struct pci_epc *epc, u8 func_no,
>  				   struct pci_epf_header *hdr)
>  {
> @@ -246,13 +300,18 @@ static int dw_pcie_ep_get_msi(struct pci_epc *epc, u8 func_no)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
>  
> -	if (!ep->msi_cap)
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msi_cap)
>  		return -EINVAL;

if (!ep_func || !ep_func->msi_cap)
  return -EINVAL;

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	if (!(val & PCI_MSI_FLAGS_ENABLE))
>  		return -EINVAL;
> @@ -268,13 +327,18 @@ static int dw_pcie_ep_set_msi(struct pci_epc *epc, u8 func_no, u8 interrupts)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
>  
> -	if (!ep->msi_cap)
> +	if (!ep_func->msi_cap)
>  		return -EINVAL;

Same here.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	val &= ~PCI_MSI_FLAGS_QMASK;
>  	val |= (interrupts << 1) & PCI_MSI_FLAGS_QMASK;
> @@ -291,13 +355,18 @@ static int dw_pcie_ep_get_msix(struct pci_epc *epc, u8 func_no)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
>  
> -	if (!ep->msix_cap)
> +	if (!ep_func->msix_cap)
>  		return -EINVAL;

Same here for msix.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	if (!(val & PCI_MSIX_FLAGS_ENABLE))
>  		return -EINVAL;
> @@ -313,13 +382,18 @@ static int dw_pcie_ep_set_msix(struct pci_epc *epc, u8 func_no, u16 interrupts)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	u32 val, reg;
>  	unsigned int func_offset = 0;
> +	struct dw_pcie_ep_func *ep_func;
>  
> -	if (!ep->msix_cap)
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msix_cap)
>  		return -EINVAL;

And here.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msix_cap + func_offset + PCI_MSIX_FLAGS;
> +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_FLAGS;
>  	val = dw_pcie_readw_dbi(pci, reg);
>  	val &= ~PCI_MSIX_FLAGS_QSIZE;
>  	val |= interrupts;
> @@ -404,6 +478,7 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  			     u8 interrupt_num)
>  {
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	struct dw_pcie_ep_func *ep_func;
>  	struct pci_epc *epc = ep->epc;
>  	unsigned int aligned_offset;
>  	unsigned int func_offset = 0;
> @@ -413,25 +488,29 @@ int dw_pcie_ep_raise_msi_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	bool has_upper;
>  	int ret;
>  
> -	if (!ep->msi_cap)
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msi_cap)
>  		return -EINVAL;

And here.

>  
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
>  	/* Raise MSI per the PCI Local Bus Specification Revision 3.0, 6.8.1. */
> -	reg = ep->msi_cap + func_offset + PCI_MSI_FLAGS;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_FLAGS;
>  	msg_ctrl = dw_pcie_readw_dbi(pci, reg);
>  	has_upper = !!(msg_ctrl & PCI_MSI_FLAGS_64BIT);
> -	reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
> +	reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_LO;
>  	msg_addr_lower = dw_pcie_readl_dbi(pci, reg);
>  	if (has_upper) {
> -		reg = ep->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
> +		reg = ep_func->msi_cap + func_offset + PCI_MSI_ADDRESS_HI;
>  		msg_addr_upper = dw_pcie_readl_dbi(pci, reg);
> -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_64;
> +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_64;
>  		msg_data = dw_pcie_readw_dbi(pci, reg);
>  	} else {
>  		msg_addr_upper = 0;
> -		reg = ep->msi_cap + func_offset + PCI_MSI_DATA_32;
> +		reg = ep_func->msi_cap + func_offset + PCI_MSI_DATA_32;
>  		msg_data = dw_pcie_readw_dbi(pci, reg);
>  	}
>  	aligned_offset = msg_addr_lower & (epc->mem->page_size - 1);
> @@ -467,6 +546,7 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>  			      u16 interrupt_num)
>  {
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
> +	struct dw_pcie_ep_func *ep_func;
>  	struct pci_epc *epc = ep->epc;
>  	u16 tbl_offset, bir;
>  	unsigned int func_offset = 0;
> @@ -477,9 +557,16 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>  	void __iomem *msix_tbl;
>  	int ret;
>  
> +	ep_func = dw_pcie_ep_get_func_from_ep(ep, func_no);
> +	if (!ep_func)
> +		return -EINVAL;
> +
> +	if (!ep_func->msix_cap)
> +		return -EINVAL;

And here.

> +
>  	func_offset = dw_pcie_ep_func_select(ep, func_no);
>  
> -	reg = ep->msix_cap + func_offset + PCI_MSIX_TABLE;
> +	reg = ep_func->msix_cap + func_offset + PCI_MSIX_TABLE;
>  	tbl_offset = dw_pcie_readl_dbi(pci, reg);
>  	bir = (tbl_offset & PCI_MSIX_TABLE_BIR);
>  	tbl_offset &= PCI_MSIX_TABLE_OFFSET;
> @@ -558,6 +645,7 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	int i;
>  	int ret;
>  	u32 reg;
> +	u8 func_no;
>  	void *addr;
>  	u8 hdr_type;
>  	unsigned int nbars;
> @@ -566,6 +654,9 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	struct dw_pcie *pci = to_dw_pcie_from_ep(ep);
>  	struct device *dev = pci->dev;
>  	struct device_node *np = dev->of_node;
> +	struct dw_pcie_ep_func *ep_func;
> +
> +	INIT_LIST_HEAD(&ep->func_list);
>  
>  	if (!pci->dbi_base || !pci->dbi_base2) {
>  		dev_err(dev, "dbi_base/dbi_base2 is not populated\n");
> @@ -632,9 +723,19 @@ int dw_pcie_ep_init(struct dw_pcie_ep *ep)
>  	if (ret < 0)
>  		epc->max_functions = 1;
>  
> -	ep->msi_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSI);
> +	for (func_no = 0; func_no < epc->max_functions; func_no++) {
> +		ep_func = devm_kzalloc(dev, sizeof(*ep_func), GFP_KERNEL);

Why do you need a list if you allocate all the functions at once? You 
could just do an array. Or do the allocations as needed and keep the 
list. I assume all functions aren't always used.

> +		if (!ep_func)
> +			return -ENOMEM;
>  
> -	ep->msix_cap = dw_pcie_find_capability(pci, PCI_CAP_ID_MSIX);
> +		ep_func->func_no = func_no;
> +		ep_func->msi_cap = dw_pcie_ep_find_capability(ep, func_no,
> +							      PCI_CAP_ID_MSI);
> +		ep_func->msix_cap = dw_pcie_ep_find_capability(ep, func_no,
> +							       PCI_CAP_ID_MSIX);
> +
> +		list_add_tail(&ep_func->list, &ep->func_list);
> +	}
>  
>  	if (ep->ops->ep_init)
>  		ep->ops->ep_init(ep);
> diff --git a/drivers/pci/controller/dwc/pcie-designware.h b/drivers/pci/controller/dwc/pcie-designware.h
> index cb32afa..dd9b7b4 100644
> --- a/drivers/pci/controller/dwc/pcie-designware.h
> +++ b/drivers/pci/controller/dwc/pcie-designware.h
> @@ -230,8 +230,16 @@ struct dw_pcie_ep_ops {
>  	unsigned int (*func_conf_select)(struct dw_pcie_ep *ep, u8 func_no);
>  };
>  
> +struct dw_pcie_ep_func {
> +	struct list_head	list;
> +	u8			func_no;
> +	u8			msi_cap;	/* MSI capability offset */
> +	u8			msix_cap;	/* MSI-X capability offset */
> +};
> +
>  struct dw_pcie_ep {
>  	struct pci_epc		*epc;
> +	struct list_head	func_list;
>  	const struct dw_pcie_ep_ops *ops;
>  	phys_addr_t		phys_base;
>  	size_t			addr_size;
> @@ -244,8 +252,6 @@ struct dw_pcie_ep {
>  	u32			num_ob_windows;
>  	void __iomem		*msi_mem;
>  	phys_addr_t		msi_mem_phys;
> -	u8			msi_cap;	/* MSI capability offset */
> -	u8			msix_cap;	/* MSI-X capability offset */
>  };
>  
>  struct dw_pcie_ops {
> @@ -437,6 +443,8 @@ int dw_pcie_ep_raise_msix_irq(struct dw_pcie_ep *ep, u8 func_no,
>  int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep, u8 func_no,
>  				       u16 interrupt_num);
>  void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar);
> +struct dw_pcie_ep_func *
> +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no);
>  #else
>  static inline void dw_pcie_ep_linkup(struct dw_pcie_ep *ep)
>  {
> @@ -478,5 +486,11 @@ static inline int dw_pcie_ep_raise_msix_irq_doorbell(struct dw_pcie_ep *ep,
>  static inline void dw_pcie_ep_reset_bar(struct dw_pcie *pci, enum pci_barno bar)
>  {
>  }
> +
> +static inline struct dw_pcie_ep_func *
> +dw_pcie_ep_get_func_from_ep(struct dw_pcie_ep *ep, u8 func_no)
> +{
> +	return NULL;
> +}
>  #endif
>  #endif /* _PCIE_DESIGNWARE_H */
> -- 
> 2.9.5
> 

^ permalink raw reply

* Re: [PATCH v6 01/11] PCI: designware-ep: Add multiple PFs support for DWC
From: Rob Herring @ 2020-05-20 20:32 UTC (permalink / raw)
  To: Xiaowei Bao
  Cc: roy.zang, lorenzo.pieralisi, devicetree, jingoohan1, Zhiqiang.Hou,
	linuxppc-dev, linux-pci, linux-kernel, leoyang.li, Minghuan.Lian,
	robh+dt, linux-arm-kernel, gustavo.pimentel, bhelgaas,
	andrew.murray, kishon, shawnguo, mingkai.hu, amurray
In-Reply-To: <20200314033038.24844-2-xiaowei.bao@nxp.com>

On Sat, 14 Mar 2020 11:30:28 +0800, Xiaowei Bao wrote:
> Add multiple PFs support for DWC, due to different PF have different
> config space, we use func_conf_select callback function to access
> the different PF's config space, the different chip company need to
> implement this callback function when use the DWC IP core and intend
> to support multiple PFs feature.
> 
> Signed-off-by: Xiaowei Bao <xiaowei.bao@nxp.com>
> Acked-by: Gustavo Pimentel <gustavo.pimentel@synopsys.com>
> ---
> v2:
>  - Remove duplicate redundant code.
>  - Reimplement the PF config space access way.
> v3:
>  - Integrate duplicate code for func_select.
>  - Move PCIE_ATU_FUNC_NUM(pf) (pf << 20) to ((pf) << 20).
>  - Add the comments for func_conf_select function.
> v4:
>  - Correct the commit message.
> v5:
>  - No change.
> v6:
>  - No change.
> 
>  drivers/pci/controller/dwc/pcie-designware-ep.c | 123 ++++++++++++++++--------
>  drivers/pci/controller/dwc/pcie-designware.c    |  59 ++++++++----
>  drivers/pci/controller/dwc/pcie-designware.h    |  18 +++-
>  3 files changed, 142 insertions(+), 58 deletions(-)
> 

Reviewed-by: Rob Herring <robh@kernel.org>

^ permalink raw reply

* Re: [PATCH] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT
From: Greg Kurz @ 2020-05-20 17:32 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linuxram, linux-kernel, kvm-ppc, paulus, sukadev, linuxppc-dev
In-Reply-To: <20200520165110.71020-1-ldufour@linux.ibm.com>

On Wed, 20 May 2020 18:51:10 +0200
Laurent Dufour <ldufour@linux.ibm.com> wrote:

> The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
> Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
> reserved to the Ultravisor.
> 
> However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
> context of the VM calling UV_ESM. This allows the Hypervisor to return to
> the guest without going through the Ultravisor. Thus the Secure bit of SRR1
> is not set in that particular case.
> 
> In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
> filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
> not set in that case.
> 

Why not checking vcpu->kvm->arch.secure_guest then ?

> Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 93493f0cbfe8..eb1f96cb7b72 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1099,9 +1099,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_h_svm_init_done(vcpu->kvm);
>  		break;
>  	case H_SVM_INIT_ABORT:
> -		ret = H_UNSUPPORTED;
> -		if (kvmppc_get_srr1(vcpu) & MSR_S)
> -			ret = kvmppc_h_svm_init_abort(vcpu->kvm);

or at least put a comment to explain why H_SVM_INIT_ABORT
doesn't have the same sanity check as the other SVM hcalls.

> +		ret = kvmppc_h_svm_init_abort(vcpu->kvm);
>  		break;
>  
>  	default:


^ permalink raw reply

* Re: [RESEND PATCH v7 4/5] ndctl/papr_scm, uapi: Add support for PAPR nvdimm specific methods
From: Vaibhav Jain @ 2020-05-20 18:37 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-nvdimm, linux-kernel, Steven Rostedt, Aneesh Kumar K . V,
	linuxppc-dev
In-Reply-To: <20200520153209.GC3660833@iweiny-DESK2.sc.intel.com>


Thanks for reviewing this patch Ira. My responses below:

Ira Weiny <ira.weiny@intel.com> writes:

> On Wed, May 20, 2020 at 12:30:57AM +0530, Vaibhav Jain wrote:
>> Introduce support for Papr nvDimm Specific Methods (PDSM) in papr_scm
>> modules and add the command family to the white list of NVDIMM command
>> sets. Also advertise support for ND_CMD_CALL for the dimm
>> command mask and implement necessary scaffolding in the module to
>> handle ND_CMD_CALL ioctl and PDSM requests that we receive.
>> 
>> The layout of the PDSM request as we expect from libnvdimm/libndctl is
>> described in newly introduced uapi header 'papr_scm_pdsm.h' which
>> defines a new 'struct nd_pdsm_cmd_pkg' header. This header is used
>> to communicate the PDSM request via member
>> 'nd_pkg_papr_scm->nd_command' and size of payload that need to be
>> sent/received for servicing the PDSM.
>> 
>> A new function is_cmd_valid() is implemented that reads the args to
>> papr_scm_ndctl() and performs sanity tests on them. A new function
>> papr_scm_service_pdsm() is introduced and is called from
>> papr_scm_ndctl() in case of a PDSM request is received via ND_CMD_CALL
>> command from libnvdimm.
>> 
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Michael Ellerman <mpe@ellerman.id.au>
>> Cc: "Aneesh Kumar K . V" <aneesh.kumar@linux.ibm.com>
>> Signed-off-by: Vaibhav Jain <vaibhav@linux.ibm.com>
>> ---
>> Changelog:
>> 
>> Resend:
>> * None
>> 
>> v6..v7 :
>> * Removed the re-definitions of __packed macro from papr_scm_pdsm.h
>>   [Mpe].
>> * Removed the usage of __KERNEL__ macros in papr_scm_pdsm.h [Mpe].
>> * Removed macros that were unused in papr_scm.c from papr_scm_pdsm.h
>>   [Mpe].
>> * Made functions defined in papr_scm_pdsm.h as static inline. [Mpe]
>> 
>> v5..v6 :
>> * Changed the usage of the term DSM to PDSM to distinguish it from the
>>   ACPI term [ Dan Williams ]
>> * Renamed papr_scm_dsm.h to papr_scm_pdsm.h and updated various struct
>>   to reflect the new terminology.
>> * Updated the patch description and title to reflect the new terminology.
>> * Squashed patch to introduce new command family in 'ndctl.h' with
>>   this patch [ Dan Williams ]
>> * Updated the papr_scm_pdsm method starting index from 0x10000 to 0x0
>>   [ Dan Williams ]
>> * Removed redundant license text from the papr_scm_psdm.h file.
>>   [ Dan Williams ]
>> * s/envelop/envelope/ at various places [ Dan Williams ]
>> * Added '__packed' attribute to command package header to gaurd
>>   against different compiler adding paddings between the fields.
>>   [ Dan Williams]
>> * Converted various pr_debug to dev_debug [ Dan Williams ]
>> 
>> v4..v5 :
>> * None
>> 
>> v3..v4 :
>> * None
>> 
>> v2..v3 :
>> * Updated the patch prefix to 'ndctl/uapi' [Aneesh]
>> 
>> v1..v2 :
>> * None
>> ---
>>  arch/powerpc/include/uapi/asm/papr_scm_pdsm.h | 134 ++++++++++++++++++
>>  arch/powerpc/platforms/pseries/papr_scm.c     | 101 ++++++++++++-
>>  include/uapi/linux/ndctl.h                    |   1 +
>>  3 files changed, 230 insertions(+), 6 deletions(-)
>>  create mode 100644 arch/powerpc/include/uapi/asm/papr_scm_pdsm.h
>> 
>> diff --git a/arch/powerpc/include/uapi/asm/papr_scm_pdsm.h b/arch/powerpc/include/uapi/asm/papr_scm_pdsm.h
>> new file mode 100644
>> index 000000000000..671693439c1c
>> --- /dev/null
>> +++ b/arch/powerpc/include/uapi/asm/papr_scm_pdsm.h
>> @@ -0,0 +1,134 @@
>> +/* SPDX-License-Identifier: GPL-2.0+ WITH Linux-syscall-note */
>> +/*
>> + * PAPR-SCM Dimm specific methods (PDSM) and structs for libndctl
>> + *
>> + * (C) Copyright IBM 2020
>> + *
>> + * Author: Vaibhav Jain <vaibhav at linux.ibm.com>
>> + */
>> +
>> +#ifndef _UAPI_ASM_POWERPC_PAPR_SCM_PDSM_H_
>> +#define _UAPI_ASM_POWERPC_PAPR_SCM_PDSM_H_
>> +
>> +#include <linux/types.h>
>> +
>> +/*
>> + * PDSM Envelope:
>> + *
>> + * The ioctl ND_CMD_CALL transfers data between user-space and kernel via
>> + * 'envelopes' which consists of a header and user-defined payload sections.
>> + * The header is described by 'struct nd_pdsm_cmd_pkg' which expects a
>> + * payload following it and offset of which relative to the struct is provided
>> + * by 'nd_pdsm_cmd_pkg.payload_offset'. *
>> + *
>> + *  +-------------+---------------------+---------------------------+
>> + *  |   64-Bytes  |       8-Bytes       |       Max 184-Bytes       |
>> + *  +-------------+---------------------+---------------------------+
>> + *  |               nd_pdsm_cmd_pkg |                           |
>> + *  |-------------+                     |                           |
>> + *  |  nd_cmd_pkg |                     |                           |
>> + *  +-------------+---------------------+---------------------------+
>> + *  | nd_family   |			|			    |
>> + *  | nd_size_out | cmd_status          |			    |
>> + *  | nd_size_in  | payload_version     |      PAYLOAD		    |
>> + *  | nd_command  | payload_offset ----->			    |
>> + *  | nd_fw_size  |                     |			    |
>> + *  +-------------+---------------------+---------------------------+
>> + *
>> + * PDSM Header:
>> + *
>> + * The header is defined as 'struct nd_pdsm_cmd_pkg' which embeds a
>> + * 'struct nd_cmd_pkg' instance. The PDSM command is assigned to member
>> + * 'nd_cmd_pkg.nd_command'. Apart from size information of the envelope which is
>> + * contained in 'struct nd_cmd_pkg', the header also has members following
>> + * members:
>> + *
>> + * 'cmd_status'		: (Out) Errors if any encountered while servicing PDSM.
>> + * 'payload_version'	: (In/Out) Version number associated with the payload.
>> + * 'payload_offset'	: (In)Relative offset of payload from start of envelope.
>> + *
>> + * PDSM Payload:
>> + *
>> + * The layout of the PDSM Payload is defined by various structs shared between
>> + * papr_scm and libndctl so that contents of payload can be interpreted. During
>> + * servicing of a PDSM the papr_scm module will read input args from the payload
>> + * field by casting its contents to an appropriate struct pointer based on the
>> + * PDSM command. Similarly the output of servicing the PDSM command will be
>> + * copied to the payload field using the same struct.
>> + *
>> + * 'libnvdimm' enforces a hard limit of 256 bytes on the envelope size, which
>> + * leaves around 184 bytes for the envelope payload (ignoring any padding that
>> + * the compiler may silently introduce).
>> + *
>> + * Payload Version:
>> + *
>> + * A 'payload_version' field is present in PDSM header that indicates a specific
>> + * version of the structure present in PDSM Payload for a given PDSM command.
>> + * This provides backward compatibility in case the PDSM Payload structure
>> + * evolves and different structures are supported by 'papr_scm' and 'libndctl'.
>> + *
>> + * When sending a PDSM Payload to 'papr_scm', 'libndctl' should send the version
>> + * of the payload struct it supports via 'payload_version' field. The 'papr_scm'
>> + * module when servicing the PDSM envelope checks the 'payload_version' and then
>> + * uses 'payload struct version' == MIN('payload_version field',
>> + * 'max payload-struct-version supported by papr_scm') to service the PDSM.
>> + * After servicing the PDSM, 'papr_scm' put the negotiated version of payload
>> + * struct in returned 'payload_version' field.
>
> FWIW many people believe using a size rather than version is more sustainable.
> It is expected that new payload structures are larger (more features) than the
> previous payload structure.
>
> I can't find references at the moment through.
>
> What does payload_version provide us that the command size in/out does
> not?

Taking constrains 1 & 2 mentioned below in section "Backward
Compatibility" into account, there should be a 1:1 mapping between set of
valid 'payload_versions' and set of corrosponding in/out sizes.

However its much more intutive and convenient to figure out the struct
type to use from version number rather than sizes. For example:

struct v1 { u64 a; };    sizeof(v1) == 8
struct v2 { u64 a, b; }; sizeof(v2) == 16

With version numbers its easy to figure out which type to use. However
with in/out size an extra lookup is needed to identify the type to be used.

>
>> + *
>> + * Libndctl on receiving the envelope back from papr_scm again checks the
>> + * 'payload_version' field and based on it use the appropriate version dsm
>> + * struct to parse the results.
>> + *
>> + * Backward Compatibility:
>> + *
>> + * Above scheme of exchanging different versioned PDSM struct between libndctl
>> + * and papr_scm should provide backward compatibility until following two
>> + * assumptions/conditions when defining new PDSM structs hold:
>> + *
>> + * Let T(X) = { set of attributes in PDSM struct 'T' versioned X }
>> + *
>> + * 1. T(X) is a proper subset of T(Y) if X > Y.
>
> Proper superset?  Or Y > X?
>
Good catch. This should be Y > X. Will get this fixed.

> Ira
>
>> + *    i.e Each new version of PDSM struct should retain existing struct
>> + *    attributes from previous version
>> + *
>> + * 2. If an entity (libndctl or papr_scm) supports a PDSM struct T(X) then
>> + *    it should also support T(1), T(2)...T(X - 1).
>> + *    i.e When adding support for new version of a PDSM struct, libndctl
>> + *    and papr_scm should retain support of the existing PDSM struct
>> + *    version they support.
>> + */
>> +
>> +/* Papr-scm-header + payload expected with ND_CMD_CALL ioctl from libnvdimm */
>> +struct nd_pdsm_cmd_pkg {
>> +	struct nd_cmd_pkg hdr;	/* Package header containing sub-cmd */
>> +	__s32 cmd_status;	/* Out: Sub-cmd status returned back */
>> +	__u16 payload_offset;	/* In: offset from start of struct */
>> +	__u16 payload_version;	/* In/Out: version of the payload */
>> +	__u8 payload[];		/* In/Out: Sub-cmd data buffer */
>> +} __packed;
>> +
>> +/*
>> + * Methods to be embedded in ND_CMD_CALL request. These are sent to the kernel
>> + * via 'nd_pdsm_cmd_pkg.hdr.nd_command' member of the ioctl struct
>> + */
>> +enum papr_scm_pdsm {
>> +	PAPR_SCM_PDSM_MIN = 0x0,
>> +	PAPR_SCM_PDSM_MAX,
>> +};
>> +
>> +/* Convert a libnvdimm nd_cmd_pkg to pdsm specific pkg */
>> +static inline struct nd_pdsm_cmd_pkg *nd_to_pdsm_cmd_pkg(struct nd_cmd_pkg *cmd)
>> +{
>> +	return (struct nd_pdsm_cmd_pkg *) cmd;
>> +}
>> +
>> +/* Return the payload pointer for a given pcmd */
>> +static inline void *pdsm_cmd_to_payload(struct nd_pdsm_cmd_pkg *pcmd)
>> +{
>> +	if (pcmd->hdr.nd_size_in == 0 && pcmd->hdr.nd_size_out == 0)
>> +		return NULL;
>> +	else
>> +		return (void *)((__u8 *) pcmd + pcmd->payload_offset);
>> +}
>> +
>> +#endif /* _UAPI_ASM_POWERPC_PAPR_SCM_PDSM_H_ */
>> diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
>> index 142636e1a59f..ed4b49a6f1e1 100644
>> --- a/arch/powerpc/platforms/pseries/papr_scm.c
>> +++ b/arch/powerpc/platforms/pseries/papr_scm.c
>> @@ -15,13 +15,15 @@
>>  #include <linux/seq_buf.h>
>>  
>>  #include <asm/plpar_wrappers.h>
>> +#include <asm/papr_scm_pdsm.h>
>>  
>>  #define BIND_ANY_ADDR (~0ul)
>>  
>>  #define PAPR_SCM_DIMM_CMD_MASK \
>>  	((1ul << ND_CMD_GET_CONFIG_SIZE) | \
>>  	 (1ul << ND_CMD_GET_CONFIG_DATA) | \
>> -	 (1ul << ND_CMD_SET_CONFIG_DATA))
>> +	 (1ul << ND_CMD_SET_CONFIG_DATA) | \
>> +	 (1ul << ND_CMD_CALL))
>>  
>>  /* DIMM health bitmap bitmap indicators */
>>  /* SCM device is unable to persist memory contents */
>> @@ -350,16 +352,97 @@ static int papr_scm_meta_set(struct papr_scm_priv *p,
>>  	return 0;
>>  }
>>  
>> +/*
>> + * Validate the inputs args to dimm-control function and return '0' if valid.
>> + * This also does initial sanity validation to ND_CMD_CALL sub-command packages.
>> + */
>> +static int is_cmd_valid(struct nvdimm *nvdimm, unsigned int cmd, void *buf,
>> +		       unsigned int buf_len)
>> +{
>> +	unsigned long cmd_mask = PAPR_SCM_DIMM_CMD_MASK;
>> +	struct nd_pdsm_cmd_pkg *pkg = nd_to_pdsm_cmd_pkg(buf);
>> +	struct papr_scm_priv *p;
>> +
>> +	/* Only dimm-specific calls are supported atm */
>> +	if (!nvdimm)
>> +		return -EINVAL;
>> +
>> +	/* get the provider date from struct nvdimm */
>> +	p = nvdimm_provider_data(nvdimm);
>> +
>> +	if (!test_bit(cmd, &cmd_mask)) {
>> +		dev_dbg(&p->pdev->dev, "Unsupported cmd=%u\n", cmd);
>> +		return -EINVAL;
>> +	} else if (cmd == ND_CMD_CALL) {
>> +
>> +		/* Verify the envelope package */
>> +		if (!buf || buf_len < sizeof(struct nd_pdsm_cmd_pkg)) {
>> +			dev_dbg(&p->pdev->dev, "Invalid pkg size=%u\n",
>> +				buf_len);
>> +			return -EINVAL;
>> +		}
>> +
>> +		/* Verify that the PDSM family is valid */
>> +		if (pkg->hdr.nd_family != NVDIMM_FAMILY_PAPR_SCM) {
>> +			dev_dbg(&p->pdev->dev, "Invalid pkg family=0x%llx\n",
>> +				pkg->hdr.nd_family);
>> +			return -EINVAL;
>> +
>> +		}
>> +
>> +		/* We except a payload with all PDSM commands */
>> +		if (pdsm_cmd_to_payload(pkg) == NULL) {
>> +			dev_dbg(&p->pdev->dev,
>> +				"Empty payload for sub-command=0x%llx\n",
>> +				pkg->hdr.nd_command);
>> +			return -EINVAL;
>> +		}
>> +	}
>> +
>> +	/* Command looks valid */
>> +	return 0;
>> +}
>> +
>> +static int papr_scm_service_pdsm(struct papr_scm_priv *p,
>> +				struct nd_pdsm_cmd_pkg *call_pkg)
>> +{
>> +	/* unknown subcommands return error in packages */
>> +	if (call_pkg->hdr.nd_command <= PAPR_SCM_PDSM_MIN ||
>> +	    call_pkg->hdr.nd_command >= PAPR_SCM_PDSM_MAX) {
>> +		dev_dbg(&p->pdev->dev, "Invalid PDSM request 0x%llx\n",
>> +			call_pkg->hdr.nd_command);
>> +		call_pkg->cmd_status = -EINVAL;
>> +		return 0;
>> +	}
>> +
>> +	/* Depending on the DSM command call appropriate service routine */
>> +	switch (call_pkg->hdr.nd_command) {
>> +	default:
>> +		dev_dbg(&p->pdev->dev, "Unsupported PDSM request 0x%llx\n",
>> +			call_pkg->hdr.nd_command);
>> +		call_pkg->cmd_status = -ENOENT;
>> +		return 0;
>> +	}
>> +}
>> +
>>  static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
>>  			  struct nvdimm *nvdimm, unsigned int cmd, void *buf,
>>  			  unsigned int buf_len, int *cmd_rc)
>>  {
>>  	struct nd_cmd_get_config_size *get_size_hdr;
>>  	struct papr_scm_priv *p;
>> +	struct nd_pdsm_cmd_pkg *call_pkg = NULL;
>> +	int rc;
>>  
>> -	/* Only dimm-specific calls are supported atm */
>> -	if (!nvdimm)
>> -		return -EINVAL;
>> +	/* Use a local variable in case cmd_rc pointer is NULL */
>> +	if (cmd_rc == NULL)
>> +		cmd_rc = &rc;
>> +
>> +	*cmd_rc = is_cmd_valid(nvdimm, cmd, buf, buf_len);
>> +	if (*cmd_rc) {
>> +		pr_debug("Invalid cmd=0x%x. Err=%d\n", cmd, *cmd_rc);
>> +		return *cmd_rc;
>> +	}
>>  
>>  	p = nvdimm_provider_data(nvdimm);
>>  
>> @@ -381,13 +464,19 @@ static int papr_scm_ndctl(struct nvdimm_bus_descriptor *nd_desc,
>>  		*cmd_rc = papr_scm_meta_set(p, buf);
>>  		break;
>>  
>> +	case ND_CMD_CALL:
>> +		call_pkg = nd_to_pdsm_cmd_pkg(buf);
>> +		*cmd_rc = papr_scm_service_pdsm(p, call_pkg);
>> +		break;
>> +
>>  	default:
>> -		return -EINVAL;
>> +		dev_dbg(&p->pdev->dev, "Unknown command = %d\n", cmd);
>> +		*cmd_rc = -EINVAL;
>>  	}
>>  
>>  	dev_dbg(&p->pdev->dev, "returned with cmd_rc = %d\n", *cmd_rc);
>>  
>> -	return 0;
>> +	return *cmd_rc;
>>  }
>>  
>>  static ssize_t flags_show(struct device *dev,
>> diff --git a/include/uapi/linux/ndctl.h b/include/uapi/linux/ndctl.h
>> index de5d90212409..99fb60600ef8 100644
>> --- a/include/uapi/linux/ndctl.h
>> +++ b/include/uapi/linux/ndctl.h
>> @@ -244,6 +244,7 @@ struct nd_cmd_pkg {
>>  #define NVDIMM_FAMILY_HPE2 2
>>  #define NVDIMM_FAMILY_MSFT 3
>>  #define NVDIMM_FAMILY_HYPERV 4
>> +#define NVDIMM_FAMILY_PAPR_SCM 5
>>  
>>  #define ND_IOCTL_CALL			_IOWR(ND_IOCTL, ND_CMD_CALL,\
>>  					struct nd_cmd_pkg)
>> -- 
>> 2.26.2
>> _______________________________________________
>> Linux-nvdimm mailing list -- linux-nvdimm@lists.01.org
>> To unsubscribe send an email to linux-nvdimm-leave@lists.01.org

-- 
Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH v2 3/5] libnvdimm/nvdimm/flush: Allow architecture to override the flush barrier
From: Aneesh Kumar K.V @ 2020-05-20 18:43 UTC (permalink / raw)
  To: Dan Williams; +Cc: alistair, linuxppc-dev, linux-nvdimm
In-Reply-To: <CAPcyv4jZhYXEmYGzqGPjPtq9ZWJNtQyszN0V0Xcv0qtByK_KCw@mail.gmail.com>

Dan Williams <dan.j.williams@intel.com> writes:

> On Tue, May 19, 2020 at 6:53 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> Dan Williams <dan.j.williams@intel.com> writes:
>>
>> > On Mon, May 18, 2020 at 10:30 PM Aneesh Kumar K.V
>> > <aneesh.kumar@linux.ibm.com> wrote:
>>
>> ...
>>
>> >> Applications using new instructions will behave as expected when running
>> >> on P8 and P9. Only future hardware will differentiate between 'dcbf' and
>> >> 'dcbfps'
>> >
>> > Right, this is the problem. Applications using new instructions behave
>> > as expected, the kernel has been shipping of_pmem and papr_scm for
>> > several cycles now, you're saying that the DAX applications written
>> > against those platforms are going to be broken on P8 and P9?
>>
>> The expecation is that both kernel and userspace would get upgraded to
>> use the new instruction before actual persistent memory devices are
>> made available.
>>
>> >
>> >> > I'm thinking the kernel
>> >> > should go as far as to disable DAX operation by default on new
>> >> > hardware until userspace asserts that it is prepared to switch to the
>> >> > new implementation. Is there any other way to ensure the forward
>> >> > compatibility of deployed ppc64 DAX applications?
>> >>
>> >> AFAIU there is no released persistent memory hardware on ppc64 platform
>> >> and we need to make sure before applications get enabled to use these
>> >> persistent memory devices, they should switch to use the new
>> >> instruction?
>> >
>> > Right, I want the kernel to offer some level of safety here because
>> > everything you are describing sounds like a flag day conversion. Am I
>> > misreading? Is there some other gate that prevents existing users of
>> > of_pmem and papr_scm from having their expectations violated when
>> > running on P8 / P9 hardware? Maybe there's tighter ecosystem control
>> > that I'm just not familiar with, I'm only going off the fact that the
>> > kernel has shipped a non-zero number of NVDIMM drivers that build with
>> > ARCH=ppc64 for several cycles.
>>
>> If we are looking at adding changes to kernel that will prevent a kernel
>> from running on newer hardware in a specific case, we could as well take
>> the changes to get the kernel use the newer instructions right?
>
> Oh, no, I'm not talking about stopping the kernel from running. I'm
> simply recommending that support for MAP_SYNC mappings (userspace
> managed flushing) be disabled by default on PPC with either a
> compile-time or run-time default to assert that userspace has been
> audited for legacy applications or that the platform owner is
> otherwise willing to take the risk.
>
>> But I agree with your concern that if we have older kernel/applications
>> that continue to use `dcbf` on future hardware we will end up
>> having issues w.r.t powerfail consistency. The plan is what you outlined
>> above as tighter ecosystem control. Considering we don't have a pmem
>> device generally available, we get both kernel and userspace upgraded
>> to use these new instructions before such a device is made available.
>
> Ok, I think a compile time kernel option with a runtime override
> satisfies my concern. Does that work for you?

something like below? But this still won't handle devdax mmap right?

diff --git a/arch/arm64/include/asm/libnvdimm.h b/arch/arm64/include/asm/libnvdimm.h
new file mode 100644
index 000000000000..aee697a72537
--- /dev/null
+++ b/arch/arm64/include/asm/libnvdimm.h
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ARCH_LIBNVDIMM_H__
+#define __ARCH_LIBNVDIMM_H__
+
+#endif /* __ARCH_LIBNVDIMM_H__  */
diff --git a/arch/powerpc/include/asm/libnvdimm.h b/arch/powerpc/include/asm/libnvdimm.h
new file mode 100644
index 000000000000..da479200bfb8
--- /dev/null
+++ b/arch/powerpc/include/asm/libnvdimm.h
@@ -0,0 +1,9 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ARCH_LIBNVDIMM_H__
+#define __ARCH_LIBNVDIMM_H__
+
+#define arch_disable_sync_nvdimm arch_disable_sync_nvdimm
+extern bool arch_disable_sync_nvdimm(void);
+
+#endif /* __ARCH_LIBNVDIMM_H__  */
diff --git a/arch/powerpc/lib/pmem.c b/arch/powerpc/lib/pmem.c
index 0666a8d29596..3ce4fb4f167b 100644
--- a/arch/powerpc/lib/pmem.c
+++ b/arch/powerpc/lib/pmem.c
@@ -9,6 +9,8 @@
 
 #include <asm/cacheflush.h>
 
+
+static bool sync_fault = IS_ENABLED(CONFIG_PPC_NVDIMM_SYNC_FAULT);
 /*
  * CONFIG_ARCH_HAS_PMEM_API symbols
  */
@@ -57,3 +59,16 @@ void memcpy_page_flushcache(char *to, struct page *page, size_t offset,
 	memcpy_flushcache(to, page_to_virt(page) + offset, len);
 }
 EXPORT_SYMBOL(memcpy_page_flushcache);
+
+bool arch_disable_sync_nvdimm(void)
+{
+	return !sync_fault;
+}
+
+static int __init parse_sync_fault(char *p)
+{
+	sync_fault = true;
+	return 0;
+}
+early_param("enable_sync_fault", parse_sync_fault);
+
diff --git a/arch/powerpc/platforms/Kconfig.cputype b/arch/powerpc/platforms/Kconfig.cputype
index 27a81c291be8..dde11d75a746 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -383,6 +383,15 @@ config PPC_KUEP
 
 	  If you're unsure, say Y.
 
+config PPC_NVDIMM_SYNC_FAULT
+	bool "Synchronous fault support (MAP_SYNC)"
+	default n
+	help
+	  Enable support for synchronous fault with nvdimm namespaces.
+
+	  If you're unsure, say N.
+
+
 config PPC_HAVE_KUAP
 	bool
 
diff --git a/arch/x86/include/asm/libnvdimm.h b/arch/x86/include/asm/libnvdimm.h
new file mode 100644
index 000000000000..aee697a72537
--- /dev/null
+++ b/arch/x86/include/asm/libnvdimm.h
@@ -0,0 +1,6 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef __ARCH_LIBNVDIMM_H__
+#define __ARCH_LIBNVDIMM_H__
+
+#endif /* __ARCH_LIBNVDIMM_H__  */
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index ccbb5b43b8b2..74a0809491af 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1278,6 +1278,13 @@ bool is_nvdimm_sync(struct nd_region *nd_region)
 	if (is_nd_volatile(&nd_region->dev))
 		return true;
 
+	/*
+	 * If arch is forcing a synchronous fault
+	 * disable.
+	 */
+	if (arch_disable_sync_nvdimm())
+		return false;
+
 	return is_nd_pmem(&nd_region->dev) &&
 		!test_bit(ND_REGION_ASYNC, &nd_region->flags);
 }
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 18da4059be09..891449aebe91 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -13,6 +13,8 @@
 #include <linux/spinlock.h>
 #include <linux/bio.h>
 
+#include <asm/libnvdimm.h>
+
 struct badrange_entry {
 	u64 start;
 	u64 length;
@@ -286,4 +288,12 @@ static inline void arch_invalidate_pmem(void *addr, size_t size)
 }
 #endif
 
+#ifndef arch_disable_sync_nvdimm
+#define arch_disable_sync_nvdimm arch_disable_sync_nvdimm
+static inline bool arch_disable_sync_nvdimm()
+{
+	return false;
+}
+#endif
+
 #endif /* __LIBNVDIMM_H__ */

^ permalink raw reply related

* Re: [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT
From: Greg Kurz @ 2020-05-20 18:23 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: linuxram, linux-kernel, kvm-ppc, paulus, sukadev, linuxppc-dev
In-Reply-To: <20200520174308.77820-1-ldufour@linux.ibm.com>

On Wed, 20 May 2020 19:43:08 +0200
Laurent Dufour <ldufour@linux.ibm.com> wrote:

> The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
> Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
> reserved to the Ultravisor.
> 
> However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
> context of the VM calling UV_ESM. This allows the Hypervisor to return to
> the guest without going through the Ultravisor. Thus the Secure bit of SRR1
> is not set in that particular case.
> 
> In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
> filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
> not set in that case.
> 
> Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
> ---

Reviewed-by: Greg Kurz <groug@kaod.org>

>  arch/powerpc/kvm/book3s_hv.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 93493f0cbfe8..6ad1a3b14300 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -1099,9 +1099,12 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>  			ret = kvmppc_h_svm_init_done(vcpu->kvm);
>  		break;
>  	case H_SVM_INIT_ABORT:
> -		ret = H_UNSUPPORTED;
> -		if (kvmppc_get_srr1(vcpu) & MSR_S)
> -			ret = kvmppc_h_svm_init_abort(vcpu->kvm);
> +		/*
> +		 * Even if that call is made by the Ultravisor, the SSR1 value
> +		 * is the guest context one, with the secure bit clear as it has
> +		 * not yet been secured. So we can't check it here.
> +		 */
> +		ret = kvmppc_h_svm_init_abort(vcpu->kvm);
>  		break;
>  
>  	default:


^ permalink raw reply

* [PATCH v2 0/7] padata: parallelize deferred page init
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev

Deferred struct page init is a bottleneck in kernel boot--the biggest
for us and probably others.  Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running.  It also benefits bare metal
machines hosting VMs that are sensitive to downtime.  In projects such
as VMM Fast Restart[1], where guest state is preserved across kexec
reboot, it helps prevent application and network timeouts in the guests.

So, multithread deferred init to take full advantage of system memory
bandwidth.

Extend padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 7.

This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init,
vfio page pinning, hugetlb fallocate, and munmap.  Deferred page init
doesn't require concurrency limits, resource control, or priority
adjustments like these other users will because it happens during boot
when the system is otherwise idle and waiting for page init to finish.

This has been run on a variety of x86 systems and speeds up kernel boot
by 3% to 49%, saving up to 1.6 out of 4 seconds.  Patch 5 has more
numbers.

Please review and test, and thanks to Alex, Andrew, Josh, and Pavel for
their feedback in the last version.

The powerpc and s390 lists are included in case they want to give this a
try, they had enabled this feature when it was configured per arch.

Series based on 5.7-rc6 plus these three from mmotm

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

and it's available here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v2
  https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v2

and the future users and related features are available as
work-in-progress:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.4
  https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.4

v2:
 - Improve the problem statement (Andrew, Josh, Pavel)
 - Add T-b's to unchanged patches (Josh)
 - Fully initialize max-order blocks to avoid buddy issues (Alex)
 - Parallelize on section-aligned boundaries to avoid potential
   false sharing (Alex)
 - Return the maximum thread count from a function that architectures
   can override, with the generic version returning 1 (current
   behavior).  Override for x86 since that's the only arch this series
   has been tested on so far.  Other archs can test with more threads
   by dropping patch 6.
 - Rebase to v5.7-rc6, rerun tests

RFC v4 [2] -> v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
    https://www.youtube.com/watch?v=pBsHnf93tcQ
    https://lore.kernel.org/linux-mm/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/

[2] https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jordan@oracle.com/

Daniel Jordan (7):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: parallelize deferred_init_memmap()
  mm: make deferred init's max threads arch-specific
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 arch/x86/mm/init_64.c             |  12 ++
 include/linux/memblock.h          |   3 +
 include/linux/padata.h            |  43 ++++-
 init/main.c                       |   2 +
 kernel/padata.c                   | 277 ++++++++++++++++++++++++------
 mm/Kconfig                        |   6 +-
 mm/page_alloc.c                   |  67 +++++++-
 8 files changed, 373 insertions(+), 78 deletions(-)

base-commit: b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce
prerequisite-patch-id: 4ad522141e1119a325a9799dad2bd982fbac8b7c
prerequisite-patch-id: 169273327e56f5461101a71dfbd6b4cfd4570cf0
prerequisite-patch-id: 0f34692c8a9673d4c4f6a3545cf8ec3a2abf8620
-- 
2.26.2

^ permalink raw reply

* [PATCH v2 1/7] padata: remove exit routine
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev
In-Reply-To: <20200520182645.1658949-1-daniel.m.jordan@oracle.com>

padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 kernel/padata.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..835919c745266 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void)
 }
 module_init(padata_driver_init);
 
-static __exit void padata_driver_exit(void)
-{
-	cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-	cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
 #endif
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 7/7] padata: document multithreaded jobs
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev
In-Reply-To: <20200520182645.1658949-1-daniel.m.jordan@oracle.com>

Add Documentation for multithreaded jobs.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 Documentation/core-api/padata.rst | 41 +++++++++++++++++++++++--------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/Documentation/core-api/padata.rst b/Documentation/core-api/padata.rst
index 9a24c111781d9..b7e047af993e8 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,23 +4,26 @@
 The padata parallel execution mechanism
 =======================================
 
-:Date: December 2019
+:Date: April 2020
 
 Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering.  It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets.  The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
 
-Usage
-=====
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets.  This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+=======================
 
 Initializing
 ------------
 
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run parallel jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
 
     #include <linux/padata.h>
 
@@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
 It is the user's responsibility to ensure all outstanding jobs are complete
 before any of the above are called.
 
+Running Multithreaded Jobs
+==========================
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished.  padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job.  First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section.  This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread.  Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any.  Prepare the shared state, which is
+typically a stack-allocated structure that wraps the required data.  Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
 Interface
 =========
 
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 3/7] padata: allocate work structures for parallel jobs from a pool
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev
In-Reply-To: <20200520182645.1658949-1-daniel.m.jordan@oracle.com>

padata allocates per-CPU, per-instance work structs for parallel jobs.
A do_parallel call assigns a job to a sequence number and hashes the
number to a CPU, where the job will eventually run using the
corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce200e6 ("padata:
unbind parallel jobs from specific CPUs") because a work isn't bound to
a particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current
task instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/padata.h |   8 +--
 kernel/padata.c        | 118 +++++++++++++++++++++++++++--------------
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 476ecfa41f363..3bfa503503ac5 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
  * @list: List entry, to attach to the padata lists.
  * @pd: Pointer to the internal control structure.
  * @cb_cpu: Callback cpu for serializatioon.
- * @cpu: Cpu for parallelization.
  * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
  * @parallel: Parallel execution function.
@@ -34,7 +33,6 @@ struct padata_priv {
 	struct list_head	list;
 	struct parallel_data	*pd;
 	int			cb_cpu;
-	int			cpu;
 	unsigned int		seq_nr;
 	int			info;
 	void                    (*parallel)(struct padata_priv *padata);
@@ -68,15 +66,11 @@ struct padata_serial_queue {
 /**
  * struct padata_parallel_queue - The percpu padata parallel queue
  *
- * @parallel: List to wait for parallelization.
  * @reorder: List to wait for reordering after parallel processing.
- * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
  */
 struct padata_parallel_queue {
-       struct padata_list    parallel;
        struct padata_list    reorder;
-       struct work_struct    work;
        atomic_t              num_obj;
 };
 
@@ -111,7 +105,7 @@ struct parallel_data {
 	struct padata_parallel_queue	__percpu *pqueue;
 	struct padata_serial_queue	__percpu *squeue;
 	atomic_t			refcnt;
-	atomic_t			seq_nr;
+	unsigned int			seq_nr;
 	unsigned int			processed;
 	int				cpu;
 	struct padata_cpumask		cpumask;
diff --git a/kernel/padata.c b/kernel/padata.c
index 6f709bc0fc413..78ff9aa529204 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -32,7 +32,15 @@
 #include <linux/sysfs.h>
 #include <linux/rcupdate.h>
 
-#define MAX_OBJ_NUM 1000
+struct padata_work {
+	struct work_struct	pw_work;
+	struct list_head	pw_list;  /* padata_free_works linkage */
+	void			*pw_data;
+};
+
+static DEFINE_SPINLOCK(padata_works_lock);
+static struct padata_work *padata_works;
+static LIST_HEAD(padata_free_works);
 
 static void padata_free_pd(struct parallel_data *pd);
 
@@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, unsigned int seq_nr)
 	return padata_index_to_cpu(pd, cpu_index);
 }
 
-static void padata_parallel_worker(struct work_struct *parallel_work)
+static struct padata_work *padata_work_alloc(void)
 {
-	struct padata_parallel_queue *pqueue;
-	LIST_HEAD(local_list);
+	struct padata_work *pw;
 
-	local_bh_disable();
-	pqueue = container_of(parallel_work,
-			      struct padata_parallel_queue, work);
+	lockdep_assert_held(&padata_works_lock);
 
-	spin_lock(&pqueue->parallel.lock);
-	list_replace_init(&pqueue->parallel.list, &local_list);
-	spin_unlock(&pqueue->parallel.lock);
+	if (list_empty(&padata_free_works))
+		return NULL;	/* No more work items allowed to be queued. */
 
-	while (!list_empty(&local_list)) {
-		struct padata_priv *padata;
+	pw = list_first_entry(&padata_free_works, struct padata_work, pw_list);
+	list_del(&pw->pw_list);
+	return pw;
+}
 
-		padata = list_entry(local_list.next,
-				    struct padata_priv, list);
+static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
+			     void *data)
+{
+	INIT_WORK(&pw->pw_work, work_fn);
+	pw->pw_data = data;
+}
 
-		list_del_init(&padata->list);
+static void padata_work_free(struct padata_work *pw)
+{
+	lockdep_assert_held(&padata_works_lock);
+	list_add(&pw->pw_list, &padata_free_works);
+}
 
-		padata->parallel(padata);
-	}
+static void padata_parallel_worker(struct work_struct *parallel_work)
+{
+	struct padata_work *pw = container_of(parallel_work, struct padata_work,
+					      pw_work);
+	struct padata_priv *padata = pw->pw_data;
 
+	local_bh_disable();
+	padata->parallel(padata);
+	spin_lock(&padata_works_lock);
+	padata_work_free(pw);
+	spin_unlock(&padata_works_lock);
 	local_bh_enable();
 }
 
@@ -105,9 +127,9 @@ int padata_do_parallel(struct padata_shell *ps,
 		       struct padata_priv *padata, int *cb_cpu)
 {
 	struct padata_instance *pinst = ps->pinst;
-	int i, cpu, cpu_index, target_cpu, err;
-	struct padata_parallel_queue *queue;
+	int i, cpu, cpu_index, err;
 	struct parallel_data *pd;
+	struct padata_work *pw;
 
 	rcu_read_lock_bh();
 
@@ -135,25 +157,25 @@ int padata_do_parallel(struct padata_shell *ps,
 	if ((pinst->flags & PADATA_RESET))
 		goto out;
 
-	if (atomic_read(&pd->refcnt) >= MAX_OBJ_NUM)
-		goto out;
-
-	err = 0;
 	atomic_inc(&pd->refcnt);
 	padata->pd = pd;
 	padata->cb_cpu = *cb_cpu;
 
-	padata->seq_nr = atomic_inc_return(&pd->seq_nr);
-	target_cpu = padata_cpu_hash(pd, padata->seq_nr);
-	padata->cpu = target_cpu;
-	queue = per_cpu_ptr(pd->pqueue, target_cpu);
-
-	spin_lock(&queue->parallel.lock);
-	list_add_tail(&padata->list, &queue->parallel.list);
-	spin_unlock(&queue->parallel.lock);
+	rcu_read_unlock_bh();
 
-	queue_work(pinst->parallel_wq, &queue->work);
+	spin_lock(&padata_works_lock);
+	padata->seq_nr = ++pd->seq_nr;
+	pw = padata_work_alloc();
+	spin_unlock(&padata_works_lock);
+	if (pw) {
+		padata_work_init(pw, padata_parallel_worker, padata);
+		queue_work(pinst->parallel_wq, &pw->pw_work);
+	} else {
+		/* Maximum works limit exceeded, run in the current task. */
+		padata->parallel(padata);
+	}
 
+	return 0;
 out:
 	rcu_read_unlock_bh();
 
@@ -324,8 +346,9 @@ static void padata_serial_worker(struct work_struct *serial_work)
 void padata_do_serial(struct padata_priv *padata)
 {
 	struct parallel_data *pd = padata->pd;
+	int hashed_cpu = padata_cpu_hash(pd, padata->seq_nr);
 	struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
-							   padata->cpu);
+							   hashed_cpu);
 	struct padata_priv *cur;
 
 	spin_lock(&pqueue->reorder.lock);
@@ -416,8 +439,6 @@ static void padata_init_pqueues(struct parallel_data *pd)
 		pqueue = per_cpu_ptr(pd->pqueue, cpu);
 
 		__padata_list_init(&pqueue->reorder);
-		__padata_list_init(&pqueue->parallel);
-		INIT_WORK(&pqueue->work, padata_parallel_worker);
 		atomic_set(&pqueue->num_obj, 0);
 	}
 }
@@ -451,7 +472,7 @@ static struct parallel_data *padata_alloc_pd(struct padata_shell *ps)
 
 	padata_init_pqueues(pd);
 	padata_init_squeues(pd);
-	atomic_set(&pd->seq_nr, -1);
+	pd->seq_nr = -1;
 	atomic_set(&pd->refcnt, 1);
 	spin_lock_init(&pd->lock);
 	pd->cpu = cpumask_first(pd->cpumask.pcpu);
@@ -1051,6 +1072,7 @@ EXPORT_SYMBOL(padata_free_shell);
 
 void __init padata_init(void)
 {
+	unsigned int i, possible_cpus;
 #ifdef CONFIG_HOTPLUG_CPU
 	int ret;
 
@@ -1062,13 +1084,27 @@ void __init padata_init(void)
 
 	ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
 				      NULL, padata_cpu_dead);
-	if (ret < 0) {
-		cpuhp_remove_multi_state(hp_online);
-		goto err;
-	}
+	if (ret < 0)
+		goto remove_online_state;
+#endif
+
+	possible_cpus = num_possible_cpus();
+	padata_works = kmalloc_array(possible_cpus, sizeof(struct padata_work),
+				     GFP_KERNEL);
+	if (!padata_works)
+		goto remove_dead_state;
+
+	for (i = 0; i < possible_cpus; ++i)
+		list_add(&padata_works[i].pw_list, &padata_free_works);
 
 	return;
+
+remove_dead_state:
+#ifdef CONFIG_HOTPLUG_CPU
+	cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
+remove_online_state:
+	cpuhp_remove_multi_state(hp_online);
 err:
-	pr_warn("padata: initialization failed\n");
 #endif
+	pr_warn("padata: initialization failed\n");
 }
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 2/7] padata: initialize earlier
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev
In-Reply-To: <20200520182645.1658949-1-daniel.m.jordan@oracle.com>

padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/padata.h |  6 ++++++
 init/main.c            |  2 ++
 kernel/padata.c        | 17 ++++++++---------
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index a0d8b41850b25..476ecfa41f363 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -164,6 +164,12 @@ struct padata_instance {
 #define	PADATA_INVALID	4
 };
 
+#ifdef CONFIG_PADATA
+extern void __init padata_init(void);
+#else
+static inline void __init padata_init(void) {}
+#endif
+
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
diff --git a/init/main.c b/init/main.c
index 03371976d3872..8ab521f7af5d2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -94,6 +94,7 @@
 #include <linux/rodata_test.h>
 #include <linux/jump_label.h>
 #include <linux/mem_encrypt.h>
+#include <linux/padata.h>
 
 #include <asm/io.h>
 #include <asm/bugs.h>
@@ -1482,6 +1483,7 @@ static noinline void __init kernel_init_freeable(void)
 	smp_init();
 	sched_init_smp();
 
+	padata_init();
 	page_alloc_init_late();
 	/* Initialize page ext after all struct pages are initialized. */
 	page_ext_init();
diff --git a/kernel/padata.c b/kernel/padata.c
index 835919c745266..6f709bc0fc413 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -31,7 +31,6 @@
 #include <linux/slab.h>
 #include <linux/sysfs.h>
 #include <linux/rcupdate.h>
-#include <linux/module.h>
 
 #define MAX_OBJ_NUM 1000
 
@@ -1050,26 +1049,26 @@ void padata_free_shell(struct padata_shell *ps)
 }
 EXPORT_SYMBOL(padata_free_shell);
 
-#ifdef CONFIG_HOTPLUG_CPU
-
-static __init int padata_driver_init(void)
+void __init padata_init(void)
 {
+#ifdef CONFIG_HOTPLUG_CPU
 	int ret;
 
 	ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
 				      padata_cpu_online, NULL);
 	if (ret < 0)
-		return ret;
+		goto err;
 	hp_online = ret;
 
 	ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
 				      NULL, padata_cpu_dead);
 	if (ret < 0) {
 		cpuhp_remove_multi_state(hp_online);
-		return ret;
+		goto err;
 	}
-	return 0;
-}
-module_init(padata_driver_init);
 
+	return;
+err:
+	pr_warn("padata: initialization failed\n");
 #endif
+}
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 6/7] mm: make deferred init's max threads arch-specific
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev
In-Reply-To: <20200520182645.1658949-1-daniel.m.jordan@oracle.com>

Using padata during deferred init has only been tested on x86, so for
now limit it to this architecture.

If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 arch/x86/mm/init_64.c    | 12 ++++++++++++
 include/linux/memblock.h |  3 +++
 mm/page_alloc.c          | 13 ++++++++-----
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..2d749ec12ea8a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1260,6 +1260,18 @@ void __init mem_init(void)
 	mem_init_print_info(NULL);
 }
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+	/*
+	 * More CPUs always led to greater speedups on tested systems, up to
+	 * all the nodes' CPUs.  Use all since the system is otherwise idle
+	 * now.
+	 */
+	return max_t(int, cpumask_weight(node_cpumask), 1);
+}
+#endif
+
 int kernel_set_to_readonly;
 
 void mark_rodata_ro(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6bc37a731d27b..2b289df44194f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone *zone,
 #define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
 	for (; i != U64_MAX;					  \
 	     __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask);
+
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cb780e8dec78..0d7d805f98b2d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1843,6 +1843,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
 	atomic_long_add(nr_pages, &args->nr_pages);
 }
 
+/* An arch may override for more concurrency. */
+__weak int __init
+deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+	return 1;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1891,11 +1898,7 @@ static int __init deferred_init_memmap(void *data)
 						 first_init_pfn))
 		goto zone_empty;
 
-	/*
-	 * More CPUs always led to greater speedups on tested systems, up to
-	 * all the nodes' CPUs.  Use all since the system is otherwise idle now.
-	 */
-	max_threads = max(cpumask_weight(cpumask), 1u);
+	max_threads = deferred_page_init_max_threads(cpumask);
 
 	while (spfn < epfn) {
 		epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 5/7] mm: parallelize deferred_init_memmap()
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev
In-Reply-To: <20200520182645.1658949-1-daniel.m.jordan@oracle.com>

Deferred struct page init is a significant bottleneck in kernel boot.
Optimizing it maximizes availability for large-memory systems and allows
spinning up short-lived VMs as needed without having to leave them
running.  It also benefits bare metal machines hosting VMs that are
sensitive to downtime.  In projects such as VMM Fast Restart[1], where
guest state is preserved across kexec reboot, it helps prevent
application and network timeouts in the guests.

Multithread to take full advantage of system memory bandwidth.

The maximum number of threads is capped at the number of CPUs on the
node because speedups always improve with additional threads on every
system tested, and at this phase of boot, the system is otherwise idle
and waiting on page init to finish.

Helper threads operate on section-aligned ranges to both avoid false
sharing when setting the pageblock's migrate type and to avoid accessing
uninitialized buddy pages, though max order alignment is enough for the
latter.

The minimum chunk size is also a section.  There was benefit to using
multiple threads even on relatively small memory (1G) systems, and this
is the smallest size that the alignment allows.

The time (milliseconds) is the slowest node to initialize since boot
blocks until all nodes finish.  intel_pstate is loaded in active mode
without hwp and with turbo enabled, and intel_idle is active as well.

    Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
      2 nodes * 26 cores * 2 threads = 104 CPUs
      384G/node = 768G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   4078.0 (  9.0)         --   1779.0 (  8.7)
       2% (  1)       1.4%   4021.3 (  2.9)       3.4%   1717.7 (  7.8)
      12% (  6)      35.1%   2644.7 ( 35.3)      80.8%    341.0 ( 35.5)
      25% ( 13)      38.7%   2498.0 ( 34.2)      89.1%    193.3 ( 32.3)
      37% ( 19)      39.1%   2482.0 ( 25.2)      90.1%    175.3 ( 31.7)
      50% ( 26)      38.8%   2495.0 (  8.7)      89.1%    193.7 (  3.5)
      75% ( 39)      39.2%   2478.0 ( 21.0)      90.3%    172.7 ( 26.7)
     100% ( 52)      40.0%   2448.0 (  2.0)      91.9%    143.3 (  1.5)

    Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
      1 node * 16 cores * 2 threads = 32 CPUs
      192G/node = 192G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1996.0 ( 18.0)         --   1104.3 (  6.7)
       3% (  1)       1.4%   1968.0 (  3.0)       2.7%   1074.7 (  9.0)
      12% (  4)      40.1%   1196.0 ( 22.7)      72.4%    305.3 ( 16.8)
      25% (  8)      47.4%   1049.3 ( 17.2)      84.2%    174.0 ( 10.6)
      37% ( 12)      48.3%   1032.0 ( 14.9)      86.8%    145.3 (  2.5)
      50% ( 16)      48.9%   1020.3 (  2.5)      88.0%    133.0 (  1.7)
      75% ( 24)      49.1%   1016.3 (  8.1)      88.4%    128.0 (  1.7)
     100% ( 32)      49.4%   1009.0 (  8.5)      88.6%    126.3 (  0.6)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
      2 nodes * 18 cores * 2 threads = 72 CPUs
      128G/node = 256G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1682.7 (  6.7)         --    630.0 (  4.6)
       3% (  1)       0.4%   1676.0 (  2.0)       0.7%    625.3 (  3.2)
      12% (  4)      25.8%   1249.0 (  1.0)      68.2%    200.3 (  1.2)
      25% (  9)      30.0%   1178.0 (  5.2)      79.7%    128.0 (  3.5)
      37% ( 13)      30.6%   1167.7 (  3.1)      81.3%    117.7 (  1.2)
      50% ( 18)      30.6%   1167.3 (  2.3)      81.4%    117.0 (  1.0)
      75% ( 27)      31.0%   1161.3 (  4.6)      82.5%    110.0 (  6.9)
     100% ( 36)      32.1%   1142.0 (  3.6)      85.7%     90.0 (  1.0)

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 8 cores * 2 threads = 16 CPUs
      64G/node = 64G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --   1003.7 ( 16.6)         --    243.3 (  8.1)
       6% (  1)       1.4%    990.0 (  4.6)       1.2%    240.3 (  1.5)
      12% (  2)      11.4%    889.3 ( 16.7)      44.5%    135.0 (  3.0)
      25% (  4)      16.8%    835.3 (  9.0)      65.8%     83.3 (  2.5)
      37% (  6)      18.6%    816.7 ( 17.6)      70.4%     72.0 (  1.0)
      50% (  8)      18.2%    821.0 (  5.0)      70.7%     71.3 (  1.2)
      75% ( 12)      19.0%    813.3 (  5.0)      71.8%     68.7 (  2.1)
     100% ( 16)      19.8%    805.3 ( 10.8)      76.4%     57.3 ( 15.9)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      16G/node = 16G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    722.3 (  9.5)         --     50.7 (  0.6)
      25% (  1)      -3.3%    746.3 (  4.7)      -2.0%     51.7 (  1.2)
      50% (  2)       0.2%    721.0 ( 11.3)      29.6%     35.7 (  4.9)
      75% (  3)      -0.3%    724.3 ( 11.2)      48.7%     26.0 (  0.0)
     100% (  4)       3.0%    700.3 ( 13.6)      55.9%     22.3 (  0.6)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
      1 node * 2 cores * 2 threads = 4 CPUs
      14G/node = 14G memory

                   kernel boot                 deferred init
                   ------------------------    ------------------------
    node% (thr)    speedup  time_ms (stdev)    speedup  time_ms (stdev)
          (  0)         --    673.0 (  6.9)         --     57.0 (  1.0)
      25% (  1)      -0.6%    677.3 ( 19.8)       1.8%     56.0 (  1.0)
      50% (  2)       3.4%    650.0 (  3.6)      36.8%     36.0 (  5.2)
      75% (  3)       4.2%    644.7 (  7.6)      56.1%     25.0 (  1.0)
     100% (  4)       5.3%    637.0 (  5.6)      63.2%     21.0 (  0.0)

On Josh's 96-CPU and 192G memory system:

    Without this patch series:
    [    0.487132] node 0 initialised, 23398907 pages in 292ms
    [    0.499132] node 1 initialised, 24189223 pages in 304ms
    ...
    [    0.629376] Run /sbin/init as init process

    With this patch series:
    [    0.227868] node 0 initialised, 23398907 pages in 28ms
    [    0.230019] node 1 initialised, 24189223 pages in 28ms
    ...
    [    0.361069] Run /sbin/init as init process

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 mm/Kconfig      |  6 ++---
 mm/page_alloc.c | 60 ++++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358..04c1da3f9f44c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
 	depends on SPARSEMEM
 	depends on !NEED_PER_CPU_KM
 	depends on 64BIT
+	select PADATA
 	help
 	  Ordinarily all struct pages are initialised during early boot in a
 	  single thread. On very large machines this can take a considerable
 	  amount of time. If this option is set, large machines will bring up
-	  a subset of memmap at boot and then initialise the rest in parallel
-	  by starting one-off "pgdatinitX" kernel thread for each node X. This
-	  has a potential performance impact on processes running early in the
+	  a subset of memmap at boot and then initialise the rest in parallel.
+	  This has a potential performance impact on tasks running early in the
 	  lifetime of the system until these kthreads finish the
 	  initialisation.
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0c0d9364aa6d..9cb780e8dec78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include <linux/lockdep.h>
 #include <linux/nmi.h>
 #include <linux/psi.h>
+#include <linux/padata.h>
 
 #include <asm/sections.h>
 #include <asm/tlbflush.h>
@@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
 	return nr_pages;
 }
 
+struct definit_args {
+	struct zone *zone;
+	atomic_long_t nr_pages;
+};
+
+static void __init
+deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
+			   void *arg)
+{
+	unsigned long spfn, epfn, nr_pages = 0;
+	struct definit_args *args = arg;
+	struct zone *zone = args->zone;
+	u64 i;
+
+	deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
+
+	/*
+	 * Initialize and free pages in MAX_ORDER sized increments so that we
+	 * can avoid introducing any issues with the buddy allocator.
+	 */
+	while (spfn < end_pfn) {
+		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+		cond_resched();
+	}
+
+	atomic_long_add(nr_pages, &args->nr_pages);
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
 	pg_data_t *pgdat = data;
 	const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
 	unsigned long spfn = 0, epfn = 0, nr_pages = 0;
-	unsigned long first_init_pfn, flags;
+	unsigned long first_init_pfn, flags, epfn_align;
 	unsigned long start = jiffies;
 	struct zone *zone;
-	int zid;
+	int zid, max_threads;
 	u64 i;
 
 	/* Bind memory initialisation thread to a local node if possible */
@@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
 		goto zone_empty;
 
 	/*
-	 * Initialize and free pages in MAX_ORDER sized increments so
-	 * that we can avoid introducing any issues with the buddy
-	 * allocator.
+	 * More CPUs always led to greater speedups on tested systems, up to
+	 * all the nodes' CPUs.  Use all since the system is otherwise idle now.
 	 */
+	max_threads = max(cpumask_weight(cpumask), 1u);
+
 	while (spfn < epfn) {
+		epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
+
+		if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
+		    epfn_align - spfn >= PAGES_PER_SECTION) {
+			struct definit_args arg = { zone, ATOMIC_LONG_INIT(0) };
+			struct padata_mt_job job = {
+				.thread_fn   = deferred_init_memmap_chunk,
+				.fn_arg      = &arg,
+				.start       = spfn,
+				.size        = epfn_align - spfn,
+				.align       = PAGES_PER_SECTION,
+				.min_chunk   = PAGES_PER_SECTION,
+				.max_threads = max_threads,
+			};
+
+			padata_do_multithreaded(&job);
+			nr_pages += atomic_long_read(&arg.nr_pages);
+			spfn = epfn_align;
+		}
+
 		nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
 		cond_resched();
 	}
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2 4/7] padata: add basic support for multithreaded jobs
From: Daniel Jordan @ 2020-05-20 18:26 UTC (permalink / raw)
  To: Andrew Morton, Herbert Xu, Steffen Klassert
  Cc: David Hildenbrand, Peter Zijlstra, Dave Hansen, Michal Hocko,
	linux-mm, Steven Sistare, Pavel Machek, Alexander Duyck,
	linux-s390, Jonathan Corbet, Daniel Jordan, Jason Gunthorpe,
	Zi Yan, Robert Elliott, Pavel Tatashin, Shile Zhang,
	Josh Triplett, Alex Williamson, Kirill Tkhai, Dan Williams,
	Randy Dunlap, linux-kernel, linux-crypto, Tejun Heo, linuxppc-dev
In-Reply-To: <20200520182645.1658949-1-daniel.m.jordan@oracle.com>

Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting
up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan <daniel.m.jordan@oracle.com>
---
 include/linux/padata.h |  29 ++++++++
 kernel/padata.c        | 152 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 3bfa503503ac5..b0affa466a841 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -4,6 +4,9 @@
  *
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
+ *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan <daniel.m.jordan@oracle.com>
  */
 
 #ifndef PADATA_H
@@ -130,6 +133,31 @@ struct padata_shell {
 	struct list_head		list;
 };
 
+/**
+ * struct padata_mt_job - represents one multithreaded job
+ *
+ * @thread_fn: Called for each chunk of work that a padata thread does.
+ * @fn_arg: The thread function argument.
+ * @start: The start of the job (units are job-specific).
+ * @size: size of this node's work (units are job-specific).
+ * @align: Ranges passed to the thread function fall on this boundary, with the
+ *         possible exceptions of the beginning and end of the job.
+ * @min_chunk: The minimum chunk size in job-specific units.  This allows
+ *             the client to communicate the minimum amount of work that's
+ *             appropriate for one worker thread to do at once.
+ * @max_threads: Max threads to use for the job, actual number may be less
+ *               depending on task size and minimum chunk size.
+ */
+struct padata_mt_job {
+	void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
+	void			*fn_arg;
+	unsigned long		start;
+	unsigned long		size;
+	unsigned long		align;
+	unsigned long		min_chunk;
+	int			max_threads;
+};
+
 /**
  * struct padata_instance - The overall control structure.
  *
@@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
 			      struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
+extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
 			      cpumask_var_t cpumask);
 extern int padata_start(struct padata_instance *pinst);
diff --git a/kernel/padata.c b/kernel/padata.c
index 78ff9aa529204..e78f57d9aef90 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -7,6 +7,9 @@
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert <steffen.klassert@secunet.com>
  *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan <daniel.m.jordan@oracle.com>
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
  * version 2, as published by the Free Software Foundation.
@@ -21,6 +24,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include <linux/completion.h>
 #include <linux/export.h>
 #include <linux/cpumask.h>
 #include <linux/err.h>
@@ -32,6 +36,8 @@
 #include <linux/sysfs.h>
 #include <linux/rcupdate.h>
 
+#define	PADATA_WORK_ONSTACK	1	/* Work's memory is on stack */
+
 struct padata_work {
 	struct work_struct	pw_work;
 	struct list_head	pw_list;  /* padata_free_works linkage */
@@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock);
 static struct padata_work *padata_works;
 static LIST_HEAD(padata_free_works);
 
+struct padata_mt_job_state {
+	spinlock_t		lock;
+	struct completion	completion;
+	struct padata_mt_job	*job;
+	int			nworks;
+	int			nworks_fini;
+	unsigned long		chunk_size;
+};
+
 static void padata_free_pd(struct parallel_data *pd);
+static void __init padata_mt_helper(struct work_struct *work);
 
 static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 {
@@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void)
 }
 
 static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
-			     void *data)
+			     void *data, int flags)
 {
-	INIT_WORK(&pw->pw_work, work_fn);
+	if (flags & PADATA_WORK_ONSTACK)
+		INIT_WORK_ONSTACK(&pw->pw_work, work_fn);
+	else
+		INIT_WORK(&pw->pw_work, work_fn);
 	pw->pw_data = data;
 }
 
+static int __init padata_work_alloc_mt(int nworks, void *data,
+				       struct list_head *head)
+{
+	int i;
+
+	spin_lock(&padata_works_lock);
+	/* Start at 1 because the current task participates in the job. */
+	for (i = 1; i < nworks; ++i) {
+		struct padata_work *pw = padata_work_alloc();
+
+		if (!pw)
+			break;
+		padata_work_init(pw, padata_mt_helper, data, 0);
+		list_add(&pw->pw_list, head);
+	}
+	spin_unlock(&padata_works_lock);
+
+	return i;
+}
+
 static void padata_work_free(struct padata_work *pw)
 {
 	lockdep_assert_held(&padata_works_lock);
 	list_add(&pw->pw_list, &padata_free_works);
 }
 
+static void __init padata_works_free(struct list_head *works)
+{
+	struct padata_work *cur, *next;
+
+	if (list_empty(works))
+		return;
+
+	spin_lock(&padata_works_lock);
+	list_for_each_entry_safe(cur, next, works, pw_list) {
+		list_del(&cur->pw_list);
+		padata_work_free(cur);
+	}
+	spin_unlock(&padata_works_lock);
+}
+
 static void padata_parallel_worker(struct work_struct *parallel_work)
 {
 	struct padata_work *pw = container_of(parallel_work, struct padata_work,
@@ -168,7 +222,7 @@ int padata_do_parallel(struct padata_shell *ps,
 	pw = padata_work_alloc();
 	spin_unlock(&padata_works_lock);
 	if (pw) {
-		padata_work_init(pw, padata_parallel_worker, padata);
+		padata_work_init(pw, padata_parallel_worker, padata, 0);
 		queue_work(pinst->parallel_wq, &pw->pw_work);
 	} else {
 		/* Maximum works limit exceeded, run in the current task. */
@@ -409,6 +463,98 @@ static int pd_setup_cpumasks(struct parallel_data *pd,
 	return err;
 }
 
+static void __init padata_mt_helper(struct work_struct *w)
+{
+	struct padata_work *pw = container_of(w, struct padata_work, pw_work);
+	struct padata_mt_job_state *ps = pw->pw_data;
+	struct padata_mt_job *job = ps->job;
+	bool done;
+
+	spin_lock(&ps->lock);
+
+	while (job->size > 0) {
+		unsigned long start, size, end;
+
+		start = job->start;
+		/* So end is chunk size aligned if enough work remains. */
+		size = roundup(start + 1, ps->chunk_size) - start;
+		size = min(size, job->size);
+		end = start + size;
+
+		job->start = end;
+		job->size -= size;
+
+		spin_unlock(&ps->lock);
+		job->thread_fn(start, end, job->fn_arg);
+		spin_lock(&ps->lock);
+	}
+
+	++ps->nworks_fini;
+	done = (ps->nworks_fini == ps->nworks);
+	spin_unlock(&ps->lock);
+
+	if (done)
+		complete(&ps->completion);
+}
+
+/**
+ * padata_do_multithreaded - run a multithreaded job
+ * @job: Description of the job.
+ *
+ * See the definition of struct padata_mt_job for more details.
+ */
+void __init padata_do_multithreaded(struct padata_mt_job *job)
+{
+	/* In case threads finish at different times. */
+	static const unsigned long load_balance_factor = 4;
+	struct padata_work my_work, *pw;
+	struct padata_mt_job_state ps;
+	LIST_HEAD(works);
+	int nworks;
+
+	if (job->size == 0)
+		return;
+
+	/* Ensure at least one thread when size < min_chunk. */
+	nworks = max(job->size / job->min_chunk, 1ul);
+	nworks = min(nworks, job->max_threads);
+
+	if (nworks == 1) {
+		/* Single thread, no coordination needed, cut to the chase. */
+		job->thread_fn(job->start, job->start + job->size, job->fn_arg);
+		return;
+	}
+
+	spin_lock_init(&ps.lock);
+	init_completion(&ps.completion);
+	ps.job	       = job;
+	ps.nworks      = padata_work_alloc_mt(nworks, &ps, &works);
+	ps.nworks_fini = 0;
+
+	/*
+	 * Chunk size is the amount of work a helper does per call to the
+	 * thread function.  Load balance large jobs between threads by
+	 * increasing the number of chunks, guarantee at least the minimum
+	 * chunk size from the caller, and honor the caller's alignment.
+	 */
+	ps.chunk_size = job->size / (ps.nworks * load_balance_factor);
+	ps.chunk_size = max(ps.chunk_size, job->min_chunk);
+	ps.chunk_size = roundup(ps.chunk_size, job->align);
+
+	list_for_each_entry(pw, &works, pw_list)
+		queue_work(system_unbound_wq, &pw->pw_work);
+
+	/* Use the current thread, which saves starting a workqueue worker. */
+	padata_work_init(&my_work, padata_mt_helper, &ps, PADATA_WORK_ONSTACK);
+	padata_mt_helper(&my_work.pw_work);
+
+	/* Wait for all the helpers to finish. */
+	wait_for_completion(&ps.completion);
+
+	destroy_work_on_stack(&my_work.pw_work);
+	padata_works_free(&works);
+}
+
 static void __padata_list_init(struct padata_list *pd_list)
 {
 	INIT_LIST_HEAD(&pd_list->list);
-- 
2.26.2


^ permalink raw reply related

* [PATCH v2] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT
From: Laurent Dufour @ 2020-05-20 17:43 UTC (permalink / raw)
  To: kvm-ppc, linuxppc-dev, linux-kernel, paulus; +Cc: sukadev, linuxram, groug
In-Reply-To: <20200520193259.0b66db32@bahia.lan>

The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
reserved to the Ultravisor.

However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
context of the VM calling UV_ESM. This allows the Hypervisor to return to
the guest without going through the Ultravisor. Thus the Secure bit of SRR1
is not set in that particular case.

In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
not set in that case.

Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
---
 arch/powerpc/kvm/book3s_hv.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 93493f0cbfe8..6ad1a3b14300 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1099,9 +1099,12 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			ret = kvmppc_h_svm_init_done(vcpu->kvm);
 		break;
 	case H_SVM_INIT_ABORT:
-		ret = H_UNSUPPORTED;
-		if (kvmppc_get_srr1(vcpu) & MSR_S)
-			ret = kvmppc_h_svm_init_abort(vcpu->kvm);
+		/*
+		 * Even if that call is made by the Ultravisor, the SSR1 value
+		 * is the guest context one, with the secure bit clear as it has
+		 * not yet been secured. So we can't check it here.
+		 */
+		ret = kvmppc_h_svm_init_abort(vcpu->kvm);
 		break;

 	default:
-- 
2.26.2

^ permalink raw reply related

* Re: [PATCH] KVM: PPC: Book3S HV: relax check on H_SVM_INIT_ABORT
From: Laurent Dufour @ 2020-05-20 17:35 UTC (permalink / raw)
  To: Greg Kurz; +Cc: linuxram, linux-kernel, kvm-ppc, paulus, sukadev, linuxppc-dev
In-Reply-To: <20200520193259.0b66db32@bahia.lan>

Le 20/05/2020 à 19:32, Greg Kurz a écrit :
> On Wed, 20 May 2020 18:51:10 +0200
> Laurent Dufour <ldufour@linux.ibm.com> wrote:
> 
>> The commit 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_*
>> Hcalls") added checks of secure bit of SRR1 to filter out the Hcall
>> reserved to the Ultravisor.
>>
>> However, the Hcall H_SVM_INIT_ABORT is made by the Ultravisor passing the
>> context of the VM calling UV_ESM. This allows the Hypervisor to return to
>> the guest without going through the Ultravisor. Thus the Secure bit of SRR1
>> is not set in that particular case.
>>
>> In the case a regular VM is calling H_SVM_INIT_ABORT, this hcall will be
>> filtered out in kvmppc_h_svm_init_abort() because kvm->arch.secure_guest is
>> not set in that case.
>>
> 
> Why not checking vcpu->kvm->arch.secure_guest then ?

I don't think that's the right place.
> 
>> Fixes: 8c47b6ff29e3 ("KVM: PPC: Book3S HV: Check caller of H_SVM_* Hcalls")
>> Signed-off-by: Laurent Dufour <ldufour@linux.ibm.com>
>> ---
>>   arch/powerpc/kvm/book3s_hv.c | 4 +---
>>   1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
>> index 93493f0cbfe8..eb1f96cb7b72 100644
>> --- a/arch/powerpc/kvm/book3s_hv.c
>> +++ b/arch/powerpc/kvm/book3s_hv.c
>> @@ -1099,9 +1099,7 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
>>   			ret = kvmppc_h_svm_init_done(vcpu->kvm);
>>   		break;
>>   	case H_SVM_INIT_ABORT:
>> -		ret = H_UNSUPPORTED;
>> -		if (kvmppc_get_srr1(vcpu) & MSR_S)
>> -			ret = kvmppc_h_svm_init_abort(vcpu->kvm);
> 
> or at least put a comment to explain why H_SVM_INIT_ABORT
> doesn't have the same sanity check as the other SVM hcalls.

I agree that might help. I'll send a v2 with a comment there.

> 
>> +		ret = kvmppc_h_svm_init_abort(vcpu->kvm);
>>   		break;
>>   
>>   	default:
> 


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox