LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH -next] powerpc/spider-pci: Remove set but not used variable 'val'
From: Arnd Bergmann @ 2021-06-15  8:18 UTC (permalink / raw)
  To: Baokun Li
  Cc: YueHaibing, Linux Kernel Mailing List, yangjihong1, Wei Yongjun,
	Paul Mackerras, linuxppc-dev, Yu Kuai
In-Reply-To: <20210601085319.140461-1-libaokun1@huawei.com>

On Tue, Jun 1, 2021 at 10:53 AM Baokun Li <libaokun1@huawei.com> wrote:
>
> Fixes gcc '-Wunused-but-set-variable' warning:
>
> arch/powerpc/platforms/cell/spider-pci.c: In function 'spiderpci_io_flush':
> arch/powerpc/platforms/cell/spider-pci.c:28:6: warning:
> variable ‘val’ set but not used [-Wunused-but-set-variable]
>
> It never used since introduction.
>
> Signed-off-by: Baokun Li <libaokun1@huawei.com>

LGTM

Acked-by: Arnd Bergmann <arnd@arndb.de>

^ permalink raw reply

* Re: [PATCH 0/5] cpufreq: cppc: Fix suspend/resume specific races with FIE code
From: Viresh Kumar @ 2021-06-15  7:50 UTC (permalink / raw)
  To: Qian Cai
  Cc: linuxppc-dev, Vincent Guittot, linux-doc, Jonathan Corbet,
	Dirk Brandewie, linux-pm, Srinivas Pandruvada, Rafael Wysocki,
	linux-kernel, Paul Mackerras, Ionela Voinescu, Len Brown
In-Reply-To: <eaaaf171-5937-e0f2-8447-c1b20b474c62@quicinc.com>

Hi Qian,

First of all thanks for testing this, I need more of your help to test
this out :)

FWIW, I did test this on my Hikey board today, with some hacks, and
tried multiple insmod/rmmod operations for the driver, and I wasn't
able to reproduce the issue you reported. I did enable the list-debug
config option.

On 14-06-21, 09:48, Qian Cai wrote:
> Unfortunately, this series looks like needing more works.
> 
> [  487.773586][    T0] CPU17: Booted secondary processor 0x0000000801 [0x503f0002]
> [  487.976495][  T670] list_del corruption. next->prev should be ffff009b66e9ec70, but was ffff009b66dfec70
> [  487.987037][  T670] ------------[ cut here ]------------
> [  487.992351][  T670] kernel BUG at lib/list_debug.c:54!
> [  487.997810][  T670] Internal error: Oops - BUG: 0 [#1] SMP
> [  488.003295][  T670] Modules linked in: cpufreq_userspace xfs loop cppc_cpufreq processor efivarfs ip_tables x_tables ext4 mbcache jbd2 dm_mod igb i2c_algo_bit nvme mlx5_core i2c_core nvme_core firmware_class
> [  488.021759][  T670] CPU: 1 PID: 670 Comm: cppc_fie Not tainted 5.13.0-rc5-next-20210611+ #46
> [  488.030190][  T670] Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> [  488.038705][  T670] pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
> [  488.045398][  T670] pc : __list_del_entry_valid+0x154/0x158
> [  488.050969][  T670] lr : __list_del_entry_valid+0x154/0x158
> [  488.056534][  T670] sp : ffff8000229afd70
> [  488.060534][  T670] x29: ffff8000229afd70 x28: ffff0008c8f4f340 x27: dfff800000000000
> [  488.068361][  T670] x26: ffff009b66e9ec70 x25: ffff800011c8b4d0 x24: ffff0008d4bfe488
> [  488.076188][  T670] x23: ffff0008c8f4f340 x22: ffff0008c8f4f340 x21: ffff009b6789ec70
> [  488.084015][  T670] x20: ffff0008d4bfe4c8 x19: ffff009b66e9ec70 x18: ffff0008c8f4fd70
> [  488.091842][  T670] x17: 20747562202c3037 x16: 6365396536366239 x15: 0000000000000028
> [  488.099669][  T670] x14: 0000000000000000 x13: 0000000000000001 x12: ffff60136cdd3447
> [  488.107495][  T670] x11: 1fffe0136cdd3446 x10: ffff60136cdd3446 x9 : ffff8000103ee444
> [  488.115322][  T670] x8 : ffff009b66e9a237 x7 : 0000000000000001 x6 : ffff009b66e9a230
> [  488.123149][  T670] x5 : 00009fec9322cbba x4 : ffff60136cdd3447 x3 : 1fffe001191e9e69
> [  488.130975][  T670] x2 : 0000000000000000 x1 : 0000000000000000 x0 : 0000000000000054
> [  488.138803][  T670] Call trace:
> [  488.141935][  T670]  __list_del_entry_valid+0x154/0x158
> [  488.147153][  T670]  kthread_worker_fn+0x15c/0xda0

This is a strange place to get the issue from. And this is a new
issue.

> [  488.151939][  T670]  kthread+0x3ac/0x460
> [  488.155854][  T670]  ret_from_fork+0x10/0x18
> [  488.160120][  T670] Code: 911e8000 aa1303e1 910a0000 941b595b (d4210000)
> [  488.166901][  T670] ---[ end trace e637e2d38b2cc087 ]---
> [  488.172206][  T670] Kernel panic - not syncing: Oops - BUG: Fatal exception
> [  488.179182][  T670] SMP: stopping secondary CPUs
> [  489.209347][  T670] SMP: failed to stop secondary CPUs 0-1,10-11,16-17,31
> [  489.216128][  T][  T670] Memoryn ]---

Can you give details on what exactly did you try to do, to get this ?
Normal boot or something more ?

I have made some changes to the way calls were happening, may get this
thing sorted. Can you please try this branch ?

https://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm.git/log/?h=cpufreq/cppc

I can see one place where race can happen, i.e. between
topology_clear_scale_freq_source() and topology_scale_freq_tick(). It
is possible that sfd->set_freq_scale() may get called for a previously
set handler as there is no protection there.

I will see how to fix that. But I am not sure if the issue reported
above comes from there.

Anyway, please give my branch a try, lets see.

-- 
viresh

^ permalink raw reply

* Re: [PATCH v12 1/6] kasan: allow an architecture to disable inline instrumentation
From: Marco Elver @ 2021-06-15  7:46 UTC (permalink / raw)
  To: Daniel Axtens
  Cc: aneesh.kumar, LKML, Linux Memory Management List, kasan-dev,
	linuxppc-dev
In-Reply-To: <20210615014705.2234866-2-dja@axtens.net>

On Tue, 15 Jun 2021 at 03:47, Daniel Axtens <dja@axtens.net> wrote:
>
> For annoying architectural reasons, it's very difficult to support inline
> instrumentation on powerpc64.
>
> Add a Kconfig flag to allow an arch to disable inline. (It's a bit
> annoying to be 'backwards', but I'm not aware of any way to have
> an arch force a symbol to be 'n', rather than 'y'.)
>
> We also disable stack instrumentation in this case as it does things that
> are functionally equivalent to inline instrumentation, namely adding
> code that touches the shadow directly without going through a C helper.
>
> Signed-off-by: Daniel Axtens <dja@axtens.net>
> ---
>  lib/Kconfig.kasan | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
>
> diff --git a/lib/Kconfig.kasan b/lib/Kconfig.kasan
> index cffc2ebbf185..935814f332a7 100644
> --- a/lib/Kconfig.kasan
> +++ b/lib/Kconfig.kasan
> @@ -12,6 +12,15 @@ config HAVE_ARCH_KASAN_HW_TAGS
>  config HAVE_ARCH_KASAN_VMALLOC
>         bool
>
> +# Sometimes an architecture might not be able to support inline instrumentation
> +# but might be able to support outline instrumentation. This option allows an
> +# arch to prevent inline and stack instrumentation from being enabled.

This comment could be moved into 'help' of this new config option.

> +# ppc64 turns on virtual memory late in boot, after calling into generic code
> +# like the device-tree parser, so it uses this in conjuntion with a hook in
> +# outline mode to avoid invalid access early in boot.

I think the ppc64-related comment isn't necessary and can be moved to
arch/ppc64 somewhere, if there isn't one already.

> +config ARCH_DISABLE_KASAN_INLINE
> +       bool
> +
>  config CC_HAS_KASAN_GENERIC
>         def_bool $(cc-option, -fsanitize=kernel-address)
>
> @@ -130,6 +139,7 @@ config KASAN_OUTLINE
>
>  config KASAN_INLINE
>         bool "Inline instrumentation"
> +       depends on !ARCH_DISABLE_KASAN_INLINE
>         help
>           Compiler directly inserts code checking shadow memory before
>           memory accesses. This is faster than outline (in some workloads
> @@ -141,6 +151,7 @@ endchoice
>  config KASAN_STACK
>         bool "Enable stack instrumentation (unsafe)" if CC_IS_CLANG && !COMPILE_TEST
>         depends on KASAN_GENERIC || KASAN_SW_TAGS
> +       depends on !ARCH_DISABLE_KASAN_INLINE
>         default y if CC_IS_GCC
>         help
>           The LLVM stack address sanitizer has a know problem that
> @@ -154,6 +165,9 @@ config KASAN_STACK
>           but clang users can still enable it for builds without
>           CONFIG_COMPILE_TEST.  On gcc it is assumed to always be safe
>           to use and enabled by default.
> +         If the architecture disables inline instrumentation, this is
> +         also disabled as it adds inline-style instrumentation that
> +         is run unconditionally.
>
>  config KASAN_SW_TAGS_IDENTIFY
>         bool "Enable memory corruption identification"
> --
> 2.27.0
>
> --
> You received this message because you are subscribed to the Google Groups "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to kasan-dev+unsubscribe@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/kasan-dev/20210615014705.2234866-2-dja%40axtens.net.

^ permalink raw reply

* Re: [RFC PATCH 7/8] powerpc/pseries: Add support for FORM2 associativity
From: Aneesh Kumar K.V @ 2021-06-15  7:40 UTC (permalink / raw)
  To: David Gibson; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev
In-Reply-To: <YMhH7X9Tfq95gY6r@yekko>

David Gibson <david@gibson.dropbear.id.au> writes:

> On Tue, Jun 15, 2021 at 10:58:42AM +0530, Aneesh Kumar K.V wrote:
>> David Gibson <david@gibson.dropbear.id.au> writes:
>> 
>> > On Mon, Jun 14, 2021 at 10:10:02PM +0530, Aneesh Kumar K.V wrote:
>> >> Signed-off-by: Daniel Henrique Barboza <danielhb413@gmail.com>
>> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> >> ---
>> >>  Documentation/powerpc/associativity.rst   | 139 ++++++++++++++++++++
>> >>  arch/powerpc/include/asm/firmware.h       |   3 +-
>> >>  arch/powerpc/include/asm/prom.h           |   1 +
>> >>  arch/powerpc/kernel/prom_init.c           |   3 +-
>> >>  arch/powerpc/mm/numa.c                    | 149 +++++++++++++++++++++-
>> >>  arch/powerpc/platforms/pseries/firmware.c |   1 +
>> >>  6 files changed, 290 insertions(+), 6 deletions(-)
>> >>  create mode 100644 Documentation/powerpc/associativity.rst
>> >> 
>> >> diff --git a/Documentation/powerpc/associativity.rst b/Documentation/powerpc/associativity.rst
>> >> new file mode 100644
>> >> index 000000000000..58abedea81d7
>> >> --- /dev/null
>> >> +++ b/Documentation/powerpc/associativity.rst
>> >> @@ -0,0 +1,139 @@
>> >> +============================
>> >> +NUMA resource associativity
>> >> +=============================
>> >> +
>> >> +Associativity represents the groupings of the various platform resources into
>> >> +domains of substantially similar mean performance relative to resources outside
>> >> +of that domain. Resources subsets of a given domain that exhibit better
>> >> +performance relative to each other than relative to other resources subsets
>> >> +are represented as being members of a sub-grouping domain. This performance
>> >> +characteristic is presented in terms of NUMA node distance within the Linux kernel.
>> >> +From the platform view, these groups are also referred to as domains.
>> >> +
>> >> +PAPR interface currently supports two different ways of communicating these resource
>> >
>> > You describe form 2 below as well, which contradicts this.
>> 
>> Fixed as below.
>> 
>> PAPR interface currently supports different ways of communicating these resource
>> grouping details to the OS. These are referred to as Form 0, Form 1 and Form2
>> associativity grouping. Form 0 is the older format and is now considered deprecated.
>> 
>> Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>> Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> A value of 1 indicates the usage of Form 1 associativity. For Form 2 associativity
>> bit 2 of byte 5 in the "ibm,architecture-vec-5" property is used.
>
> LGTM.
>
>> >> +grouping details to the OS. These are referred to as Form 0 and Form 1 associativity grouping.
>> >> +Form 0 is the older format and is now considered deprecated.
>> >> +
>> >> +Hypervisor indicates the type/form of associativity used via "ibm,arcitecture-vec-5 property".
>> >> +Bit 0 of byte 5 in the "ibm,architecture-vec-5" property indicates usage of Form 0 or Form 1.
>> >> +A value of 1 indicates the usage of Form 1 associativity.
>> >> +
>> >> +Form 0
>> >> +-----
>> >> +Form 0 associativity supports only two NUMA distance (LOCAL and REMOTE).
>> >> +
>> >> +Form 1
>> >> +-----
>> >> +With Form 1 a combination of ibm,associativity-reference-points and ibm,associativity
>> >> +device tree properties are used to determine the NUMA distance between resource groups/domains. 
>> >> +
>> >> +The “ibm,associativity” property contains one or more lists of numbers (domainID)
>> >> +representing the resource’s platform grouping domains.
>> >> +
>> >> +The “ibm,associativity-reference-points” property contains one or more list of numbers
>> >> +(domain index) that represents the 1 based ordinal in the associativity lists of the most
>> >> +significant boundary, with subsequent entries indicating progressively less significant boundaries.
>> >> +
>> >> +Linux kernel uses the domain id of the most significant boundary (aka primary domain)
>> >
>> > I thought we used the *least* significant boundary (the smallest
>> > grouping, not the largest).  That is, the last index, not the first.
>> >
>> > Actually... come to think of it, I'm not even sure how to interpret
>> > "most significant".  Does that mean a change in grouping at that "most
>> > significant" level results in the largest perfomance difference?
>> 
>> PAPR defines "most significant" as below
>> 
>> When the “ibm,architecture-vec-5” property byte 5 bit 0 has the value of one, the “ibm,associativ-
>> ity-reference-points” property indicates boundaries between associativity domains presented by the
>> “ibm,associativity” property containing “near” and “far” resources. The
>> first such boundary in the list represents the 1 based ordinal in the
>> associativity lists of the most significant boundary, with subsequent
>> entries indicating progressively less significant boundaries
>
> No... that's not a definition.  Like your draft PAPR uses the term
> while entirely failing to define it.  From what I can tell about how
> it is used the "most significant" boundary corresponds to what Linux
> simply thinks of as the node id.  But intuitively, I'd think of that
> as the "least significant" boundary, since that's basically the
> smallest granularity at which we care about NUMA distances.
>
>
>> I would interpret it as the boundary where we start defining NUMA
>> nodes.
>
> That isn't any clearer to me.

How about calling it least significant boundary then? 

The “ibm,associativity-reference-points” property contains one or more list of numbers
(domainID index) that represents the 1 based ordinal in the associativity lists of the
least significant boundary, with subsequent entries indicating progressively higher
significant boundaries.

ex:
{ primary domainID index, secondary domainID index, tertiary domainID index.. }

Linux kernel uses the domainID of the least significant boundary (aka primary domain)
as the NUMA node id. Linux kernel computes NUMA distance between two domains by
recursively comparing if they belong to the same higher-level domains. For mismatch
at every higher level of the resource group, the kernel doubles the NUMA distance between
the comparing domains.

>
>> >> +as the NUMA node id. Linux kernel computes NUMA distance between two domains by
>> >> +recursively comparing if they belong to the same higher-level domains. For mismatch
>> >> +at every higher level of the resource group, the kernel doubles the NUMA distance between
>> >> +the comparing domains.
>> >> +
>> >> +Form 2
>> >> +-------
>> >> +Form 2 associativity format adds separate device tree properties representing NUMA node distance
>> >> +thereby making the node distance computation flexible. Form 2 also allows flexible primary
>> >> +domain numbering. With numa distance computation now detached from the index value of
>> >> +"ibm,associativity" property, Form 2 allows a large number of primary domain ids at the
>> >> +same domain index representing resource groups of different
>> >> performance/latency characteristics.
>> >
>> > The meaning of "domain index" is not clear to me here.
>> 
>> Sorry for the confusion there. domain index is the index where domainID
>> is appearing. W.r.t "ibm,associativity"  we have
>
> Ok, I think I eventually deduced that.  We should start out clearly
> defining both domainID and index here.
>
> Also.. I think we need to find more distinct terms, because "index" is
> being used for both where the ID appears in an associativity array,
> and also when an ID appears in the Form2 "lookup-index-table" and the
> two usages are totally unconnected.
>
>> The “ibm,associativity” property contains one or more lists of numbers (domainID)
>> representing the resource’s platform grouping domains. If we can look at
>> an example property.
>> 
>> { 4, 6, 7, 0, 0}
>> { 4, 6, 7, 0, 40}
>> 
>> With Form 1 both NUMA node 0 and 40 will appear at the same distance.
>> They both are at domain index 4. With Form 2 we can represent them with
>> different NUMA distance values.
>
> Ok.  Note that PAPR was never clear about what space domain IDs need
> to be unique within: do they need to be (a) globally unique (not true
> in practice), (b) unique at their index level or (c) unique only
> within their "parent" node at the previous index level.
>
> We should take the opportunity with Form2 to make that clearer.
>
> My understanding is that with Form2 it should be entirely feasible to
> built a dt have associativity arrays that are always of length 1.  Is
> that correct?

Correct, unless you have persistent memory device attached in which case
you need two entries.

>
>> >> +
>> >> +Hypervisor indicates the usage of FORM2 associativity using bit 2 of byte 5 in the
>> >> +"ibm,architecture-vec-5" property.
>> >> +
>> >> +"ibm,numa-lookup-index-table" property contains one or more list numbers representing
>> >> +the domainIDs present in the system. The offset of the domainID in this property is considered
>> >> +the domainID index.
>> >
>> > You haven't really introduced the term "domainID".  Is "domainID
>> > index" the same as "domain index" above?  It's not clear to me.
>> 
>> The earlier part of the documented said 
>> 
>> The “ibm,associativity” property contains one or more lists of numbers (domainID)
>> representing the resource’s platform grouping domains.
>> 
>> I will update domain index to domainID index. 
>> 
>> >
>> > The distinction between "domain index" and "primary domain id" is also
>> > not clear to me.
>> 
>> primary domain id is the domainID appearing in the primary domainID
>> index. Linux kenrel also use that as the NUMA node number.
>
> nit s/kenrel/kernel/
>
>> Primary domainID index is defined by ibm,associativity-reference-points
>> and we consider that as the most significant resource group boundary.
>> 
>> ibm,associativity-reference-points can be looked at as
>> { primary domainID index, secondary domainID index, tertiary domainID index.. }
>
> Ok, explicitly stating that in the doc would help a lot.
>
>> >
>> >> +prop-encoded-array: The number N of the domainIDs encoded as with encode-int, followed by
>> >> +N domainID encoded as with encode-int
>> >> +
>> >> +For ex:
>> >> +ibm,numa-lookup-index-table =  {4, 0, 8, 250, 252}, domainID index for domainID 8 is 1.
>> >
>> > Above you say "Form 2 allows a large number of primary domain ids at
>> > the same domain index", but this encoding doesn't appear to permit
>> > that.
>> 
>> I didn't follow that question.
>
> Ah, that's because I was thinking of index here as the index within
> the lookup-index-table, not the index within the associativity
> arrays.
>
>> >
>> >> +"ibm,numa-distance-table" property contains one or more list of numbers representing the NUMA
>> >> +distance between resource groups/domains present in the system.
>> >> +
>> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> >> +
>> >> +For ex:
>> >> +ibm,numa-lookup-index-table =  {3, 0, 8, 40}
>> >> +ibm,numa-distance-table     =  {9, 1, 2, 8, 2, 1, 16, 8, 16, 1}
>> >> +
>> >> +  | 0    8   40
>> >> +--|------------
>> >> +  |
>> >> +0 | 10   20  80
>> >> +  |
>> >> +8 | 20   10  160
>> >> +  |
>> >> +40| 80   160  10
>> >
>> > What's the reason for multiplying the values by 10 in the expanded
>> > table version?
>> 
>> That was me missing a document update. Since we used 8 bits to encode
>> distance at some point we were looking at a SCALE factor. But later
>> realized other architectures also restrict distance to 8 bits. I will
>> update ibm,numa-distance-table in the document.
>
> Ok.
>
>> >> +
>> >> +With Form2 "ibm,associativity" for resources is listed as below:
>> >> +
>> >> +"ibm,associativity" property for resources in node 0, 8 and 40
>> >> +{ 4, 6, 7, 0, 0}
>> >> +{ 4, 6, 9, 8, 8}
>> >> +{ 4, 6, 7, 0, 40}
>> >> +
>> >> +With "ibm,associativity-reference-points"  { 0x4, 0x3, 0x2 }
>> >> +
>> >> +With Form2 the primary domainID and secondary domainID are used to identify the NUMA nodes
>> >
>> > What the heck is the secondary domainID
>> 
>> domainID appearing the secondary domainID index.
>
> I understand that from the clarifications you've made about, but
> second domainID index wasn't any more defined in the original draft.
>
>> ibm,associativity-reference-points gives an indication of different
>> hierachy of resource grouping as below.
>> 
>> ibm,associativity-reference-points can be looked at as
>> { primary domainID index, secondary domainID index, tertiary domainID index.. }
>> 
>> >
>> >> +the kernel should use when using persistent memory devices. Persistent memory devices
>> >> +can also be used as regular memory using DAX KMEM driver and primary domainID indicates
>> >> +the numa node number OS should use when using these devices as regular memory. Secondary
>> >> +domainID is the numa node number that should be used when using this device as
>> >> +persistent memory. In the later case, we are interested in the locality of the
>> >> +device to an established numa node. In the above example, if the last row represents a
>> >> +persistent memory device/resource, NUMA node number 40 will be used when using the device
>> >> +as regular memory and NUMA node number 0 will be the device numa node when using it as
>> >> +a persistent memory device.
>> >> +
>> >> +Each resource (drcIndex) now also supports additional optional device tree properties.
>> >> +These properties are marked optional because the platform can choose not to export
>> >> +them and provide the system topology details using the earlier defined device tree
>> >> +properties alone. The optional device tree properties are used when adding new resources
>> >> +(DLPAR) and when the platform didn't provide the topology details of the domain which
>> >> +contains the newly added resource during boot.
>> >> +
>> >> +"ibm,numa-lookup-index" property contains a number representing the domainID index to be used
>> >> +when building the NUMA distance of the numa node to which this resource belongs. The domain id
>> >> +of the new resource can be obtained from the existing "ibm,associativity" property. This
>> >> +can be used to build distance information of a newly onlined NUMA node via DLPAR operation.
>> >> +The value is 1 based array index value.
>> >
>> > Am I correct in thinking that if we have an entirely form2 world, we'd
>> > only need this and the ibm,associativity properties could be dropped?
>> 
>> Not really. ibm,numa-lookup-index-table was added to have a concise
>> representation of numa distance via ibm,numa-distance-table. 
>> 
>> For ex: With domainID 0, 4, 5 we could do a 5x5 matrix to represent the
>> numa distance. Instead ibm,numa-lookup-index-table allows us to present
>> the same in a 3x3 matrix  distance[index0][index1] is the  distance
>> between NUMA node 0 and 4 and distance[index0][index2] is the distance
>> between NUMA node 0 and 5
>
> Right, I get the purpose of it, and I realized I misphrashed my
> question.  My point is that in a Form2 world, the *only* thing the
> associativity array is used for is to deduce its position in
> lookup-index-table.  Once you have have that for each resource, you
> have everything you need, yes?


ibm,associativity is used find the domainID/NUMA node id of the
resource.

ibm,lookup-index-table is used compute the distance information between
NUMA nodes using ibm,numa-distance-table.


>
>> 
>> 
>> >
>> >> +
>> >> +prop-encoded-array: An integer encoded as with encode-int specifying the domainID index
>> >> +
>> >> +"ibm,numa-distance" property contains one or more list of numbers presenting the NUMA distance
>> >> +from this resource domain to other resources.
>> >
>> > IIUC this is about extending the global distance table with
>> > information for a new node.  Is that correct?
>> 
>> correct.
>> 
>> >
>> > The global distance table allows for the possibility of asymmetric
>> > distances between nodes, but this does not.  Is that intentional?
>> 
>> This also does, For example with 3 nodes currently present and 4 node
>> getting added ibm,numa-distance have 8 elements enabling us to have
>> distance[Node0][Node50] being different from distance[Node50][Node0]
>> as shown below.
>
> Ok that's not clear from the above.  Rather than "one or more lists of
> numbers" I think you want to explicitly give two options.  Either one
> list, which gives symmetric distances, or two which gives distances
> to, then distance from.
>
>> 
>> >
>> >> +prop-encoded-array: The number N of the distance values encoded as with encode-int, followed by
>> >> +N distance values encoded as with encode-bytes. The max distance value we could encode is 255.
>> >> +
>> >> +For ex:
>> >> +ibm,associativity     = { 4, 5, 6, 7, 50}
>> >> +ibm,numa-lookup-index = { 4 }
>> >> +ibm,numa-distance   =  {8, 16, 32, 8, 1, 16, 32, 8, 1}
>> >> +
>> >> +resulting in a new toplogy as below.
>> >> +  | 0    8   40   50
>> >> +--|------------------
>> >> +  |
>> >> +0 | 10   20  80   160
>> >> +  |
>> >> +8 | 20   10  160  320
>> >> +  |
>> >> +40| 80   160  10  80
>> >> +  |
>> >> +50| 160  320  80  10
>> 
>

-aneesh

^ permalink raw reply

* Re: [PATCH 5/7] signal: Add unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-06-15  7:28 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-arch, linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <YMhU3Df7foVo9BaM@infradead.org>



Le 15/06/2021 à 09:21, Christoph Hellwig a écrit :
> On Tue, Jun 15, 2021 at 09:03:42AM +0200, Christophe Leroy wrote:
>>
>>
>> Le 15/06/2021 ?? 08:52, Christoph Hellwig a ??crit??:
>>> On Tue, Jun 15, 2021 at 06:41:01AM +0000, Christophe Leroy wrote:
>>>> +	unsafe_copy_to_user(__ucs_to, __ucs_from,			\
>>>> +			    sizeof(struct kernel_siginfo), label);	\
>>>> +	unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);	\
>>>> +} while (0)
>>>
>>> unsafe_clear_user does not exist at this point, and even your later
>>> patch only adds it for powerpc.
>>>
>>
>> You missed below chunck I guess:
>>
>>> diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
>>> index c05e903cef02..37073caac474 100644
>>> --- a/include/linux/uaccess.h
>>> +++ b/include/linux/uaccess.h
>>> @@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, long count);
>>>    #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
>>>    #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e)
>>>    #define unsafe_copy_from_user(d,s,l,e) unsafe_op_wrap(__copy_from_user(d,s,l),e)
>>> +#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)
> 
> That doesn't help with architectures that define user_access_begin but
> do not define unsafe_clear_user. (i.e. x86).
> 

Yes, the day they want to use unsafe_copy_siginfo_to_user() they'll have to implement 
unsafe_clear_user().

Until that day, they don't need unsafe_clear_user() and I'm sure the result would be disastrous if a 
poor powerpc guy like me was trying to implement some low level x86 code.

Similar to unsafe_get_compat_sigset(), an arch wanting to use it has to implement 
unsafe_copy_from_user().

^ permalink raw reply

* Re: [PATCH v5 12/17] powerpc/pseries/vas: Integrate API with open/close windows
From: Haren Myneni @ 2021-06-15  7:26 UTC (permalink / raw)
  To: Nicholas Piggin, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <1623638159.6pp87imz6a.astroid@bobo.none>

On Mon, 2021-06-14 at 12:55 +1000, Nicholas Piggin wrote:
> Excerpts from Haren Myneni's message of June 13, 2021 9:02 pm:
> > This patch adds VAS window allocatioa/close with the corresponding
> > hcalls. Also changes to integrate with the existing user space VAS
> > API and provide register/unregister functions to NX pseries driver.
> > 
> > The driver register function is used to create the user space
> > interface (/dev/crypto/nx-gzip) and unregister to remove this
> > entry.
> > 
> > The user space process opens this device node and makes an ioctl
> > to allocate VAS window. The close interface is used to deallocate
> > window.
> > 
> > Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> > ---
> >  arch/powerpc/include/asm/vas.h          |   4 +
> >  arch/powerpc/platforms/pseries/Makefile |   1 +
> >  arch/powerpc/platforms/pseries/vas.c    | 223
> > ++++++++++++++++++++++++
> >  3 files changed, 228 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/vas.h
> > b/arch/powerpc/include/asm/vas.h
> > index eefc758d8cd4..9d5646d721c4 100644
> > --- a/arch/powerpc/include/asm/vas.h
> > +++ b/arch/powerpc/include/asm/vas.h
> > @@ -254,6 +254,10 @@ struct vas_all_caps {
> >  	u64     feat_type;
> >  };
> >  
> > +int h_query_vas_capabilities(const u64 hcall, u8 query_type, u64
> > result);
> > +int vas_register_api_pseries(struct module *mod,
> > +			     enum vas_cop_type cop_type, const char
> > *name);
> > +void vas_unregister_api_pseries(void);
> >  #endif
> >  
> >  /*
> > diff --git a/arch/powerpc/platforms/pseries/Makefile
> > b/arch/powerpc/platforms/pseries/Makefile
> > index c8a2b0b05ac0..4cda0ef87be0 100644
> > --- a/arch/powerpc/platforms/pseries/Makefile
> > +++ b/arch/powerpc/platforms/pseries/Makefile
> > @@ -30,3 +30,4 @@ obj-$(CONFIG_PPC_SVM)		+= svm.o
> >  obj-$(CONFIG_FA_DUMP)		+= rtas-fadump.o
> >  
> >  obj-$(CONFIG_SUSPEND)		+= suspend.o
> > +obj-$(CONFIG_PPC_VAS)		+= vas.o
> > diff --git a/arch/powerpc/platforms/pseries/vas.c
> > b/arch/powerpc/platforms/pseries/vas.c
> > index 98109a13f1c2..fe375f7a7029 100644
> > --- a/arch/powerpc/platforms/pseries/vas.c
> > +++ b/arch/powerpc/platforms/pseries/vas.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/export.h>
> >  #include <linux/types.h>
> >  #include <linux/delay.h>
> > +#include <linux/slab.h>
> >  #include <asm/machdep.h>
> >  #include <asm/hvcall.h>
> >  #include <asm/plpar_wrappers.h>
> > @@ -187,6 +188,228 @@ int h_query_vas_capabilities(const u64 hcall,
> > u8 query_type, u64 result)
> >  		return -EIO;
> >  	}
> >  }
> > +EXPORT_SYMBOL_GPL(h_query_vas_capabilities);
> > +
> > +/*
> > + * Allocate window and setup IRQ mapping.
> > + */
> > +static int allocate_setup_window(struct pseries_vas_window *txwin,
> > +				 u64 *domain, u8 wintype)
> > +{
> > +	int rc;
> > +
> > +	rc = h_allocate_vas_window(txwin, domain, wintype,
> > DEF_WIN_CREDS);
> > +	if (rc)
> > +		return rc;
> > +
> > +	txwin->vas_win.wcreds_max = DEF_WIN_CREDS;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct vas_window *vas_allocate_window(struct
> > vas_tx_win_open_attr *uattr,
> > +					      enum vas_cop_type
> > cop_type)
> > +{
> > +	long domain[PLPAR_HCALL9_BUFSIZE] = {VAS_DEFAULT_DOMAIN_ID};
> > +	struct vas_ct_caps *ct_caps;
> > +	struct vas_caps *caps;
> > +	struct pseries_vas_window *txwin;
> > +	int rc;
> > +
> > +	txwin = kzalloc(sizeof(*txwin), GFP_KERNEL);
> > +	if (!txwin)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/*
> > +	 * A VAS window can have many credits which means that many
> > +	 * requests can be issued simultaneously. But phyp restricts
> > +	 * one credit per window.
> > +	 * phyp introduces 2 different types of credits:
> > +	 * Default credit type (Uses normal priority FIFO):
> > +	 *	A limited number of credits are assigned to partitions
> > +	 *	based on processor entitlement. But these credits may be
> > +	 *	over-committed on a system depends on whether the CPUs
> > +	 *	are in shared or dedicated modes - that is, more requests
> > +	 *	may be issued across the system than NX can service at
> > +	 *	once which can result in paste command failure (RMA_busy).
> > +	 *	Then the process has to resend requests or fall-back to
> > +	 *	SW compression.
> > +	 * Quality of Service (QoS) credit type (Uses high priority
> > FIFO):
> > +	 *	To avoid NX HW contention, the system admins can assign
> > +	 *	QoS credits for each LPAR so that this partition is
> > +	 *	guaranteed access to NX resources. These credits are
> > +	 *	assigned to partitions via the HMC.
> > +	 *	Refer PAPR for more information.
> > +	 *
> > +	 * Allocate window with QoS credits if user requested.
> > Otherwise
> > +	 * default credits are used.
> > +	 */
> > +	if (uattr->flags & VAS_TX_WIN_FLAG_QOS_CREDIT)
> > +		caps = &vascaps[VAS_GZIP_QOS_FEAT_TYPE];
> > +	else
> > +		caps = &vascaps[VAS_GZIP_DEF_FEAT_TYPE];
> > +
> > +	ct_caps = &caps->caps;
> > +
> > +	if (atomic_inc_return(&ct_caps->used_lpar_creds) >
> > +			atomic_read(&ct_caps->target_lpar_creds)) {
> > +		pr_err("Credits are not available to allocate
> > window\n");
> > +		rc = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The user space is requesting to allocate a window on a VAS
> > +	 * instance (or chip) where the process is executing.
> > +	 * On powerVM, domain values are passed to pHyp to select chip
> > /
> > +	 * VAS instance. Useful if the process is affinity to NUMA
> > node.
> > +	 * pHyp selects VAS instance if VAS_DEFAULT_DOMAIN_ID (-1) is
> > +	 * passed for domain values.
> > +	 */
> 
> s/powerVM/PowerVM
> s/pHyp/PowerVM
> 
> You could also just call it the hypervisor. KVM may not implement
> the 
> hcalls now, but in future it could.
> 
> > +	if (uattr->vas_id == -1) {
> 
> Should the above comment fit under here? vas_id == -1 means
> userspace 
> asks for any VAS but preferably a local one?
> 
> > +		/*
> > +		 * To allocate VAS window, pass same domain values
> > returned
> > +		 * from this HCALL.
> > +		 */
> 
> Then you could merge it with this comment and make it a bit clearer:
> the h_allocate_vas_window hcall is defined to take a domain as
> specified by h_home_node_associativity, so no conversions or
> unpacking
> needs to be done.
> 
> > +		rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, domain,
> > +				  VPHN_FLAG_VCPU, smp_processor_id());
> > +		if (rc != H_SUCCESS) {
> > +			pr_err("HCALL(%x): failed with ret(%d)\n",
> > +			       H_HOME_NODE_ASSOCIATIVITY, rc);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Allocate / Deallocate window HCALLs and setup / free IRQs
> > +	 * have to be protected with mutex.
> > +	 * Open VAS window: Allocate window HCALL and setup IRQ
> > +	 * Close VAS window: Deallocate window HCALL and free IRQ
> > +	 *	The hypervisor waits until all NX requests are
> > +	 *	completed before closing the window. So expects OS
> > +	 *	to handle NX faults, means IRQ can be freed only
> > +	 *	after the deallocate window HCALL is returned.
> > +	 * So once the window is closed with deallocate HCALL before
> > +	 * the IRQ is freed, it can be assigned to new allocate
> > +	 * HCALL with the same fault IRQ by the hypervisor. It can
> > +	 * result in setup IRQ fail for the new window since the
> > +	 * same fault IRQ is not freed by the OS.
> > +	 */
> > +	mutex_lock(&vas_pseries_mutex);
> 
> Why? What's the mutex protecting here?
This mutex is protecting Allocate/ Deallocate HCAlls and setup/free
IRQ. Basically once the window is opened and setup IRQ successfully,
makes sure same IRQ is not assigned to otherwindow before the IRQ is
freed in OS. 

Allocate window: 
	Allocate window HCALL
	setup IRQ

Deallocate window:
	Deallocate window HCALL - The hypervisor waits all credits are
returned including any faults. Disable window on this VAS so that do
not take new requests and wait for all pending requests (faults are
processed) 
	free IRQ (We can not free IRQ before the HCALL so that pending
faults can be processed and credits returned)

So if we do not jave mutex, possible that same fault IRQ may be
assigned to different window open HCALL before it is actually freed in
OS.

Process A:
Allocate window HCALL (winID 123)
Setup IRQ (IRQ# 321)
Process requests
Deallocate HCALL(winID 123)       Process B:
                                    Allocate window HCALL (winID 123)
                                    setup IRQ (IRQ# 321) -- will fail.
Free IRQ (IRQ#321)

                        
> 
> > +	rc = allocate_setup_window(txwin, (u64 *)&domain[0],
> > +				   ct_caps->win_type);
> 
> If you define the types to be the same, can you avoid this casting?
> allocate_setup_window specifically needs an array of 
> PLPAR_HCALL9_BUFSIZE longs.

Yes H_HOME_NODE_ASSOCIATIVITY returns longs (PLPAR_HCALL9_BUFSIZE) but
H_ALLOCATE_VAS_WINDOW expects u64.

Syntax:
int64 /* H_Success, H_Parameter, H_Resource, H_Constrained */
/* H_Cop_Hw, H_Unsupported, H_Busy */
/* H_LongBusyOrder1mSec, LongBusyOrder10mSec */
hcall(const uint64 H_ALLOCATE_VAS_WINDOW /* Allocate a VAS window */
uint8 vasWindowType, /* The type of VAS window to allocate */
uint16 numCredits, /* Number of credits to assign to the window */
uint64 desiredAssociativityDomainIdentifier1, /* Associativity Domain
Identifier Doubleword 1 */
uint64 desiredAssociativityDomainIdentifier2, /* Associativity Domain
Identifier Doubleword 2 */
uint64 desiredAssociativityDomainIdentifier3, /* Associativity Domain
Identifier Doubleword 3 */
uint64 desiredAssociativityDomainIdentifier4, /* Associativity Domain
Identifier Doubleword 4 */
uint64 desiredAssociativityDomainIdentifier5, /* Associativity Domain
Identifier Doubleword 5 */
uint64 desiredAssociativityDomainIdentifier6) /* Associativity Domain
Identifier Doubleword 6 */

> 
> > +	mutex_unlock(&vas_pseries_mutex);
> > +	if (rc)
> > +		goto out;
> > +
> > +	/*
> > +	 * Modify window and it is ready to use.
> > +	 */
> > +	rc = h_modify_vas_window(txwin);
> > +	if (!rc)
> > +		rc = get_vas_user_win_ref(&txwin->vas_win.task_ref);
> > +	if (rc)
> > +		goto out_free;
> > +
> > +	vas_user_win_add_mm_context(&txwin->vas_win.task_ref);
> > +	txwin->win_type = ct_caps->win_type;
> > +	mutex_lock(&vas_pseries_mutex);
> > +	list_add(&txwin->win_list, &caps->list);
> > +	mutex_unlock(&vas_pseries_mutex);
> > +
> > +	return &txwin->vas_win;
> > +
> > +out_free:
> > +	h_deallocate_vas_window(txwin->vas_win.winid);
> 
> No mutex here in this deallocate hcall.

We do not need mutex just for this hcall which is executing during
failure. Need mutex only for the code patch as explained above. 

If the window open is not returned to user space successfully, window
close path will be:

 free_irq_setup(txwin);
 h_deallocate_vas_window(txwin->vas_win.winid);


> 
> I suspect you don't actually need the mutex for the hcalls
> themselves, 
> but the list manipulations. I would possibly consider putting 
> used_lpar_creds under that same lock rather than making it atomic and
> playing lock free games, unless you really need to.
> 
> Also... "creds". credentials? credits, right? Don't go through and 
> change everything now, but not skimping on naming helps a lot with
> reading code that you're not familiar with. All the vas/nx stuff
> could probably do with a pass to make the names a bit easier.

Yes creds is for credits. We have names like credits and capabilities
for VAS/NX. Using creds for credits and caps for capabilities. creds is
already used in the existing code.

> 
> (creds isn't so bad, "ct" for "coprocessor type" is pretty obscure 
> though).

So can I use 'copt' coprocessor type?

Thanks
Haren


> 
> Thanks,
> Nick
> 
> > +out:
> > +	atomic_dec(&ct_caps->used_lpar_creds);
> > +	kfree(txwin);
> > +	return ERR_PTR(rc);
> > +}
> > +
> > +static u64 vas_paste_address(struct vas_window *vwin)
> > +{
> > +	struct pseries_vas_window *win;
> > +
> > +	win = container_of(vwin, struct pseries_vas_window, vas_win);
> > +	return win->win_addr;
> > +}
> > +
> > +static int deallocate_free_window(struct pseries_vas_window *win)
> > +{
> > +	int rc = 0;
> > +
> > +	rc = h_deallocate_vas_window(win->vas_win.winid);
> > +
> > +	return rc;
> > +}
> > +
> > +static int vas_deallocate_window(struct vas_window *vwin)
> > +{
> > +	struct pseries_vas_window *win;
> > +	struct vas_ct_caps *caps;
> > +	int rc = 0;
> > +
> > +	if (!vwin)
> > +		return -EINVAL;
> > +
> > +	win = container_of(vwin, struct pseries_vas_window, vas_win);
> > +
> > +	/* Should not happen */
> > +	if (win->win_type >= VAS_MAX_FEAT_TYPE) {
> > +		pr_err("Window (%u): Invalid window type %u\n",
> > +				vwin->winid, win->win_type);
> > +		return -EINVAL;
> > +	}
> > +
> > +	caps = &vascaps[win->win_type].caps;
> > +	mutex_lock(&vas_pseries_mutex);
> > +	rc = deallocate_free_window(win);
> > +	if (rc) {
> > +		mutex_unlock(&vas_pseries_mutex);
> > +		return rc;
> > +	}
> > +
> > +	list_del(&win->win_list);
> > +	atomic_dec(&caps->used_lpar_creds);
> > +	mutex_unlock(&vas_pseries_mutex);
> > +
> > +	put_vas_user_win_ref(&vwin->task_ref);
> > +	mm_context_remove_vas_window(vwin->task_ref.mm);
> > +
> > +	kfree(win);
> > +	return 0;
> > +}
> > +
> > +static const struct vas_user_win_ops vops_pseries = {
> > +	.open_win	= vas_allocate_window,	/* Open and configure
> > window */
> > +	.paste_addr	= vas_paste_address,	/* To do copy/paste */
> > +	.close_win	= vas_deallocate_window, /* Close window */
> > +};
> > +
> > +/*
> > + * Supporting only nx-gzip coprocessor type now, but this API code
> > + * extended to other coprocessor types later.
> > + */
> > +int vas_register_api_pseries(struct module *mod, enum vas_cop_type
> > cop_type,
> > +			     const char *name)
> > +{
> > +	int rc;
> > +
> > +	if (!copypaste_feat)
> > +		return -ENOTSUPP;
> > +
> > +	rc = vas_register_coproc_api(mod, cop_type, name,
> > &vops_pseries);
> > +
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_GPL(vas_register_api_pseries);
> > +
> > +void vas_unregister_api_pseries(void)
> > +{
> > +	vas_unregister_coproc_api();
> > +}
> > +EXPORT_SYMBOL_GPL(vas_unregister_api_pseries);
> >  
> >  /*
> >   * Get the specific capabilities based on the feature type.
> > -- 
> > 2.18.2
> > 
> > 
> > 


^ permalink raw reply

* Re: [PATCH v2 00/12] powerpc: Cleanup use of 'struct ppc_inst'
From: Christophe Leroy @ 2021-06-15  7:24 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
	naveen.n.rao, jniethe5
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <87r1h3tx3a.fsf@mpe.ellerman.id.au>



Le 15/06/2021 à 09:18, Michael Ellerman a écrit :
> Christophe Leroy <christophe.leroy@csgroup.eu> writes:
>> This series is a cleanup of the use of 'struct ppc_inst'.
>>
>> A confusion is made between internal representation of powerpc
>> instructions with 'struct ppc_inst' and in-memory code which is
>> and will always be an array of 'unsigned int'.
> 
> Why don't we use u32 *, to make it even more explicit what the expected
> size is?
> 

I guess that's historical, we could use u32 *

We can convert it incrementaly maybe ?

^ permalink raw reply

* Re: [PATCH 5/7] signal: Add unsafe_copy_siginfo_to_user()
From: Christoph Hellwig @ 2021-06-15  7:21 UTC (permalink / raw)
  To: Christophe Leroy
  Cc: linux-arch, linux-kernel, Christoph Hellwig, Paul Mackerras,
	linuxppc-dev
In-Reply-To: <7061fbee-cc82-2699-cf12-e5a4ae46940f@csgroup.eu>

On Tue, Jun 15, 2021 at 09:03:42AM +0200, Christophe Leroy wrote:
> 
> 
> Le 15/06/2021 ?? 08:52, Christoph Hellwig a ??crit??:
> > On Tue, Jun 15, 2021 at 06:41:01AM +0000, Christophe Leroy wrote:
> > > +	unsafe_copy_to_user(__ucs_to, __ucs_from,			\
> > > +			    sizeof(struct kernel_siginfo), label);	\
> > > +	unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);	\
> > > +} while (0)
> > 
> > unsafe_clear_user does not exist at this point, and even your later
> > patch only adds it for powerpc.
> > 
> 
> You missed below chunck I guess:
> 
> > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
> > index c05e903cef02..37073caac474 100644
> > --- a/include/linux/uaccess.h
> > +++ b/include/linux/uaccess.h
> > @@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, long count);
> >   #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
> >   #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e)
> >   #define unsafe_copy_from_user(d,s,l,e) unsafe_op_wrap(__copy_from_user(d,s,l),e)
> > +#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)

That doesn't help with architectures that define user_access_begin but
do not define unsafe_clear_user. (i.e. x86).

^ permalink raw reply

* Re: [PATCH v2 00/12] powerpc: Cleanup use of 'struct ppc_inst'
From: Michael Ellerman @ 2021-06-15  7:18 UTC (permalink / raw)
  To: Christophe Leroy, Benjamin Herrenschmidt, Paul Mackerras,
	naveen.n.rao, jniethe5
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <cover.1621516826.git.christophe.leroy@csgroup.eu>

Christophe Leroy <christophe.leroy@csgroup.eu> writes:
> This series is a cleanup of the use of 'struct ppc_inst'.
>
> A confusion is made between internal representation of powerpc
> instructions with 'struct ppc_inst' and in-memory code which is
> and will always be an array of 'unsigned int'.

Why don't we use u32 *, to make it even more explicit what the expected
size is?

cheers

^ permalink raw reply

* Re: [PATCH 6/7] powerpc/uaccess: Add unsafe_clear_user()
From: Christophe Leroy @ 2021-06-15  7:10 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <YMhOYKM5+s0wUoeP@infradead.org>



Le 15/06/2021 à 08:53, Christoph Hellwig a écrit :
> On Tue, Jun 15, 2021 at 06:41:02AM +0000, Christophe Leroy wrote:
>> Implement unsafe_clear_user() for powerpc.
>> It's a copy/paste of unsafe_copy_to_user() with value 0 as source.
>>
>> It may be improved in a later patch by using 'dcbz' instruction
>> to zeroize full cache lines at once.
> 
> Please add this to common code insted of making it powerpc specific.
> 

A common version is added in previous patch.

Just like unsafe_copy_to_user(), unsafe_clear_user() needs to be arch defined.

unsafe_copy_to_user() has both an x86 implementation and a powerpc implementation, why do different ?

I can't see how it could be not powerpc specific. At the end we want to use 'dcbz' to zeroize full 
cachelines at once, even if at the time being that's a simple write of 0.

^ permalink raw reply

* [PATCH 3/3] powerpc/smp: Use existing L2 cache_map cpumask to find L3 cache siblings
From: Parth Shah @ 2021-06-15  7:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, svaidy, srikar, ego
In-Reply-To: <20210615070804.390341-1-parth@linux.ibm.com>

On POWER10 systems, the "ibm,thread-groups" property "2" indicates the cpus
in thread-group share both L2 and L3 caches. Hence, use cache_property = 2
itself to find both the L2 and L3 cache siblings.
Hence, rename existing macros to detect if the cache property is for L2 or
L3 and use the L2 cache map itself to find the presence of L3 siblings.

Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/smp.h  |  2 ++
 arch/powerpc/kernel/cacheinfo.c |  3 +++
 arch/powerpc/kernel/smp.c       | 20 +++++++++++++++-----
 3 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 1259040cc3a4..55082d343bd2 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -144,6 +144,7 @@ extern int cpu_to_core_id(int cpu);
 
 extern bool has_big_cores;
 extern bool thread_group_shares_l2;
+extern bool thread_group_shares_l3;
 
 #define cpu_smt_mask cpu_smt_mask
 #ifdef CONFIG_SCHED_SMT
@@ -198,6 +199,7 @@ extern void __cpu_die(unsigned int cpu);
 #define hard_smp_processor_id()		get_hard_smp_processor_id(0)
 #define smp_setup_cpu_maps()
 #define thread_group_shares_l2  0
+#define thread_group_shares_l3	0
 static inline void inhibit_secondary_onlining(void) {}
 static inline void uninhibit_secondary_onlining(void) {}
 static inline const struct cpumask *cpu_sibling_mask(int cpu)
diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 20d91693eac1..378ae20d05a9 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -469,6 +469,9 @@ static int get_group_id(unsigned int cpu_id, int level)
 	else if (thread_group_shares_l2 && level == 2)
 		return cpumask_first(per_cpu(thread_group_l2_cache_map,
 					     cpu_id));
+	else if (thread_group_shares_l3 && level == 3)
+		return cpumask_first(per_cpu(thread_group_l2_cache_map,
+					     cpu_id));
 	return -1;
 }
 
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index a34877257f2d..d0c70fcd0068 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -78,6 +78,7 @@ struct task_struct *secondary_current;
 bool has_big_cores;
 bool coregroup_enabled;
 bool thread_group_shares_l2;
+bool thread_group_shares_l3;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
@@ -101,7 +102,7 @@ enum {
 
 #define MAX_THREAD_LIST_SIZE	8
 #define THREAD_GROUP_SHARE_L1   1
-#define THREAD_GROUP_SHARE_L2   2
+#define THREAD_GROUP_SHARE_L2_L3 2
 struct thread_groups {
 	unsigned int property;
 	unsigned int nr_groups;
@@ -887,9 +888,16 @@ static int __init init_thread_group_cache_map(int cpu, int cache_property)
 	cpumask_var_t *mask = NULL;
 
 	if (cache_property != THREAD_GROUP_SHARE_L1 &&
-	    cache_property != THREAD_GROUP_SHARE_L2)
+	    cache_property != THREAD_GROUP_SHARE_L2_L3)
 		return -EINVAL;
 
+	/*
+	 * On P10 fused-core system, the L3 cache is shared between threads of a
+	 * small core only, but the "ibm,thread-groups" property is indicated as
+	 * "2" only which is interpreted as the thread-groups sharing both L2
+	 * and L3 caches. Hence cache_property of THREAD_GROUP_SHARE_L2_L3 is
+	 * used for both L2 and L3 cache sibling detection.
+	 */
 	tg = get_thread_groups(cpu, cache_property, &err);
 	if (!tg)
 		return err;
@@ -903,7 +911,7 @@ static int __init init_thread_group_cache_map(int cpu, int cache_property)
 
 	if (cache_property == THREAD_GROUP_SHARE_L1)
 		mask = &per_cpu(thread_group_l1_cache_map, cpu);
-	else if (cache_property == THREAD_GROUP_SHARE_L2)
+	else if (cache_property == THREAD_GROUP_SHARE_L2_L3)
 		mask = &per_cpu(thread_group_l2_cache_map, cpu);
 
 	zalloc_cpumask_var_node(mask, GFP_KERNEL, cpu_to_node(cpu));
@@ -1009,14 +1017,16 @@ static int __init init_big_cores(void)
 	has_big_cores = true;
 
 	for_each_possible_cpu(cpu) {
-		int err = init_thread_group_cache_map(cpu, THREAD_GROUP_SHARE_L2);
+		int err = init_thread_group_cache_map(cpu, THREAD_GROUP_SHARE_L2_L3);
 
 		if (err)
 			return err;
 	}
 
 	thread_group_shares_l2 = true;
-	pr_debug("L2 cache only shared by the threads in the small core\n");
+	thread_group_shares_l3 = true;
+	pr_debug("L2/L3 cache only shared by the threads in the small core\n");
+
 	return 0;
 }
 
-- 
2.26.3


^ permalink raw reply related

* [PATCH 2/3] powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()
From: Parth Shah @ 2021-06-15  7:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, svaidy, srikar, ego
In-Reply-To: <20210615070804.390341-1-parth@linux.ibm.com>

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

The helper function get_shared_cpu_map() was added in

'commit 500fe5f550ec ("powerpc/cacheinfo: Report the correct
shared_cpu_map on big-cores")'

and subsequently expanded upon in

'commit 0be47634db0b ("powerpc/cacheinfo: Print correct cache-sibling
map/list for L2 cache")'

in order to help report the correct groups of threads sharing these caches
on big-core systems where groups of threads within a core can share
different sets of caches.

Now that powerpc/cacheinfo is aware of "ibm,thread-groups" property,
cache->shared_cpu_map contains the correct set of thread-siblings
sharing the cache. Hence we no longer need the functions
get_shared_cpu_map(). This patch removes this function. We also remove
the helper function index_dir_to_cpu() which was only called by
get_shared_cpu_map().

With these functions removed, we can still see the correct
cache-sibling map/list for L1 and L2 caches on systems with L1 and L2
caches distributed among groups of threads in a core.

With this patch, on a SMT8 POWER10 system where the L1 and L2 caches
are split between the two groups of threads in a core, for CPUs 8,9,
the L1-Data, L1-Instruction, L2, L3 cache CPU sibling list is as
follows:

$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15

$ ppc64_cpu --smt=4
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11

$ ppc64_cpu --smt=2
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9

$ ppc64_cpu --smt=1
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/cacheinfo.c | 41 +--------------------------------
 1 file changed, 1 insertion(+), 40 deletions(-)

diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 5a6925d87424..20d91693eac1 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -675,45 +675,6 @@ static ssize_t level_show(struct kobject *k, struct kobj_attribute *attr, char *
 static struct kobj_attribute cache_level_attr =
 	__ATTR(level, 0444, level_show, NULL);
 
-static unsigned int index_dir_to_cpu(struct cache_index_dir *index)
-{
-	struct kobject *index_dir_kobj = &index->kobj;
-	struct kobject *cache_dir_kobj = index_dir_kobj->parent;
-	struct kobject *cpu_dev_kobj = cache_dir_kobj->parent;
-	struct device *dev = kobj_to_dev(cpu_dev_kobj);
-
-	return dev->id;
-}
-
-/*
- * On big-core systems, each core has two groups of CPUs each of which
- * has its own L1-cache. The thread-siblings which share l1-cache with
- * @cpu can be obtained via cpu_smallcore_mask().
- *
- * On some big-core systems, the L2 cache is shared only between some
- * groups of siblings. This is already parsed and encoded in
- * cpu_l2_cache_mask().
- *
- * TODO: cache_lookup_or_instantiate() needs to be made aware of the
- *       "ibm,thread-groups" property so that cache->shared_cpu_map
- *       reflects the correct siblings on platforms that have this
- *       device-tree property. This helper function is only a stop-gap
- *       solution so that we report the correct siblings to the
- *       userspace via sysfs.
- */
-static const struct cpumask *get_shared_cpu_map(struct cache_index_dir *index, struct cache *cache)
-{
-	if (has_big_cores) {
-		int cpu = index_dir_to_cpu(index);
-		if (cache->level == 1)
-			return cpu_smallcore_mask(cpu);
-		if (cache->level == 2 && thread_group_shares_l2)
-			return cpu_l2_cache_mask(cpu);
-	}
-
-	return &cache->shared_cpu_map;
-}
-
 static ssize_t
 show_shared_cpumap(struct kobject *k, struct kobj_attribute *attr, char *buf, bool list)
 {
@@ -724,7 +685,7 @@ show_shared_cpumap(struct kobject *k, struct kobj_attribute *attr, char *buf, bo
 	index = kobj_to_cache_index_dir(k);
 	cache = index->cache;
 
-	mask = get_shared_cpu_map(index, cache);
+	mask = &cache->shared_cpu_map;
 
 	return cpumap_print_to_pagebuf(list, buf, mask);
 }
-- 
2.26.3


^ permalink raw reply related

* [PATCH 1/3] powerpc/cacheinfo: Lookup cache by dt node and thread-group id
From: Parth Shah @ 2021-06-15  7:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, svaidy, srikar, ego
In-Reply-To: <20210615070804.390341-1-parth@linux.ibm.com>

From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>

Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.

Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).

In this patch, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.

Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
[parth: Remove "static" keyword for the definition of "thread_group_l1_cache_map"
and "thread_group_l2_cache_map" to get rid of the compile error.]
Signed-off-by: Parth Shah <parth@linux.ibm.com>
---
 arch/powerpc/include/asm/smp.h  |  3 ++
 arch/powerpc/kernel/cacheinfo.c | 80 ++++++++++++++++++++++++---------
 arch/powerpc/kernel/smp.c       |  4 +-
 3 files changed, 63 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 03b3d010cbab..1259040cc3a4 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -33,6 +33,9 @@ extern bool coregroup_enabled;
 extern int cpu_to_chip_id(int cpu);
 extern int *chip_id_lookup_table;
 
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
 #ifdef CONFIG_SMP
 
 struct smp_ops_t {
diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 6f903e9aa20b..5a6925d87424 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -120,6 +120,7 @@ struct cache {
 	struct cpumask shared_cpu_map; /* online CPUs using this cache */
 	int type;                      /* split cache disambiguation */
 	int level;                     /* level not explicit in device tree */
+	int group_id;                  /* id of the group of threads that share this cache */
 	struct list_head list;         /* global list of cache objects */
 	struct cache *next_local;      /* next cache of >= level */
 };
@@ -142,22 +143,24 @@ static const char *cache_type_string(const struct cache *cache)
 }
 
 static void cache_init(struct cache *cache, int type, int level,
-		       struct device_node *ofnode)
+		       struct device_node *ofnode, int group_id)
 {
 	cache->type = type;
 	cache->level = level;
 	cache->ofnode = of_node_get(ofnode);
+	cache->group_id = group_id;
 	INIT_LIST_HEAD(&cache->list);
 	list_add(&cache->list, &cache_list);
 }
 
-static struct cache *new_cache(int type, int level, struct device_node *ofnode)
+static struct cache *new_cache(int type, int level,
+			       struct device_node *ofnode, int group_id)
 {
 	struct cache *cache;
 
 	cache = kzalloc(sizeof(*cache), GFP_KERNEL);
 	if (cache)
-		cache_init(cache, type, level, ofnode);
+		cache_init(cache, type, level, ofnode, group_id);
 
 	return cache;
 }
@@ -309,20 +312,24 @@ static struct cache *cache_find_first_sibling(struct cache *cache)
 		return cache;
 
 	list_for_each_entry(iter, &cache_list, list)
-		if (iter->ofnode == cache->ofnode && iter->next_local == cache)
+		if (iter->ofnode == cache->ofnode &&
+		    iter->group_id == cache->group_id &&
+		    iter->next_local == cache)
 			return iter;
 
 	return cache;
 }
 
-/* return the first cache on a local list matching node */
-static struct cache *cache_lookup_by_node(const struct device_node *node)
+/* return the first cache on a local list matching node and thread-group id */
+static struct cache *cache_lookup_by_node_group(const struct device_node *node,
+						int group_id)
 {
 	struct cache *cache = NULL;
 	struct cache *iter;
 
 	list_for_each_entry(iter, &cache_list, list) {
-		if (iter->ofnode != node)
+		if (iter->ofnode != node ||
+		    iter->group_id != group_id)
 			continue;
 		cache = cache_find_first_sibling(iter);
 		break;
@@ -352,14 +359,15 @@ static int cache_is_unified_d(const struct device_node *np)
 		CACHE_TYPE_UNIFIED_D : CACHE_TYPE_UNIFIED;
 }
 
-static struct cache *cache_do_one_devnode_unified(struct device_node *node, int level)
+static struct cache *cache_do_one_devnode_unified(struct device_node *node, int group_id,
+						  int level)
 {
 	pr_debug("creating L%d ucache for %pOFP\n", level, node);
 
-	return new_cache(cache_is_unified_d(node), level, node);
+	return new_cache(cache_is_unified_d(node), level, node, group_id);
 }
 
-static struct cache *cache_do_one_devnode_split(struct device_node *node,
+static struct cache *cache_do_one_devnode_split(struct device_node *node, int group_id,
 						int level)
 {
 	struct cache *dcache, *icache;
@@ -367,8 +375,8 @@ static struct cache *cache_do_one_devnode_split(struct device_node *node,
 	pr_debug("creating L%d dcache and icache for %pOFP\n", level,
 		 node);
 
-	dcache = new_cache(CACHE_TYPE_DATA, level, node);
-	icache = new_cache(CACHE_TYPE_INSTRUCTION, level, node);
+	dcache = new_cache(CACHE_TYPE_DATA, level, node, group_id);
+	icache = new_cache(CACHE_TYPE_INSTRUCTION, level, node, group_id);
 
 	if (!dcache || !icache)
 		goto err;
@@ -382,31 +390,32 @@ static struct cache *cache_do_one_devnode_split(struct device_node *node,
 	return NULL;
 }
 
-static struct cache *cache_do_one_devnode(struct device_node *node, int level)
+static struct cache *cache_do_one_devnode(struct device_node *node, int group_id, int level)
 {
 	struct cache *cache;
 
 	if (cache_node_is_unified(node))
-		cache = cache_do_one_devnode_unified(node, level);
+		cache = cache_do_one_devnode_unified(node, group_id, level);
 	else
-		cache = cache_do_one_devnode_split(node, level);
+		cache = cache_do_one_devnode_split(node, group_id, level);
 
 	return cache;
 }
 
 static struct cache *cache_lookup_or_instantiate(struct device_node *node,
+						 int group_id,
 						 int level)
 {
 	struct cache *cache;
 
-	cache = cache_lookup_by_node(node);
+	cache = cache_lookup_by_node_group(node, group_id);
 
 	WARN_ONCE(cache && cache->level != level,
 		  "cache level mismatch on lookup (got %d, expected %d)\n",
 		  cache->level, level);
 
 	if (!cache)
-		cache = cache_do_one_devnode(node, level);
+		cache = cache_do_one_devnode(node, group_id, level);
 
 	return cache;
 }
@@ -443,7 +452,27 @@ static void do_subsidiary_caches_debugcheck(struct cache *cache)
 		  of_node_get_device_type(cache->ofnode));
 }
 
-static void do_subsidiary_caches(struct cache *cache)
+/*
+ * If sub-groups of threads in a core containing @cpu_id share the
+ * L@level-cache (information obtained via "ibm,thread-groups"
+ * device-tree property), then we identify the group by the first
+ * thread-sibling in the group. We define this to be the group-id.
+ *
+ * In the absence of any thread-group information for L@level-cache,
+ * this function returns -1.
+ */
+static int get_group_id(unsigned int cpu_id, int level)
+{
+	if (has_big_cores && level == 1)
+		return cpumask_first(per_cpu(thread_group_l1_cache_map,
+					     cpu_id));
+	else if (thread_group_shares_l2 && level == 2)
+		return cpumask_first(per_cpu(thread_group_l2_cache_map,
+					     cpu_id));
+	return -1;
+}
+
+static void do_subsidiary_caches(struct cache *cache, unsigned int cpu_id)
 {
 	struct device_node *subcache_node;
 	int level = cache->level;
@@ -452,9 +481,11 @@ static void do_subsidiary_caches(struct cache *cache)
 
 	while ((subcache_node = of_find_next_cache_node(cache->ofnode))) {
 		struct cache *subcache;
+		int group_id;
 
 		level++;
-		subcache = cache_lookup_or_instantiate(subcache_node, level);
+		group_id = get_group_id(cpu_id, level);
+		subcache = cache_lookup_or_instantiate(subcache_node, group_id, level);
 		of_node_put(subcache_node);
 		if (!subcache)
 			break;
@@ -468,6 +499,7 @@ static struct cache *cache_chain_instantiate(unsigned int cpu_id)
 {
 	struct device_node *cpu_node;
 	struct cache *cpu_cache = NULL;
+	int group_id;
 
 	pr_debug("creating cache object(s) for CPU %i\n", cpu_id);
 
@@ -476,11 +508,13 @@ static struct cache *cache_chain_instantiate(unsigned int cpu_id)
 	if (!cpu_node)
 		goto out;
 
-	cpu_cache = cache_lookup_or_instantiate(cpu_node, 1);
+	group_id = get_group_id(cpu_id, 1);
+
+	cpu_cache = cache_lookup_or_instantiate(cpu_node, group_id, 1);
 	if (!cpu_cache)
 		goto out;
 
-	do_subsidiary_caches(cpu_cache);
+	do_subsidiary_caches(cpu_cache, cpu_id);
 
 	cache_cpu_set(cpu_cache, cpu_id);
 out:
@@ -848,13 +882,15 @@ static struct cache *cache_lookup_by_cpu(unsigned int cpu_id)
 {
 	struct device_node *cpu_node;
 	struct cache *cache;
+	int group_id;
 
 	cpu_node = of_get_cpu_node(cpu_id, NULL);
 	WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
 	if (!cpu_node)
 		return NULL;
 
-	cache = cache_lookup_by_node(cpu_node);
+	group_id = get_group_id(cpu_id, 1);
+	cache = cache_lookup_by_node_group(cpu_node, group_id);
 	of_node_put(cpu_node);
 
 	return cache;
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 2e05c783440a..a34877257f2d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -122,14 +122,14 @@ static struct thread_groups_list tgl[NR_CPUS] __initdata;
  * On big-cores system, thread_group_l1_cache_map for each CPU corresponds to
  * the set its siblings that share the L1-cache.
  */
-static DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
 
 /*
  * On some big-cores system, thread_group_l2_cache_map for each CPU
  * corresponds to the set its siblings within the core that share the
  * L2-cache.
  */
-static DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+DEFINE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
 
 /* SMP operations for this machine */
 struct smp_ops_t *smp_ops;
-- 
2.26.3


^ permalink raw reply related

* [PATCH 0/3] Make cache-object aware of L3 siblings by parsing "ibm, thread-groups" property
From: Parth Shah @ 2021-06-15  7:08 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: mikey, svaidy, srikar, ego

On POWER10 big-core system, the L3 cache reflected by sysfs contains all
the CPUs in the big-core.

grep . /sys/devices/system/cpu/cpu0/cache/index*/shared_cpu_list
/sys/devices/system/cpu/cpu0/cache/index0/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index1/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index2/shared_cpu_list:0,2,4,6
/sys/devices/system/cpu/cpu0/cache/index3/shared_cpu_list:0-7

In the above example, CPU-0 observes CPU 0-7 in L3 (index3) cache, which
is not correct as only the CPUs in small core share the L3 cache.

The "ibm,thread-groups" contains property "2" to indicate that the CPUs
share both the L2 and L3 caches. This patch-set uses this property to
reflect correct L3 topology to a cache-object.

After applying this patch-set, the topology looks like:
$> ppc64_cpu --smt=8
$> grep . /sys/devices/system/cpu/cpu[89]/cache/*/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:9,11,13,15


$> ppc64_cpu --smt=4
$> grep . /sys/devices/system/cpu/cpu[89]/cache/*/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:9,11

$> ppc64_cpu --smt=2
$> grep . /sys/devices/system/cpu/cpu[89]/cache/*/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:9

$> ppc64_cpu --smt=1
grep . /sys/devices/system/cpu/cpu[89]/cache/*/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8

Patches Organization:
=====================
This patch-set series is based on top of v5.13-rc2

- Patch 1-2: Add functionality to introduce awareness for
"ibm,thread-groups". Original (not merged) posted version can be found at:
https://lore.kernel.org/linuxppc-dev/1611041780-8640-1-git-send-email-ego@linux.vnet.ibm.co
- Patch 3: Use existing L2 cache_map to detect L3 cache siblings

Gautham R. Shenoy (2):
  powerpc/cacheinfo: Lookup cache by dt node and thread-group id
  powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()

Parth Shah (1):
  powerpc/smp: Use existing L2 cache_map cpumask to find L3 cache
    siblings

 arch/powerpc/include/asm/smp.h  |   5 ++
 arch/powerpc/kernel/cacheinfo.c | 124 ++++++++++++++++----------------
 arch/powerpc/kernel/smp.c       |  24 +++++--
 3 files changed, 84 insertions(+), 69 deletions(-)

-- 
2.26.3


^ permalink raw reply

* Re: [RFC PATCH 8/8] powerpc/papr_scm: Use FORM2 associativity details
From: Aneesh Kumar K.V @ 2021-06-15  7:05 UTC (permalink / raw)
  To: David Gibson; +Cc: Nathan Lynch, Daniel Henrique Barboza, linuxppc-dev
In-Reply-To: <YMhKEJ9WSlapuSE6@yekko>

David Gibson <david@gibson.dropbear.id.au> writes:

> On Tue, Jun 15, 2021 at 11:27:50AM +0530, Aneesh Kumar K.V wrote:
>> David Gibson <david@gibson.dropbear.id.au> writes:
>> 
>> > On Mon, Jun 14, 2021 at 10:10:03PM +0530, Aneesh Kumar K.V wrote:
>> >> FORM2 introduce a concept of secondary domain which is identical to the
>> >> conceept of FORM1 primary domain. Use secondary domain as the numa node
>> >> when using persistent memory device. For DAX kmem use the logical domain
>> >> id introduced in FORM2. This new numa node
>> >> 
>> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> >> ---
>> >>  arch/powerpc/mm/numa.c                    | 28 +++++++++++++++++++++++
>> >>  arch/powerpc/platforms/pseries/papr_scm.c | 26 +++++++++++++--------
>> >>  arch/powerpc/platforms/pseries/pseries.h  |  1 +
>> >>  3 files changed, 45 insertions(+), 10 deletions(-)
>> >> 
>> >> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
>> >> index 86cd2af014f7..b9ac6d02e944 100644
>> >> --- a/arch/powerpc/mm/numa.c
>> >> +++ b/arch/powerpc/mm/numa.c
>> >> @@ -265,6 +265,34 @@ static int associativity_to_nid(const __be32 *associativity)
>> >>  	return nid;
>> >>  }
>> >>  
>> >> +int get_primary_and_secondary_domain(struct device_node *node, int *primary, int *secondary)
>> >> +{
>> >> +	int secondary_index;
>> >> +	const __be32 *associativity;
>> >> +
>> >> +	if (!numa_enabled) {
>> >> +		*primary = NUMA_NO_NODE;
>> >> +		*secondary = NUMA_NO_NODE;
>> >> +		return 0;
>> >> +	}
>> >> +
>> >> +	associativity = of_get_associativity(node);
>> >> +	if (!associativity)
>> >> +		return -ENODEV;
>> >> +
>> >> +	if (of_read_number(associativity, 1) >= primary_domain_index) {
>> >> +		*primary = of_read_number(&associativity[primary_domain_index], 1);
>> >> +		secondary_index = of_read_number(&distance_ref_points[1], 1);
>> >
>> > Secondary ID is always the second reference point, but primary depends
>> > on the length of resources?  That seems very weird.
>> 
>> primary_domain_index is distance_ref_point[0]. With Form2 we would find
>> both primary and secondary domain ID same for all resources other than
>> persistent memory device. The usage w.r.t. persistent memory is
>> explained in patch 7.
>
> Right, I misunderstood
>
>> 
>> With Form2 the primary domainID and secondary domainID are used to identify the NUMA nodes
>> the kernel should use when using persistent memory devices.
>
> This seems kind of bogus.  With Form1, the primary/secondary ID are a
> sort of heirarchy of distance (things with same primary ID are very
> close, things with same secondary are kinda-close, etc.).  With Form2,
> it's referring to their effective node for different purposes.
>
> Using the same terms for different meanings seems unnecessarily
> confusing.

They are essentially domainIDs. The interpretation of them are different
between Form1 and Form2. Hence I kept referring to them as primary and
secondary domainID. Any suggestion on what to name them with Form2?

>
>> Persistent memory devices
>> can also be used as regular memory using DAX KMEM driver and primary domainID indicates
>> the numa node number OS should use when using these devices as regular memory. Secondary
>> domainID is the numa node number that should be used when using this device as
>> persistent memory.
>
> It's weird to me that you'd want to consider them in different nodes
> for those different purposes.


   --------------------------------------
  |                            NUMA node0 |
  |    ProcA -----> MEMA                  |
  |     |                                 |
  |	|                                 |
  |	-------------------> PMEMB        |
  |                                       |
   ---------------------------------------

   ---------------------------------------
  |                            NUMA node1 |
  |                                       |
  |    ProcB -------> MEMC                |
  |	|                                 |
  |	-------------------> PMEMD        |
  |                                       |
  |                                       |
   ---------------------------------------
 

For a topology like the above application running of ProcA wants to find out
persistent memory mount local to its NUMA node. Hence when using it as
pmem fsdax mount or devdax device we want PMEMB to have associativity
of NUMA node0 and PMEMD to have associativity of NUMA node 1. But when
we want to use it as memory using dax kmem driver, we want both PMEMB
and PMEMD to appear as memory only NUMA node at a distance that is
derived based on the latency of the media. 

>
>> In the later case, we are interested in the locality of the
>> device to an established numa node. In the above example, if the last row represents a
>> persistent memory device/resource, NUMA node number 40 will be used when using the device
>> as regular memory and NUMA node number 0 will be the device numa node when using it as
>> a persistent memory device.
>
> I don't really get what you mean by "locality of the device to an
> established numa node".  Or at least how that's different from
> anything else we're handling here.


-aneesh

^ permalink raw reply

* Re: [PATCH 5/7] signal: Add unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-06-15  7:03 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: linux-arch, linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <YMhOMoKKvew0YYCt@infradead.org>



Le 15/06/2021 à 08:52, Christoph Hellwig a écrit :
> On Tue, Jun 15, 2021 at 06:41:01AM +0000, Christophe Leroy wrote:
>> +	unsafe_copy_to_user(__ucs_to, __ucs_from,			\
>> +			    sizeof(struct kernel_siginfo), label);	\
>> +	unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);	\
>> +} while (0)
> 
> unsafe_clear_user does not exist at this point, and even your later
> patch only adds it for powerpc.
> 

You missed below chunck I guess:

 > diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
 > index c05e903cef02..37073caac474 100644
 > --- a/include/linux/uaccess.h
 > +++ b/include/linux/uaccess.h
 > @@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, long count);
 >   #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
 >   #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e)
 >   #define unsafe_copy_from_user(d,s,l,e) unsafe_op_wrap(__copy_from_user(d,s,l),e)
 > +#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)
 >   static inline unsigned long user_access_save(void) { return 0UL; }
 >   static inline void user_access_restore(unsigned long flags) { }
 >   #endif

^ permalink raw reply

* Re: [PATCH 7/7] powerpc/signal: Use unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-06-15  7:00 UTC (permalink / raw)
  To: Christoph Hellwig; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <YMhO4oiqh4F+ZEW8@infradead.org>



Le 15/06/2021 à 08:55, Christoph Hellwig a écrit :
>> @@ -836,14 +830,19 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
>>   		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
>>   	}
>>   	unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed);
>> +#ifndef CONFIG_COMPAT
>> +	unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, failed);
>> +#endif
>>   
>>   	/* create a stack frame for the caller of the handler */
>>   	unsafe_put_user(regs->gpr[1], newsp, failed);
>>   
>>   	user_access_end();
>>   
>> -	if (copy_siginfo_to_user(&frame->info, &ksig->info))
>> +#ifdef CONFIG_COMPAT
>> +	if (copy_siginfo_to_user32(&frame->info, &ksig->info))
>>   		goto badframe;
>> +#endif
> 
> Shouldn't the compat case be handled the same way?
> 

It would be best, but it is not that easy to convert. So for the time being it is left aside, anyway 
compat is for compatibility, so performance doesn't matter so much.

^ permalink raw reply

* Re: [PATCH 7/7] powerpc/signal: Use unsafe_copy_siginfo_to_user()
From: Christoph Hellwig @ 2021-06-15  6:55 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <8b4b640746523f5efb1c9a9fd97465bac4f00cae.1623739212.git.christophe.leroy@csgroup.eu>

> @@ -836,14 +830,19 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
>  		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
>  	}
>  	unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed);
> +#ifndef CONFIG_COMPAT
> +	unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, failed);
> +#endif
>  
>  	/* create a stack frame for the caller of the handler */
>  	unsafe_put_user(regs->gpr[1], newsp, failed);
>  
>  	user_access_end();
>  
> -	if (copy_siginfo_to_user(&frame->info, &ksig->info))
> +#ifdef CONFIG_COMPAT
> +	if (copy_siginfo_to_user32(&frame->info, &ksig->info))
>  		goto badframe;
> +#endif

Shouldn't the compat case be handled the same way?

^ permalink raw reply

* Re: [PATCH 6/7] powerpc/uaccess: Add unsafe_clear_user()
From: Christoph Hellwig @ 2021-06-15  6:53 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: Paul Mackerras, linuxppc-dev, linux-kernel
In-Reply-To: <67eedb69ca81e5a4b16459a4c61f99e64cb42675.1623739212.git.christophe.leroy@csgroup.eu>

On Tue, Jun 15, 2021 at 06:41:02AM +0000, Christophe Leroy wrote:
> Implement unsafe_clear_user() for powerpc.
> It's a copy/paste of unsafe_copy_to_user() with value 0 as source.
> 
> It may be improved in a later patch by using 'dcbz' instruction
> to zeroize full cache lines at once.

Please add this to common code insted of making it powerpc specific.


^ permalink raw reply

* Re: [PATCH 5/7] signal: Add unsafe_copy_siginfo_to_user()
From: Christoph Hellwig @ 2021-06-15  6:52 UTC (permalink / raw)
  To: Christophe Leroy; +Cc: linux-arch, linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <684939dcfef612fac573d1b983a977215b71f64d.1623739212.git.christophe.leroy@csgroup.eu>

On Tue, Jun 15, 2021 at 06:41:01AM +0000, Christophe Leroy wrote:
> +	unsafe_copy_to_user(__ucs_to, __ucs_from,			\
> +			    sizeof(struct kernel_siginfo), label);	\
> +	unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);	\
> +} while (0)

unsafe_clear_user does not exist at this point, and even your later
patch only adds it for powerpc.

^ permalink raw reply

* Re: [PATCH v5 12/17] powerpc/pseries/vas: Integrate API with open/close windows
From: Haren Myneni @ 2021-06-15  6:51 UTC (permalink / raw)
  To: Nicholas Piggin, herbert, linux-crypto, linuxppc-dev, mpe
In-Reply-To: <1623638159.6pp87imz6a.astroid@bobo.none>

On Mon, 2021-06-14 at 12:55 +1000, Nicholas Piggin wrote:
> Excerpts from Haren Myneni's message of June 13, 2021 9:02 pm:
> > This patch adds VAS window allocatioa/close with the corresponding
> > hcalls. Also changes to integrate with the existing user space VAS
> > API and provide register/unregister functions to NX pseries driver.
> > 
> > The driver register function is used to create the user space
> > interface (/dev/crypto/nx-gzip) and unregister to remove this
> > entry.
> > 
> > The user space process opens this device node and makes an ioctl
> > to allocate VAS window. The close interface is used to deallocate
> > window.
> > 
> > Signed-off-by: Haren Myneni <haren@linux.ibm.com>
> > ---
> >  arch/powerpc/include/asm/vas.h          |   4 +
> >  arch/powerpc/platforms/pseries/Makefile |   1 +
> >  arch/powerpc/platforms/pseries/vas.c    | 223
> > ++++++++++++++++++++++++
> >  3 files changed, 228 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/vas.h
> > b/arch/powerpc/include/asm/vas.h
> > index eefc758d8cd4..9d5646d721c4 100644
> > --- a/arch/powerpc/include/asm/vas.h
> > +++ b/arch/powerpc/include/asm/vas.h
> > @@ -254,6 +254,10 @@ struct vas_all_caps {
> >  	u64     feat_type;
> >  };
> >  
> > +int h_query_vas_capabilities(const u64 hcall, u8 query_type, u64
> > result);
> > +int vas_register_api_pseries(struct module *mod,
> > +			     enum vas_cop_type cop_type, const char
> > *name);
> > +void vas_unregister_api_pseries(void);
> >  #endif
> >  
> >  /*
> > diff --git a/arch/powerpc/platforms/pseries/Makefile
> > b/arch/powerpc/platforms/pseries/Makefile
> > index c8a2b0b05ac0..4cda0ef87be0 100644
> > --- a/arch/powerpc/platforms/pseries/Makefile
> > +++ b/arch/powerpc/platforms/pseries/Makefile
> > @@ -30,3 +30,4 @@ obj-$(CONFIG_PPC_SVM)		+= svm.o
> >  obj-$(CONFIG_FA_DUMP)		+= rtas-fadump.o
> >  
> >  obj-$(CONFIG_SUSPEND)		+= suspend.o
> > +obj-$(CONFIG_PPC_VAS)		+= vas.o
> > diff --git a/arch/powerpc/platforms/pseries/vas.c
> > b/arch/powerpc/platforms/pseries/vas.c
> > index 98109a13f1c2..fe375f7a7029 100644
> > --- a/arch/powerpc/platforms/pseries/vas.c
> > +++ b/arch/powerpc/platforms/pseries/vas.c
> > @@ -10,6 +10,7 @@
> >  #include <linux/export.h>
> >  #include <linux/types.h>
> >  #include <linux/delay.h>
> > +#include <linux/slab.h>
> >  #include <asm/machdep.h>
> >  #include <asm/hvcall.h>
> >  #include <asm/plpar_wrappers.h>
> > @@ -187,6 +188,228 @@ int h_query_vas_capabilities(const u64 hcall,
> > u8 query_type, u64 result)
> >  		return -EIO;
> >  	}
> >  }
> > +EXPORT_SYMBOL_GPL(h_query_vas_capabilities);
> > +
> > +/*
> > + * Allocate window and setup IRQ mapping.
> > + */
> > +static int allocate_setup_window(struct pseries_vas_window *txwin,
> > +				 u64 *domain, u8 wintype)
> > +{
> > +	int rc;
> > +
> > +	rc = h_allocate_vas_window(txwin, domain, wintype,
> > DEF_WIN_CREDS);
> > +	if (rc)
> > +		return rc;
> > +
> > +	txwin->vas_win.wcreds_max = DEF_WIN_CREDS;
> > +
> > +	return 0;
> > +}
> > +
> > +static struct vas_window *vas_allocate_window(struct
> > vas_tx_win_open_attr *uattr,
> > +					      enum vas_cop_type
> > cop_type)
> > +{
> > +	long domain[PLPAR_HCALL9_BUFSIZE] = {VAS_DEFAULT_DOMAIN_ID};
> > +	struct vas_ct_caps *ct_caps;
> > +	struct vas_caps *caps;
> > +	struct pseries_vas_window *txwin;
> > +	int rc;
> > +
> > +	txwin = kzalloc(sizeof(*txwin), GFP_KERNEL);
> > +	if (!txwin)
> > +		return ERR_PTR(-ENOMEM);
> > +
> > +	/*
> > +	 * A VAS window can have many credits which means that many
> > +	 * requests can be issued simultaneously. But phyp restricts
> > +	 * one credit per window.
> > +	 * phyp introduces 2 different types of credits:
> > +	 * Default credit type (Uses normal priority FIFO):
> > +	 *	A limited number of credits are assigned to partitions
> > +	 *	based on processor entitlement. But these credits may be
> > +	 *	over-committed on a system depends on whether the CPUs
> > +	 *	are in shared or dedicated modes - that is, more requests
> > +	 *	may be issued across the system than NX can service at
> > +	 *	once which can result in paste command failure (RMA_busy).
> > +	 *	Then the process has to resend requests or fall-back to
> > +	 *	SW compression.
> > +	 * Quality of Service (QoS) credit type (Uses high priority
> > FIFO):
> > +	 *	To avoid NX HW contention, the system admins can assign
> > +	 *	QoS credits for each LPAR so that this partition is
> > +	 *	guaranteed access to NX resources. These credits are
> > +	 *	assigned to partitions via the HMC.
> > +	 *	Refer PAPR for more information.
> > +	 *
> > +	 * Allocate window with QoS credits if user requested.
> > Otherwise
> > +	 * default credits are used.
> > +	 */
> > +	if (uattr->flags & VAS_TX_WIN_FLAG_QOS_CREDIT)
> > +		caps = &vascaps[VAS_GZIP_QOS_FEAT_TYPE];
> > +	else
> > +		caps = &vascaps[VAS_GZIP_DEF_FEAT_TYPE];
> > +
> > +	ct_caps = &caps->caps;
> > +
> > +	if (atomic_inc_return(&ct_caps->used_lpar_creds) >
> > +			atomic_read(&ct_caps->target_lpar_creds)) {
> > +		pr_err("Credits are not available to allocate
> > window\n");
> > +		rc = -EINVAL;
> > +		goto out;
> > +	}
> > +
> > +	/*
> > +	 * The user space is requesting to allocate a window on a VAS
> > +	 * instance (or chip) where the process is executing.
> > +	 * On powerVM, domain values are passed to pHyp to select chip
> > /
> > +	 * VAS instance. Useful if the process is affinity to NUMA
> > node.
> > +	 * pHyp selects VAS instance if VAS_DEFAULT_DOMAIN_ID (-1) is
> > +	 * passed for domain values.
> > +	 */
> 
> s/powerVM/PowerVM
> s/pHyp/PowerVM
> 
> You could also just call it the hypervisor. KVM may not implement
> the 
> hcalls now, but in future it could.
> 
> > +	if (uattr->vas_id == -1) {
> 
> Should the above comment fit under here? vas_id == -1 means
> userspace 
> asks for any VAS but preferably a local one?

yes, the user space is requesting window to be allocated on the local
VAS. Added the first comment to explain about vas_id = -1 and the
second comment on why same domain values from H_HOME_NODE_ASSOCIATIVITY
are passed to allocate window HCALL. 

I will remove the first comment and merge it later after "if (uattr-
>vas_id == -1) {"

Thanks
Haren
> 
> > +		/*
> > +		 * To allocate VAS window, pass same domain values
> > returned
> > +		 * from this HCALL.
> > +		 */
> 
> Then you could merge it with this comment and make it a bit clearer:
> the h_allocate_vas_window hcall is defined to take a domain as
> specified by h_home_node_associativity, so no conversions or
> unpacking
> needs to be done.
> 
> > +		rc = plpar_hcall9(H_HOME_NODE_ASSOCIATIVITY, domain,
> > +				  VPHN_FLAG_VCPU, smp_processor_id());
> > +		if (rc != H_SUCCESS) {
> > +			pr_err("HCALL(%x): failed with ret(%d)\n",
> > +			       H_HOME_NODE_ASSOCIATIVITY, rc);
> > +			goto out;
> > +		}
> > +	}
> > +
> > +	/*
> > +	 * Allocate / Deallocate window HCALLs and setup / free IRQs
> > +	 * have to be protected with mutex.
> > +	 * Open VAS window: Allocate window HCALL and setup IRQ
> > +	 * Close VAS window: Deallocate window HCALL and free IRQ
> > +	 *	The hypervisor waits until all NX requests are
> > +	 *	completed before closing the window. So expects OS
> > +	 *	to handle NX faults, means IRQ can be freed only
> > +	 *	after the deallocate window HCALL is returned.
> > +	 * So once the window is closed with deallocate HCALL before
> > +	 * the IRQ is freed, it can be assigned to new allocate
> > +	 * HCALL with the same fault IRQ by the hypervisor. It can
> > +	 * result in setup IRQ fail for the new window since the
> > +	 * same fault IRQ is not freed by the OS.
> > +	 */
> > +	mutex_lock(&vas_pseries_mutex);
> 
> Why? What's the mutex protecting here?
> 
> > +	rc = allocate_setup_window(txwin, (u64 *)&domain[0],
> > +				   ct_caps->win_type);
> 
> If you define the types to be the same, can you avoid this casting?
> allocate_setup_window specifically needs an array of 
> PLPAR_HCALL9_BUFSIZE longs.
> 
> > +	mutex_unlock(&vas_pseries_mutex);
> > +	if (rc)
> > +		goto out;
> > +
> > +	/*
> > +	 * Modify window and it is ready to use.
> > +	 */
> > +	rc = h_modify_vas_window(txwin);
> > +	if (!rc)
> > +		rc = get_vas_user_win_ref(&txwin->vas_win.task_ref);
> > +	if (rc)
> > +		goto out_free;
> > +
> > +	vas_user_win_add_mm_context(&txwin->vas_win.task_ref);
> > +	txwin->win_type = ct_caps->win_type;
> > +	mutex_lock(&vas_pseries_mutex);
> > +	list_add(&txwin->win_list, &caps->list);
> > +	mutex_unlock(&vas_pseries_mutex);
> > +
> > +	return &txwin->vas_win;
> > +
> > +out_free:
> > +	h_deallocate_vas_window(txwin->vas_win.winid);
> 
> No mutex here in this deallocate hcall.
> 
> I suspect you don't actually need the mutex for the hcalls
> themselves, 
> but the list manipulations. I would possibly consider putting 
> used_lpar_creds under that same lock rather than making it atomic and
> playing lock free games, unless you really need to.
> 
> Also... "creds". credentials? credits, right? Don't go through and 
> change everything now, but not skimping on naming helps a lot with
> reading code that you're not familiar with. All the vas/nx stuff
> could probably do with a pass to make the names a bit easier.
> 
> (creds isn't so bad, "ct" for "coprocessor type" is pretty obscure 
> though).
> 
> Thanks,
> Nick
> 
> > +out:
> > +	atomic_dec(&ct_caps->used_lpar_creds);
> > +	kfree(txwin);
> > +	return ERR_PTR(rc);
> > +}
> > +
> > +static u64 vas_paste_address(struct vas_window *vwin)
> > +{
> > +	struct pseries_vas_window *win;
> > +
> > +	win = container_of(vwin, struct pseries_vas_window, vas_win);
> > +	return win->win_addr;
> > +}
> > +
> > +static int deallocate_free_window(struct pseries_vas_window *win)
> > +{
> > +	int rc = 0;
> > +
> > +	rc = h_deallocate_vas_window(win->vas_win.winid);
> > +
> > +	return rc;
> > +}
> > +
> > +static int vas_deallocate_window(struct vas_window *vwin)
> > +{
> > +	struct pseries_vas_window *win;
> > +	struct vas_ct_caps *caps;
> > +	int rc = 0;
> > +
> > +	if (!vwin)
> > +		return -EINVAL;
> > +
> > +	win = container_of(vwin, struct pseries_vas_window, vas_win);
> > +
> > +	/* Should not happen */
> > +	if (win->win_type >= VAS_MAX_FEAT_TYPE) {
> > +		pr_err("Window (%u): Invalid window type %u\n",
> > +				vwin->winid, win->win_type);
> > +		return -EINVAL;
> > +	}
> > +
> > +	caps = &vascaps[win->win_type].caps;
> > +	mutex_lock(&vas_pseries_mutex);
> > +	rc = deallocate_free_window(win);
> > +	if (rc) {
> > +		mutex_unlock(&vas_pseries_mutex);
> > +		return rc;
> > +	}
> > +
> > +	list_del(&win->win_list);
> > +	atomic_dec(&caps->used_lpar_creds);
> > +	mutex_unlock(&vas_pseries_mutex);
> > +
> > +	put_vas_user_win_ref(&vwin->task_ref);
> > +	mm_context_remove_vas_window(vwin->task_ref.mm);
> > +
> > +	kfree(win);
> > +	return 0;
> > +}
> > +
> > +static const struct vas_user_win_ops vops_pseries = {
> > +	.open_win	= vas_allocate_window,	/* Open and configure
> > window */
> > +	.paste_addr	= vas_paste_address,	/* To do copy/paste */
> > +	.close_win	= vas_deallocate_window, /* Close window */
> > +};
> > +
> > +/*
> > + * Supporting only nx-gzip coprocessor type now, but this API code
> > + * extended to other coprocessor types later.
> > + */
> > +int vas_register_api_pseries(struct module *mod, enum vas_cop_type
> > cop_type,
> > +			     const char *name)
> > +{
> > +	int rc;
> > +
> > +	if (!copypaste_feat)
> > +		return -ENOTSUPP;
> > +
> > +	rc = vas_register_coproc_api(mod, cop_type, name,
> > &vops_pseries);
> > +
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_GPL(vas_register_api_pseries);
> > +
> > +void vas_unregister_api_pseries(void)
> > +{
> > +	vas_unregister_coproc_api();
> > +}
> > +EXPORT_SYMBOL_GPL(vas_unregister_api_pseries);
> >  
> >  /*
> >   * Get the specific capabilities based on the feature type.
> > -- 
> > 2.18.2
> > 
> > 
> > 


^ permalink raw reply

* [PATCH 7/7] powerpc/signal: Use unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-06-15  6:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <b813c1f4d3dab2f51300eac44d99029aa8e57830.1623739212.git.christophe.leroy@csgroup.eu>

Use unsafe_copy_siginfo_to_user() in order to do the copy
within the user access block.

On an mpc 8321 (book3s/32) the improvment is about 5% on a process
sending a signal to itself.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/signal_32.c | 13 ++++++-------
 arch/powerpc/kernel/signal_64.c |  5 +----
 2 files changed, 7 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 621de6e457b3..f3276cf05c8a 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -765,12 +765,6 @@ static long restore_tm_user_regs(struct pt_regs *regs, struct mcontext __user *s
 }
 #endif
 
-#ifdef CONFIG_PPC64
-
-#define copy_siginfo_to_user	copy_siginfo_to_user32
-
-#endif /* CONFIG_PPC64 */
-
 /*
  * Set up a signal frame for a "real-time" signal handler
  * (one which gets siginfo).
@@ -836,14 +830,19 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
 	}
 	unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed);
+#ifndef CONFIG_COMPAT
+	unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, failed);
+#endif
 
 	/* create a stack frame for the caller of the handler */
 	unsafe_put_user(regs->gpr[1], newsp, failed);
 
 	user_access_end();
 
-	if (copy_siginfo_to_user(&frame->info, &ksig->info))
+#ifdef CONFIG_COMPAT
+	if (copy_siginfo_to_user32(&frame->info, &ksig->info))
 		goto badframe;
+#endif
 
 	regs->link = tramp;
 
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 35c301457fbf..47cf7462e0d6 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -901,15 +901,12 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	}
 
 	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
+	unsafe_copy_siginfo_to_user(&frame->info, &ksig->info, badframe_block);
 	/* Allocate a dummy caller frame for the signal handler. */
 	unsafe_put_user(regs->gpr[1], newsp, badframe_block);
 
 	user_write_access_end();
 
-	/* Save the siginfo outside of the unsafe block. */
-	if (copy_siginfo_to_user(&frame->info, &ksig->info))
-		goto badframe;
-
 	/* Make sure signal handler doesn't get spurious FP exceptions */
 	tsk->thread.fp_state.fpscr = 0;
 
-- 
2.25.0


^ permalink raw reply related

* [PATCH 6/7] powerpc/uaccess: Add unsafe_clear_user()
From: Christophe Leroy @ 2021-06-15  6:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <b813c1f4d3dab2f51300eac44d99029aa8e57830.1623739212.git.christophe.leroy@csgroup.eu>

Implement unsafe_clear_user() for powerpc.
It's a copy/paste of unsafe_copy_to_user() with value 0 as source.

It may be improved in a later patch by using 'dcbz' instruction
to zeroize full cache lines at once.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/include/asm/uaccess.h | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
index 22c79ab40006..962b675485ff 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -467,6 +467,26 @@ do {									\
 		unsafe_put_user(*(u8*)(_src + _i), (u8 __user *)(_dst + _i), e); \
 } while (0)
 
+#define unsafe_clear_user(d, l, e)					\
+do {									\
+	u8 __user *_dst = (u8 __user *)(d);				\
+	size_t _len = (l);						\
+	int _i;								\
+									\
+	for (_i = 0; _i < (_len & ~(sizeof(u64) - 1)); _i += sizeof(u64)) \
+		unsafe_put_user(0, (u64 __user *)(_dst + _i), e);	\
+	if (_len & 4) {							\
+		unsafe_put_user(0, (u32 __user *)(_dst + _i), e);	\
+		_i += 4;						\
+	}								\
+	if (_len & 2) {							\
+		unsafe_put_user(0, (u16 __user *)(_dst + _i), e);	\
+		_i += 2;						\
+	}								\
+	if (_len & 1)							\
+		unsafe_put_user(0, (u8 __user *)(_dst + _i), e);	\
+} while (0)
+
 #define HAVE_GET_KERNEL_NOFAULT
 
 #define __get_kernel_nofault(dst, src, type, err_label)			\
-- 
2.25.0


^ permalink raw reply related

* [PATCH 5/7] signal: Add unsafe_copy_siginfo_to_user()
From: Christophe Leroy @ 2021-06-15  6:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linux-arch, linuxppc-dev, linux-kernel
In-Reply-To: <b813c1f4d3dab2f51300eac44d99029aa8e57830.1623739212.git.christophe.leroy@csgroup.eu>

In the same spirit as commit fb05121fd6a2 ("signal: Add
unsafe_get_compat_sigset()"), implement an 'unsafe' version of
copy_siginfo_to_user() in order to use it within user access blocks.

For that, also add an 'unsafe' version of clear_user().

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 include/linux/signal.h  | 15 +++++++++++++++
 include/linux/uaccess.h |  1 +
 kernel/signal.c         |  5 -----
 3 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/include/linux/signal.h b/include/linux/signal.h
index 201f88e3738b..beac7b5e4acc 100644
--- a/include/linux/signal.h
+++ b/include/linux/signal.h
@@ -35,6 +35,21 @@ static inline void copy_siginfo_to_external(siginfo_t *to,
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from);
 int copy_siginfo_from_user(kernel_siginfo_t *to, const siginfo_t __user *from);
 
+static __always_inline char __user *si_expansion(const siginfo_t __user *info)
+{
+	return ((char __user *)info) + sizeof(struct kernel_siginfo);
+}
+
+#define unsafe_copy_siginfo_to_user(to, from, label) do {		\
+	siginfo_t __user *__ucs_to = to;				\
+	const kernel_siginfo_t *__ucs_from = from;			\
+	char __user *__ucs_expansion = si_expansion(__ucs_to);		\
+									\
+	unsafe_copy_to_user(__ucs_to, __ucs_from,			\
+			    sizeof(struct kernel_siginfo), label);	\
+	unsafe_clear_user(__ucs_expansion, SI_EXPANSION_SIZE, label);	\
+} while (0)
+
 enum siginfo_layout {
 	SIL_KILL,
 	SIL_TIMER,
diff --git a/include/linux/uaccess.h b/include/linux/uaccess.h
index c05e903cef02..37073caac474 100644
--- a/include/linux/uaccess.h
+++ b/include/linux/uaccess.h
@@ -398,6 +398,7 @@ long strnlen_user_nofault(const void __user *unsafe_addr, long count);
 #define unsafe_put_user(x,p,e) unsafe_op_wrap(__put_user(x,p),e)
 #define unsafe_copy_to_user(d,s,l,e) unsafe_op_wrap(__copy_to_user(d,s,l),e)
 #define unsafe_copy_from_user(d,s,l,e) unsafe_op_wrap(__copy_from_user(d,s,l),e)
+#define unsafe_clear_user(d, l, e) unsafe_op_wrap(__clear_user(d, l), e)
 static inline unsigned long user_access_save(void) { return 0UL; }
 static inline void user_access_restore(unsigned long flags) { }
 #endif
diff --git a/kernel/signal.c b/kernel/signal.c
index f7c6ffcbd044..7a366331d2b7 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -3286,11 +3286,6 @@ enum siginfo_layout siginfo_layout(unsigned sig, int si_code)
 	return layout;
 }
 
-static inline char __user *si_expansion(const siginfo_t __user *info)
-{
-	return ((char __user *)info) + sizeof(struct kernel_siginfo);
-}
-
 int copy_siginfo_to_user(siginfo_t __user *to, const kernel_siginfo_t *from)
 {
 	char __user *expansion = si_expansion(to);
-- 
2.25.0


^ permalink raw reply related

* [PATCH 4/7] powerpc/signal: Include the new stack frame inside the user access block
From: Christophe Leroy @ 2021-06-15  6:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <b813c1f4d3dab2f51300eac44d99029aa8e57830.1623739212.git.christophe.leroy@csgroup.eu>

Include the new stack frame inside the user access block and set it up
using unsafe_put_user().

On an mpc 8321 (book3s/32) the improvment is about 4% on a process
sending a signal to itself.

Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
 arch/powerpc/kernel/signal_32.c | 29 +++++++++++++----------------
 arch/powerpc/kernel/signal_64.c | 14 +++++++-------
 2 files changed, 20 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 8f05ed0da292..621de6e457b3 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -781,7 +781,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	struct rt_sigframe __user *frame;
 	struct mcontext __user *mctx;
 	struct mcontext __user *tm_mctx = NULL;
-	unsigned long newsp = 0;
+	unsigned long __user *newsp;
 	unsigned long tramp;
 	struct pt_regs *regs = tsk->thread.regs;
 	/* Save the thread's msr before get_tm_stackpointer() changes it */
@@ -789,6 +789,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 
 	/* Set up Signal Frame */
 	frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
+	newsp = (unsigned long __user *)((unsigned long)frame - (__SIGNAL_FRAMESIZE + 16));
 	mctx = &frame->uc.uc_mcontext;
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 	tm_mctx = &frame->uc_transact.uc_mcontext;
@@ -798,7 +799,7 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	else
 		prepare_save_user_regs(1);
 
-	if (!user_access_begin(frame, sizeof(*frame)))
+	if (!user_access_begin(newsp, __SIGNAL_FRAMESIZE + 16 + sizeof(*frame)))
 		goto badframe;
 
 	/* Put the siginfo & fill in most of the ucontext */
@@ -836,6 +837,9 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	}
 	unsafe_put_sigset_t(&frame->uc.uc_sigmask, oldset, failed);
 
+	/* create a stack frame for the caller of the handler */
+	unsafe_put_user(regs->gpr[1], newsp, failed);
+
 	user_access_end();
 
 	if (copy_siginfo_to_user(&frame->info, &ksig->info))
@@ -847,13 +851,8 @@ int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
 	tsk->thread.fp_state.fpscr = 0;	/* turn off all fp exceptions */
 #endif
 
-	/* create a stack frame for the caller of the handler */
-	newsp = ((unsigned long)frame) - (__SIGNAL_FRAMESIZE + 16);
-	if (put_user(regs->gpr[1], (u32 __user *)newsp))
-		goto badframe;
-
 	/* Fill registers for signal handler */
-	regs->gpr[1] = newsp;
+	regs->gpr[1] = (unsigned long)newsp;
 	regs->gpr[3] = ksig->sig;
 	regs->gpr[4] = (unsigned long)&frame->info;
 	regs->gpr[5] = (unsigned long)&frame->uc;
@@ -883,7 +882,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	struct sigframe __user *frame;
 	struct mcontext __user *mctx;
 	struct mcontext __user *tm_mctx = NULL;
-	unsigned long newsp = 0;
+	unsigned long __user *newsp;
 	unsigned long tramp;
 	struct pt_regs *regs = tsk->thread.regs;
 	/* Save the thread's msr before get_tm_stackpointer() changes it */
@@ -891,6 +890,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 
 	/* Set up Signal Frame */
 	frame = get_sigframe(ksig, tsk, sizeof(*frame), 1);
+	newsp = (unsigned long __user *)((unsigned long)frame - __SIGNAL_FRAMESIZE);
 	mctx = &frame->mctx;
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 	tm_mctx = &frame->mctx_transact;
@@ -900,7 +900,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	else
 		prepare_save_user_regs(1);
 
-	if (!user_access_begin(frame, sizeof(*frame)))
+	if (!user_access_begin(newsp, __SIGNAL_FRAMESIZE + sizeof(*frame)))
 		goto badframe;
 	sc = (struct sigcontext __user *) &frame->sctx;
 
@@ -931,6 +931,8 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 		unsafe_put_user(PPC_INST_SC, &mctx->mc_pad[1], failed);
 		asm("dcbst %y0; sync; icbi %y0; sync" :: "Z" (mctx->mc_pad[0]));
 	}
+	/* create a stack frame for the caller of the handler */
+	unsafe_put_user(regs->gpr[1], newsp, failed);
 	user_access_end();
 
 	regs->link = tramp;
@@ -939,12 +941,7 @@ int handle_signal32(struct ksignal *ksig, sigset_t *oldset,
 	tsk->thread.fp_state.fpscr = 0;	/* turn off all fp exceptions */
 #endif
 
-	/* create a stack frame for the caller of the handler */
-	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
-	if (put_user(regs->gpr[1], (u32 __user *)newsp))
-		goto badframe;
-
-	regs->gpr[1] = newsp;
+	regs->gpr[1] = (unsigned long)newsp;
 	regs->gpr[3] = ksig->sig;
 	regs->gpr[4] = (unsigned long) sc;
 	regs->nip = (unsigned long)ksig->ka.sa.sa_handler;
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 9ca97b4366df..35c301457fbf 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -847,13 +847,14 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 		struct task_struct *tsk)
 {
 	struct rt_sigframe __user *frame;
-	unsigned long newsp = 0;
+	unsigned long __user *newsp;
 	long err = 0;
 	struct pt_regs *regs = tsk->thread.regs;
 	/* Save the thread's msr before get_tm_stackpointer() changes it */
 	unsigned long msr = regs->msr;
 
 	frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
+	newsp = (unsigned long __user *)((unsigned long)frame - __SIGNAL_FRAMESIZE);
 
 	/*
 	 * This only applies when calling unsafe_setup_sigcontext() and must be
@@ -862,7 +863,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	if (!MSR_TM_ACTIVE(msr))
 		prepare_setup_sigcontext(tsk);
 
-	if (!user_write_access_begin(frame, sizeof(*frame)))
+	if (!user_write_access_begin(newsp, __SIGNAL_FRAMESIZE + sizeof(*frame)))
 		goto badframe;
 
 	unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
@@ -900,6 +901,9 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	}
 
 	unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), badframe_block);
+	/* Allocate a dummy caller frame for the signal handler. */
+	unsafe_put_user(regs->gpr[1], newsp, badframe_block);
+
 	user_write_access_end();
 
 	/* Save the siginfo outside of the unsafe block. */
@@ -919,10 +923,6 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 		regs->nip = (unsigned long) &frame->tramp[0];
 	}
 
-	/* Allocate a dummy caller frame for the signal handler. */
-	newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
-	err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
-
 	/* Set up "regs" so we "return" to the signal handler. */
 	if (is_elf2_task()) {
 		regs->ctr = (unsigned long) ksig->ka.sa.sa_handler;
@@ -953,7 +953,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 	/* enter the signal handler in native-endian mode */
 	regs->msr &= ~MSR_LE;
 	regs->msr |= (MSR_KERNEL & MSR_LE);
-	regs->gpr[1] = newsp;
+	regs->gpr[1] = (unsigned long)newsp;
 	regs->gpr[3] = ksig->sig;
 	regs->result = 0;
 	if (ksig->ka.sa.sa_flags & SA_SIGINFO) {
-- 
2.25.0


^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox