LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
From: Simon Guo @ 2018-06-04 10:27 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev, Paul Mackerras, Naveen N.  Rao, Cyril Bur
In-Reply-To: <877eneasg9.fsf@concordia.ellerman.id.au>

Hi Michael,
On Tue, Jun 05, 2018 at 12:16:22PM +1000, Michael Ellerman wrote:
> Hi Simon,
> 
> wei.guo.simon@gmail.com writes:
> > From: Simon Guo <wei.guo.simon@gmail.com>
> >
> > There is some room to optimize memcmp() in powerpc 64 bits version for
> > following 2 cases:
> > (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> > memcmp() can align them and go with .Llong comparision mode without
> > fallback to .Lshort comparision mode do compare buffer byte by byte.
> > (2) VMX instructions can be used to speed up for large size comparision,
> > currently the threshold is set for 4K bytes. Notes the VMX instructions
> > will lead to VMX regs save/load penalty. This patch set includes a
> > patch to add a 32 bytes pre-checking to minimize the penalty.
> >
> > It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
> > performance for POWER8). Thanks Cyril Bur's information.
> > This patch set also updates memcmp selftest case to make it compiled and
> > incorporate large size comparison case.
> 
> I'm seeing a few crashes with this applied, I haven't had time to look
> into what is happening yet, sorry.
Sorry I didn't catch this in my testing. I will check the root cause
and update later.

Thanks,
- Simon

> 
> [ 2471.300595] kselftest: Running tests in user
> [ 2471.302785] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 44883
> [ 2471.302892] Unable to handle kernel paging request for data at address 0xc008000018553005
> [ 2471.303014] Faulting instruction address: 0xc00000000001f29c
> [ 2471.303119] Oops: Kernel access of bad area, sig: 11 [#1]
> [ 2471.303193] LE SMP NR_CPUS=2048 NUMA PowerNV


> [ 2471.303256] Modules linked in: test_user_copy(+) vxlan ip6_udp_tunnel udp_tunnel 8021q bridge stp llc dummy test_printf test_firmware vmx_crypto crct10dif_vpmsum crct10dif_common crc32c_vpmsum veth [last unloaded: test_static_key_base]
> [ 2471.303532] CPU: 4 PID: 44883 Comm: modprobe Tainted: G        W         4.17.0-rc3-gcc7x-g7204012 #1
> [ 2471.303644] NIP:  c00000000001f29c LR: c00000000001f6e4 CTR: 0000000000000000
> [ 2471.303754] REGS: c000001fddc2b560 TRAP: 0300   Tainted: G        W          (4.17.0-rc3-gcc7x-g7204012)
> [ 2471.303873] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
> [ 2471.303996] CFAR: c00000000001f6e0 DAR: c008000018553005 DSISR: 40000000 IRQMASK: 0 
> [ 2471.303996] GPR00: c00000000001f6e4 c000001fddc2b7e0 c008000018529900 0000000002000000 
> [ 2471.303996] GPR04: c000001fe4b90020 000000000000ffe0 0000000000000000 03fffffe01b48000 
> [ 2471.303996] GPR08: 0000000080000000 c008000018553005 c000001fddc28000 c008000018520df0 
> [ 2471.303996] GPR12: c00000000009c430 c000001fffffbc00 0000000020000000 0000000000000000 
> [ 2471.303996] GPR16: c000001fddc2bc20 0000000000000030 c0000000001f7ba0 0000000000000001 
> [ 2471.303996] GPR20: 0000000000000000 c000000000c772b0 c0000000010b4018 0000000000000000 
> [ 2471.303996] GPR24: 0000000000000000 c008000018521c98 0000000000000000 c000001fe4b90000 
> [ 2471.303996] GPR28: fffffffffffffff4 0000000002000000 9000000002009033 9000000002009033 
> [ 2471.304930] NIP [c00000000001f29c] msr_check_and_set+0x3c/0xc0
> [ 2471.305008] LR [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
> [ 2471.305084] Call Trace:
> [ 2471.305122] [c000001fddc2b7e0] [c00000000009baa8] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
> [ 2471.305240] [c000001fddc2b860] [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
> [ 2471.305336] [c000001fddc2b890] [c00000000009ce40] enter_vmx_ops+0x50/0x70
> [ 2471.305418] [c000001fddc2b8b0] [c00000000009c768] memcmp+0x338/0x680
> [ 2471.305501] [c000001fddc2b9b0] [c008000018520190] test_user_copy_init+0x188/0xd14 [test_user_copy]
> [ 2471.305617] [c000001fddc2ba60] [c00000000000de20] do_one_initcall+0x90/0x560
> [ 2471.305710] [c000001fddc2bb30] [c000000000200630] do_init_module+0x90/0x260
> [ 2471.305795] [c000001fddc2bbc0] [c0000000001fec88] load_module+0x1a28/0x1ce0
> [ 2471.305875] [c000001fddc2bd70] [c0000000001ff1e8] sys_finit_module+0xc8/0x110
> [ 2471.305983] [c000001fddc2be30] [c00000000000b528] system_call+0x58/0x6c
> [ 2471.306066] Instruction dump:
> [ 2471.306112] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
> [ 2471.306216] 7fe000a6 3d220003 39299705 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
> [ 2471.306326] ---[ end trace daf8d409e65b9841 ]---
> 
> And:
> 
> [   19.096709] test_bpf: test_skb_segment: success in skb_segment!
> [   19.096799] initcall test_bpf_init+0x0/0xae0 [test_bpf] returned 0 after 591217 usecs
> [   19.115869] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 3159
> [   19.116165] Unable to handle kernel paging request for data at address 0xd000000003852805
> [   19.116352] Faulting instruction address: 0xc00000000001f44c
> [   19.116483] Oops: Kernel access of bad area, sig: 11 [#1]
> [   19.116583] LE SMP NR_CPUS=2048 NUMA pSeries
> [   19.116684] Modules linked in: test_user_copy(+) lzo_compress crc_itu_t zstd_compress zstd_decompress test_bpf test_static_keys test_static_key_base xxhash test_firmware af_key cls_bpf act_bpf bridge nf_nat_irc xt_NFLOG nfnetlink_log xt_policy nf_conntrack_netlink nfnetlink xt_nat nf_conntrack_irc xt_mark xt_tcpudp nf_nat_sip xt_TCPMSS xt_LOG nf_nat_ftp nf_conntrack_ftp xt_conntrack nf_conntrack_sip xt_addrtype xt_state 8021q iptable_filter ipt_MASQUERADE nf_log_ipv4 iptable_mangle nf_nat_masquerade_ipv4 ipt_REJECT nf_reject_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables nf_log_arp nf_log_common ah4 ipcomp xfrm4_tunnel esp4 rpcrdma stp p8022 psnap llc xfrm_ipcomp xfrm_user xfrm_algo platform_lcd lcd ocxl virtio_balloon virtio_crypto crypto_engine
> [   19.118040]  vmx_crypto nbd zram zsmalloc virtio_blk st be2iscsi cxgb3i cxgb4i libcxgbi bnx2i ibmvfc sym53c8xx scsi_transport_spi scsi_dh_alua scsi_dh_rdac qla4xxx mpt3sas scsi_transport_sas cxlflash cxl libiscsi_tcp lpfc crc_t10dif crct10dif_generic crct10dif_common qla2xxx iscsi_boot_sysfs raid_class parport_pc parport powernv_op_panel powernv_rng pseries_rng rng_core virtio_console pcspkr input_leds evdev dm_round_robin dm_mirror dm_region_hash dm_log raid10 dm_service_time multipath dm_queue_length dm_multipath dm_thin_pool faulty dm_persistent_data dm_zero dm_crypt dm_bio_prison dm_snapshot dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq rpadlpar_io rpaphp jsm icom hvcs ib_ipoib ib_srp ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_ucm ib_ucm ib_uverbs
> [   19.119505]  rdma_cm iw_cm ib_cm mlx4_ib iw_cxgb3 iw_cxgb4 ib_mthca ib_core leds_powernv led_class vhost_net vhost macvtap macvlan dummy bsd_comp ppp_async crc_ccitt pppoe ppp_synctty pppox ppp_deflate ppp_generic 3c59x s2io bnx2 cnic uio bnx2x libcrc32c i40e ixgbe ixgb cxgb3 libcxgb cxgb cxgb4 pcnet32 netxen_nic qlge be2net acenic mlx4_en mlx4_core myri10ge bonding slhc tap mdio veth vxlan udp_tunnel tun usb_storage usbmon oprofile sha1_powerpc md5_ppc crc32c_vpmsum kvm hvcserver
> [   19.120358] CPU: 4 PID: 3159 Comm: modprobe Not tainted 4.17.0-rc3-gcc7x-g7204012 #1
> [   19.120508] NIP:  c00000000001f44c LR: c00000000001f894 CTR: 0000000000000000
> [   19.120666] REGS: c0000000f8d9f570 TRAP: 0300   Not tainted  (4.17.0-rc3-gcc7x-g7204012)
> [   19.120817] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
> [   19.120984] CFAR: c00000000000c03c DAR: d000000003852805 DSISR: 40000000 IRQMASK: 0 
>                GPR00: c00000000001f894 c0000000f8d9f7f0 d000000003829900 0000000002000000 
>                GPR04: c0000000f9a30048 000000000000ffe0 0000000000000000 03fffffff065dffd 
>                GPR08: 0000000080000000 d000000003852805 c0000000f8d9c000 d000000003820df0 
>                GPR12: c00000000009ebb0 c00000003fffb300 c0000000f8d9fd90 d000000003840000 
>                GPR16: d000000003840000 0000000000000000 c0000000011d6900 d000000003821ad0 
>                GPR20: c000000000bd7860 0000000000000000 c000000000ff9060 00000000014000c0 
>                GPR24: 0000000000000000 0000000000000000 0000000000000100 c0000000f9a30028 
>                GPR28: fffffffffffffff4 0000000002000000 8000000002009033 8000000000009033 
> [   19.122454] NIP [c00000000001f44c] msr_check_and_set+0x3c/0xc0
> [   19.122580] LR [c00000000001f894] enable_kernel_altivec+0x44/0x100
> [   19.122707] Call Trace:
> [   19.122789] [c0000000f8d9f7f0] [c00000000009e228] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
> [   19.122962] [c0000000f8d9f870] [c00000000001f894] enable_kernel_altivec+0x44/0x100
> [   19.123344] [c0000000f8d9f8a0] [c00000000009f740] enter_vmx_ops+0x50/0x70
> [   19.123583] [c0000000f8d9f8c0] [c00000000009eee8] memcmp+0x338/0x680
> [   19.123728] [c0000000f8d9f9c0] [d000000003820190] test_user_copy_init+0x188/0xd14 [test_user_copy]
> [   19.123909] [c0000000f8d9fa70] [c00000000000e37c] do_one_initcall+0x5c/0x2d0
> [   19.124094] [c0000000f8d9fb30] [c00000000020066c] do_init_module+0x90/0x264
> [   19.124234] [c0000000f8d9fbc0] [c0000000001ff084] load_module+0x2f64/0x3600
> [   19.124371] [c0000000f8d9fd70] [c0000000001ff9c8] sys_finit_module+0xc8/0x110
> [   19.124530] [c0000000f8d9fe30] [c00000000000b868] system_call+0x58/0x6c
> [   19.124648] Instruction dump:
> [   19.124721] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
> [   19.124869] 7fe000a6 3d220003 39298f05 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
> [   19.125034] ---[ end trace 7c08acedd4b4e6aa ]---
> 
> 
> cheers

^ permalink raw reply

* Re: [PATCH] cpuidle:powernv: Make the snooze timeout dynamic.
From: Stewart Smith @ 2018-06-05  3:47 UTC (permalink / raw)
  To: Michael Ellerman, Gautham R. Shenoy, Rafael J. Wysocki,
	Daniel Lezcano, Michael Neuling, Vaidyanathan Srinivasan,
	Shilpasri G Bhat, Akshay Adiga, Nicholas Piggin
  Cc: linuxppc-dev, linux-kernel, linux-pm, Gautham R. Shenoy
In-Reply-To: <87fu22bxlf.fsf@concordia.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:
> "Gautham R. Shenoy" <ego@linux.vnet.ibm.com> writes:
>
>> From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
>>
>> The commit 78eaa10f027c ("cpuidle: powernv/pseries: Auto-promotion of
>> snooze to deeper idle state") introduced a timeout for the snooze idle
>> state so that it could be eventually be promoted to a deeper idle
>> state. The snooze timeout value is static and set to the target
>> residency of the next idle state, which would train the cpuidle
>> governor to pick the next idle state eventually.
>>
>> The unfortunate side-effect of this is that if the next idle state(s)
>> is disabled, the CPU will forever remain in snooze, despite the fact
>> that the system is completely idle, and other deeper idle states are
>> available.
>
> That sounds like a bug, I'll add?
>
> Fixes: 78eaa10f027c ("cpuidle: powernv/pseries: Auto-promotion of snooze to deeper idle state")
> Cc: stable@vger.kernel.org # v4.2+

Yes, it's a bug - we had a customer bug because we lacked this that
meant we had to do firmware changes rather than just tweaking what stop
states were used.

-- 
Stewart Smith
OPAL Architect, IBM.

^ permalink raw reply

* RE: [PATCH 09/10] dpaa_eth: add support for hardware timestamping
From: Y.b. Lu @ 2018-06-05  3:35 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev@vger.kernel.org, Madalin-cristian Bucur, Rob Herring,
	Shawn Guo, David S . Miller, devicetree@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org,
	linux-arm-kernel@lists.infradead.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <20180604134920.ezhe6jz5ntpnqyzj@localhost>

Hi Richard,

> -----Original Message-----
> From: Richard Cochran [mailto:richardcochran@gmail.com]
> Sent: Monday, June 4, 2018 9:49 PM
> To: Y.b. Lu <yangbo.lu@nxp.com>
> Cc: netdev@vger.kernel.org; Madalin-cristian Bucur
> <madalin.bucur@nxp.com>; Rob Herring <robh+dt@kernel.org>; Shawn Guo
> <shawnguo@kernel.org>; David S . Miller <davem@davemloft.net>;
> devicetree@vger.kernel.org; linuxppc-dev@lists.ozlabs.org;
> linux-arm-kernel@lists.infradead.org; linux-kernel@vger.kernel.org
> Subject: Re: [PATCH 09/10] dpaa_eth: add support for hardware timestampin=
g
>=20
> On Mon, Jun 04, 2018 at 03:08:36PM +0800, Yangbo Lu wrote:
>=20
> > +if FSL_DPAA_ETH
> > +config FSL_DPAA_ETH_TS
> > +	bool "DPAA hardware timestamping support"
> > +	select PTP_1588_CLOCK_QORIQ
> > +	default n
> > +	help
> > +	  Enable DPAA hardware timestamping support.
> > +	  This option is useful for applications to get
> > +	  hardware time stamps on the Ethernet packets
> > +	  using the SO_TIMESTAMPING API.
> > +endif
>=20
> You should drop this #ifdef.  In general, if a MAC supports time stamping=
 and
> PHC, then the driver support should simply be compiled in.
>=20
> [ When time stamping incurs a large run time performance penalty to
>   non-PTP users, then it might make sense to have a Kconfig option to
>   disable it, but that doesn't appear to be the case here. ]

[Y.b. Lu] Actually these timestamping codes affected DPAA networking perfor=
mance in our previous performance test.
That's why we used ifdef for it.

>=20
> > @@ -1615,6 +1635,24 @@ static int dpaa_eth_refill_bpools(struct
> dpaa_priv *priv)
> >  	skbh =3D (struct sk_buff **)phys_to_virt(addr);
> >  	skb =3D *skbh;
> >
> > +#ifdef CONFIG_FSL_DPAA_ETH_TS
> > +	if (priv->tx_tstamp &&
> > +	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
>=20
> This condition fits on one line easily.

[Y.b. Lu] Right. I will use one line in next version.

>=20
> > +		struct skb_shared_hwtstamps shhwtstamps;
> > +		u64 ns;
>=20
> Local variables belong at the top of the function.

[Y.b. Lu] Ok, will move them to the top in next verison.

>=20
> > +		memset(&shhwtstamps, 0, sizeof(shhwtstamps));
> > +
> > +		if (!dpaa_get_tstamp_ns(priv->net_dev, &ns,
> > +					priv->mac_dev->port[TX],
> > +					(void *)skbh)) {
> > +			shhwtstamps.hwtstamp =3D ns_to_ktime(ns);
> > +			skb_tstamp_tx(skb, &shhwtstamps);
> > +		} else {
> > +			dev_warn(dev, "dpaa_get_tstamp_ns failed!\n");
> > +		}
> > +	}
> > +#endif
> >  	if (unlikely(qm_fd_get_format(fd) =3D=3D qm_fd_sg)) {
> >  		nr_frags =3D skb_shinfo(skb)->nr_frags;
> >  		dma_unmap_single(dev, addr, qm_fd_get_offset(fd) + @@ -2086,6
> > +2124,14 @@ static int dpaa_start_xmit(struct sk_buff *skb, struct
> net_device *net_dev)
> >  	if (unlikely(err < 0))
> >  		goto skb_to_fd_failed;
> >
> > +#ifdef CONFIG_FSL_DPAA_ETH_TS
> > +	if (priv->tx_tstamp &&
> > +	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
>=20
> One line please.

[Y.b. Lu] No problem.

>=20
> > +		fd.cmd |=3D FM_FD_CMD_UPD;
> > +		skb_shinfo(skb)->tx_flags |=3D SKBTX_IN_PROGRESS;
> > +	}
> > +#endif
> > +
> >  	if (likely(dpaa_xmit(priv, percpu_stats, queue_mapping, &fd) =3D=3D 0=
))
> >  		return NETDEV_TX_OK;
> >
>=20
> Thanks,
> Richard

^ permalink raw reply

* Re: [PATCH v7 0/5] powerpc/64: memcmp() optimization
From: Michael Ellerman @ 2018-06-05  2:16 UTC (permalink / raw)
  To: wei.guo.simon, linuxppc-dev
  Cc: Paul Mackerras, Naveen N.  Rao, Cyril Bur, Simon Guo
In-Reply-To: <1527672063-6953-1-git-send-email-wei.guo.simon@gmail.com>

Hi Simon,

wei.guo.simon@gmail.com writes:
> From: Simon Guo <wei.guo.simon@gmail.com>
>
> There is some room to optimize memcmp() in powerpc 64 bits version for
> following 2 cases:
> (1) Even src/dst addresses are not aligned with 8 bytes at the beginning,
> memcmp() can align them and go with .Llong comparision mode without
> fallback to .Lshort comparision mode do compare buffer byte by byte.
> (2) VMX instructions can be used to speed up for large size comparision,
> currently the threshold is set for 4K bytes. Notes the VMX instructions
> will lead to VMX regs save/load penalty. This patch set includes a
> patch to add a 32 bytes pre-checking to minimize the penalty.
>
> It did the similar with glibc commit dec4a7105e (powerpc: Improve memcmp 
> performance for POWER8). Thanks Cyril Bur's information.
> This patch set also updates memcmp selftest case to make it compiled and
> incorporate large size comparison case.

I'm seeing a few crashes with this applied, I haven't had time to look
into what is happening yet, sorry.

[ 2471.300595] kselftest: Running tests in user
[ 2471.302785] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 44883
[ 2471.302892] Unable to handle kernel paging request for data at address 0xc008000018553005
[ 2471.303014] Faulting instruction address: 0xc00000000001f29c
[ 2471.303119] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2471.303193] LE SMP NR_CPUS=2048 NUMA PowerNV
[ 2471.303256] Modules linked in: test_user_copy(+) vxlan ip6_udp_tunnel udp_tunnel 8021q bridge stp llc dummy test_printf test_firmware vmx_crypto crct10dif_vpmsum crct10dif_common crc32c_vpmsum veth [last unloaded: test_static_key_base]
[ 2471.303532] CPU: 4 PID: 44883 Comm: modprobe Tainted: G        W         4.17.0-rc3-gcc7x-g7204012 #1
[ 2471.303644] NIP:  c00000000001f29c LR: c00000000001f6e4 CTR: 0000000000000000
[ 2471.303754] REGS: c000001fddc2b560 TRAP: 0300   Tainted: G        W          (4.17.0-rc3-gcc7x-g7204012)
[ 2471.303873] MSR:  9000000002009033 <SF,HV,VEC,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
[ 2471.303996] CFAR: c00000000001f6e0 DAR: c008000018553005 DSISR: 40000000 IRQMASK: 0 
[ 2471.303996] GPR00: c00000000001f6e4 c000001fddc2b7e0 c008000018529900 0000000002000000 
[ 2471.303996] GPR04: c000001fe4b90020 000000000000ffe0 0000000000000000 03fffffe01b48000 
[ 2471.303996] GPR08: 0000000080000000 c008000018553005 c000001fddc28000 c008000018520df0 
[ 2471.303996] GPR12: c00000000009c430 c000001fffffbc00 0000000020000000 0000000000000000 
[ 2471.303996] GPR16: c000001fddc2bc20 0000000000000030 c0000000001f7ba0 0000000000000001 
[ 2471.303996] GPR20: 0000000000000000 c000000000c772b0 c0000000010b4018 0000000000000000 
[ 2471.303996] GPR24: 0000000000000000 c008000018521c98 0000000000000000 c000001fe4b90000 
[ 2471.303996] GPR28: fffffffffffffff4 0000000002000000 9000000002009033 9000000002009033 
[ 2471.304930] NIP [c00000000001f29c] msr_check_and_set+0x3c/0xc0
[ 2471.305008] LR [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
[ 2471.305084] Call Trace:
[ 2471.305122] [c000001fddc2b7e0] [c00000000009baa8] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
[ 2471.305240] [c000001fddc2b860] [c00000000001f6e4] enable_kernel_altivec+0x44/0x100
[ 2471.305336] [c000001fddc2b890] [c00000000009ce40] enter_vmx_ops+0x50/0x70
[ 2471.305418] [c000001fddc2b8b0] [c00000000009c768] memcmp+0x338/0x680
[ 2471.305501] [c000001fddc2b9b0] [c008000018520190] test_user_copy_init+0x188/0xd14 [test_user_copy]
[ 2471.305617] [c000001fddc2ba60] [c00000000000de20] do_one_initcall+0x90/0x560
[ 2471.305710] [c000001fddc2bb30] [c000000000200630] do_init_module+0x90/0x260
[ 2471.305795] [c000001fddc2bbc0] [c0000000001fec88] load_module+0x1a28/0x1ce0
[ 2471.305875] [c000001fddc2bd70] [c0000000001ff1e8] sys_finit_module+0xc8/0x110
[ 2471.305983] [c000001fddc2be30] [c00000000000b528] system_call+0x58/0x6c
[ 2471.306066] Instruction dump:
[ 2471.306112] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
[ 2471.306216] 7fe000a6 3d220003 39299705 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
[ 2471.306326] ---[ end trace daf8d409e65b9841 ]---

And:

[   19.096709] test_bpf: test_skb_segment: success in skb_segment!
[   19.096799] initcall test_bpf_init+0x0/0xae0 [test_bpf] returned 0 after 591217 usecs
[   19.115869] calling  test_user_copy_init+0x0/0xd14 [test_user_copy] @ 3159
[   19.116165] Unable to handle kernel paging request for data at address 0xd000000003852805
[   19.116352] Faulting instruction address: 0xc00000000001f44c
[   19.116483] Oops: Kernel access of bad area, sig: 11 [#1]
[   19.116583] LE SMP NR_CPUS=2048 NUMA pSeries
[   19.116684] Modules linked in: test_user_copy(+) lzo_compress crc_itu_t zstd_compress zstd_decompress test_bpf test_static_keys test_static_key_base xxhash test_firmware af_key cls_bpf act_bpf bridge nf_nat_irc xt_NFLOG nfnetlink_log xt_policy nf_conntrack_netlink nfnetlink xt_nat nf_conntrack_irc xt_mark xt_tcpudp nf_nat_sip xt_TCPMSS xt_LOG nf_nat_ftp nf_conntrack_ftp xt_conntrack nf_conntrack_sip xt_addrtype xt_state 8021q iptable_filter ipt_MASQUERADE nf_log_ipv4 iptable_mangle nf_nat_masquerade_ipv4 ipt_REJECT nf_reject_ipv4 iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables nf_log_arp nf_log_common ah4 ipcomp xfrm4_tunnel esp4 rpcrdma stp p8022 psnap llc xfrm_ipcomp xfrm_user xfrm_algo platform_lcd lcd ocxl virtio_balloon virtio_crypto crypto_engine
[   19.118040]  vmx_crypto nbd zram zsmalloc virtio_blk st be2iscsi cxgb3i cxgb4i libcxgbi bnx2i ibmvfc sym53c8xx scsi_transport_spi scsi_dh_alua scsi_dh_rdac qla4xxx mpt3sas scsi_transport_sas cxlflash cxl libiscsi_tcp lpfc crc_t10dif crct10dif_generic crct10dif_common qla2xxx iscsi_boot_sysfs raid_class parport_pc parport powernv_op_panel powernv_rng pseries_rng rng_core virtio_console pcspkr input_leds evdev dm_round_robin dm_mirror dm_region_hash dm_log raid10 dm_service_time multipath dm_queue_length dm_multipath dm_thin_pool faulty dm_persistent_data dm_zero dm_crypt dm_bio_prison dm_snapshot dm_bufio raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq rpadlpar_io rpaphp jsm icom hvcs ib_ipoib ib_srp ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_ucm ib_ucm ib_uverbs
[   19.119505]  rdma_cm iw_cm ib_cm mlx4_ib iw_cxgb3 iw_cxgb4 ib_mthca ib_core leds_powernv led_class vhost_net vhost macvtap macvlan dummy bsd_comp ppp_async crc_ccitt pppoe ppp_synctty pppox ppp_deflate ppp_generic 3c59x s2io bnx2 cnic uio bnx2x libcrc32c i40e ixgbe ixgb cxgb3 libcxgb cxgb cxgb4 pcnet32 netxen_nic qlge be2net acenic mlx4_en mlx4_core myri10ge bonding slhc tap mdio veth vxlan udp_tunnel tun usb_storage usbmon oprofile sha1_powerpc md5_ppc crc32c_vpmsum kvm hvcserver
[   19.120358] CPU: 4 PID: 3159 Comm: modprobe Not tainted 4.17.0-rc3-gcc7x-g7204012 #1
[   19.120508] NIP:  c00000000001f44c LR: c00000000001f894 CTR: 0000000000000000
[   19.120666] REGS: c0000000f8d9f570 TRAP: 0300   Not tainted  (4.17.0-rc3-gcc7x-g7204012)
[   19.120817] MSR:  8000000000009033 <SF,EE,ME,IR,DR,RI,LE>  CR: 24222844  XER: 00000000
[   19.120984] CFAR: c00000000000c03c DAR: d000000003852805 DSISR: 40000000 IRQMASK: 0 
               GPR00: c00000000001f894 c0000000f8d9f7f0 d000000003829900 0000000002000000 
               GPR04: c0000000f9a30048 000000000000ffe0 0000000000000000 03fffffff065dffd 
               GPR08: 0000000080000000 d000000003852805 c0000000f8d9c000 d000000003820df0 
               GPR12: c00000000009ebb0 c00000003fffb300 c0000000f8d9fd90 d000000003840000 
               GPR16: d000000003840000 0000000000000000 c0000000011d6900 d000000003821ad0 
               GPR20: c000000000bd7860 0000000000000000 c000000000ff9060 00000000014000c0 
               GPR24: 0000000000000000 0000000000000000 0000000000000100 c0000000f9a30028 
               GPR28: fffffffffffffff4 0000000002000000 8000000002009033 8000000000009033 
[   19.122454] NIP [c00000000001f44c] msr_check_and_set+0x3c/0xc0
[   19.122580] LR [c00000000001f894] enable_kernel_altivec+0x44/0x100
[   19.122707] Call Trace:
[   19.122789] [c0000000f8d9f7f0] [c00000000009e228] __copy_tofrom_user_base+0x9c/0x574 (unreliable)
[   19.122962] [c0000000f8d9f870] [c00000000001f894] enable_kernel_altivec+0x44/0x100
[   19.123344] [c0000000f8d9f8a0] [c00000000009f740] enter_vmx_ops+0x50/0x70
[   19.123583] [c0000000f8d9f8c0] [c00000000009eee8] memcmp+0x338/0x680
[   19.123728] [c0000000f8d9f9c0] [d000000003820190] test_user_copy_init+0x188/0xd14 [test_user_copy]
[   19.123909] [c0000000f8d9fa70] [c00000000000e37c] do_one_initcall+0x5c/0x2d0
[   19.124094] [c0000000f8d9fb30] [c00000000020066c] do_init_module+0x90/0x264
[   19.124234] [c0000000f8d9fbc0] [c0000000001ff084] load_module+0x2f64/0x3600
[   19.124371] [c0000000f8d9fd70] [c0000000001ff9c8] sys_finit_module+0xc8/0x110
[   19.124530] [c0000000f8d9fe30] [c00000000000b868] system_call+0x58/0x6c
[   19.124648] Instruction dump:
[   19.124721] fba1ffe8 fbc1fff0 fbe1fff8 f8010010 f821ff81 7c7d1b78 60000000 60000000 
[   19.124869] 7fe000a6 3d220003 39298f05 7ffeeb78 <89290000> 2f890000 419e0044 60000000 
[   19.125034] ---[ end trace 7c08acedd4b4e6aa ]---


cheers

^ permalink raw reply

* [PATCH  5/5] powerpc/pkeys: make protection key 0 less special
From: Ram Pai @ 2018-06-05  2:09 UTC (permalink / raw)
  To: mpe
  Cc: linuxppc-dev, dave.hansen, aneesh.kumar, bsingharora, hbabu,
	mhocko, bauerman, linuxram, Ulrich.Weigand, fweimer, luto
In-Reply-To: <1528164594-23823-1-git-send-email-linuxram@us.ibm.com>

Applications need the ability to associate an address-range with some
key and latter revert to its initial default key. Pkey-0 comes close to
providing this function but falls short, because the current
implementation disallows applications to explicitly associate pkey-0 to
the address range.

Lets make pkey-0 less special and treat it almost like any other key.
Thus it can be explicitly associated with any address range, and can be
freed. This gives the application more flexibility and power.  The
ability to free pkey-0 must be used responsibily, since pkey-0 is
associated with almost all address-range by default.

Even with this change pkey-0 continues to be slightly more special
from the following point of view.
(a) it is implicitly allocated.
(b) it is the default key assigned to any address-range.
(c) its permissions cannot be modified by userspace.

NOTE: (c) is specific to powerpc only. pkey-0 is associated by default
with all pages including kernel pages, and pkeys are also active in
kernel mode. If any permission is denied on pkey-0, the kernel running
in the context of the application will be unable to operate.

Tested on powerpc.

cc: Thomas Gleixner <tglx@linutronix.de>
cc: Dave Hansen <dave.hansen@intel.com>
cc: Michael Ellermen <mpe@ellerman.id.au>
cc: Ingo Molnar <mingo@kernel.org>
cc: Andrew Morton <akpm@linux-foundation.org>
cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
cc: Michal Such谩nek <msuchanek@suse.de
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
-----------------------------------------------------------------------
History:
	v4: . introduced PKEY_0 macro.  No bug fixes. Code
		re-arrangement to save a few cycles.

	v3: . Corrected a comment in arch_set_user_pkey_access().  .
	Clarified the header, to capture the notion that pkey-0
	permissions cannot be modified by userspace on powerpc.
      		-- comment from Thiago

	v2: . mm_pkey_is_allocated() continued to treat pkey-0 special.
		fixed it.
---
 arch/powerpc/include/asm/pkeys.h |   29 +++++++++++++++++++++++------
 arch/powerpc/mm/pkeys.c          |   13 ++++++-------
 2 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/pkeys.h b/arch/powerpc/include/asm/pkeys.h
index 0409c80..d349e22 100644
--- a/arch/powerpc/include/asm/pkeys.h
+++ b/arch/powerpc/include/asm/pkeys.h
@@ -13,7 +13,10 @@
 
 DECLARE_STATIC_KEY_TRUE(pkey_disabled);
 extern int pkeys_total; /* total pkeys as per device tree */
-extern u32 initial_allocation_mask; /* bits set for reserved keys */
+extern u32 initial_allocation_mask; /*  bits set for the initially allocated keys */
+extern u32 reserved_allocation_mask; /* bits set for reserved keys */
+
+#define PKEY_0	0
 
 /*
  * Define these here temporarily so we're not dependent on patching linux/mm.h.
@@ -96,15 +99,19 @@ static inline u16 pte_to_pkey_bits(u64 pteflags)
 #define __mm_pkey_is_allocated(mm, pkey)	\
 	(mm_pkey_allocation_map(mm) & pkey_alloc_mask(pkey))
 
-#define __mm_pkey_is_reserved(pkey) (initial_allocation_mask & \
+#define __mm_pkey_is_reserved(pkey) (reserved_allocation_mask & \
 				       pkey_alloc_mask(pkey))
 
 static inline bool mm_pkey_is_allocated(struct mm_struct *mm, int pkey)
 {
-	/* A reserved key is never considered as 'explicitly allocated' */
-	return ((pkey < arch_max_pkey()) &&
-		!__mm_pkey_is_reserved(pkey) &&
-		__mm_pkey_is_allocated(mm, pkey));
+	if (pkey < 0 || pkey >= arch_max_pkey())
+		return false;
+
+	/* Reserved keys are never allocated. */
+	if (__mm_pkey_is_reserved(pkey))
+		return false;
+
+	return __mm_pkey_is_allocated(mm, pkey);
 }
 
 extern void __arch_activate_pkey(int pkey);
@@ -200,6 +207,16 @@ static inline int arch_set_user_pkey_access(struct task_struct *tsk, int pkey,
 {
 	if (static_branch_likely(&pkey_disabled))
 		return -EINVAL;
+
+	/*
+	 * userspace should not change pkey-0 permissions.
+	 * pkey-0 is associated with every page in the kernel.
+	 * If userspace denies any permission on pkey-0, the
+	 * kernel cannot operate.
+	 */
+	if (pkey == PKEY_0)
+		return init_val ? -EINVAL : 0;
+
 	return __arch_set_user_pkey_access(tsk, pkey, init_val);
 }
 
diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 0b98db6..1ebb21b 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -14,7 +14,8 @@
 bool pkey_execute_disable_supported;
 int  pkeys_total;		/* Total pkeys as per device tree */
 bool pkeys_devtree_defined;	/* pkey property exported by device tree */
-u32  initial_allocation_mask;	/* Bits set for reserved keys */
+u32  initial_allocation_mask;   /* Bits set for the initially allocated keys */
+u32  reserved_allocation_mask;  /* Bits set for reserved keys */
 u64  pkey_amr_mask;		/* Bits in AMR not to be touched */
 u64  pkey_iamr_mask;		/* Bits in AMR not to be touched */
 u64  pkey_uamor_mask;		/* Bits in UMOR not to be touched */
@@ -121,8 +122,9 @@ int pkey_initialize(void)
 #else
 	os_reserved = 0;
 #endif
-	initial_allocation_mask  = (0x1 << 0) | (0x1 << 1) |
-					(0x1 << EXECUTE_ONLY_KEY);
+	/* Bits are in LE format. */
+	reserved_allocation_mask = (0x1 << 1) | (0x1 << EXECUTE_ONLY_KEY);
+	initial_allocation_mask  = reserved_allocation_mask | (0x1 << PKEY_0);
 
 	/* register mask is in BE format */
 	pkey_amr_mask = ~0x0ul;
@@ -135,7 +137,7 @@ int pkey_initialize(void)
 	pkey_amr_mask |= (AMR_RD_BIT|AMR_WR_BIT) << pkeyshift(EXECUTE_ONLY_KEY);
 
 	pkey_uamor_mask = ~0x0ul;
-	pkey_uamor_mask &= ~(0x3ul << pkeyshift(0));
+	pkey_uamor_mask &= ~(0x3ul << pkeyshift(PKEY_0));
 	/*
 	 * key 1 is recommended not to be used.
 	 * PowerISA(3.0) page 1015,
@@ -381,9 +383,6 @@ static bool pkey_access_permitted(int pkey, bool write, bool execute)
 	int pkey_shift;
 	u64 amr;
 
-	if (!pkey)
-		return true;
-
 	if (!is_pkey_enabled(pkey))
 		return true;
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH  4/5] powerpc/pkeys: Preallocate execute-only key
From: Ram Pai @ 2018-06-05  2:09 UTC (permalink / raw)
  To: mpe
  Cc: linuxppc-dev, dave.hansen, aneesh.kumar, bsingharora, hbabu,
	mhocko, bauerman, linuxram, Ulrich.Weigand, fweimer, luto
In-Reply-To: <1528164594-23823-1-git-send-email-linuxram@us.ibm.com>

execute-only key is allocated dynamically. This is a problem.  When a
thread implicitly creates a execute-only key, and resets UAMOR for that
key, the UAMOR value does not percolate to all the other threads.  Any
other thread may ignorantly change the permissions on the key.  This can
cause the key to be not execute-only for that thread.

Preallocate the execute-only key and ensure that no thread can change
the permission of the key, by resetting the corresponding bit in UAMOR.

CC: Andy Lutomirski <luto@kernel.org>
CC: Florian Weimer <fweimer@redhat.com>
CC: Thiago Jung Bauermann <bauerman@linux.ibm.com>
CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/mm/pkeys.c |   53 +++++++---------------------------------------
 1 files changed, 8 insertions(+), 45 deletions(-)

diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 90ab793..0b98db6 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -25,6 +25,7 @@
 #define IAMR_EX_BIT 0x1UL
 #define PKEY_REG_BITS (sizeof(u64)*8)
 #define pkeyshift(pkey) (PKEY_REG_BITS - ((pkey+1) * AMR_BITS_PER_PKEY))
+#define EXECUTE_ONLY_KEY 2
 
 static void scan_pkey_feature(void)
 {
@@ -120,7 +121,8 @@ int pkey_initialize(void)
 #else
 	os_reserved = 0;
 #endif
-	initial_allocation_mask  = (0x1 << 0) | (0x1 << 1);
+	initial_allocation_mask  = (0x1 << 0) | (0x1 << 1) |
+					(0x1 << EXECUTE_ONLY_KEY);
 
 	/* register mask is in BE format */
 	pkey_amr_mask = ~0x0ul;
@@ -130,6 +132,7 @@ int pkey_initialize(void)
 		pkey_amr_mask &= ~(0x3ul << pkeyshift(i));
 		pkey_iamr_mask &= ~(0x1ul << pkeyshift(i));
 	}
+	pkey_amr_mask |= (AMR_RD_BIT|AMR_WR_BIT) << pkeyshift(EXECUTE_ONLY_KEY);
 
 	pkey_uamor_mask = ~0x0ul;
 	pkey_uamor_mask &= ~(0x3ul << pkeyshift(0));
@@ -140,6 +143,8 @@ int pkey_initialize(void)
 	 * pseries kernel running on powerVM.
 	 */
 	pkey_uamor_mask &= ~(0x3ul << pkeyshift(1));
+	pkey_uamor_mask &= ~(0x3ul << pkeyshift(EXECUTE_ONLY_KEY));
+
 	for (i = (pkeys_total - os_reserved); i < pkeys_total; i++)
 		pkey_uamor_mask &= ~(0x3ul << pkeyshift(i));
 
@@ -153,8 +158,7 @@ void pkey_mm_init(struct mm_struct *mm)
 	if (static_branch_likely(&pkey_disabled))
 		return;
 	mm_pkey_allocation_map(mm) = initial_allocation_mask;
-	/* -1 means unallocated or invalid */
-	mm->context.execute_only_pkey = -1;
+	mm->context.execute_only_pkey = EXECUTE_ONLY_KEY;
 }
 
 static inline u64 read_amr(void)
@@ -333,48 +337,7 @@ static inline bool pkey_allows_readwrite(int pkey)
 
 int __execute_only_pkey(struct mm_struct *mm)
 {
-	bool need_to_set_mm_pkey = false;
-	int execute_only_pkey = mm->context.execute_only_pkey;
-	int ret;
-
-	/* Do we need to assign a pkey for mm's execute-only maps? */
-	if (execute_only_pkey == -1) {
-		/* Go allocate one to use, which might fail */
-		execute_only_pkey = mm_pkey_alloc(mm);
-		if (execute_only_pkey < 0)
-			return -1;
-		need_to_set_mm_pkey = true;
-	}
-
-	/*
-	 * We do not want to go through the relatively costly dance to set AMR
-	 * if we do not need to. Check it first and assume that if the
-	 * execute-only pkey is readwrite-disabled than we do not have to set it
-	 * ourselves.
-	 */
-	if (!need_to_set_mm_pkey && !pkey_allows_readwrite(execute_only_pkey))
-		return execute_only_pkey;
-
-	/*
-	 * Set up AMR so that it denies access for everything other than
-	 * execution.
-	 */
-	ret = __arch_set_user_pkey_access(current, execute_only_pkey,
-					  PKEY_DISABLE_ACCESS |
-					  PKEY_DISABLE_WRITE);
-	/*
-	 * If the AMR-set operation failed somehow, just return 0 and
-	 * effectively disable execute-only support.
-	 */
-	if (ret) {
-		mm_pkey_free(mm, execute_only_pkey);
-		return -1;
-	}
-
-	/* We got one, store it and use it from here on out */
-	if (need_to_set_mm_pkey)
-		mm->context.execute_only_pkey = execute_only_pkey;
-	return execute_only_pkey;
+	return mm->context.execute_only_pkey;
 }
 
 static inline bool vma_is_pkey_exec_only(struct vm_area_struct *vma)
-- 
1.7.1

^ permalink raw reply related

* [PATCH  3/5] powerpc/pkeys: fix calculation of total pkeys.
From: Ram Pai @ 2018-06-05  2:09 UTC (permalink / raw)
  To: mpe
  Cc: linuxppc-dev, dave.hansen, aneesh.kumar, bsingharora, hbabu,
	mhocko, bauerman, linuxram, Ulrich.Weigand, fweimer, luto
In-Reply-To: <1528164594-23823-1-git-send-email-linuxram@us.ibm.com>

Total number of pkeys calculation is off by 1. Fix it.

CC: Florian Weimer <fweimer@redhat.com>
CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/mm/pkeys.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 6fc56f4..90ab793 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -92,7 +92,7 @@ int pkey_initialize(void)
 	 * arch-neutral code.
 	 */
 	pkeys_total = min_t(int, pkeys_total,
-			(ARCH_VM_PKEY_FLAGS >> VM_PKEY_SHIFT));
+			((ARCH_VM_PKEY_FLAGS >> VM_PKEY_SHIFT)+1));
 
 	if (!pkey_mmu_enabled() || radix_enabled() || !pkeys_total)
 		static_branch_enable(&pkey_disabled);
-- 
1.7.1

^ permalink raw reply related

* [PATCH  2/5] powerpc/pkeys: Save the pkey registers before fork
From: Ram Pai @ 2018-06-05  2:09 UTC (permalink / raw)
  To: mpe
  Cc: linuxppc-dev, dave.hansen, aneesh.kumar, bsingharora, hbabu,
	mhocko, bauerman, linuxram, Ulrich.Weigand, fweimer, luto
In-Reply-To: <1528164594-23823-1-git-send-email-linuxram@us.ibm.com>

When a thread forks the contents of AMR, IAMR, UAMOR registers in the
newly forked thread are not inherited.

Save the registers before forking, for content of those
registers to be automatically copied into the new thread.

CC: Michael Ellerman <mpe@ellerman.id.au>
CC: Florian Weimer <fweimer@redhat.com>
CC: Andy Lutomirski <luto@kernel.org>
CC: Thiago Jung Bauermann <bauerman@linux.ibm.com>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/kernel/process.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 1237f13..999dd08 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -582,6 +582,7 @@ static void save_all(struct task_struct *tsk)
 		__giveup_spe(tsk);
 
 	msr_check_and_clear(msr_all_available);
+	thread_pkey_regs_save(&tsk->thread);
 }
 
 void flush_all_to_thread(struct task_struct *tsk)
-- 
1.7.1

^ permalink raw reply related

* [PATCH 1/5] powerpc/pkeys: Enable all user-allocatable pkeys at init.
From: Ram Pai @ 2018-06-05  2:09 UTC (permalink / raw)
  To: mpe
  Cc: linuxppc-dev, dave.hansen, aneesh.kumar, bsingharora, hbabu,
	mhocko, bauerman, linuxram, Ulrich.Weigand, fweimer, luto
In-Reply-To: <1528164594-23823-1-git-send-email-linuxram@us.ibm.com>

In a multithreaded application, a key allocated by one thread must
be activate and usable on all threads.

Currently this is not the case, because the UAMOR bits for all keys are
disabled by default. When a new key is allocated in one thread, though
the corresponding UAMOR bits for that thread get enabled, the UAMOR bits
for all other existing threads continue to have their bits disabled.
Other threads have no way to set permissions on the key, effectively
making the key useless.

Enable the UAMOR bits for all keys, at process creation. Since the
contents of UAMOR are inherited at fork, all threads are capable of
modifying the permissions on any key.

BTW: changing the permission on unallocated keys has no effect, till
those keys are not associated with any PTEs. The kernel will anyway
disallow to association of unallocated keys with PTEs.

CC: Andy Lutomirski <luto@kernel.org>
CC: Florian Weimer <fweimer@redhat.com>
CC: Thiago Jung Bauermann <bauerman@linux.ibm.com>
CC: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Ram Pai <linuxram@us.ibm.com>
---
 arch/powerpc/mm/pkeys.c |   47 +++++++++++++++++++++++++++++------------------
 1 files changed, 29 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 0eafdf0..6fc56f4 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -15,8 +15,9 @@
 int  pkeys_total;		/* Total pkeys as per device tree */
 bool pkeys_devtree_defined;	/* pkey property exported by device tree */
 u32  initial_allocation_mask;	/* Bits set for reserved keys */
-u64  pkey_amr_uamor_mask;	/* Bits in AMR/UMOR not to be touched */
+u64  pkey_amr_mask;		/* Bits in AMR not to be touched */
 u64  pkey_iamr_mask;		/* Bits in AMR not to be touched */
+u64  pkey_uamor_mask;		/* Bits in UMOR not to be touched */
 
 #define AMR_BITS_PER_PKEY 2
 #define AMR_RD_BIT 0x1UL
@@ -119,20 +120,29 @@ int pkey_initialize(void)
 #else
 	os_reserved = 0;
 #endif
-	initial_allocation_mask = ~0x0;
-	pkey_amr_uamor_mask = ~0x0ul;
+	initial_allocation_mask  = (0x1 << 0) | (0x1 << 1);
+
+	/* register mask is in BE format */
+	pkey_amr_mask = ~0x0ul;
 	pkey_iamr_mask = ~0x0ul;
-	/*
-	 * key 0, 1 are reserved.
-	 * key 0 is the default key, which allows read/write/execute.
-	 * key 1 is recommended not to be used. PowerISA(3.0) page 1015,
-	 * programming note.
-	 */
-	for (i = 2; i < (pkeys_total - os_reserved); i++) {
-		initial_allocation_mask &= ~(0x1 << i);
-		pkey_amr_uamor_mask &= ~(0x3ul << pkeyshift(i));
+
+	for (i = 0; i < (pkeys_total - os_reserved); i++) {
+		pkey_amr_mask &= ~(0x3ul << pkeyshift(i));
 		pkey_iamr_mask &= ~(0x1ul << pkeyshift(i));
 	}
+
+	pkey_uamor_mask = ~0x0ul;
+	pkey_uamor_mask &= ~(0x3ul << pkeyshift(0));
+	/*
+	 * key 1 is recommended not to be used.
+	 * PowerISA(3.0) page 1015,
+	 * @TODO: Revisit this. This is only applicable on
+	 * pseries kernel running on powerVM.
+	 */
+	pkey_uamor_mask &= ~(0x3ul << pkeyshift(1));
+	for (i = (pkeys_total - os_reserved); i < pkeys_total; i++)
+		pkey_uamor_mask &= ~(0x3ul << pkeyshift(i));
+
 	return 0;
 }
 
@@ -289,9 +299,6 @@ void thread_pkey_regs_restore(struct thread_struct *new_thread,
 	if (static_branch_likely(&pkey_disabled))
 		return;
 
-	/*
-	 * TODO: Just set UAMOR to zero if @new_thread hasn't used any keys yet.
-	 */
 	if (old_thread->amr != new_thread->amr)
 		write_amr(new_thread->amr);
 	if (old_thread->iamr != new_thread->iamr)
@@ -305,9 +312,13 @@ void thread_pkey_regs_init(struct thread_struct *thread)
 	if (static_branch_likely(&pkey_disabled))
 		return;
 
-	thread->amr = read_amr() & pkey_amr_uamor_mask;
-	thread->iamr = read_iamr() & pkey_iamr_mask;
-	thread->uamor = read_uamor() & pkey_amr_uamor_mask;
+	thread->amr = pkey_amr_mask;
+	thread->iamr = pkey_iamr_mask;
+ 	thread->uamor = pkey_uamor_mask;
+
+	write_uamor(pkey_uamor_mask);
+	write_amr(pkey_amr_mask);
+	write_iamr(pkey_iamr_mask);
 }
 
 static inline bool pkey_allows_readwrite(int pkey)
-- 
1.7.1

^ permalink raw reply related

* [PATCH  0/5] powerpc/pkeys: fixes to pkeys
From: Ram Pai @ 2018-06-05  2:09 UTC (permalink / raw)
  To: mpe
  Cc: linuxppc-dev, dave.hansen, aneesh.kumar, bsingharora, hbabu,
	mhocko, bauerman, linuxram, Ulrich.Weigand, fweimer, luto

Assortment of fixes to pkey.

Patch 1  makes pkey consumable in multithreaded applications.

Patch 2  fixes fork behavior to inherit the key attributes.

Patch 3  A off-by-one bug made one key unusable. Fixes it.

Patch 4  Makes pkey-0 less special.

Ram Pai (5):
  powerpc/pkeys: Enable all user-allocatable pkeys at init.
  powerpc/pkeys: Save the pkey registers before fork
  powerpc/pkeys: fix calculation of total pkeys.
  powerpc/pkeys: Preallocate execute-only key
  powerpc/pkeys: make protection key 0 less special

 arch/powerpc/include/asm/pkeys.h |   29 ++++++++--
 arch/powerpc/kernel/process.c    |    1 +
 arch/powerpc/mm/pkeys.c          |  107 ++++++++++++++------------------------
 3 files changed, 64 insertions(+), 73 deletions(-)

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: David Gibson @ 2018-06-05  1:52 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Michael S. Tsirkin, Anshuman Khandual, virtualization,
	linux-kernel, linuxppc-dev, aik, robh, joe, elfring, jasowang,
	mpe, hch
In-Reply-To: <d5df613d6347fe2f9bb6ea65bc6f6be05650ca6f.camel@kernel.crashing.org>

[-- Attachment #1: Type: text/plain, Size: 1720 bytes --]

On Mon, Jun 04, 2018 at 07:48:54PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-04 at 18:57 +1000, David Gibson wrote:
> > 
> > > - First qemu doesn't know that the guest will switch to "secure mode"
> > > in advance. There is no difference between a normal and a secure
> > > partition until the partition does the magic UV call to "enter secure
> > > mode" and qemu doesn't see any of it. So who can set the flag here ?
> > 
> > This seems weird to me.  As a rule HV calls should go through qemu -
> > or be allowed to go directly to KVM *by* qemu.
> 
> It's not an HV call, it's a UV call, qemu won't see it, qemu isn't
> trusted. Now the UV *will* reflect that to the HV via some synthetized
> HV calls, and we *could* have those do a pass by qemu, however, so far,
> our entire design doesn't rely on *any* qemu knowledge whatsoever and
> it would be sad to add it just for that purpose.
> 
> Additionally, this is rather orthogonal, see my other email, the
> problem we are trying to solve is *not* a qemu problem and it doesn't
> make sense to leak that into qemu.
> 
> >   We generally reserve
> > the latter for hot path things.  Since this isn't a hot path, having
> > the call handled directly by the kernel seems wrong.
> >
> > Unless a "UV call" is something different I don't know about.
> 
> Yes, a UV call goes to the Ultravisor, not the Hypervisor. The
> Hypervisor isn't trusted.

Ah, right.  Is that implemented in the host kernel, or in something
further above?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-05  1:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, hch
In-Reply-To: <6164442a718645a754879eac5c4c5fad9283c211.camel@kernel.crashing.org>

On Tue, Jun 05, 2018 at 09:26:56AM +1000, Benjamin Herrenschmidt wrote:
> I would like to keep however the ability to bypass the iommu for
> performance reasons

So that's easy, clear the IOMMU flag and this means "bypass the IOMMU".

-- 
MST

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-04 23:26 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, hch
In-Reply-To: <20180604184035-mutt-send-email-mst@kernel.org>

On Mon, 2018-06-04 at 19:21 +0300, Michael S. Tsirkin wrote:
> 
> > > > - First qemu doesn't know that the guest will switch to "secure mode"
> > > > in advance. There is no difference between a normal and a secure
> > > > partition until the partition does the magic UV call to "enter secure
> > > > mode" and qemu doesn't see any of it. So who can set the flag here ?
> > > 
> > > The user should set it. You just tell user "to be able to use with
> > > feature X, enable IOMMU".
> > 
> > That's completely backwards. The user has no idea what that stuff is.
> > And it would have to percolate all the way up the management stack,
> > libvirt, kimchi, whatever else ... that's just nonsense.
> > 
> > Especially since, as I explained in my other email, this is *not* a
> > qemu problem and thus the solution shouldn't be messing around with
> > qemu.
> 
> virtio is implemented in qemu though. If you prefer to stick
> all your code in either guest or the UV that's your decision
> but it looks like qemu could be helpful here.

Sorry Michael, that doesn't click. Yes of course virtio is implemented
in qemu, but the problem we are trying to solve is *not* a qemu problem
(the fact that the Linux drivers bypass the DMA API is wrong, needs
fixing, and isnt a qemu problem). The fact that the secure guests need
bounce buffering is not a qemu problem either.

Whether qemu chose to use an iommu or not is, and should remain an
orthogonal problem.

Forcing qemu to use the iommu to work around a linux side lack of
proper use of the DMA API is not only just papering over the problem,
it's also forcing changes up 3 or 4 levels of the SW stack to create
that new option that no user will understand the meaning of and that
would otherwise be unnecessary.

> For example what if you have a guest that passes physical addresses
> to qemu bypassing swiotlb? Don't you want to detect
> that and fail gracefully rather than crash the guest?

A guest bug then ? Well it wouldn't so much crash as force the pages to
become encrypted and cause horrible ping/pong between qemu and the
guest (the secure pages aren't accessible to qemu directly).

> That's what VIRTIO_F_IOMMU_PLATFORM will do for you.

Again this is orthogonal. Using an iommu will indeed provide a modicum
of protection against buggy drivers, like it does on HW PCI platforms,
whether those guests are secure or not.

Note however that in practice, we tend to disable the iommu even on
real HW whenever we want performance (of course we can't for guests but
for bare metal systems we do, the added RAS isn't worth the performance
lost for very fast networking for example).

> Still that's hypervisor's decision. What isn't up to the hypervisor is
> the way we structure code. We made an early decision to merge a hack
> with xen, among discussion about how with time DMA API will learn to
> support per-device quirks and we'll be able to switch to that.
> So let's do that now?

The DMA API itself isn't the one that needs to learn "per-device
quirks", it's just plumbing into arch backends. The "quirk" is at the
point of establishing the backend for a given device.

We can go a good way down that path simply by having virtio in Linux
start with putting *itself* its own direct ops in there when
VIRTIO_F_IOMMU_PLATFORM is not set, and removing all the special casing
in the rest of the driver.

Once that's done, we have a single point of establishing the dma ops,
we can quirk in there if needed, that's rather nicely contained, or put
an arch hook, or whatever is necessary.

I would like to keep however the ability to bypass the iommu for
performance reasons, and also because it's qemu default mode of
operation and my secure guest has no clean way to force qemu to turn
the iommu on. The hypervisor *could* return something to qemu when the
guest switch to secure as we do know that, and qemu could walk all of
it's virtio devices as a result and "switch" them over but that's
almost grosser from a qemu perspective.

.../...

> > The point is that requiring specific qemu command line arguments isn't
> > going to fly. We have additional problems due to the fact that our
> > firmware (SLOF) inside qemu doesn't currently deal with iommu's etc...
> > though those can be fixed.
> > 
> > Overall, however, this seems to be the most convoluted way of achieving
> > things, require user interventions where none should be needed etc...
> > 
> > Again, what's wrong with a 2 lines hook instead that solves it all and
> > completely avoids involving qemu ?
> > 
> > Ben.
> 
> That each platform wants to add hacks in this data path function.

Sure, then add a single platform hook and the platforms can do what
they want here.

But as I said, it should all be done at initialization time rather than
in the data path, this we absolutely agree. We should just chose the
right set of dma_ops, and have the data path always use the DMA API.

Cheers,
Ben.


> > > 
> > > > > 
> > > > > 
> > > > > >  arch/powerpc/include/asm/dma-mapping.h |  6 ++++++
> > > > > >  arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > > > >  drivers/virtio/virtio_ring.c           | 10 ++++++++++
> > > > > >  3 files changed, 27 insertions(+)
> > > > > > 
> > > > > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > > > > index 8fa3945..056e578 100644
> > > > > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > > > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > > > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > > > >  #define ARCH_HAS_DMA_MMAP_COHERENT
> > > > > >  
> > > > > >  #endif /* __KERNEL__ */
> > > > > > +
> > > > > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > > > > +
> > > > > > +struct virtio_device;
> > > > > > +
> > > > > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > > > >  #endif	/* _ASM_DMA_MAPPING_H */
> > > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > > > index 06f0296..a2ec15a 100644
> > > > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > > > @@ -38,6 +38,7 @@
> > > > > >  #include <linux/of.h>
> > > > > >  #include <linux/iommu.h>
> > > > > >  #include <linux/rculist.h>
> > > > > > +#include <linux/virtio.h>
> > > > > >  #include <asm/io.h>
> > > > > >  #include <asm/prom.h>
> > > > > >  #include <asm/rtas.h>
> > > > > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > > > >  __setup("multitce=", disable_multitce);
> > > > > >  
> > > > > >  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > > > > +
> > > > > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > > +{
> > > > > > +	/*
> > > > > > +	 * On protected guest platforms, force virtio core to use DMA
> > > > > > +	 * MAP API for all virtio devices. But there can also be some
> > > > > > +	 * exceptions for individual devices like virtio balloon.
> > > > > > +	 */
> > > > > > +	return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > > > > +}
> > > > > 
> > > > > Isn't this kind of slow?  vring_use_dma_api is on
> > > > > data path and supposed to be very fast.
> > > > > 
> > > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > > index 21d464a..47ea6c3 100644
> > > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > > > >   * unconditionally on data path.
> > > > > >   */
> > > > > >  
> > > > > > +#ifndef platform_forces_virtio_dma
> > > > > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > > +{
> > > > > > +	return false;
> > > > > > +}
> > > > > > +#endif
> > > > > > +
> > > > > >  static bool vring_use_dma_api(struct virtio_device *vdev)
> > > > > >  {
> > > > > > +	if (platform_forces_virtio_dma(vdev))
> > > > > > +		return true;
> > > > > > +
> > > > > >  	if (!virtio_has_iommu_quirk(vdev))
> > > > > >  		return true;
> > > > > >  
> > > > > > -- 
> > > > > > 2.9.3

^ permalink raw reply

* Re: [PATCH net-next] wan/fsl_ucc_hdlc: use dma_zalloc_coherent instead of allocator/memset
From: David Miller @ 2018-06-04 21:28 UTC (permalink / raw)
  To: yuehaibing; +Cc: qiang.zhao, netdev, linux-kernel, linuxppc-dev
In-Reply-To: <20180604130759.25024-1-yuehaibing@huawei.com>

From: YueHaibing <yuehaibing@huawei.com>
Date: Mon, 4 Jun 2018 21:07:59 +0800

> Use dma_zalloc_coherent instead of dma_alloc_coherent
> followed by memset 0.
> 
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>

Applied.

^ permalink raw reply

* Re: pkeys on POWER: Access rights not reset on execve
From: Florian Weimer @ 2018-06-04 21:00 UTC (permalink / raw)
  To: Ram Pai; +Cc: Andy Lutomirski, Linux-MM, linuxppc-dev, Dave Hansen
In-Reply-To: <20180604190229.GB10088@ram.oc3035372033.ibm.com>

On 06/04/2018 09:02 PM, Ram Pai wrote:
> On Mon, Jun 04, 2018 at 07:57:46PM +0200, Florian Weimer wrote:
>> On 06/04/2018 04:01 PM, Ram Pai wrote:
>>> On Mon, Jun 04, 2018 at 12:12:07PM +0200, Florian Weimer wrote:
>>>> On 06/03/2018 10:18 PM, Ram Pai wrote:
>>>>> On Mon, May 21, 2018 at 01:29:11PM +0200, Florian Weimer wrote:
>>>>>> On 05/20/2018 09:11 PM, Ram Pai wrote:
>>>>>>> Florian,
>>>>>>>
>>>>>>> 	Does the following patch fix the problem for you?  Just like x86
>>>>>>> 	I am enabling all keys in the UAMOR register during
>>>>>>> 	initialization itself. Hence any key created by any thread at
>>>>>>> 	any time, will get activated on all threads. So any thread
>>>>>>> 	can change the permission on that key. Smoke tested it
>>>>>>> 	with your test program.
>>>>>>
>>>>>> I think this goes in the right direction, but the AMR value after
>>>>>> fork is still strange:
>>>>>>
>>>>>> AMR (PID 34912): 0x0000000000000000
>>>>>> AMR after fork (PID 34913): 0x0000000000000000
>>>>>> AMR (PID 34913): 0x0000000000000000
>>>>>> Allocated key in subprocess (PID 34913): 2
>>>>>> Allocated key (PID 34912): 2
>>>>>> Setting AMR: 0xffffffffffffffff
>>>>>> New AMR value (PID 34912): 0x0fffffffffffffff
>>>>>> About to call execl (PID 34912) ...
>>>>>> AMR (PID 34912): 0x0fffffffffffffff
>>>>>> AMR after fork (PID 34914): 0x0000000000000003
>>>>>> AMR (PID 34914): 0x0000000000000003
>>>>>> Allocated key in subprocess (PID 34914): 2
>>>>>> Allocated key (PID 34912): 2
>>>>>> Setting AMR: 0xffffffffffffffff
>>>>>> New AMR value (PID 34912): 0x0fffffffffffffff
>>>>>>
>>>>>> I mean this line:
>>>>>>
>>>>>> AMR after fork (PID 34914): 0x0000000000000003
>>>>>>
>>>>>> Shouldn't it be the same as in the parent process?
>>>>>
>>>>> Fixed it. Please try this patch. If it all works to your satisfaction, I
>>>>> will clean it up further and send to Michael Ellermen(ppc maintainer).
>>>>>
>>>>>
>>>>> commit 51f4208ed5baeab1edb9b0f8b68d7144449b3527
>>>>> Author: Ram Pai <linuxram@us.ibm.com>
>>>>> Date:   Sun Jun 3 14:44:32 2018 -0500
>>>>>
>>>>>      Fix for the fork bug.
>>>>>      Signed-off-by: Ram Pai <linuxram@us.ibm.com>
>>>>
>>>> Is this on top of the previous patch, or a separate fix?
>>>
>>> top of previous patch.
>>
>> Thanks.  With this patch, I get this on an LPAR:
>>
>> AMR (PID 1876): 0x0000000000000003
>> AMR after fork (PID 1877): 0x0000000000000003
>> AMR (PID 1877): 0x0000000000000003
>> Allocated key in subprocess (PID 1877): 2
>> Allocated key (PID 1876): 2
>> Setting AMR: 0xffffffffffffffff
>> New AMR value (PID 1876): 0x0fffffffffffffff
>> About to call execl (PID 1876) ...
>> AMR (PID 1876): 0x0000000000000003
>> AMR after fork (PID 1878): 0x0000000000000003
>> AMR (PID 1878): 0x0000000000000003
>> Allocated key in subprocess (PID 1878): 2
>> Allocated key (PID 1876): 2
>> Setting AMR: 0xffffffffffffffff
>> New AMR value (PID 1876): 0x0fffffffffffffff
>>
>> Test program is still this one:
>>
>> <https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/173198.html>
>>
>> So the process starts out with a different AMR value for some
>> reason. That could be a pre-existing bug that was just hidden by the
>> reset-to-zero on fork, or it could be intentional.  But the kernel
> 
> yes it is a bug, a patch for which is lined up for submission..
> 
> The fix is
> 
> 
> commit eaf5b2ac002ad2f5bca118d7ce075ce28311aa8e
> Author: Ram Pai <linuxram@us.ibm.com>
> Date:   Mon Jun 4 10:58:44 2018 -0500
> 
>      powerpc/pkeys: fix total pkeys calculation
>      
>      Total number of pkeys calculation is off by 1. Fix it.
>      
>      Signed-off-by: Ram Pai <linuxram@us.ibm.com>

Looks good to me now.  Initial AMR value is zero, as is currently intended.

So the remaining question at this point is whether the Intel behavior 
(default-deny instead of default-allow) is preferable.

But if you can get the existing fixes into 4.18 and perhaps the relevant 
stable kernels, that would already be a great help for my glibc work.

Thanks,
Florian

^ permalink raw reply

* Re: [1/5] powerpc/embedded6xx: Remove C2K board support
From: Mark Greer @ 2018-06-04 20:50 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Mark Greer, Benjamin Herrenschmidt, Paul Mackerras, Remi Machet,
	Dale Farnsworth, linuxppc-dev, linux-kernel
In-Reply-To: <40zxfb5H0xz9s2S@ozlabs.org>

On Tue, Jun 05, 2018 at 12:10:31AM +1000, Michael Ellerman wrote:
> On Fri, 2018-04-06 at 01:17:16 UTC, Mark Greer wrote:
> > The C2K platform appears to be orphaned so remove code supporting it.
> > 
> > CC: Remi Machet <rmachet@nvidia.com>
> > Signed-off-by: Mark Greer <mgreer@animalcreek.com>
> > Acked-by: Remi Machet <remi@machet.us>
> > Signed-off-by: Mark Greer <mgreer@animalcreek.com>
> 
> Series applied to powerpc next, thanks.
> 
> https://git.kernel.org/powerpc/c/92c8c16f345759e87c5d5b771d438f

Thanks Michael.

Mark
--

^ permalink raw reply

* Re: pkeys on POWER: Access rights not reset on execve
From: Ram Pai @ 2018-06-04 19:02 UTC (permalink / raw)
  To: Florian Weimer; +Cc: Andy Lutomirski, Linux-MM, linuxppc-dev, Dave Hansen
In-Reply-To: <f2f61c24-8e8f-0d36-4e22-196a2a3f7ca7@redhat.com>

On Mon, Jun 04, 2018 at 07:57:46PM +0200, Florian Weimer wrote:
> On 06/04/2018 04:01 PM, Ram Pai wrote:
> >On Mon, Jun 04, 2018 at 12:12:07PM +0200, Florian Weimer wrote:
> >>On 06/03/2018 10:18 PM, Ram Pai wrote:
> >>>On Mon, May 21, 2018 at 01:29:11PM +0200, Florian Weimer wrote:
> >>>>On 05/20/2018 09:11 PM, Ram Pai wrote:
> >>>>>Florian,
> >>>>>
> >>>>>	Does the following patch fix the problem for you?  Just like x86
> >>>>>	I am enabling all keys in the UAMOR register during
> >>>>>	initialization itself. Hence any key created by any thread at
> >>>>>	any time, will get activated on all threads. So any thread
> >>>>>	can change the permission on that key. Smoke tested it
> >>>>>	with your test program.
> >>>>
> >>>>I think this goes in the right direction, but the AMR value after
> >>>>fork is still strange:
> >>>>
> >>>>AMR (PID 34912): 0x0000000000000000
> >>>>AMR after fork (PID 34913): 0x0000000000000000
> >>>>AMR (PID 34913): 0x0000000000000000
> >>>>Allocated key in subprocess (PID 34913): 2
> >>>>Allocated key (PID 34912): 2
> >>>>Setting AMR: 0xffffffffffffffff
> >>>>New AMR value (PID 34912): 0x0fffffffffffffff
> >>>>About to call execl (PID 34912) ...
> >>>>AMR (PID 34912): 0x0fffffffffffffff
> >>>>AMR after fork (PID 34914): 0x0000000000000003
> >>>>AMR (PID 34914): 0x0000000000000003
> >>>>Allocated key in subprocess (PID 34914): 2
> >>>>Allocated key (PID 34912): 2
> >>>>Setting AMR: 0xffffffffffffffff
> >>>>New AMR value (PID 34912): 0x0fffffffffffffff
> >>>>
> >>>>I mean this line:
> >>>>
> >>>>AMR after fork (PID 34914): 0x0000000000000003
> >>>>
> >>>>Shouldn't it be the same as in the parent process?
> >>>
> >>>Fixed it. Please try this patch. If it all works to your satisfaction,=
 I
> >>>will clean it up further and send to Michael Ellermen(ppc maintainer).
> >>>
> >>>
> >>>commit 51f4208ed5baeab1edb9b0f8b68d7144449b3527
> >>>Author: Ram Pai <linuxram@us.ibm.com>
> >>>Date:   Sun Jun 3 14:44:32 2018 -0500
> >>>
> >>>     Fix for the fork bug.
> >>>     Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> >>
> >>Is this on top of the previous patch, or a separate fix?
> >
> >top of previous patch.
>=20
> Thanks.  With this patch, I get this on an LPAR:
>=20
> AMR (PID 1876): 0x0000000000000003
> AMR after fork (PID 1877): 0x0000000000000003
> AMR (PID 1877): 0x0000000000000003
> Allocated key in subprocess (PID 1877): 2
> Allocated key (PID 1876): 2
> Setting AMR: 0xffffffffffffffff
> New AMR value (PID 1876): 0x0fffffffffffffff
> About to call execl (PID 1876) ...
> AMR (PID 1876): 0x0000000000000003
> AMR after fork (PID 1878): 0x0000000000000003
> AMR (PID 1878): 0x0000000000000003
> Allocated key in subprocess (PID 1878): 2
> Allocated key (PID 1876): 2
> Setting AMR: 0xffffffffffffffff
> New AMR value (PID 1876): 0x0fffffffffffffff
>=20
> Test program is still this one:
>=20
> <https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/173198.html>
>=20
> So the process starts out with a different AMR value for some
> reason. That could be a pre-existing bug that was just hidden by the
> reset-to-zero on fork, or it could be intentional.  But the kernel

yes it is a bug, a patch for which is lined up for submission..

The fix is


commit eaf5b2ac002ad2f5bca118d7ce075ce28311aa8e
Author: Ram Pai <linuxram@us.ibm.com>
Date:   Mon Jun 4 10:58:44 2018 -0500

    powerpc/pkeys: fix total pkeys calculation
=20=20=20=20
    Total number of pkeys calculation is off by 1. Fix it.
=20=20=20=20
    Signed-off-by: Ram Pai <linuxram@us.ibm.com>

diff --git a/arch/powerpc/mm/pkeys.c b/arch/powerpc/mm/pkeys.c
index 4530cdf..3384c4e 100644
--- a/arch/powerpc/mm/pkeys.c
+++ b/arch/powerpc/mm/pkeys.c
@@ -93,7 +93,7 @@ int pkey_initialize(void)
 	 * arch-neutral code.
 	 */
 	pkeys_total =3D min_t(int, pkeys_total,
-			(ARCH_VM_PKEY_FLAGS >> VM_PKEY_SHIFT));
+			((ARCH_VM_PKEY_FLAGS >> VM_PKEY_SHIFT)+1));
=20
 	if (!pkey_mmu_enabled() || radix_enabled() || !pkeys_total)
 		static_branch_enable(&pkey_disabled);

^ permalink raw reply related

* Re: pkeys on POWER: Access rights not reset on execve
From: Florian Weimer @ 2018-06-04 17:57 UTC (permalink / raw)
  To: Ram Pai; +Cc: Andy Lutomirski, Linux-MM, linuxppc-dev, Dave Hansen
In-Reply-To: <20180604140135.GA10088@ram.oc3035372033.ibm.com>

On 06/04/2018 04:01 PM, Ram Pai wrote:
> On Mon, Jun 04, 2018 at 12:12:07PM +0200, Florian Weimer wrote:
>> On 06/03/2018 10:18 PM, Ram Pai wrote:
>>> On Mon, May 21, 2018 at 01:29:11PM +0200, Florian Weimer wrote:
>>>> On 05/20/2018 09:11 PM, Ram Pai wrote:
>>>>> Florian,
>>>>>
>>>>> 	Does the following patch fix the problem for you?  Just like x86
>>>>> 	I am enabling all keys in the UAMOR register during
>>>>> 	initialization itself. Hence any key created by any thread at
>>>>> 	any time, will get activated on all threads. So any thread
>>>>> 	can change the permission on that key. Smoke tested it
>>>>> 	with your test program.
>>>>
>>>> I think this goes in the right direction, but the AMR value after
>>>> fork is still strange:
>>>>
>>>> AMR (PID 34912): 0x0000000000000000
>>>> AMR after fork (PID 34913): 0x0000000000000000
>>>> AMR (PID 34913): 0x0000000000000000
>>>> Allocated key in subprocess (PID 34913): 2
>>>> Allocated key (PID 34912): 2
>>>> Setting AMR: 0xffffffffffffffff
>>>> New AMR value (PID 34912): 0x0fffffffffffffff
>>>> About to call execl (PID 34912) ...
>>>> AMR (PID 34912): 0x0fffffffffffffff
>>>> AMR after fork (PID 34914): 0x0000000000000003
>>>> AMR (PID 34914): 0x0000000000000003
>>>> Allocated key in subprocess (PID 34914): 2
>>>> Allocated key (PID 34912): 2
>>>> Setting AMR: 0xffffffffffffffff
>>>> New AMR value (PID 34912): 0x0fffffffffffffff
>>>>
>>>> I mean this line:
>>>>
>>>> AMR after fork (PID 34914): 0x0000000000000003
>>>>
>>>> Shouldn't it be the same as in the parent process?
>>>
>>> Fixed it. Please try this patch. If it all works to your satisfaction, I
>>> will clean it up further and send to Michael Ellermen(ppc maintainer).
>>>
>>>
>>> commit 51f4208ed5baeab1edb9b0f8b68d7144449b3527
>>> Author: Ram Pai <linuxram@us.ibm.com>
>>> Date:   Sun Jun 3 14:44:32 2018 -0500
>>>
>>>      Fix for the fork bug.
>>>      Signed-off-by: Ram Pai <linuxram@us.ibm.com>
>>
>> Is this on top of the previous patch, or a separate fix?
> 
> top of previous patch.

Thanks.  With this patch, I get this on an LPAR:

AMR (PID 1876): 0x0000000000000003
AMR after fork (PID 1877): 0x0000000000000003
AMR (PID 1877): 0x0000000000000003
Allocated key in subprocess (PID 1877): 2
Allocated key (PID 1876): 2
Setting AMR: 0xffffffffffffffff
New AMR value (PID 1876): 0x0fffffffffffffff
About to call execl (PID 1876) ...
AMR (PID 1876): 0x0000000000000003
AMR after fork (PID 1878): 0x0000000000000003
AMR (PID 1878): 0x0000000000000003
Allocated key in subprocess (PID 1878): 2
Allocated key (PID 1876): 2
Setting AMR: 0xffffffffffffffff
New AMR value (PID 1876): 0x0fffffffffffffff

Test program is still this one:

<https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-May/173198.html>

So the process starts out with a different AMR value for some reason. 
That could be a pre-existing bug that was just hidden by the 
reset-to-zero on fork, or it could be intentional.  But the kernel code 
does not indicate that key 63 is reserved (POWER numbers keys from the 
MSB to the LSB).

But it looks like we are finally getting somewhere. 8-)

Thanks,
Florian

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-04 16:34 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Christoph Hellwig, Anshuman Khandual, virtualization,
	linux-kernel, linuxppc-dev, aik, robh, joe, elfring, david,
	jasowang, mpe
In-Reply-To: <acdfef1327f73f6ac67645d9f1a8e9204a0f22fb.camel@kernel.crashing.org>

On Mon, Jun 04, 2018 at 11:14:36PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-04 at 05:55 -0700, Christoph Hellwig wrote:
> > On Mon, Jun 04, 2018 at 03:43:09PM +0300, Michael S. Tsirkin wrote:
> > > Another is that given the basic functionality is in there, optimizations
> > > can possibly wait until per-device quirks in DMA API are supported.
> > 
> > We have had per-device dma_ops for quite a while.
> 
> I've asked Ansuman to start with a patch that converts virtio to use
> DMA ops always, along with an init quirk to hookup "direct" ops when
> the IOMMU flag isn't set.
> 
> This will at least remove that horrid duplication of code path we have
> in there.
> 
> Then we can just involve the arch in that init quirk so we can chose an
> alternate set of ops when running a secure VM.
> 
> This is completely orthogonal to whether an iommu exist qemu side or
> not, and should be entirely solved on the Linux side.
> 
> Cheers,
> Ben.

Sounds good to me.

-- 
MST

^ permalink raw reply

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-04 16:21 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe, hch
In-Reply-To: <e7ceddbec11711a89282e9b70b7fd3c8af10b030.camel@kernel.crashing.org>

On Mon, Jun 04, 2018 at 11:11:52PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-04 at 15:43 +0300, Michael S. Tsirkin wrote:
> > On Thu, May 24, 2018 at 08:27:04AM +1000, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-05-23 at 21:50 +0300, Michael S. Tsirkin wrote:
> > > 
> > > > I re-read that discussion and I'm still unclear on the
> > > > original question, since I got several apparently
> > > > conflicting answers.
> > > > 
> > > > I asked:
> > > > 
> > > > 	Why isn't setting VIRTIO_F_IOMMU_PLATFORM on the
> > > > 	hypervisor side sufficient?
> > > 
> > > I thought I had replied to this...
> > > 
> > > There are a couple of reasons:
> > > 
> > > - First qemu doesn't know that the guest will switch to "secure mode"
> > > in advance. There is no difference between a normal and a secure
> > > partition until the partition does the magic UV call to "enter secure
> > > mode" and qemu doesn't see any of it. So who can set the flag here ?
> > 
> > The user should set it. You just tell user "to be able to use with
> > feature X, enable IOMMU".
> 
> That's completely backwards. The user has no idea what that stuff is.
> And it would have to percolate all the way up the management stack,
> libvirt, kimchi, whatever else ... that's just nonsense.
> 
> Especially since, as I explained in my other email, this is *not* a
> qemu problem and thus the solution shouldn't be messing around with
> qemu.

virtio is implemented in qemu though. If you prefer to stick
all your code in either guest or the UV that's your decision
but it looks like qemu could be helpful here.

For example what if you have a guest that passes physical addresses
to qemu bypassing swiotlb? Don't you want to detect
that and fail gracefully rather than crash the guest?
That's what VIRTIO_F_IOMMU_PLATFORM will do for you.

Still that's hypervisor's decision. What isn't up to the hypervisor is
the way we structure code. We made an early decision to merge a hack
with xen, among discussion about how with time DMA API will learn to
support per-device quirks and we'll be able to switch to that.
So let's do that now?

> > 
> > > - Second, when using VIRTIO_F_IOMMU_PLATFORM, we also make qemu (or
> > > vhost) go through the emulated MMIO for every access to the guest,
> > > which adds additional overhead.
> > > 
> > > Cheers,
> > > Ben.
> > 
> > There are several answers to this.  One is that we are working hard to
> > make overhead small when the mappings are static (which they would be if
> > there's no actual IOMMU). So maybe especially given you are using
> > a bounce buffer on top it's not so bad - did you try to
> > benchmark?
> > 
> > Another is that given the basic functionality is in there, optimizations
> > can possibly wait until per-device quirks in DMA API are supported.
> 
> The point is that requiring specific qemu command line arguments isn't
> going to fly. We have additional problems due to the fact that our
> firmware (SLOF) inside qemu doesn't currently deal with iommu's etc...
> though those can be fixed.
> 
> Overall, however, this seems to be the most convoluted way of achieving
> things, require user interventions where none should be needed etc...
> 
> Again, what's wrong with a 2 lines hook instead that solves it all and
> completely avoids involving qemu ?
> 
> Ben.

That each platform wants to add hacks in this data path function.

> > 
> > > > 
> > > > 
> > > > >  arch/powerpc/include/asm/dma-mapping.h |  6 ++++++
> > > > >  arch/powerpc/platforms/pseries/iommu.c | 11 +++++++++++
> > > > >  drivers/virtio/virtio_ring.c           | 10 ++++++++++
> > > > >  3 files changed, 27 insertions(+)
> > > > > 
> > > > > diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
> > > > > index 8fa3945..056e578 100644
> > > > > --- a/arch/powerpc/include/asm/dma-mapping.h
> > > > > +++ b/arch/powerpc/include/asm/dma-mapping.h
> > > > > @@ -115,4 +115,10 @@ extern u64 __dma_get_required_mask(struct device *dev);
> > > > >  #define ARCH_HAS_DMA_MMAP_COHERENT
> > > > >  
> > > > >  #endif /* __KERNEL__ */
> > > > > +
> > > > > +#define platform_forces_virtio_dma platform_forces_virtio_dma
> > > > > +
> > > > > +struct virtio_device;
> > > > > +
> > > > > +extern bool platform_forces_virtio_dma(struct virtio_device *vdev);
> > > > >  #endif	/* _ASM_DMA_MAPPING_H */
> > > > > diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> > > > > index 06f0296..a2ec15a 100644
> > > > > --- a/arch/powerpc/platforms/pseries/iommu.c
> > > > > +++ b/arch/powerpc/platforms/pseries/iommu.c
> > > > > @@ -38,6 +38,7 @@
> > > > >  #include <linux/of.h>
> > > > >  #include <linux/iommu.h>
> > > > >  #include <linux/rculist.h>
> > > > > +#include <linux/virtio.h>
> > > > >  #include <asm/io.h>
> > > > >  #include <asm/prom.h>
> > > > >  #include <asm/rtas.h>
> > > > > @@ -1396,3 +1397,13 @@ static int __init disable_multitce(char *str)
> > > > >  __setup("multitce=", disable_multitce);
> > > > >  
> > > > >  machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
> > > > > +
> > > > > +bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > +{
> > > > > +	/*
> > > > > +	 * On protected guest platforms, force virtio core to use DMA
> > > > > +	 * MAP API for all virtio devices. But there can also be some
> > > > > +	 * exceptions for individual devices like virtio balloon.
> > > > > +	 */
> > > > > +	return (of_find_compatible_node(NULL, NULL, "ibm,ultravisor") != NULL);
> > > > > +}
> > > > 
> > > > Isn't this kind of slow?  vring_use_dma_api is on
> > > > data path and supposed to be very fast.
> > > > 
> > > > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > > > index 21d464a..47ea6c3 100644
> > > > > --- a/drivers/virtio/virtio_ring.c
> > > > > +++ b/drivers/virtio/virtio_ring.c
> > > > > @@ -141,8 +141,18 @@ struct vring_virtqueue {
> > > > >   * unconditionally on data path.
> > > > >   */
> > > > >  
> > > > > +#ifndef platform_forces_virtio_dma
> > > > > +static inline bool platform_forces_virtio_dma(struct virtio_device *vdev)
> > > > > +{
> > > > > +	return false;
> > > > > +}
> > > > > +#endif
> > > > > +
> > > > >  static bool vring_use_dma_api(struct virtio_device *vdev)
> > > > >  {
> > > > > +	if (platform_forces_virtio_dma(vdev))
> > > > > +		return true;
> > > > > +
> > > > >  	if (!virtio_has_iommu_quirk(vdev))
> > > > >  		return true;
> > > > >  
> > > > > -- 
> > > > > 2.9.3

^ permalink raw reply

* [RFC PATCH -tip v5 24/27] bpf: error-inject: kprobes: Clear current_kprobe and enable preempt in kprobe
From: Masami Hiramatsu @ 2018-06-04 16:00 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Masami Hiramatsu, Ingo Molnar, H . Peter Anvin, linux-kernel,
	Ananth N Mavinakayanahalli, Andrew Morton, Steven Rostedt,
	linux-arch, Vineet Gupta, Russell King, Catalin Marinas,
	Will Deacon, Tony Luck, Fenghua Yu, Ralf Baechle, James Hogan,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman,
	Martin Schwidefsky, Heiko Carstens, Yoshinori Sato, Rich Felker,
	David S. Miller, Naveen N. Rao, Josef Bacik, Alexei Starovoitov,
	x86, linux-snps-arc, linux-arm-kernel, linux-ia64, linux-mips,
	linuxppc-dev, linux-s390, linux-sh, sparclinux
In-Reply-To: <152812730943.10068.5166429445118734697.stgit@devbox>

Clear current_kprobe and enable preemption in kprobe
even if pre_handler returns !0.

This simplifies function override using kprobes.

Jprobe used to require to keep the preemption disabled and
keep current_kprobe until it returned to original function
entry. For this reason kprobe_int3_handler() and similar
arch dependent kprobe handers checks pre_handler result
and exit without enabling preemption if the result is !0.

After removing the jprobe, Kprobes does not need to
keep preempt disabled even if user handler returns !0
anymore.

But since the function override handler in error-inject
and bpf is also returns !0 if it overrides a function,
to balancing the preempt count, it enables preemption
and reset current kprobe by itself.

That is a bad design that is very buggy. This fixes
such unbalanced preempt-count and current_kprobes setting
in kprobes, bpf and error-inject.

Note: for powerpc and x86, this removes all preempt_disable
from kprobe_ftrace_handler because ftrace callbacks are
called under preempt disabled.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Vineet Gupta <vgupta@synopsys.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Tony Luck <tony.luck@intel.com>
Cc: Fenghua Yu <fenghua.yu@intel.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Cc: James Hogan <jhogan@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Cc: Rich Felker <dalias@libc.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: Josef Bacik <jbacik@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: x86@kernel.org
Cc: linux-snps-arc@lists.infradead.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: sparclinux@vger.kernel.org
---
 Changes in v5:
  - Fix kprobe_ftrace_handler in arch/powerpc too.
---
 arch/arc/kernel/kprobes.c            |    5 +++--
 arch/arm/probes/kprobes/core.c       |   10 +++++-----
 arch/arm64/kernel/probes/kprobes.c   |   10 +++++-----
 arch/ia64/kernel/kprobes.c           |   13 ++++---------
 arch/mips/kernel/kprobes.c           |    4 ++--
 arch/powerpc/kernel/kprobes-ftrace.c |   15 ++++++---------
 arch/powerpc/kernel/kprobes.c        |    7 +++++--
 arch/s390/kernel/kprobes.c           |    7 ++++---
 arch/sh/kernel/kprobes.c             |    7 ++++---
 arch/sparc/kernel/kprobes.c          |    7 ++++---
 arch/x86/kernel/kprobes/core.c       |    4 ++++
 arch/x86/kernel/kprobes/ftrace.c     |   15 ++++++++-------
 kernel/fail_function.c               |    3 ---
 kernel/trace/trace_kprobe.c          |   11 +++--------
 14 files changed, 57 insertions(+), 61 deletions(-)

diff --git a/arch/arc/kernel/kprobes.c b/arch/arc/kernel/kprobes.c
index 465365696c91..df35d4c0b0b8 100644
--- a/arch/arc/kernel/kprobes.c
+++ b/arch/arc/kernel/kprobes.c
@@ -231,6 +231,9 @@ int __kprobes arc_kprobe_handler(unsigned long addr, struct pt_regs *regs)
 		if (!p->pre_handler || !p->pre_handler(p, regs)) {
 			setup_singlestep(p, regs);
 			kcb->kprobe_status = KPROBE_HIT_SS;
+		} else {
+			reset_current_kprobe();
+			preempt_enable_no_resched();
 		}
 
 		return 1;
@@ -442,9 +445,7 @@ static int __kprobes trampoline_probe_handler(struct kprobe *p,
 	kretprobe_assert(ri, orig_ret_address, trampoline_address);
 	regs->ret = orig_ret_address;
 
-	reset_current_kprobe();
 	kretprobe_hash_unlock(current, &flags);
-	preempt_enable_no_resched();
 
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
diff --git a/arch/arm/probes/kprobes/core.c b/arch/arm/probes/kprobes/core.c
index 3192350f389d..8d37601fdb20 100644
--- a/arch/arm/probes/kprobes/core.c
+++ b/arch/arm/probes/kprobes/core.c
@@ -300,10 +300,10 @@ void __kprobes kprobe_handler(struct pt_regs *regs)
 
 			/*
 			 * If we have no pre-handler or it returned 0, we
-			 * continue with normal processing.  If we have a
-			 * pre-handler and it returned non-zero, it prepped
-			 * for calling the break_handler below on re-entry,
-			 * so get out doing nothing more here.
+			 * continue with normal processing. If we have a
+			 * pre-handler and it returned non-zero, it will
+			 * modify the execution path and no need to single
+			 * stepping. Let's just reset current kprobe and exit.
 			 */
 			if (!p->pre_handler || !p->pre_handler(p, regs)) {
 				kcb->kprobe_status = KPROBE_HIT_SS;
@@ -312,8 +312,8 @@ void __kprobes kprobe_handler(struct pt_regs *regs)
 					kcb->kprobe_status = KPROBE_HIT_SSDONE;
 					p->post_handler(p, regs, 0);
 				}
-				reset_current_kprobe();
 			}
+			reset_current_kprobe();
 		}
 	} else {
 		/*
diff --git a/arch/arm64/kernel/probes/kprobes.c b/arch/arm64/kernel/probes/kprobes.c
index 076c3c0775a6..5daf3d721cb7 100644
--- a/arch/arm64/kernel/probes/kprobes.c
+++ b/arch/arm64/kernel/probes/kprobes.c
@@ -395,9 +395,9 @@ static void __kprobes kprobe_handler(struct pt_regs *regs)
 			/*
 			 * If we have no pre-handler or it returned 0, we
 			 * continue with normal processing.  If we have a
-			 * pre-handler and it returned non-zero, it prepped
-			 * for calling the break_handler below on re-entry,
-			 * so get out doing nothing more here.
+			 * pre-handler and it returned non-zero, it will
+			 * modify the execution path and no need to single
+			 * stepping. Let's just reset current kprobe and exit.
 			 *
 			 * pre_handler can hit a breakpoint and can step thru
 			 * before return, keep PSTATE D-flag enabled until
@@ -405,8 +405,8 @@ static void __kprobes kprobe_handler(struct pt_regs *regs)
 			 */
 			if (!p->pre_handler || !p->pre_handler(p, regs)) {
 				setup_singlestep(p, regs, kcb, 0);
-				return;
-			}
+			} else
+				reset_current_kprobe();
 		}
 	}
 	/*
diff --git a/arch/ia64/kernel/kprobes.c b/arch/ia64/kernel/kprobes.c
index 74c8524e6309..aa41bd5cf9b7 100644
--- a/arch/ia64/kernel/kprobes.c
+++ b/arch/ia64/kernel/kprobes.c
@@ -478,12 +478,9 @@ int __kprobes trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs)
 			 */
 			break;
 	}
-
 	kretprobe_assert(ri, orig_ret_address, trampoline_address);
 
-	reset_current_kprobe();
 	kretprobe_hash_unlock(current, &flags);
-	preempt_enable_no_resched();
 
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
@@ -851,13 +848,11 @@ static int __kprobes pre_kprobes_handler(struct die_args *args)
 	set_current_kprobe(p, kcb);
 	kcb->kprobe_status = KPROBE_HIT_ACTIVE;
 
-	if (p->pre_handler && p->pre_handler(p, regs))
-		/*
-		 * Our pre-handler is specifically requesting that we just
-		 * do a return.  This is used for both the jprobe pre-handler
-		 * and the kretprobe trampoline
-		 */
+	if (p->pre_handler && p->pre_handler(p, regs)) {
+		reset_current_kprobe();
+		preempt_enable_no_resched();
 		return 1;
+	}
 
 #if !defined(CONFIG_PREEMPT)
 	if (p->ainsn.inst_flag == INST_FLAG_BOOSTABLE && !p->post_handler) {
diff --git a/arch/mips/kernel/kprobes.c b/arch/mips/kernel/kprobes.c
index 7fd277bc59b9..54cd675c5d1d 100644
--- a/arch/mips/kernel/kprobes.c
+++ b/arch/mips/kernel/kprobes.c
@@ -358,6 +358,8 @@ static int __kprobes kprobe_handler(struct pt_regs *regs)
 
 	if (p->pre_handler && p->pre_handler(p, regs)) {
 		/* handler has already set things up, so skip ss setup */
+		reset_current_kprobe();
+		preempt_enable_no_resched();
 		return 1;
 	}
 
@@ -543,9 +545,7 @@ static int __kprobes trampoline_probe_handler(struct kprobe *p,
 	kretprobe_assert(ri, orig_ret_address, trampoline_address);
 	instruction_pointer(regs) = orig_ret_address;
 
-	reset_current_kprobe();
 	kretprobe_hash_unlock(current, &flags);
-	preempt_enable_no_resched();
 
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 3869b0e5d5c7..c80c35d1e26e 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -51,11 +51,9 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 	struct kprobe *p;
 	struct kprobe_ctlblk *kcb;
 
-	preempt_disable();
-
 	p = get_kprobe((kprobe_opcode_t *)nip);
 	if (unlikely(!p) || kprobe_disabled(p))
-		goto end;
+		return;
 
 	kcb = get_kprobe_ctlblk();
 	if (kprobe_running()) {
@@ -75,15 +73,14 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 			skip_singlestep(p, regs, kcb, orig_nip);
 		else {
 			/*
-			 * If pre_handler returns !0, it sets regs->nip and
-			 * resets current kprobe. In this case, we should not
-			 * re-enable preemption.
+			 * If pre_handler returns !0, this handler
+			 * modifies regs->ip and goes back to there
+			 * directly without single stepping.
+			 * So let's just clear current kprobe.
 			 */
-			return;
+			__this_cpu_write(current_kprobe, NULL);
 		}
 	}
-end:
-	preempt_enable_no_resched();
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
 
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index f06747e2e70d..5c60bb0f927f 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -358,9 +358,12 @@ int kprobe_handler(struct pt_regs *regs)
 
 	kcb->kprobe_status = KPROBE_HIT_ACTIVE;
 	set_current_kprobe(p, regs, kcb);
-	if (p->pre_handler && p->pre_handler(p, regs))
-		/* handler has already set things up, so skip ss setup */
+	if (p->pre_handler && p->pre_handler(p, regs)) {
+		/* handler changed execution path, so skip ss setup */
+		reset_current_kprobe();
+		preempt_enable_no_resched();
 		return 1;
+	}
 
 	if (p->ainsn.boostable >= 0) {
 		ret = try_to_emulate(p, regs);
diff --git a/arch/s390/kernel/kprobes.c b/arch/s390/kernel/kprobes.c
index 3e34018960b5..7c0a095e9c5f 100644
--- a/arch/s390/kernel/kprobes.c
+++ b/arch/s390/kernel/kprobes.c
@@ -326,8 +326,11 @@ static int kprobe_handler(struct pt_regs *regs)
 			 */
 			push_kprobe(kcb, p);
 			kcb->kprobe_status = KPROBE_HIT_ACTIVE;
-			if (p->pre_handler && p->pre_handler(p, regs))
+			if (p->pre_handler && p->pre_handler(p, regs)) {
+				pop_kprobe(kcb);
+				preempt_enable_no_resched();
 				return 1;
+			}
 			kcb->kprobe_status = KPROBE_HIT_SS;
 		}
 		enable_singlestep(kcb, regs, (unsigned long) p->ainsn.insn);
@@ -431,9 +434,7 @@ static int trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs)
 
 	regs->psw.addr = orig_ret_address;
 
-	pop_kprobe(get_kprobe_ctlblk());
 	kretprobe_hash_unlock(current, &flags);
-	preempt_enable_no_resched();
 
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
diff --git a/arch/sh/kernel/kprobes.c b/arch/sh/kernel/kprobes.c
index 4fafe0cd12c6..241e903dd3ee 100644
--- a/arch/sh/kernel/kprobes.c
+++ b/arch/sh/kernel/kprobes.c
@@ -272,9 +272,12 @@ static int __kprobes kprobe_handler(struct pt_regs *regs)
 	set_current_kprobe(p, regs, kcb);
 	kcb->kprobe_status = KPROBE_HIT_ACTIVE;
 
-	if (p->pre_handler && p->pre_handler(p, regs))
+	if (p->pre_handler && p->pre_handler(p, regs)) {
 		/* handler has already set things up, so skip ss setup */
+		reset_current_kprobe();
+		preempt_enable_no_resched();
 		return 1;
+	}
 
 	prepare_singlestep(p, regs);
 	kcb->kprobe_status = KPROBE_HIT_SS;
@@ -352,8 +355,6 @@ int __kprobes trampoline_probe_handler(struct kprobe *p, struct pt_regs *regs)
 	regs->pc = orig_ret_address;
 	kretprobe_hash_unlock(current, &flags);
 
-	preempt_enable_no_resched();
-
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
 		kfree(ri);
diff --git a/arch/sparc/kernel/kprobes.c b/arch/sparc/kernel/kprobes.c
index c684c96ef2e9..dfbca2470536 100644
--- a/arch/sparc/kernel/kprobes.c
+++ b/arch/sparc/kernel/kprobes.c
@@ -175,8 +175,11 @@ static int __kprobes kprobe_handler(struct pt_regs *regs)
 
 	set_current_kprobe(p, regs, kcb);
 	kcb->kprobe_status = KPROBE_HIT_ACTIVE;
-	if (p->pre_handler && p->pre_handler(p, regs))
+	if (p->pre_handler && p->pre_handler(p, regs)) {
+		reset_current_kprobe();
+		preempt_enable_no_resched();
 		return 1;
+	}
 
 	prepare_singlestep(p, regs, kcb);
 	kcb->kprobe_status = KPROBE_HIT_SS;
@@ -508,9 +511,7 @@ static int __kprobes trampoline_probe_handler(struct kprobe *p,
 	regs->tpc = orig_ret_address;
 	regs->tnpc = orig_ret_address + 4;
 
-	reset_current_kprobe();
 	kretprobe_hash_unlock(current, &flags);
-	preempt_enable_no_resched();
 
 	hlist_for_each_entry_safe(ri, tmp, &empty_rp, hlist) {
 		hlist_del(&ri->hlist);
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 0ac16a0d93e5..814e26b7c8a2 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -694,6 +694,10 @@ int kprobe_int3_handler(struct pt_regs *regs)
 			 */
 			if (!p->pre_handler || !p->pre_handler(p, regs))
 				setup_singlestep(p, regs, kcb, 0);
+			else {
+				reset_current_kprobe();
+				preempt_enable_no_resched();
+			}
 			return 1;
 		}
 	} else if (*addr != BREAKPOINT_INSTRUCTION) {
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index c8696f2a583f..310ef737b9d4 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -63,18 +63,19 @@ void kprobe_ftrace_handler(unsigned long ip, unsigned long parent_ip,
 		/* Kprobe handler expects regs->ip = ip + 1 as breakpoint hit */
 		regs->ip = ip + sizeof(kprobe_opcode_t);
 
-		/* To emulate trap based kprobes, preempt_disable here */
-		preempt_disable();
 		__this_cpu_write(current_kprobe, p);
 		kcb->kprobe_status = KPROBE_HIT_ACTIVE;
 		if (!p->pre_handler || !p->pre_handler(p, regs)) {
 			skip_singlestep(p, regs, kcb, orig_ip);
-			preempt_enable_no_resched();
+		} else {
+			/*
+			 * If pre_handler returns !0, this handler
+			 * modifies regs->ip and goes back to there
+			 * directly without single stepping.
+			 * So let's just clear current kprobe.
+			 */
+			__this_cpu_write(current_kprobe, NULL);
 		}
-		/*
-		 * If pre_handler returns !0, it sets regs->ip and
-		 * resets current kprobe, and keep preempt count +1.
-		 */
 	}
 }
 NOKPROBE_SYMBOL(kprobe_ftrace_handler);
diff --git a/kernel/fail_function.c b/kernel/fail_function.c
index 1d5632d8bbcc..b090688df94f 100644
--- a/kernel/fail_function.c
+++ b/kernel/fail_function.c
@@ -184,9 +184,6 @@ static int fei_kprobe_handler(struct kprobe *kp, struct pt_regs *regs)
 	if (should_fail(&fei_fault_attr, 1)) {
 		regs_set_return_value(regs, attr->retval);
 		override_function_with_return(regs);
-		/* Kprobe specific fixup */
-		reset_current_kprobe();
-		preempt_enable_no_resched();
 		return 1;
 	}
 
diff --git a/kernel/trace/trace_kprobe.c b/kernel/trace/trace_kprobe.c
index 02aed76e0978..b65cd6834450 100644
--- a/kernel/trace/trace_kprobe.c
+++ b/kernel/trace/trace_kprobe.c
@@ -1217,16 +1217,11 @@ kprobe_perf_func(struct trace_kprobe *tk, struct pt_regs *regs)
 
 		/*
 		 * We need to check and see if we modified the pc of the
-		 * pt_regs, and if so clear the kprobe and return 1 so that we
-		 * don't do the single stepping.
-		 * The ftrace kprobe handler leaves it up to us to re-enable
-		 * preemption here before returning if we've modified the ip.
+		 * pt_regs, and if so return 1 so that we don't do the
+		 * single stepping.
 		 */
-		if (orig_ip != instruction_pointer(regs)) {
-			reset_current_kprobe();
-			preempt_enable_no_resched();
+		if (orig_ip != instruction_pointer(regs))
 			return 1;
-		}
 		if (!ret)
 			return 0;
 	}

^ permalink raw reply related

* [RFC PATCH -tip v5 18/27] powerpc/kprobes: Don't call the ->break_handler() in arm kprobes code
From: Masami Hiramatsu @ 2018-06-04 15:57 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Masami Hiramatsu, Ingo Molnar, H . Peter Anvin, linux-kernel,
	Ananth N Mavinakayanahalli, Andrew Morton, Steven Rostedt,
	linux-arch, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Naveen N. Rao, linuxppc-dev
In-Reply-To: <152812730943.10068.5166429445118734697.stgit@devbox>

Don't call the ->break_handler() from the arm kprobes code,
because it was only used by jprobes which got removed.

This also makes skip_singlestep() a static function since
only ftrace-kprobe.c is using this function.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/kprobes.h   |   10 ----------
 arch/powerpc/kernel/kprobes-ftrace.c |   16 +++-------------
 arch/powerpc/kernel/kprobes.c        |   31 +++++++++++--------------------
 3 files changed, 14 insertions(+), 43 deletions(-)

diff --git a/arch/powerpc/include/asm/kprobes.h b/arch/powerpc/include/asm/kprobes.h
index 674036db558b..785c464b6588 100644
--- a/arch/powerpc/include/asm/kprobes.h
+++ b/arch/powerpc/include/asm/kprobes.h
@@ -102,16 +102,6 @@ extern int kprobe_exceptions_notify(struct notifier_block *self,
 extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
 extern int kprobe_handler(struct pt_regs *regs);
 extern int kprobe_post_handler(struct pt_regs *regs);
-#ifdef CONFIG_KPROBES_ON_FTRACE
-extern int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
-			   struct kprobe_ctlblk *kcb);
-#else
-static inline int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
-				  struct kprobe_ctlblk *kcb)
-{
-	return 0;
-}
-#endif
 #else
 static inline int kprobe_handler(struct pt_regs *regs) { return 0; }
 static inline int kprobe_post_handler(struct pt_regs *regs) { return 0; }
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 1b316331c2d9..3869b0e5d5c7 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -26,8 +26,8 @@
 #include <linux/ftrace.h>
 
 static nokprobe_inline
-int __skip_singlestep(struct kprobe *p, struct pt_regs *regs,
-		      struct kprobe_ctlblk *kcb, unsigned long orig_nip)
+int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
+		    struct kprobe_ctlblk *kcb, unsigned long orig_nip)
 {
 	/*
 	 * Emulate singlestep (and also recover regs->nip)
@@ -44,16 +44,6 @@ int __skip_singlestep(struct kprobe *p, struct pt_regs *regs,
 	return 1;
 }
 
-int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
-		    struct kprobe_ctlblk *kcb)
-{
-	if (kprobe_ftrace(p))
-		return __skip_singlestep(p, regs, kcb, 0);
-	else
-		return 0;
-}
-NOKPROBE_SYMBOL(skip_singlestep);
-
 /* Ftrace callback handler for kprobes */
 void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 			   struct ftrace_ops *ops, struct pt_regs *regs)
@@ -82,7 +72,7 @@ void kprobe_ftrace_handler(unsigned long nip, unsigned long parent_nip,
 		__this_cpu_write(current_kprobe, p);
 		kcb->kprobe_status = KPROBE_HIT_ACTIVE;
 		if (!p->pre_handler || !p->pre_handler(p, regs))
-			__skip_singlestep(p, regs, kcb, orig_nip);
+			skip_singlestep(p, regs, kcb, orig_nip);
 		else {
 			/*
 			 * If pre_handler returns !0, it sets regs->nip and
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index 600678fce0a8..f06747e2e70d 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -317,25 +317,17 @@ int kprobe_handler(struct pt_regs *regs)
 			}
 			prepare_singlestep(p, regs);
 			return 1;
-		} else {
-			if (*addr != BREAKPOINT_INSTRUCTION) {
-				/* If trap variant, then it belongs not to us */
-				kprobe_opcode_t cur_insn = *addr;
-				if (is_trap(cur_insn))
-		       			goto no_kprobe;
-				/* The breakpoint instruction was removed by
-				 * another cpu right after we hit, no further
-				 * handling of this interrupt is appropriate
-				 */
-				ret = 1;
+		} else if (*addr != BREAKPOINT_INSTRUCTION) {
+			/* If trap variant, then it belongs not to us */
+			kprobe_opcode_t cur_insn = *addr;
+
+			if (is_trap(cur_insn))
 				goto no_kprobe;
-			}
-			p = __this_cpu_read(current_kprobe);
-			if (p->break_handler && p->break_handler(p, regs)) {
-				if (!skip_singlestep(p, regs, kcb))
-					goto ss_probe;
-				ret = 1;
-			}
+			/* The breakpoint instruction was removed by
+			 * another cpu right after we hit, no further
+			 * handling of this interrupt is appropriate
+			 */
+			ret = 1;
 		}
 		goto no_kprobe;
 	}
@@ -350,7 +342,7 @@ int kprobe_handler(struct pt_regs *regs)
 			 */
 			kprobe_opcode_t cur_insn = *addr;
 			if (is_trap(cur_insn))
-		       		goto no_kprobe;
+				goto no_kprobe;
 			/*
 			 * The breakpoint instruction was removed right
 			 * after we hit it.  Another cpu has removed
@@ -370,7 +362,6 @@ int kprobe_handler(struct pt_regs *regs)
 		/* handler has already set things up, so skip ss setup */
 		return 1;
 
-ss_probe:
 	if (p->ainsn.boostable >= 0) {
 		ret = try_to_emulate(p, regs);
 

^ permalink raw reply related

* [RFC PATCH -tip v5 07/27] powerpc/kprobes: Remove jprobe powerpc implementation
From: Masami Hiramatsu @ 2018-06-04 15:51 UTC (permalink / raw)
  To: Thomas Gleixner, Ingo Molnar
  Cc: Masami Hiramatsu, Ingo Molnar, H . Peter Anvin, linux-kernel,
	Ananth N Mavinakayanahalli, Andrew Morton, Steven Rostedt,
	linux-arch, Benjamin Herrenschmidt, Paul Mackerras,
	Michael Ellerman, Naveen N. Rao, linuxppc-dev
In-Reply-To: <152812730943.10068.5166429445118734697.stgit@devbox>

Remove arch dependent setjump/longjump functions
and unused fields in kprobe_ctlblk for jprobes
from arch/powerpc. This also reverts commits
related __is_active_jprobe() function.

Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.vnet.ibm.com>
Cc: linuxppc-dev@lists.ozlabs.org
---
 arch/powerpc/include/asm/kprobes.h             |    2 -
 arch/powerpc/kernel/kprobes-ftrace.c           |   15 -------
 arch/powerpc/kernel/kprobes.c                  |   54 ------------------------
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S |   39 ++---------------
 4 files changed, 5 insertions(+), 105 deletions(-)

diff --git a/arch/powerpc/include/asm/kprobes.h b/arch/powerpc/include/asm/kprobes.h
index 9f3be5c8a4a3..674036db558b 100644
--- a/arch/powerpc/include/asm/kprobes.h
+++ b/arch/powerpc/include/asm/kprobes.h
@@ -88,7 +88,6 @@ struct prev_kprobe {
 struct kprobe_ctlblk {
 	unsigned long kprobe_status;
 	unsigned long kprobe_saved_msr;
-	struct pt_regs jprobe_saved_regs;
 	struct prev_kprobe prev_kprobe;
 };
 
@@ -104,7 +103,6 @@ extern int kprobe_fault_handler(struct pt_regs *regs, int trapnr);
 extern int kprobe_handler(struct pt_regs *regs);
 extern int kprobe_post_handler(struct pt_regs *regs);
 #ifdef CONFIG_KPROBES_ON_FTRACE
-extern int __is_active_jprobe(unsigned long addr);
 extern int skip_singlestep(struct kprobe *p, struct pt_regs *regs,
 			   struct kprobe_ctlblk *kcb);
 #else
diff --git a/arch/powerpc/kernel/kprobes-ftrace.c b/arch/powerpc/kernel/kprobes-ftrace.c
index 7a1f99f1b47f..1b316331c2d9 100644
--- a/arch/powerpc/kernel/kprobes-ftrace.c
+++ b/arch/powerpc/kernel/kprobes-ftrace.c
@@ -25,21 +25,6 @@
 #include <linux/preempt.h>
 #include <linux/ftrace.h>
 
-/*
- * This is called from ftrace code after invoking registered handlers to
- * disambiguate regs->nip changes done by jprobes and livepatch. We check if
- * there is an active jprobe at the provided address (mcount location).
- */
-int __is_active_jprobe(unsigned long addr)
-{
-	if (!preemptible()) {
-		struct kprobe *p = raw_cpu_read(current_kprobe);
-		return (p && (unsigned long)p->addr == addr) ? 1 : 0;
-	}
-
-	return 0;
-}
-
 static nokprobe_inline
 int __skip_singlestep(struct kprobe *p, struct pt_regs *regs,
 		      struct kprobe_ctlblk *kcb, unsigned long orig_nip)
diff --git a/arch/powerpc/kernel/kprobes.c b/arch/powerpc/kernel/kprobes.c
index e4c5bf33970b..600678fce0a8 100644
--- a/arch/powerpc/kernel/kprobes.c
+++ b/arch/powerpc/kernel/kprobes.c
@@ -611,60 +611,6 @@ unsigned long arch_deref_entry_point(void *entry)
 }
 NOKPROBE_SYMBOL(arch_deref_entry_point);
 
-int setjmp_pre_handler(struct kprobe *p, struct pt_regs *regs)
-{
-	struct jprobe *jp = container_of(p, struct jprobe, kp);
-	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
-
-	memcpy(&kcb->jprobe_saved_regs, regs, sizeof(struct pt_regs));
-
-	/* setup return addr to the jprobe handler routine */
-	regs->nip = arch_deref_entry_point(jp->entry);
-#ifdef PPC64_ELF_ABI_v2
-	regs->gpr[12] = (unsigned long)jp->entry;
-#elif defined(PPC64_ELF_ABI_v1)
-	regs->gpr[2] = (unsigned long)(((func_descr_t *)jp->entry)->toc);
-#endif
-
-	/*
-	 * jprobes use jprobe_return() which skips the normal return
-	 * path of the function, and this messes up the accounting of the
-	 * function graph tracer.
-	 *
-	 * Pause function graph tracing while performing the jprobe function.
-	 */
-	pause_graph_tracing();
-
-	return 1;
-}
-NOKPROBE_SYMBOL(setjmp_pre_handler);
-
-void __used jprobe_return(void)
-{
-	asm volatile("jprobe_return_trap:\n"
-		     "trap\n"
-		     ::: "memory");
-}
-NOKPROBE_SYMBOL(jprobe_return);
-
-int longjmp_break_handler(struct kprobe *p, struct pt_regs *regs)
-{
-	struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
-
-	if (regs->nip != ppc_kallsyms_lookup_name("jprobe_return_trap")) {
-		pr_debug("longjmp_break_handler NIP (0x%lx) does not match jprobe_return_trap (0x%lx)\n",
-				regs->nip, ppc_kallsyms_lookup_name("jprobe_return_trap"));
-		return 0;
-	}
-
-	memcpy(regs, &kcb->jprobe_saved_regs, sizeof(struct pt_regs));
-	/* It's OK to start function graph tracing again */
-	unpause_graph_tracing();
-	preempt_enable_no_resched();
-	return 1;
-}
-NOKPROBE_SYMBOL(longjmp_break_handler);
-
 static struct kprobe trampoline_p = {
 	.addr = (kprobe_opcode_t *) &kretprobe_trampoline,
 	.pre_handler = trampoline_probe_handler
diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index 3f3e81852422..4e84a713e80a 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -99,39 +99,13 @@ ftrace_call:
 	bl	ftrace_stub
 	nop
 
-	/* Load the possibly modified NIP */
-	ld	r15, _NIP(r1)
-
+	/* Load ctr with the possibly modified NIP */
+	ld	r3, _NIP(r1)
+	mtctr	r3
 #ifdef CONFIG_LIVEPATCH
-	cmpd	r14, r15	/* has NIP been altered? */
+	cmpd	r14, r3		/* has NIP been altered? */
 #endif
 
-#if defined(CONFIG_LIVEPATCH) && defined(CONFIG_KPROBES_ON_FTRACE)
-	/* NIP has not been altered, skip over further checks */
-	beq	1f
-
-	/* Check if there is an active jprobe on us */
-	subi	r3, r14, 4
-	bl	__is_active_jprobe
-	nop
-
-	/*
-	 * If r3 == 1, then this is a kprobe/jprobe.
-	 * else, this is livepatched function.
-	 *
-	 * The conditional branch for livepatch_handler below will use the
-	 * result of this comparison. For kprobe/jprobe, we just need to branch to
-	 * the new NIP, not call livepatch_handler. The branch below is bne, so we
-	 * want CR0[EQ] to be true if this is a kprobe/jprobe. Which means we want
-	 * CR0[EQ] = (r3 == 1).
-	 */
-	cmpdi	r3, 1
-1:
-#endif
-
-	/* Load CTR with the possibly modified NIP */
-	mtctr	r15
-
 	/* Restore gprs */
 	REST_GPR(0,r1)
 	REST_10GPRS(2,r1)
@@ -149,10 +123,7 @@ ftrace_call:
 	addi r1, r1, SWITCH_FRAME_SIZE
 
 #ifdef CONFIG_LIVEPATCH
-        /*
-	 * Based on the cmpd or cmpdi above, if the NIP was altered and we're
-	 * not on a kprobe/jprobe, then handle livepatch.
-	 */
+        /* Based on the cmpd above, if the NIP was altered handle livepatch */
 	bne-	livepatch_handler
 #endif
 

^ permalink raw reply related

* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-04 13:14 UTC (permalink / raw)
  To: Christoph Hellwig, Michael S. Tsirkin
  Cc: Anshuman Khandual, virtualization, linux-kernel, linuxppc-dev,
	aik, robh, joe, elfring, david, jasowang, mpe
In-Reply-To: <20180604125530.GA16378@infradead.org>

On Mon, 2018-06-04 at 05:55 -0700, Christoph Hellwig wrote:
> On Mon, Jun 04, 2018 at 03:43:09PM +0300, Michael S. Tsirkin wrote:
> > Another is that given the basic functionality is in there, optimizations
> > can possibly wait until per-device quirks in DMA API are supported.
> 
> We have had per-device dma_ops for quite a while.

I've asked Ansuman to start with a patch that converts virtio to use
DMA ops always, along with an init quirk to hookup "direct" ops when
the IOMMU flag isn't set.

This will at least remove that horrid duplication of code path we have
in there.

Then we can just involve the arch in that init quirk so we can chose an
alternate set of ops when running a secure VM.

This is completely orthogonal to whether an iommu exist qemu side or
not, and should be entirely solved on the Linux side.

Cheers,
Ben.

^ permalink raw reply

* Re: [v4, 1/7] powerpc/64s/radix: do not flush TLB when relaxing access
From: Michael Ellerman @ 2018-06-04 14:11 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Aneesh Kumar K . V, Nicholas Piggin
In-Reply-To: <20180601100121.393-2-npiggin@gmail.com>

On Fri, 2018-06-01 at 10:01:15 UTC, Nicholas Piggin wrote:
> Radix flushes the TLB when updating ptes to increase permissiveness
> of protection (increase access authority). Book3S does not require
> TLB flushing in this case, and it is not done on hash. This patch
> avoids the flush for radix.
> 
> >From Power ISA v3.0B, p.1090:
> 
>     Setting a Reference or Change Bit or Upgrading Access Authority
>     (PTE Subject to Atomic Hardware Updates)
> 
>     If the only change being made to a valid PTE that is subject to
>     atomic hardware updates is to set the Reference or Change bit to 1
>     or to add access authorities, a simpler sequence suffices because
>     the translation hardware will refetch the PTE if an access is
>     attempted for which the only problems were reference and/or change
>     bits needing to be set or insufficient access authority.
> 
> The nest MMU on POWER9 does not re-fetch the PTE after such an access
> attempt before faulting, so address spaces with a coprocessor
> attached will continue to flush in these cases.
> 
> This reduces tlbies for a kernel compile workload from 1.28M to 0.95M,
> tlbiels from 20.17M 19.68M.
> 
> fork --fork --exec benchmark improved 2.77% (12000->12300).
> 
> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>

Series applied to powerpc next, thanks.

https://git.kernel.org/powerpc/c/e5f7cb58c2b77a0249c2028b6d1ec4

cheers

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox