LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v7 0/2] hwmon/powernv: Add attributes to enable/disable sensors
From: Shilpasri G Bhat @ 2018-07-20  6:12 UTC (permalink / raw)
  To: mpe, linux
  Cc: linuxppc-dev, linux-hwmon, linux-kernel, ego, stewart,
	Shilpasri G Bhat

This patch series adds new attribute to enable or disable a sensor at
runtime.

Changes from v6:
- Updated the commit message as per Stewart's suggestion
- Use the compatible DT property instead of the path to find the node

v6 : https://lkml.org/lkml/2018/7/18/806
v5 : https://lkml.org/lkml/2018/7/15/15
v4 : https://lkml.org/lkml/2018/7/6/379
v3 : https://lkml.org/lkml/2018/7/5/476
v2 : https://lkml.org/lkml/2018/7/4/263
v1 : https://lkml.org/lkml/2018/3/22/214

Shilpasri G Bhat (2):
  powernv:opal-sensor-groups: Add support to enable sensor groups
  hwmon: ibmpowernv: Add attributes to enable/disable sensor groups

 Documentation/hwmon/ibmpowernv                     |  43 +++-
 arch/powerpc/include/asm/opal-api.h                |   1 +
 arch/powerpc/include/asm/opal.h                    |   2 +
 .../powerpc/platforms/powernv/opal-sensor-groups.c |  28 +++
 arch/powerpc/platforms/powernv/opal-wrappers.S     |   1 +
 drivers/hwmon/ibmpowernv.c                         | 249 ++++++++++++++++++---
 6 files changed, 290 insertions(+), 34 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* Re: [powerpc/powervm]Oops: Kernel access of bad area, sig: 11 [#1] while running stress-ng
From: vrbagal1 @ 2018-07-20  6:02 UTC (permalink / raw)
  To: Brian Foster
  Cc: sachinp, Linuxppc-dev, Nicholas Piggin, linux-xfs, linux-fsdevel,
	linuxppc-dev
In-Reply-To: <20180719133310.GA29404@bfoster>

On 2018-07-19 19:03, Brian Foster wrote:
> cc linux-xfs
> 
> On Thu, Jul 19, 2018 at 11:47:59AM +0530, vrbagal1 wrote:
>> On 2018-07-10 19:12, Michael Ellerman wrote:
>> > vrbagal1 <vrbagal1@linux.vnet.ibm.com> writes:
>> >
>> > > On 2018-07-10 13:37, Nicholas Piggin wrote:
>> > > > On Tue, 10 Jul 2018 11:58:40 +0530
>> > > > vrbagal1 <vrbagal1@linux.vnet.ibm.com> wrote:
>> > > >
>> > > > > Hi,
>> > > > >
>> > > > > Observing kernel oops on Power9(ZZ) box, running on PowerVM, while
>> > > > > running stress-ng.
>> > > > >
>> > > > >
>> > > > > Kernel: 4.18.0-rc4
>> > > > > Machine: Power9 ZZ (PowerVM)
>> > > > > Test: Stress-ng
>> > > > >
>> > > > > Attached is .config file
>> > > > >
>> > > > > Traces:
>> > > > >
>> > > > >   [12251.245209] Oops: Kernel access of bad area, sig: 11 [#1]
>> > > >
>> > > > Can you post the lines above this? Otherwise we don't know what
>> > > > address
>> > > > it tried to access (without decoding the instructions and
>> > > > reconstructing
>> > > > it from registers at least, which the XFS devs wouldn't be
>> > > > inclined to
>> > > > do).
>> > > >
>> > >
>> > > ah my bad.
>> > >
>> > >   [12251.245179] Unable to handle kernel paging request for data at
>> > > address 0x6000000060000000
>> > >   [12251.245199] Faulting instruction address: 0xc000000000319e2c
>> >
>> > Which matches the regs & disassembly:
>> >
>> > r4 = 6000000060000000
>> > r9 = 0
>> > ldx     r9,r4,r9	<- pop
>> >
>> > So object was 0x6000000060000000.
>> >
>> > That looks like two nops, ie. we got some code?
>> >
>> > And there's only one caller of prefetch_freepointer() in
>> > slab_alloc_node():
>> >
>> > 	prefetch_freepointer(s, next_object);
>> >
>> >
>> > So slab corruption is looking likely.
>> >
>> > Do you have slub_debug=FZP on the kernel command line?
>> 
>> I tried to reproduce this, but didnt succeed. But instead I got xfs
>> assertion.
>> 
>> Kernel: 4.18.0-rc5
>> Tree Head: 30b06abfb92bfd5f9b63ea6a2ffb0bd905d1a6da
>> 
>> Traces:
>> 
>>  [12290.993612] XFS: Assertion failed: flags & XFS_BMAPI_COWFORK, 
>> file:
>> fs/xfs/libxfs/xfs_bmap.c, line: 4370
> 
> This usually means we have a writeback of a page with delayed 
> allocation
> buffers over an extent map that reflects a hole in the file. This 
> should
> never happen for normal (non-reflink) data fork writes, hence the 
> assert
> failure and -EIO return.
> 
> We used to actually (accidentally) allocate a new block in this case,
> but more recent kernels generate the error. More interestingly, I think
> the pending iomap + writeback rework in XFS may simply drop this error
> on the floor because we refer to extent state to process the page 
> rather
> than the other way around (and buffer_delay() isn't a state in the 
> iomap
> bits). This of course depends on which state is actually valid, the
> buffer or extent map, which we don't know for sure.
> 
> Can you reliably reproduce this problem? If so, can you describe the
> reproducer? Also, what is the filesystem geometry ('xfs_info <mnt>')?
> And since this is a power kernel, I'm guessing you have 64k pages as
> well..?
> 
> Brian


I tried, but couldnt hit the issue. But this issue was hit running 
stress-ng test case, last time(saw some xfs traces) and this time.
More specific I was running: /bin/avocado run 
avocado-misc-tests/generic/stress-ng.py -m 
/root/avocado-fvt-wrapper/tests/avocado-misc-tests/generic/stress-ng.py.data/stress-ng-cpu.yaml

xfs_info o/p:

# xfs_info /dev/sdi2
meta-data=/dev/sdi2              isize=512    agcount=4, agsize=65536 
blks
          =                       sectsz=512   attr=2, projid32bit=1
          =                       crc=1        finobt=0 spinodes=0
data     =                       bsize=4096   blocks=262144, imaxpct=25
          =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
log      =internal               bsize=4096   blocks=2560, version=2
          =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0


Yes, 64K pagesize.

Regards,
Venkat.


> 
>>  [12290.993668] ------------[ cut here ]------------
>>  [12290.993672] kernel BUG at fs/xfs/xfs_message.c:102!
>>  [12290.993676] Oops: Exception in kernel mode, sig: 5 [#1]
>>  [12290.993678] LE SMP NR_CPUS=2048 NUMA pSeries
>>  [12290.993697] Dumping ftrace buffer:
>>  [12290.993730]    (ftrace buffer empty)
>>  [12290.993735] Modules linked in: camellia_generic(E) 
>> cast6_generic(E)
>> cast_common(E) ppp_generic(E) serpent_generic(E) slhc(E) kvm_pr(E) 
>> kvm(E)
>> snd_seq(E) snd_seq_device(E) twofish_generic(E) snd_timer(E) snd(E)
>> soundcore(E) twofish_common(E) lrw(E) cuse(E) tgr192(E) hci_vhci(E)
>> vhost_net(E) bluetooth(E) ecdh_generic(E) wp512(E) rfkill(E)
>> vfio_iommu_spapr_tce(E) vfio_spapr_eeh(E) vfio(E) rmd320(E) uhid(E) 
>> tun(E)
>> tap(E) rmd256(E) rmd160(E) vhost_vsock(E) rmd128(E)
>> vmw_vsock_virtio_transport_common(E) vsock(E) vhost(E) uinput(E) 
>> md4(E)
>> unix_diag(E) dccp_ipv4(E) dccp(E) sha512_generic(E) binfmt_misc(E) 
>> fuse(E)
>> vfat(E) fat(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) 
>> xxhash(E)
>> raid6_pq(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) loop(E) 
>> ip6t_rpfilter(E)
>> ipt_REJECT(E) nf_reject_ipv4(E) ip6t_REJECT(E)
>>  [12290.993784]  nf_reject_ipv6(E) xt_conntrack(E) ip_set(E) 
>> nfnetlink(E)
>> ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) 
>> ip6table_nat(E)
>> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_nat_ipv6(E) 
>> ip6table_mangle(E)
>> ip6table_security(E) ip6table_raw(E) iptable_nat(E) 
>> nf_conntrack_ipv4(E)
>> nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) 
>> iptable_mangle(E)
>> iptable_security(E) iptable_raw(E) ebtable_filter(E) ebtables(E)
>> ip6table_filter(E) ip6_tables(E) iptable_filter(E) xts(E) sg(E)
>> vmx_crypto(E) pseries_rng(E) ip_tables(E) xfs(E) libcrc32c(E) 
>> sd_mod(E)
>> ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) lpfc(E) mlx5_core(E)
>> nvmet_fc(E) nvmet(E) nvme_fc(E) nvme_fabrics(E) nvme_core(E)
>> scsi_transport_fc(E) mlxfw(E) devlink(E) dm_mirror(E) 
>> dm_region_hash(E)
>> dm_log(E) dm_mod(E) [last unloaded: torture]
>>  [12290.993834] CPU: 2 PID: 433 Comm: kswapd1 Tainted: G            E
>> 4.18.0-rc5-autotest-autotest #1
>>  [12290.993838] NIP:  d00000000ebcfb8c LR: d00000000ebcfb64 CTR:
>> c0000000005260a0
>>  [12290.993841] REGS: c000000feee7ef40 TRAP: 0700   Tainted: G         
>>    E
>> (4.18.0-rc5-autotest-autotest)
>>  [12290.993845] MSR:  800000010282b033
>> <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]>  CR: 22224044  XER: 0000000d
>>  [12290.993854] CFAR: d00000000ebcfb74 IRQMASK: 0
>>  [12290.993854] GPR00: d00000000ebcfb64 c000000feee7f1c0 
>> d00000000ec58700
>> ffffffffffffffea
>>  [12290.993854] GPR04: 000000000000000a c000000feee7f0c0 
>> ffffffffffffffc0
>> d00000000ec41620
>>  [12290.993854] GPR08: ffffffffffffffca 0000000000000001 
>> 0000000000000000
>> 0000000000000000
>>  [12290.993854] GPR12: c0000000005260a0 c00000001ec9d600 
>> c000000feee7f510
>> 0000000040000000
>>  [12290.993854] GPR16: 0000000020000000 0000000000000000 
>> c000000fc6308000
>> c000000feee7f4c0
>>  [12290.993854] GPR20: c000000feee7f520 c000000feee7f520 
>> 000ffffffffe0000
>> 0000000000000000
>>  [12290.993854] GPR24: 00000000001fffff 0000000000000001 
>> c000000f689c8800
>> 0000000000000000
>>  [12290.993854] GPR28: c000000f689c8848 c000000feee7f520 
>> 0000000000000001
>> 0000000000000400
>>  [12290.993907] NIP [d00000000ebcfb8c] assfail+0x5c/0x60 [xfs]
>>  [12290.993921] LR [d00000000ebcfb64] assfail+0x34/0x60 [xfs]
>>  [12290.993923] Call Trace:
>>  [12290.993936] [c000000feee7f1c0] [d00000000ebcfb64] 
>> assfail+0x34/0x60
>> [xfs] (unreliable)
>>  [12290.993947] [c000000feee7f220] [d00000000eb542f0]
>> xfs_bmapi_write+0x10f0/0x1260 [xfs]
>>  [12290.993961] [c000000feee7f450] [d00000000ebc34e8]
>> xfs_iomap_write_allocate+0x1b8/0x430 [xfs]
>>  [12290.993975] [c000000feee7f5c0] [d00000000eba0de4]
>> xfs_map_blocks+0x234/0x470 [xfs]
>>  [12290.993988] [c000000feee7f630] [d00000000eba2d4c]
>> xfs_do_writepage+0x27c/0x880 [xfs]
>>  [12290.994001] [c000000feee7f710] [d00000000eba3314]
>> xfs_do_writepage+0x844/0x880 [xfs]
>>  [12290.994007] [c000000feee7f780] [c00000000029df24]
>> pageout.isra.37+0x294/0x4a0
>>  [12290.994011] [c000000feee7f860] [c0000000002a0a28]
>> shrink_page_list+0x948/0x1090
>>  [12290.994015] [c000000feee7f980] [c0000000002a1c8c]
>> shrink_inactive_list+0x2ec/0x720
>>  [12290.994019] [c000000feee7fa40] [c0000000002a28d8]
>> shrink_node_memcg+0x238/0x7d0
>>  [12290.994024] [c000000feee7fb40] [c0000000002a2f94]
>> shrink_node+0x124/0x610
>>  [12290.994027] [c000000feee7fc10] [c0000000002a46d0]
>> balance_pgdat+0x200/0x460
>>  [12290.994031] [c000000feee7fcf0] [c0000000002a4afc] 
>> kswapd+0x1cc/0x5f0
>>  [12290.994036] [c000000feee7fdc0] [c00000000012ca6c] 
>> kthread+0x15c/0x1a0
>>  [12290.994040] [c000000feee7fe30] [c00000000000b65c]
>> ret_from_kernel_thread+0x5c/0x80
>>  [12290.994043] Instruction dump:
>>  [12290.994046] f821ffa1 4bfff8a9 3d220000 e929cb40 89290008 2f890000
>> 40de0018 0fe00000
>>  [12290.994053] 38210060 e8010010 7c0803a6 4e800020 <0fe00000> 
>> 3c4c0009
>> 38428b70 7c0802a6
>>  [12290.994061] ---[ end trace 84e4cce770c4d4e2 ]---
>>  [12290.996233]
>>  [12291.996258] Kernel panic - not syncing: Fatal exception
>>  [12292.006592] Dumping ftrace buffer:
>>  [12292.006637]    (ftrace buffer empty)
>>  [12292.013222] WARNING: CPU: 2 PID: 433 at drivers/tty/vt/vt.c:3885
>> do_unblank_screen+0x278/0x280
>>  [12292.013228] Modules linked in: camellia_generic(E) 
>> cast6_generic(E)
>> cast_common(E) ppp_generic(E) serpent_generic(E) slhc(E) kvm_pr(E) 
>> kvm(E)
>> snd_seq(E) snd_seq_device(E) twofish_generic(E) snd_timer(E) snd(E)
>> soundcore(E) twofish_common(E) lrw(E) cuse(E) tgr192(E) hci_vhci(E)
>> vhost_net(E) bluetooth(E) ecdh_generic(E) wp512(E) rfkill(E)
>> vfio_iommu_spapr_tce(E) vfio_spapr_eeh(E) vfio(E) rmd320(E) uhid(E) 
>> tun(E)
>> tap(E) rmd256(E) rmd160(E) vhost_vsock(E) rmd128(E)
>> vmw_vsock_virtio_transport_common(E) vsock(E) vhost(E) uinput(E) 
>> md4(E)
>> unix_diag(E) dccp_ipv4(E) dccp(E) sha512_generic(E) binfmt_misc(E) 
>> fuse(E)
>> vfat(E) fat(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) 
>> xxhash(E)
>> raid6_pq(E) ext4(E) mbcache(E) jbd2(E) fscrypto(E) loop(E) 
>> ip6t_rpfilter(E)
>> ipt_REJECT(E) nf_reject_ipv4(E) ip6t_REJECT(E)
>>  [12292.013315]  nf_reject_ipv6(E) xt_conntrack(E) ip_set(E) 
>> nfnetlink(E)
>> ebtable_nat(E) ebtable_broute(E) bridge(E) stp(E) llc(E) 
>> ip6table_nat(E)
>> nf_conntrack_ipv6(E) nf_defrag_ipv6(E) nf_nat_ipv6(E) 
>> ip6table_mangle(E)
>> ip6table_security(E) ip6table_raw(E) iptable_nat(E) 
>> nf_conntrack_ipv4(E)
>> nf_defrag_ipv4(E) nf_nat_ipv4(E) nf_nat(E) nf_conntrack(E) 
>> iptable_mangle(E)
>> iptable_security(E) iptable_raw(E) ebtable_filter(E) ebtables(E)
>> ip6table_filter(E) ip6_tables(E) iptable_filter(E) xts(E) sg(E)
>> vmx_crypto(E) pseries_rng(E) ip_tables(E) xfs(E) libcrc32c(E) 
>> sd_mod(E)
>> ibmvscsi(E) scsi_transport_srp(E) ibmveth(E) lpfc(E) mlx5_core(E)
>> nvmet_fc(E) nvmet(E) nvme_fc(E) nvme_fabrics(E) nvme_core(E)
>> scsi_transport_fc(E) mlxfw(E) devlink(E) dm_mirror(E) 
>> dm_region_hash(E)
>> dm_log(E) dm_mod(E) [last unloaded: torture]
>>  [12292.013394] CPU: 2 PID: 433 Comm: kswapd1 Tainted: G      D     E
>> 4.18.0-rc5-autotest-autotest #1
>>  [12292.013400] NIP:  c0000000005e9cb8 LR: c0000000005e9a7c CTR:
>> c00000000002bcc0
>>  [12292.013406] REGS: c000000feee7e880 TRAP: 0700   Tainted: G      D  
>>    E
>> (4.18.0-rc5-autotest-autotest)
>>  [12292.013411] MSR:  8000000000021033 <SF,ME,IR,DR,RI,LE>  CR: 
>> 28222042
>> XER: 20040009
>>  [12292.013422] CFAR: c0000000005e9a90 IRQMASK: 3
>>  [12292.013422] GPR00: c0000000005e9c74 c000000feee7eb00 
>> c000000001150200
>> 0000000000000000
>>  [12292.013422] GPR04: 0000000000000003 000000000000000b 
>> 00000000646f6320
>> 3030303020726c20
>>  [12292.013422] GPR08: 0000000000000000 0000000000000000 
>> 0000000000000000
>> 3830303031303030
>>  [12292.013422] GPR12: c00000000002bcc0 c00000001ec9d600 
>> c000000feee7f510
>> 0000000040000000
>>  [12292.013422] GPR16: 0000000020000000 0000000000000000 
>> c000000fc6308000
>> c000000feee7f4c0
>>  [12292.013422] GPR20: c000000feee7f520 c000000feee7f520 
>> 000ffffffffe0000
>> 0000000000000000
>>  [12292.013422] GPR24: 00000000001fffff 0000000000000001 
>> c000000000ff9e60
>> c0000000013067e8
>>  [12292.013422] GPR28: c0000000013067c0 0000000000000000 
>> c000000001414f60
>> c0000000013067e8
>>  [12292.013483] NIP [c0000000005e9cb8] do_unblank_screen+0x278/0x280
>>  [12292.013488] LR [c0000000005e9a7c] do_unblank_screen+0x3c/0x280
>>  [12292.013492] Call Trace:
>>  [12292.013496] [c000000feee7eb00] [c0000000005e9c74]
>> do_unblank_screen+0x234/0x280 (unreliable)
>>  [12292.013505] [c000000feee7eb80] [c000000000513300]
>> bust_spinlocks+0x40/0x80
>>  [12292.013511] [c000000feee7eba0] [c0000000001003b0] 
>> panic+0x1b8/0x308
>>  [12292.013518] [c000000feee7ec30] [c000000000023a3c] 
>> oops_end+0x1ec/0x1f0
>>  [12292.013524] [c000000feee7ecb0] [c000000000023f54]
>> _exception_pkey+0x1c4/0x1f0
>>  [12292.013530] [c000000feee7ee50] [c000000000026020]
>> program_check_exception+0x2c0/0x3e0
>>  [12292.013537] [c000000feee7eed0] [c000000000008e94]
>> program_check_common+0x134/0x140
>>  [12292.013579] --- interrupt: 700 at assfail+0x5c/0x60 [xfs]
>>  [12292.013579]     LR = assfail+0x34/0x60 [xfs]
>>  [12292.013597] [c000000feee7f220] [d00000000eb542f0]
>> xfs_bmapi_write+0x10f0/0x1260 [xfs]
>>  [12292.013619] [c000000feee7f450] [d00000000ebc34e8]
>> xfs_iomap_write_allocate+0x1b8/0x430 [xfs]
>>  [12292.013641] [c000000feee7f5c0] [d00000000eba0de4]
>> xfs_map_blocks+0x234/0x470 [xfs]
>>  [12292.013662] [c000000feee7f630] [d00000000eba2d4c]
>> xfs_do_writepage+0x27c/0x880 [xfs]
>>  [12292.013682] [c000000feee7f710] [d00000000eba3314]
>> xfs_do_writepage+0x844/0x880 [xfs]
>>  [12292.013689] [c000000feee7f780] [c00000000029df24]
>> pageout.isra.37+0x294/0x4a0
>>  [12292.013696] [c000000feee7f860] [c0000000002a0a28]
>> shrink_page_list+0x948/0x1090
>>  [12292.013702] [c000000feee7f980] [c0000000002a1c8c]
>> shrink_inactive_list+0x2ec/0x720
>>  [12292.013708] [c000000feee7fa40] [c0000000002a28d8]
>> shrink_node_memcg+0x238/0x7d0
>>  [12292.013714] [c000000feee7fb40] [c0000000002a2f94]
>> shrink_node+0x124/0x610
>>  [12292.013720] [c000000feee7fc10] [c0000000002a46d0]
>> balance_pgdat+0x200/0x460
>>  [12292.013726] [c000000feee7fcf0] [c0000000002a4afc] 
>> kswapd+0x1cc/0x5f0
>>  [12292.013732] [c000000feee7fdc0] [c00000000012ca6c] 
>> kthread+0x15c/0x1a0
>>  [12292.013738] [c000000feee7fe30] [c00000000000b65c]
>> ret_from_kernel_thread+0x5c/0x80
>>  [12292.013743] Instruction dump:
>>  [12292.013748] 1d290064 3c62ffef 3863ab50 e94a0000 7d2407b4 7c845214
>> 4bbbacc9 60000000
>>  [12292.013758] 39200001 3d42003e 912adff8 4bfffe68 <0fe00000> 
>> 4bfffdd8
>> 3c4c00b6 38426540
>>  [12292.013769] ---[ end trace 84e4cce770c4d4e3 ]---
>>  [12292.013774] Rebooting in 10 seconds..
>> 
>> Regards,
>> Venkat.
>> >
>> > cheers
>> 

^ permalink raw reply

* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
From: Benjamin Herrenschmidt @ 2018-07-20  4:57 UTC (permalink / raw)
  To: Andy Lutomirski, Rik van Riel, Peter Zijlstra, Vitaly Kuznetsov,
	Juergen Gross, Boris Ostrovsky, linux-arch, Will Deacon,
	Catalin Marinas, linux-s390, linuxppc-dev
  Cc: LKML, X86 ML, Mike Galbraith, kernel-team, Ingo Molnar,
	Dave Hansen, Nick Piggin, Aneesh Kumar K.V
In-Reply-To: <CALCETrXLMsSBChDvrms-omwYV4LHT30GenDjbnD-+LTg55yPow@mail.gmail.com>

On Thu, 2018-07-19 at 10:04 -0700, Andy Lutomirski wrote:
> On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> > [I added PeterZ and Vitaly -- can you see any way in which this would
> > break something obscure?  I don't.]

Added Nick and Aneesh. We do have HW remote flushes on powerpc.

> > On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <riel@surriel.com> wrote:
> > > I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
> > > next?
> > 
> > Yes, AFAICS.
> > 
> > > 
> > > On to the lazy TLB mm_struct refcounting stuff :)
> > > 
> > > > 
> > > > Which refcount?  mm_users shouldn’t be hot, so I assume you’re talking about
> > > > mm_count. My suggestion is to get rid of mm_count instead of trying to
> > > > optimize it.
> > > 
> > > 
> > > Do you have any suggestions on how? :)
> > > 
> > > The TLB shootdown sent at __exit_mm time does not get rid of the
> > > kernelthread->active_mm
> > > pointer pointing at the mm that is exiting.
> > > 
> > 
> > Ah, but that's conceptually very easy to fix.  Add a #define like
> > ARCH_NO_TASK_ACTIVE_MM.  Then just get rid of active_mm if that
> > #define is set.  After some grepping, there are very few users.  The
> > only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> > are involved in the rather complicated dance of refcounting active_mm.
> > If that field goes away, it doesn't need to be refcounted.  Instead, I
> > think the refcounting can get replaced with something like:
> > 
> > /*
> >  * Release any arch-internal references to mm.  Only called when
> > mm_users is zero
> >  * and all tasks using mm have either been switch_mm()'d away or have had
> >  * enter_lazy_tlb() called.
> >  */
> > extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
> > 
> > which the kernel calls in __mmput() after tearing down all the page
> > tables.  The body can be something like:
> > 
> > if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
> >   /* send an IPI.  Maybe just call tlb_flush_remove_tables() */
> > }
> > 
> > (You'll also have to fix up the highly questionable users in
> > arch/x86/platform/efi/efi_64.c, but that's easy.)
> > 
> > Does all that make sense?  Basically, as I understand it, the
> > expensive atomic ops you're seeing are all pointless because they're
> > enabling an optimization that hasn't actually worked for a long time,
> > if ever.
> 
> Hmm.  Xen PV has a big hack in xen_exit_mmap(), which is called from
> arch_exit_mmap(), I think.  It's a heavier weight version of more or
> less the same thing that arch_shoot_down_dead_mm() would be, except
> that it happens before exit_mmap().  But maybe Xen actually has the
> right idea.  In other words, rather doing the big pagetable free in
> exit_mmap() while there may still be other CPUs pointing at the page
> tables, the other order might make more sense.  So maybe, if
> ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
> for getting rid of all secret arch references to the mm.
> 
> Hmm.  ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.
> 
> I added some more arch maintainers.  The idea here is that, on x86 at
> least, task->active_mm and all its refcounting is pure overhead.  When
> a process exits, __mmput() gets called, but the core kernel has a
> longstanding "optimization" in which other tasks (kernel threads and
> idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
> complicated, and hurts performance on large systems, since it requires
> extra atomic operations whenever a CPU switches between real users
> threads and idle/kernel threads.
> 
> It's also almost completely worthless on x86 at least, since __mmput()
> frees pagetables, and that operation *already* forces a remote TLB
> flush, so we might as well zap all the active_mm references at the
> same time.
> 
> But arm64 has real HW remote flushes.  Does arm64 actually benefit
> from the active_mm optimization?  What happens on arm64 when a process
> exits?  How about s390?  I suspect that x390 has rather larger systems
> than arm64, where the cost of the reference counting can be much
> higher.
> 
> (Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
> interact with your previous patch set?)

^ permalink raw reply

* Re: [PATCH v3] PCI: Data corruption happening due to race condition
From: Benjamin Herrenschmidt @ 2018-07-20  4:27 UTC (permalink / raw)
  To: Lukas Wunner, Hari Vyas
  Cc: Bjorn Helgaas, linux-pci, Ray Jui, Paul Mackerras,
	Michael Ellerman, linuxppc-dev
In-Reply-To: <20180719185538.GA7731@wunner.de>

On Thu, 2018-07-19 at 20:55 +0200, Lukas Wunner wrote:
> On Thu, Jul 19, 2018 at 9:48 AM, Benjamin Herrenschmidt <benh@kernel.crashing.org wrote:
> > Indeed. However I'm not fan of the solution. Shouldn't we instead have
> > some locking for the content of pci_dev? I've always been wary of us
> > having other similar races in there.
> 
> The solution presented is perfectly fine as it uses atomic bitops which
> obviate the need for locking.  Why do you want to add unnecessary locking
> on top?

Atomic bitops tend to be *more* expensive than a lock.

My concern is that the PCIe code historically had no locking and I
worry we may have other fields in there with similar issues. But maybe
I'm wrong.

> Certain other parts of struct pci_dev use their own locking, e.g.
> pci_bus_sem to protect bus_list.  Most elements can and should
> be accessed lockless for performance.
> 
> 
> > > The powerpc PCI code contains a lot of cruft coming from the depth of
> > > history, including rather nasty assumptions. We want to progressively
> > > clean it up, starting with EEH, but it will take time.
> 
> Then I suggest using the #include "../../../drivers/pci/pci.h" for now
> until the powerpc arch code has been consolidated.

There's also the need both in powerpc and sparc to access the guts of
pci_dev because those archs will "fabricate" as pci_dev from the
device-tree rather than probing it under some circumstances.

Cheers,
Ben.

^ permalink raw reply

* [RFC 4/4] virtio: Add platform specific DMA API translation for virito devices
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: linuxppc-dev, aik, robh, joe, elfring, david, jasowang, benh, mpe,
	mst, hch, khandual, linuxram, haren, paulus, srikar
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

This adds a hook which a platform can define in order to allow it to
override virtio device's DMA OPS irrespective of whether it has the
flag VIRTIO_F_IOMMU_PLATFORM set or not. We want to use this to do
bounce-buffering of data on the new secure pSeries platform, currently
under development, where a KVM host cannot access all of the memory
space of a secure KVM guest.  The host can only access the pages which
the guest has explicitly requested to be shared with the host, thus
the virtio implementation in the guest has to copy data to and from
shared pages.

With this hook, the platform code in the secure guest can force the
use of swiotlb for virtio buffers, with a back-end for swiotlb which
will use a pool of pre-allocated shared pages.  Thus all data being
sent or received by virtio devices will be copied through pages which
the host has access to.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/dma-mapping.h | 6 ++++++
 arch/powerpc/platforms/pseries/iommu.c | 6 ++++++
 drivers/virtio/virtio.c                | 7 +++++++
 3 files changed, 19 insertions(+)

diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index 8fa3945..bc5a9d3 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -116,3 +116,9 @@ extern u64 __dma_get_required_mask(struct device *dev);
 
 #endif /* __KERNEL__ */
 #endif	/* _ASM_DMA_MAPPING_H */
+
+#define platform_override_dma_ops platform_override_dma_ops
+
+struct virtio_device;
+
+extern void platform_override_dma_ops(struct virtio_device *vdev);
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..5773bc7 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -38,6 +38,7 @@
 #include <linux/of.h>
 #include <linux/iommu.h>
 #include <linux/rculist.h>
+#include <linux/virtio.h>
 #include <asm/io.h>
 #include <asm/prom.h>
 #include <asm/rtas.h>
@@ -1396,3 +1397,8 @@ static int __init disable_multitce(char *str)
 __setup("multitce=", disable_multitce);
 
 machine_subsys_initcall_sync(pseries, tce_iommu_bus_notifier_init);
+
+void platform_override_dma_ops(struct virtio_device *vdev)
+{
+	/* Override vdev->parent.dma_ops if required */
+}
diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 6b13987..432c332 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -168,6 +168,12 @@ EXPORT_SYMBOL_GPL(virtio_add_status);
 
 const struct dma_map_ops virtio_direct_dma_ops;
 
+#ifndef platform_override_dma_ops
+static inline void platform_override_dma_ops(struct virtio_device *vdev)
+{
+}
+#endif
+
 int virtio_finalize_features(struct virtio_device *dev)
 {
 	int ret = dev->config->finalize_features(dev);
@@ -179,6 +185,7 @@ int virtio_finalize_features(struct virtio_device *dev)
 	if (virtio_has_iommu_quirk(dev))
 		set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
 
+	platform_override_dma_ops(dev);
 	if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
 		return 0;
 
-- 
2.9.3

^ permalink raw reply related

* [RFC 3/4] virtio: Force virtio core to use DMA API callbacks for all virtio devices
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: linuxppc-dev, aik, robh, joe, elfring, david, jasowang, benh, mpe,
	mst, hch, khandual, linuxram, haren, paulus, srikar
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

Virtio core should use DMA API callbacks for all virtio devices which may
generate either GAP or IOVA depending on VIRTIO_F_IOMMU_PLATFORM flag and
resulting QEMU expectations. This implies that every virtio device needs
to have a DMA OPS structure. This removes previous GPA fallback code paths.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 drivers/virtio/virtio_ring.c | 65 ++------------------------------------------
 1 file changed, 2 insertions(+), 63 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index 814b395..c265964 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -141,26 +141,6 @@ struct vring_virtqueue {
  * unconditionally on data path.
  */
 
-static bool vring_use_dma_api(struct virtio_device *vdev)
-{
-	if (!virtio_has_iommu_quirk(vdev))
-		return true;
-
-	/* Otherwise, we are left to guess. */
-	/*
-	 * In theory, it's possible to have a buggy QEMU-supposed
-	 * emulated Q35 IOMMU and Xen enabled at the same time.  On
-	 * such a configuration, virtio has never worked and will
-	 * not work without an even larger kludge.  Instead, enable
-	 * the DMA API if we're a Xen guest, which at least allows
-	 * all of the sensible Xen configurations to work correctly.
-	 */
-	if (xen_domain())
-		return true;
-
-	return false;
-}
-
 /*
  * The DMA ops on various arches are rather gnarly right now, and
  * making all of the arch DMA ops work on the vring device itself
@@ -176,9 +156,6 @@ static dma_addr_t vring_map_one_sg(const struct vring_virtqueue *vq,
 				   struct scatterlist *sg,
 				   enum dma_data_direction direction)
 {
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return (dma_addr_t)sg_phys(sg);
-
 	/*
 	 * We can't use dma_map_sg, because we don't use scatterlists in
 	 * the way it expects (we don't guarantee that the scatterlist
@@ -193,9 +170,6 @@ static dma_addr_t vring_map_single(const struct vring_virtqueue *vq,
 				   void *cpu_addr, size_t size,
 				   enum dma_data_direction direction)
 {
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return (dma_addr_t)virt_to_phys(cpu_addr);
-
 	return dma_map_single(vring_dma_dev(vq),
 			      cpu_addr, size, direction);
 }
@@ -205,9 +179,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
 {
 	u16 flags;
 
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return;
-
 	flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
 
 	if (flags & VRING_DESC_F_INDIRECT) {
@@ -228,9 +199,6 @@ static void vring_unmap_one(const struct vring_virtqueue *vq,
 static int vring_mapping_error(const struct vring_virtqueue *vq,
 			       dma_addr_t addr)
 {
-	if (!vring_use_dma_api(vq->vq.vdev))
-		return 0;
-
 	return dma_mapping_error(vring_dma_dev(vq), addr);
 }
 
@@ -1016,43 +984,14 @@ EXPORT_SYMBOL_GPL(__vring_new_virtqueue);
 static void *vring_alloc_queue(struct virtio_device *vdev, size_t size,
 			      dma_addr_t *dma_handle, gfp_t flag)
 {
-	if (vring_use_dma_api(vdev)) {
-		return dma_alloc_coherent(vdev->dev.parent, size,
+	return dma_alloc_coherent(vdev->dev.parent, size,
 					  dma_handle, flag);
-	} else {
-		void *queue = alloc_pages_exact(PAGE_ALIGN(size), flag);
-		if (queue) {
-			phys_addr_t phys_addr = virt_to_phys(queue);
-			*dma_handle = (dma_addr_t)phys_addr;
-
-			/*
-			 * Sanity check: make sure we dind't truncate
-			 * the address.  The only arches I can find that
-			 * have 64-bit phys_addr_t but 32-bit dma_addr_t
-			 * are certain non-highmem MIPS and x86
-			 * configurations, but these configurations
-			 * should never allocate physical pages above 32
-			 * bits, so this is fine.  Just in case, throw a
-			 * warning and abort if we end up with an
-			 * unrepresentable address.
-			 */
-			if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
-				free_pages_exact(queue, PAGE_ALIGN(size));
-				return NULL;
-			}
-		}
-		return queue;
-	}
 }
 
 static void vring_free_queue(struct virtio_device *vdev, size_t size,
 			     void *queue, dma_addr_t dma_handle)
 {
-	if (vring_use_dma_api(vdev)) {
-		dma_free_coherent(vdev->dev.parent, size, queue, dma_handle);
-	} else {
-		free_pages_exact(queue, PAGE_ALIGN(size));
-	}
+	dma_free_coherent(vdev->dev.parent, size, queue, dma_handle);
 }
 
 struct virtqueue *vring_create_virtqueue(
-- 
2.9.3

^ permalink raw reply related

* [RFC 2/4] virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: linuxppc-dev, aik, robh, joe, elfring, david, jasowang, benh, mpe,
	mst, hch, khandual, linuxram, haren, paulus, srikar
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

Now that virtio core always needs all virtio devices to have DMA OPS, we
need to make sure that the structure it points is the right one. In the
absence of VIRTIO_F_IOMMU_PLATFORM flag QEMU expects GPA from guest kernel.
In such case, virtio device must use default virtio_direct_dma_ops DMA OPS
structure which transforms scatter gather buffer addresses as GPA. This
DMA OPS override must happen as early as possible during virtio device
initializatin sequence before virtio core starts using given device's DMA
OPS callbacks for I/O transactions. This change detects device's IOMMU flag
and does the override in case the flag is cleared.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 7907ad3..6b13987 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -166,6 +166,8 @@ void virtio_add_status(struct virtio_device *dev, unsigned int status)
 }
 EXPORT_SYMBOL_GPL(virtio_add_status);
 
+const struct dma_map_ops virtio_direct_dma_ops;
+
 int virtio_finalize_features(struct virtio_device *dev)
 {
 	int ret = dev->config->finalize_features(dev);
@@ -174,6 +176,9 @@ int virtio_finalize_features(struct virtio_device *dev)
 	if (ret)
 		return ret;
 
+	if (virtio_has_iommu_quirk(dev))
+		set_dma_ops(dev->dev.parent, &virtio_direct_dma_ops);
+
 	if (!virtio_has_feature(dev, VIRTIO_F_VERSION_1))
 		return 0;
 
-- 
2.9.3

^ permalink raw reply related

* [RFC 1/4] virtio: Define virtio_direct_dma_ops structure
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: linuxppc-dev, aik, robh, joe, elfring, david, jasowang, benh, mpe,
	mst, hch, khandual, linuxram, haren, paulus, srikar
In-Reply-To: <20180720035941.6844-1-khandual@linux.vnet.ibm.com>

Current implementation of DMA API inside virtio core calls device's DMA OPS
callback functions when the flag VIRTIO_F_IOMMU_PLATFORM flag is set. But
in absence of the flag, virtio core falls back calling basic transformation
of the incoming SG addresses as GPA. Going forward virtio should only call
DMA API based transformations generating either GPA or IOVA depending on
QEMU expectations again based on VIRTIO_F_IOMMU_PLATFORM flag. It requires
removing existing fallback code path for GPA transformation and replacing
that with a direct map DMA OPS structure. This adds that direct mapping DMA
OPS structure to be used in later patches which will make virtio core call
DMA API all the time for all virtio devices.

Signed-off-by: Anshuman Khandual <khandual@linux.vnet.ibm.com>
---
 drivers/virtio/virtio.c            | 60 ++++++++++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.h |  3 ++
 2 files changed, 63 insertions(+)

diff --git a/drivers/virtio/virtio.c b/drivers/virtio/virtio.c
index 59e36ef..7907ad3 100644
--- a/drivers/virtio/virtio.c
+++ b/drivers/virtio/virtio.c
@@ -3,6 +3,7 @@
 #include <linux/virtio_config.h>
 #include <linux/module.h>
 #include <linux/idr.h>
+#include <linux/dma-mapping.h>
 #include <uapi/linux/virtio_ids.h>
 
 /* Unique numbering for virtio devices. */
@@ -442,3 +443,62 @@ core_initcall(virtio_init);
 module_exit(virtio_exit);
 
 MODULE_LICENSE("GPL");
+
+/*
+ * Virtio direct mapping DMA API operations structure
+ *
+ * This defines DMA API structure for all virtio devices which would not
+ * either bring in their own DMA OPS from architecture or they would not
+ * like to use architecture specific IOMMU based DMA OPS because QEMU
+ * expects GPA instead of an IOVA in absence of VIRTIO_F_IOMMU_PLATFORM.
+ */
+dma_addr_t virtio_direct_map_page(struct device *dev, struct page *page,
+			    unsigned long offset, size_t size,
+			    enum dma_data_direction dir,
+			    unsigned long attrs)
+{
+	return page_to_phys(page) + offset;
+}
+
+void virtio_direct_unmap_page(struct device *hwdev, dma_addr_t dev_addr,
+			size_t size, enum dma_data_direction dir,
+			unsigned long attrs)
+{
+}
+
+int virtio_direct_mapping_error(struct device *hwdev, dma_addr_t dma_addr)
+{
+	return 0;
+}
+
+void *virtio_direct_alloc(struct device *dev, size_t size, dma_addr_t *dma_handle,
+		gfp_t gfp, unsigned long attrs)
+{
+	void *queue = alloc_pages_exact(PAGE_ALIGN(size), gfp);
+
+	if (queue) {
+		phys_addr_t phys_addr = virt_to_phys(queue);
+		*dma_handle = (dma_addr_t)phys_addr;
+
+		if (WARN_ON_ONCE(*dma_handle != phys_addr)) {
+			free_pages_exact(queue, PAGE_ALIGN(size));
+			return NULL;
+		}
+	}
+	return queue;
+}
+
+void virtio_direct_free(struct device *dev, size_t size, void *vaddr,
+		dma_addr_t dma_addr, unsigned long attrs)
+{
+	free_pages_exact(vaddr, PAGE_ALIGN(size));
+}
+
+const struct dma_map_ops virtio_direct_dma_ops = {
+	.alloc			= virtio_direct_alloc,
+	.free			= virtio_direct_free,
+	.map_page		= virtio_direct_map_page,
+	.unmap_page		= virtio_direct_unmap_page,
+	.mapping_error		= virtio_direct_mapping_error,
+};
+EXPORT_SYMBOL(virtio_direct_dma_ops);
diff --git a/drivers/virtio/virtio_pci_common.h b/drivers/virtio/virtio_pci_common.h
index 135ee3c..ec44d2f 100644
--- a/drivers/virtio/virtio_pci_common.h
+++ b/drivers/virtio/virtio_pci_common.h
@@ -31,6 +31,9 @@
 #include <linux/highmem.h>
 #include <linux/spinlock.h>
 
+extern struct dma_map_ops virtio_direct_dma_ops;
+
+
 struct virtio_pci_vq_info {
 	/* the actual virtqueue */
 	struct virtqueue *vq;
-- 
2.9.3

^ permalink raw reply related

* [RFC 0/4] Virtio uses DMA API for all devices
From: Anshuman Khandual @ 2018-07-20  3:59 UTC (permalink / raw)
  To: virtualization, linux-kernel
  Cc: linuxppc-dev, aik, robh, joe, elfring, david, jasowang, benh, mpe,
	mst, hch, khandual, linuxram, haren, paulus, srikar

This patch series is the follow up on the discussions we had before about
the RFC titled [RFC,V2] virtio: Add platform specific DMA API translation
for virito devices (https://patchwork.kernel.org/patch/10417371/). There
were suggestions about doing away with two different paths of transactions
with the host/QEMU, first being the direct GPA and the other being the DMA
API based translations.

First patch attempts to create a direct GPA mapping based DMA operations
structure called 'virtio_direct_dma_ops' with exact same implementation
of the direct GPA path which virtio core currently has but just wrapped in
a DMA API format. Virtio core must use 'virtio_direct_dma_ops' instead of
the arch default in absence of VIRTIO_F_IOMMU_PLATFORM flag to preserve the
existing semantics. The second patch does exactly that inside the function
virtio_finalize_features(). The third patch removes the default direct GPA
path from virtio core forcing it to use DMA API callbacks for all devices.
Now with that change, every device must have a DMA operations structure
associated with it. The fourth patch adds an additional hook which gives
the platform an opportunity to do yet another override if required. This
platform hook can be used on POWER Ultravisor based protected guests to
load up SWIOTLB DMA callbacks to do the required (as discussed previously
in the above mentioned thread how host is allowed to access only parts of
the guest GPA range) bounce buffering into the shared memory for all I/O
scatter gather buffers to be consumed on the host side.

Please go through these patches and review whether this approach broadly
makes sense. I will appreciate suggestions, inputs, comments regarding
the patches or the approach in general. Thank you.

Anshuman Khandual (4):
  virtio: Define virtio_direct_dma_ops structure
  virtio: Override device's DMA OPS with virtio_direct_dma_ops selectively
  virtio: Force virtio core to use DMA API callbacks for all virtio devices
  virtio: Add platform specific DMA API translation for virito devices

 arch/powerpc/include/asm/dma-mapping.h |  6 +++
 arch/powerpc/platforms/pseries/iommu.c |  6 +++
 drivers/virtio/virtio.c                | 72 ++++++++++++++++++++++++++++++++++
 drivers/virtio/virtio_pci_common.h     |  3 ++
 drivers/virtio/virtio_ring.c           | 65 +-----------------------------
 5 files changed, 89 insertions(+), 63 deletions(-)

-- 
2.9.3

^ permalink raw reply

* Re: [kernel,v7,1/2] vfio/spapr: Use IOMMU pageshift rather than pagesize
From: Paul Mackerras @ 2018-07-20  3:06 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Alexey Kardashevskiy, linuxppc-dev, Nicholas Piggin, kvm-ppc,
	Alex Williamson, Aneesh Kumar K.V, David Gibson
In-Reply-To: <41WNmy1B2zz9s1R@ozlabs.org>

On Thu, Jul 19, 2018 at 04:06:10PM +1000, Michael Ellerman wrote:
> On Tue, 2018-07-17 at 07:19:12 UTC, Alexey Kardashevskiy wrote:
> > The size is always equal to 1 page so let's use this. Later on this will
> > be used for other checks which use page shifts to check the granularity
> > of access.
> > 
> > This should cause no behavioral change.
> > 
> > Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> > Acked-by: Alex Williamson <alex.williamson@redhat.com>
> > Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> 
> Applied to powerpc fixes, thanks.
> 
> https://git.kernel.org/powerpc/c/1463edca6734d42ab4406fa2896e20

Ah.  I have put these two patches in my kvm-ppc-next branch and I was
about to send a pull request to Paolo.

Paul.

^ permalink raw reply

* Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
From: Michael Neuling @ 2018-07-20  2:41 UTC (permalink / raw)
  To: Michael Ellerman, ego
  Cc: Benjamin Herrenschmidt, Vaidyanathan Srinivasan, linuxppc-dev,
	linux-kernel, Florian Weimer, Oleg Nesterov
In-Reply-To: <87tvoubpro.fsf@concordia.ellerman.id.au>

On Fri, 2018-07-20 at 12:32 +1000, Michael Ellerman wrote:
> Michael Neuling <mikey@neuling.org> writes:
> > On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
> > > On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
> > > >=20
> > > > >  	DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > > > > diff --git a/arch/powerpc/kernel/idle_book3s.S
> > > > > b/arch/powerpc/kernel/idle_book3s.S
> > > > > index d85d551..5069d42 100644
> > > > > --- a/arch/powerpc/kernel/idle_book3s.S
> > > > > +++ b/arch/powerpc/kernel/idle_book3s.S
> > > > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
> > > > >  	mfspr	r4, SPRN_MMCR2
> > > > >  	std	r3, STOP_MMCR1(r13)
> > > > >  	std	r4, STOP_MMCR2(r13)
> > > > > +
> > > > > +	mfspr	r3, SPRN_SPRG3
> > > > > +	std	r3, STOP_SPRG3(r13)
> > > >=20
> > > > We don't need to save it.  Just restore it from paca->sprg_vdso whi=
ch
> > > > should
> > > > never change.
> > >=20
> > > Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
> > >=20
> > > >=20
> > > > How can we do better at catching these missing SPRGs?
> > >=20
> > > We can go through the list of SPRs from the POWER9 User Manual and
> > > document explicitly why we don't have to save/restore certain SPRs
> > > during the execution of the stop instruction. Does this sound ok ?
> > >=20
> > > (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
> > > accessible from
> > > https://openpowerfoundation.org/?resource_lib=3Dpower9-processor-user=
s-manua
> > > l)
> >=20
> > I was thinking of a boot time test case built into linux. linux has som=
e
> > boot
> > time test cases which you can enable via CONFIG options.
> >=20
> > Firstly you could see if an SPR exists using the same trick xmon does i=
n
> > dump_one_spr(). Then once you have a list of usable SPRs, you could wri=
te
> > all
> > the known ones (I assume you'd have to leave out some, like the PSSCR),=
 then
> > set
>=20
> Write what value?
>=20
> Ideally you want to write a random bit pattern to reduce the chance
> that only some bits are being restored.

The xmon dump_one_spr() trick tries to work around that by writing one rand=
om
value and then a different one to see if it really is a nop.

> But you can't do that because writing a value to an SPRs has an effect.

Sure that's a concern but xmon seems to get away with it.

> Some of them might even need to be zero, in which case you can't really
> distinguish that from a non-restored zero.

It doesn't need to be perfect. It just needs to catch more than we have now=
.

> > the appropriate stop level, make sure you got into that stop level, and=
 then
> > see
> > if that register was changed. Then you'd have an automated list of regi=
sters
> > you
> > need to make sure you save/restore at each stop level.
> >=20
> > Could something like that work?
>=20
> Maybe.
>=20
> Ignoring the problem of whether you can write a meaningful value to some
> of the SPRs, I'm not entirely convinced it's going to work. But maybe
> I'm wrong.

Yeah, I'm not convinced it'll work either but it would be a nice piece of t=
est
infrastructure to have if it does work.

We'd still need to marry up the SPR numbers we get from the test to what's
actually being restored in Linux.

> But there's a much simpler solution, we should 1) have a selftest for
> getcpu() and 2) we should be running the glibc (I think?) test suite
> that found this in the first place. It's frankly embarrassing that we
> didn't find this.

Yeah, we should do that also, but how do we catch the next SPR we are missi=
ng.
I'd like some systematic way of doing that rather than wack-a-mole.

Mikey

^ permalink raw reply

* Re: linux-next: manual merge of the powerpc tree with the powerpc-fixes tree
From: Michael Ellerman @ 2018-07-20  2:34 UTC (permalink / raw)
  To: Stephen Rothwell, Benjamin Herrenschmidt, PowerPC
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Alexey Kardashevskiy, David Gibson, Paul Mackerras,
	Alex Williamson
In-Reply-To: <20180720095138.275cc9aa@canb.auug.org.au>

Stephen Rothwell <sfr@canb.auug.org.au> writes:

> Hi all,
>
> Today's linux-next merge of the powerpc tree got a conflict in:
>
>   drivers/vfio/vfio_iommu_spapr_tce.c
>
> between commit:
>
>   1463edca6734 ("vfio/spapr: Use IOMMU pageshift rather than pagesize")
>
> from the powerpc-fixes tree and commit:
>
>   00a5c58d9499 ("KVM: PPC: Make iommu_table::it_userspace big endian")
>
> from the powerpc tree.

Thanks.

That has turned into a real mess, with conflicting code in next, fixes
and topic/ppc-kvm.

I'll fix it all up before the merge window.

cheers

^ permalink raw reply

* Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
From: Michael Ellerman @ 2018-07-20  2:32 UTC (permalink / raw)
  To: Michael Neuling, ego
  Cc: Benjamin Herrenschmidt, Vaidyanathan Srinivasan, linuxppc-dev,
	linux-kernel, Florian Weimer, Oleg Nesterov
In-Reply-To: <1205bfc10c62986b4345fa258cf37e820c08226b.camel@neuling.org>

Michael Neuling <mikey@neuling.org> writes:
> On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
>> On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
>> > 
>> > >  	DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
>> > > diff --git a/arch/powerpc/kernel/idle_book3s.S
>> > > b/arch/powerpc/kernel/idle_book3s.S
>> > > index d85d551..5069d42 100644
>> > > --- a/arch/powerpc/kernel/idle_book3s.S
>> > > +++ b/arch/powerpc/kernel/idle_book3s.S
>> > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
>> > >  	mfspr	r4, SPRN_MMCR2
>> > >  	std	r3, STOP_MMCR1(r13)
>> > >  	std	r4, STOP_MMCR2(r13)
>> > > +
>> > > +	mfspr	r3, SPRN_SPRG3
>> > > +	std	r3, STOP_SPRG3(r13)
>> > 
>> > We don't need to save it.  Just restore it from paca->sprg_vdso which should
>> > never change.
>> 
>> Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
>> 
>> > 
>> > How can we do better at catching these missing SPRGs?
>> 
>> We can go through the list of SPRs from the POWER9 User Manual and
>> document explicitly why we don't have to save/restore certain SPRs
>> during the execution of the stop instruction. Does this sound ok ?
>> 
>> (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
>> accessible from
>> https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual)
>
> I was thinking of a boot time test case built into linux. linux has some boot
> time test cases which you can enable via CONFIG options.
>
> Firstly you could see if an SPR exists using the same trick xmon does in
> dump_one_spr(). Then once you have a list of usable SPRs, you could write all
> the known ones (I assume you'd have to leave out some, like the PSSCR), then set

Write what value?

Ideally you want to write a random bit pattern to reduce the chance
that only some bits are being restored.

But you can't do that because writing a value to an SPRs has an effect.

Some of them might even need to be zero, in which case you can't really
distinguish that from a non-restored zero.

> the appropriate stop level, make sure you got into that stop level, and then see
> if that register was changed. Then you'd have an automated list of registers you
> need to make sure you save/restore at each stop level.
>
> Could something like that work?

Maybe.

Ignoring the problem of whether you can write a meaningful value to some
of the SPRs, I'm not entirely convinced it's going to work. But maybe
I'm wrong.

But there's a much simpler solution, we should 1) have a selftest for
getcpu() and 2) we should be running the glibc (I think?) test suite
that found this in the first place. It's frankly embarrassing that we
didn't find this.

cheers

^ permalink raw reply

* Re: [RESEND][PATCH] powerpc/powernv : Save/Restore SPRG3 on entry/exit from stop.
From: Michael Neuling @ 2018-07-20  0:27 UTC (permalink / raw)
  To: ego
  Cc: Michael Ellerman, Benjamin Herrenschmidt, Vaidyanathan Srinivasan,
	linuxppc-dev, linux-kernel, Florian Weimer, Oleg Nesterov
In-Reply-To: <20180718081249.GA17700@in.ibm.com>

On Wed, 2018-07-18 at 13:42 +0530, Gautham R Shenoy wrote:
> Hello Mikey,
>=20
> On Wed, Jul 18, 2018 at 09:24:19AM +1000, Michael Neuling wrote:
> >=20
> > >  	DEFINE(PPC_DBELL_SERVER, PPC_DBELL_SERVER);
> > > diff --git a/arch/powerpc/kernel/idle_book3s.S
> > > b/arch/powerpc/kernel/idle_book3s.S
> > > index d85d551..5069d42 100644
> > > --- a/arch/powerpc/kernel/idle_book3s.S
> > > +++ b/arch/powerpc/kernel/idle_book3s.S
> > > @@ -120,6 +120,9 @@ power9_save_additional_sprs:
> > >  	mfspr	r4, SPRN_MMCR2
> > >  	std	r3, STOP_MMCR1(r13)
> > >  	std	r4, STOP_MMCR2(r13)
> > > +
> > > +	mfspr	r3, SPRN_SPRG3
> > > +	std	r3, STOP_SPRG3(r13)
> >=20
> > We don't need to save it.  Just restore it from paca->sprg_vdso which s=
hould
> > never change.
>=20
> Ok. I will respin a patch to restore SPRG3 from paca->sprg_vdso.
>=20
> >=20
> > How can we do better at catching these missing SPRGs?
>=20
> We can go through the list of SPRs from the POWER9 User Manual and
> document explicitly why we don't have to save/restore certain SPRs
> during the execution of the stop instruction. Does this sound ok ?
>=20
> (Ref: Table 4-8, Section 4.7.3.4 from the POWER9 User Manual
> accessible from
> https://openpowerfoundation.org/?resource_lib=3Dpower9-processor-users-ma=
nual)

I was thinking of a boot time test case built into linux. linux has some bo=
ot
time test cases which you can enable via CONFIG options.

Firstly you could see if an SPR exists using the same trick xmon does in
dump_one_spr(). Then once you have a list of usable SPRs, you could write a=
ll
the known ones (I assume you'd have to leave out some, like the PSSCR), the=
n set
the appropriate stop level, make sure you got into that stop level, and the=
n see
if that register was changed. Then you'd have an automated list of register=
s you
need to make sure you save/restore at each stop level.

Could something like that work?

Mikey

^ permalink raw reply

* linux-next: manual merge of the powerpc tree with the powerpc-fixes tree
From: Stephen Rothwell @ 2018-07-19 23:51 UTC (permalink / raw)
  To: Michael Ellerman, Benjamin Herrenschmidt, PowerPC
  Cc: Linux-Next Mailing List, Linux Kernel Mailing List,
	Alexey Kardashevskiy, David Gibson, Paul Mackerras,
	Alex Williamson

[-- Attachment #1: Type: text/plain, Size: 2237 bytes --]

Hi all,

Today's linux-next merge of the powerpc tree got a conflict in:

  drivers/vfio/vfio_iommu_spapr_tce.c

between commit:

  1463edca6734 ("vfio/spapr: Use IOMMU pageshift rather than pagesize")

from the powerpc-fixes tree and commit:

  00a5c58d9499 ("KVM: PPC: Make iommu_table::it_userspace big endian")

from the powerpc tree.

I fixed it up (see below) and can carry the fix as necessary. This
is now fixed as far as linux-next is concerned, but any non trivial
conflicts should be mentioned to your upstream maintainer when your tree
is submitted for merging.  You may also want to consider cooperating
with the maintainer of the conflicting tree to minimise any particularly
complex conflicts.

-- 
Cheers,
Stephen Rothwell

diff --cc drivers/vfio/vfio_iommu_spapr_tce.c
index 7cd63b0c1a46,11a4c194d6e3..000000000000
--- a/drivers/vfio/vfio_iommu_spapr_tce.c
+++ b/drivers/vfio/vfio_iommu_spapr_tce.c
@@@ -487,11 -449,11 +449,11 @@@ static void tce_iommu_unuse_page_v2(str
  	if (!pua)
  		return;
  
- 	ret = tce_iommu_prereg_ua_to_hpa(container, *pua, tbl->it_page_shift,
- 			&hpa, &mem);
+ 	ret = tce_iommu_prereg_ua_to_hpa(container, be64_to_cpu(*pua),
 -			IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
++			tbl->it_page_shift, &hpa, &mem);
  	if (ret)
- 		pr_debug("%s: tce %lx at #%lx was not cached, ret=%d\n",
- 				__func__, *pua, entry, ret);
+ 		pr_debug("%s: tce %llx at #%lx was not cached, ret=%d\n",
+ 				__func__, be64_to_cpu(*pua), entry, ret);
  	if (mem)
  		mm_iommu_mapped_dec(mem);
  
@@@ -599,19 -561,12 +561,12 @@@ static long tce_iommu_build_v2(struct t
  	unsigned long hpa;
  	enum dma_data_direction dirtmp;
  
- 	if (!tbl->it_userspace) {
- 		ret = tce_iommu_userspace_view_alloc(tbl, container->mm);
- 		if (ret)
- 			return ret;
- 	}
- 
  	for (i = 0; i < pages; ++i) {
  		struct mm_iommu_table_group_mem_t *mem = NULL;
- 		unsigned long *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl,
- 				entry + i);
+ 		__be64 *pua = IOMMU_TABLE_USERSPACE_ENTRY(tbl, entry + i);
  
  		ret = tce_iommu_prereg_ua_to_hpa(container,
 -				tce, IOMMU_PAGE_SIZE(tbl), &hpa, &mem);
 +				tce, tbl->it_page_shift, &hpa, &mem);
  		if (ret)
  			break;
  

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: Improvements for the PS3
From: Fredrik Noring @ 2018-07-19 20:14 UTC (permalink / raw)
  To: Geert Uytterhoeven, Geoff Levand; +Cc: linuxppc-dev, Maciej W. Rozycki
In-Reply-To: <CAMuHMdX3xiRYYdhQftHRDYFqob7xMHm-xxvh3NyoC1TowH4x8g@mail.gmail.com>

Hi Geert, Geoff,

> > > so I added a sleep with
> > >
> > > +     msleep(10000);
> 
> I can't see where you added the sleep, but 10s seems excessive.
> If the real reason is the need to wait for an interrupt for ps3fb_sync_image(),
> then waiting for 40 ms should be sufficient? Or am I missing something?

It's at the end of ps3fb_probe, as shown in the original post:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2018-July/175771.html

I thought 100 ms or so would work, but evidently it didn't. In fact, 1 s
for even 5 s didn't seem to work either. In any case, I would like to
develop a solution that does not need to sleep at all, so that will be my
first approach for a proper implementation.

> > > I suppose the problem is that it relies on interrupts for ps3fb_sync_image
> > > to regularly copy the image, hence without them the screen isn't updated to
> > > show kernel panics, etc. Perhaps one way to fix that is to implement the
> > > struct fb_tile_ops API, so that the console is synchronously updated? Would
> > > that be acceptable?
> >
> > I'm not sure if that would work or not.   Maybe Geert is more familiar with it.
> 
> That sounds like a complex solution, slowing down the console a lot.

Why would that be slow? I have implemented a similar technique for the
PlayStation 2 frame buffer, and (without any measurements at hand now) it
appears to be about as fast as is possible, and reasonably easy too. :)

It works like this:

The PS2 frame buffer can operate in two distinct modes: virtual mode or
console mode. Virtual mode is very similar to the PS3 in that the whole
visible kernel memory frame buffer is copied to the Graphics Synthesizer
(GS) via DMA regularly at vertical blank intervals.

Console mode is different. No kernel memory is allocated for the frame
buffer at all (which is a significant improvement in itself given that the
PS2 is limited to 32 MiB of main memory) and mmap is disabled. Some struct
fb_ops such as fb_fillrect and fb_copyarea are directly accelerated by GS
hardware. The GS has two CRT merge circuits made for picture-in-picture that
are used to accelerate YWRAP in hardware, which is particularly effective
for console text scrolling.

Additionally, the tiled API is implemented, and it turned out to be a rather
good fit. The GS has 4 MiB of local frame buffer video memory (not directly
accessible by the kernel). The maximum practical resolution 1920x1080p at 16
bits per pixel is 4147200 bytes, which leaves 47104 bytes for a tiled font
which at 8x8 pixels and a minimum 4 bits indexed texture palette is at most
1464 characters. The indexed palette makes it easy to switch colors. struct
fb_tile_ops such as fb_tileblit are accelerated by GS texture sprites which
are fast (GS local copy) for the kernel via simple DMA (GIF) GS commands.

Console text is always synchronous and therefore works properly in interrupt
contexts and with kernel panics, etc. which is essential for debuggability.
A buffered version could be faster, possibly, but I think that might as well
be implemented by a user space console driver using a /dev/gs interface that
can do zero-copy GS commands. The PS2 frame buffer implementation is nearly
complete:

https://github.com/frno7/linux/blob/ps2-v4.17-n3/drivers/video/fbdev/ps2fb.c

Some adjustments and feature reductions seem to be needed for a PS3 version
of anything similar. The simplest implementation is probably to just mirror
characters as they are printed synchronously. I don't know the overhead of
the hypervisor calls for copying graphics though, but the typical areas are
quite small. Perhaps one could avoid allocating the kernel frame buffer as
well when it's not needed. I have to investigate these things to be sure.

> What about letting ps3fb register a panic notifier to sync the screen, like
> hyperv_fb does?

That would not work with kernel freezes unfortunately. Debugging those with
nondeterministicly invisible kernel prints would be painful, I believe.

Fredrik

^ permalink raw reply

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
From: Andrew Morton @ 2018-07-19 19:44 UTC (permalink / raw)
  To: Baoquan He
  Cc: linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied,
	kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list,
	keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx,
	brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei,
	richard.weiyang, devel, linux-input, linux-nvdimm, devicetree,
	linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem,
	chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul,
	linux-parisc, linuxppc-dev, kexec
In-Reply-To: <20180719151753.GB7147@localhost.localdomain>

On Thu, 19 Jul 2018 23:17:53 +0800 Baoquan He <bhe@redhat.com> wrote:

> Hi Andrew,
> 
> On 07/18/18 at 03:33pm, Andrew Morton wrote:
> > On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote:
> > 
> > > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > > top to down. This is what we have been doing all along in the old kexec
> > > loading interface and the kexec loading is still default setting in some
> > > distributions. However, the current kexec_file loading interface doesn't
> > > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > > all resources of System RAM from bottom to up, to try to find memory region
> > > which can contain the specific kexec buffer, then call locate_mem_hole_callback()
> > > to allocate memory in that found memory region from top to down. This brings
> > > confusion especially when KASLR is widely supported , users have to make clear
> > > why kexec/kdump kernel loading position is different between these two
> > > interfaces in order to exclude unnecessary noises. Hence these two interfaces
> > > need be unified on behaviour.
> > 
> > As far as I can tell, the above is the whole reason for the patchset,
> > yes?  To avoid confusing users.
> 
> 
> In fact, it's not just trying to avoid confusing users. Kexec loading
> and kexec_file loading are just do the same thing in essence. Just we
> need do kernel image verification on uefi system, have to port kexec
> loading code to kernel. 
> 
> Kexec has been a formal feature in our distro, and customers owning
> those kind of very large machine can make use of this feature to speed
> up the reboot process. On uefi machine, the kexec_file loading will
> search place to put kernel under 4G from top to down. As we know, the
> 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> it. It may have possibility to not be able to find a usable space for
> kernel/initrd. From the top down of the whole memory space, we don't
> have this worry. 
> 
> And at the first post, I just posted below with AKASHI's
> walk_system_ram_res_rev() version. Later you suggested to use
> list_head to link child sibling of resource, see what the code change
> looks like.
> http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com
> 
> Then I posted v2
> http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com
> Rob Herring mentioned that other components which has this tree struct
> have planned to do the same thing, replacing the singly linked list with
> list_head to link resource child sibling. Just quote Rob's words as
> below. I think this could be another reason.
> 
> ~~~~~ From Rob
> The DT struct device_node also has the same tree structure with
> parent, child, sibling pointers and converting to list_head had been
> on the todo list for a while. ACPI also has some tree walking
> functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
> common tree struct and helpers defined either on top of list_head or a
> ~~~~~
> new struct if that saves some size.

Please let's get all this into the changelogs?

> > 
> > Is that sufficient?  Can we instead simplify their lives by providing
> > better documentation or informative printks or better Kconfig text,
> > etc?
> > 
> > And who *are* the people who are performing this configuration?  Random
> > system administrators?  Linux distro engineers?  If the latter then
> > they presumably aren't easily confused!
> 
> Kexec was invented for kernel developer to speed up their kernel
> rebooting. Now high end sever admin, kernel developer and QE are also
> keen to use it to reboot large box for faster feature testing, bug
> debugging. Kernel dev could know this well, about kernel loading
> position, admin or QE might not be aware of it very well. 
> 
> > 
> > In other words, I'm trying to understand how much benefit this patchset
> > will provide to our users as a whole.
> 
> Understood. The list_head replacing patch truly involes too many code
> changes, it's risky. I am willing to try any idea from reviewers, won't
> persuit they have to be accepted finally. If don't have a try, we don't
> know what it looks like, and what impact it may have. I am fine to take
> AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
> though it could be a little bit low efficient.

The larger patch produces a better result.  We can handle it ;)

^ permalink raw reply

* Re: [PATCH] powerpc/msi: Remove VLA usage
From: Kees Cook @ 2018-07-19 18:33 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Benjamin Herrenschmidt, Paul Mackerras, Andrew Morton,
	Randy Dunlap, Tyrel Datwyler, Rob Herring, Ingo Molnar, PowerPC,
	LKML
In-Reply-To: <8736wfctdl.fsf@concordia.ellerman.id.au>

On Thu, Jul 19, 2018 at 5:17 AM, Michael Ellerman <mpe@ellerman.id.au> wrote:
> Kees Cook <keescook@chromium.org> writes:
>
>> On Fri, Jun 29, 2018 at 11:52 AM, Kees Cook <keescook@chromium.org> wrote:
>>> In the quest to remove all stack VLA usage from the kernel[1], this
>>> switches from an unchanging variable to a constant expression to eliminate
>>> the VLA generation.
>>>
>>> [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com
>>>
>>> Signed-off-by: Kees Cook <keescook@chromium.org>
>>
>> Friendly ping! Michael, can you take this?
>
> Yep, sorry. I actually applied it weeks ago but hadn't pushed.

No worries! Thanks! :)

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [PATCH 4/7] x86,tlb: make lazy TLB mode lazier
From: Andy Lutomirski @ 2018-07-19 17:04 UTC (permalink / raw)
  To: Rik van Riel, Peter Zijlstra, Vitaly Kuznetsov, Juergen Gross,
	Boris Ostrovsky, linux-arch, Will Deacon, Catalin Marinas,
	linux-s390, Benjamin Herrenschmidt, linuxppc-dev
  Cc: Andy Lutomirski, LKML, X86 ML, Mike Galbraith, kernel-team,
	Ingo Molnar, Dave Hansen
In-Reply-To: <CALCETrWDiMhgiR3f8n0jdWcW31EDJ+Waq0wh5sMDutfigANGnA@mail.gmail.com>

On Thu, Jul 19, 2018 at 9:45 AM, Andy Lutomirski <luto@amacapital.net> wrot=
e:
> [I added PeterZ and Vitaly -- can you see any way in which this would
> break something obscure?  I don't.]
>
> On Thu, Jul 19, 2018 at 7:14 AM, Rik van Riel <riel@surriel.com> wrote:
>> I guess we can skip both switch_ldt and load_mm_cr4 if real_prev equals
>> next?
>
> Yes, AFAICS.
>
>>
>> On to the lazy TLB mm_struct refcounting stuff :)
>>
>>>
>>> Which refcount?  mm_users shouldn=E2=80=99t be hot, so I assume you=E2=
=80=99re talking about
>>> mm_count. My suggestion is to get rid of mm_count instead of trying to
>>> optimize it.
>>
>>
>> Do you have any suggestions on how? :)
>>
>> The TLB shootdown sent at __exit_mm time does not get rid of the
>> kernelthread->active_mm
>> pointer pointing at the mm that is exiting.
>>
>
> Ah, but that's conceptually very easy to fix.  Add a #define like
> ARCH_NO_TASK_ACTIVE_MM.  Then just get rid of active_mm if that
> #define is set.  After some grepping, there are very few users.  The
> only nontrivial ones are the ones in kernel/ and mm/mmu_context.c that
> are involved in the rather complicated dance of refcounting active_mm.
> If that field goes away, it doesn't need to be refcounted.  Instead, I
> think the refcounting can get replaced with something like:
>
> /*
>  * Release any arch-internal references to mm.  Only called when
> mm_users is zero
>  * and all tasks using mm have either been switch_mm()'d away or have had
>  * enter_lazy_tlb() called.
>  */
> extern void arch_shoot_down_dead_mm(struct mm_struct *mm);
>
> which the kernel calls in __mmput() after tearing down all the page
> tables.  The body can be something like:
>
> if (WARN_ON(cpumask_any_but(mm_cpumask(...), ...)) {
>   /* send an IPI.  Maybe just call tlb_flush_remove_tables() */
> }
>
> (You'll also have to fix up the highly questionable users in
> arch/x86/platform/efi/efi_64.c, but that's easy.)
>
> Does all that make sense?  Basically, as I understand it, the
> expensive atomic ops you're seeing are all pointless because they're
> enabling an optimization that hasn't actually worked for a long time,
> if ever.

Hmm.  Xen PV has a big hack in xen_exit_mmap(), which is called from
arch_exit_mmap(), I think.  It's a heavier weight version of more or
less the same thing that arch_shoot_down_dead_mm() would be, except
that it happens before exit_mmap().  But maybe Xen actually has the
right idea.  In other words, rather doing the big pagetable free in
exit_mmap() while there may still be other CPUs pointing at the page
tables, the other order might make more sense.  So maybe, if
ARCH_NO_TASK_ACTIVE_MM is set, arch_exit_mmap() should be responsible
for getting rid of all secret arch references to the mm.

Hmm.  ARCH_FREE_UNUSED_MM_IMMEDIATELY might be a better name.

I added some more arch maintainers.  The idea here is that, on x86 at
least, task->active_mm and all its refcounting is pure overhead.  When
a process exits, __mmput() gets called, but the core kernel has a
longstanding "optimization" in which other tasks (kernel threads and
idle tasks) may have ->active_mm pointing at this mm.  This is nasty,
complicated, and hurts performance on large systems, since it requires
extra atomic operations whenever a CPU switches between real users
threads and idle/kernel threads.

It's also almost completely worthless on x86 at least, since __mmput()
frees pagetables, and that operation *already* forces a remote TLB
flush, so we might as well zap all the active_mm references at the
same time.

But arm64 has real HW remote flushes.  Does arm64 actually benefit
from the active_mm optimization?  What happens on arm64 when a process
exits?  How about s390?  I suspect that x390 has rather larger systems
than arm64, where the cost of the reference counting can be much
higher.

(Also, Rik, x86 on Hyper-V has remote flushes, too. How does that
interact with your previous patch set?)

^ permalink raw reply

* Re: [PATCH v7 1/4] resource: Move reparent_resources() to kernel/resource.c and make it public
From: Baoquan He @ 2018-07-19 15:18 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Linux Kernel Mailing List, Andrew Morton, Rob Herring,
	Dan Williams, Nicolas Pitre, Josh Triplett, kbuild test robot,
	Borislav Petkov, Patrik Jakobsson, David Airlie, KY Srinivasan,
	Haiyang Zhang, Stephen Hemminger, Dmitry Torokhov, Frank Rowand,
	Keith Busch, Jon Derrick, Lorenzo Pieralisi, Bjorn Helgaas,
	Thomas Gleixner, brijesh.singh, Jérôme Glisse,
	Tom Lendacky, Greg Kroah-Hartman, baiyaowei, richard.weiyang,
	devel, linux-input, linux-nvdimm, devicetree, linux-pci,
	Eric Biederman, Vivek Goyal, Dave Young, Yinghai Lu, Michal Simek,
	David S. Miller, Chris Zankel, Max Filippov, Gustavo Padovan,
	Maarten Lankhorst, Sean Paul, linux-parisc,
	open list:LINUX FOR POWERPC PA SEMI PWRFICIENT,
	Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
In-Reply-To: <CAHp75Vf2yEwHhEhhQH2XN+pOQ=-skiAHZ=FgLnfVV8vcm59qeQ@mail.gmail.com>

On 07/18/18 at 07:37pm, Andy Shevchenko wrote:
> On Wed, Jul 18, 2018 at 7:36 PM, Andy Shevchenko
> <andy.shevchenko@gmail.com> wrote:
> > On Wed, Jul 18, 2018 at 5:49 AM, Baoquan He <bhe@redhat.com> wrote:
> >> reparent_resources() is duplicated in arch/microblaze/pci/pci-common.c
> >> and arch/powerpc/kernel/pci-common.c, so move it to kernel/resource.c
> >> so that it's shared.
> 
> >> + * Returns 0 on success, -ENOTSUPP if child resource is not completely
> >> + * contained by 'res', -ECANCELED if no any conflicting entry found.
> 
> You also can refer to constants by prefixing them with %, e.g. %-ENOTSUPP.
> But this is up to you completely.

Thanks, will fix when repost. 

^ permalink raw reply

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
From: Baoquan He @ 2018-07-19 15:17 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, robh+dt, dan.j.williams, nicolas.pitre, josh,
	fengguang.wu, bp, andy.shevchenko, patrik.r.jakobsson, airlied,
	kys, haiyangz, sthemmin, dmitry.torokhov, frowand.list,
	keith.busch, jonathan.derrick, lorenzo.pieralisi, bhelgaas, tglx,
	brijesh.singh, jglisse, thomas.lendacky, gregkh, baiyaowei,
	richard.weiyang, devel, linux-input, linux-nvdimm, devicetree,
	linux-pci, ebiederm, vgoyal, dyoung, yinghai, monstr, davem,
	chris, jcmvbkbc, gustavo, maarten.lankhorst, seanpaul,
	linux-parisc, linuxppc-dev, kexec
In-Reply-To: <20180718153326.b795e9ea7835432a56cd7011@linux-foundation.org>

Hi Andrew,

On 07/18/18 at 03:33pm, Andrew Morton wrote:
> On Wed, 18 Jul 2018 10:49:44 +0800 Baoquan He <bhe@redhat.com> wrote:
> 
> > For kexec_file loading, if kexec_buf.top_down is 'true', the memory which
> > is used to load kernel/initrd/purgatory is supposed to be allocated from
> > top to down. This is what we have been doing all along in the old kexec
> > loading interface and the kexec loading is still default setting in some
> > distributions. However, the current kexec_file loading interface doesn't
> > do like this. The function arch_kexec_walk_mem() it calls ignores checking
> > kexec_buf.top_down, but calls walk_system_ram_res() directly to go through
> > all resources of System RAM from bottom to up, to try to find memory region
> > which can contain the specific kexec buffer, then call locate_mem_hole_callback()
> > to allocate memory in that found memory region from top to down. This brings
> > confusion especially when KASLR is widely supported , users have to make clear
> > why kexec/kdump kernel loading position is different between these two
> > interfaces in order to exclude unnecessary noises. Hence these two interfaces
> > need be unified on behaviour.
> 
> As far as I can tell, the above is the whole reason for the patchset,
> yes?  To avoid confusing users.


In fact, it's not just trying to avoid confusing users. Kexec loading
and kexec_file loading are just do the same thing in essence. Just we
need do kernel image verification on uefi system, have to port kexec
loading code to kernel. 

Kexec has been a formal feature in our distro, and customers owning
those kind of very large machine can make use of this feature to speed
up the reboot process. On uefi machine, the kexec_file loading will
search place to put kernel under 4G from top to down. As we know, the
1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
it. It may have possibility to not be able to find a usable space for
kernel/initrd. From the top down of the whole memory space, we don't
have this worry. 

And at the first post, I just posted below with AKASHI's
walk_system_ram_res_rev() version. Later you suggested to use
list_head to link child sibling of resource, see what the code change
looks like.
http://lkml.kernel.org/r/20180322033722.9279-1-bhe@redhat.com

Then I posted v2
http://lkml.kernel.org/r/20180408024724.16812-1-bhe@redhat.com
Rob Herring mentioned that other components which has this tree struct
have planned to do the same thing, replacing the singly linked list with
list_head to link resource child sibling. Just quote Rob's words as
below. I think this could be another reason.

~~~~~ From Rob
The DT struct device_node also has the same tree structure with
parent, child, sibling pointers and converting to list_head had been
on the todo list for a while. ACPI also has some tree walking
functions (drivers/acpi/acpica/pstree.c). Perhaps there should be a
common tree struct and helpers defined either on top of list_head or a
~~~~~
new struct if that saves some size.

> 
> Is that sufficient?  Can we instead simplify their lives by providing
> better documentation or informative printks or better Kconfig text,
> etc?
> 
> And who *are* the people who are performing this configuration?  Random
> system administrators?  Linux distro engineers?  If the latter then
> they presumably aren't easily confused!

Kexec was invented for kernel developer to speed up their kernel
rebooting. Now high end sever admin, kernel developer and QE are also
keen to use it to reboot large box for faster feature testing, bug
debugging. Kernel dev could know this well, about kernel loading
position, admin or QE might not be aware of it very well. 

> 
> In other words, I'm trying to understand how much benefit this patchset
> will provide to our users as a whole.

Understood. The list_head replacing patch truly involes too many code
changes, it's risky. I am willing to try any idea from reviewers, won't
persuit they have to be accepted finally. If don't have a try, we don't
know what it looks like, and what impact it may have. I am fine to take
AKASHI's simple version of walk_system_ram_res_rev() to lower risk, even
though it could be a little bit low efficient.

Thanks
Baoquan

^ permalink raw reply

* Re: Improvements for the PS3
From: Geoff Levand @ 2018-07-19 15:15 UTC (permalink / raw)
  To: Geert Uytterhoeven, noring; +Cc: linuxppc-dev
In-Reply-To: <CAMuHMdX3xiRYYdhQftHRDYFqob7xMHm-xxvh3NyoC1TowH4x8g@mail.gmail.com>

Hi Geert, Fredrik,

On 07/19/2018 12:45 AM, Geert Uytterhoeven wrote:
>> On 07/14/2018 09:49 AM, Fredrik Noring wrote:
>>>
>>> et voilà, the screen came alive and the kernel panic was revealed! It seems
>>> the kernel panics so fast that the PS3 frame buffer is unprepared. This is,
>>> of course, very unfortunate because trying to debug the boot process without
>>> a screen or any other means of obtaining console text is quite difficult.
>>
> What about letting ps3fb register a panic notifier to sync the screen, like
> hyperv_fb does?

Seems like the thing to do.  Fredrik, do you want to try it?  Otherwise, I'll
work on it when I have some time.

-Geoff

^ permalink raw reply

* Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask
From: Alan Stern @ 2018-07-19 15:00 UTC (permalink / raw)
  To: Geoff Levand
  Cc: Takashi Iwai, Jaroslav Kysela, Michael Ellerman, linux-usb,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <3264269e-9a6e-da8e-3857-c0b55d7ee776@infradead.org>

On Thu, 19 Jul 2018, Geoff Levand wrote:

> Hi Alan,
> 
> On 07/19/2018 07:33 AM, Alan Stern wrote:
> > On Wed, 18 Jul 2018, Geoff Levand wrote:
> > 
> >> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
> >> index 8c733492d8fe..454d8c624a3f 100644
> >> --- a/drivers/usb/host/ehci-ps3.c
> >> +++ b/drivers/usb/host/ehci-ps3.c
> >> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
> >>  	int result;
> >>  	struct usb_hcd *hcd;
> >>  	unsigned int virq;
> >> -	static u64 dummy_mask = DMA_BIT_MASK(32);
> >> +	static u64 dummy_mask;
> >>  
> >>  	if (usb_disabled()) {
> >>  		result = -ENODEV;
> >> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
> >>  		goto fail_irq;
> >>  	}
> >>  
> >> -	dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
> >> +	dummy_mask = DMA_BIT_MASK(32);
> >> +	dev->core.dma_mask = &dummy_mask;
> >> +	dma_set_coherent_mask(&dev->core, dummy_mask);
> > 
> > What is the reason for changing a static initialization to a dynamic
> > one?  As far as I can see, the patch touches four lines of code but the
> > only real difference is addition of a single line (and removal of a 
> > comment).
> 
> I thought it would be better if all the setting was done in
> one place, that's the only reason.

All right; in that case (for the EHCI and OHCI pieces):

Acked-by: Alan Stern <stern@rowland.harvard.edu>

Alan Stern

^ permalink raw reply

* Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask
From: Geoff Levand @ 2018-07-19 14:53 UTC (permalink / raw)
  To: Alan Stern
  Cc: Takashi Iwai, Jaroslav Kysela, Michael Ellerman, linux-usb,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <Pine.LNX.4.44L0.1807191029570.1491-100000@iolanthe.rowland.org>

Hi Alan,

On 07/19/2018 07:33 AM, Alan Stern wrote:
> On Wed, 18 Jul 2018, Geoff Levand wrote:
> 
>> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
>> index 8c733492d8fe..454d8c624a3f 100644
>> --- a/drivers/usb/host/ehci-ps3.c
>> +++ b/drivers/usb/host/ehci-ps3.c
>> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
>>  	int result;
>>  	struct usb_hcd *hcd;
>>  	unsigned int virq;
>> -	static u64 dummy_mask = DMA_BIT_MASK(32);
>> +	static u64 dummy_mask;
>>  
>>  	if (usb_disabled()) {
>>  		result = -ENODEV;
>> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
>>  		goto fail_irq;
>>  	}
>>  
>> -	dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
>> +	dummy_mask = DMA_BIT_MASK(32);
>> +	dev->core.dma_mask = &dummy_mask;
>> +	dma_set_coherent_mask(&dev->core, dummy_mask);
> 
> What is the reason for changing a static initialization to a dynamic
> one?  As far as I can see, the patch touches four lines of code but the
> only real difference is addition of a single line (and removal of a 
> comment).

I thought it would be better if all the setting was done in
one place, that's the only reason.

-Geoff

^ permalink raw reply

* Re: [PATCH] powerpc/ps3: Set driver coherent_dma_mask
From: Alan Stern @ 2018-07-19 14:33 UTC (permalink / raw)
  To: Geoff Levand
  Cc: Takashi Iwai, Jaroslav Kysela, Michael Ellerman, linux-usb,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <068ebcfa-7cd0-bd06-42e7-577a4624f0b0@infradead.org>

On Wed, 18 Jul 2018, Geoff Levand wrote:

> Set the coherent_dma_mask for the PS3 ehci, ohci, and snd devices.
> 
> Silences WARN_ON_ONCE messages emitted by the dma_alloc_attrs() routine.
> 
> Reported-by: Fredrik Noring <noring@nocrew.org>
> Signed-off-by: Geoff Levand <geoff@infradead.org>
> ---
> Hi Michael,
> 
> This just silences some warnings.  Can you take it through the powerpc
> tree?
> 
> -Geoff
> 
> 
>  drivers/usb/host/ehci-ps3.c | 6 ++++--
>  drivers/usb/host/ohci-ps3.c | 6 ++++--
>  sound/ppc/snd_ps3.c         | 5 +++++
>  3 files changed, 13 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/usb/host/ehci-ps3.c b/drivers/usb/host/ehci-ps3.c
> index 8c733492d8fe..454d8c624a3f 100644
> --- a/drivers/usb/host/ehci-ps3.c
> +++ b/drivers/usb/host/ehci-ps3.c
> @@ -86,7 +86,7 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
>  	int result;
>  	struct usb_hcd *hcd;
>  	unsigned int virq;
> -	static u64 dummy_mask = DMA_BIT_MASK(32);
> +	static u64 dummy_mask;
>  
>  	if (usb_disabled()) {
>  		result = -ENODEV;
> @@ -131,7 +131,9 @@ static int ps3_ehci_probe(struct ps3_system_bus_device *dev)
>  		goto fail_irq;
>  	}
>  
> -	dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
> +	dummy_mask = DMA_BIT_MASK(32);
> +	dev->core.dma_mask = &dummy_mask;
> +	dma_set_coherent_mask(&dev->core, dummy_mask);

What is the reason for changing a static initialization to a dynamic
one?  As far as I can see, the patch touches four lines of code but the
only real difference is addition of a single line (and removal of a 
comment).

> diff --git a/drivers/usb/host/ohci-ps3.c b/drivers/usb/host/ohci-ps3.c
> index 20a23d795adf..395f9d3bc849 100644
> --- a/drivers/usb/host/ohci-ps3.c
> +++ b/drivers/usb/host/ohci-ps3.c
> @@ -69,7 +69,7 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev)
>  	int result;
>  	struct usb_hcd *hcd;
>  	unsigned int virq;
> -	static u64 dummy_mask = DMA_BIT_MASK(32);
> +	static u64 dummy_mask;
>  
>  	if (usb_disabled()) {
>  		result = -ENODEV;
> @@ -115,7 +115,9 @@ static int ps3_ohci_probe(struct ps3_system_bus_device *dev)
>  		goto fail_irq;
>  	}
>  
> -	dev->core.dma_mask = &dummy_mask; /* FIXME: for improper usb code */
> +	dummy_mask = DMA_BIT_MASK(32);
> +	dev->core.dma_mask = &dummy_mask;
> +	dma_set_coherent_mask(&dev->core, dummy_mask);

Same here.

Alan Stern

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox