* Re: [PATCH v11 39/40] virtio_net: support tx queue resize
From: Xuan Zhuo @ 2022-07-15 8:28 UTC (permalink / raw)
To: Jason Wang
Cc: Richard Weinberger, Anton Ivanov, Johannes Berg,
Michael S. Tsirkin, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Hans de Goede, Mark Gross, Vadim Pasternak,
Bjorn Andersson, Mathieu Poirier, Cornelia Huck, Halil Pasic,
Eric Farman, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
Christian Borntraeger, Sven Schnelle, Alexei Starovoitov,
Daniel Borkmann, Jesper Dangaard Brouer, John Fastabend,
Vincent Whitchurch, linux-um, netdev, platform-driver-x86,
linux-remoteproc, linux-s390, kvm,
open list:XDP (eXpress Data Path), kangjie.xu, virtualization
In-Reply-To: <CACGkMEt8MSS=tcn=Hd6WF9+btT0ccocxEd1ighRgK-V1uiWmCQ@mail.gmail.com>
On Fri, 8 Jul 2022 14:23:57 +0800, Jason Wang <jasowang@redhat.com> wrote:
> On Tue, Jul 5, 2022 at 10:01 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote:
> >
> > On Mon, 4 Jul 2022 11:45:52 +0800, Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > 在 2022/6/29 14:56, Xuan Zhuo 写道:
> > > > This patch implements the resize function of the tx queues.
> > > > Based on this function, it is possible to modify the ring num of the
> > > > queue.
> > > >
> > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> > > > ---
> > > > drivers/net/virtio_net.c | 48 ++++++++++++++++++++++++++++++++++++++++
> > > > 1 file changed, 48 insertions(+)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index 6ab16fd193e5..fd358462f802 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -135,6 +135,9 @@ struct send_queue {
> > > > struct virtnet_sq_stats stats;
> > > >
> > > > struct napi_struct napi;
> > > > +
> > > > + /* Record whether sq is in reset state. */
> > > > + bool reset;
> > > > };
> > > >
> > > > /* Internal representation of a receive virtqueue */
> > > > @@ -279,6 +282,7 @@ struct padded_vnet_hdr {
> > > > };
> > > >
> > > > static void virtnet_rq_free_unused_buf(struct virtqueue *vq, void *buf);
> > > > +static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
> > > >
> > > > static bool is_xdp_frame(void *ptr)
> > > > {
> > > > @@ -1603,6 +1607,11 @@ static void virtnet_poll_cleantx(struct receive_queue *rq)
> > > > return;
> > > >
> > > > if (__netif_tx_trylock(txq)) {
> > > > + if (READ_ONCE(sq->reset)) {
> > > > + __netif_tx_unlock(txq);
> > > > + return;
> > > > + }
> > > > +
> > > > do {
> > > > virtqueue_disable_cb(sq->vq);
> > > > free_old_xmit_skbs(sq, true);
> > > > @@ -1868,6 +1877,45 @@ static int virtnet_rx_resize(struct virtnet_info *vi,
> > > > return err;
> > > > }
> > > >
> > > > +static int virtnet_tx_resize(struct virtnet_info *vi,
> > > > + struct send_queue *sq, u32 ring_num)
> > > > +{
> > > > + struct netdev_queue *txq;
> > > > + int err, qindex;
> > > > +
> > > > + qindex = sq - vi->sq;
> > > > +
> > > > + virtnet_napi_tx_disable(&sq->napi);
> > > > +
> > > > + txq = netdev_get_tx_queue(vi->dev, qindex);
> > > > +
> > > > + /* 1. wait all ximt complete
> > > > + * 2. fix the race of netif_stop_subqueue() vs netif_start_subqueue()
> > > > + */
> > > > + __netif_tx_lock_bh(txq);
> > > > +
> > > > + /* Prevent rx poll from accessing sq. */
> > > > + WRITE_ONCE(sq->reset, true);
> > >
> > >
> > > Can we simply disable RX NAPI here?
> >
> > Disable rx napi is indeed a simple solution. But I hope that when dealing with
> > tx, it will not affect rx.
>
> Ok, but I think we've already synchronized with tx lock here, isn't it?
Yes, do you have any questions about WRITE_ONCE()? There is a set false operation
later, I did not use lock there, so I used WRITE/READ_ONCE
uniformly.
Thanks.
>
> Thanks
>
> >
> > Thanks.
> >
> >
> > >
> > > Thanks
> > >
> > >
> > > > +
> > > > + /* Prevent the upper layer from trying to send packets. */
> > > > + netif_stop_subqueue(vi->dev, qindex);
> > > > +
> > > > + __netif_tx_unlock_bh(txq);
> > > > +
> > > > + err = virtqueue_resize(sq->vq, ring_num, virtnet_sq_free_unused_buf);
> > > > + if (err)
> > > > + netdev_err(vi->dev, "resize tx fail: tx queue index: %d err: %d\n", qindex, err);
> > > > +
> > > > + /* Memory barrier before set reset and start subqueue. */
> > > > + smp_mb();
> > > > +
> > > > + WRITE_ONCE(sq->reset, false);
> > > > + netif_tx_wake_queue(txq);
> > > > +
> > > > + virtnet_napi_tx_enable(vi, sq->vq, &sq->napi);
> > > > + return err;
> > > > +}
> > > > +
> > > > /*
> > > > * Send command via the control virtqueue and check status. Commands
> > > > * supported by the hypervisor, as indicated by feature bits, should
> > >
> >
>
^ permalink raw reply
* Re: [PATCH 1/3] s390/cpufeature: rework to allow more than only hwcap bits
From: Hendrik Brueckner @ 2022-07-15 8:03 UTC (permalink / raw)
To: Heiko Carstens
Cc: Steffen Eiden, Alexander Gordeev, Christian Borntraeger,
Janosch Frank, Claudio Imbrenda, Vasily Gorbik, linux-s390,
linux-kernel, linux-mm, nrb
In-Reply-To: <Ys/1ab1BXPw1RWuy@osiris>
On Thu, Jul 14, 2022 at 12:52:25PM +0200, Heiko Carstens wrote:
> > > > +static struct s390_cpu_feature s390_cpu_features[MAX_CPU_FEATURES] = {
> > > > + [S390_CPU_FEATURE_ESAN3] = {.type = TYPE_HWCAP, .num = HWCAP_NR_ESAN3},
> > > > + [S390_CPU_FEATURE_ZARCH] = {.type = TYPE_HWCAP, .num = HWCAP_NR_ZARCH},
> ...
> > > I only realized now that you added all HWCAP bits here. It was
> > > intentional that I added only the two bits which are currently used
> > > for several reasons:
> > >
> > > - Keep the array as small as possible.
> > > - No need to keep this array in sync with HWCAPs, if new ones are added.
> > > - There is a for loop in print_cpu_modalias() which iterates over all
> > > MAX_CPU_FEATURES entries; this should be as fast as possible. Adding
> > > extra entries burns cycles for no added value.
> > The loop in print_cpu_modalias() was the reason why I added all
> > current HWCAPs. The current implementation runs through all HWCAPs
> > using cpu_have_feature() and I feared that reducing to just MSA and
> > VXRS has effects in the reporting of CPU-features to userspace.
> >
> > I double checked the output of 'grep features /proc/cpuinfo' and it
> > stays the same, for 5.19-rc6, 5.19-rc6+this series, 5.19-rc6+this series
> > with just the two S390_CPU_FEATUREs. I might have misunderstood what happens
> > in that loop in print_cpu_modalias().
>
> It is used on cpu hotplug to generate a MODALIAS environment
> variable. You can check that by running "udevadm monitor -p"
> and then switching a cpu off/on.
>
> This environment variable is then used by systemd/udev to load
> feature matching modules via kmod.
See also some notes on the cpu feature in KRN1305 spec (introduced w/ VX
support).
^ permalink raw reply
* Re: [PATCH v2 1/3] s390/cpufeature: rework to allow more than only hwcap bits
From: Hendrik Brueckner @ 2022-07-15 7:59 UTC (permalink / raw)
To: Steffen Eiden
Cc: Heiko Carstens, Alexander Gordeev, Christian Borntraeger,
Janosch Frank, Claudio Imbrenda, Vasily Gorbik, linux-s390,
linux-kernel, linux-mm, nrb
In-Reply-To: <20220713125644.16121-2-seiden@linux.ibm.com>
On Wed, Jul 13, 2022 at 02:56:42PM +0200, Steffen Eiden wrote:
> From: Heiko Carstens <hca@linux.ibm.com>
>
> Rework cpufeature implementation to allow for various cpu feature
> indications, which is not only limited to hwcap bits. This is achieved
> by adding a sequential list of cpu feature numbers, where each of them
> is mapped to an entry which indicates what this number is about.
>
> Each entry contains a type member, which indicates what feature
> name space to look into (e.g. hwcap, or cpu facility). If wanted this
> allows also to automatically load modules only in e.g. z/VM
> configurations.
>
> Signed-off-by: Heiko Carstens <hca@linux.ibm.com>
> Signed-off-by: Steffen Eiden <seiden@linux.ibm.com>
> ---
> arch/s390/crypto/aes_s390.c | 2 +-
> arch/s390/crypto/chacha-glue.c | 2 +-
> arch/s390/crypto/crc32-vx.c | 2 +-
> arch/s390/crypto/des_s390.c | 2 +-
> arch/s390/crypto/ghash_s390.c | 2 +-
> arch/s390/crypto/prng.c | 2 +-
> arch/s390/crypto/sha1_s390.c | 2 +-
> arch/s390/crypto/sha256_s390.c | 2 +-
> arch/s390/crypto/sha3_256_s390.c | 2 +-
> arch/s390/crypto/sha3_512_s390.c | 2 +-
> arch/s390/crypto/sha512_s390.c | 2 +-
Regarding facility bits for cpu features: Initially, I used
MSA hwcap to cover all ciphers among all hw generations. With facility bit
checks, it makes more sense to fine-tune and load based on respective
MSA level or CPACF functions that is required for ciphers/hashes.
E.g. like
> diff --git a/arch/s390/crypto/sha512_s390.c b/arch/s390/crypto/sha512_s390.c
> index 43ce4956df73..04f11c407763 100644
> --- a/arch/s390/crypto/sha512_s390.c
> +++ b/arch/s390/crypto/sha512_s390.c
> @@ -142,7 +142,7 @@ static void __exit fini(void)
> crypto_unregister_shash(&sha384_alg);
> }
>
> -module_cpu_feature_match(MSA, init);
> +module_cpu_feature_match(S390_CPU_FEATURE_MSA, init);
> module_exit(fini);
which becomes automatically loaded if (any) MSA is available and then
performs this check:
cpacf_query_func(CPACF_KIMD, CPACF_KIMD_SHA_512
which in the worst case would fail.
This might be a very useful follow-up patch to remove those mod init checks
into the cpu feature.
Other than that,
Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com>
^ permalink raw reply
* Re: [PATCH v2 0/3] s390/cpufeature: rework to allow different types of cpufeatures
From: Hendrik Brueckner @ 2022-07-15 7:49 UTC (permalink / raw)
To: Steffen Eiden
Cc: Heiko Carstens, Alexander Gordeev, Christian Borntraeger,
Janosch Frank, Claudio Imbrenda, Vasily Gorbik, linux-s390,
linux-kernel, linux-mm, nrb
In-Reply-To: <20220713125644.16121-1-seiden@linux.ibm.com>
Maybe, I am bit late ...
On Wed, Jul 13, 2022 at 02:56:41PM +0200, Steffen Eiden wrote:
> Currently the s390 implementaion of cpufeature is limited to elf_hwcap
> bits. Using these to automatically load modules also exposes this
> cpufeature to userspace which, sometimes is not intended.
Those features are (always) exposed to user space as module loading is
actually done by udev rules. However, we had some pseudo-hwcaps (e.g. sie64a)
in the past.. but I very appreciate this change!
Thanks.
^ permalink raw reply
* Re: [RFC PATCH 1/2] asm-generic: Remove pci.h copying code out to architectures
From: Arnd Bergmann @ 2022-07-15 7:40 UTC (permalink / raw)
To: Max Filippov
Cc: Stafford Horne, LKML, Arnd Bergmann, Richard Henderson,
Ivan Kokshaysky, Matt Turner, Geert Uytterhoeven,
Michael Ellerman, Benjamin Herrenschmidt, Paul Mackerras,
Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
Christian Borntraeger, Sven Schnelle, David S. Miller,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
maintainer:X86 ARCHITECTURE..., H. Peter Anvin, Chris Zankel,
Bjorn Helgaas, Paul Walmsley, Palmer Dabbelt, Albert Ou,
Nick Child, Niklas Schnelle, Matthew Rosato, Pierre Morel,
Kees Cook, Gustavo A. R. Silva, open list:ALPHA PORT,
open list:IA64 (Itanium) PL..., open list:M68K ARCHITECTURE,
linuxppc-dev, linux-s390, open list:SPARC + UltraSPAR...,
open list:TENSILICA XTENSA PORT (xtensa), linux-pci, Linux-Arch,
linux-riscv
In-Reply-To: <CAMo8BfKkGRHiFq1vu1ZKkURkUqC+Ee7D42yuKrCeDF+578s9cw@mail.gmail.com>
On Fri, Jul 15, 2022 at 3:45 AM Max Filippov <jcmvbkbc@gmail.com> wrote:
> On Thu, Jul 14, 2022 at 2:47 PM Stafford Horne <shorne@gmail.com> wrote:
>
> > +static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
> > +{
> > + return channel ? 15 : 14;
> > +}
>
> This addition does not make sense for the xtensa as it isn't even possible
> to enable PNP support (the only user of this function) on xtensa.
Nice catch! I had looked at this function earlier and only tried to infer
which architectures might have this based on who has those interrupt
numbers reserved for ISA devices, but looking at CONFIG_PNP is clearly
better here.
PNP depends on "ISA || ACPI", and this already rules out most
architectures. The remaining ones are:
* x86, ia64, alpha: These clearly use PNP based on-board devices on
common machines, and use PC-style interrupts
* arm64, loongarch: These select PNP when ACPI is enabled. I don't
think they actually use PNP, but for the moment the function needs to
be defined, probably returning 0. Loongarch still lacks PCI support
though, so asm/pci.h is not yet there.
* arm, mips, powerpc: Only a few older machines in each of these
support ISA devices, and the function is probably machine specific.
These all have a custom pci.h already and don't use the asm-generic
version.
* m68k: there are two that enable CONFIG_ISA and one that enables
CONFIG_PCI, but nothing that has both, so we don't need this
function.
In summary, I think only x86 actually uses this function, and it is
correct there, everything else either has its own implementation
or does not need it, so the existing asm-generic/pci.h file can
just be folded into the x86 asm/pci.h. That is a great cleanup.
Arnd
^ permalink raw reply
* Re: [PATCH V12 01/20] uapi: simplify __ARCH_FLOCK{,64}_PAD a little
From: Florian Fainelli @ 2022-07-15 3:13 UTC (permalink / raw)
To: guoren, palmer, arnd, gregkh, hch, nathan, naresh.kamboju
Cc: linux-arch, linux-kernel, linux-riscv, linux-s390, sparclinux,
linuxppc-dev, linux-parisc, linux-mips, linux-arm-kernel, x86,
heiko
In-Reply-To: <20220405071314.3225832-2-guoren@kernel.org>
On 4/5/2022 12:12 AM, guoren@kernel.org wrote:
> From: Christoph Hellwig <hch@lst.de>
>
> Don't bother to define the symbols empty, just don't use them.
> That makes the intent a little more clear.
>
> Remove the unused HAVE_ARCH_STRUCT_FLOCK64 define and merge the
> 32-bit mips struct flock into the generic one.
>
> Add a new __ARCH_FLOCK_EXTRA_SYSID macro following the style of
> __ARCH_FLOCK_PAD to avoid having a separate definition just for
> one architecture.
>
> Signed-off-by: Christoph Hellwig <hch@lst.de>
> Signed-off-by: Guo Ren <guoren@kernel.org>
> Reviewed-by: Arnd Bergmann <arnd@arndb.de>
> Tested-by: Heiko Stuebner <heiko@sntech.de>
Being late to this, but this breaks the perf build for me using a MIPS
toolchain with the following:
CC
/home/fainelli/work/buildroot/output/bmips/build/linux-custom/tools/perf/trace/beauty/fcntl.o
In file included from
../../../../host/mipsel-buildroot-linux-gnu/sysroot/usr/include/asm/fcntl.h:77,
from ../include/uapi/linux/fcntl.h:5,
from trace/beauty/fcntl.c:10:
../include/uapi/asm-generic/fcntl.h:188:8: error: redefinition of
'struct flock'
struct flock {
^~~~~
In file included from ../include/uapi/linux/fcntl.h:5,
from trace/beauty/fcntl.c:10:
../../../../host/mipsel-buildroot-linux-gnu/sysroot/usr/include/asm/fcntl.h:63:8:
note: originally defined here
struct flock {
^~~~~
make[6]: ***
[/home/fainelli/work/buildroot/output/bmips/build/linux-custom/tools/build/Makefile.build:97:
/home/fainelli/work/buildroot/output/bmips/build/linux-custom/tools/perf/trace/beauty/fcntl.o]
Error 1
the kernel headers are set to 4.1.31 which is arguably old but
toolchains using newer kernel headers do not fare much better either
unfortunately as I tried a toolchain with kernel headers 4.9.x.
I will start doing more regular MIPS builds of the perf tools since that
seems to escape our testing.
Thanks!
--
Florian
^ permalink raw reply
* Re: [PATCH v2 0/4] mm: arm64: bring up BATCHED_UNMAP_TLB_FLUSH
From: Yicong Yang @ 2022-07-15 2:47 UTC (permalink / raw)
To: Barry Song, xhao
Cc: yangyicong, Andrew Morton, Linux-MM, LAK, x86, Catalin Marinas,
Will Deacon, Linux Doc Mailing List, Jonathan Corbet,
Arnd Bergmann, LKML, Darren Hart, huzhanyuan,
李培锋(wink),
张诗明(Simon Zhang), 郭健, real mz,
linux-mips, openrisc, linuxppc-dev, linux-riscv, linux-s390,
tiantao (H)
In-Reply-To: <CAGsJ_4zjnmQV6LT3yo--K-qD-92=hBmgfK121=n-Y0oEFX8RnQ@mail.gmail.com>
On 2022/7/14 12:51, Barry Song wrote:
> On Thu, Jul 14, 2022 at 3:29 PM Xin Hao <xhao@linux.alibaba.com> wrote:
>>
>> Hi barry.
>>
>> I do some test on Kunpeng arm64 machine use Unixbench.
>>
>> The test result as below.
>>
>> One core, we can see the performance improvement above +30%.
>
> I am really pleased to see the 30%+ improvement on unixbench on single core.
>
>> ./Run -c 1 -i 1 shell1
>> w/o
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent) 42.4 5481.0 1292.7
>> ========
>> System Benchmarks Index Score (Partial Only) 1292.7
>>
>> w/
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent) 42.4 6974.6 1645.0
>> ========
>> System Benchmarks Index Score (Partial Only) 1645.0
>>
>>
>> But with whole cores, there have little performance degradation above -5%
>
> That is sad as we might get more concurrency between mprotect(), madvise(),
> mremap(), zap_pte_range() and the deferred tlbi.
>
>>
>> ./Run -c 96 -i 1 shell1
>> w/o
>> Shell Scripts (1 concurrent) 80765.5 lpm (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent) 42.4 80765.5 19048.5
>> ========
>> System Benchmarks Index Score (Partial Only) 19048.5
>>
>> w
>> Shell Scripts (1 concurrent) 76333.6 lpm (60.0 s, 1
>> samples)
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent) 42.4 76333.6 18003.2
>> ========
>> System Benchmarks Index Score (Partial Only) 18003.2
>>
>> ----------------------------------------------------------------------------------------------
>>
>>
>> After discuss with you, and do some changes in the patch.
>>
>> ndex a52381a680db..1ecba81f1277 100644
>> --- a/mm/rmap.c
>> +++ b/mm/rmap.c
>> @@ -727,7 +727,11 @@ void flush_tlb_batched_pending(struct mm_struct *mm)
>> int flushed = batch >> TLB_FLUSH_BATCH_FLUSHED_SHIFT;
>>
>> if (pending != flushed) {
>> +#ifdef CONFIG_ARCH_HAS_MM_CPUMASK
>> flush_tlb_mm(mm);
>> +#else
>> + dsb(ish);
>> +#endif
>>
>
> i was guessing the problem might be flush_tlb_batched_pending()
> so i asked you to change this to verify my guess.
>
> /*
>> * If the new TLB flushing is pending during flushing, leave
>> * mm->tlb_flush_batched as is, to avoid losing flushing.
>>
>> there have a performance improvement with whole cores, above +30%
>
> But I don't think it is a proper patch. There is no guarantee the cpu calling
> flush_tlb_batched_pending is exactly the cpu sending the deferred
> tlbi. so the solution is unsafe. But since this temporary code can bring the
> 30%+ performance improvement back for high concurrency, we have huge
> potential to finally make it.
>
> Unfortunately I don't have an arm64 server to debug on this. I only have
> 8 cores which are unlikely to reproduce regression which happens in
> high concurrency with 96 parallel tasks.
>
> So I'd ask if @yicong or someone else working on kunpeng or other
> arm64 servers is able to actually debug and figure out a proper
> patch for this, then add the patch as 5/5 into this series?
>
sure, Tiantao and I will look into this on Kunpeng 920.
>>
>> ./Run -c 96 -i 1 shell1
>> 96 CPUs in system; running 96 parallel copies of tests
>>
>> Shell Scripts (1 concurrent) 109229.0 lpm (60.0 s, 1 samples)
>> System Benchmarks Partial Index BASELINE RESULT INDEX
>> Shell Scripts (1 concurrent) 42.4 109229.0 25761.6
>> ========
>> System Benchmarks Index Score (Partial Only) 25761.6
>>
>>
>> Tested-by: Xin Hao<xhao@linux.alibaba.com>
>
> Thanks for your testing!
>
>>
>> Looking forward to your next version patch.
>>
>> On 7/11/22 11:46 AM, Barry Song wrote:
>>> Though ARM64 has the hardware to do tlb shootdown, the hardware
>>> broadcasting is not free.
>>> A simplest micro benchmark shows even on snapdragon 888 with only
>>> 8 cores, the overhead for ptep_clear_flush is huge even for paging
>>> out one page mapped by only one process:
>>> 5.36% a.out [kernel.kallsyms] [k] ptep_clear_flush
>>>
>>> While pages are mapped by multiple processes or HW has more CPUs,
>>> the cost should become even higher due to the bad scalability of
>>> tlb shootdown.
>>>
>>> The same benchmark can result in 16.99% CPU consumption on ARM64
>>> server with around 100 cores according to Yicong's test on patch
>>> 4/4.
>>>
>>> This patchset leverages the existing BATCHED_UNMAP_TLB_FLUSH by
>>> 1. only send tlbi instructions in the first stage -
>>> arch_tlbbatch_add_mm()
>>> 2. wait for the completion of tlbi by dsb while doing tlbbatch
>>> sync in arch_tlbbatch_flush()
>>> My testing on snapdragon shows the overhead of ptep_clear_flush
>>> is removed by the patchset. The micro benchmark becomes 5% faster
>>> even for one page mapped by single process on snapdragon 888.
>>>
>>>
>>> -v2:
>>> 1. Collected Yicong's test result on kunpeng920 ARM64 server;
>>> 2. Removed the redundant vma parameter in arch_tlbbatch_add_mm()
>>> according to the comments of Peter Zijlstra and Dave Hansen
>>> 3. Added ARCH_HAS_MM_CPUMASK rather than checking if mm_cpumask
>>> is empty according to the comments of Nadav Amit
>>>
>>> Thanks, Yicong, Peter, Dave and Nadav for your testing or reviewing
>>> , and comments.
>>>
>>> -v1:
>>> https://lore.kernel.org/lkml/20220707125242.425242-1-21cnbao@gmail.com/
>>>
>>> Barry Song (4):
>>> Revert "Documentation/features: mark BATCHED_UNMAP_TLB_FLUSH doesn't
>>> apply to ARM64"
>>> mm: rmap: Allow platforms without mm_cpumask to defer TLB flush
>>> mm: rmap: Extend tlbbatch APIs to fit new platforms
>>> arm64: support batched/deferred tlb shootdown during page reclamation
>>>
>>> Documentation/features/arch-support.txt | 1 -
>>> .../features/vm/TLB/arch-support.txt | 2 +-
>>> arch/arm/Kconfig | 1 +
>>> arch/arm64/Kconfig | 1 +
>>> arch/arm64/include/asm/tlbbatch.h | 12 ++++++++++
>>> arch/arm64/include/asm/tlbflush.h | 23 +++++++++++++++++--
>>> arch/loongarch/Kconfig | 1 +
>>> arch/mips/Kconfig | 1 +
>>> arch/openrisc/Kconfig | 1 +
>>> arch/powerpc/Kconfig | 1 +
>>> arch/riscv/Kconfig | 1 +
>>> arch/s390/Kconfig | 1 +
>>> arch/um/Kconfig | 1 +
>>> arch/x86/Kconfig | 1 +
>>> arch/x86/include/asm/tlbflush.h | 3 ++-
>>> mm/Kconfig | 3 +++
>>> mm/rmap.c | 14 +++++++----
>>> 17 files changed, 59 insertions(+), 9 deletions(-)
>>> create mode 100644 arch/arm64/include/asm/tlbbatch.h
>>>
>> --
>> Best Regards!
>> Xin Hao
>>
>
> Thanks
> Barry
> .
>
^ permalink raw reply
* Re: [RFC PATCH 1/2] asm-generic: Remove pci.h copying code out to architectures
From: Stafford Horne @ 2022-07-15 2:27 UTC (permalink / raw)
To: Max Filippov
Cc: LKML, Arnd Bergmann, Richard Henderson, Ivan Kokshaysky,
Matt Turner, Geert Uytterhoeven, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, David S. Miller, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, maintainer:X86 ARCHITECTURE...,
H. Peter Anvin, Chris Zankel, Bjorn Helgaas, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Nick Child, Niklas Schnelle,
Matthew Rosato, Pierre Morel, Kees Cook, Gustavo A. R. Silva,
open list:ALPHA PORT, open list:IA64 (Itanium) PL...,
open list:M68K ARCHITECTURE, linuxppc-dev, linux-s390,
open list:SPARC + UltraSPAR...,
open list:TENSILICA XTENSA PORT (xtensa), linux-pci, Linux-Arch,
linux-riscv
In-Reply-To: <CAMo8BfKkGRHiFq1vu1ZKkURkUqC+Ee7D42yuKrCeDF+578s9cw@mail.gmail.com>
On Thu, Jul 14, 2022 at 06:45:27PM -0700, Max Filippov wrote:
> On Thu, Jul 14, 2022 at 2:47 PM Stafford Horne <shorne@gmail.com> wrote:
> >
> > The generic pci.h header provides a definition of pci_get_legacy_ide_irq
> > which is used by architectures that use PC-style interrupt numbers.
> >
> > This patch removes the old pci.h in order to make room for a new
> > pci.h to be used by arm64, riscv, openrisc, etc.
> >
> > The existing code in pci.h is moved out to architectures.
> >
> > Suggested-by: Arnd Bergmann <arnd@arndb.de>
> > Link: https://lore.kernel.org/lkml/CAK8P3a0JmPeczfmMBE__vn=Jbvf=nkbpVaZCycyv40pZNCJJXQ@mail.gmail.com/
> > Signed-off-by: Stafford Horne <shorne@gmail.com>
> > ---
> > arch/alpha/include/asm/pci.h | 1 -
> > arch/ia64/include/asm/pci.h | 1 -
> > arch/m68k/include/asm/pci.h | 7 +++++--
> > arch/powerpc/include/asm/pci.h | 1 -
> > arch/s390/include/asm/pci.h | 6 +++++-
> > arch/sparc/include/asm/pci.h | 5 ++++-
> > arch/x86/include/asm/pci.h | 6 ++++--
> > arch/xtensa/include/asm/pci.h | 6 ++++--
> > include/asm-generic/pci.h | 17 -----------------
> > 9 files changed, 22 insertions(+), 28 deletions(-)
> > delete mode 100644 include/asm-generic/pci.h
>
> [...]
>
> > diff --git a/arch/xtensa/include/asm/pci.h b/arch/xtensa/include/asm/pci.h
> > index 8e2b48a268db..f57ede61f5db 100644
> > --- a/arch/xtensa/include/asm/pci.h
> > +++ b/arch/xtensa/include/asm/pci.h
> > @@ -43,7 +43,9 @@
> > #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1
> > #define arch_can_pci_mmap_io() 1
> >
> > -/* Generic PCI */
> > -#include <asm-generic/pci.h>
>
> Ok.
>
> > +static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
> > +{
> > + return channel ? 15 : 14;
> > +}
>
> This addition does not make sense for the xtensa as it isn't even possible
> to enable PNP support (the only user of this function) on xtensa.
Thanks for your feedback, this is the kind of feedback I was hoping to fish out
with this patch. I will look into completely removing this then.
-Stafford
^ permalink raw reply
* [PATCH] s390/hmcdrv: fix Kconfig "its" grammar
From: Randy Dunlap @ 2022-07-15 2:00 UTC (permalink / raw)
To: linux-kernel
Cc: Randy Dunlap, Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
linux-s390
Use the possessive "its" instead of the contraction "it's"
where appropriate.
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: linux-s390@vger.kernel.org
---
drivers/s390/char/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
--- a/drivers/s390/char/Kconfig
+++ b/drivers/s390/char/Kconfig
@@ -89,7 +89,7 @@ config HMC_DRV
Management Console (HMC) drive CD/DVD-ROM. It is available as a
module, called 'hmcdrv', and also as kernel built-in. There is one
optional parameter for this module: cachesize=N, which modifies the
- transfer cache size from it's default value 0.5MB to N bytes. If N
+ transfer cache size from its default value 0.5MB to N bytes. If N
is zero, then no caching is performed.
config SCLP_OFB
^ permalink raw reply
* Re: [RFC PATCH 1/2] asm-generic: Remove pci.h copying code out to architectures
From: Max Filippov @ 2022-07-15 1:45 UTC (permalink / raw)
To: Stafford Horne
Cc: LKML, Arnd Bergmann, Richard Henderson, Ivan Kokshaysky,
Matt Turner, Geert Uytterhoeven, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, David S. Miller, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, maintainer:X86 ARCHITECTURE...,
H. Peter Anvin, Chris Zankel, Bjorn Helgaas, Paul Walmsley,
Palmer Dabbelt, Albert Ou, Nick Child, Niklas Schnelle,
Matthew Rosato, Pierre Morel, Kees Cook, Gustavo A. R. Silva,
open list:ALPHA PORT, open list:IA64 (Itanium) PL...,
open list:M68K ARCHITECTURE, linuxppc-dev, linux-s390,
open list:SPARC + UltraSPAR...,
open list:TENSILICA XTENSA PORT (xtensa), linux-pci, Linux-Arch,
linux-riscv
In-Reply-To: <20220714214657.2402250-2-shorne@gmail.com>
On Thu, Jul 14, 2022 at 2:47 PM Stafford Horne <shorne@gmail.com> wrote:
>
> The generic pci.h header provides a definition of pci_get_legacy_ide_irq
> which is used by architectures that use PC-style interrupt numbers.
>
> This patch removes the old pci.h in order to make room for a new
> pci.h to be used by arm64, riscv, openrisc, etc.
>
> The existing code in pci.h is moved out to architectures.
>
> Suggested-by: Arnd Bergmann <arnd@arndb.de>
> Link: https://lore.kernel.org/lkml/CAK8P3a0JmPeczfmMBE__vn=Jbvf=nkbpVaZCycyv40pZNCJJXQ@mail.gmail.com/
> Signed-off-by: Stafford Horne <shorne@gmail.com>
> ---
> arch/alpha/include/asm/pci.h | 1 -
> arch/ia64/include/asm/pci.h | 1 -
> arch/m68k/include/asm/pci.h | 7 +++++--
> arch/powerpc/include/asm/pci.h | 1 -
> arch/s390/include/asm/pci.h | 6 +++++-
> arch/sparc/include/asm/pci.h | 5 ++++-
> arch/x86/include/asm/pci.h | 6 ++++--
> arch/xtensa/include/asm/pci.h | 6 ++++--
> include/asm-generic/pci.h | 17 -----------------
> 9 files changed, 22 insertions(+), 28 deletions(-)
> delete mode 100644 include/asm-generic/pci.h
[...]
> diff --git a/arch/xtensa/include/asm/pci.h b/arch/xtensa/include/asm/pci.h
> index 8e2b48a268db..f57ede61f5db 100644
> --- a/arch/xtensa/include/asm/pci.h
> +++ b/arch/xtensa/include/asm/pci.h
> @@ -43,7 +43,9 @@
> #define ARCH_GENERIC_PCI_MMAP_RESOURCE 1
> #define arch_can_pci_mmap_io() 1
>
> -/* Generic PCI */
> -#include <asm-generic/pci.h>
Ok.
> +static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
> +{
> + return channel ? 15 : 14;
> +}
This addition does not make sense for the xtensa as it isn't even possible
to enable PNP support (the only user of this function) on xtensa.
--
Thanks.
-- Max
^ permalink raw reply
* [RFC PATCH 1/2] asm-generic: Remove pci.h copying code out to architectures
From: Stafford Horne @ 2022-07-14 21:46 UTC (permalink / raw)
To: LKML
Cc: Arnd Bergmann, Stafford Horne, Richard Henderson, Ivan Kokshaysky,
Matt Turner, Geert Uytterhoeven, Michael Ellerman,
Benjamin Herrenschmidt, Paul Mackerras, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, David S. Miller, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Chris Zankel,
Max Filippov, Bjorn Helgaas, Paul Walmsley, Palmer Dabbelt,
Albert Ou, Nick Child, Niklas Schnelle, Matthew Rosato,
Pierre Morel, Kees Cook, Gustavo A. R. Silva, linux-alpha,
linux-ia64, linux-m68k, linuxppc-dev, linux-s390, sparclinux,
linux-xtensa, linux-pci, linux-arch, linux-riscv
In-Reply-To: <20220714214657.2402250-1-shorne@gmail.com>
The generic pci.h header provides a definition of pci_get_legacy_ide_irq
which is used by architectures that use PC-style interrupt numbers.
This patch removes the old pci.h in order to make room for a new
pci.h to be used by arm64, riscv, openrisc, etc.
The existing code in pci.h is moved out to architectures.
Suggested-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/lkml/CAK8P3a0JmPeczfmMBE__vn=Jbvf=nkbpVaZCycyv40pZNCJJXQ@mail.gmail.com/
Signed-off-by: Stafford Horne <shorne@gmail.com>
---
arch/alpha/include/asm/pci.h | 1 -
arch/ia64/include/asm/pci.h | 1 -
arch/m68k/include/asm/pci.h | 7 +++++--
arch/powerpc/include/asm/pci.h | 1 -
arch/s390/include/asm/pci.h | 6 +++++-
arch/sparc/include/asm/pci.h | 5 ++++-
arch/x86/include/asm/pci.h | 6 ++++--
arch/xtensa/include/asm/pci.h | 6 ++++--
include/asm-generic/pci.h | 17 -----------------
9 files changed, 22 insertions(+), 28 deletions(-)
delete mode 100644 include/asm-generic/pci.h
diff --git a/arch/alpha/include/asm/pci.h b/arch/alpha/include/asm/pci.h
index cf6bc1e64d66..8ac5af0fc4da 100644
--- a/arch/alpha/include/asm/pci.h
+++ b/arch/alpha/include/asm/pci.h
@@ -56,7 +56,6 @@ struct pci_controller {
/* IOMMU controls. */
-/* TODO: integrate with include/asm-generic/pci.h ? */
static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
{
return channel ? 15 : 14;
diff --git a/arch/ia64/include/asm/pci.h b/arch/ia64/include/asm/pci.h
index 8c163d1d0189..218412d963c2 100644
--- a/arch/ia64/include/asm/pci.h
+++ b/arch/ia64/include/asm/pci.h
@@ -63,7 +63,6 @@ static inline int pci_proc_domain(struct pci_bus *bus)
return (pci_domain_nr(bus) != 0);
}
-#define HAVE_ARCH_PCI_GET_LEGACY_IDE_IRQ
static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
{
return channel ? isa_irq_to_vector(15) : isa_irq_to_vector(14);
diff --git a/arch/m68k/include/asm/pci.h b/arch/m68k/include/asm/pci.h
index 5a4bc223743b..0c272ff515cc 100644
--- a/arch/m68k/include/asm/pci.h
+++ b/arch/m68k/include/asm/pci.h
@@ -2,11 +2,14 @@
#ifndef _ASM_M68K_PCI_H
#define _ASM_M68K_PCI_H
-#include <asm-generic/pci.h>
-
#define pcibios_assign_all_busses() 1
#define PCIBIOS_MIN_IO 0x00000100
#define PCIBIOS_MIN_MEM 0x02000000
+static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
+{
+ return channel ? 15 : 14;
+}
+
#endif /* _ASM_M68K_PCI_H */
diff --git a/arch/powerpc/include/asm/pci.h b/arch/powerpc/include/asm/pci.h
index 915d6ee4b40a..f9da506751bb 100644
--- a/arch/powerpc/include/asm/pci.h
+++ b/arch/powerpc/include/asm/pci.h
@@ -39,7 +39,6 @@
#define pcibios_assign_all_busses() \
(pci_has_flag(PCI_REASSIGN_ALL_BUS))
-#define HAVE_ARCH_PCI_GET_LEGACY_IDE_IRQ
static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
{
if (ppc_md.pci_get_legacy_ide_irq)
diff --git a/arch/s390/include/asm/pci.h b/arch/s390/include/asm/pci.h
index fdb9745ee998..93cd0167f8aa 100644
--- a/arch/s390/include/asm/pci.h
+++ b/arch/s390/include/asm/pci.h
@@ -6,7 +6,6 @@
#include <linux/mutex.h>
#include <linux/iommu.h>
#include <linux/pci_hotplug.h>
-#include <asm-generic/pci.h>
#include <asm/pci_clp.h>
#include <asm/pci_debug.h>
#include <asm/sclp.h>
@@ -233,6 +232,11 @@ int zpci_init_iommu(struct zpci_dev *zdev);
void zpci_destroy_iommu(struct zpci_dev *zdev);
#ifdef CONFIG_PCI
+static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
+{
+ return channel ? 15 : 14;
+}
+
static inline bool zpci_use_mio(struct zpci_dev *zdev)
{
return static_branch_likely(&have_mio) && zdev->mio_capable;
diff --git a/arch/sparc/include/asm/pci.h b/arch/sparc/include/asm/pci.h
index 4deddf430e5d..6d283fc7b55b 100644
--- a/arch/sparc/include/asm/pci.h
+++ b/arch/sparc/include/asm/pci.h
@@ -46,7 +46,10 @@ static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
return PCI_IRQ_NONE;
}
#else
-#include <asm-generic/pci.h>
+static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
+{
+ return channel ? 15 : 14;
+}
#endif
#endif /* ___ASM_SPARC_PCI_H */
diff --git a/arch/x86/include/asm/pci.h b/arch/x86/include/asm/pci.h
index f3fd5928bcbb..7da27f665cfe 100644
--- a/arch/x86/include/asm/pci.h
+++ b/arch/x86/include/asm/pci.h
@@ -105,8 +105,10 @@ static inline void early_quirks(void) { }
extern void pci_iommu_alloc(void);
-/* generic pci stuff */
-#include <asm-generic/pci.h>
+static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
+{
+ return channel ? 15 : 14;
+}
#ifdef CONFIG_NUMA
/* Returns the node based on pci bus */
diff --git a/arch/xtensa/include/asm/pci.h b/arch/xtensa/include/asm/pci.h
index 8e2b48a268db..f57ede61f5db 100644
--- a/arch/xtensa/include/asm/pci.h
+++ b/arch/xtensa/include/asm/pci.h
@@ -43,7 +43,9 @@
#define ARCH_GENERIC_PCI_MMAP_RESOURCE 1
#define arch_can_pci_mmap_io() 1
-/* Generic PCI */
-#include <asm-generic/pci.h>
+static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
+{
+ return channel ? 15 : 14;
+}
#endif /* _XTENSA_PCI_H */
diff --git a/include/asm-generic/pci.h b/include/asm-generic/pci.h
deleted file mode 100644
index 6bb3cd3d695a..000000000000
--- a/include/asm-generic/pci.h
+++ /dev/null
@@ -1,17 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-/*
- * linux/include/asm-generic/pci.h
- *
- * Copyright (C) 2003 Russell King
- */
-#ifndef _ASM_GENERIC_PCI_H
-#define _ASM_GENERIC_PCI_H
-
-#ifndef HAVE_ARCH_PCI_GET_LEGACY_IDE_IRQ
-static inline int pci_get_legacy_ide_irq(struct pci_dev *dev, int channel)
-{
- return channel ? 15 : 14;
-}
-#endif /* HAVE_ARCH_PCI_GET_LEGACY_IDE_IRQ */
-
-#endif /* _ASM_GENERIC_PCI_H */
--
2.36.1
^ permalink raw reply related
* [PATCH v13 2/2] KVM: s390: resetting the Topology-Change-Report
From: Pierre Morel @ 2022-07-14 19:43 UTC (permalink / raw)
To: kvm
Cc: linux-s390, linux-kernel, borntraeger, frankja, cohuck, david,
thuth, imbrenda, hca, gor, pmorel, wintera, seiden, nrb, scgl
In-Reply-To: <541d85d3-4864-583c-ff33-d0f566770c9f@linux.ibm.com>
During a subsystem reset the Topology-Change-Report is cleared.
Let's give userland the possibility to clear the MTCR in the case
of a subsystem reset.
To migrate the MTCR, we give userland the possibility to
query the MTCR state.
We indicate KVM support for the CPU topology facility with a new
KVM capability: KVM_CAP_S390_CPU_TOPOLOGY.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
---
Documentation/virt/kvm/api.rst | 25 ++++++++++++++++
arch/s390/include/uapi/asm/kvm.h | 1 +
arch/s390/kvm/kvm-s390.c | 51 ++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 1 +
4 files changed, 78 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 82baa7682829..892fc2e470d7 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8159,6 +8159,31 @@ PV guests. The `KVM_PV_DUMP` command is available for the
dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is
available and supports the `KVM_PV_DUMP_CPU` subcommand.
+8.38 KVM_CAP_S390_CPU_TOPOLOGY
+------------------------------
+
+:Capability: KVM_CAP_S390_CPU_TOPOLOGY
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM will provide the S390 CPU Topology
+facility which consist of the interpretation of the PTF instruction for
+the function code 2 along with interception and forwarding of both the
+PTF instruction with function codes 0 or 1 and the STSI(15,1,x)
+instruction to the userland hypervisor.
+
+The stfle facility 11, CPU Topology facility, should not be indicated
+to the guest without this capability.
+
+When this capability is present, KVM provides a new attribute group
+on vm fd, KVM_S390_VM_CPU_TOPOLOGY.
+This new attribute allows to get, set or clear the Modified Change
+Topology Report (MTCR) bit of the SCA through the kvm_device_attr
+structure.
+
+When getting the Modified Change Topology Report value, the attr->addr
+must point to a byte where the value will be stored or retrieved from.
+
9. Known KVM API problems
=========================
diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h
index 7a6b14874d65..a73cf01a1606 100644
--- a/arch/s390/include/uapi/asm/kvm.h
+++ b/arch/s390/include/uapi/asm/kvm.h
@@ -74,6 +74,7 @@ struct kvm_s390_io_adapter_req {
#define KVM_S390_VM_CRYPTO 2
#define KVM_S390_VM_CPU_MODEL 3
#define KVM_S390_VM_MIGRATION 4
+#define KVM_S390_VM_CPU_TOPOLOGY 5
/* kvm attributes for mem_ctrl */
#define KVM_S390_VM_MEM_ENABLE_CMMA 0
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 330a0cd4b8c8..b2cb13d1c0cd 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -647,6 +647,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_S390_ZPCI_OP:
r = kvm_s390_pci_interp_allowed();
break;
+ case KVM_CAP_S390_CPU_TOPOLOGY:
+ r = test_facility(11);
+ break;
default:
r = 0;
}
@@ -858,6 +861,20 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
icpt_operexc_on_all_vcpus(kvm);
r = 0;
break;
+ case KVM_CAP_S390_CPU_TOPOLOGY:
+ r = -EINVAL;
+ mutex_lock(&kvm->lock);
+ if (kvm->created_vcpus) {
+ r = -EBUSY;
+ } else if (test_facility(11)) {
+ set_kvm_facility(kvm->arch.model.fac_mask, 11);
+ set_kvm_facility(kvm->arch.model.fac_list, 11);
+ r = 0;
+ }
+ mutex_unlock(&kvm->lock);
+ VM_EVENT(kvm, 3, "ENABLE: CAP_S390_CPU_TOPOLOGY %s",
+ r ? "(not available)" : "(success)");
+ break;
default:
r = -EINVAL;
break;
@@ -1794,6 +1811,31 @@ static void kvm_s390_update_topology_change_report(struct kvm *kvm, bool val)
read_unlock(&kvm->arch.sca_lock);
}
+static int kvm_s390_set_topo_change_indication(struct kvm *kvm,
+ struct kvm_device_attr *attr)
+{
+ if (!test_kvm_facility(kvm, 11))
+ return -ENXIO;
+
+ kvm_s390_update_topology_change_report(kvm, !!attr->attr);
+ return 0;
+}
+
+static int kvm_s390_get_topo_change_indication(struct kvm *kvm,
+ struct kvm_device_attr *attr)
+{
+ u8 topo;
+
+ if (!test_kvm_facility(kvm, 11))
+ return -ENXIO;
+
+ read_lock(&kvm->arch.sca_lock);
+ topo = ((struct bsca_block *)kvm->arch.sca)->utility.mtcr;
+ read_unlock(&kvm->arch.sca_lock);
+
+ return put_user(topo, (u8 __user *)attr->addr);
+}
+
static int kvm_s390_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
{
int ret;
@@ -1814,6 +1856,9 @@ static int kvm_s390_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MIGRATION:
ret = kvm_s390_vm_set_migration(kvm, attr);
break;
+ case KVM_S390_VM_CPU_TOPOLOGY:
+ ret = kvm_s390_set_topo_change_indication(kvm, attr);
+ break;
default:
ret = -ENXIO;
break;
@@ -1839,6 +1884,9 @@ static int kvm_s390_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MIGRATION:
ret = kvm_s390_vm_get_migration(kvm, attr);
break;
+ case KVM_S390_VM_CPU_TOPOLOGY:
+ ret = kvm_s390_get_topo_change_indication(kvm, attr);
+ break;
default:
ret = -ENXIO;
break;
@@ -1912,6 +1960,9 @@ static int kvm_s390_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MIGRATION:
ret = 0;
break;
+ case KVM_S390_VM_CPU_TOPOLOGY:
+ ret = test_kvm_facility(kvm, 11) ? 0 : -ENXIO;
+ break;
default:
ret = -ENXIO;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 80216c6cece1..08206212fd36 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1159,6 +1159,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_ARM_SYSTEM_SUSPEND 216
#define KVM_CAP_S390_PROTECTED_DUMP 217
#define KVM_CAP_S390_ZPCI_OP 221
+#define KVM_CAP_S390_CPU_TOPOLOGY 222
#ifdef KVM_CAP_IRQ_ROUTING
--
2.31.1
^ permalink raw reply related
* Re: [PATCH net-next v2 0/6] net/smc: Introduce virtually contiguous buffers for SMC-R
From: Wenjia Zhang @ 2022-07-14 15:16 UTC (permalink / raw)
To: Wen Gu, kgraul, davem, edumazet, kuba, pabeni
Cc: linux-s390, netdev, linux-rdma, linux-kernel
In-Reply-To: <1657791845-1060-1-git-send-email-guwen@linux.alibaba.com>
On 14.07.22 11:43, Wen Gu wrote:
> On long-running enterprise production servers, high-order contiguous
> memory pages are usually very rare and in most cases we can only get
> fragmented pages.
>
> When replacing TCP with SMC-R in such production scenarios, attempting
> to allocate high-order physically contiguous sndbufs and RMBs may result
> in frequent memory compaction, which will cause unexpected hung issue
> and further stability risks.
>
> So this patch set is aimed to allow SMC-R link group to use virtually
> contiguous sndbufs and RMBs to avoid potential issues mentioned above.
> Whether to use physically or virtually contiguous buffers can be set
> by sysctl smcr_buf_type.
>
> Note that using virtually contiguous buffers will bring an acceptable
> performance regression, which can be mainly divided into two parts:
>
> 1) regression in data path, which is brought by additional address
> translation of sndbuf by RNIC in Tx. But in general, translating
> address through MTT is fast. According to qperf test, this part
> regression is basically less than 10% in latency and bandwidth.
> (see patch 5/6 for details)
>
> 2) regression in buffer initialization and destruction path, which is
> brought by additional MR operations of sndbufs. But thanks to link
> group buffer reuse mechanism, the impact of this kind of regression
> decreases as times of buffer reuse increases.
>
> Patch set overview:
> - Patch 1/6 and 2/6 mainly about simplifying and optimizing DMA sync
> operation, which will reduce overhead on the data path, especially
> when using virtually contiguous buffers;
> - Patch 3/6 and 4/6 introduce a sysctl smcr_buf_type to set the type
> of buffers in new created link group;
> - Patch 5/6 allows SMC-R to use virtually contiguous sndbufs and RMBs,
> including buffer creation, destruction, MR operation and access;
> - patch 6/6 extends netlink attribute for buffer type of SMC-R link group;
>
> v1->v2:
> - Patch 5/6 fixes build issue on 32bit;
> - Patch 3/6 adds description of new sysctl in smc-sysctl.rst;
>
> Guangguan Wang (2):
> net/smc: remove redundant dma sync ops
> net/smc: optimize for smc_sndbuf_sync_sg_for_device and
> smc_rmb_sync_sg_for_cpu
>
> Wen Gu (4):
> net/smc: Introduce a sysctl for setting SMC-R buffer type
> net/smc: Use sysctl-specified types of buffers in new link group
> net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R
> net/smc: Extend SMC-R link group netlink attribute
>
> Documentation/networking/smc-sysctl.rst | 13 ++
> include/net/netns/smc.h | 1 +
> include/uapi/linux/smc.h | 1 +
> net/smc/af_smc.c | 68 +++++++--
> net/smc/smc_clc.c | 8 +-
> net/smc/smc_clc.h | 2 +-
> net/smc/smc_core.c | 246 +++++++++++++++++++++-----------
> net/smc/smc_core.h | 20 ++-
> net/smc/smc_ib.c | 44 +++++-
> net/smc/smc_ib.h | 2 +
> net/smc/smc_llc.c | 33 +++--
> net/smc/smc_rx.c | 92 +++++++++---
> net/smc/smc_sysctl.c | 11 ++
> net/smc/smc_tx.c | 10 +-
> 14 files changed, 404 insertions(+), 147 deletions(-)
>
This idea is very cool! Thank you for your effort! But we still need to
verify if this solution can run well on our system. I'll come to you soon.
^ permalink raw reply
* Re: [PATCH v13 2/2] KVM: s390: resetting the Topology-Change-Report
From: Janosch Frank @ 2022-07-14 14:12 UTC (permalink / raw)
To: Pierre Morel, kvm
Cc: linux-s390, linux-kernel, borntraeger, cohuck, david, thuth,
imbrenda, hca, gor, wintera, seiden, nrb, scgl
In-Reply-To: <20220714101824.101601-3-pmorel@linux.ibm.com>
On 7/14/22 12:18, Pierre Morel wrote:
> During a subsystem reset the Topology-Change-Report is cleared.
>
> Let's give userland the possibility to clear the MTCR in the case
> of a subsystem reset.
>
> To migrate the MTCR, we give userland the possibility to
> query the MTCR state.
>
> We indicate KVM support for the CPU topology facility with a new
> KVM capability: KVM_CAP_S390_CPU_TOPOLOGY.
>
> Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
> Reviewed-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Nit below, but:
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1158,6 +1158,7 @@ struct kvm_ppc_resize_hpt {
> #define KVM_CAP_SYSTEM_EVENT_DATA 215
> #define KVM_CAP_ARM_SYSTEM_SUSPEND 216
> #define KVM_CAP_S390_PROTECTED_DUMP 217
> +#define KVM_CAP_S390_CPU_TOPOLOGY 218
> #define KVM_CAP_S390_ZPCI_OP 221
Using 222 and moving it a line down might make more sense as 218 is
KVM_CAP_X86_TRIPLE_FAULT_EVENT.
Can you fix this and push both patches to devel?
Also send the fixed patch as a reply to this message so I can pick it
from the list.
next and devel have diverted a bit so I will need to fix this up for
next, same for the Documentation entry which will be 6.39 instead of 6.38.
>
> #ifdef KVM_CAP_IRQ_ROUTING
^ permalink raw reply
* [PATCH v10 4/4] kexec, KEYS, s390: Make use of built-in and secondary keyring for signature verification
From: Coiby Xu @ 2022-07-14 13:40 UTC (permalink / raw)
To: kexec, linux-integrity
Cc: Mimi Zohar, linux-arm-kernel, Michal Suchanek, Baoquan He,
Dave Young, Will Deacon, Eric W . Biederman, Chun-Yi Lee, stable,
Philipp Rudo, keyrings, linux-security-module, Heiko Carstens,
Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
Sven Schnelle, Martin Schwidefsky, open list:S390, open list
In-Reply-To: <20220714134027.394370-1-coxu@redhat.com>
From: Michal Suchanek <msuchanek@suse.de>
commit e23a8020ce4e ("s390/kexec_file: Signature verification prototype")
adds support for KEXEC_SIG verification with keys from platform keyring
but the built-in keys and secondary keyring are not used.
Add support for the built-in keys and secondary keyring as x86 does.
Fixes: e23a8020ce4e ("s390/kexec_file: Signature verification prototype")
Cc: stable@vger.kernel.org
Cc: Philipp Rudo <prudo@linux.ibm.com>
Cc: kexec@lists.infradead.org
Cc: keyrings@vger.kernel.org
Cc: linux-security-module@vger.kernel.org
Signed-off-by: Michal Suchanek <msuchanek@suse.de>
Reviewed-by: "Lee, Chun-Yi" <jlee@suse.com>
Acked-by: Baoquan He <bhe@redhat.com>
Signed-off-by: Coiby Xu <coxu@redhat.com>
---
arch/s390/kernel/machine_kexec_file.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
diff --git a/arch/s390/kernel/machine_kexec_file.c b/arch/s390/kernel/machine_kexec_file.c
index 8f43575a4dd3..fc6d5f58debe 100644
--- a/arch/s390/kernel/machine_kexec_file.c
+++ b/arch/s390/kernel/machine_kexec_file.c
@@ -31,6 +31,7 @@ int s390_verify_sig(const char *kernel, unsigned long kernel_len)
const unsigned long marker_len = sizeof(MODULE_SIG_STRING) - 1;
struct module_signature *ms;
unsigned long sig_len;
+ int ret;
/* Skip signature verification when not secure IPLed. */
if (!ipl_secure_flag)
@@ -65,11 +66,18 @@ int s390_verify_sig(const char *kernel, unsigned long kernel_len)
return -EBADMSG;
}
- return verify_pkcs7_signature(kernel, kernel_len,
- kernel + kernel_len, sig_len,
- VERIFY_USE_PLATFORM_KEYRING,
- VERIFYING_MODULE_SIGNATURE,
- NULL, NULL);
+ ret = verify_pkcs7_signature(kernel, kernel_len,
+ kernel + kernel_len, sig_len,
+ VERIFY_USE_SECONDARY_KEYRING,
+ VERIFYING_MODULE_SIGNATURE,
+ NULL, NULL);
+ if (ret == -ENOKEY && IS_ENABLED(CONFIG_INTEGRITY_PLATFORM_KEYRING))
+ ret = verify_pkcs7_signature(kernel, kernel_len,
+ kernel + kernel_len, sig_len,
+ VERIFY_USE_PLATFORM_KEYRING,
+ VERIFYING_MODULE_SIGNATURE,
+ NULL, NULL);
+ return ret;
}
#endif /* CONFIG_KEXEC_SIG */
--
2.35.3
^ permalink raw reply related
* Re: [PATCH v2 0/3] s390/cpufeature: rework to allow different types of cpufeatures
From: Heiko Carstens @ 2022-07-14 10:53 UTC (permalink / raw)
To: Steffen Eiden
Cc: Alexander Gordeev, Christian Borntraeger, Janosch Frank,
Claudio Imbrenda, Vasily Gorbik, linux-s390, linux-kernel,
linux-mm, nrb
In-Reply-To: <20220713125644.16121-1-seiden@linux.ibm.com>
On Wed, Jul 13, 2022 at 02:56:41PM +0200, Steffen Eiden wrote:
> Currently the s390 implementaion of cpufeature is limited to elf_hwcap
> bits. Using these to automatically load modules also exposes this
> cpufeature to userspace which, sometimes is not intended.
>
> Therefore, rework the s390-cpufeature implementation to allow for
> various cpu feature indications, which is not only limited to hwcap bits.
>
> Add a new type to allow facilities to be a cpufeature without using
> hwcap bits that expose this feature to userspace.
>
> Load uvdevice when facility 158 is present.
>
> since v1:
> * add r-bs from Claudio
> * worked in comments
>
> Heiko Carstens (2):
> s390/cpufeature: rework to allow more than only hwcap bits
> s390/cpufeature: allow for facility bits
>
> Steffen Eiden (1):
> s390/uvdevice: autoload module based on CPU facility
Series applied, thanks!
^ permalink raw reply
* Re: [PATCH 1/3] s390/cpufeature: rework to allow more than only hwcap bits
From: Heiko Carstens @ 2022-07-14 10:52 UTC (permalink / raw)
To: Steffen Eiden
Cc: Alexander Gordeev, Christian Borntraeger, Janosch Frank,
Claudio Imbrenda, Vasily Gorbik, linux-s390, linux-kernel,
linux-mm, nrb
In-Reply-To: <4132ba2a-f5ad-25ba-7f74-72369b8a140b@linux.ibm.com>
> > > +static struct s390_cpu_feature s390_cpu_features[MAX_CPU_FEATURES] = {
> > > + [S390_CPU_FEATURE_ESAN3] = {.type = TYPE_HWCAP, .num = HWCAP_NR_ESAN3},
> > > + [S390_CPU_FEATURE_ZARCH] = {.type = TYPE_HWCAP, .num = HWCAP_NR_ZARCH},
...
> > I only realized now that you added all HWCAP bits here. It was
> > intentional that I added only the two bits which are currently used
> > for several reasons:
> >
> > - Keep the array as small as possible.
> > - No need to keep this array in sync with HWCAPs, if new ones are added.
> > - There is a for loop in print_cpu_modalias() which iterates over all
> > MAX_CPU_FEATURES entries; this should be as fast as possible. Adding
> > extra entries burns cycles for no added value.
> The loop in print_cpu_modalias() was the reason why I added all
> current HWCAPs. The current implementation runs through all HWCAPs
> using cpu_have_feature() and I feared that reducing to just MSA and
> VXRS has effects in the reporting of CPU-features to userspace.
>
> I double checked the output of 'grep features /proc/cpuinfo' and it
> stays the same, for 5.19-rc6, 5.19-rc6+this series, 5.19-rc6+this series
> with just the two S390_CPU_FEATUREs. I might have misunderstood what happens
> in that loop in print_cpu_modalias().
It is used on cpu hotplug to generate a MODALIAS environment
variable. You can check that by running "udevadm monitor -p"
and then switching a cpu off/on.
This environment variable is then used by systemd/udev to load
feature matching modules via kmod.
^ permalink raw reply
* [PATCH v13 2/2] KVM: s390: resetting the Topology-Change-Report
From: Pierre Morel @ 2022-07-14 10:18 UTC (permalink / raw)
To: kvm
Cc: linux-s390, linux-kernel, borntraeger, frankja, cohuck, david,
thuth, imbrenda, hca, gor, pmorel, wintera, seiden, nrb, scgl
In-Reply-To: <20220714101824.101601-1-pmorel@linux.ibm.com>
During a subsystem reset the Topology-Change-Report is cleared.
Let's give userland the possibility to clear the MTCR in the case
of a subsystem reset.
To migrate the MTCR, we give userland the possibility to
query the MTCR state.
We indicate KVM support for the CPU topology facility with a new
KVM capability: KVM_CAP_S390_CPU_TOPOLOGY.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
---
Documentation/virt/kvm/api.rst | 25 ++++++++++++++++
arch/s390/include/uapi/asm/kvm.h | 1 +
arch/s390/kvm/kvm-s390.c | 51 ++++++++++++++++++++++++++++++++
include/uapi/linux/kvm.h | 1 +
4 files changed, 78 insertions(+)
diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 82baa7682829..892fc2e470d7 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8159,6 +8159,31 @@ PV guests. The `KVM_PV_DUMP` command is available for the
dump related UV data. Also the vcpu ioctl `KVM_S390_PV_CPU_COMMAND` is
available and supports the `KVM_PV_DUMP_CPU` subcommand.
+8.38 KVM_CAP_S390_CPU_TOPOLOGY
+------------------------------
+
+:Capability: KVM_CAP_S390_CPU_TOPOLOGY
+:Architectures: s390
+:Type: vm
+
+This capability indicates that KVM will provide the S390 CPU Topology
+facility which consist of the interpretation of the PTF instruction for
+the function code 2 along with interception and forwarding of both the
+PTF instruction with function codes 0 or 1 and the STSI(15,1,x)
+instruction to the userland hypervisor.
+
+The stfle facility 11, CPU Topology facility, should not be indicated
+to the guest without this capability.
+
+When this capability is present, KVM provides a new attribute group
+on vm fd, KVM_S390_VM_CPU_TOPOLOGY.
+This new attribute allows to get, set or clear the Modified Change
+Topology Report (MTCR) bit of the SCA through the kvm_device_attr
+structure.
+
+When getting the Modified Change Topology Report value, the attr->addr
+must point to a byte where the value will be stored or retrieved from.
+
9. Known KVM API problems
=========================
diff --git a/arch/s390/include/uapi/asm/kvm.h b/arch/s390/include/uapi/asm/kvm.h
index 7a6b14874d65..a73cf01a1606 100644
--- a/arch/s390/include/uapi/asm/kvm.h
+++ b/arch/s390/include/uapi/asm/kvm.h
@@ -74,6 +74,7 @@ struct kvm_s390_io_adapter_req {
#define KVM_S390_VM_CRYPTO 2
#define KVM_S390_VM_CPU_MODEL 3
#define KVM_S390_VM_MIGRATION 4
+#define KVM_S390_VM_CPU_TOPOLOGY 5
/* kvm attributes for mem_ctrl */
#define KVM_S390_VM_MEM_ENABLE_CMMA 0
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index 330a0cd4b8c8..b2cb13d1c0cd 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -647,6 +647,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
case KVM_CAP_S390_ZPCI_OP:
r = kvm_s390_pci_interp_allowed();
break;
+ case KVM_CAP_S390_CPU_TOPOLOGY:
+ r = test_facility(11);
+ break;
default:
r = 0;
}
@@ -858,6 +861,20 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm, struct kvm_enable_cap *cap)
icpt_operexc_on_all_vcpus(kvm);
r = 0;
break;
+ case KVM_CAP_S390_CPU_TOPOLOGY:
+ r = -EINVAL;
+ mutex_lock(&kvm->lock);
+ if (kvm->created_vcpus) {
+ r = -EBUSY;
+ } else if (test_facility(11)) {
+ set_kvm_facility(kvm->arch.model.fac_mask, 11);
+ set_kvm_facility(kvm->arch.model.fac_list, 11);
+ r = 0;
+ }
+ mutex_unlock(&kvm->lock);
+ VM_EVENT(kvm, 3, "ENABLE: CAP_S390_CPU_TOPOLOGY %s",
+ r ? "(not available)" : "(success)");
+ break;
default:
r = -EINVAL;
break;
@@ -1794,6 +1811,31 @@ static void kvm_s390_update_topology_change_report(struct kvm *kvm, bool val)
read_unlock(&kvm->arch.sca_lock);
}
+static int kvm_s390_set_topo_change_indication(struct kvm *kvm,
+ struct kvm_device_attr *attr)
+{
+ if (!test_kvm_facility(kvm, 11))
+ return -ENXIO;
+
+ kvm_s390_update_topology_change_report(kvm, !!attr->attr);
+ return 0;
+}
+
+static int kvm_s390_get_topo_change_indication(struct kvm *kvm,
+ struct kvm_device_attr *attr)
+{
+ u8 topo;
+
+ if (!test_kvm_facility(kvm, 11))
+ return -ENXIO;
+
+ read_lock(&kvm->arch.sca_lock);
+ topo = ((struct bsca_block *)kvm->arch.sca)->utility.mtcr;
+ read_unlock(&kvm->arch.sca_lock);
+
+ return put_user(topo, (u8 __user *)attr->addr);
+}
+
static int kvm_s390_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
{
int ret;
@@ -1814,6 +1856,9 @@ static int kvm_s390_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MIGRATION:
ret = kvm_s390_vm_set_migration(kvm, attr);
break;
+ case KVM_S390_VM_CPU_TOPOLOGY:
+ ret = kvm_s390_set_topo_change_indication(kvm, attr);
+ break;
default:
ret = -ENXIO;
break;
@@ -1839,6 +1884,9 @@ static int kvm_s390_vm_get_attr(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MIGRATION:
ret = kvm_s390_vm_get_migration(kvm, attr);
break;
+ case KVM_S390_VM_CPU_TOPOLOGY:
+ ret = kvm_s390_get_topo_change_indication(kvm, attr);
+ break;
default:
ret = -ENXIO;
break;
@@ -1912,6 +1960,9 @@ static int kvm_s390_vm_has_attr(struct kvm *kvm, struct kvm_device_attr *attr)
case KVM_S390_VM_MIGRATION:
ret = 0;
break;
+ case KVM_S390_VM_CPU_TOPOLOGY:
+ ret = test_kvm_facility(kvm, 11) ? 0 : -ENXIO;
+ break;
default:
ret = -ENXIO;
break;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 80216c6cece1..16ce54d6a868 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1158,6 +1158,7 @@ struct kvm_ppc_resize_hpt {
#define KVM_CAP_SYSTEM_EVENT_DATA 215
#define KVM_CAP_ARM_SYSTEM_SUSPEND 216
#define KVM_CAP_S390_PROTECTED_DUMP 217
+#define KVM_CAP_S390_CPU_TOPOLOGY 218
#define KVM_CAP_S390_ZPCI_OP 221
#ifdef KVM_CAP_IRQ_ROUTING
--
2.31.1
^ permalink raw reply related
* [PATCH v13 0/2] s390x: KVM: CPU Topology
From: Pierre Morel @ 2022-07-14 10:18 UTC (permalink / raw)
To: kvm
Cc: linux-s390, linux-kernel, borntraeger, frankja, cohuck, david,
thuth, imbrenda, hca, gor, pmorel, wintera, seiden, nrb, scgl
Hi all,
The series provides:
1- interception of the STSI instruction forwarding the CPU topology
2- interpretation of the PTF instruction
3- a KVM capability for the userland hypervisor to ask KVM to
setup PTF interpretation.
4- KVM ioctl to get and set the MTCR bit of the SCA in order to
migrate this bit during a migration.
0- Foreword
The S390 CPU topology is reported using two instructions:
- PTF, to get information if the CPU topology did change since last
PTF instruction or a subsystem reset.
- STSI, to get the topology information, consisting of the topology
of the CPU inside the sockets, of the sockets inside the books etc.
The PTF(2) instruction report a change if the STSI(15.1.2) instruction
will report a difference with the last STSI(15.1.2) instruction*.
With the SIE interpretation, the PTF(2) instruction will report a
change to the guest if the host sets the SCA.MTCR bit.
*The STSI(15.1.2) instruction reports:
- The cores address within a socket
- The polarization of the cores
- The CPU type of the cores
- If the cores are dedicated or not
We decided to implement the CPU topology for S390 in several steps:
- first we report CPU hotplug
In future development we will provide:
- modification of the CPU mask inside sockets
- handling of shared CPUs
- reporting of the CPU Type
- reporting of the polarization
1- Interception of STSI
To provide Topology information to the guest through the STSI
instruction, we forward STSI with Function Code 15 to the
userland hypervisor which will take care to provide the right
information to the guest.
To let the guest use both the PTF instruction to check if a topology
change occurred and sthe STSI_15.x.x instruction we add a new KVM
capability to enable the topology facility.
2- Interpretation of PTF with FC(2)
The PTF instruction reports a topology change if there is any change
with a previous STSI(15.1.2) SYSIB.
Changes inside a STSI(15.1.2) SYSIB occur if CPU bits are set or clear
inside the CPU Topology List Entry CPU mask field, which happens with
changes in CPU polarization, dedication, CPU types and adding or
removing CPUs in a socket.
Considering that the KVM guests currently only supports:
- horizontal polarization
- type 3 (Linux) CPU
And that we decide to support only:
- dedicated CPUs on the host
- pinned vCPUs on the guest
the creation of vCPU will is the only trigger to set the MTCR bit for
a guest.
The reporting to the guest is done using the Multiprocessor
Topology-Change-Report (MTCR) bit of the utility entry of the guest's
SCA which will be cleared during the interpretation of PTF.
Regards,
Pierre
Pierre Morel (2):
KVM: s390: guest support for topology function
KVM: s390: resetting the Topology-Change-Report
Documentation/virt/kvm/api.rst | 25 ++++++++++
arch/s390/include/asm/kvm_host.h | 18 +++++--
arch/s390/include/uapi/asm/kvm.h | 1 +
arch/s390/kvm/kvm-s390.c | 82 ++++++++++++++++++++++++++++++++
arch/s390/kvm/priv.c | 20 ++++++--
arch/s390/kvm/vsie.c | 8 ++++
include/uapi/linux/kvm.h | 1 +
7 files changed, 148 insertions(+), 7 deletions(-)
--
2.31.1
Changelog:
from v12 to v13
- remove check for protected virtualization
(Janosch)
- move stting of sca out of the loop
(Janis)
- Change some function names and typos
(Janis, Janosch)
from v11 to v12
- protect sca pointer
(Janis)
- check for user_stsi before returning information
to userland
(Janis)
- check for protected virtualization
(Pierre)
from v10 to v11
- access mctr with interlocked access instead of ipte_lock
(Janis)
- set mctr in kvm_arch_vcpu_destroy
(Nico)
- better function documentation
(Claudio)
- use a single function to set and clear
(Janosch)
- Use u8 as API data
(David, Janis)
- Check KVM_CAP_S390_USER_STSI before returning
data to userspace
(Nico)
from v9 to v10
- Suppression of the check on real CPU migration
(Christian)
- Changed the check on fc in handle_stsi
(David)
from v8 to v9
- bug correction in kvm_s390_topology_changed
(Heiko)
- simplification for ipte_lock/unlock to use kvm
as arg instead of vcpu and test on sclp.has_siif
instead of the SIE ECA_SII.
(David)
- use of a single value for reporting if the
topology changed instead of a structure
(David)
from v7 to v8
- implement reset handling
(Janosch)
- change the way to check if the topology changed
(Nico, Heiko)
from v6 to v7
- rebase
from v5 to v6
- make the subject more accurate
(Claudio)
- Change the kvm_s390_set_mtcr() function to have vcpu in the name
(Janosch)
- Replace the checks on ECB_PTF wit the check of facility 11
(Janosch)
- modify kvm_arch_vcpu_load, move the check in a function in
the header file
(Janosh)
- No magical number replace the "new cpu value" of -1 with a define
(Janosch)
- Make the checks for STSI validity clearer
(Janosch)
from v4 tp v5
- modify the way KVM_CAP is tested to be OK with vsie
(David)
from v3 to v4
- squatch both patches
(David)
- Added Documentation
(David)
- Modified the detection for new vCPUs
(Pierre)
from v2 to v3
- use PTF interpretation
(Christian)
- optimize arch_update_cpu_topology using PTF
(Pierre)
from v1 to v2:
- Add a KVM capability to let QEMU know we support PTF and STSI 15
(David)
- check KVM facility 11 before accepting STSI fc 15
(David)
- handle all we can in userland
(David)
- add tracing to STSI fc 15
(Connie)
^ permalink raw reply
* [PATCH v13 1/2] KVM: s390: guest support for topology function
From: Pierre Morel @ 2022-07-14 10:18 UTC (permalink / raw)
To: kvm
Cc: linux-s390, linux-kernel, borntraeger, frankja, cohuck, david,
thuth, imbrenda, hca, gor, pmorel, wintera, seiden, nrb, scgl
In-Reply-To: <20220714101824.101601-1-pmorel@linux.ibm.com>
We report a topology change to the guest for any CPU hotplug.
The reporting to the guest is done using the Multiprocessor
Topology-Change-Report (MTCR) bit of the utility entry in the guest's
SCA which will be cleared during the interpretation of PTF.
On every vCPU creation we set the MCTR bit to let the guest know the
next time it uses the PTF with command 2 instruction that the
topology changed and that it should use the STSI(15.1.x) instruction
to get the topology details.
STSI(15.1.x) gives information on the CPU configuration topology.
Let's accept the interception of STSI with the function code 15 and
let the userland part of the hypervisor handle it when userland
supports the CPU Topology facility.
Signed-off-by: Pierre Morel <pmorel@linux.ibm.com>
Reviewed-by: Nico Boehr <nrb@linux.ibm.com>
Reviewed-by: Janis Schoetterl-Glausch <scgl@linux.ibm.com>
Reviewed-by: Janosch Frank <frankja@linux.ibm.com>
---
arch/s390/include/asm/kvm_host.h | 18 +++++++++++++++---
arch/s390/kvm/kvm-s390.c | 31 +++++++++++++++++++++++++++++++
arch/s390/kvm/priv.c | 20 ++++++++++++++++----
arch/s390/kvm/vsie.c | 8 ++++++++
4 files changed, 70 insertions(+), 7 deletions(-)
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index 78d5dbd0c65b..6287a843e8bc 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -95,19 +95,30 @@ union ipte_control {
};
};
+union sca_utility {
+ __u16 val;
+ struct {
+ __u16 mtcr : 1;
+ __u16 reserved : 15;
+ };
+};
+
struct bsca_block {
union ipte_control ipte_control;
__u64 reserved[5];
__u64 mcn;
- __u64 reserved2;
+ union sca_utility utility;
+ __u8 reserved2[6];
struct bsca_entry cpu[KVM_S390_BSCA_CPU_SLOTS];
};
struct esca_block {
union ipte_control ipte_control;
- __u64 reserved1[7];
+ __u64 reserved1[6];
+ union sca_utility utility;
+ __u8 reserved2[6];
__u64 mcn[4];
- __u64 reserved2[20];
+ __u64 reserved3[20];
struct esca_entry cpu[KVM_S390_ESCA_CPU_SLOTS];
};
@@ -251,6 +262,7 @@ struct kvm_s390_sie_block {
#define ECB_SPECI 0x08
#define ECB_SRSI 0x04
#define ECB_HOSTPROTINT 0x02
+#define ECB_PTF 0x01
__u8 ecb; /* 0x0061 */
#define ECB2_CMMA 0x80
#define ECB2_IEP 0x20
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index ae08491ddb0c..330a0cd4b8c8 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -1768,6 +1768,32 @@ static int kvm_s390_get_cpu_model(struct kvm *kvm, struct kvm_device_attr *attr)
return ret;
}
+/**
+ * kvm_s390_update_topology_change_report - update CPU topology change report
+ * @kvm: guest KVM description
+ * @val: set or clear the MTCR bit
+ *
+ * Updates the Multiprocessor Topology-Change-Report bit to signal
+ * the guest with a topology change.
+ * This is only relevant if the topology facility is present.
+ *
+ * The SCA version, bsca or esca, doesn't matter as offset is the same.
+ */
+static void kvm_s390_update_topology_change_report(struct kvm *kvm, bool val)
+{
+ union sca_utility new, old;
+ struct bsca_block *sca;
+
+ read_lock(&kvm->arch.sca_lock);
+ sca = kvm->arch.sca;
+ do {
+ old = READ_ONCE(sca->utility);
+ new = old;
+ new.mtcr = val;
+ } while (cmpxchg(&sca->utility.val, old.val, new.val) != old.val);
+ read_unlock(&kvm->arch.sca_lock);
+}
+
static int kvm_s390_vm_set_attr(struct kvm *kvm, struct kvm_device_attr *attr)
{
int ret;
@@ -3177,6 +3203,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
kvm_clear_async_pf_completion_queue(vcpu);
if (!kvm_is_ucontrol(vcpu->kvm))
sca_del_vcpu(vcpu);
+ kvm_s390_update_topology_change_report(vcpu->kvm, 1);
if (kvm_is_ucontrol(vcpu->kvm))
gmap_remove(vcpu->arch.gmap);
@@ -3581,6 +3608,8 @@ static int kvm_s390_vcpu_setup(struct kvm_vcpu *vcpu)
vcpu->arch.sie_block->ecb |= ECB_HOSTPROTINT;
if (test_kvm_facility(vcpu->kvm, 9))
vcpu->arch.sie_block->ecb |= ECB_SRSI;
+ if (test_kvm_facility(vcpu->kvm, 11))
+ vcpu->arch.sie_block->ecb |= ECB_PTF;
if (test_kvm_facility(vcpu->kvm, 73))
vcpu->arch.sie_block->ecb |= ECB_TE;
if (!kvm_is_ucontrol(vcpu->kvm))
@@ -3714,6 +3743,8 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
rc = kvm_s390_vcpu_setup(vcpu);
if (rc)
goto out_ucontrol_uninit;
+
+ kvm_s390_update_topology_change_report(vcpu->kvm, 1);
return 0;
out_ucontrol_uninit:
diff --git a/arch/s390/kvm/priv.c b/arch/s390/kvm/priv.c
index 12c464c7cddf..3335fa09b6f1 100644
--- a/arch/s390/kvm/priv.c
+++ b/arch/s390/kvm/priv.c
@@ -873,10 +873,18 @@ static int handle_stsi(struct kvm_vcpu *vcpu)
if (vcpu->arch.sie_block->gpsw.mask & PSW_MASK_PSTATE)
return kvm_s390_inject_program_int(vcpu, PGM_PRIVILEGED_OP);
- if (fc > 3) {
- kvm_s390_set_psw_cc(vcpu, 3);
- return 0;
- }
+ /* Bailout forbidden function codes */
+ if (fc > 3 && fc != 15)
+ goto out_no_data;
+
+ /*
+ * fc 15 is provided only with
+ * - PTF/CPU topology support through facility 15
+ * - KVM_CAP_S390_USER_STSI
+ */
+ if (fc == 15 && (!test_kvm_facility(vcpu->kvm, 11) ||
+ !vcpu->kvm->arch.user_stsi))
+ goto out_no_data;
if (vcpu->run->s.regs.gprs[0] & 0x0fffff00
|| vcpu->run->s.regs.gprs[1] & 0xffff0000)
@@ -910,6 +918,10 @@ static int handle_stsi(struct kvm_vcpu *vcpu)
goto out_no_data;
handle_stsi_3_2_2(vcpu, (void *) mem);
break;
+ case 15: /* fc 15 is fully handled in userspace */
+ insert_stsi_usr_data(vcpu, operand2, ar, fc, sel1, sel2);
+ trace_kvm_s390_handle_stsi(vcpu, fc, sel1, sel2, operand2);
+ return -EREMOTE;
}
if (kvm_s390_pv_cpu_is_protected(vcpu)) {
memcpy((void *)sida_origin(vcpu->arch.sie_block), (void *)mem,
diff --git a/arch/s390/kvm/vsie.c b/arch/s390/kvm/vsie.c
index dada78b92691..94138f8f0c1c 100644
--- a/arch/s390/kvm/vsie.c
+++ b/arch/s390/kvm/vsie.c
@@ -503,6 +503,14 @@ static int shadow_scb(struct kvm_vcpu *vcpu, struct vsie_page *vsie_page)
/* Host-protection-interruption introduced with ESOP */
if (test_kvm_cpu_feat(vcpu->kvm, KVM_S390_VM_CPU_FEAT_ESOP))
scb_s->ecb |= scb_o->ecb & ECB_HOSTPROTINT;
+ /*
+ * CPU Topology
+ * This facility only uses the utility field of the SCA and none of
+ * the cpu entries that are problematic with the other interpretation
+ * facilities so we can pass it through
+ */
+ if (test_kvm_facility(vcpu->kvm, 11))
+ scb_s->ecb |= scb_o->ecb & ECB_PTF;
/* transactional execution */
if (test_kvm_facility(vcpu->kvm, 73) && wants_tx) {
/* remap the prefix is tx is toggled on */
--
2.31.1
^ permalink raw reply related
* [PATCH net-next v2 5/6] net/smc: Allow virtually contiguous sndbufs or RMBs for SMC-R
From: Wen Gu @ 2022-07-14 9:44 UTC (permalink / raw)
To: kgraul, wenjia, davem, edumazet, kuba, pabeni
Cc: linux-s390, netdev, linux-rdma, linux-kernel
In-Reply-To: <1657791845-1060-1-git-send-email-guwen@linux.alibaba.com>
On long-running enterprise production servers, high-order contiguous
memory pages are usually very rare and in most cases we can only get
fragmented pages.
When replacing TCP with SMC-R in such production scenarios, attempting
to allocate high-order physically contiguous sndbufs and RMBs may result
in frequent memory compaction, which will cause unexpected hung issue
and further stability risks.
So this patch is aimed to allow SMC-R link group to use virtually
contiguous sndbufs and RMBs to avoid potential issues mentioned above.
Whether to use physically or virtually contiguous buffers can be set
by sysctl smcr_buf_type.
Note that using virtually contiguous buffers will bring an acceptable
performance regression, which can be mainly divided into two parts:
1) regression in data path, which is brought by additional address
translation of sndbuf by RNIC in Tx. But in general, translating
address through MTT is fast.
Taking 256KB sndbuf and RMB as an example, the comparisons in qperf
latency and bandwidth test with physically and virtually contiguous
buffers are as follows:
- client:
smc_run taskset -c <cpu> qperf <server> -oo msg_size:1:64K:*2\
-t 5 -vu tcp_{bw|lat}
- server:
smc_run taskset -c <cpu> qperf
[latency]
msgsize tcp smcr smcr-use-virt-buf
1 11.17 us 7.56 us 7.51 us (-0.67%)
2 10.65 us 7.74 us 7.56 us (-2.31%)
4 11.11 us 7.52 us 7.59 us ( 0.84%)
8 10.83 us 7.55 us 7.51 us (-0.48%)
16 11.21 us 7.46 us 7.51 us ( 0.71%)
32 10.65 us 7.53 us 7.58 us ( 0.61%)
64 10.95 us 7.74 us 7.80 us ( 0.76%)
128 11.14 us 7.83 us 7.87 us ( 0.47%)
256 10.97 us 7.94 us 7.92 us (-0.28%)
512 11.23 us 7.94 us 8.20 us ( 3.25%)
1024 11.60 us 8.12 us 8.20 us ( 0.96%)
2048 14.04 us 8.30 us 8.51 us ( 2.49%)
4096 16.88 us 9.13 us 9.07 us (-0.64%)
8192 22.50 us 10.56 us 11.22 us ( 6.26%)
16384 28.99 us 12.88 us 13.83 us ( 7.37%)
32768 40.13 us 16.76 us 16.95 us ( 1.16%)
65536 68.70 us 24.68 us 24.85 us ( 0.68%)
[bandwidth]
msgsize tcp smcr smcr-use-virt-buf
1 1.65 MB/s 1.59 MB/s 1.53 MB/s (-3.88%)
2 3.32 MB/s 3.17 MB/s 3.08 MB/s (-2.67%)
4 6.66 MB/s 6.33 MB/s 6.09 MB/s (-3.85%)
8 13.67 MB/s 13.45 MB/s 11.97 MB/s (-10.99%)
16 25.36 MB/s 27.15 MB/s 24.16 MB/s (-11.01%)
32 48.22 MB/s 54.24 MB/s 49.41 MB/s (-8.89%)
64 106.79 MB/s 107.32 MB/s 99.05 MB/s (-7.71%)
128 210.21 MB/s 202.46 MB/s 201.02 MB/s (-0.71%)
256 400.81 MB/s 416.81 MB/s 393.52 MB/s (-5.59%)
512 746.49 MB/s 834.12 MB/s 809.99 MB/s (-2.89%)
1024 1292.33 MB/s 1641.96 MB/s 1571.82 MB/s (-4.27%)
2048 2007.64 MB/s 2760.44 MB/s 2717.68 MB/s (-1.55%)
4096 2665.17 MB/s 4157.44 MB/s 4070.76 MB/s (-2.09%)
8192 3159.72 MB/s 4361.57 MB/s 4270.65 MB/s (-2.08%)
16384 4186.70 MB/s 4574.13 MB/s 4501.17 MB/s (-1.60%)
32768 4093.21 MB/s 4487.42 MB/s 4322.43 MB/s (-3.68%)
65536 4057.14 MB/s 4735.61 MB/s 4555.17 MB/s (-3.81%)
2) regression in buffer initialization and destruction path, which is
brought by additional MR operations of sndbufs. But thanks to link
group buffer reuse mechanism, the impact of this kind of regression
decreases as times of buffer reuse increases.
Taking 256KB sndbuf and RMB as an example, latency of some key SMC-R
buffer-related function obtained by bpftrace are as follows:
Function Phys-bufs Virt-bufs
smcr_new_buf_create() 67154 ns 79164 ns
smc_ib_buf_map_sg() 525 ns 928 ns
smc_ib_get_memory_region() 162294 ns 161191 ns
smc_wr_reg_send() 9957 ns 9635 ns
smc_ib_put_memory_region() 203548 ns 198374 ns
smc_ib_buf_unmap_sg() 508 ns 1158 ns
------------
Test environment notes:
1. Above tests run on 2 VMs within the same Host.
2. The NIC is ConnectX-4Lx, using SRIOV and passing through 2 VFs to
the each VM respectively.
3. VMs' vCPUs are binded to different physical CPUs, and the binded
physical CPUs are isolated by `isolcpus=xxx` cmdline.
4. NICs' queue number are set to 1.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
net/smc/af_smc.c | 66 +++++++++++++++--
net/smc/smc_clc.c | 8 +-
net/smc/smc_clc.h | 2 +-
net/smc/smc_core.c | 213 +++++++++++++++++++++++++++++++++++++----------------
net/smc/smc_core.h | 10 ++-
net/smc/smc_ib.c | 15 ++--
net/smc/smc_llc.c | 33 +++++----
net/smc/smc_rx.c | 90 ++++++++++++++++++----
net/smc/smc_tx.c | 9 ++-
9 files changed, 328 insertions(+), 118 deletions(-)
diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 9497a3b..6e70d9c 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -487,6 +487,29 @@ static void smc_copy_sock_settings_to_smc(struct smc_sock *smc)
smc_copy_sock_settings(&smc->sk, smc->clcsock->sk, SK_FLAGS_CLC_TO_SMC);
}
+/* register the new vzalloced sndbuf on all links */
+static int smcr_lgr_reg_sndbufs(struct smc_link *link,
+ struct smc_buf_desc *snd_desc)
+{
+ struct smc_link_group *lgr = link->lgr;
+ int i, rc = 0;
+
+ if (!snd_desc->is_vm)
+ return -EINVAL;
+
+ /* protect against parallel smcr_link_reg_buf() */
+ mutex_lock(&lgr->llc_conf_mutex);
+ for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
+ if (!smc_link_active(&lgr->lnk[i]))
+ continue;
+ rc = smcr_link_reg_buf(&lgr->lnk[i], snd_desc);
+ if (rc)
+ break;
+ }
+ mutex_unlock(&lgr->llc_conf_mutex);
+ return rc;
+}
+
/* register the new rmb on all links */
static int smcr_lgr_reg_rmbs(struct smc_link *link,
struct smc_buf_desc *rmb_desc)
@@ -498,13 +521,13 @@ static int smcr_lgr_reg_rmbs(struct smc_link *link,
if (rc)
return rc;
/* protect against parallel smc_llc_cli_rkey_exchange() and
- * parallel smcr_link_reg_rmb()
+ * parallel smcr_link_reg_buf()
*/
mutex_lock(&lgr->llc_conf_mutex);
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
if (!smc_link_active(&lgr->lnk[i]))
continue;
- rc = smcr_link_reg_rmb(&lgr->lnk[i], rmb_desc);
+ rc = smcr_link_reg_buf(&lgr->lnk[i], rmb_desc);
if (rc)
goto out;
}
@@ -550,8 +573,15 @@ static int smcr_clnt_conf_first_link(struct smc_sock *smc)
smc_wr_remember_qp_attr(link);
- if (smcr_link_reg_rmb(link, smc->conn.rmb_desc))
- return SMC_CLC_DECL_ERR_REGRMB;
+ /* reg the sndbuf if it was vzalloced */
+ if (smc->conn.sndbuf_desc->is_vm) {
+ if (smcr_link_reg_buf(link, smc->conn.sndbuf_desc))
+ return SMC_CLC_DECL_ERR_REGBUF;
+ }
+
+ /* reg the rmb */
+ if (smcr_link_reg_buf(link, smc->conn.rmb_desc))
+ return SMC_CLC_DECL_ERR_REGBUF;
/* confirm_rkey is implicit on 1st contact */
smc->conn.rmb_desc->is_conf_rkey = true;
@@ -1221,8 +1251,15 @@ static int smc_connect_rdma(struct smc_sock *smc,
goto connect_abort;
}
} else {
+ /* reg sendbufs if they were vzalloced */
+ if (smc->conn.sndbuf_desc->is_vm) {
+ if (smcr_lgr_reg_sndbufs(link, smc->conn.sndbuf_desc)) {
+ reason_code = SMC_CLC_DECL_ERR_REGBUF;
+ goto connect_abort;
+ }
+ }
if (smcr_lgr_reg_rmbs(link, smc->conn.rmb_desc)) {
- reason_code = SMC_CLC_DECL_ERR_REGRMB;
+ reason_code = SMC_CLC_DECL_ERR_REGBUF;
goto connect_abort;
}
}
@@ -1749,8 +1786,15 @@ static int smcr_serv_conf_first_link(struct smc_sock *smc)
struct smc_llc_qentry *qentry;
int rc;
- if (smcr_link_reg_rmb(link, smc->conn.rmb_desc))
- return SMC_CLC_DECL_ERR_REGRMB;
+ /* reg the sndbuf if it was vzalloced*/
+ if (smc->conn.sndbuf_desc->is_vm) {
+ if (smcr_link_reg_buf(link, smc->conn.sndbuf_desc))
+ return SMC_CLC_DECL_ERR_REGBUF;
+ }
+
+ /* reg the rmb */
+ if (smcr_link_reg_buf(link, smc->conn.rmb_desc))
+ return SMC_CLC_DECL_ERR_REGBUF;
/* send CONFIRM LINK request to client over the RoCE fabric */
rc = smc_llc_send_confirm_link(link, SMC_LLC_REQ);
@@ -2109,8 +2153,14 @@ static int smc_listen_rdma_reg(struct smc_sock *new_smc, bool local_first)
struct smc_connection *conn = &new_smc->conn;
if (!local_first) {
+ /* reg sendbufs if they were vzalloced */
+ if (conn->sndbuf_desc->is_vm) {
+ if (smcr_lgr_reg_sndbufs(conn->lnk,
+ conn->sndbuf_desc))
+ return SMC_CLC_DECL_ERR_REGBUF;
+ }
if (smcr_lgr_reg_rmbs(conn->lnk, conn->rmb_desc))
- return SMC_CLC_DECL_ERR_REGRMB;
+ return SMC_CLC_DECL_ERR_REGBUF;
}
return 0;
diff --git a/net/smc/smc_clc.c b/net/smc/smc_clc.c
index f9f3f59..1472f31 100644
--- a/net/smc/smc_clc.c
+++ b/net/smc/smc_clc.c
@@ -1034,7 +1034,7 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
ETH_ALEN);
hton24(clc->r0.qpn, link->roce_qp->qp_num);
clc->r0.rmb_rkey =
- htonl(conn->rmb_desc->mr_rx[link->link_idx]->rkey);
+ htonl(conn->rmb_desc->mr[link->link_idx]->rkey);
clc->r0.rmbe_idx = 1; /* for now: 1 RMB = 1 RMBE */
clc->r0.rmbe_alert_token = htonl(conn->alert_token_local);
switch (clc->hdr.type) {
@@ -1046,8 +1046,10 @@ static int smc_clc_send_confirm_accept(struct smc_sock *smc,
break;
}
clc->r0.rmbe_size = conn->rmbe_size_short;
- clc->r0.rmb_dma_addr = cpu_to_be64((u64)sg_dma_address
- (conn->rmb_desc->sgt[link->link_idx].sgl));
+ clc->r0.rmb_dma_addr = conn->rmb_desc->is_vm ?
+ cpu_to_be64((uintptr_t)conn->rmb_desc->cpu_addr) :
+ cpu_to_be64((u64)sg_dma_address
+ (conn->rmb_desc->sgt[link->link_idx].sgl));
hton24(clc->r0.psn, link->psn_initial);
if (version == SMC_V1) {
clc->hdr.length = htons(SMCR_CLC_ACCEPT_CONFIRM_LEN);
diff --git a/net/smc/smc_clc.h b/net/smc/smc_clc.h
index 83f02f1..5fee545 100644
--- a/net/smc/smc_clc.h
+++ b/net/smc/smc_clc.h
@@ -62,7 +62,7 @@
#define SMC_CLC_DECL_INTERR 0x09990000 /* internal error */
#define SMC_CLC_DECL_ERR_RTOK 0x09990001 /* rtoken handling failed */
#define SMC_CLC_DECL_ERR_RDYLNK 0x09990002 /* ib ready link failed */
-#define SMC_CLC_DECL_ERR_REGRMB 0x09990003 /* reg rmb failed */
+#define SMC_CLC_DECL_ERR_REGBUF 0x09990003 /* reg rdma bufs failed */
#define SMC_FIRST_CONTACT_MASK 0b10 /* first contact bit within typev2 */
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 86afbbc..f26770c 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -1087,34 +1087,37 @@ struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
return NULL;
}
-static void smcr_buf_unuse(struct smc_buf_desc *rmb_desc,
+static void smcr_buf_unuse(struct smc_buf_desc *buf_desc, bool is_rmb,
struct smc_link_group *lgr)
{
+ struct mutex *lock; /* lock buffer list */
int rc;
- if (rmb_desc->is_conf_rkey && !list_empty(&lgr->list)) {
+ if (is_rmb && buf_desc->is_conf_rkey && !list_empty(&lgr->list)) {
/* unregister rmb with peer */
rc = smc_llc_flow_initiate(lgr, SMC_LLC_FLOW_RKEY);
if (!rc) {
/* protect against smc_llc_cli_rkey_exchange() */
mutex_lock(&lgr->llc_conf_mutex);
- smc_llc_do_delete_rkey(lgr, rmb_desc);
- rmb_desc->is_conf_rkey = false;
+ smc_llc_do_delete_rkey(lgr, buf_desc);
+ buf_desc->is_conf_rkey = false;
mutex_unlock(&lgr->llc_conf_mutex);
smc_llc_flow_stop(lgr, &lgr->llc_flow_lcl);
}
}
- if (rmb_desc->is_reg_err) {
+ if (buf_desc->is_reg_err) {
/* buf registration failed, reuse not possible */
- mutex_lock(&lgr->rmbs_lock);
- list_del(&rmb_desc->list);
- mutex_unlock(&lgr->rmbs_lock);
+ lock = is_rmb ? &lgr->rmbs_lock :
+ &lgr->sndbufs_lock;
+ mutex_lock(lock);
+ list_del(&buf_desc->list);
+ mutex_unlock(lock);
- smc_buf_free(lgr, true, rmb_desc);
+ smc_buf_free(lgr, is_rmb, buf_desc);
} else {
- rmb_desc->used = 0;
- memset(rmb_desc->cpu_addr, 0, rmb_desc->len);
+ buf_desc->used = 0;
+ memset(buf_desc->cpu_addr, 0, buf_desc->len);
}
}
@@ -1122,15 +1125,23 @@ static void smc_buf_unuse(struct smc_connection *conn,
struct smc_link_group *lgr)
{
if (conn->sndbuf_desc) {
- conn->sndbuf_desc->used = 0;
- memset(conn->sndbuf_desc->cpu_addr, 0, conn->sndbuf_desc->len);
+ if (!lgr->is_smcd && conn->sndbuf_desc->is_vm) {
+ smcr_buf_unuse(conn->sndbuf_desc, false, lgr);
+ } else {
+ conn->sndbuf_desc->used = 0;
+ memset(conn->sndbuf_desc->cpu_addr, 0,
+ conn->sndbuf_desc->len);
+ }
}
- if (conn->rmb_desc && lgr->is_smcd) {
- conn->rmb_desc->used = 0;
- memset(conn->rmb_desc->cpu_addr, 0, conn->rmb_desc->len +
- sizeof(struct smcd_cdc_msg));
- } else if (conn->rmb_desc) {
- smcr_buf_unuse(conn->rmb_desc, lgr);
+ if (conn->rmb_desc) {
+ if (!lgr->is_smcd) {
+ smcr_buf_unuse(conn->rmb_desc, true, lgr);
+ } else {
+ conn->rmb_desc->used = 0;
+ memset(conn->rmb_desc->cpu_addr, 0,
+ conn->rmb_desc->len +
+ sizeof(struct smcd_cdc_msg));
+ }
}
}
@@ -1178,20 +1189,21 @@ void smc_conn_free(struct smc_connection *conn)
static void smcr_buf_unmap_link(struct smc_buf_desc *buf_desc, bool is_rmb,
struct smc_link *lnk)
{
- if (is_rmb)
+ if (is_rmb || buf_desc->is_vm)
buf_desc->is_reg_mr[lnk->link_idx] = false;
if (!buf_desc->is_map_ib[lnk->link_idx])
return;
- if (is_rmb) {
- if (buf_desc->mr_rx[lnk->link_idx]) {
- smc_ib_put_memory_region(
- buf_desc->mr_rx[lnk->link_idx]);
- buf_desc->mr_rx[lnk->link_idx] = NULL;
- }
+
+ if ((is_rmb || buf_desc->is_vm) &&
+ buf_desc->mr[lnk->link_idx]) {
+ smc_ib_put_memory_region(buf_desc->mr[lnk->link_idx]);
+ buf_desc->mr[lnk->link_idx] = NULL;
+ }
+ if (is_rmb)
smc_ib_buf_unmap_sg(lnk, buf_desc, DMA_FROM_DEVICE);
- } else {
+ else
smc_ib_buf_unmap_sg(lnk, buf_desc, DMA_TO_DEVICE);
- }
+
sg_free_table(&buf_desc->sgt[lnk->link_idx]);
buf_desc->is_map_ib[lnk->link_idx] = false;
}
@@ -1280,8 +1292,10 @@ static void smcr_buf_free(struct smc_link_group *lgr, bool is_rmb,
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++)
smcr_buf_unmap_link(buf_desc, is_rmb, &lgr->lnk[i]);
- if (buf_desc->pages)
+ if (!buf_desc->is_vm && buf_desc->pages)
__free_pages(buf_desc->pages, buf_desc->order);
+ else if (buf_desc->is_vm && buf_desc->cpu_addr)
+ vfree(buf_desc->cpu_addr);
kfree(buf_desc);
}
@@ -1993,26 +2007,50 @@ static inline int smc_rmb_wnd_update_limit(int rmbe_size)
return max_t(int, rmbe_size / 10, SOCK_MIN_SNDBUF / 2);
}
-/* map an rmb buf to a link */
+/* map an buf to a link */
static int smcr_buf_map_link(struct smc_buf_desc *buf_desc, bool is_rmb,
struct smc_link *lnk)
{
- int rc;
+ int rc, i, nents, offset, buf_size, size, access_flags;
+ struct scatterlist *sg;
+ void *buf;
if (buf_desc->is_map_ib[lnk->link_idx])
return 0;
- rc = sg_alloc_table(&buf_desc->sgt[lnk->link_idx], 1, GFP_KERNEL);
+ if (buf_desc->is_vm) {
+ buf = buf_desc->cpu_addr;
+ buf_size = buf_desc->len;
+ offset = offset_in_page(buf_desc->cpu_addr);
+ nents = PAGE_ALIGN(buf_size + offset) / PAGE_SIZE;
+ } else {
+ nents = 1;
+ }
+
+ rc = sg_alloc_table(&buf_desc->sgt[lnk->link_idx], nents, GFP_KERNEL);
if (rc)
return rc;
- sg_set_buf(buf_desc->sgt[lnk->link_idx].sgl,
- buf_desc->cpu_addr, buf_desc->len);
+
+ if (buf_desc->is_vm) {
+ /* virtually contiguous buffer */
+ for_each_sg(buf_desc->sgt[lnk->link_idx].sgl, sg, nents, i) {
+ size = min_t(int, PAGE_SIZE - offset, buf_size);
+ sg_set_page(sg, vmalloc_to_page(buf), size, offset);
+ buf += size / sizeof(*buf);
+ buf_size -= size;
+ offset = 0;
+ }
+ } else {
+ /* physically contiguous buffer */
+ sg_set_buf(buf_desc->sgt[lnk->link_idx].sgl,
+ buf_desc->cpu_addr, buf_desc->len);
+ }
/* map sg table to DMA address */
rc = smc_ib_buf_map_sg(lnk, buf_desc,
is_rmb ? DMA_FROM_DEVICE : DMA_TO_DEVICE);
/* SMC protocol depends on mapping to one DMA address only */
- if (rc != 1) {
+ if (rc != nents) {
rc = -EAGAIN;
goto free_table;
}
@@ -2020,15 +2058,18 @@ static int smcr_buf_map_link(struct smc_buf_desc *buf_desc, bool is_rmb,
buf_desc->is_dma_need_sync |=
smc_ib_is_sg_need_sync(lnk, buf_desc) << lnk->link_idx;
- /* create a new memory region for the RMB */
- if (is_rmb) {
- rc = smc_ib_get_memory_region(lnk->roce_pd,
- IB_ACCESS_REMOTE_WRITE |
- IB_ACCESS_LOCAL_WRITE,
+ if (is_rmb || buf_desc->is_vm) {
+ /* create a new memory region for the RMB or vzalloced sndbuf */
+ access_flags = is_rmb ?
+ IB_ACCESS_REMOTE_WRITE | IB_ACCESS_LOCAL_WRITE :
+ IB_ACCESS_LOCAL_WRITE;
+
+ rc = smc_ib_get_memory_region(lnk->roce_pd, access_flags,
buf_desc, lnk->link_idx);
if (rc)
goto buf_unmap;
- smc_ib_sync_sg_for_device(lnk, buf_desc, DMA_FROM_DEVICE);
+ smc_ib_sync_sg_for_device(lnk, buf_desc,
+ is_rmb ? DMA_FROM_DEVICE : DMA_TO_DEVICE);
}
buf_desc->is_map_ib[lnk->link_idx] = true;
return 0;
@@ -2041,20 +2082,23 @@ static int smcr_buf_map_link(struct smc_buf_desc *buf_desc, bool is_rmb,
return rc;
}
-/* register a new rmb on IB device,
+/* register a new buf on IB device, rmb or vzalloced sndbuf
* must be called under lgr->llc_conf_mutex lock
*/
-int smcr_link_reg_rmb(struct smc_link *link, struct smc_buf_desc *rmb_desc)
+int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *buf_desc)
{
if (list_empty(&link->lgr->list))
return -ENOLINK;
- if (!rmb_desc->is_reg_mr[link->link_idx]) {
- /* register memory region for new rmb */
- if (smc_wr_reg_send(link, rmb_desc->mr_rx[link->link_idx])) {
- rmb_desc->is_reg_err = true;
+ if (!buf_desc->is_reg_mr[link->link_idx]) {
+ /* register memory region for new buf */
+ if (buf_desc->is_vm)
+ buf_desc->mr[link->link_idx]->iova =
+ (uintptr_t)buf_desc->cpu_addr;
+ if (smc_wr_reg_send(link, buf_desc->mr[link->link_idx])) {
+ buf_desc->is_reg_err = true;
return -EFAULT;
}
- rmb_desc->is_reg_mr[link->link_idx] = true;
+ buf_desc->is_reg_mr[link->link_idx] = true;
}
return 0;
}
@@ -2106,18 +2150,38 @@ int smcr_buf_reg_lgr(struct smc_link *lnk)
struct smc_buf_desc *buf_desc, *bf;
int i, rc = 0;
+ /* reg all RMBs for a new link */
mutex_lock(&lgr->rmbs_lock);
for (i = 0; i < SMC_RMBE_SIZES; i++) {
list_for_each_entry_safe(buf_desc, bf, &lgr->rmbs[i], list) {
if (!buf_desc->used)
continue;
- rc = smcr_link_reg_rmb(lnk, buf_desc);
- if (rc)
- goto out;
+ rc = smcr_link_reg_buf(lnk, buf_desc);
+ if (rc) {
+ mutex_unlock(&lgr->rmbs_lock);
+ return rc;
+ }
}
}
-out:
mutex_unlock(&lgr->rmbs_lock);
+
+ if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
+ return rc;
+
+ /* reg all vzalloced sndbufs for a new link */
+ mutex_lock(&lgr->sndbufs_lock);
+ for (i = 0; i < SMC_RMBE_SIZES; i++) {
+ list_for_each_entry_safe(buf_desc, bf, &lgr->sndbufs[i], list) {
+ if (!buf_desc->used || !buf_desc->is_vm)
+ continue;
+ rc = smcr_link_reg_buf(lnk, buf_desc);
+ if (rc) {
+ mutex_unlock(&lgr->sndbufs_lock);
+ return rc;
+ }
+ }
+ }
+ mutex_unlock(&lgr->sndbufs_lock);
return rc;
}
@@ -2131,18 +2195,39 @@ static struct smc_buf_desc *smcr_new_buf_create(struct smc_link_group *lgr,
if (!buf_desc)
return ERR_PTR(-ENOMEM);
- buf_desc->order = get_order(bufsize);
- buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
- __GFP_NOMEMALLOC | __GFP_COMP |
- __GFP_NORETRY | __GFP_ZERO,
- buf_desc->order);
- if (!buf_desc->pages) {
- kfree(buf_desc);
- return ERR_PTR(-EAGAIN);
- }
- buf_desc->cpu_addr = (void *)page_address(buf_desc->pages);
- buf_desc->len = bufsize;
+ switch (lgr->buf_type) {
+ case SMCR_PHYS_CONT_BUFS:
+ case SMCR_MIXED_BUFS:
+ buf_desc->order = get_order(bufsize);
+ buf_desc->pages = alloc_pages(GFP_KERNEL | __GFP_NOWARN |
+ __GFP_NOMEMALLOC | __GFP_COMP |
+ __GFP_NORETRY | __GFP_ZERO,
+ buf_desc->order);
+ if (buf_desc->pages) {
+ buf_desc->cpu_addr =
+ (void *)page_address(buf_desc->pages);
+ buf_desc->len = bufsize;
+ buf_desc->is_vm = false;
+ break;
+ }
+ if (lgr->buf_type == SMCR_PHYS_CONT_BUFS)
+ goto out;
+ fallthrough; // try virtually continguous buf
+ case SMCR_VIRT_CONT_BUFS:
+ buf_desc->order = get_order(bufsize);
+ buf_desc->cpu_addr = vzalloc(PAGE_SIZE << buf_desc->order);
+ if (!buf_desc->cpu_addr)
+ goto out;
+ buf_desc->pages = NULL;
+ buf_desc->len = bufsize;
+ buf_desc->is_vm = true;
+ break;
+ }
return buf_desc;
+
+out:
+ kfree(buf_desc);
+ return ERR_PTR(-EAGAIN);
}
/* map buf_desc on all usable links,
@@ -2273,7 +2358,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
if (!is_smcd) {
if (smcr_buf_map_usable_links(lgr, buf_desc, is_rmb)) {
- smcr_buf_unuse(buf_desc, lgr);
+ smcr_buf_unuse(buf_desc, is_rmb, lgr);
return -ENOMEM;
}
}
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 0261124..fe8b524 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -168,9 +168,11 @@ struct smc_buf_desc {
struct { /* SMC-R */
struct sg_table sgt[SMC_LINKS_PER_LGR_MAX];
/* virtual buffer */
- struct ib_mr *mr_rx[SMC_LINKS_PER_LGR_MAX];
- /* for rmb only: memory region
+ struct ib_mr *mr[SMC_LINKS_PER_LGR_MAX];
+ /* memory region: for rmb and
+ * vzalloced sndbuf
* incl. rkey provided to peer
+ * and lkey provided to local
*/
u32 order; /* allocation order */
@@ -183,6 +185,8 @@ struct smc_buf_desc {
u8 is_dma_need_sync;
u8 is_reg_err;
/* buffer registration err */
+ u8 is_vm;
+ /* virtually contiguous */
};
struct { /* SMC-D */
unsigned short sba_idx;
@@ -543,7 +547,7 @@ void smc_switch_link_and_count(struct smc_connection *conn,
void smcr_lgr_set_type(struct smc_link_group *lgr, enum smc_lgr_type new_type);
void smcr_lgr_set_type_asym(struct smc_link_group *lgr,
enum smc_lgr_type new_type, int asym_lnk_idx);
-int smcr_link_reg_rmb(struct smc_link *link, struct smc_buf_desc *rmb_desc);
+int smcr_link_reg_buf(struct smc_link *link, struct smc_buf_desc *rmb_desc);
struct smc_link *smc_switch_conns(struct smc_link_group *lgr,
struct smc_link *from_lnk, bool is_dev_err);
void smcr_link_down_cond(struct smc_link *lnk);
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index 60e5095..854772d 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -698,7 +698,7 @@ static int smc_ib_map_mr_sg(struct smc_buf_desc *buf_slot, u8 link_idx)
int sg_num;
/* map the largest prefix of a dma mapped SG list */
- sg_num = ib_map_mr_sg(buf_slot->mr_rx[link_idx],
+ sg_num = ib_map_mr_sg(buf_slot->mr[link_idx],
buf_slot->sgt[link_idx].sgl,
buf_slot->sgt[link_idx].orig_nents,
&offset, PAGE_SIZE);
@@ -710,20 +710,21 @@ static int smc_ib_map_mr_sg(struct smc_buf_desc *buf_slot, u8 link_idx)
int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
struct smc_buf_desc *buf_slot, u8 link_idx)
{
- if (buf_slot->mr_rx[link_idx])
+ if (buf_slot->mr[link_idx])
return 0; /* already done */
- buf_slot->mr_rx[link_idx] =
+ buf_slot->mr[link_idx] =
ib_alloc_mr(pd, IB_MR_TYPE_MEM_REG, 1 << buf_slot->order);
- if (IS_ERR(buf_slot->mr_rx[link_idx])) {
+ if (IS_ERR(buf_slot->mr[link_idx])) {
int rc;
- rc = PTR_ERR(buf_slot->mr_rx[link_idx]);
- buf_slot->mr_rx[link_idx] = NULL;
+ rc = PTR_ERR(buf_slot->mr[link_idx]);
+ buf_slot->mr[link_idx] = NULL;
return rc;
}
- if (smc_ib_map_mr_sg(buf_slot, link_idx) != 1)
+ if (smc_ib_map_mr_sg(buf_slot, link_idx) !=
+ buf_slot->sgt[link_idx].orig_nents)
return -EINVAL;
return 0;
diff --git a/net/smc/smc_llc.c b/net/smc/smc_llc.c
index c4d057b..1d83fa9 100644
--- a/net/smc/smc_llc.c
+++ b/net/smc/smc_llc.c
@@ -505,19 +505,22 @@ static int smc_llc_send_confirm_rkey(struct smc_link *send_link,
if (smc_link_active(link) && link != send_link) {
rkeyllc->rtoken[rtok_ix].link_id = link->link_id;
rkeyllc->rtoken[rtok_ix].rmb_key =
- htonl(rmb_desc->mr_rx[link->link_idx]->rkey);
- rkeyllc->rtoken[rtok_ix].rmb_vaddr = cpu_to_be64(
- (u64)sg_dma_address(
- rmb_desc->sgt[link->link_idx].sgl));
+ htonl(rmb_desc->mr[link->link_idx]->rkey);
+ rkeyllc->rtoken[rtok_ix].rmb_vaddr = rmb_desc->is_vm ?
+ cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) :
+ cpu_to_be64((u64)sg_dma_address
+ (rmb_desc->sgt[link->link_idx].sgl));
rtok_ix++;
}
}
/* rkey of send_link is in rtoken[0] */
rkeyllc->rtoken[0].num_rkeys = rtok_ix - 1;
rkeyllc->rtoken[0].rmb_key =
- htonl(rmb_desc->mr_rx[send_link->link_idx]->rkey);
- rkeyllc->rtoken[0].rmb_vaddr = cpu_to_be64(
- (u64)sg_dma_address(rmb_desc->sgt[send_link->link_idx].sgl));
+ htonl(rmb_desc->mr[send_link->link_idx]->rkey);
+ rkeyllc->rtoken[0].rmb_vaddr = rmb_desc->is_vm ?
+ cpu_to_be64((uintptr_t)rmb_desc->cpu_addr) :
+ cpu_to_be64((u64)sg_dma_address
+ (rmb_desc->sgt[send_link->link_idx].sgl));
/* send llc message */
rc = smc_wr_tx_send(send_link, pend);
put_out:
@@ -544,7 +547,7 @@ static int smc_llc_send_delete_rkey(struct smc_link *link,
rkeyllc->hd.common.llc_type = SMC_LLC_DELETE_RKEY;
smc_llc_init_msg_hdr(&rkeyllc->hd, link->lgr, sizeof(*rkeyllc));
rkeyllc->num_rkeys = 1;
- rkeyllc->rkey[0] = htonl(rmb_desc->mr_rx[link->link_idx]->rkey);
+ rkeyllc->rkey[0] = htonl(rmb_desc->mr[link->link_idx]->rkey);
/* send llc message */
rc = smc_wr_tx_send(link, pend);
put_out:
@@ -614,9 +617,10 @@ static int smc_llc_fill_ext_v2(struct smc_llc_msg_add_link_v2_ext *ext,
if (!buf_pos)
break;
rmb = buf_pos;
- ext->rt[i].rmb_key = htonl(rmb->mr_rx[prim_lnk_idx]->rkey);
- ext->rt[i].rmb_key_new = htonl(rmb->mr_rx[lnk_idx]->rkey);
- ext->rt[i].rmb_vaddr_new =
+ ext->rt[i].rmb_key = htonl(rmb->mr[prim_lnk_idx]->rkey);
+ ext->rt[i].rmb_key_new = htonl(rmb->mr[lnk_idx]->rkey);
+ ext->rt[i].rmb_vaddr_new = rmb->is_vm ?
+ cpu_to_be64((uintptr_t)rmb->cpu_addr) :
cpu_to_be64((u64)sg_dma_address(rmb->sgt[lnk_idx].sgl));
buf_pos = smc_llc_get_next_rmb(lgr, &buf_lst, buf_pos);
while (buf_pos && !(buf_pos)->used)
@@ -852,9 +856,10 @@ static int smc_llc_add_link_cont(struct smc_link *link,
}
rmb = *buf_pos;
- addc_llc->rt[i].rmb_key = htonl(rmb->mr_rx[prim_lnk_idx]->rkey);
- addc_llc->rt[i].rmb_key_new = htonl(rmb->mr_rx[lnk_idx]->rkey);
- addc_llc->rt[i].rmb_vaddr_new =
+ addc_llc->rt[i].rmb_key = htonl(rmb->mr[prim_lnk_idx]->rkey);
+ addc_llc->rt[i].rmb_key_new = htonl(rmb->mr[lnk_idx]->rkey);
+ addc_llc->rt[i].rmb_vaddr_new = rmb->is_vm ?
+ cpu_to_be64((uintptr_t)rmb->cpu_addr) :
cpu_to_be64((u64)sg_dma_address(rmb->sgt[lnk_idx].sgl));
(*num_rkeys_todo)--;
diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
index 00ad004..17c5aee 100644
--- a/net/smc/smc_rx.c
+++ b/net/smc/smc_rx.c
@@ -145,35 +145,93 @@ static void smc_rx_spd_release(struct splice_pipe_desc *spd,
static int smc_rx_splice(struct pipe_inode_info *pipe, char *src, size_t len,
struct smc_sock *smc)
{
+ struct smc_link_group *lgr = smc->conn.lgr;
+ int offset = offset_in_page(src);
+ struct partial_page *partial;
struct splice_pipe_desc spd;
- struct partial_page partial;
- struct smc_spd_priv *priv;
- int bytes;
+ struct smc_spd_priv **priv;
+ struct page **pages;
+ int bytes, nr_pages;
+ int i;
- priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ nr_pages = !lgr->is_smcd && smc->conn.rmb_desc->is_vm ?
+ PAGE_ALIGN(len + offset) / PAGE_SIZE : 1;
+
+ pages = kcalloc(nr_pages, sizeof(*pages), GFP_KERNEL);
+ if (!pages)
+ goto out;
+ partial = kcalloc(nr_pages, sizeof(*partial), GFP_KERNEL);
+ if (!partial)
+ goto out_page;
+ priv = kcalloc(nr_pages, sizeof(*priv), GFP_KERNEL);
if (!priv)
- return -ENOMEM;
- priv->len = len;
- priv->smc = smc;
- partial.offset = src - (char *)smc->conn.rmb_desc->cpu_addr;
- partial.len = len;
- partial.private = (unsigned long)priv;
-
- spd.nr_pages_max = 1;
- spd.nr_pages = 1;
- spd.pages = &smc->conn.rmb_desc->pages;
- spd.partial = &partial;
+ goto out_part;
+ for (i = 0; i < nr_pages; i++) {
+ priv[i] = kzalloc(sizeof(**priv), GFP_KERNEL);
+ if (!priv[i])
+ goto out_priv;
+ }
+
+ if (lgr->is_smcd ||
+ (!lgr->is_smcd && !smc->conn.rmb_desc->is_vm)) {
+ /* smcd or smcr that uses physically contiguous RMBs */
+ priv[0]->len = len;
+ priv[0]->smc = smc;
+ partial[0].offset = src - (char *)smc->conn.rmb_desc->cpu_addr;
+ partial[0].len = len;
+ partial[0].private = (unsigned long)priv[0];
+ pages[0] = smc->conn.rmb_desc->pages;
+ } else {
+ int size, left = len;
+ void *buf = src;
+ /* smcr that uses virtually contiguous RMBs*/
+ for (i = 0; i < nr_pages; i++) {
+ size = min_t(int, PAGE_SIZE - offset, left);
+ priv[i]->len = size;
+ priv[i]->smc = smc;
+ pages[i] = vmalloc_to_page(buf);
+ partial[i].offset = offset;
+ partial[i].len = size;
+ partial[i].private = (unsigned long)priv[i];
+ buf += size / sizeof(*buf);
+ left -= size;
+ offset = 0;
+ }
+ }
+ spd.nr_pages_max = nr_pages;
+ spd.nr_pages = nr_pages;
+ spd.pages = pages;
+ spd.partial = partial;
spd.ops = &smc_pipe_ops;
spd.spd_release = smc_rx_spd_release;
bytes = splice_to_pipe(pipe, &spd);
if (bytes > 0) {
sock_hold(&smc->sk);
- get_page(smc->conn.rmb_desc->pages);
+ if (!lgr->is_smcd && smc->conn.rmb_desc->is_vm) {
+ for (i = 0; i < PAGE_ALIGN(bytes + offset) / PAGE_SIZE; i++)
+ get_page(pages[i]);
+ } else {
+ get_page(smc->conn.rmb_desc->pages);
+ }
atomic_add(bytes, &smc->conn.splice_pending);
}
+ kfree(priv);
+ kfree(partial);
+ kfree(pages);
return bytes;
+
+out_priv:
+ for (i = (i - 1); i >= 0; i--)
+ kfree(priv[i]);
+ kfree(priv);
+out_part:
+ kfree(partial);
+out_page:
+ kfree(pages);
+out:
+ return -ENOMEM;
}
static int smc_rx_data_available_and_no_splice_pend(struct smc_connection *conn)
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index ca0d5f5..4e83776 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -383,6 +383,7 @@ static int smcr_tx_rdma_writes(struct smc_connection *conn, size_t len,
dma_addr_t dma_addr =
sg_dma_address(conn->sndbuf_desc->sgt[link->link_idx].sgl);
+ u64 virt_addr = (uintptr_t)conn->sndbuf_desc->cpu_addr;
int src_len_sum = src_len, dst_len_sum = dst_len;
int sent_count = src_off;
int srcchunk, dstchunk;
@@ -395,7 +396,7 @@ static int smcr_tx_rdma_writes(struct smc_connection *conn, size_t len,
u64 base_addr = dma_addr;
if (dst_len < link->qp_attr.cap.max_inline_data) {
- base_addr = (uintptr_t)conn->sndbuf_desc->cpu_addr;
+ base_addr = virt_addr;
wr->wr.send_flags |= IB_SEND_INLINE;
} else {
wr->wr.send_flags &= ~IB_SEND_INLINE;
@@ -403,8 +404,12 @@ static int smcr_tx_rdma_writes(struct smc_connection *conn, size_t len,
num_sges = 0;
for (srcchunk = 0; srcchunk < 2; srcchunk++) {
- sge[srcchunk].addr = base_addr + src_off;
+ sge[srcchunk].addr = conn->sndbuf_desc->is_vm ?
+ (virt_addr + src_off) : (base_addr + src_off);
sge[srcchunk].length = src_len;
+ if (conn->sndbuf_desc->is_vm)
+ sge[srcchunk].lkey =
+ conn->sndbuf_desc->mr[link->link_idx]->lkey;
num_sges++;
src_off += src_len;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v2 6/6] net/smc: Extend SMC-R link group netlink attribute
From: Wen Gu @ 2022-07-14 9:44 UTC (permalink / raw)
To: kgraul, wenjia, davem, edumazet, kuba, pabeni
Cc: linux-s390, netdev, linux-rdma, linux-kernel
In-Reply-To: <1657791845-1060-1-git-send-email-guwen@linux.alibaba.com>
Extend SMC-R link group netlink attribute SMC_GEN_LGR_SMCR.
Introduce SMC_NLA_LGR_R_BUF_TYPE to show the buffer type of
SMC-R link group.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
include/uapi/linux/smc.h | 1 +
net/smc/smc_core.c | 2 ++
2 files changed, 3 insertions(+)
diff --git a/include/uapi/linux/smc.h b/include/uapi/linux/smc.h
index 693f549..bb4dacc 100644
--- a/include/uapi/linux/smc.h
+++ b/include/uapi/linux/smc.h
@@ -124,6 +124,7 @@ enum {
SMC_NLA_LGR_R_V2, /* nest */
SMC_NLA_LGR_R_NET_COOKIE, /* u64 */
SMC_NLA_LGR_R_PAD, /* flag */
+ SMC_NLA_LGR_R_BUF_TYPE, /* u8 */
__SMC_NLA_LGR_R_MAX,
SMC_NLA_LGR_R_MAX = __SMC_NLA_LGR_R_MAX - 1
};
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index f26770c..ff49a11 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -347,6 +347,8 @@ static int smc_nl_fill_lgr(struct smc_link_group *lgr,
goto errattr;
if (nla_put_u8(skb, SMC_NLA_LGR_R_TYPE, lgr->type))
goto errattr;
+ if (nla_put_u8(skb, SMC_NLA_LGR_R_BUF_TYPE, lgr->buf_type))
+ goto errattr;
if (nla_put_u8(skb, SMC_NLA_LGR_R_VLAN_ID, lgr->vlan_id))
goto errattr;
if (nla_put_u64_64bit(skb, SMC_NLA_LGR_R_NET_COOKIE,
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v2 4/6] net/smc: Use sysctl-specified types of buffers in new link group
From: Wen Gu @ 2022-07-14 9:44 UTC (permalink / raw)
To: kgraul, wenjia, davem, edumazet, kuba, pabeni
Cc: linux-s390, netdev, linux-rdma, linux-kernel
In-Reply-To: <1657791845-1060-1-git-send-email-guwen@linux.alibaba.com>
This patch introduces a new SMC-R specific element buf_type
in struct smc_link_group, for recording the value of sysctl
smcr_buf_type when link group is created.
New created link group will create and reuse buffers of the
type specified by buf_type.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
net/smc/smc_core.c | 1 +
net/smc/smc_core.h | 1 +
2 files changed, 2 insertions(+)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index fa3a7a8..86afbbc 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -907,6 +907,7 @@ static int smc_lgr_create(struct smc_sock *smc, struct smc_init_info *ini)
lgr->net = smc_ib_net(lnk->smcibdev);
lgr_list = &smc_lgr_list.list;
lgr_lock = &smc_lgr_list.lock;
+ lgr->buf_type = lgr->net->smc.sysctl_smcr_buf_type;
atomic_inc(&lgr_cnt);
}
smc->conn.lgr = lgr;
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 7652dfa..0261124 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -284,6 +284,7 @@ struct smc_link_group {
/* used rtoken elements */
u8 next_link_id;
enum smc_lgr_type type;
+ enum smcr_buf_type buf_type;
/* redundancy state */
u8 pnet_id[SMC_MAX_PNETID_LEN + 1];
/* pnet id of this lgr */
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v2 3/6] net/smc: Introduce a sysctl for setting SMC-R buffer type
From: Wen Gu @ 2022-07-14 9:44 UTC (permalink / raw)
To: kgraul, wenjia, davem, edumazet, kuba, pabeni
Cc: linux-s390, netdev, linux-rdma, linux-kernel
In-Reply-To: <1657791845-1060-1-git-send-email-guwen@linux.alibaba.com>
This patch introduces the sysctl smcr_buf_type for setting
the type of SMC-R sndbufs and RMBs.
Valid values includes:
- SMCR_PHYS_CONT_BUFS, which means use physically contiguous
buffers for better performance and is the default value.
- SMCR_VIRT_CONT_BUFS, which means use virtually contiguous
buffers in case of physically contiguous memory is scarce.
- SMCR_MIXED_BUFS, which means first try to use physically
contiguous buffers. If not available, then use virtually
contiguous buffers.
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
---
Documentation/networking/smc-sysctl.rst | 13 +++++++++++++
include/net/netns/smc.h | 1 +
net/smc/smc_core.h | 6 ++++++
net/smc/smc_sysctl.c | 11 +++++++++++
4 files changed, 31 insertions(+)
diff --git a/Documentation/networking/smc-sysctl.rst b/Documentation/networking/smc-sysctl.rst
index 0987fd1..742e90e 100644
--- a/Documentation/networking/smc-sysctl.rst
+++ b/Documentation/networking/smc-sysctl.rst
@@ -21,3 +21,16 @@ autocorking_size - INTEGER
know how/when to uncork their sockets.
Default: 64K
+
+smcr_buf_type - INTEGER
+ Controls which type of sndbufs and RMBs to use in later newly created
+ SMC-R link group. Only for SMC-R.
+
+ Default: 0 (physically contiguous sndbufs and RMBs)
+
+ Possible values:
+
+ - 0 - Use physically contiguous buffers
+ - 1 - Use virtually contiguous buffers
+ - 2 - Mixed use of the two types. Try physically contiguous buffers first.
+ If not available, use virtually contiguous buffers then.
diff --git a/include/net/netns/smc.h b/include/net/netns/smc.h
index e5389ee..2adbe2b 100644
--- a/include/net/netns/smc.h
+++ b/include/net/netns/smc.h
@@ -18,5 +18,6 @@ struct netns_smc {
struct ctl_table_header *smc_hdr;
#endif
unsigned int sysctl_autocorking_size;
+ unsigned int sysctl_smcr_buf_type;
};
#endif
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index 46ddec5..7652dfa 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -217,6 +217,12 @@ enum smc_lgr_type { /* redundancy state of lgr */
SMC_LGR_ASYMMETRIC_LOCAL, /* local has 1, peer 2 active RNICs */
};
+enum smcr_buf_type { /* types of SMC-R sndbufs and RMBs */
+ SMCR_PHYS_CONT_BUFS = 0,
+ SMCR_VIRT_CONT_BUFS = 1,
+ SMCR_MIXED_BUFS = 2,
+};
+
enum smc_llc_flowtype {
SMC_LLC_FLOW_NONE = 0,
SMC_LLC_FLOW_ADD_LINK = 2,
diff --git a/net/smc/smc_sysctl.c b/net/smc/smc_sysctl.c
index cf3ab13..0613868 100644
--- a/net/smc/smc_sysctl.c
+++ b/net/smc/smc_sysctl.c
@@ -15,6 +15,7 @@
#include <net/net_namespace.h>
#include "smc.h"
+#include "smc_core.h"
#include "smc_sysctl.h"
static struct ctl_table smc_table[] = {
@@ -25,6 +26,15 @@
.mode = 0644,
.proc_handler = proc_douintvec,
},
+ {
+ .procname = "smcr_buf_type",
+ .data = &init_net.smc.sysctl_smcr_buf_type,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_TWO,
+ },
{ }
};
@@ -49,6 +59,7 @@ int __net_init smc_sysctl_net_init(struct net *net)
goto err_reg;
net->smc.sysctl_autocorking_size = SMC_AUTOCORKING_DEFAULT_SIZE;
+ net->smc.sysctl_smcr_buf_type = SMCR_PHYS_CONT_BUFS;
return 0;
--
1.8.3.1
^ permalink raw reply related
* [PATCH net-next v2 2/6] net/smc: optimize for smc_sndbuf_sync_sg_for_device and smc_rmb_sync_sg_for_cpu
From: Wen Gu @ 2022-07-14 9:44 UTC (permalink / raw)
To: kgraul, wenjia, davem, edumazet, kuba, pabeni
Cc: linux-s390, netdev, linux-rdma, linux-kernel
In-Reply-To: <1657791845-1060-1-git-send-email-guwen@linux.alibaba.com>
From: Guangguan Wang <guangguan.wang@linux.alibaba.com>
Some CPU, such as Xeon, can guarantee DMA cache coherency.
So it is no need to use dma sync APIs to flush cache on such CPUs.
In order to avoid calling dma sync APIs on the IO path, use the
dma_need_sync to check whether smc_buf_desc needs dma sync when
creating smc_buf_desc.
Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com>
---
net/smc/smc_core.c | 8 ++++++++
net/smc/smc_core.h | 1 +
net/smc/smc_ib.c | 29 +++++++++++++++++++++++++++++
net/smc/smc_ib.h | 2 ++
4 files changed, 40 insertions(+)
diff --git a/net/smc/smc_core.c b/net/smc/smc_core.c
index 1faa0cb..fa3a7a8 100644
--- a/net/smc/smc_core.c
+++ b/net/smc/smc_core.c
@@ -2016,6 +2016,9 @@ static int smcr_buf_map_link(struct smc_buf_desc *buf_desc, bool is_rmb,
goto free_table;
}
+ buf_desc->is_dma_need_sync |=
+ smc_ib_is_sg_need_sync(lnk, buf_desc) << lnk->link_idx;
+
/* create a new memory region for the RMB */
if (is_rmb) {
rc = smc_ib_get_memory_region(lnk->roce_pd,
@@ -2234,6 +2237,7 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
/* check for reusable slot in the link group */
buf_desc = smc_buf_get_slot(bufsize_short, lock, buf_list);
if (buf_desc) {
+ buf_desc->is_dma_need_sync = 0;
SMC_STAT_RMB_SIZE(smc, is_smcd, is_rmb, bufsize);
SMC_STAT_BUF_REUSE(smc, is_smcd, is_rmb);
break; /* found reusable slot */
@@ -2292,6 +2296,8 @@ static int __smc_buf_create(struct smc_sock *smc, bool is_smcd, bool is_rmb)
void smc_sndbuf_sync_sg_for_device(struct smc_connection *conn)
{
+ if (!conn->sndbuf_desc->is_dma_need_sync)
+ return;
if (!smc_conn_lgr_valid(conn) || conn->lgr->is_smcd ||
!smc_link_active(conn->lnk))
return;
@@ -2302,6 +2308,8 @@ void smc_rmb_sync_sg_for_cpu(struct smc_connection *conn)
{
int i;
+ if (!conn->rmb_desc->is_dma_need_sync)
+ return;
if (!smc_conn_lgr_valid(conn) || conn->lgr->is_smcd)
return;
for (i = 0; i < SMC_LINKS_PER_LGR_MAX; i++) {
diff --git a/net/smc/smc_core.h b/net/smc/smc_core.h
index c441dfe..46ddec5 100644
--- a/net/smc/smc_core.h
+++ b/net/smc/smc_core.h
@@ -180,6 +180,7 @@ struct smc_buf_desc {
/* mem region registered */
u8 is_map_ib[SMC_LINKS_PER_LGR_MAX];
/* mem region mapped to lnk */
+ u8 is_dma_need_sync;
u8 is_reg_err;
/* buffer registration err */
};
diff --git a/net/smc/smc_ib.c b/net/smc/smc_ib.c
index dcda416..60e5095 100644
--- a/net/smc/smc_ib.c
+++ b/net/smc/smc_ib.c
@@ -729,6 +729,29 @@ int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
return 0;
}
+bool smc_ib_is_sg_need_sync(struct smc_link *lnk,
+ struct smc_buf_desc *buf_slot)
+{
+ struct scatterlist *sg;
+ unsigned int i;
+ bool ret = false;
+
+ /* for now there is just one DMA address */
+ for_each_sg(buf_slot->sgt[lnk->link_idx].sgl, sg,
+ buf_slot->sgt[lnk->link_idx].nents, i) {
+ if (!sg_dma_len(sg))
+ break;
+ if (dma_need_sync(lnk->smcibdev->ibdev->dma_device,
+ sg_dma_address(sg))) {
+ ret = true;
+ goto out;
+ }
+ }
+
+out:
+ return ret;
+}
+
/* synchronize buffer usage for cpu access */
void smc_ib_sync_sg_for_cpu(struct smc_link *lnk,
struct smc_buf_desc *buf_slot,
@@ -737,6 +760,9 @@ void smc_ib_sync_sg_for_cpu(struct smc_link *lnk,
struct scatterlist *sg;
unsigned int i;
+ if (!(buf_slot->is_dma_need_sync & (1U << lnk->link_idx)))
+ return;
+
/* for now there is just one DMA address */
for_each_sg(buf_slot->sgt[lnk->link_idx].sgl, sg,
buf_slot->sgt[lnk->link_idx].nents, i) {
@@ -757,6 +783,9 @@ void smc_ib_sync_sg_for_device(struct smc_link *lnk,
struct scatterlist *sg;
unsigned int i;
+ if (!(buf_slot->is_dma_need_sync & (1U << lnk->link_idx)))
+ return;
+
/* for now there is just one DMA address */
for_each_sg(buf_slot->sgt[lnk->link_idx].sgl, sg,
buf_slot->sgt[lnk->link_idx].nents, i) {
diff --git a/net/smc/smc_ib.h b/net/smc/smc_ib.h
index 5d8b49c..03429567 100644
--- a/net/smc/smc_ib.h
+++ b/net/smc/smc_ib.h
@@ -102,6 +102,8 @@ void smc_ib_buf_unmap_sg(struct smc_link *lnk,
int smc_ib_get_memory_region(struct ib_pd *pd, int access_flags,
struct smc_buf_desc *buf_slot, u8 link_idx);
void smc_ib_put_memory_region(struct ib_mr *mr);
+bool smc_ib_is_sg_need_sync(struct smc_link *lnk,
+ struct smc_buf_desc *buf_slot);
void smc_ib_sync_sg_for_cpu(struct smc_link *lnk,
struct smc_buf_desc *buf_slot,
enum dma_data_direction data_direction);
--
1.8.3.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox