Linux-HyperV List
 help / color / mirror / Atom feed
* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: David Hildenbrand @ 2024-06-12 18:38 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Mike Rapoport, Oscar Salvador, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240611121942.050a2215143af0ecb576122f@linux-foundation.org>

On 11.06.24 21:19, Andrew Morton wrote:
> On Tue, 11 Jun 2024 12:06:56 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> On 07.06.24 11:09, David Hildenbrand wrote:
>>> In preparation for further changes, let's teach __free_pages_core()
>>> about the differences of memory hotplug handling.
>>>
>>> Move the memory hotplug specific handling from generic_online_page() to
>>> __free_pages_core(), use adjust_managed_page_count() on the memory
>>> hotplug path, and spell out why memory freed via memblock
>>> cannot currently use adjust_managed_page_count().
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>
>> @Andrew, can you squash the following?
> 
> Sure.
> 
> I queued it against "mm: pass meminit_context to __free_pages_core()",
> not against
> 
>> Subject: [PATCH] fixup: mm/highmem: make nr_free_highpages() return "unsigned
>>    long"
> 

Can you squash the following as well? (hopefully the last fixup, otherwise I
might just resend a v2)


 From 53c8c5834e638b2ae5e2a34fa7d49ce0dcf25192 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Wed, 12 Jun 2024 20:31:07 +0200
Subject: [PATCH] fixup: mm: pass meminit_context to __free_pages_core()

Let's add the parameter name also in the declaration.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/internal.h | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 14bab8a41baf6..254dd907bf9a2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -605,7 +605,7 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
  extern void memblock_free_pages(struct page *page, unsigned long pfn,
  					unsigned int order);
  extern void __free_pages_core(struct page *page, unsigned int order,
-		enum meminit_context);
+		enum meminit_context context);
  
  /*
   * This will have no effect, other than possibly generating a warning, if the
-- 
2.45.2


-- 
Cheers,

David / dhildenb


^ permalink raw reply related

* Re: [PATCH net-next v4] net: mana: Allow variable size indirection table
From: Leon Romanovsky @ 2024-06-12 18:30 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Shradha Gupta, linux-hardening, netdev, linux-hyperv,
	linux-kernel, linux-rdma, Colin Ian King, Ahmed Zaki,
	Pavan Chebbi, Souradeep Chakrabarti, Konstantin Taranov,
	Kees Cook, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Jason Gunthorpe, Long Li, Shradha Gupta
In-Reply-To: <1718015319-9609-1-git-send-email-shradhagupta@linux.microsoft.com>

On Mon, Jun 10, 2024 at 03:28:39AM -0700, Shradha Gupta wrote:
> Allow variable size indirection table allocation in MANA instead
> of using a constant value MANA_INDIRECT_TABLE_SIZE.
> The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> indirection table is allocated dynamically.
> 
> Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> Reviewed-by: Dexuan Cui <decui@microsoft.com>
> Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  Changes in v4:
>  * Skip NULLify after free
>  * Log proper errors in mana_probe() if mana_attach(), mana_probe_port()
>    fails
>  * Implement mana_cleanup_indir_table() to avoid code duplication.
> 
>  Changes in v3:
>  * Fixed the memory leak(save_table) in mana_set_rxfh()
> 
>  Changes in v2:
>  * Rebased to latest net-next tree
>  * Rearranged cleanup code in mana_probe_port to avoid extra operations
> ---
>  drivers/infiniband/hw/mana/qp.c               | 10 +--
>  drivers/net/ethernet/microsoft/mana/mana_en.c | 85 ++++++++++++++++---
>  .../ethernet/microsoft/mana/mana_ethtool.c    | 27 ++++--
>  include/net/mana/gdma.h                       |  4 +-
>  include/net/mana/mana.h                       |  9 +-
>  5 files changed, 104 insertions(+), 31 deletions(-)

Hi Jakub,

Like we talked, I created new shared branch for this patch:
https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/log/?h=mana-shared

Because it took time, the base of this branch is v6.10-rc3 as I'm eager
to get the commit c9d52fb313d3 ("PCI: Revert the cfg_access_lock lockdep mechanism")
from Linus's master, and this shared patch gives me the reason to pull
the fix.

I see that your net-next didn't get -rc3 tag yet, so please pull it after you
advance net-next. After that I'll do the same in RDMA tree.

Thanks

^ permalink raw reply

* Re: [PATCH net-next v4] net: mana: Allow variable size indirection table
From: Leon Romanovsky @ 2024-06-12 18:20 UTC (permalink / raw)
  To: linux-hardening, netdev, linux-hyperv, linux-kernel, linux-rdma,
	Shradha Gupta
  Cc: Colin Ian King, Ahmed Zaki, Pavan Chebbi, Souradeep Chakrabarti,
	Konstantin Taranov, Paolo Abeni, Jakub Kicinski, Eric Dumazet,
	David S. Miller, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Jason Gunthorpe, Long Li, Shradha Gupta,
	Kees Cook
In-Reply-To: <1718015319-9609-1-git-send-email-shradhagupta@linux.microsoft.com>


On Mon, 10 Jun 2024 03:28:39 -0700, Shradha Gupta wrote:
> Allow variable size indirection table allocation in MANA instead
> of using a constant value MANA_INDIRECT_TABLE_SIZE.
> The size is now derived from the MANA_QUERY_VPORT_CONFIG and the
> indirection table is allocated dynamically.
> 
> 

Applied, thanks!

[1/1] net: mana: Allow variable size indirection table
      https://git.kernel.org/rdma/rdma/c/7fc45cb68696c7

Best regards,
-- 
Leon Romanovsky <leon@kernel.org>


^ permalink raw reply

* Re: [PATCH 1/1] Documentation: hyperv: Add overview of Confidential Computing VM support
From: Easwar Hariharan @ 2024-06-12 16:10 UTC (permalink / raw)
  To: mhklinux, kys, haiyangz, wei.liu, decui, corbet, linux-kernel,
	linux-hyperv, linux-doc, linux-coco
  Cc: eahariha
In-Reply-To: <20240610202810.193452-1-mhklinux@outlook.com>

Thank you for adding much needed documentation throughout the tree!

On 6/10/2024 1:28 PM, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> Add documentation topic for Confidential Computing (CoCo) VM support
> in Linux guests on Hyper-V.
> 
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>  Documentation/virt/hyperv/coco.rst  | 258 ++++++++++++++++++++++++++++
>  Documentation/virt/hyperv/index.rst |   1 +
>  2 files changed, 259 insertions(+)
>  create mode 100644 Documentation/virt/hyperv/coco.rst
> 
> diff --git a/Documentation/virt/hyperv/coco.rst b/Documentation/virt/hyperv/coco.rst
> new file mode 100644
> index 000000000000..ffd6ba7a1d64
> --- /dev/null
> +++ b/Documentation/virt/hyperv/coco.rst
> @@ -0,0 +1,258 @@
> +.. SPDX-License-Identifier: GPL-2.0
> +
> +Confidential Computing VMs
> +==========================
> +Hyper-V can create and run Linux guests that are Confidential Computing
> +(CoCo) VMs. Such VMs cooperate with the physical processor to better protect
> +the confidentiality and integrity of data in the VM's memory, even in the
> +face of a hypervisor/VMM that has been compromised and may behave maliciously.
> +CoCo VMs on Hyper-V share the generic CoCo VM threat model and security
> +objectives described in Documentation/security/snp-tdx-threat-model.rst. Note
> +that Hyper-V specific code in Linux refers to CoCo VMs as "isolated VMs" or
> +"isolation VMs".

Thanks for incorporating the link to the threat model!

> +
> +A Linux CoCo VM on Hyper-V requires the cooperation and interaction of the
> +following:
> +
> +* Physical hardware with a processor that supports CoCo VMs
> +
> +* The hardware runs a version of Windows/Hyper-V with support for CoCo VMs
> +
> +* The VM runs a version of Linux that supports being a CoCo VM
> +
> +The physical hardware requirements are as follows:
> +
> +* AMD processor with SEV-SNP. Hyper-V does not run guest VMs with AMD SME,
> +  SEV, or SEV-ES encryption, and such encryption is not sufficient for a CoCo
> +  VM on Hyper-V.
> +
> +* Intel processor with TDX
> +
> +To create a CoCo VM, the "Isolated VM" attribute must be specified to Hyper-V
> +when the VM is created. A VM cannot be changed from a CoCo VM to a normal VM,
> +or vice versa, after it is created.
> +
> +Operational Modes
> +-----------------
> +Hyper-V CoCo VMs can run in two modes. The mode is selected when the VM is
> +created and cannot be changed during the life of the VM.
> +
> +* Fully-enlightened mode. In this mode, the guest operating system is
> +  enlightened to understand and manage all aspects of running as a CoCo VM.
> +
> +* Paravisor mode. In this mode, a paravisor layer between the guest and the
> +  host provides some operations needed to run as a CoCo VM. The guest operating
> +  system can have fewer CoCo enlightenments than is required in the
> +  fully-enlightened case.
> +
> +Conceptually, fully-enlightened mode and paravisor mode may be treated as
> +points on a spectrum spanning the degree of guest enlightenment needed to run
> +as a CoCo VM. Fully-enlightened mode is one end of the spectrum. A full
> +implementation of paravisor mode is the other end of the spectrum, where all
> +aspects of running as a CoCo VM are handled by the paravisor, and a normal
> +guest OS with no knowledge of memory encryption or other aspects of CoCo VMs
> +can run successfully. However, the Hyper-V implementation of paravisor mode
> +does not go this far, and is somewhere in the middle of the spectrum. Some
> +aspects of CoCo VMs are handled by the Hyper-V paravisor while the guest OS
> +must be enlightened for other aspects. Unfortunately, there is no
> +standardized enumeration of feature/functions that might be provided in the
> +paravisor, and there is no standardized mechanism for a guest OS to query the
> +paravisor for the feature/functions it provides. The understanding of what
> +the paravisor provides is hard-coded in the guest OS.
> +
> +Paravisor mode has similarities to the Coconut project, which aims to provide
> +a limited paravisor to provide services to the guest such as a virtual TPM.

Would it be useful to add an external link to the Coconut project here?
https://github.com/coconut-svsm/svsm

> +However, the Hyper-V paravisor generally handles more aspects of CoCo VMs
> +than is currently envisioned for Coconut, and so is further toward the "no
> +guest enlightenments required" end of the spectrum.
> +
> +In the CoCo VM threat model, the paravisor is in the guest security domain
> +and must be trusted by the guest OS. By implication, the hypervisor/VMM must
> +protect itself against a potentially malicious paravisor just like it
> +protects against a potentially malicious guest.

Tangential to this patch, can the guest provide its own paravisor since
it needs to be trusted and apparently the only way to find out if a
paravisor will be used is to rely on the (possibly) malicious
hypervisor/VMM to provide a synthetic MSR?

> +
> +The hardware architectural approach to fully-enlightened vs. paravisor mode
> +varies depending on the underlying processor.
> +
> +* With AMD SEV-SNP processors, in fully-enlightened mode the guest OS runs in
> +  VMPL 0 and has full control of the guest context. In paravisor mode, the
> +  guest OS runs in VMPL 2 and the paravisor runs in VMPL 0. The paravisor
> +  running in VMPL 0 has privileges that the guest OS in VMPL 2 does not have.
> +  Certain operations require the guest to invoke the paravisor. Furthermore, in
> +  paravisor mode the guest OS operates in "virtual Top Of Memory" (vTOM) mode
> +  as defined by the SEV-SNP architecture. This mode simplifies guest management
> +  of memory encryption when a paravisor is used.
> +
> +* With Intel TDX processor, in fully-enlightened mode the guest OS runs in an
> +  L1 VM. In paravisor mode, TD partitioning is used. The paravisor runs in the
> +  L1 VM, and the guest OS runs in a nested L2 VM.
> +
> +Hyper-V exposes a synthetic MSR to guests that describes the CoCo mode. This
> +MSR indicates if the underlying processor uses AMD SEV-SNP or Intel TDX, and
> +whether a paravisor is being used. It is straightforward to build a single
> +kernel image that can boot and run properly on either architecture, and in
> +either mode.
> +

^ permalink raw reply

* Re: [f2fs-dev] [PATCH] tracing/treewide: Remove second parameter of __assign_str()
From: patchwork-bot+f2fs @ 2024-06-12 15:55 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-kernel, linux-trace-kernel, linux-hyperv, kvm, dri-devel,
	ath10k, Julia.Lawall, linux-s390, dev, linux-cifs, linux-bcachefs,
	linux-rdma, amd-gfx, io-uring, torvalds, iommu, ath11k,
	linux-media, linux-wpan, linux-pm, selinux, linux-arm-msm,
	intel-gfx, linux-erofs, virtualization, linux-sound, linux-block,
	ocfs2-devel, mathieu.desnoyers, linux-cxl, linux-tegra, intel-xe,
	linux-edac, linux-hwmon, brcm80211-dev-list.pdl, linuxppc-dev,
	linux-usb, linux-wireless, brcm80211, linux-f2fs-devel, linux-xfs,
	ath12k, tipc-discussion, mhiramat, netdev, bpf, freedreno,
	linux-nfs, linux-btrfs
In-Reply-To: <20240516133454.681ba6a0@rorschach.local.home>

Hello:

This patch was applied to jaegeuk/f2fs.git (dev)
by Steven Rostedt (Google) <rostedt@goodmis.org>:

On Thu, 16 May 2024 13:34:54 -0400 you wrote:
> From: "Steven Rostedt (Google)" <rostedt@goodmis.org>
> 
> [
>    This is a treewide change. I will likely re-create this patch again in
>    the second week of the merge window of v6.10 and submit it then. Hoping
>    to keep the conflicts that it will cause to a minimum.
> ]
> 
> [...]

Here is the summary with links:
  - [f2fs-dev] tracing/treewide: Remove second parameter of __assign_str()
    https://git.kernel.org/jaegeuk/f2fs/c/2c92ca849fcc

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* RE: [PATCH net-next] net: mana: Add support for variable page sizes of ARM64
From: Haiyang Zhang @ 2024-06-12 14:21 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, Paul Rosswurm
  Cc: Dexuan Cui, stephen@networkplumber.org, KY Srinivasan,
	olaf@aepfle.de, vkuznets, davem@davemloft.net, wei.liu@kernel.org,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	leon@kernel.org, Long Li, ssengar@linux.microsoft.com,
	linux-rdma@vger.kernel.org, daniel@iogearbox.net,
	john.fastabend@gmail.com, bpf@vger.kernel.org, ast@kernel.org,
	hawk@kernel.org, tglx@linutronix.de,
	shradhagupta@linux.microsoft.com, linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB415781A68B43463F34A884E2D4C02@SN6PR02MB4157.namprd02.prod.outlook.com>



> -----Original Message-----
> From: Michael Kelley <mhklinux@outlook.com>
> Sent: Tuesday, June 11, 2024 10:42 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Cc: Dexuan Cui <decui@microsoft.com>; stephen@networkplumber.org; KY
> Srinivasan <kys@microsoft.com>; olaf@aepfle.de; vkuznets
> <vkuznets@redhat.com>; davem@davemloft.net; wei.liu@kernel.org;
> edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; leon@kernel.org;
> Long Li <longli@microsoft.com>; ssengar@linux.microsoft.com; linux-
> rdma@vger.kernel.org; daniel@iogearbox.net; john.fastabend@gmail.com;
> bpf@vger.kernel.org; ast@kernel.org; hawk@kernel.org; tglx@linutronix.de;
> shradhagupta@linux.microsoft.com; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH net-next] net: mana: Add support for variable page
> sizes of ARM64

> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 1332db9a08eb..c9df942d0d02 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -182,7 +182,7 @@ int mana_gd_alloc_memory(struct gdma_context
> *gc,
> > > > unsigned int length,
> > > >  dma_addr_t dma_handle;
> > > >  void *buf;
> > > >
> > > > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > > > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> > > >         return -EINVAL;
> > > >
> > > >  gmi->dev = gc->dev;
> > > > @@ -717,7 +717,7 @@ EXPORT_SYMBOL_NS(mana_gd_destroy_dma_region,
> NET_MANA);
> > > >  static int mana_gd_create_dma_region(struct gdma_dev *gd,
> > > >                          struct gdma_mem_info *gmi)
> > > >  {
> > > > - unsigned int num_page = gmi->length / PAGE_SIZE;
> > > > + unsigned int num_page = gmi->length / MANA_MIN_QSIZE;
> > >
> > > This calculation seems a bit weird when using MANA_MIN_QSIZE. The
> > > number of pages, and the construction of the page_addr_list array
> > > a few lines later, seem unrelated to the concept of a minimum queue
> > > size. Is the right concept really a "mapping chunk", and num_page
> > > would conceptually be "num_chunks", or something like that?  Then
> > > a queue must be at least one chunk in size, but that's derived from
> the
> > > chunk size, and is not the core concept.
> >
> > I think calling it "num_chunks" is fine.
> > May I use "num_chunks" in next version?
> >
> 
> I think first is the decision on what to use for MANA_MIN_QSIZE. I'm
> admittedly not familiar with mana and gdma, but the function
> mana_gd_create_dma_region() seems fairly typical in defining a
> logical region that spans multiple 4K chunks that may or may not
> be physically contiguous.  So you set up an array of physical
> addresses that identify the physical memory location of each chunk.
> It seems very similar to a Hyper-V GPADL. I typically *do* see such
> chunks called "pages", but a "mapping chunk" or "mapping unit"
> is probably OK too.  So mana_gd_create_dma_region() would use
> MANA_CHUNK_SIZE instead of MANA_MIN_QSIZE.  Then you could
> also define MANA_MIN_QSIZE to be MANA_CHUNK_SIZE, for use
> specifically when checking the size of a queue.
> 
> Then if you are using MANA_CHUNK_SIZE, the local variable
> would be "num_chunks".  The use of "page_count", "page_addr_list",
> and "offset_in_page" field names in struct
> gdma_create_dma_region_req should then be changed as well.

I'm fine with these names. I will check with Paul too.

I'd define just one macro, with a code comment like this. It 
will be used for chunk calculation and q len checking:
/* Chunk size of MANA DMA, which is also the min queue size */
#define MANA_CHUNK_SIZE 4096

 
> Looking further at the function mana_gd_create_dma_region(),
> there's also the use of offset_in_page(), which is based on the
> guest PAGE_SIZE.  Wouldn't this be problematic if PAGE_SIZE
> is 64K?

As in my other email - I confirmed with Hostnet team that the 
alignment requirement is just 4k.

So I will relax the following check to be 4k alignment too:
if (offset_in_page(gmi->virt_addr) != 0)
                return -EINVAL;

Thanks,
- Haiyang


^ permalink raw reply

* Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion
From: Nikolay Borisov @ 2024-06-12 12:10 UTC (permalink / raw)
  To: H. Peter Anvin, Kirill A. Shutemov, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86
  Cc: Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, K. Y. Srinivasan,
	Haiyang Zhang, kexec, linux-hyperv, linux-acpi, linux-coco,
	linux-kernel
In-Reply-To: <5d7dca1e-5889-4440-b3a0-48610f11200e@zytor.com>



On 3.06.24 г. 17:43 ч., H. Peter Anvin wrote:
> On 5/29/24 03:47, Nikolay Borisov wrote:
>>>
>>> diff --git a/arch/x86/kernel/relocate_kernel_64.S 
>>> b/arch/x86/kernel/relocate_kernel_64.S
>>> index 56cab1bb25f5..085eef5c3904 100644
>>> --- a/arch/x86/kernel/relocate_kernel_64.S
>>> +++ b/arch/x86/kernel/relocate_kernel_64.S
>>> @@ -148,9 +148,10 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>>>        */
>>>       movl    $X86_CR4_PAE, %eax
>>>       testq    $X86_CR4_LA57, %r13
>>> -    jz    1f
>>> +    jz    .Lno_la57
>>>       orl    $X86_CR4_LA57, %eax
>>> -1:
>>> +.Lno_la57:
>>> +
>>>       movq    %rax, %cr4
>>>       jmp 1f
>>
>> That jmp 1f becomes redundant now as it simply jumps 1 line below.
>>
> 
> Uh... am I the only person to notice that ALL that is needed here is:
> 
>      andl $(X86_CR4_PAE|X86_CR4_LA57), %r13d
>      movq %r13, %rax
> 
> ... since %r13 is dead afterwards, and PAE *will* have been set in %r13 
> already?
> 
> I don't believe that this specific jmp is actually needed -- there are 
> several more synchronizing jumps later -- but it doesn't hurt.
> 
> However, if the effort is for improving the readability, it might be 
> worthwhile to encapsulate the "jmp 1f; 1:" as a macro, e.g. "SYNC_CODE".


The preceding move to CR4 is itself a serializing instruction, no?

> 
>      -hpa

^ permalink raw reply

* Re: [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
From: Borislav Petkov @ 2024-06-12  9:29 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, Rafael J. Wysocki,
	Peter Zijlstra, Adrian Hunter, Kuppuswamy Sathyanarayanan,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	Kalra, Ashish, Sean Christopherson, Huang, Kai, Ard Biesheuvel,
	Baoquan He, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	kexec, linux-hyperv, linux-acpi, linux-coco, linux-kernel,
	Tao Liu
In-Reply-To: <2kc27uzrsvpevtvos2harqj3bgfkizi5dhhxkigswlylpnogr5@lk6fi2okv53i>

On Wed, Jun 12, 2024 at 12:24:30PM +0300, Kirill A. Shutemov wrote:
> I will try to deliver it in timely manner.

:-P

> > Yeah, we have a bunch of different pagetable manipulating things, all
> > with their peculiarities and unifying them and having a good set of APIs
> > which everything else uses, is always a good thing.
> 
> Will give it a try.
> 
> > And since we're talking cleanups, there's another thing I've been
> > looking at critically: CONFIG_X86_5LEVEL. Maybe it is time to get rid of
> > it and make the 5level stuff unconditional. And get rid of a bunch of
> > code since both vendors support 5level now...
> 
> Can do.

Much appreciated, thanks!

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
From: Kirill A. Shutemov @ 2024-06-12  9:24 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, Rafael J. Wysocki,
	Peter Zijlstra, Adrian Hunter, Kuppuswamy Sathyanarayanan,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	Kalra, Ashish, Sean Christopherson, Huang, Kai, Ard Biesheuvel,
	Baoquan He, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	kexec, linux-hyperv, linux-acpi, linux-coco, linux-kernel,
	Tao Liu
In-Reply-To: <20240611194653.GGZmiprSNzK0JSJL17@fat_crate.local>

On Tue, Jun 11, 2024 at 09:46:53PM +0200, Borislav Petkov wrote:
> On Tue, Jun 11, 2024 at 06:47:05PM +0300, Kirill A. Shutemov wrote:
> > Borislav, given this code deduplication effort is not trivial, maybe we
> > can do it as a separate patchset on top of this one?
> 
> Sure, as long as it gets done and doesn't get delayed indefinitely by
> new and more important features enablement.

I will try to deliver it in timely manner.

> Usually, we do unifications and cleanups first - then new features but
> this kexec pile has been long in the making already...
> 
> > I also wounder if it makes sense to combine ident_map.c and
> > set_memory.c.  There's some overlap between the two.
> 
> Yeah, we have a bunch of different pagetable manipulating things, all
> with their peculiarities and unifying them and having a good set of APIs
> which everything else uses, is always a good thing.

Will give it a try.

> And since we're talking cleanups, there's another thing I've been
> looking at critically: CONFIG_X86_5LEVEL. Maybe it is time to get rid of
> it and make the 5level stuff unconditional. And get rid of a bunch of
> code since both vendors support 5level now...

Can do.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion
From: Kirill A. Shutemov @ 2024-06-12  9:22 UTC (permalink / raw)
  To: H. Peter Anvin, Andrew Cooper
  Cc: Borislav Petkov, Nikolay Borisov, Thomas Gleixner, Ingo Molnar,
	Dave Hansen, x86, Rafael J. Wysocki, Peter Zijlstra,
	Adrian Hunter, Kuppuswamy Sathyanarayanan, Elena Reshetova,
	Jun Nakajima, Rick Edgecombe, Tom Lendacky, Kalra, Ashish,
	Sean Christopherson, Huang, Kai, Ard Biesheuvel, Baoquan He,
	K. Y. Srinivasan, Haiyang Zhang, kexec, linux-hyperv, linux-acpi,
	linux-coco, linux-kernel
In-Reply-To: <5c8b3ee9-64c2-4ff3-9cca-ba2672b9635e@zytor.com>

On Tue, Jun 11, 2024 at 11:26:17AM -0700, H. Peter Anvin wrote:
> On 6/4/24 08:21, Kirill A. Shutemov wrote:
> > 
> >  From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> > Date: Fri, 10 Feb 2023 12:53:11 +0300
> > Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
> > 
> > TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
> > that bit is cleared during CR4 register reprogramming during boot or
> > kexec flows, a #VE exception will be raised which the guest kernel
> > cannot handle it.
> > 
> > Therefore, make sure the CR4.MCE setting is preserved over kexec too and
> > avoid raising any #VEs.
> > 
> > The change doesn't affect non-TDX-guest environments.
> > 
> > Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > ---
> >   arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
> >   1 file changed, 10 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> > index 085eef5c3904..9c2cf70c5f54 100644
> > --- a/arch/x86/kernel/relocate_kernel_64.S
> > +++ b/arch/x86/kernel/relocate_kernel_64.S
> > @@ -5,6 +5,8 @@
> >    */
> >   #include <linux/linkage.h>
> > +#include <linux/stringify.h>
> > +#include <asm/alternative.h>
> >   #include <asm/page_types.h>
> >   #include <asm/kexec.h>
> >   #include <asm/processor-flags.h>
> > @@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
> >   	 * Set cr4 to a known state:
> >   	 *  - physical address extension enabled
> >   	 *  - 5-level paging, if it was enabled before
> > +	 *  - Machine check exception on TDX guest, if it was enabled before.
> > +	 *    Clearing MCE might not be allowed in TDX guests, depending on setup.
> > +	 *
> > +	 * Use R13 that contains the original CR4 value, read in relocate_kernel().
> > +	 * PAE is always set in the original CR4.
> >   	 */
> > -	movl	$X86_CR4_PAE, %eax
> > -	testq	$X86_CR4_LA57, %r13
> > -	jz	.Lno_la57
> > -	orl	$X86_CR4_LA57, %eax
> > -.Lno_la57:
> > -
> > -	movq	%rax, %cr4
> > +	andl	$(X86_CR4_PAE | X86_CR4_LA57), %r13d
> > +	ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
> > +	movq	%r13, %cr4
> 
> If this is the case, I don't really see a reason to clear MCE per se as I'm
> guessing a machine check here will be fatal anyway? It just changes the
> method of death.

Andrew had a strong opinion on method of death here.

https://lore.kernel.org/all/1144340e-dd95-ee3b-dabb-579f9a65b3c7@citrix.com

> Also, is there a reason to save %cr4, run code, and *then* clear the
> relevant bits? Wouldn't it be better to sanitize %cr4 as soon as possible?

You mean set new CR4 directly in relocate_kernel() before switching CR3?
I guess it is possible.

But I can say I see huge benefit of changing it. Such change would have
own risks.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* RE: [PATCH net-next] net: mana: Add support for variable page sizes of ARM64
From: Michael Kelley @ 2024-06-12  2:42 UTC (permalink / raw)
  To: Haiyang Zhang, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, Paul Rosswurm
  Cc: Dexuan Cui, stephen@networkplumber.org, KY Srinivasan,
	olaf@aepfle.de, vkuznets, davem@davemloft.net, wei.liu@kernel.org,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	leon@kernel.org, Long Li, ssengar@linux.microsoft.com,
	linux-rdma@vger.kernel.org, daniel@iogearbox.net,
	john.fastabend@gmail.com, bpf@vger.kernel.org, ast@kernel.org,
	hawk@kernel.org, tglx@linutronix.de,
	shradhagupta@linux.microsoft.com, linux-kernel@vger.kernel.org
In-Reply-To: <DM6PR21MB1481E3F4E4E26765CCF6114DCAC72@DM6PR21MB1481.namprd21.prod.outlook.com>

From: Haiyang Zhang <haiyangz@microsoft.com> Sent: Tuesday, June 11, 2024 10:44 AM
> 
> > -----Original Message-----
> > From: Michael Kelley <mailto:mhklinux@outlook.com>
> > Sent: Tuesday, June 11, 2024 12:35 PM
> > To: Haiyang Zhang <mailto:haiyangz@microsoft.com>; mailto:linux-hyperv@vger.kernel.org;
> > mailto:netdev@vger.kernel.org
> > Cc: Dexuan Cui <mailto:decui@microsoft.com>; mailto:stephen@networkplumber.org; KY
> > Srinivasan <mailto:kys@microsoft.com>; Paul Rosswurm <mailto:paulros@microsoft.com>;
> > mailto:olaf@aepfle.de; vkuznets <mailto:vkuznets@redhat.com>; mailto:davem@davemloft.net;
> > mailto:wei.liu@kernel.org; mailto:edumazet@google.com; mailto:kuba@kernel.org;
> > mailto:pabeni@redhat.com; mailto:leon@kernel.org; Long Li <mailto:longli@microsoft.com>;
> > mailto:ssengar@linux.microsoft.com; mailto:linux-rdma@vger.kernel.org;
> > mailto:daniel@iogearbox.net; mailto:john.fastabend@gmail.com; mailto:bpf@vger.kernel.org;
> > mailto:ast@kernel.org; mailto:hawk@kernel.org; mailto:tglx@linutronix.de;
> > mailto:shradhagupta@linux.microsoft.com; mailto:linux-kernel@vger.kernel.org
> > Subject: RE: [PATCH net-next] net: mana: Add support for variable page
> > sizes of ARM64
> >
> > From: LKML haiyangz <mailto:lkmlhyz@microsoft.com> On Behalf Of Haiyang Zhang
> > Sent: Monday, June 10, 2024 2:23 PM
> > >
> > > As defined by the MANA Hardware spec, the queue size for DMA is 4KB
> > > minimal, and power of 2.
> >
> > You say the hardware requires 4K "minimal". But the definitions in this
> > patch hardcode to 4K, as if that's the only choice. Is the hardcoding to
> > 4K a design decision made to simplify the MANA driver?
> 
> The HWC q size has to be exactly 4k, which is by HW design.
> Other "regular" queues can be 2^n >= 4k.
> 
> >
> > > To support variable page sizes (4KB, 16KB, 64KB) of ARM64, define
> >
> > A minor nit, but "variable" page size doesn't seem like quite the right
> > description -- both here and in the Subject line.  On ARM64, the page size
> > is a choice among a few fixed options.  Perhaps call it support for "page sizes
> > other than 4K"?
> 
> "page sizes other than 4K" sounds good.
> 
> >
> > > the minimal queue size as a macro separate from the PAGE_SIZE, which
> > > we always assumed it to be 4KB before supporting ARM64.
> > > Also, update the relevant code related to size alignment, DMA region
> > > calculations, etc.
> > >
> > > Signed-off-by: Haiyang Zhang <mailto:haiyangz@microsoft.com>
> > > ---
> > >  drivers/net/ethernet/microsoft/Kconfig        |  2 +-
> > >  .../net/ethernet/microsoft/mana/gdma_main.c   |  8 +++----
> > >  .../net/ethernet/microsoft/mana/hw_channel.c  | 22 +++++++++----------
> > >  drivers/net/ethernet/microsoft/mana/mana_en.c |  8 +++----
> > >  .../net/ethernet/microsoft/mana/shm_channel.c |  9 ++++----
> > >  include/net/mana/gdma.h                       |  7 +++++-
> > >  include/net/mana/mana.h                       |  3 ++-
> > >  7 files changed, 33 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/microsoft/Kconfig
> > > b/drivers/net/ethernet/microsoft/Kconfig
> > > index 286f0d5697a1..901fbffbf718 100644
> > > --- a/drivers/net/ethernet/microsoft/Kconfig
> > > +++ b/drivers/net/ethernet/microsoft/Kconfig
> > > @@ -18,7 +18,7 @@ if NET_VENDOR_MICROSOFT
> > >  config MICROSOFT_MANA
> > >  tristate "Microsoft Azure Network Adapter (MANA) support"
> > >  depends on PCI_MSI
> > > - depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN && ARM64_4K_PAGES)
> > > + depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
> > >  depends on PCI_HYPERV
> > >  select AUXILIARY_BUS
> > >  select PAGE_POOL
> > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > index 1332db9a08eb..c9df942d0d02 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > @@ -182,7 +182,7 @@ int mana_gd_alloc_memory(struct gdma_context *gc,
> > > unsigned int length,
> > >  dma_addr_t dma_handle;
> > >  void *buf;
> > >
> > > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> > >         return -EINVAL;
> > >
> > >  gmi->dev = gc->dev;
> > > @@ -717,7 +717,7 @@ EXPORT_SYMBOL_NS(mana_gd_destroy_dma_region, NET_MANA);
> > >  static int mana_gd_create_dma_region(struct gdma_dev *gd,
> > >                          struct gdma_mem_info *gmi)
> > >  {
> > > - unsigned int num_page = gmi->length / PAGE_SIZE;
> > > + unsigned int num_page = gmi->length / MANA_MIN_QSIZE;
> >
> > This calculation seems a bit weird when using MANA_MIN_QSIZE. The
> > number of pages, and the construction of the page_addr_list array
> > a few lines later, seem unrelated to the concept of a minimum queue
> > size. Is the right concept really a "mapping chunk", and num_page
> > would conceptually be "num_chunks", or something like that?  Then
> > a queue must be at least one chunk in size, but that's derived from the
> > chunk size, and is not the core concept.
> 
> I think calling it "num_chunks" is fine.
> May I use "num_chunks" in next version?
> 

I think first is the decision on what to use for MANA_MIN_QSIZE. I'm
admittedly not familiar with mana and gdma, but the function
mana_gd_create_dma_region() seems fairly typical in defining a
logical region that spans multiple 4K chunks that may or may not
be physically contiguous.  So you set up an array of physical
addresses that identify the physical memory location of each chunk.
It seems very similar to a Hyper-V GPADL. I typically *do* see such
chunks called "pages", but a "mapping chunk" or "mapping unit"
is probably OK too.  So mana_gd_create_dma_region() would use
MANA_CHUNK_SIZE instead of MANA_MIN_QSIZE.  Then you could
also define MANA_MIN_QSIZE to be MANA_CHUNK_SIZE, for use
specifically when checking the size of a queue.

Then if you are using MANA_CHUNK_SIZE, the local variable
would be "num_chunks".  The use of "page_count", "page_addr_list",
and "offset_in_page" field names in struct
gdma_create_dma_region_req should then be changed as well.

Looking further at the function mana_gd_create_dma_region(),
there's also the use of offset_in_page(), which is based on the
guest PAGE_SIZE.  Wouldn't this be problematic if PAGE_SIZE
is 64K?

But perhaps Paul would weigh in further on his thoughts.

> >
> > Another approach might be to just call it "MANA_PAGE_SIZE", like
> > has been done with HV_HYP_PAGE_SIZE.  HV_HYP_PAGE_SIZE exists to
> > handle exactly the same issue of the guest PAGE_SIZE potentially
> > being different from the fixed 4K size that must be used in host-guest
> > communication on Hyper-V.  Same thing here with MANA.
> 
> I actually called it "MANA_PAGE_SIZE" in my previous internal patch.
> But Paul from Hostnet team opposed using that name, because
> 4kB is the min q size. MANA doesn't have "page" at HW level.
> 
> 
> > >  struct gdma_create_dma_region_req *req = NULL;
> > >  struct gdma_create_dma_region_resp resp = {};
> > >  struct gdma_context *gc = gd->gdma_context;
> > > @@ -727,7 +727,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
> > >  int err;
> > >  int i;
> > >
> > > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> > >         return -EINVAL;
> > >
> > >  if (offset_in_page(gmi->virt_addr) != 0)
> > > @@ -751,7 +751,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
> > >  req->page_addr_list_len = num_page;
> > >
> > >  for (i = 0; i < num_page; i++)
> > > -       req->page_addr_list[i] = gmi->dma_handle +  i * PAGE_SIZE;
> > > +       req->page_addr_list[i] = gmi->dma_handle +  i * MANA_MIN_QSIZE;
> > >
> > >  err = mana_gd_send_request(gc, req_msg_size, req, sizeof(resp), &resp);
> > >  if (err)
> > > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > index bbc4f9e16c98..038dc31e09cd 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > > @@ -362,12 +362,12 @@ static int mana_hwc_create_cq(struct hw_channel_context
> > > *hwc, u16 q_depth,
> > >  int err;
> > >
> > >  eq_size = roundup_pow_of_two(GDMA_EQE_SIZE * q_depth);
> > > - if (eq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > > -       eq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > > + if (eq_size < MANA_MIN_QSIZE)
> > > +       eq_size = MANA_MIN_QSIZE;
> > >
> > >  cq_size = roundup_pow_of_two(GDMA_CQE_SIZE * q_depth);
> > > - if (cq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > > -       cq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > > + if (cq_size < MANA_MIN_QSIZE)
> > > +       cq_size = MANA_MIN_QSIZE;
> > >
> > >  hwc_cq = kzalloc(sizeof(*hwc_cq), GFP_KERNEL);
> > >  if (!hwc_cq)
> > > @@ -429,7 +429,7 @@ static int mana_hwc_alloc_dma_buf(struct hw_channel_context *hwc, u16 q_depth,
> > >
> > >  dma_buf->num_reqs = q_depth;
> > >
> > > - buf_size = PAGE_ALIGN(q_depth * max_msg_size);
> > > + buf_size = MANA_MIN_QALIGN(q_depth * max_msg_size);
> > >
> > >  gmi = &dma_buf->mem_info;
> > >  err = mana_gd_alloc_memory(gc, buf_size, gmi);
> > > @@ -497,8 +497,8 @@ static int mana_hwc_create_wq(struct hw_channel_context
> > > *hwc,
> > >  else
> > >         queue_size = roundup_pow_of_two(GDMA_MAX_SQE_SIZE * q_depth);
> > >
> > > - if (queue_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > > -       queue_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > > + if (queue_size < MANA_MIN_QSIZE)
> > > +       queue_size = MANA_MIN_QSIZE;
> > >
> > >  hwc_wq = kzalloc(sizeof(*hwc_wq), GFP_KERNEL);
> > >  if (!hwc_wq)
> > > @@ -628,10 +628,10 @@ static int mana_hwc_establish_channel(struct
> > > gdma_context *gc, u16 *q_depth,
> > >  init_completion(&hwc->hwc_init_eqe_comp);
> > >
> > >  err = mana_smc_setup_hwc(&gc->shm_channel, false,
> > > -                   eq->mem_info.dma_handle,
> > > -                   cq->mem_info.dma_handle,
> > > -                   rq->mem_info.dma_handle,
> > > -                   sq->mem_info.dma_handle,
> > > +                   virt_to_phys(eq->mem_info.virt_addr),
> > > +                   virt_to_phys(cq->mem_info.virt_addr),
> > > +                   virt_to_phys(rq->mem_info.virt_addr),
> > > +                   virt_to_phys(sq->mem_info.virt_addr),
> >
> > This change seems unrelated to handling guest PAGE_SIZE values
> > other than 4K.  Does it belong in a separate patch?  Or maybe it just
> > needs an explanation in the commit message of this patch?
> 
> I know dma_handle is usually just the phys adr. But this is not always
> True if IOMMU is used...
> I have no problem to put it to a separate patch if desired.

Yes, that would seem like a separate patch.  It's not related to handling
page size other than 4K.

> 
> >
> > >                      eq->eq.msix_index);
> > >  if (err)
> > >         return err;
> > > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > index d087cf954f75..6a891dbce686 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > > @@ -1889,10 +1889,10 @@ static int mana_create_txq(struct mana_port_context
> > > *apc,
> > >   *  to prevent overflow.
> > >   */
> > >  txq_size = MAX_SEND_BUFFERS_PER_QUEUE * 32;
> > > - BUILD_BUG_ON(!PAGE_ALIGNED(txq_size));
> > > + BUILD_BUG_ON(!MANA_MIN_QALIGNED(txq_size));
> > >
> > >  cq_size = MAX_SEND_BUFFERS_PER_QUEUE * COMP_ENTRY_SIZE;
> > > - cq_size = PAGE_ALIGN(cq_size);
> > > + cq_size = MANA_MIN_QALIGN(cq_size);
> > >
> > >  gc = gd->gdma_context;
> > >
> > > @@ -2189,8 +2189,8 @@ static struct mana_rxq *mana_create_rxq(struct mana_port_context *apc,
> > >  if (err)
> > >         goto out;
> > >
> > > - rq_size = PAGE_ALIGN(rq_size);
> > > - cq_size = PAGE_ALIGN(cq_size);
> > > + rq_size = MANA_MIN_QALIGN(rq_size);
> > > + cq_size = MANA_MIN_QALIGN(cq_size);
> > >
> > >  /* Create RQ */
> > >  memset(&spec, 0, sizeof(spec));
> > > diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > index 5553af9c8085..9a54a163d8d1 100644
> > > --- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > +++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > > @@ -6,6 +6,7 @@
> > >  #include <linux/io.h>
> > >  #include <linux/mm.h>
> > >
> > > +#include <net/mana/gdma.h>
> > >  #include <net/mana/shm_channel.h>
> > >
> > >  #define PAGE_FRAME_L48_WIDTH_BYTES 6
> > > @@ -183,7 +184,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool reset_vf, u64 eq_addr,
> > >
> > >  /* EQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(eq_addr);
> > > + frame_addr = MANA_PFN(eq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> >
> > In mana_smc_setup_hwc() a few lines above this change, code using
> > PAGE_ALIGNED() is unchanged.  Is it correct that the eq/cq/rq/sq
> > addresses
> > must be aligned to 64K if PAGE_SIZE is 64K?
> 
> Since we still using PHYS_PFN on them, if not aligned to PAGE_SIZE,
> the lower bits may be lost. (You said the same below.)
> 
> >
> > Related, I wonder about how MANA_PFN() is defined. If PAGE_SIZE is 64K,
> > MANA_PFN() will first right-shift 16, then left shift 4. The net is
> > right-shift 12,
> > corresponding to the 4K chunks that MANA expects. But that approach
> > guarantees
> > that the rightmost 4 bits of the MANA PFN will always be zero. That's
> > consistent
> > with requiring the addresses to be PAGE_ALIGNED() to 64K, but I'm unclear
> > whether
> > that is really the requirement. You might compare with the definition of
> > HVPFN_DOWN(), which has a similar goal for Linux guests communicating
> > with
> > Hyper-V.
> 
> @Paul Rosswurm You said MANA HW has "no page concept". So the "frame_addr"
> In the mana_smc_setup_hwc() is NOT related to physical page number, correct?
> Can we just use phys_adr >> 12 like below?
> 
> #define MANA_MIN_QSHIFT 12
> #define MANA_PFN(a) ((a) >> MANA_MIN_QSHIFT)
> 
>       /* EQ addr: low 48 bits of frame address */
>      shmem = (u64 *)ptr;
> -     frame_addr = PHYS_PFN(eq_addr);
> +     frame_addr = MANA_PFN(eq_addr);
> 
> >
> > > @@ -191,7 +192,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > > reset_vf, u64 eq_addr,
> > >
> > >  /* CQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(cq_addr);
> > > + frame_addr = MANA_PFN(cq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > > @@ -199,7 +200,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > > reset_vf, u64 eq_addr,
> > >
> > >  /* RQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(rq_addr);
> > > + frame_addr = MANA_PFN(rq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > > @@ -207,7 +208,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > > reset_vf, u64 eq_addr,
> > >
> > >  /* SQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(sq_addr);
> > > + frame_addr = MANA_PFN(sq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > > diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> > > index 27684135bb4d..b392559c33e9 100644
> > > --- a/include/net/mana/gdma.h
> > > +++ b/include/net/mana/gdma.h
> > > @@ -224,7 +224,12 @@ struct gdma_dev {
> > >  struct auxiliary_device *adev;
> > >  };
> > >
> > > -#define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
> > > +/* These are defined by HW */
> > > +#define MANA_MIN_QSHIFT 12
> > > +#define MANA_MIN_QSIZE (1 << MANA_MIN_QSHIFT)
> > > +#define MANA_MIN_QALIGN(x) ALIGN((x), MANA_MIN_QSIZE)
> > > +#define MANA_MIN_QALIGNED(addr) IS_ALIGNED((unsigned long)(addr),
> > MANA_MIN_QSIZE)
> > > +#define MANA_PFN(a) (PHYS_PFN(a) << (PAGE_SHIFT - MANA_MIN_QSHIFT))
> >
> > See comments above about how this is defined.
> 
> Replied above.
> Thank you for all the detailed comments!
> 
> - Haiyang


^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: David Hildenbrand @ 2024-06-11 19:50 UTC (permalink / raw)
  To: Tim Chen, linux-kernel
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	Andrew Morton, Mike Rapoport, Oscar Salvador, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <80532f73e52e2c21fdc9aac7bce24aefb76d11b0.camel@linux.intel.com>

On 11.06.24 21:41, Tim Chen wrote:
> On Fri, 2024-06-07 at 11:09 +0200, David Hildenbrand wrote:
>> In preparation for further changes, let's teach __free_pages_core()
>> about the differences of memory hotplug handling.
>>
>> Move the memory hotplug specific handling from generic_online_page() to
>> __free_pages_core(), use adjust_managed_page_count() on the memory
>> hotplug path, and spell out why memory freed via memblock
>> cannot currently use adjust_managed_page_count().
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/internal.h       |  3 ++-
>>   mm/kmsan/init.c     |  2 +-
>>   mm/memory_hotplug.c |  9 +--------
>>   mm/mm_init.c        |  4 ++--
>>   mm/page_alloc.c     | 17 +++++++++++++++--
>>   5 files changed, 21 insertions(+), 14 deletions(-)
>>
>> diff --git a/mm/internal.h b/mm/internal.h
>> index 12e95fdf61e90..3fdee779205ab 100644
>> --- a/mm/internal.h
>> +++ b/mm/internal.h
>> @@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
>>   				    int mt);
>>   extern void memblock_free_pages(struct page *page, unsigned long pfn,
>>   					unsigned int order);
>> -extern void __free_pages_core(struct page *page, unsigned int order);
>> +extern void __free_pages_core(struct page *page, unsigned int order,
>> +		enum meminit_context);
> 
> Shouldn't the above be
> 		enum meminit_context context);

Although C allows parameters without names in declarations, this was 
unintended.

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
From: Borislav Petkov @ 2024-06-11 19:46 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, Rafael J. Wysocki,
	Peter Zijlstra, Adrian Hunter, Kuppuswamy Sathyanarayanan,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	Kalra, Ashish, Sean Christopherson, Huang, Kai, Ard Biesheuvel,
	Baoquan He, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	kexec, linux-hyperv, linux-acpi, linux-coco, linux-kernel,
	Tao Liu
In-Reply-To: <nh7cihzlsjtoddtec6m62biqdn62k3ka5svs6m64qekhpebu5z@dkplwad2urgp>

On Tue, Jun 11, 2024 at 06:47:05PM +0300, Kirill A. Shutemov wrote:
> Borislav, given this code deduplication effort is not trivial, maybe we
> can do it as a separate patchset on top of this one?

Sure, as long as it gets done and doesn't get delayed indefinitely by
new and more important features enablement.

Usually, we do unifications and cleanups first - then new features but
this kexec pile has been long in the making already...

> I also wounder if it makes sense to combine ident_map.c and
> set_memory.c.  There's some overlap between the two.

Yeah, we have a bunch of different pagetable manipulating things, all
with their peculiarities and unifying them and having a good set of APIs
which everything else uses, is always a good thing.

And since we're talking cleanups, there's another thing I've been
looking at critically: CONFIG_X86_5LEVEL. Maybe it is time to get rid of
it and make the 5level stuff unconditional. And get rid of a bunch of
code since both vendors support 5level now...

Thx.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: Tim Chen @ 2024-06-11 19:41 UTC (permalink / raw)
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, linux-hyperv, virtualization, xen-devel, kasan-dev,
	Andrew Morton, Mike Rapoport, Oscar Salvador, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240607090939.89524-2-david@redhat.com>

On Fri, 2024-06-07 at 11:09 +0200, David Hildenbrand wrote:
> In preparation for further changes, let's teach __free_pages_core()
> about the differences of memory hotplug handling.
> 
> Move the memory hotplug specific handling from generic_online_page() to
> __free_pages_core(), use adjust_managed_page_count() on the memory
> hotplug path, and spell out why memory freed via memblock
> cannot currently use adjust_managed_page_count().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/internal.h       |  3 ++-
>  mm/kmsan/init.c     |  2 +-
>  mm/memory_hotplug.c |  9 +--------
>  mm/mm_init.c        |  4 ++--
>  mm/page_alloc.c     | 17 +++++++++++++++--
>  5 files changed, 21 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/internal.h b/mm/internal.h
> index 12e95fdf61e90..3fdee779205ab 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -604,7 +604,8 @@ extern void __putback_isolated_page(struct page *page, unsigned int order,
>  				    int mt);
>  extern void memblock_free_pages(struct page *page, unsigned long pfn,
>  					unsigned int order);
> -extern void __free_pages_core(struct page *page, unsigned int order);
> +extern void __free_pages_core(struct page *page, unsigned int order,
> +		enum meminit_context);

Shouldn't the above be 
		enum meminit_context context);
>  
>  /*
>   * This will have no effect, other than possibly generating a warning, if the

Thanks.

Tim

^ permalink raw reply

* Re: [PATCH v1 2/3] mm/memory_hotplug: initialize memmap of !ZONE_DEVICE with PageOffline() instead of PageReserved()
From: Andrew Morton @ 2024-06-11 19:36 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Mike Rapoport, Oscar Salvador, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <824c319a-530e-4153-80f5-20e2c463fa81@redhat.com>

On Tue, 11 Jun 2024 11:42:56 +0200 David Hildenbrand <david@redhat.com> wrote:

> > We'll leave the ZONE_DEVICE case alone for now.
> > 
> 
> @Andrew, can we add here:
> 
> "Note that self-hosted vmemmap pages will no longer be marked as 
> reserved. This matches ordinary vmemmap pages allocated from the buddy 
> during memory hotplug. Now, really only vmemmap pages allocated from 
> memblock during early boot will be marked reserved. Existing 
> PageReserved() checks seem to be handling all relevant cases correctly 
> even after this change."

Done, thanks.

^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: David Hildenbrand @ 2024-06-11 19:20 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Mike Rapoport, Oscar Salvador, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <20240611121942.050a2215143af0ecb576122f@linux-foundation.org>

On 11.06.24 21:19, Andrew Morton wrote:
> On Tue, 11 Jun 2024 12:06:56 +0200 David Hildenbrand <david@redhat.com> wrote:
> 
>> On 07.06.24 11:09, David Hildenbrand wrote:
>>> In preparation for further changes, let's teach __free_pages_core()
>>> about the differences of memory hotplug handling.
>>>
>>> Move the memory hotplug specific handling from generic_online_page() to
>>> __free_pages_core(), use adjust_managed_page_count() on the memory
>>> hotplug path, and spell out why memory freed via memblock
>>> cannot currently use adjust_managed_page_count().
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>
>> @Andrew, can you squash the following?
> 
> Sure.
> 
> I queued it against "mm: pass meminit_context to __free_pages_core()",
> not against

Ah yes, sorry. Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply

* Re: [PATCH v1 1/3] mm: pass meminit_context to __free_pages_core()
From: Andrew Morton @ 2024-06-11 19:19 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, linux-hyperv, virtualization, xen-devel,
	kasan-dev, Mike Rapoport, Oscar Salvador, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Juergen Gross,
	Stefano Stabellini, Oleksandr Tyshchenko, Alexander Potapenko,
	Marco Elver, Dmitry Vyukov
In-Reply-To: <2ed64218-7f3b-4302-a5dc-27f060654fe2@redhat.com>

On Tue, 11 Jun 2024 12:06:56 +0200 David Hildenbrand <david@redhat.com> wrote:

> On 07.06.24 11:09, David Hildenbrand wrote:
> > In preparation for further changes, let's teach __free_pages_core()
> > about the differences of memory hotplug handling.
> > 
> > Move the memory hotplug specific handling from generic_online_page() to
> > __free_pages_core(), use adjust_managed_page_count() on the memory
> > hotplug path, and spell out why memory freed via memblock
> > cannot currently use adjust_managed_page_count().
> > 
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> 
> @Andrew, can you squash the following?

Sure.

I queued it against "mm: pass meminit_context to __free_pages_core()",
not against

> Subject: [PATCH] fixup: mm/highmem: make nr_free_highpages() return "unsigned
>   long"


^ permalink raw reply

* Re: [PATCHv11 05/19] x86/relocate_kernel: Use named labels for less confusion
From: H. Peter Anvin @ 2024-06-11 18:26 UTC (permalink / raw)
  To: Kirill A. Shutemov, Borislav Petkov
  Cc: Nikolay Borisov, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86,
	Rafael J. Wysocki, Peter Zijlstra, Adrian Hunter,
	Kuppuswamy Sathyanarayanan, Elena Reshetova, Jun Nakajima,
	Rick Edgecombe, Tom Lendacky, Kalra, Ashish, Sean Christopherson,
	Huang, Kai, Ard Biesheuvel, Baoquan He, K. Y. Srinivasan,
	Haiyang Zhang, kexec, linux-hyperv, linux-acpi, linux-coco,
	linux-kernel
In-Reply-To: <ehttxqgg7zhbgty5m5uxkduj3xf7soonrzfu4rfw7hccqgdydl@afki66pnree5>

On 6/4/24 08:21, Kirill A. Shutemov wrote:
> 
>  From b45fe48092abad2612c2bafbb199e4de80c99545 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
> Date: Fri, 10 Feb 2023 12:53:11 +0300
> Subject: [PATCHv11.1 06/19] x86/kexec: Keep CR4.MCE set during kexec for TDX guest
> 
> TDX guests run with MCA enabled (CR4.MCE=1b) from the very start. If
> that bit is cleared during CR4 register reprogramming during boot or
> kexec flows, a #VE exception will be raised which the guest kernel
> cannot handle it.
> 
> Therefore, make sure the CR4.MCE setting is preserved over kexec too and
> avoid raising any #VEs.
> 
> The change doesn't affect non-TDX-guest environments.
> 
> Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> ---
>   arch/x86/kernel/relocate_kernel_64.S | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/relocate_kernel_64.S b/arch/x86/kernel/relocate_kernel_64.S
> index 085eef5c3904..9c2cf70c5f54 100644
> --- a/arch/x86/kernel/relocate_kernel_64.S
> +++ b/arch/x86/kernel/relocate_kernel_64.S
> @@ -5,6 +5,8 @@
>    */
>   
>   #include <linux/linkage.h>
> +#include <linux/stringify.h>
> +#include <asm/alternative.h>
>   #include <asm/page_types.h>
>   #include <asm/kexec.h>
>   #include <asm/processor-flags.h>
> @@ -145,14 +147,15 @@ SYM_CODE_START_LOCAL_NOALIGN(identity_mapped)
>   	 * Set cr4 to a known state:
>   	 *  - physical address extension enabled
>   	 *  - 5-level paging, if it was enabled before
> +	 *  - Machine check exception on TDX guest, if it was enabled before.
> +	 *    Clearing MCE might not be allowed in TDX guests, depending on setup.
> +	 *
> +	 * Use R13 that contains the original CR4 value, read in relocate_kernel().
> +	 * PAE is always set in the original CR4.
>   	 */
> -	movl	$X86_CR4_PAE, %eax
> -	testq	$X86_CR4_LA57, %r13
> -	jz	.Lno_la57
> -	orl	$X86_CR4_LA57, %eax
> -.Lno_la57:
> -
> -	movq	%rax, %cr4
> +	andl	$(X86_CR4_PAE | X86_CR4_LA57), %r13d
> +	ALTERNATIVE "", __stringify(orl $X86_CR4_MCE, %r13d), X86_FEATURE_TDX_GUEST
> +	movq	%r13, %cr4
>   

If this is the case, I don't really see a reason to clear MCE per se as 
I'm guessing a machine check here will be fatal anyway? It just changes 
the method of death.

Also, is there a reason to save %cr4, run code, and *then* clear the 
relevant bits? Wouldn't it be better to sanitize %cr4 as soon as possible?

	-hpa

^ permalink raw reply

* RE: [PATCH net-next] net: mana: Add support for variable page sizes of ARM64
From: Haiyang Zhang @ 2024-06-11 18:03 UTC (permalink / raw)
  To: Haiyang Zhang, Michael Kelley, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, Paul Rosswurm
  Cc: Dexuan Cui, stephen@networkplumber.org, KY Srinivasan,
	olaf@aepfle.de, vkuznets, davem@davemloft.net, wei.liu@kernel.org,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	leon@kernel.org, Long Li, ssengar@linux.microsoft.com,
	linux-rdma@vger.kernel.org, daniel@iogearbox.net,
	john.fastabend@gmail.com, bpf@vger.kernel.org, ast@kernel.org,
	hawk@kernel.org, tglx@linutronix.de,
	shradhagupta@linux.microsoft.com, linux-kernel@vger.kernel.org
In-Reply-To: <DM6PR21MB1481E3F4E4E26765CCF6114DCAC72@DM6PR21MB1481.namprd21.prod.outlook.com>



> -----Original Message-----
> From: Haiyang Zhang <haiyangz@microsoft.com>
> Sent: Tuesday, June 11, 2024 1:44 PM
> To: Michael Kelley <mhklinux@outlook.com>; linux-hyperv@vger.kernel.org;
> netdev@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Cc: Dexuan Cui <decui@microsoft.com>; stephen@networkplumber.org; KY
> Srinivasan <kys@microsoft.com>; olaf@aepfle.de; vkuznets
> <vkuznets@redhat.com>; davem@davemloft.net; wei.liu@kernel.org;
> edumazet@google.com; kuba@kernel.org; pabeni@redhat.com; leon@kernel.org;
> Long Li <longli@microsoft.com>; ssengar@linux.microsoft.com; linux-
> rdma@vger.kernel.org; daniel@iogearbox.net; john.fastabend@gmail.com;
> bpf@vger.kernel.org; ast@kernel.org; hawk@kernel.org; tglx@linutronix.de;
> shradhagupta@linux.microsoft.com; linux-kernel@vger.kernel.org
> Subject: RE: [PATCH net-next] net: mana: Add support for variable page
> sizes of ARM64


> > > @@ -183,7 +184,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc,
> bool
> > > reset_vf, u64 eq_addr,
> > >
> > >  /* EQ addr: low 48 bits of frame address */
> > >  shmem = (u64 *)ptr;
> > > - frame_addr = PHYS_PFN(eq_addr);
> > > + frame_addr = MANA_PFN(eq_addr);
> > >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> > >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> > >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> >
> > In mana_smc_setup_hwc() a few lines above this change, code using
> > PAGE_ALIGNED() is unchanged.  Is it correct that the eq/cq/rq/sq
> > addresses
> > must be aligned to 64K if PAGE_SIZE is 64K?
> 
> Since we still using PHYS_PFN on them, if not aligned to PAGE_SIZE,
> the lower bits may be lost. (You said the same below.)
> 
> >
> > Related, I wonder about how MANA_PFN() is defined. If PAGE_SIZE is 64K,
> > MANA_PFN() will first right-shift 16, then left shift 4. The net is
> > right-shift 12,
> > corresponding to the 4K chunks that MANA expects. But that approach
> > guarantees
> > that the rightmost 4 bits of the MANA PFN will always be zero. That's
> > consistent
> > with requiring the addresses to be PAGE_ALIGNED() to 64K, but I'm
> unclear
> > whether
> > that is really the requirement. You might compare with the definition
> of
> > HVPFN_DOWN(), which has a similar goal for Linux guests communicating
> > with
> > Hyper-V.
> 
> @Paul Rosswurm You said MANA HW has "no page concept". So the
> "frame_addr"
> In the mana_smc_setup_hwc() is NOT related to physical page number,
> correct?
> Can we just use phys_adr >> 12 like below?
> 
> #define MANA_MIN_QSHIFT 12
> #define MANA_PFN(a) ((a) >> MANA_MIN_QSHIFT)
> 
>       /* EQ addr: low 48 bits of frame address */
>      shmem = (u64 *)ptr;
> -     frame_addr = PHYS_PFN(eq_addr);
> +     frame_addr = MANA_PFN(eq_addr);
> 

I just confirmed with Paul, we can use phys_adr >> 12.
And I will change the alignment requirements to be 4k.

Thanks,
- Haiyang


^ permalink raw reply

* RE: [PATCH net-next] net: mana: Add support for variable page sizes of ARM64
From: Haiyang Zhang @ 2024-06-11 17:43 UTC (permalink / raw)
  To: Michael Kelley, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org, Paul Rosswurm
  Cc: Dexuan Cui, stephen@networkplumber.org, KY Srinivasan,
	olaf@aepfle.de, vkuznets, davem@davemloft.net, wei.liu@kernel.org,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	leon@kernel.org, Long Li, ssengar@linux.microsoft.com,
	linux-rdma@vger.kernel.org, daniel@iogearbox.net,
	john.fastabend@gmail.com, bpf@vger.kernel.org, ast@kernel.org,
	hawk@kernel.org, tglx@linutronix.de,
	shradhagupta@linux.microsoft.com, linux-kernel@vger.kernel.org
In-Reply-To: <DM6PR21MB14818F4519381967A9FEE8B8CAC72@DM6PR21MB1481.namprd21.prod.outlook.com>

(resending in plain text)

> -----Original Message-----
> From: Michael Kelley <mailto:mhklinux@outlook.com>
> Sent: Tuesday, June 11, 2024 12:35 PM
> To: Haiyang Zhang <mailto:haiyangz@microsoft.com>; mailto:linux-hyperv@vger.kernel.org;
> mailto:netdev@vger.kernel.org
> Cc: Dexuan Cui <mailto:decui@microsoft.com>; mailto:stephen@networkplumber.org; KY
> Srinivasan <mailto:kys@microsoft.com>; Paul Rosswurm <mailto:paulros@microsoft.com>;
> mailto:olaf@aepfle.de; vkuznets <mailto:vkuznets@redhat.com>; mailto:davem@davemloft.net;
> mailto:wei.liu@kernel.org; mailto:edumazet@google.com; mailto:kuba@kernel.org;
> mailto:pabeni@redhat.com; mailto:leon@kernel.org; Long Li <mailto:longli@microsoft.com>;
> mailto:ssengar@linux.microsoft.com; mailto:linux-rdma@vger.kernel.org;
> mailto:daniel@iogearbox.net; mailto:john.fastabend@gmail.com; mailto:bpf@vger.kernel.org;
> mailto:ast@kernel.org; mailto:hawk@kernel.org; mailto:tglx@linutronix.de;
> mailto:shradhagupta@linux.microsoft.com; mailto:linux-kernel@vger.kernel.org
> Subject: RE: [PATCH net-next] net: mana: Add support for variable page
> sizes of ARM64
> 
> From: LKML haiyangz <mailto:lkmlhyz@microsoft.com> On Behalf Of Haiyang Zhang
> Sent: Monday, June 10, 2024 2:23 PM
> >
> > As defined by the MANA Hardware spec, the queue size for DMA is 4KB
> > minimal, and power of 2.
> 
> You say the hardware requires 4K "minimal". But the definitions in this
> patch hardcode to 4K, as if that's the only choice. Is the hardcoding to
> 4K a design decision made to simplify the MANA driver?

The HWC q size has to be exactly 4k, which is by HW design. 
Other "regular" queues can be 2^n >= 4k.

> 
> > To support variable page sizes (4KB, 16KB, 64KB) of ARM64, define
> 
> A minor nit, but "variable" page size doesn't seem like quite the right
> description -- both here and in the Subject line.  On ARM64, the page
> size
> is a choice among a few fixed options.  Perhaps call it support for "page
> sizes
> other than 4K"?

"page sizes other than 4K" sounds good.

> 
> > the minimal queue size as a macro separate from the PAGE_SIZE, which
> > we always assumed it to be 4KB before supporting ARM64.
> > Also, update the relevant code related to size alignment, DMA region
> > calculations, etc.
> >
> > Signed-off-by: Haiyang Zhang <mailto:haiyangz@microsoft.com>
> > ---
> >  drivers/net/ethernet/microsoft/Kconfig        |  2 +-
> >  .../net/ethernet/microsoft/mana/gdma_main.c   |  8 +++----
> >  .../net/ethernet/microsoft/mana/hw_channel.c  | 22 +++++++++----------
> >  drivers/net/ethernet/microsoft/mana/mana_en.c |  8 +++----
> >  .../net/ethernet/microsoft/mana/shm_channel.c |  9 ++++----
> >  include/net/mana/gdma.h                       |  7 +++++-
> >  include/net/mana/mana.h                       |  3 ++-
> >  7 files changed, 33 insertions(+), 26 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/microsoft/Kconfig
> > b/drivers/net/ethernet/microsoft/Kconfig
> > index 286f0d5697a1..901fbffbf718 100644
> > --- a/drivers/net/ethernet/microsoft/Kconfig
> > +++ b/drivers/net/ethernet/microsoft/Kconfig
> > @@ -18,7 +18,7 @@ if NET_VENDOR_MICROSOFT
> >  config MICROSOFT_MANA
> >  tristate "Microsoft Azure Network Adapter (MANA) support"
> >  depends on PCI_MSI
> > - depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN && ARM64_4K_PAGES)
> > + depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
> >  depends on PCI_HYPERV
> >  select AUXILIARY_BUS
> >  select PAGE_POOL
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 1332db9a08eb..c9df942d0d02 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -182,7 +182,7 @@ int mana_gd_alloc_memory(struct gdma_context *gc,
> > unsigned int length,
> >  dma_addr_t dma_handle;
> >  void *buf;
> >
> > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> >         return -EINVAL;
> >
> >  gmi->dev = gc->dev;
> > @@ -717,7 +717,7 @@ EXPORT_SYMBOL_NS(mana_gd_destroy_dma_region,
> > NET_MANA);
> >  static int mana_gd_create_dma_region(struct gdma_dev *gd,
> >                          struct gdma_mem_info *gmi)
> >  {
> > - unsigned int num_page = gmi->length / PAGE_SIZE;
> > + unsigned int num_page = gmi->length / MANA_MIN_QSIZE;
> 
> This calculation seems a bit weird when using MANA_MIN_QSIZE. The
> number of pages, and the construction of the page_addr_list array
> a few lines later, seem unrelated to the concept of a minimum queue
> size. Is the right concept really a "mapping chunk", and num_page
> would conceptually be "num_chunks", or something like that?  Then
> a queue must be at least one chunk in size, but that's derived from the
> chunk size, and is not the core concept.

I think calling it "num_chunks" is fine. 
May I use "num_chunks" in next version?

>
> Another approach might be to just call it "MANA_PAGE_SIZE", like
> has been done with HV_HYP_PAGE_SIZE.  HV_HYP_PAGE_SIZE exists to
> handle exactly the same issue of the guest PAGE_SIZE potentially
> being different from the fixed 4K size that must be used in host-guest
> communication on Hyper-V.  Same thing here with MANA.

I actually called it "MANA_PAGE_SIZE" in my previous internal patch.
But Paul from Hostnet team opposed using that name, because
4kB is the min q size. MANA doesn't have "page" at HW level.


> >  struct gdma_create_dma_region_req *req = NULL;
> >  struct gdma_create_dma_region_resp resp = {};
> >  struct gdma_context *gc = gd->gdma_context;
> > @@ -727,7 +727,7 @@ static int mana_gd_create_dma_region(struct
> gdma_dev *gd,
> >  int err;
> >  int i;
> >
> > - if (length < PAGE_SIZE || !is_power_of_2(length))
> > + if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
> >         return -EINVAL;
> >
> >  if (offset_in_page(gmi->virt_addr) != 0)
> > @@ -751,7 +751,7 @@ static int mana_gd_create_dma_region(struct
> gdma_dev *gd,
> >  req->page_addr_list_len = num_page;
> >
> >  for (i = 0; i < num_page; i++)
> > -       req->page_addr_list[i] = gmi->dma_handle +  i * PAGE_SIZE;
> > +       req->page_addr_list[i] = gmi->dma_handle +  i *
> MANA_MIN_QSIZE;
> >
> >  err = mana_gd_send_request(gc, req_msg_size, req, sizeof(resp),
> &resp);
> >  if (err)
> > diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > index bbc4f9e16c98..038dc31e09cd 100644
> > --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> > @@ -362,12 +362,12 @@ static int mana_hwc_create_cq(struct
> hw_channel_context
> > *hwc, u16 q_depth,
> >  int err;
> >
> >  eq_size = roundup_pow_of_two(GDMA_EQE_SIZE * q_depth);
> > - if (eq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > -       eq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > + if (eq_size < MANA_MIN_QSIZE)
> > +       eq_size = MANA_MIN_QSIZE;
> >
> >  cq_size = roundup_pow_of_two(GDMA_CQE_SIZE * q_depth);
> > - if (cq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > -       cq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > + if (cq_size < MANA_MIN_QSIZE)
> > +       cq_size = MANA_MIN_QSIZE;
> >
> >  hwc_cq = kzalloc(sizeof(*hwc_cq), GFP_KERNEL);
> >  if (!hwc_cq)
> > @@ -429,7 +429,7 @@ static int mana_hwc_alloc_dma_buf(struct
> > hw_channel_context *hwc, u16 q_depth,
> >
> >  dma_buf->num_reqs = q_depth;
> >
> > - buf_size = PAGE_ALIGN(q_depth * max_msg_size);
> > + buf_size = MANA_MIN_QALIGN(q_depth * max_msg_size);
> >
> >  gmi = &dma_buf->mem_info;
> >  err = mana_gd_alloc_memory(gc, buf_size, gmi);
> > @@ -497,8 +497,8 @@ static int mana_hwc_create_wq(struct
> hw_channel_context
> > *hwc,
> >  else
> >         queue_size = roundup_pow_of_two(GDMA_MAX_SQE_SIZE *
> > q_depth);
> >
> > - if (queue_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> > -       queue_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> > + if (queue_size < MANA_MIN_QSIZE)
> > +       queue_size = MANA_MIN_QSIZE;
> >
> >  hwc_wq = kzalloc(sizeof(*hwc_wq), GFP_KERNEL);
> >  if (!hwc_wq)
> > @@ -628,10 +628,10 @@ static int mana_hwc_establish_channel(struct
> > gdma_context *gc, u16 *q_depth,
> >  init_completion(&hwc->hwc_init_eqe_comp);
> >
> >  err = mana_smc_setup_hwc(&gc->shm_channel, false,
> > -                   eq->mem_info.dma_handle,
> > -                   cq->mem_info.dma_handle,
> > -                   rq->mem_info.dma_handle,
> > -                   sq->mem_info.dma_handle,
> > +                   virt_to_phys(eq->mem_info.virt_addr),
> > +                   virt_to_phys(cq->mem_info.virt_addr),
> > +                   virt_to_phys(rq->mem_info.virt_addr),
> > +                   virt_to_phys(sq->mem_info.virt_addr),
> 
> This change seems unrelated to handling guest PAGE_SIZE values
> other than 4K.  Does it belong in a separate patch?  Or maybe it just
> needs an explanation in the commit message of this patch?

I know dma_handle is usually just the phys adr. But this is not always 
True if IOMMU is used... 
I have no problem to put it to a separate patch if desired.

> 
> >                      eq->eq.msix_index);
> >  if (err)
> >         return err;
> > diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > index d087cf954f75..6a891dbce686 100644
> > --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> > +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> > @@ -1889,10 +1889,10 @@ static int mana_create_txq(struct
> mana_port_context
> > *apc,
> >   *  to prevent overflow.
> >   */
> >  txq_size = MAX_SEND_BUFFERS_PER_QUEUE * 32;
> > - BUILD_BUG_ON(!PAGE_ALIGNED(txq_size));
> > + BUILD_BUG_ON(!MANA_MIN_QALIGNED(txq_size));
> >
> >  cq_size = MAX_SEND_BUFFERS_PER_QUEUE * COMP_ENTRY_SIZE;
> > - cq_size = PAGE_ALIGN(cq_size);
> > + cq_size = MANA_MIN_QALIGN(cq_size);
> >
> >  gc = gd->gdma_context;
> >
> > @@ -2189,8 +2189,8 @@ static struct mana_rxq *mana_create_rxq(struct
> > mana_port_context *apc,
> >  if (err)
> >         goto out;
> >
> > - rq_size = PAGE_ALIGN(rq_size);
> > - cq_size = PAGE_ALIGN(cq_size);
> > + rq_size = MANA_MIN_QALIGN(rq_size);
> > + cq_size = MANA_MIN_QALIGN(cq_size);
> >
> >  /* Create RQ */
> >  memset(&spec, 0, sizeof(spec));
> > diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > index 5553af9c8085..9a54a163d8d1 100644
> > --- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > +++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> > @@ -6,6 +6,7 @@
> >  #include <linux/io.h>
> >  #include <linux/mm.h>
> >
> > +#include <net/mana/gdma.h>
> >  #include <net/mana/shm_channel.h>
> >
> >  #define PAGE_FRAME_L48_WIDTH_BYTES 6
> > @@ -183,7 +184,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* EQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(eq_addr);
> > + frame_addr = MANA_PFN(eq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> 
> In mana_smc_setup_hwc() a few lines above this change, code using
> PAGE_ALIGNED() is unchanged.  Is it correct that the eq/cq/rq/sq
> addresses
> must be aligned to 64K if PAGE_SIZE is 64K?

Since we still using PHYS_PFN on them, if not aligned to PAGE_SIZE, 
the lower bits may be lost. (You said the same below.)

> 
> Related, I wonder about how MANA_PFN() is defined. If PAGE_SIZE is 64K,
> MANA_PFN() will first right-shift 16, then left shift 4. The net is
> right-shift 12,
> corresponding to the 4K chunks that MANA expects. But that approach
> guarantees
> that the rightmost 4 bits of the MANA PFN will always be zero. That's
> consistent
> with requiring the addresses to be PAGE_ALIGNED() to 64K, but I'm unclear
> whether
> that is really the requirement. You might compare with the definition of
> HVPFN_DOWN(), which has a similar goal for Linux guests communicating
> with
> Hyper-V.

@Paul Rosswurm You said MANA HW has "no page concept". So the "frame_addr"
In the mana_smc_setup_hwc() is NOT related to physical page number, correct?
Can we just use phys_adr >> 12 like below?

#define MANA_MIN_QSHIFT 12
#define MANA_PFN(a) ((a) >> MANA_MIN_QSHIFT)

      /* EQ addr: low 48 bits of frame address */
     shmem = (u64 *)ptr;
-     frame_addr = PHYS_PFN(eq_addr);
+     frame_addr = MANA_PFN(eq_addr);

> 
> > @@ -191,7 +192,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* CQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(cq_addr);
> > + frame_addr = MANA_PFN(cq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > @@ -199,7 +200,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* RQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(rq_addr);
> > + frame_addr = MANA_PFN(rq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > @@ -207,7 +208,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> > reset_vf, u64 eq_addr,
> >
> >  /* SQ addr: low 48 bits of frame address */
> >  shmem = (u64 *)ptr;
> > - frame_addr = PHYS_PFN(sq_addr);
> > + frame_addr = MANA_PFN(sq_addr);
> >  *shmem = frame_addr & PAGE_FRAME_L48_MASK;
> >  all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
> >         (frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> > diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> > index 27684135bb4d..b392559c33e9 100644
> > --- a/include/net/mana/gdma.h
> > +++ b/include/net/mana/gdma.h
> > @@ -224,7 +224,12 @@ struct gdma_dev {
> >  struct auxiliary_device *adev;
> >  };
> >
> > -#define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
> > +/* These are defined by HW */
> > +#define MANA_MIN_QSHIFT 12
> > +#define MANA_MIN_QSIZE (1 << MANA_MIN_QSHIFT)
> > +#define MANA_MIN_QALIGN(x) ALIGN((x), MANA_MIN_QSIZE)
> > +#define MANA_MIN_QALIGNED(addr) IS_ALIGNED((unsigned long)(addr),
> MANA_MIN_QSIZE)
> > +#define MANA_PFN(a) (PHYS_PFN(a) << (PAGE_SHIFT - MANA_MIN_QSHIFT))
> 
> See comments above about how this is defined.

Replied above.
Thank you for all the detailed comments!

- Haiyang


^ permalink raw reply

* RE: [PATCH net-next] net: mana: Add support for variable page sizes of ARM64
From: Michael Kelley @ 2024-06-11 16:34 UTC (permalink / raw)
  To: Haiyang Zhang, linux-hyperv@vger.kernel.org,
	netdev@vger.kernel.org
  Cc: decui@microsoft.com, stephen@networkplumber.org,
	kys@microsoft.com, paulros@microsoft.com, olaf@aepfle.de,
	vkuznets@redhat.com, davem@davemloft.net, wei.liu@kernel.org,
	edumazet@google.com, kuba@kernel.org, pabeni@redhat.com,
	leon@kernel.org, longli@microsoft.com,
	ssengar@linux.microsoft.com, linux-rdma@vger.kernel.org,
	daniel@iogearbox.net, john.fastabend@gmail.com,
	bpf@vger.kernel.org, ast@kernel.org, hawk@kernel.org,
	tglx@linutronix.de, shradhagupta@linux.microsoft.com,
	linux-kernel@vger.kernel.org
In-Reply-To: <1718054553-6588-1-git-send-email-haiyangz@microsoft.com>

From: LKML haiyangz <lkmlhyz@microsoft.com> On Behalf Of Haiyang Zhang Sent: Monday, June 10, 2024 2:23 PM
> 
> As defined by the MANA Hardware spec, the queue size for DMA is 4KB
> minimal, and power of 2.

You say the hardware requires 4K "minimal". But the definitions in this
patch hardcode to 4K, as if that's the only choice. Is the hardcoding to
4K a design decision made to simplify the MANA driver?

> To support variable page sizes (4KB, 16KB, 64KB) of ARM64, define

A minor nit, but "variable" page size doesn't seem like quite the right
description -- both here and in the Subject line.  On ARM64, the page size
is a choice among a few fixed options.  Perhaps call it support for "page sizes
other than 4K"?

> the minimal queue size as a macro separate from the PAGE_SIZE, which
> we always assumed it to be 4KB before supporting ARM64.
> Also, update the relevant code related to size alignment, DMA region
> calculations, etc.
> 
> Signed-off-by: Haiyang Zhang <haiyangz@microsoft.com>
> ---
>  drivers/net/ethernet/microsoft/Kconfig        |  2 +-
>  .../net/ethernet/microsoft/mana/gdma_main.c   |  8 +++----
>  .../net/ethernet/microsoft/mana/hw_channel.c  | 22 +++++++++----------
>  drivers/net/ethernet/microsoft/mana/mana_en.c |  8 +++----
>  .../net/ethernet/microsoft/mana/shm_channel.c |  9 ++++----
>  include/net/mana/gdma.h                       |  7 +++++-
>  include/net/mana/mana.h                       |  3 ++-
>  7 files changed, 33 insertions(+), 26 deletions(-)
> 
> diff --git a/drivers/net/ethernet/microsoft/Kconfig
> b/drivers/net/ethernet/microsoft/Kconfig
> index 286f0d5697a1..901fbffbf718 100644
> --- a/drivers/net/ethernet/microsoft/Kconfig
> +++ b/drivers/net/ethernet/microsoft/Kconfig
> @@ -18,7 +18,7 @@ if NET_VENDOR_MICROSOFT
>  config MICROSOFT_MANA
>  	tristate "Microsoft Azure Network Adapter (MANA) support"
>  	depends on PCI_MSI
> -	depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN && ARM64_4K_PAGES)
> +	depends on X86_64 || (ARM64 && !CPU_BIG_ENDIAN)
>  	depends on PCI_HYPERV
>  	select AUXILIARY_BUS
>  	select PAGE_POOL
> diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> index 1332db9a08eb..c9df942d0d02 100644
> --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> @@ -182,7 +182,7 @@ int mana_gd_alloc_memory(struct gdma_context *gc,
> unsigned int length,
>  	dma_addr_t dma_handle;
>  	void *buf;
> 
> -	if (length < PAGE_SIZE || !is_power_of_2(length))
> +	if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
>  		return -EINVAL;
> 
>  	gmi->dev = gc->dev;
> @@ -717,7 +717,7 @@ EXPORT_SYMBOL_NS(mana_gd_destroy_dma_region,
> NET_MANA);
>  static int mana_gd_create_dma_region(struct gdma_dev *gd,
>  				     struct gdma_mem_info *gmi)
>  {
> -	unsigned int num_page = gmi->length / PAGE_SIZE;
> +	unsigned int num_page = gmi->length / MANA_MIN_QSIZE;

This calculation seems a bit weird when using MANA_MIN_QSIZE. The
number of pages, and the construction of the page_addr_list array
a few lines later, seem unrelated to the concept of a minimum queue
size. Is the right concept really a "mapping chunk", and num_page
would conceptually be "num_chunks", or something like that?  Then
a queue must be at least one chunk in size, but that's derived from the
chunk size, and is not the core concept.

Another approach might be to just call it "MANA_PAGE_SIZE", like
has been done with HV_HYP_PAGE_SIZE.  HV_HYP_PAGE_SIZE exists to
handle exactly the same issue of the guest PAGE_SIZE potentially
being different from the fixed 4K size that must be used in host-guest
communication on Hyper-V.  Same thing here with MANA.

>  	struct gdma_create_dma_region_req *req = NULL;
>  	struct gdma_create_dma_region_resp resp = {};
>  	struct gdma_context *gc = gd->gdma_context;
> @@ -727,7 +727,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
>  	int err;
>  	int i;
> 
> -	if (length < PAGE_SIZE || !is_power_of_2(length))
> +	if (length < MANA_MIN_QSIZE || !is_power_of_2(length))
>  		return -EINVAL;
> 
>  	if (offset_in_page(gmi->virt_addr) != 0)
> @@ -751,7 +751,7 @@ static int mana_gd_create_dma_region(struct gdma_dev *gd,
>  	req->page_addr_list_len = num_page;
> 
>  	for (i = 0; i < num_page; i++)
> -		req->page_addr_list[i] = gmi->dma_handle +  i * PAGE_SIZE;
> +		req->page_addr_list[i] = gmi->dma_handle +  i * MANA_MIN_QSIZE;
> 
>  	err = mana_gd_send_request(gc, req_msg_size, req, sizeof(resp), &resp);
>  	if (err)
> diff --git a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> index bbc4f9e16c98..038dc31e09cd 100644
> --- a/drivers/net/ethernet/microsoft/mana/hw_channel.c
> +++ b/drivers/net/ethernet/microsoft/mana/hw_channel.c
> @@ -362,12 +362,12 @@ static int mana_hwc_create_cq(struct hw_channel_context
> *hwc, u16 q_depth,
>  	int err;
> 
>  	eq_size = roundup_pow_of_two(GDMA_EQE_SIZE * q_depth);
> -	if (eq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> -		eq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> +	if (eq_size < MANA_MIN_QSIZE)
> +		eq_size = MANA_MIN_QSIZE;
> 
>  	cq_size = roundup_pow_of_two(GDMA_CQE_SIZE * q_depth);
> -	if (cq_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> -		cq_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> +	if (cq_size < MANA_MIN_QSIZE)
> +		cq_size = MANA_MIN_QSIZE;
> 
>  	hwc_cq = kzalloc(sizeof(*hwc_cq), GFP_KERNEL);
>  	if (!hwc_cq)
> @@ -429,7 +429,7 @@ static int mana_hwc_alloc_dma_buf(struct
> hw_channel_context *hwc, u16 q_depth,
> 
>  	dma_buf->num_reqs = q_depth;
> 
> -	buf_size = PAGE_ALIGN(q_depth * max_msg_size);
> +	buf_size = MANA_MIN_QALIGN(q_depth * max_msg_size);
> 
>  	gmi = &dma_buf->mem_info;
>  	err = mana_gd_alloc_memory(gc, buf_size, gmi);
> @@ -497,8 +497,8 @@ static int mana_hwc_create_wq(struct hw_channel_context
> *hwc,
>  	else
>  		queue_size = roundup_pow_of_two(GDMA_MAX_SQE_SIZE *
> q_depth);
> 
> -	if (queue_size < MINIMUM_SUPPORTED_PAGE_SIZE)
> -		queue_size = MINIMUM_SUPPORTED_PAGE_SIZE;
> +	if (queue_size < MANA_MIN_QSIZE)
> +		queue_size = MANA_MIN_QSIZE;
> 
>  	hwc_wq = kzalloc(sizeof(*hwc_wq), GFP_KERNEL);
>  	if (!hwc_wq)
> @@ -628,10 +628,10 @@ static int mana_hwc_establish_channel(struct
> gdma_context *gc, u16 *q_depth,
>  	init_completion(&hwc->hwc_init_eqe_comp);
> 
>  	err = mana_smc_setup_hwc(&gc->shm_channel, false,
> -				 eq->mem_info.dma_handle,
> -				 cq->mem_info.dma_handle,
> -				 rq->mem_info.dma_handle,
> -				 sq->mem_info.dma_handle,
> +				 virt_to_phys(eq->mem_info.virt_addr),
> +				 virt_to_phys(cq->mem_info.virt_addr),
> +				 virt_to_phys(rq->mem_info.virt_addr),
> +				 virt_to_phys(sq->mem_info.virt_addr),

This change seems unrelated to handling guest PAGE_SIZE values
other than 4K.  Does it belong in a separate patch?  Or maybe it just
needs an explanation in the commit message of this patch?

>  				 eq->eq.msix_index);
>  	if (err)
>  		return err;
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c
> b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index d087cf954f75..6a891dbce686 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1889,10 +1889,10 @@ static int mana_create_txq(struct mana_port_context
> *apc,
>  	 *  to prevent overflow.
>  	 */
>  	txq_size = MAX_SEND_BUFFERS_PER_QUEUE * 32;
> -	BUILD_BUG_ON(!PAGE_ALIGNED(txq_size));
> +	BUILD_BUG_ON(!MANA_MIN_QALIGNED(txq_size));
> 
>  	cq_size = MAX_SEND_BUFFERS_PER_QUEUE * COMP_ENTRY_SIZE;
> -	cq_size = PAGE_ALIGN(cq_size);
> +	cq_size = MANA_MIN_QALIGN(cq_size);
> 
>  	gc = gd->gdma_context;
> 
> @@ -2189,8 +2189,8 @@ static struct mana_rxq *mana_create_rxq(struct
> mana_port_context *apc,
>  	if (err)
>  		goto out;
> 
> -	rq_size = PAGE_ALIGN(rq_size);
> -	cq_size = PAGE_ALIGN(cq_size);
> +	rq_size = MANA_MIN_QALIGN(rq_size);
> +	cq_size = MANA_MIN_QALIGN(cq_size);
> 
>  	/* Create RQ */
>  	memset(&spec, 0, sizeof(spec));
> diff --git a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> index 5553af9c8085..9a54a163d8d1 100644
> --- a/drivers/net/ethernet/microsoft/mana/shm_channel.c
> +++ b/drivers/net/ethernet/microsoft/mana/shm_channel.c
> @@ -6,6 +6,7 @@
>  #include <linux/io.h>
>  #include <linux/mm.h>
> 
> +#include <net/mana/gdma.h>
>  #include <net/mana/shm_channel.h>
> 
>  #define PAGE_FRAME_L48_WIDTH_BYTES 6
> @@ -183,7 +184,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> reset_vf, u64 eq_addr,
> 
>  	/* EQ addr: low 48 bits of frame address */
>  	shmem = (u64 *)ptr;
> -	frame_addr = PHYS_PFN(eq_addr);
> +	frame_addr = MANA_PFN(eq_addr);
>  	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
>  	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
>  		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);

In mana_smc_setup_hwc() a few lines above this change, code using
PAGE_ALIGNED() is unchanged.  Is it correct that the eq/cq/rq/sq addresses
must be aligned to 64K if PAGE_SIZE is 64K?

Related, I wonder about how MANA_PFN() is defined. If PAGE_SIZE is 64K,
MANA_PFN() will first right-shift 16, then left shift 4. The net is right-shift 12,
corresponding to the 4K chunks that MANA expects. But that approach guarantees
that the rightmost 4 bits of the MANA PFN will always be zero. That's consistent
with requiring the addresses to be PAGE_ALIGNED() to 64K, but I'm unclear whether
that is really the requirement. You might compare with the definition of
HVPFN_DOWN(), which has a similar goal for Linux guests communicating with
Hyper-V.

> @@ -191,7 +192,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> reset_vf, u64 eq_addr,
> 
>  	/* CQ addr: low 48 bits of frame address */
>  	shmem = (u64 *)ptr;
> -	frame_addr = PHYS_PFN(cq_addr);
> +	frame_addr = MANA_PFN(cq_addr);
>  	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
>  	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
>  		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> @@ -199,7 +200,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> reset_vf, u64 eq_addr,
> 
>  	/* RQ addr: low 48 bits of frame address */
>  	shmem = (u64 *)ptr;
> -	frame_addr = PHYS_PFN(rq_addr);
> +	frame_addr = MANA_PFN(rq_addr);
>  	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
>  	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
>  		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> @@ -207,7 +208,7 @@ int mana_smc_setup_hwc(struct shm_channel *sc, bool
> reset_vf, u64 eq_addr,
> 
>  	/* SQ addr: low 48 bits of frame address */
>  	shmem = (u64 *)ptr;
> -	frame_addr = PHYS_PFN(sq_addr);
> +	frame_addr = MANA_PFN(sq_addr);
>  	*shmem = frame_addr & PAGE_FRAME_L48_MASK;
>  	all_addr_h4bits |= (frame_addr >> PAGE_FRAME_L48_WIDTH_BITS) <<
>  		(frame_addr_seq++ * PAGE_FRAME_H4_WIDTH_BITS);
> diff --git a/include/net/mana/gdma.h b/include/net/mana/gdma.h
> index 27684135bb4d..b392559c33e9 100644
> --- a/include/net/mana/gdma.h
> +++ b/include/net/mana/gdma.h
> @@ -224,7 +224,12 @@ struct gdma_dev {
>  	struct auxiliary_device *adev;
>  };
> 
> -#define MINIMUM_SUPPORTED_PAGE_SIZE PAGE_SIZE
> +/* These are defined by HW */
> +#define MANA_MIN_QSHIFT 12
> +#define MANA_MIN_QSIZE (1 << MANA_MIN_QSHIFT)
> +#define MANA_MIN_QALIGN(x) ALIGN((x), MANA_MIN_QSIZE)
> +#define MANA_MIN_QALIGNED(addr) IS_ALIGNED((unsigned long)(addr), MANA_MIN_QSIZE)
> +#define MANA_PFN(a) (PHYS_PFN(a) << (PAGE_SHIFT - MANA_MIN_QSHIFT))

See comments above about how this is defined.

Michael

> 
>  #define GDMA_CQE_SIZE 64
>  #define GDMA_EQE_SIZE 16
> diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
> index 561f6719fb4e..43e8fc574354 100644
> --- a/include/net/mana/mana.h
> +++ b/include/net/mana/mana.h
> @@ -42,7 +42,8 @@ enum TRI_STATE {
> 
>  #define MAX_SEND_BUFFERS_PER_QUEUE 256
> 
> -#define EQ_SIZE (8 * PAGE_SIZE)
> +#define EQ_SIZE (8 * MANA_MIN_QSIZE)
> +
>  #define LOG2_EQ_THROTTLE 3
> 
>  #define MAX_PORTS_IN_MANA_DEV 256
> --
> 2.34.1
> 


^ permalink raw reply

* Re: [PATCHv11 18/19] x86/acpi: Add support for CPU offlining for ACPI MADT wakeup method
From: Kirill A. Shutemov @ 2024-06-11 15:47 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, Dave Hansen, x86, Rafael J. Wysocki,
	Peter Zijlstra, Adrian Hunter, Kuppuswamy Sathyanarayanan,
	Elena Reshetova, Jun Nakajima, Rick Edgecombe, Tom Lendacky,
	Kalra, Ashish, Sean Christopherson, Huang, Kai, Ard Biesheuvel,
	Baoquan He, H. Peter Anvin, K. Y. Srinivasan, Haiyang Zhang,
	kexec, linux-hyperv, linux-acpi, linux-coco, linux-kernel,
	Tao Liu
In-Reply-To: <hidvykk3yan5rtlhum6go7j3lwgrcfcgxlwyjug3osfakw2x6f@4ohvo23zaesv>

On Mon, Jun 10, 2024 at 05:01:55PM +0300, Kirill A. Shutemov wrote:
> On Mon, Jun 10, 2024 at 03:40:20PM +0200, Borislav Petkov wrote:
> > On Fri, Jun 07, 2024 at 06:14:28PM +0300, Kirill A. Shutemov wrote:
> > >   I was able to address this issue by switching cpa_lock to a mutex.
> > >   However, this solution will only work if the callers for set_memory
> > >   interfaces are not called from an atomic context. I need to verify if
> > >   this is the case.
> > 
> > Dunno, I'd be nervous about this. Althouth from looking at
> > 
> >    ad5ca55f6bdb ("x86, cpa: srlz cpa(), global flush tlb after splitting big page and before doing cpa")
> > 
> > I don't see how "So that we don't allow any other cpu" can't be done
> > with a mutex. Perhaps the set_memory* interfaces should be usable in as
> > many contexts as possible.
> > 
> > Have you run this with lockdep enabled?
> 
> Yes, it booted to the shell just fine. However, that doesn't prove
> anything. The set_memory_* function has many obscured cases.
> 
> > > - The function __flush_tlb_all() in kernel_(un)map_pages_in_pgd() must be
> > >   called with preemption disabled. Once again, I am unsure why this has
> > >   not caused issues in the EFI case.
> > 
> > It could be because EFI does all that setup on the BSP only before the
> > others have arrived but I don't remember anymore... It is more than
> > a decade ago when I did this...
> 
> Are you okay with this? Disabling preemption looks strange, but I don't
> see a better option.

Borislav, given this code deduplication effort is not trivial, maybe we
can do it as a separate patchset on top of this one?

I also wounder if it makes sense to combine ident_map.c and set_memory.c.
There's some overlap between the two.

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v2 1/6] arm64/hyperv: Support DeviceTree
From: Roman Kisel @ 2024-06-11 14:55 UTC (permalink / raw)
  To: Elliot Berman
  Cc: arnd, bhelgaas, bp, catalin.marinas, dave.hansen, decui, haiyangz,
	hpa, kw, kys, lenb, lpieralisi, mingo, mhklinux, rafael, robh,
	tglx, wei.liu, will, linux-acpi, linux-arch, linux-arm-kernel,
	linux-hyperv, linux-kernel, linux-pci, x86, ssengar, sunilmut,
	vdso
In-Reply-To: <9b216f16-a2ea-48d7-8986-f0c2e3f3d009@linux.microsoft.com>



On 5/16/2024 8:27 AM, Roman Kisel wrote:
> 
> 
> On 5/15/2024 3:02 PM, Elliot Berman wrote:
>> On Tue, May 14, 2024 at 03:43:48PM -0700, Roman Kisel wrote:
>>> The Virtual Trust Level platforms rely on DeviceTree, and the
>>> arm64/hyperv code supports ACPI only. Update the logic to
>>> support DeviceTree on boot as well as ACPI.
>>
>> Could you use Call UID query from SMCCC? KVM [1] and Gunyah [2] have
>> been using this to identify if guest is running under those respective
>> hypervisors. This works in both DT and ACPI cases.
>>
>> [1]: https://lore.kernel.org/all/20210330145430.996981-2-maz@kernel.org/
>> [2]: 
>> https://lore.kernel.org/all/20240222-gunyah-v17-4-1e9da6763d38@quicinc.com/
> 
> That would be very neat indeed, thanks! Talking to the hypervisor folks.
> 
We have that now. Will send out the revised patches sometime during the 
next week most likely.

>>>
>>> Signed-off-by: Roman Kisel <romank@linux.microsoft.com>
>>> ---
>>>   arch/arm64/hyperv/mshyperv.c | 34 +++++++++++++++++++++++++++++-----
>>>   1 file changed, 29 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
>>> index b1a4de4eee29..208a3bcb9686 100644
>>> --- a/arch/arm64/hyperv/mshyperv.c
>>> +++ b/arch/arm64/hyperv/mshyperv.c
>>> @@ -15,6 +15,9 @@
>>>   #include <linux/errno.h>
>>>   #include <linux/version.h>
>>>   #include <linux/cpuhotplug.h>
>>> +#include <linux/libfdt.h>
>>> +#include <linux/of.h>
>>> +#include <linux/of_fdt.h>
>>>   #include <asm/mshyperv.h>
>>>   static bool hyperv_initialized;
>>> @@ -27,6 +30,29 @@ int hv_get_hypervisor_version(union 
>>> hv_hypervisor_version_info *info)
>>>       return 0;
>>>   }
>>> +static bool hyperv_detect_fdt(void)
>>> +{
>>> +#ifdef CONFIG_OF
>>> +    const unsigned long hyp_node = of_get_flat_dt_subnode_by_name(
>>> +            of_get_flat_dt_root(), "hypervisor");
>>> +
>>> +    return (hyp_node != -FDT_ERR_NOTFOUND) &&
>>> +            of_flat_dt_is_compatible(hyp_node, "microsoft,hyperv");
>>> +#else
>>> +    return false;
>>> +#endif
>>> +}
>>> +
>>> +static bool hyperv_detect_acpi(void)
>>> +{
>>> +#ifdef CONFIG_ACPI
>>> +    return !acpi_disabled &&
>>> +            !strncmp((char *)&acpi_gbl_FADT.hypervisor_id, 
>>> "MsHyperV", 8);
>>> +#else
>>> +    return false;
>>> +#endif
>>> +}
>>> +
>>>   static int __init hyperv_init(void)
>>>   {
>>>       struct hv_get_vp_registers_output    result;
>>> @@ -35,13 +61,11 @@ static int __init hyperv_init(void)
>>>       /*
>>>        * Allow for a kernel built with CONFIG_HYPERV to be running in
>>> -     * a non-Hyper-V environment, including on DT instead of ACPI.
>>> +     * a non-Hyper-V environment.
>>> +     *
>>>        * In such cases, do nothing and return success.
>>>        */
>>> -    if (acpi_disabled)
>>> -        return 0;
>>> -
>>> -    if (strncmp((char *)&acpi_gbl_FADT.hypervisor_id, "MsHyperV", 8))
>>> +    if (!hyperv_detect_fdt() && !hyperv_detect_acpi())
>>>           return 0;
>>>       /* Setup the guest ID */
>>> -- 
>>> 2.45.0
>>>
>>>
>>> _______________________________________________
>>> linux-arm-kernel mailing list
>>> linux-arm-kernel@lists.infradead.org
>>> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
> 

-- 
Thank you,
Roman

^ permalink raw reply

* Re: [PATCH 1/1] x86/hyperv: Set X86_FEATURE_TSC_KNOWN_FREQ when Hyper-V provides frequency
From: Roman Kisel @ 2024-06-11 14:51 UTC (permalink / raw)
  To: mhklinux, kys, haiyangz, wei.liu, decui, tglx, mingo, bp,
	dave.hansen, x86, hpa, linux-hyperv, linux-kernel
In-Reply-To: <20240606025559.1631-1-mhklinux@outlook.com>



On 6/5/2024 7:55 PM, mhkelley58@gmail.com wrote:
> From: Michael Kelley <mhklinux@outlook.com>
> 
> A Linux guest on Hyper-V gets the TSC frequency from a synthetic MSR, if
> available. In this case, set X86_FEATURE_TSC_KNOWN_FREQ so that Linux
> doesn't unnecessarily do refined TSC calibration when setting up the TSC
> clocksource.
> 
> With this change, a message such as this is no longer output during boot
> when the TSC is used as the clocksource:
> 
> [    1.115141] tsc: Refined TSC clocksource calibration: 2918.408 MHz
> 
> Furthermore, the guest and host will have exactly the same view of the
> TSC frequency, which is important for features such as the TSC deadline
> timer that are emulated by the Hyper-V host.
> 
> Signed-off-by: Michael Kelley <mhklinux@outlook.com>
> ---
>   arch/x86/kernel/cpu/mshyperv.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index e0fd57a8ba84..c3e38eaf6d2f 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -424,6 +424,7 @@ static void __init ms_hyperv_init_platform(void)
>   	    ms_hyperv.misc_features & HV_FEATURE_FREQUENCY_MSRS_AVAILABLE) {
>   		x86_platform.calibrate_tsc = hv_get_tsc_khz;
>   		x86_platform.calibrate_cpu = hv_get_tsc_khz;
> +		setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
>   	}
>   
>   	if (ms_hyperv.priv_high & HV_ISOLATION) {

LGTM

Reviewed-by: Roman Kisel <romank@linux.microsoft.com>

-- 
Thank you,
Roman

^ permalink raw reply

* Re: [PATCH v2 6/6] drivers/pci/hyperv/arm64: vPCI MSI IRQ domain from DT
From: Roman Kisel @ 2024-06-11 14:40 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Saurabh Singh Sengar, arnd, bhelgaas, bp, catalin.marinas,
	dave.hansen, decui, haiyangz, hpa, kw, kys, lenb, lpieralisi,
	mingo, mhklinux, rafael, robh, tglx, wei.liu, will, linux-acpi,
	linux-arch, linux-arm-kernel, linux-hyperv, linux-kernel,
	linux-pci, x86, ssengar, sunilmut, vdso
In-Reply-To: <20240607195501.GA858122@bhelgaas>



On 6/7/2024 12:55 PM, Bjorn Helgaas wrote:
> On Wed, May 15, 2024 at 01:12:38PM -0500, Bjorn Helgaas wrote:
>> On Wed, May 15, 2024 at 09:34:09AM -0700, Roman Kisel wrote:
>>>
>>>
>>> On 5/15/2024 2:48 AM, Saurabh Singh Sengar wrote:
>>>> On Tue, May 14, 2024 at 03:43:53PM -0700, Roman Kisel wrote:
>>>>> The hyperv-pci driver uses ACPI for MSI IRQ domain configuration
>>>>> on arm64 thereby it won't be able to do that in the VTL mode where
>>>>> only DeviceTree can be used.
>>>>>
>>>>> Update the hyperv-pci driver to discover interrupt configuration
>>>>> via DeviceTree.
>>>>
>>>> Subject prefix should be "PCI: hv:"
> 
> I forgot to also suggest that the subject line begin with a verb,
> e.g., "Get vPCI MSI IRQ domain from DT" or similar, again so it reads
> consistently with previous commits.
> 
> Oh, I see patch 5/6, "Get the irq number from DeviceTree" is also very
> similar.  It would be nice if they matched, e.g., both used "IRQ" and
> "DT".
> 
> Bjorn

Will update, thanks! Going to send another version during the next week 
most likely.

-- 
Thank you,
Roman

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox