Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: [PATCH net-next v12 1/6] net: mana: Create separate EQs for each vPort
From: patchwork-bot+netdevbpf @ 2026-06-10  0:40 UTC (permalink / raw)
  To: Long Li
  Cc: kotaranov, kuba, davem, pabeni, edumazet, andrew+netdev, jgg,
	leon, haiyangz, kys, wei.liu, decui, shradhagupta, horms, netdev,
	linux-rdma, linux-hyperv, linux-kernel
In-Reply-To: <20260605005717.2059954-2-longli@microsoft.com>

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Thu,  4 Jun 2026 17:57:10 -0700 you wrote:
> To prepare for assigning vPorts to dedicated MSI-X vectors, remove EQ
> sharing among the vPorts and create dedicated EQs for each vPort.
> 
> Move the EQ definition from struct mana_context to struct mana_port_context
> and update related support functions. Export mana_create_eq() and
> mana_destroy_eq() for use by the MANA RDMA driver.
> 
> [...]

Here is the summary with links:
  - [net-next,v12,1/6] net: mana: Create separate EQs for each vPort
    https://git.kernel.org/netdev/net-next/c/fa1a3b7bcd16
  - [net-next,v12,2/6] net: mana: Query device capabilities and configure MSI-X sharing for EQs
    https://git.kernel.org/netdev/net-next/c/d7c253d61488
  - [net-next,v12,3/6] net: mana: Introduce GIC context with refcounting for interrupt management
    https://git.kernel.org/netdev/net-next/c/d478457fc1b7
  - [net-next,v12,4/6] net: mana: Use GIC functions to allocate global EQs
    https://git.kernel.org/netdev/net-next/c/346c277d1db8
  - [net-next,v12,5/6] net: mana: Allocate interrupt context for each EQ when creating vPort
    https://git.kernel.org/netdev/net-next/c/487af6f5391e
  - [net-next,v12,6/6] RDMA/mana_ib: Allocate interrupt contexts on EQs
    https://git.kernel.org/netdev/net-next/c/062b2b051f14

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply

* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Sean Christopherson @ 2026-06-09 19:28 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260602034916.GGah5SvARd77mkvxe3@fat_crate.local>

On Mon, Jun 01, 2026, Borislav Petkov wrote:
> On Fri, May 29, 2026 at 07:43:49AM -0700, Sean Christopherson wrote:
> > +static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
> > +{
> > +	unsigned int ecx_hz, edx;
> > +
> > +	memset(info, 0, sizeof(*info));
> 
> Let's not clear this unnecessarily...
> 
> > +
> > +	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
> > +		return -ENOENT;
> 
> ... just to return here...
> 
> > +
> > +	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> > +	cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
> > +
> > +	if (!info->denominator || !info->numerator)
> > +		return -ENOENT;
> 
> ... or here.
> 
> We wanna clear it here, when we'll return success.

Actually, if we take the approach of relying on the user to check the return
code, then there's no need to zero the struct since all fields will be explicitly
written, especially if we drop the "tsc_khz" field.  I was zeroing the field
purely as defense in depth.

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Thomas Gleixner @ 2026-06-09 19:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <aihKj-0nP7bUbNHH@google.com>

On Tue, Jun 09 2026 at 10:17, Sean Christopherson wrote:

> On Fri, Jun 05, 2026, Thomas Gleixner wrote:
>> On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
>> But we also should have a check in the TSC init code somewhere which
>> validates that X86_FEATURE_CONSTANT_TSC is set when
>> X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
>> w/o X86_FEATURE_CONSTANT_TSC.
>
> Ugh, any objection to punting on this for now?  KVM and Xen guests will trigger
> TSC_KNOWN_FREQ without CONSTANT_TSC, thanks to commits:
>
>   e10f78050323 ("kvmclock: fix TSC calibration for nested guests")
>   898ec52d2ba0 ("x86/xen/time: Set the X86_FEATURE_TSC_KNOWN_FREQ flag in xen_tsc_khz()")
>
> Hyper-V guests might as well?  Hyper-V's handling of TSC is weird, even for a
> hypervisor.

Hypervisors are ranked by weirdness? I ranked them by insanity so far.

> Even when the frequency is provided in CPUID by the hypervisor, QEMU at least
> requires a fairly explicit opt-in to advertise CONSTANT_TSC, presumably to try
> to prevent users from shooting themselves in the foot.

Bah. We really should have enforced the dependency when we introduced
KNOWN_FREQ. But that ship has sailed.

Though for correctness sake this should be fixed at some point in the
foreseeable future.

Thanks,

        tglx

^ permalink raw reply

* Re: [RFC PATCH 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
From: sashiko-bot @ 2026-06-09 18:51 UTC (permalink / raw)
  To: Kameron Carr; +Cc: linux-hyperv
In-Reply-To: <20260609181030.2378391-4-kameroncarr@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] The generic hypercall wrappers (`hv_do_hypercall`, `hv_do_fast_hypercall8`, `hv_do_fast_hypercall16`) were not updated to use the RSI_HOST_CALL interface, breaking hypercalls for CCA Realms.
- [High] Missing NULL pointer check for the dynamically allocated per-CPU `hostcall` buffer in `hv_set_vpreg()` and `hv_get_vpreg_128()` can cause a recursive crash during a panic.
- [Low] Complex error handling and cleanup logic for early boot memory allocation failures violates subsystem guidelines.
--

commit 286a7465cb0cf6433c1d0e69d74a2bf4a0c1a935
Author: Kameron Carr <kameroncarr@linux.microsoft.com>

arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms

This patch introduces the hyperv_pcpu_hostcall_struct to support Arm CCA
Realms, which must route hypercalls through the RSI_HOST_CALL interface.
It allocates this per-CPU structure for the boot CPU and secondary CPUs.

> Arm CCA Realms cannot issue Hyper-V hypercalls via HVC; the guest must
> route them through the RSI_HOST_CALL interface, which takes the IPA of a
> per-CPU rsi_host_call structure as its argument.

[Severity: High]
Does this patch miss updating the generic hypercall wrappers?

Looking at hv_do_hypercall() in arch/arm64/hyperv/hv_core.c, it still issues
an HVC instruction unconditionally via arm_smccc_1_1_hvc().

Since these generic functions are heavily used throughout the Hyper-V stack
(including early initialization like hv_get_partition_id()), could a CCA
Realm guest crash when it inadvertently executes the trapped HVC instruction?

> diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
> index 4fdc26ade1d74..08fec82691683 100644
> --- a/arch/arm64/hyperv/mshyperv.c
> +++ b/arch/arm64/hyperv/mshyperv.c

[ ... ]

> @@ -60,6 +66,46 @@ static bool __init hyperv_detect_via_acpi(void)
>  
>  #endif
>  
> +static void hv_hostcall_free(void)
> +{
> +	int cpu;
> +
> +	if (!hyperv_pcpu_hostcall_struct)
> +		return;
> +
> +	for_each_possible_cpu(cpu)
> +		kfree(*per_cpu_ptr(hyperv_pcpu_hostcall_struct, cpu));
> +	free_percpu(hyperv_pcpu_hostcall_struct);
> +	hyperv_pcpu_hostcall_struct = NULL;
> +}
> +
> +static int hv_cpu_init(unsigned int cpu)
> +{
> +	void **hostcall_struct;
> +	gfp_t flags;
> +	void *mem;
> +
> +	if (hyperv_pcpu_hostcall_struct) {
> +		/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
> +		flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
> +
> +		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		/*
> +		 * The hostcall_struct memory is not freed when the CPU
> +		 * goes offline. If a previously offlined CPU is brought
> +		 * back online, the memory is reused here.
> +		 */
> +		if (!*hostcall_struct) {
> +			mem = kzalloc_obj(struct rsi_host_call, flags);
> +			if (!mem)
> +				return -ENOMEM;
> +			*hostcall_struct = mem;
> +		}

[Severity: High]
Is it safe to assume this pointer is always initialized during a panic?

If a secondary CPU panics before or during its bringup phase, its per-CPU
hostcall_struct pointer remains NULL.

The globally registered Hyper-V panic notifier runs on the panicking CPU and
calls hv_set_vpreg() and hv_get_vpreg_128(). These functions unconditionally
fetch the local pointer and perform an unprotected memset(hostcall, 0,
sizeof(*hostcall)).

Additionally, if cpuhp_setup_state() fails later during initialization,
hv_hostcall_free() clears hyperv_pcpu_hostcall_struct to NULL. Since
hv_common_free() does not unregister the panic notifiers, a subsequent
panic would cause this_cpu_ptr(NULL) to evaluate to __my_cpu_offset.

Could doing a memset() on these uninitialized pointers trigger a double
fault in the panic handler or memory corruption, losing the crash report
entirely?

> +	}
> +
> +	return hv_common_cpu_init(cpu);
> +}

[ ... ]

> @@ -85,6 +133,27 @@ static int __init hyperv_init(void)
>  	if (!hyperv_detect_via_acpi() && !hyperv_detect_via_smccc())
>  		return 0;
>  
> +	/*
> +	 * The RSI host-call buffer is only ever used when
> +	 * is_realm_world() is true. Skip the per-CPU allocation on
> +	 * non-Realm guests.
> +	 */
> +	if (is_realm_world()) {
> +		hyperv_pcpu_hostcall_struct = alloc_percpu(void *);
> +		if (!hyperv_pcpu_hostcall_struct)
> +			return -ENOMEM;
> +
> +		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		if (!*hostcall_struct) {
> +			mem = kzalloc_obj(struct rsi_host_call);
> +			if (!mem) {
> +				ret = -ENOMEM;
> +				goto free_hostcall_mem;

[Severity: Low]
Is this complex error handling and cleanup logic necessary here?

Since hyperv_init() is an early_initcall, physical memory should be
available. A failure at this stage typically implies the system cannot
boot anyway, making this teardown logic unnecessary dead code.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609181030.2378391-1-kameroncarr@linux.microsoft.com?part=3

^ permalink raw reply

* Re: [RFC PATCH 5/6] arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
From: sashiko-bot @ 2026-06-09 18:50 UTC (permalink / raw)
  To: Kameron Carr; +Cc: linux-hyperv
In-Reply-To: <20260609181030.2378391-6-kameroncarr@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [High] NMI/FIQ Reentrancy Data Corruption in Realm Hypercalls
- [High] Uninitialized Access / Missing NULL Check on Per-CPU Hostcall Pointer
- [High] Hardware ABI Violation / Unsafe Memory Allocation for rsi_host_call
--

commit 17e33243e24b87deba9b6417fd0671d6c8f9140b
Author: Kameron Carr <kameroncarr@linux.microsoft.com>

arm64: hyperv: Route hypercalls through RSI host call in CCA Realms

Modify the five hypercall wrapper functions to check is_realm_world()
and use the per-CPU rsi_host_call structure when inside a Realm.

> diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
> index e33a9e3c366a1..1759998ef2667 100644
> --- a/arch/arm64/hyperv/hv_core.c
> +++ b/arch/arm64/hyperv/hv_core.c
[ ... ]
> @@ -25,12 +26,32 @@ u64 hv_do_hypercall(u64 control, void *input, void *output)
>  	struct arm_smccc_res	res;
>  	u64			input_address;
>  	u64			output_address;
> +	struct rsi_host_call *hostcall;
> +	unsigned long flags;
> +	u64 ret;
>  
>  	input_address = input ? virt_to_phys(input) : 0;
>  	output_address = output ? virt_to_phys(output) : 0;
>  
> -	arm_smccc_1_1_hvc(HV_FUNC_ID, control,
> -			  input_address, output_address, &res);
> +	if (is_realm_world()) {
> +		local_irq_save(flags);
> +		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
> +		memset(hostcall, 0, sizeof(*hostcall));

[Severity: High]
Could hostcall be NULL here?

If a hypercall is issued before the per-CPU structure is allocated for the
local CPU (such as during early boot, early AP bring-up, or broadcast TLB
flushes), won't passing this to memset() cause a NULL pointer dereference?

[Severity: High]
Since hypercall arguments are now passed using a shared per-CPU memory
structure instead of registers, is there a risk of reentrancy corruption?

local_irq_save() disables IRQs, but it does not mask NMIs or synchronous
exceptions. If an NMI or exception interrupts this code after the memset()
and issues another hypercall (like a crash IPI via hv_do_fast_hypercall8()),
won't the nested hypercall overwrite this shared hostcall buffer?

When the exception returns, the interrupted hypercall would then execute
using the corrupted arguments or read the wrong return status.

> +		hostcall->gprs[0] = HV_FUNC_ID;
> +		hostcall->gprs[1] = control;
> +		hostcall->gprs[2] = input_address;
> +		hostcall->gprs[3] = output_address;
> +
> +		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)

[Severity: High]
Is the 256-byte alignment required by the RMM guaranteed for this physical
address?

While the struct declaration might include an alignment attribute, dynamic
allocations don't always guarantee 256-byte physical alignment. For
instance, standard allocations only guarantee alignment up to
ARCH_KMALLOC_MINALIGN, and debugging features like CONFIG_SLUB_DEBUG can
shift object payloads, breaking natural power-of-two alignment.

Would passing an unaligned physical address here cause the RMM to reject
the call or inject a synchronous alignment fault into the Realm guest?

> +			ret = hostcall->gprs[0];
> +		else
> +			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
> +		local_irq_restore(flags);
> +		return ret;
> +	}
> +
>  	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input_address,
>  			  output_address, &res);
>  	return res.a0;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609181030.2378391-1-kameroncarr@linux.microsoft.com?part=5

^ permalink raw reply

* Re: [RFC PATCH 4/6] Drivers: hv: Mark shared memory as decrypted for CCA Realms
From: sashiko-bot @ 2026-06-09 18:27 UTC (permalink / raw)
  To: Kameron Carr; +Cc: linux-hyperv
In-Reply-To: <20260609181030.2378391-5-kameroncarr@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Calling `set_memory_decrypted()` on a `HV_HYP_PAGE_SIZE` kmalloc buffer causes out-of-bounds memory decryption or silent failure when `PAGE_SIZE > 4096`.
- [High] Passing an MTE-tagged `kmalloc` pointer directly to `set_memory_decrypted()` without untagging it causes address validation failures and breaks CPU initialization.
--

commit 3df67c547c1f3625047cb1aee50d949f7179dd45
Author: Kameron Carr <kameroncarr@linux.microsoft.com>

Drivers: hv: Mark shared memory as decrypted for CCA Realms

In hv_common_cpu_init(), the per-CPU hypercall input/output pages need
to be marked as decrypted (shared) for confidential VM isolation types.
This is already done for SNP and TDX isolation; extend the same handling
to Arm CCA Realm guests so that the host hypervisor can access the
shared hypercall buffers.

[ ... ]

> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 6b67ac6167891..010c7d98b5de1 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -499,7 +499,8 @@ int hv_common_cpu_init(unsigned int cpu)
>  		}
>  
>  		if (!ms_hyperv.paravisor_present &&
> -		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
> +		    (hv_isolation_type_snp() || hv_isolation_type_tdx() ||
> +		     hv_isolation_type_cca())) {
>  			ret = set_memory_decrypted((unsigned long)mem, pgcount);

[Severity: Critical]
Will this decrypt out-of-bounds memory or fail if PAGE_SIZE is larger than
4096 bytes on ARM64?

The buffer mem is allocated in hv_common_cpu_init() as 1 or 2 Hyper-V
4KB pages:

mem = kmalloc_array(pgcount, HV_HYP_PAGE_SIZE, flags);

However, set_memory_decrypted() expects pgcount to be the number of system
pages. On ARM64 kernels configured with 16KB or 64KB pages, passing pgcount
will decrypt 16KB-128KB of memory, potentially overrunning the 4KB or
8KB allocation and exposing unrelated adjacent slab memory to the hypervisor.

Additionally, if the kmalloc buffer is not aligned to the system PAGE_SIZE,
set_memory_decrypted() will trigger a WARN_ON(!PAGE_ALIGNED(addr)) and silently
return 0 without decrypting anything, leaving the hypercall page encrypted.

[Severity: High]
Does this code handle MTE tags correctly when passing mem to
set_memory_decrypted()?

On ARM64 with MTE enabled (CONFIG_KASAN_HW_TAGS), kmalloc_array() returns
an MTE-tagged pointer. The ARM64 set_memory_decrypted() implementation
validates the address using __is_lm_address(addr):

arch/arm64/mm/pageattr.c:__set_memory_enc_dec() {
    ...
    if (!__is_lm_address(addr))
        return -EINVAL;
    ...
}

Because __is_lm_address() performs software-side arithmetic without stripping
the tag, the MTE tag bits cause an arithmetic wrap-around. This evaluates
to false, returning -EINVAL and failing the CPU initialization process.

>  			if (ret) {
>  				/* It may be unsafe to free 'mem' */
>  				return ret;

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609181030.2378391-1-kameroncarr@linux.microsoft.com?part=4

^ permalink raw reply

* Re: [RFC PATCH 2/6] firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
From: sashiko-bot @ 2026-06-09 18:24 UTC (permalink / raw)
  To: Kameron Carr; +Cc: linux-hyperv
In-Reply-To: <20260609181030.2378391-3-kameroncarr@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Host call structure `uuid_hc` is allocated in protected memory, which will cause the RMM to reject the RSI_HOST_CALL.
--

commit ec503a8b7982b956b2d7e06ff213250087e22275
Author: Kameron Carr <kameroncarr@linux.microsoft.com>

firmware: smccc: Detect hypervisor via RSI host call in CCA Realms

Modify arm_smccc_hypervisor_has_uuid() to check is_realm_world() and use
rsi_host_call() to query the hypervisor vendor UUID when inside a Realm.
This introduces a file-static rsi_host_call buffer serialized by a spinlock
to handle the request.

> diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
> index bdee057db2fd3..6b465e65472b0 100644
> --- a/drivers/firmware/smccc/smccc.c
> +++ b/drivers/firmware/smccc/smccc.c

[ ... ]

> @@ -67,12 +73,45 @@ s32 arm_smccc_get_soc_id_revision(void)
>  }
>  EXPORT_SYMBOL_GPL(arm_smccc_get_soc_id_revision);
>  
> +#ifdef CONFIG_ARM64
> +static struct rsi_host_call uuid_hc;
> +static DEFINE_SPINLOCK(uuid_hc_lock);
> +
> +/*
> + * Helper function to get the hypervisor UUID via an RsiHostCall.
> + */
> +static bool arm_smccc_realm_get_hypervisor_uuid(struct arm_smccc_res *res)
> +{
> +	guard(spinlock_irqsave)(&uuid_hc_lock);
> +
> +	memset(&uuid_hc, 0, sizeof(uuid_hc));
> +	uuid_hc.gprs[0] = ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID;
> +
> +	if (rsi_host_call(__pa_symbol(&uuid_hc)) != RSI_SUCCESS)

[Severity: High]
Will the RMM reject this host call because uuid_hc is located in protected
memory?

In ARM CCA (Realms), the memory passed to the RMM for an RSI_HOST_CALL
needs to be in the unprotected (shared) state, as the untrusted host
hypervisor must be able to read and write to it. 

Since uuid_hc is statically allocated, it resides in the kernel's protected
BSS section. Calling rsi_host_call() with this physical address will likely
result in RSI_ERROR_INPUT and break hypervisor UUID discovery.

Additionally, because memory decryption operates at page granularity, applying
set_memory_decrypted() to a static variable would decrypt adjacent, unrelated
kernel data and expose it to the untrusted host.

Should this instead use a dedicated, dynamically allocated page that is safely
marked as decrypted?

> +		return false;
> +
> +	res->a0 = uuid_hc.gprs[0];
> +	res->a1 = uuid_hc.gprs[1];
> +	res->a2 = uuid_hc.gprs[2];
> +	res->a3 = uuid_hc.gprs[3];
> +	return true;
> +}
> +#endif

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609181030.2378391-1-kameroncarr@linux.microsoft.com?part=2

^ permalink raw reply

* Re: [RFC PATCH 1/6] arm64: rsi: Add RSI host call structure and helper function
From: sashiko-bot @ 2026-06-09 18:20 UTC (permalink / raw)
  To: Kameron Carr; +Cc: linux-hyperv
In-Reply-To: <20260609181030.2378391-2-kameroncarr@linux.microsoft.com>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [Low] The `struct rsi_host_call` ABI definition relies on implicit compiler padding, failing to explicitly represent the RES0 field required by the CCA RMM specification.
--

commit 8a89052b48f09110c7f3fe9050a30fbf2d54c237
Author: Kameron Carr <kameroncarr@linux.microsoft.com>

arm64: rsi: Add RSI host call structure and helper function

This commit introduces the rsi_host_call structure and a helper function to
facilitate hypercall routing through the RSI interface for Hyper-V guests in
an Arm CCA Realm.

> diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
> index e19253f96c940..ffea93340ed7f 100644
> --- a/arch/arm64/include/asm/rsi_smc.h
> +++ b/arch/arm64/include/asm/rsi_smc.h

[ ... ]

> @@ -142,6 +142,12 @@ struct realm_config {
>  	 */
>  } __aligned(0x1000);
>  
> +struct rsi_host_call {
> +	u16 immediate;
> +	u64 gprs[31];
> +} __aligned(256);

[Severity: Low]
This isn't a bug, but does this structure definition leave a 6-byte implicit
padding hole between immediate and gprs?

The Arm CCA RMM specification requires this 48-bit region to be reserved and
zeroed. While current allocations might safely zero this memory, could a
future stack allocation of this structure leave the padding bytes
uninitialized?

Would it be safer to add an explicit padding field to enforce the correct
layout and prevent potential leaks of uninitialized Realm memory to the
untrusted Host?

> +static_assert(sizeof(struct rsi_host_call) == 256);
> +
>  #endif /* __ASSEMBLER__ */

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260609181030.2378391-1-kameroncarr@linux.microsoft.com?part=1

^ permalink raw reply

* [RFC PATCH 6/6] arm64: hyperv: Implement hv_is_isolation_supported() for CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Provide an arm64 implementation of hv_is_isolation_supported() that
overrides the __weak default in drivers/hv/hv_common.c.

The implementation deliberately does not depend on
hv_is_hyperv_initialized() because hv_common_init() consults
hv_is_isolation_supported() before hyperv_initialized is set.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index b595b2b9bdbbb..b9b1c2f8e3ec7 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -213,3 +213,8 @@ bool hv_isolation_type_cca(void)
 {
 	return is_realm_world();
 }
+
+bool hv_is_isolation_supported(void)
+{
+	return is_realm_world();
+}
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 5/6] arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Modify the five hypercall wrapper functions to check is_realm_world()
and use the per-CPU rsi_host_call structure when inside a Realm.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/hv_core.c | 175 +++++++++++++++++++++++++++++-------
 1 file changed, 141 insertions(+), 34 deletions(-)

diff --git a/arch/arm64/hyperv/hv_core.c b/arch/arm64/hyperv/hv_core.c
index e33a9e3c366a1..1759998ef2667 100644
--- a/arch/arm64/hyperv/hv_core.c
+++ b/arch/arm64/hyperv/hv_core.c
@@ -16,6 +16,7 @@
 #include <asm-generic/bug.h>
 #include <hyperv/hvhdk.h>
 #include <asm/mshyperv.h>
+#include <asm/rsi.h>
 
 /*
  * hv_do_hypercall- Invoke the specified hypercall
@@ -25,12 +26,32 @@ u64 hv_do_hypercall(u64 control, void *input, void *output)
 	struct arm_smccc_res	res;
 	u64			input_address;
 	u64			output_address;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 ret;
 
 	input_address = input ? virt_to_phys(input) : 0;
 	output_address = output ? virt_to_phys(output) : 0;
 
-	arm_smccc_1_1_hvc(HV_FUNC_ID, control,
-			  input_address, output_address, &res);
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = control;
+		hostcall->gprs[2] = input_address;
+		hostcall->gprs[3] = output_address;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			ret = hostcall->gprs[0];
+		else
+			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+		return ret;
+	}
+
+	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input_address,
+			  output_address, &res);
 	return res.a0;
 }
 EXPORT_SYMBOL_GPL(hv_do_hypercall);
@@ -45,9 +66,28 @@ u64 hv_do_fast_hypercall8(u16 code, u64 input)
 {
 	struct arm_smccc_res	res;
 	u64			control;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 ret;
 
 	control = (u64)code | HV_HYPERCALL_FAST_BIT;
 
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = control;
+		hostcall->gprs[2] = input;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			ret = hostcall->gprs[0];
+		else
+			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+		return ret;
+	}
+
 	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input, &res);
 	return res.a0;
 }
@@ -62,9 +102,29 @@ u64 hv_do_fast_hypercall16(u16 code, u64 input1, u64 input2)
 {
 	struct arm_smccc_res	res;
 	u64			control;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 ret;
 
 	control = (u64)code | HV_HYPERCALL_FAST_BIT;
 
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = control;
+		hostcall->gprs[2] = input1;
+		hostcall->gprs[3] = input2;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			ret = hostcall->gprs[0];
+		else
+			ret = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+		return ret;
+	}
+
 	arm_smccc_1_1_hvc(HV_FUNC_ID, control, input1, input2, &res);
 	return res.a0;
 }
@@ -76,24 +136,44 @@ EXPORT_SYMBOL_GPL(hv_do_fast_hypercall16);
 void hv_set_vpreg(u32 msr, u64 value)
 {
 	struct arm_smccc_res res;
+	struct rsi_host_call *hostcall;
+	unsigned long flags;
+	u64 status;
+
+	if (is_realm_world()) {
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = HVCALL_SET_VP_REGISTERS |
+				    HV_HYPERCALL_FAST_BIT |
+				    HV_HYPERCALL_REP_COMP_1;
+		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
+		hostcall->gprs[3] = HV_VP_INDEX_SELF;
+		hostcall->gprs[4] = msr;
+		hostcall->gprs[6] = value;
 
-	arm_smccc_1_1_hvc(HV_FUNC_ID,
-		HVCALL_SET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
-			HV_HYPERCALL_REP_COMP_1,
-		HV_PARTITION_ID_SELF,
-		HV_VP_INDEX_SELF,
-		msr,
-		0,
-		value,
-		0,
-		&res);
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS)
+			status = hostcall->gprs[0];
+		else
+			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		local_irq_restore(flags);
+	} else {
+		arm_smccc_1_1_hvc(HV_FUNC_ID,
+				  HVCALL_SET_VP_REGISTERS |
+					  HV_HYPERCALL_FAST_BIT |
+					  HV_HYPERCALL_REP_COMP_1,
+				  HV_PARTITION_ID_SELF, HV_VP_INDEX_SELF, msr,
+				  0, value, 0, &res);
+		status = res.a0;
+	}
 
 	/*
-	 * Something is fundamentally broken in the hypervisor if
-	 * setting a VP register fails. There's really no way to
-	 * continue as a guest VM, so panic.
+	 * Something is fundamentally broken in the hypervisor (or, in a
+	 * Realm, the RMM denied the host call) if setting a VP register
+	 * fails. There's really no way to continue as a guest VM, so panic.
 	 */
-	BUG_ON(!hv_result_success(res.a0));
+	BUG_ON(!hv_result_success(status));
 }
 EXPORT_SYMBOL_GPL(hv_set_vpreg);
 
@@ -108,29 +188,56 @@ void hv_get_vpreg_128(u32 msr, struct hv_get_vp_registers_output *result)
 {
 	struct arm_smccc_1_2_regs args;
 	struct arm_smccc_1_2_regs res;
+	struct rsi_host_call *hostcall;
+	u64 status;
 
-	args.a0 = HV_FUNC_ID;
-	args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
-			HV_HYPERCALL_REP_COMP_1;
-	args.a2 = HV_PARTITION_ID_SELF;
-	args.a3 = HV_VP_INDEX_SELF;
-	args.a4 = msr;
+	if (is_realm_world()) {
+		unsigned long flags;
 
-	/*
-	 * Use the SMCCC 1.2 interface because the results are in registers
-	 * beyond X0-X3.
-	 */
-	arm_smccc_1_2_hvc(&args, &res);
+		local_irq_save(flags);
+		hostcall = *this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		memset(hostcall, 0, sizeof(*hostcall));
+
+		hostcall->gprs[0] = HV_FUNC_ID;
+		hostcall->gprs[1] = HVCALL_GET_VP_REGISTERS |
+				    HV_HYPERCALL_FAST_BIT |
+				    HV_HYPERCALL_REP_COMP_1;
+		hostcall->gprs[2] = HV_PARTITION_ID_SELF;
+		hostcall->gprs[3] = HV_VP_INDEX_SELF;
+		hostcall->gprs[4] = msr;
+
+		if (rsi_host_call(virt_to_phys(hostcall)) == RSI_SUCCESS) {
+			status = hostcall->gprs[0];
+			result->as64.low = hostcall->gprs[6];
+			result->as64.high = hostcall->gprs[7];
+		} else {
+			status = HV_STATUS_INVALID_HYPERCALL_INPUT;
+		}
+		local_irq_restore(flags);
+	} else {
+		args.a0 = HV_FUNC_ID;
+		args.a1 = HVCALL_GET_VP_REGISTERS | HV_HYPERCALL_FAST_BIT |
+			  HV_HYPERCALL_REP_COMP_1;
+		args.a2 = HV_PARTITION_ID_SELF;
+		args.a3 = HV_VP_INDEX_SELF;
+		args.a4 = msr;
+
+		/*
+		 * Use the SMCCC 1.2 interface because the results are in
+		 * registers beyond X0-X3.
+		 */
+		arm_smccc_1_2_hvc(&args, &res);
+		status = res.a0;
+		result->as64.low = res.a6;
+		result->as64.high = res.a7;
+	}
 
 	/*
-	 * Something is fundamentally broken in the hypervisor if
-	 * getting a VP register fails. There's really no way to
-	 * continue as a guest VM, so panic.
+	 * Something is fundamentally broken in the hypervisor (or, in a
+	 * Realm, the RMM denied the host call) if getting a VP register
+	 * fails. There's really no way to continue as a guest VM, so panic.
 	 */
-	BUG_ON(!hv_result_success(res.a0));
-
-	result->as64.low = res.a6;
-	result->as64.high = res.a7;
+	BUG_ON(!hv_result_success(status));
 }
 EXPORT_SYMBOL_GPL(hv_get_vpreg_128);
 
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 4/6] Drivers: hv: Mark shared memory as decrypted for CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

In hv_common_cpu_init(), the per-CPU hypercall input/output pages need
to be marked as decrypted (shared) for confidential VM isolation types.
This is already done for SNP and TDX isolation; extend the same handling
to Arm CCA Realm guests so that the host hypervisor can access the
shared hypercall buffers.

is_realm_world() is only declared in arch/arm64/include/asm/rsi.h, so
using it directly in the arch-neutral drivers/hv/hv_common.c would
break the x86 build. Introduce a Hyper-V-specific helper following the
established hv_isolation_type_snp() / hv_isolation_type_tdx() pattern.

On architectures other than arm64 the weak default keeps the existing
behaviour.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c   | 5 +++++
 drivers/hv/hv_common.c         | 9 ++++++++-
 include/asm-generic/mshyperv.h | 1 +
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 08fec82691683..b595b2b9bdbbb 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -208,3 +208,8 @@ bool hv_is_hyperv_initialized(void)
 	return hyperv_initialized;
 }
 EXPORT_SYMBOL_GPL(hv_is_hyperv_initialized);
+
+bool hv_isolation_type_cca(void)
+{
+	return is_realm_world();
+}
diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
index 6b67ac6167891..010c7d98b5de1 100644
--- a/drivers/hv/hv_common.c
+++ b/drivers/hv/hv_common.c
@@ -499,7 +499,8 @@ int hv_common_cpu_init(unsigned int cpu)
 		}
 
 		if (!ms_hyperv.paravisor_present &&
-		    (hv_isolation_type_snp() || hv_isolation_type_tdx())) {
+		    (hv_isolation_type_snp() || hv_isolation_type_tdx() ||
+		     hv_isolation_type_cca())) {
 			ret = set_memory_decrypted((unsigned long)mem, pgcount);
 			if (ret) {
 				/* It may be unsafe to free 'mem' */
@@ -666,6 +667,12 @@ bool __weak hv_isolation_type_tdx(void)
 }
 EXPORT_SYMBOL_GPL(hv_isolation_type_tdx);
 
+bool __weak hv_isolation_type_cca(void)
+{
+	return false;
+}
+EXPORT_SYMBOL_GPL(hv_isolation_type_cca);
+
 void __weak hv_setup_vmbus_handler(void (*handler)(void))
 {
 }
diff --git a/include/asm-generic/mshyperv.h b/include/asm-generic/mshyperv.h
index bf601d67cecb9..1fa79abce743c 100644
--- a/include/asm-generic/mshyperv.h
+++ b/include/asm-generic/mshyperv.h
@@ -79,6 +79,7 @@ u64 hv_do_fast_hypercall16(u16 control, u64 input1, u64 input2);
 
 bool hv_isolation_type_snp(void);
 bool hv_isolation_type_tdx(void);
+bool hv_isolation_type_cca(void);
 
 /*
  * On architectures where Hyper-V doesn't support AEOI (e.g., ARM64),
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 3/6] arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Arm CCA Realms cannot issue Hyper-V hypercalls via HVC; the guest must
route them through the RSI_HOST_CALL interface, which takes the IPA of a
per-CPU rsi_host_call structure as its argument.

Add hyperv_pcpu_hostcall_struct as a per-CPU pointer to that buffer and
allocate it for the boot CPU during hyperv_init() and for each secondary
CPU in hv_cpu_init(). The allocation is gated on is_realm_world() so
non-Realm arm64 Hyper-V guests pay no memory cost.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/hyperv/mshyperv.c      | 78 ++++++++++++++++++++++++++++++-
 arch/arm64/include/asm/mshyperv.h |  3 ++
 2 files changed, 79 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/hyperv/mshyperv.c b/arch/arm64/hyperv/mshyperv.c
index 4fdc26ade1d74..08fec82691683 100644
--- a/arch/arm64/hyperv/mshyperv.c
+++ b/arch/arm64/hyperv/mshyperv.c
@@ -15,10 +15,16 @@
 #include <linux/errno.h>
 #include <linux/version.h>
 #include <linux/cpuhotplug.h>
+#include <linux/slab.h>
+#include <linux/percpu.h>
 #include <asm/mshyperv.h>
+#include <asm/rsi.h>
 
 static bool hyperv_initialized;
 
+void * __percpu *hyperv_pcpu_hostcall_struct;
+EXPORT_SYMBOL_GPL(hyperv_pcpu_hostcall_struct);
+
 int hv_get_hypervisor_version(union hv_hypervisor_version_info *info)
 {
 	hv_get_vpreg_128(HV_REGISTER_HYPERVISOR_VERSION,
@@ -60,6 +66,46 @@ static bool __init hyperv_detect_via_acpi(void)
 
 #endif
 
+static void hv_hostcall_free(void)
+{
+	int cpu;
+
+	if (!hyperv_pcpu_hostcall_struct)
+		return;
+
+	for_each_possible_cpu(cpu)
+		kfree(*per_cpu_ptr(hyperv_pcpu_hostcall_struct, cpu));
+	free_percpu(hyperv_pcpu_hostcall_struct);
+	hyperv_pcpu_hostcall_struct = NULL;
+}
+
+static int hv_cpu_init(unsigned int cpu)
+{
+	void **hostcall_struct;
+	gfp_t flags;
+	void *mem;
+
+	if (hyperv_pcpu_hostcall_struct) {
+		/* hv_cpu_init() can be called with IRQs disabled from hv_resume() */
+		flags = irqs_disabled() ? GFP_ATOMIC : GFP_KERNEL;
+
+		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		/*
+		 * The hostcall_struct memory is not freed when the CPU
+		 * goes offline. If a previously offlined CPU is brought
+		 * back online, the memory is reused here.
+		 */
+		if (!*hostcall_struct) {
+			mem = kzalloc_obj(struct rsi_host_call, flags);
+			if (!mem)
+				return -ENOMEM;
+			*hostcall_struct = mem;
+		}
+	}
+
+	return hv_common_cpu_init(cpu);
+}
+
 static bool __init hyperv_detect_via_smccc(void)
 {
 	uuid_t hyperv_uuid = UUID_INIT(
@@ -73,6 +119,8 @@ static bool __init hyperv_detect_via_smccc(void)
 static int __init hyperv_init(void)
 {
 	struct hv_get_vp_registers_output	result;
+	void **hostcall_struct;
+	void *mem;
 	u64	guest_id;
 	int	ret;
 
@@ -85,6 +133,27 @@ static int __init hyperv_init(void)
 	if (!hyperv_detect_via_acpi() && !hyperv_detect_via_smccc())
 		return 0;
 
+	/*
+	 * The RSI host-call buffer is only ever used when
+	 * is_realm_world() is true. Skip the per-CPU allocation on
+	 * non-Realm guests.
+	 */
+	if (is_realm_world()) {
+		hyperv_pcpu_hostcall_struct = alloc_percpu(void *);
+		if (!hyperv_pcpu_hostcall_struct)
+			return -ENOMEM;
+
+		hostcall_struct = (void **)this_cpu_ptr(hyperv_pcpu_hostcall_struct);
+		if (!*hostcall_struct) {
+			mem = kzalloc_obj(struct rsi_host_call);
+			if (!mem) {
+				ret = -ENOMEM;
+				goto free_hostcall_mem;
+			}
+			*hostcall_struct = mem;
+		}
+	}
+
 	/* Setup the guest ID */
 	guest_id = hv_generate_guest_id(LINUX_VERSION_CODE);
 	hv_set_vpreg(HV_REGISTER_GUEST_OS_ID, guest_id);
@@ -106,12 +175,13 @@ static int __init hyperv_init(void)
 
 	ret = hv_common_init();
 	if (ret)
-		return ret;
+		goto free_hostcall_mem;
 
 	ret = cpuhp_setup_state(CPUHP_AP_HYPERV_ONLINE, "arm64/hyperv_init:online",
-				hv_common_cpu_init, hv_common_cpu_die);
+				hv_cpu_init, hv_common_cpu_die);
 	if (ret < 0) {
 		hv_common_free();
+		hv_hostcall_free();
 		return ret;
 	}
 
@@ -125,6 +195,10 @@ static int __init hyperv_init(void)
 
 	hyperv_initialized = true;
 	return 0;
+
+free_hostcall_mem:
+	hv_hostcall_free();
+	return ret;
 }
 
 early_initcall(hyperv_init);
diff --git a/arch/arm64/include/asm/mshyperv.h b/arch/arm64/include/asm/mshyperv.h
index b721d3134ab66..65a00bd14c6cb 100644
--- a/arch/arm64/include/asm/mshyperv.h
+++ b/arch/arm64/include/asm/mshyperv.h
@@ -63,4 +63,7 @@ static inline u64 hv_get_non_nested_msr(unsigned int reg)
 
 #include <asm-generic/mshyperv.h>
 
+/* Per-CPU RSI host call structure for CCA Realms */
+extern void *__percpu *hyperv_pcpu_hostcall_struct;
+
 #endif
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 2/6] firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Modify arm_smccc_hypervisor_has_uuid() to check is_realm_world() and
use rsi_host_call() to query the hypervisor vendor UUID when inside a
Realm. The realm path is factored into a helper,
arm_smccc_realm_get_hypervisor_uuid(), that owns a file-static
rsi_host_call buffer (uuid_hc) serialized by a spinlock.

The RSI-specific includes, file-static state and helper are guarded
with CONFIG_ARM64 because <asm/rsi.h> does not exist on 32-bit ARM.

For non-Realm environments, the existing arm_smccc_1_1_invoke() path
is unchanged.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 drivers/firmware/smccc/smccc.c | 41 +++++++++++++++++++++++++++++++++-
 1 file changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/firmware/smccc/smccc.c b/drivers/firmware/smccc/smccc.c
index bdee057db2fd3..6b465e65472b0 100644
--- a/drivers/firmware/smccc/smccc.c
+++ b/drivers/firmware/smccc/smccc.c
@@ -12,6 +12,12 @@
 #include <linux/platform_device.h>
 #include <asm/archrandom.h>
 
+#ifdef CONFIG_ARM64
+#include <linux/cleanup.h>
+#include <linux/spinlock.h>
+#include <asm/rsi.h>
+#endif
+
 static u32 smccc_version = ARM_SMCCC_VERSION_1_0;
 static enum arm_smccc_conduit smccc_conduit = SMCCC_CONDUIT_NONE;
 
@@ -67,12 +73,45 @@ s32 arm_smccc_get_soc_id_revision(void)
 }
 EXPORT_SYMBOL_GPL(arm_smccc_get_soc_id_revision);
 
+#ifdef CONFIG_ARM64
+static struct rsi_host_call uuid_hc;
+static DEFINE_SPINLOCK(uuid_hc_lock);
+
+/*
+ * Helper function to get the hypervisor UUID via an RsiHostCall.
+ */
+static bool arm_smccc_realm_get_hypervisor_uuid(struct arm_smccc_res *res)
+{
+	guard(spinlock_irqsave)(&uuid_hc_lock);
+
+	memset(&uuid_hc, 0, sizeof(uuid_hc));
+	uuid_hc.gprs[0] = ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID;
+
+	if (rsi_host_call(__pa_symbol(&uuid_hc)) != RSI_SUCCESS)
+		return false;
+
+	res->a0 = uuid_hc.gprs[0];
+	res->a1 = uuid_hc.gprs[1];
+	res->a2 = uuid_hc.gprs[2];
+	res->a3 = uuid_hc.gprs[3];
+	return true;
+}
+#endif
+
 bool arm_smccc_hypervisor_has_uuid(const uuid_t *hyp_uuid)
 {
 	struct arm_smccc_res res = {};
 	uuid_t uuid;
 
-	arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID, &res);
+#ifdef CONFIG_ARM64
+	if (is_realm_world()) {
+		if (!arm_smccc_realm_get_hypervisor_uuid(&res))
+			return false;
+	} else
+#endif
+		arm_smccc_1_1_invoke(ARM_SMCCC_VENDOR_HYP_CALL_UID_FUNC_ID,
+				     &res);
+
 	if (res.a0 == SMCCC_RET_NOT_SUPPORTED)
 		return false;
 
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 1/6] arm64: rsi: Add RSI host call structure and helper function
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux
In-Reply-To: <20260609181030.2378391-1-kameroncarr@linux.microsoft.com>

Add struct rsi_host_call to rsi_smc.h, which represents the host call
data structure used by the Realm Management Monitor (RMM) for the
RSI_HOST_CALL interface. The structure contains a 16-bit immediate field
and 31 general-purpose register values, aligned to 256 bytes as required
by the CCA RMM specification.

Add rsi_host_call() static inline wrapper in rsi_cmds.h that invokes
SMC_RSI_HOST_CALL with the physical address of the host call structure.
This will be used by Hyper-V guest code to route hypercalls through the
RSI interface when running inside an Arm CCA Realm.

Signed-off-by: Kameron Carr <kameroncarr@linux.microsoft.com>
---
 arch/arm64/include/asm/rsi_cmds.h | 9 +++++++++
 arch/arm64/include/asm/rsi_smc.h  | 6 ++++++
 2 files changed, 15 insertions(+)

diff --git a/arch/arm64/include/asm/rsi_cmds.h b/arch/arm64/include/asm/rsi_cmds.h
index 2c8763876dfb7..83b4b1f598454 100644
--- a/arch/arm64/include/asm/rsi_cmds.h
+++ b/arch/arm64/include/asm/rsi_cmds.h
@@ -159,4 +159,13 @@ static inline unsigned long rsi_attestation_token_continue(phys_addr_t granule,
 	return res.a0;
 }
 
+static inline long rsi_host_call(phys_addr_t host_call_struct)
+{
+	struct arm_smccc_res res;
+
+	arm_smccc_smc(SMC_RSI_HOST_CALL, host_call_struct, 0, 0, 0, 0, 0, 0,
+		      &res);
+	return res.a0;
+}
+
 #endif /* __ASM_RSI_CMDS_H */
diff --git a/arch/arm64/include/asm/rsi_smc.h b/arch/arm64/include/asm/rsi_smc.h
index e19253f96c940..ffea93340ed7f 100644
--- a/arch/arm64/include/asm/rsi_smc.h
+++ b/arch/arm64/include/asm/rsi_smc.h
@@ -142,6 +142,12 @@ struct realm_config {
 	 */
 } __aligned(0x1000);
 
+struct rsi_host_call {
+	u16 immediate;
+	u64 gprs[31];
+} __aligned(256);
+static_assert(sizeof(struct rsi_host_call) == 256);
+
 #endif /* __ASSEMBLER__ */
 
 /*
-- 
2.45.4


^ permalink raw reply related

* [RFC PATCH 0/6] arm64: hyperv: Add Realm support for Hyper-V
From: Kameron Carr @ 2026-06-09 18:10 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, longli
  Cc: catalin.marinas, will, mark.rutland, lpieralisi, sudeep.holla,
	arnd, thuth, linux-hyperv, linux-arm-kernel, linux-kernel,
	linux-arch, mhklinux

From: Kameron Carr <kameroncarr@microsoft.com>

Realms (CoCo VMs on ARM) require host calls to be routed through the RMM
(Realm Management Monitor) via the RSI (Realm Service Interface). This
series implements most of the necessary changes to support Realms on
Hyper-V.

One required change is not included in this series. The two buffers
allocated via vzalloc() in netvsc_init_buf() cannot be decrypted in
vmbus_establish_gpadl(). Currently only linearly mapped memory can be
decrypted. See my RFC patch [1]. I will implement the accompanying netvsc
changes based on the feedback I receive on that patch.

This patch series was tested by booting a Realm on Cobalt 200 running
Windows. I decreased the buffer size and used kzalloc() in
netvsc_init_buf() in my testing as a workaround for the issue mentioned
above.

[1] https://lore.kernel.org/all/20260521205834.1012925-1-kameroncarr@linux.microsoft.com/

Kameron Carr (6):
  arm64: rsi: Add RSI host call structure and helper function
  firmware: smccc: Detect hypervisor via RSI host call in CCA Realms
  arm64: hyperv: Add per-CPU RSI host call infrastructure for CCA Realms
  Drivers: hv: Mark shared memory as decrypted for CCA Realms
  arm64: hyperv: Route hypercalls through RSI host call in CCA Realms
  arm64: hyperv: Implement hv_is_isolation_supported() for CCA Realms

 arch/arm64/hyperv/hv_core.c       | 175 ++++++++++++++++++++++++------
 arch/arm64/hyperv/mshyperv.c      |  88 ++++++++++++++-
 arch/arm64/include/asm/mshyperv.h |   3 +
 arch/arm64/include/asm/rsi_cmds.h |   9 ++
 arch/arm64/include/asm/rsi_smc.h  |   6 +
 drivers/firmware/smccc/smccc.c    |  41 ++++++-
 drivers/hv/hv_common.c            |   9 +-
 include/asm-generic/mshyperv.h    |   1 +
 8 files changed, 294 insertions(+), 38 deletions(-)

base-commit: 7a035678fc2bdee81881170764ef08a91a076147
-- 
2.45.4

^ permalink raw reply

* Re: [PATCH v4 01/47] x86/tsc: Never re-calibrate TSC frequency if its exact timing is known
From: Sean Christopherson @ 2026-06-09 17:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Paolo Bonzini, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley
In-Reply-To: <87a4t86a0l.ffs@fw13>

On Fri, Jun 05, 2026, Thomas Gleixner wrote:
> On Fri, Jun 05 2026 at 11:04, Sean Christopherson wrote:
> But we also should have a check in the TSC init code somewhere which
> validates that X86_FEATURE_CONSTANT_TSC is set when
> X86_FEATURE_TSC_KNOWN_FREQ is set. X86_FEATURE_TSC_KNOWN_FREQ is useless
> w/o X86_FEATURE_CONSTANT_TSC.

Ugh, any objection to punting on this for now?  KVM and Xen guests will trigger
TSC_KNOWN_FREQ without CONSTANT_TSC, thanks to commits:

  e10f78050323 ("kvmclock: fix TSC calibration for nested guests")
  898ec52d2ba0 ("x86/xen/time: Set the X86_FEATURE_TSC_KNOWN_FREQ flag in xen_tsc_khz()")

Hyper-V guests might as well?  Hyper-V's handling of TSC is weird, even for a
hypervisor.

Even when the frequency is provided in CPUID by the hypervisor, QEMU at least
requires a fairly explicit opt-in to advertise CONSTANT_TSC, presumably to try
to prevent users from shooting themselves in the foot.

^ permalink raw reply

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-09 15:49 UTC (permalink / raw)
  To: Yury Norov
  Cc: Dexuan Cui, Wei Liu, Haiyang Zhang, K. Y. Srinivasan, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
	Dipayaan Roy, Shiraz Saleem, Michael Kelley, Long Li,
	linux-hyperv, linux-kernel, netdev, Paul Rosswurm, Shradha Gupta,
	Saurabh Singh Sengar, stable
In-Reply-To: <aidDwAQqnQRCNQP1@yury>

On Mon, Jun 08, 2026 at 06:35:44PM -0400, Yury Norov wrote:
> On Mon, Jun 01, 2026 at 03:27:46AM -0700, Shradha Gupta wrote:
> > In mana driver, the number of IRQs allocated is capped by the
> > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > their NUMA/core bindings.
> > 
> > This is important, especially in the envs where number of vCPUs are so
> > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > much more than their overheads if they were spread across sibling vCPUs.
> > 
> > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > IRQs are assigned at a later stage compared to static allocation, other
> > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > weights become imbalanced, causing multiple MANA IRQs to land on the
> > same vCPU, while some vCPUs have none.
> > 
> > In such cases when many parallel TCP connections are tested, the
> > throughput drops significantly.
> > 
> > Test envs:
> > =======================================================
> > Case 1: without this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> > 	TYPE		effective vCPU aff
> > =======================================================
> > IRQ0:	HWC		0
> > IRQ1:	mana_q1		0
> > IRQ2:	mana_q2		2
> > IRQ3:	mana_q3		0
> > IRQ4:	mana_q4		3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU		0	1	2	3
> > =======================================================
> > pass 1:		38.85	0.03	24.89	24.65
> > pass 2:		39.15	0.03	24.57	25.28
> > pass 3:		40.36	0.03	23.20	23.17
> > 
> > =======================================================
> > Case 2: with this patch
> > =======================================================
> > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > 
> >         TYPE            effective vCPU aff
> > =======================================================
> > IRQ0:   HWC             0
> > IRQ1:   mana_q1         0
> > IRQ2:   mana_q2         1
> > IRQ3:   mana_q3         2
> > IRQ4:   mana_q4         3
> > 
> > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > vCPU            0       1       2       3
> > =======================================================
> > pass 1:         15.42	15.85	14.99	14.51
> > pass 2:         15.53	15.94	15.81	15.93
> > pass 3:         16.41	16.35	16.40	16.36
> > 
> > =======================================================
> > Throughput Impact(in Gbps, same env)
> > =======================================================
> > TCP conn	with patch	w/o patch
> > 20480		15.65		7.73
> > 10240		15.63		8.93
> > 8192		15.64		9.69
> > 6144		15.64		13.16
> > 4096		15.69		15.75
> > 2048		15.69		15.83
> > 1024		15.71		15.28
>  
> So, case 1 is irq_setup(), and case 2 is irq_setup_linear(). Is that
> correct?

That is correct.

> 
> On the previous round we've discussed a no-affinity case:
> 
>         irq_set_affinity_and_hint(irq, NULL);
> 
> My naive view is that the more freedom you give to the scheduler in
> balancing the IRQ handling load, the better results you've got. But
> your numbers show that the 'linear' distribution is still better. Can
> you add the results of that experiment as the 'case 3' please? Any
> ideas why the linear case wins over the no-affinity?

Sure, I will include them in the commit message for the next version.
The reason is that in some cases(reproducible testcases in Azure
environments), there are  existing non-MANA IRQs that are skewing
the IRQ weights on individual vCPUS. And that is causing MANA queue
IRQ clustering again.

I had mentioned a possibility of this in the last thread, and we were
able to reproduce it through some more testcases.

> 
> Thanks,
> Yury
> 
> > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > Cc: stable@vger.kernel.org
> > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > Reviewed-by: Simon Horman <horms@kernel.org>
> > ---
> > Changes in v3
> >  * Optimize the comments in mana_gd_setup_dyn_irqs()
> >  * add more details in the dev_dbg for extra IRQs 
> > ---
> > Changes in v2
> >  * Removed the unused skip_first_cpu variable
> >  * fixed exit condition in irq_setup_linear() with len == 0
> >  * changed return type of irq_setup_linear() as it will always be 0
> >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> >  * added appropriate comments to indicate expected behaviour when
> >    IRQs are more than or equal to num_online_cpus()
> > ---
> >  .../net/ethernet/microsoft/mana/gdma_main.c   | 60 ++++++++++++++++---
> >  1 file changed, 53 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > index 712a0881d720..00a28b3ca0a6 100644
> > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > @@ -197,6 +197,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> >  	} else {
> >  		/* If dynamic allocation is enabled we have already allocated
> >  		 * hwc msi
> > +		 * Also, we make sure in this case the following is always true
> > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> >  		 */
> >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> >  	}
> > @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> >  	return 0;
> >  }
> >  
> > +/* should be called with cpus_read_lock() held */
> > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > +{
> > +	int cpu;
> > +
> > +	for_each_online_cpu(cpu) {
> > +		if (len == 0)
> > +			break;
> > +
> > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > +		len--;
> > +	}
> > +}
> > +
> >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  {
> >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> >  	struct gdma_irq_context *gic;
> > -	bool skip_first_cpu = false;
> >  	int *irqs, irq, err, i;
> >  
> >  	irqs = kmalloc_objs(int, nvec);
> > @@ -1729,6 +1744,8 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  		return -ENOMEM;
> >  
> >  	/*
> > +	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> > +	 * nvec is only Queue IRQ (HWC already setup).
> >  	 * While processing the next pci irq vector, we start with index 1,
> >  	 * as IRQ vector at index 0 is already processed for HWC.
> >  	 * However, the population of irqs array starts with index 0, to be
> > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> >  	 */
> >  	cpus_read_lock();
> > -	if (gc->num_msix_usable <= num_online_cpus())
> > -		skip_first_cpu = true;
> > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > +		if (err) {
> > +			cpus_read_unlock();
> > +			goto free_irq;
> > +		}
> > +	} else {
> > +		/*
> > +		 * When num_msix_usable are more than num_online_cpus, our
> > +		 * queue IRQs should be equal to num of online vCPUs.
> > +		 * We try to make sure queue IRQs spread across all vCPUs.
> > +		 * In such a case NUMA or CPU core affinity does not matter.
> > +		 * Note: in this case the total mana IRQ should always be
> > +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> > +		 * in HWC setup calls
> > +		 * However, if CPUs went offline since num_msix_usable was
> > +		 * computed, queue IRQs will be more than num_online_cpus().
> > +		 * In such cases remaining extra IRQs will retain their default
> > +		 * affinity.
> > +		 */
> > +		int first_unassigned = num_online_cpus();
> > +		if (nvec > first_unassigned) {
> > +			char buf[32];
> > +
> > +			if (first_unassigned == nvec - 1)
> > +				snprintf(buf, sizeof(buf), "%d",
> > +					 first_unassigned);
> > +			else
> > +				snprintf(buf, sizeof(buf), "%d-%d",
> > +					 first_unassigned, nvec - 1);
> > +
> > +			dev_dbg(&pdev->dev,
> > +				"MANA IRQ indices #%s will retain the default CPU affinity\n", buf);
> > +		}
> >  
> > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > -	if (err) {
> > -		cpus_read_unlock();
> > -		goto free_irq;
> > +		irq_setup_linear(irqs, nvec);
> >  	}
> >  
> >  	cpus_read_unlock();
> > 
> > base-commit: 8415598365503ced2e3d019491b0a2756c85c494
> > -- 
> > 2.34.1

^ permalink raw reply

* Re: [PATCH net v3] net: mana: Optimize irq affinity for low vcpu configs
From: Shradha Gupta @ 2026-06-09 15:37 UTC (permalink / raw)
  To: Yury Norov
  Cc: Jacob Keller, Dexuan Cui, Wei Liu, Haiyang Zhang,
	K. Y. Srinivasan, Andrew Lunn, David S. Miller, Eric Dumazet,
	Jakub Kicinski, Paolo Abeni, Konstantin Taranov, Simon Horman,
	Erni Sri Satya Vennela, Dipayaan Roy, Shiraz Saleem,
	Michael Kelley, Long Li, linux-hyperv, linux-kernel, netdev,
	Paul Rosswurm, Shradha Gupta, Saurabh Singh Sengar, stable
In-Reply-To: <aidB82s7A0Roh2dD@yury>

On Mon, Jun 08, 2026 at 06:28:03PM -0400, Yury Norov wrote:
> On Wed, Jun 03, 2026 at 09:39:18PM -0700, Shradha Gupta wrote:
> > On Wed, Jun 03, 2026 at 02:49:24PM -0700, Jacob Keller wrote:
> > > On 6/1/2026 3:27 AM, Shradha Gupta wrote:
> > > > In mana driver, the number of IRQs allocated is capped by the
> > > > min(num_cpu + 1, queue count). In cases, where the IRQ count is greater
> > > > than the vcpu count, we want to utilize all the vCPUs, irrespective of
> > > > their NUMA/core bindings.
> > > > 
> > > > This is important, especially in the envs where number of vCPUs are so
> > > > few that the softIRQ handling overhead on two IRQs on the same vCPU is
> > > > much more than their overheads if they were spread across sibling vCPUs.
> > > > 
> > > > This behaviour is more evident with dynamic IRQ allocation. Since MANA
> > > > IRQs are assigned at a later stage compared to static allocation, other
> > > > device IRQs may already be affinitized to the vCPUs. As a result, IRQ
> > > > weights become imbalanced, causing multiple MANA IRQs to land on the
> > > > same vCPU, while some vCPUs have none.
> > > > 
> > > > In such cases when many parallel TCP connections are tested, the
> > > > throughput drops significantly.
> > > > 
> > > > Test envs:
> > > > =======================================================
> > > > Case 1: without this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > > 	TYPE		effective vCPU aff
> > > > =======================================================
> > > > IRQ0:	HWC		0
> > > > IRQ1:	mana_q1		0
> > > > IRQ2:	mana_q2		2
> > > > IRQ3:	mana_q3		0
> > > > IRQ4:	mana_q4		3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU		0	1	2	3
> > > > =======================================================
> > > > pass 1:		38.85	0.03	24.89	24.65
> > > > pass 2:		39.15	0.03	24.57	25.28
> > > > pass 3:		40.36	0.03	23.20	23.17
> > > > 
> > > > =======================================================
> > > > Case 2: with this patch
> > > > =======================================================
> > > > 4 vcpu(2 cores), 5 MANA IRQs (1 HWC + 4 Queue)
> > > > 
> > > >         TYPE            effective vCPU aff
> > > > =======================================================
> > > > IRQ0:   HWC             0
> > > > IRQ1:   mana_q1         0
> > > > IRQ2:   mana_q2         1
> > > > IRQ3:   mana_q3         2
> > > > IRQ4:   mana_q4         3
> > > > 
> > > > %soft on each vCPU(mpstat -P ALL 1) on receiver
> > > > vCPU            0       1       2       3
> > > > =======================================================
> > > > pass 1:         15.42	15.85	14.99	14.51
> > > > pass 2:         15.53	15.94	15.81	15.93
> > > > pass 3:         16.41	16.35	16.40	16.36
> > > > 
> > > > =======================================================
> > > > Throughput Impact(in Gbps, same env)
> > > > =======================================================
> > > > TCP conn	with patch	w/o patch
> > > > 20480		15.65		7.73
> > > > 10240		15.63		8.93
> > > > 8192		15.64		9.69
> > > > 6144		15.64		13.16
> > > > 4096		15.69		15.75
> > > > 2048		15.69		15.83
> > > > 1024		15.71		15.28
> > > > 
> > > > Fixes: 755391121038 ("net: mana: Allocate MSI-X vectors dynamically")
> > > > Cc: stable@vger.kernel.org
> > > > Co-developed-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Signed-off-by: Erni Sri Satya Vennela <ernis@linux.microsoft.com>
> > > > Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com>
> > > > Reviewed-by: Haiyang Zhang <haiyangz@microsoft.com>
> > > > Reviewed-by: Simon Horman <horms@kernel.org>
> > > > ---
> > > > Changes in v3
> > > >  * Optimize the comments in mana_gd_setup_dyn_irqs()
> > > >  * add more details in the dev_dbg for extra IRQs 
> > > > ---
> > > > Changes in v2
> > > >  * Removed the unused skip_first_cpu variable
> > > >  * fixed exit condition in irq_setup_linear() with len == 0
> > > >  * changed return type of irq_setup_linear() as it will always be 0
> > > >  * removed the unnecessary rcu_read_lock() in irq_setup_linear()
> > > >  * added appropriate comments to indicate expected behaviour when
> > > >    IRQs are more than or equal to num_online_cpus()
> > > > ---
> > > >  .../net/ethernet/microsoft/mana/gdma_main.c   | 60 ++++++++++++++++---
> > > >  1 file changed, 53 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/microsoft/mana/gdma_main.c b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > index 712a0881d720..00a28b3ca0a6 100644
> > > > --- a/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > +++ b/drivers/net/ethernet/microsoft/mana/gdma_main.c
> > > > @@ -197,6 +197,8 @@ static int mana_gd_query_max_resources(struct pci_dev *pdev)
> > > >  	} else {
> > > >  		/* If dynamic allocation is enabled we have already allocated
> > > >  		 * hwc msi
> > > > +		 * Also, we make sure in this case the following is always true
> > > > +		 * (num_msix_usable - 1 HWC) <= num_online_cpus()
> > > >  		 */
> > > >  		gc->num_msix_usable = min(resp.max_msix, num_online_cpus() + 1);
> > > >  	}
> > > > @@ -1717,11 +1719,24 @@ static int irq_setup(unsigned int *irqs, unsigned int len, int node,
> > > >  	return 0;
> > > >  }
> > > >  
> > > > +/* should be called with cpus_read_lock() held */
> > > > +static void irq_setup_linear(unsigned int *irqs, unsigned int len)
> > > > +{
> > > > +	int cpu;
> > > > +
> > > > +	for_each_online_cpu(cpu) {
> > > > +		if (len == 0)
> > > > +			break;
> > > > +
> > > > +		irq_set_affinity_and_hint(*irqs++, cpumask_of(cpu));
> > > > +		len--;
> > > > +	}
> > > > +}
> > > 
> > > I would find all of this a bit easier to follow if irq_setup_linear()
> > > and irq_setup() had a mana prefix so it was more obvious these are
> > > specific to the driver. Of course irq_setup is pre-existing, and its not
> > > my driver so do as you will :)
> > > 
> > > > +
> > > >  static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  {
> > > >  	struct gdma_context *gc = pci_get_drvdata(pdev);
> > > >  	struct gdma_irq_context *gic;
> > > > -	bool skip_first_cpu = false;
> > > >  	int *irqs, irq, err, i;
> > > >  
> > > >  	irqs = kmalloc_objs(int, nvec);
> > > > @@ -1729,6 +1744,8 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  		return -ENOMEM;
> > > >  
> > > >  	/*
> > > > +	 * In this function, num_msix_usable = HWC IRQ + Queue IRQ.
> > > > +	 * nvec is only Queue IRQ (HWC already setup).
> > > >  	 * While processing the next pci irq vector, we start with index 1,
> > > >  	 * as IRQ vector at index 0 is already processed for HWC.
> > > >  	 * However, the population of irqs array starts with index 0, to be
> > > > @@ -1767,13 +1784,42 @@ static int mana_gd_setup_dyn_irqs(struct pci_dev *pdev, int nvec)
> > > >  	 * first CPU sibling group since they are already affinitized to HWC IRQ
> > > >  	 */
> > > >  	cpus_read_lock();
> > > > -	if (gc->num_msix_usable <= num_online_cpus())
> > > > -		skip_first_cpu = true;
> > > > +	if (gc->num_msix_usable <= num_online_cpus()) {
> > > > +		err = irq_setup(irqs, nvec, gc->numa_node, true);
> > > > +		if (err) {
> > > > +			cpus_read_unlock();
> > > > +			goto free_irq;
> > > > +		}
> > > > +	} else {
> > > > +		/*
> > > > +		 * When num_msix_usable are more than num_online_cpus, our
> > > > +		 * queue IRQs should be equal to num of online vCPUs.
> > > > +		 * We try to make sure queue IRQs spread across all vCPUs.
> > > > +		 * In such a case NUMA or CPU core affinity does not matter.
> > > > +		 * Note: in this case the total mana IRQ should always be
> > > > +		 * num_online_cpus + 1. The first HWC IRQ is already handled
> > > > +		 * in HWC setup calls
> > > > +		 * However, if CPUs went offline since num_msix_usable was
> > > > +		 * computed, queue IRQs will be more than num_online_cpus().
> > > > +		 * In such cases remaining extra IRQs will retain their default
> > > > +		 * affinity.
> > > > +		 */
> > > > +		int first_unassigned = num_online_cpus();
> > > > +		if (nvec > first_unassigned) {
> > > > +			char buf[32];
> > > > +
> > > > +			if (first_unassigned == nvec - 1)
> > > > +				snprintf(buf, sizeof(buf), "%d",
> > > > +					 first_unassigned);
> > > > +			else
> > > > +				snprintf(buf, sizeof(buf), "%d-%d",
> > > > +					 first_unassigned, nvec - 1);
> > > > +
> > > > +			dev_dbg(&pdev->dev,
> > > > +				"MANA IRQ indices #%s will retain the default CPU affinity\n", buf);
> > > > +		}
> > > >  
> > > > -	err = irq_setup(irqs, nvec, gc->numa_node, skip_first_cpu);
> > > > -	if (err) {
> > > > -		cpus_read_unlock();
> > > > -		goto free_irq;
> > > > +		irq_setup_linear(irqs, nvec);
> > > 
> > > irq_setup() doesn't have a driver prefix, but is actually a static
> > > function in gdma_main.c, so its implementation is specific to this
> > > driver despite its name.
> > > 
> > > So if I understand this change correctly, if the number of usable MSI-X
> > > vectors is smaller than the number of CPUs, you contineu to use the
> > > current irq_setup logic.. otherwise you switch to the simpler "linear"
> > > logic.
> > > 
> > > I guess this means the logic and heuristic used in irq_setup() breaks
> > > down when the number of vectors is large and number of vCPU is small?
> > > 
> > > Makes sense.
> > > 
> > 
> > Hi Jacob,
> > 
> > Yes, that's the right understanding. 
> > Regarding the function names, let me take that up in a seperate patch to
> > add prefixes to all such functions.
> 
> I agree. Now that you've got more than one setup method, short
> 'irq_setup' looks confusing, if not misleading. You need some name
> that distinguished numa-based and plain linear method.
> 
> Thanks,
> Yury

noted, now that a v4 is needed, I will make these changes there. Thanks

-Shradha

^ permalink raw reply

* Re: [PATCH net-next v2] net: mana: Add Interrupt Moderation support
From: Paolo Abeni @ 2026-06-09 13:49 UTC (permalink / raw)
  To: Haiyang Zhang, linux-hyperv, netdev, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Andrew Lunn,
	David S. Miller, Eric Dumazet, Jakub Kicinski, Konstantin Taranov,
	Simon Horman, Shradha Gupta, Erni Sri Satya Vennela, Dipayaan Roy,
	Aditya Garg, Kees Cook, Breno Leitao, linux-kernel, linux-rdma
  Cc: paulros
In-Reply-To: <20260604234211.2056341-1-haiyangz@linux.microsoft.com>

On 6/5/26 1:41 AM, Haiyang Zhang wrote:
> diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
> index db14357d3732..b1e0c444f414 100644
> --- a/drivers/net/ethernet/microsoft/mana/mana_en.c
> +++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
> @@ -1551,6 +1551,9 @@ int mana_create_wq_obj(struct mana_port_context *apc,
>  
>  	mana_gd_init_req_hdr(&req.hdr, MANA_CREATE_WQ_OBJ,
>  			     sizeof(req), sizeof(resp));
> +
> +	req.hdr.req.msg_version = GDMA_MESSAGE_V3;
> +	req.hdr.resp.msg_version = GDMA_MESSAGE_V2;

Sashiko noted the above cold break initialization on older firmware:

https://sashiko.dev/#/patchset/20260604234211.2056341-1-haiyangz%40linux.microsoft.com

[...]
> +static void mana_update_rx_dim(struct mana_cq *cq)
> +{
> +	struct mana_port_context *apc = netdev_priv(cq->rxq->ndev);
> +	struct mana_rxq *rxq = cq->rxq;
> +	struct dim_sample dim_sample = {};

Minor nit: please fix the variable declaration order above. Other
occurrences below.

[...]
> @@ -440,17 +474,94 @@ static int mana_set_coalesce(struct net_device *ndev,
>  		return -EINVAL;
>  	}
>  
> -	saved_cqe_coalescing_enable = apc->cqe_coalescing_enable;
> +	if (ec->rx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX ||
> +	    ec->tx_coalesce_usecs > MANA_INTR_MODR_USEC_MAX) {
> +		NL_SET_ERR_MSG_FMT(extack,
> +				   "coalesce usecs must be <= %lu",
> +				   MANA_INTR_MODR_USEC_MAX);
> +		return -EINVAL;
> +	}
> +
> +	if (ec->rx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX ||
> +	    ec->tx_max_coalesced_frames > MANA_INTR_MODR_COMP_MAX) {
> +		NL_SET_ERR_MSG_FMT(extack,
> +				   "coalesce frames must be <= %lu",
> +				   MANA_INTR_MODR_COMP_MAX);
> +		return -EINVAL;
> +	}
> +
> +	if (ec->rx_coalesce_usecs != apc->intr_modr_rx_usec ||
> +	    ec->rx_max_coalesced_frames != apc->intr_modr_rx_comp ||
> +	    ec->tx_coalesce_usecs != apc->intr_modr_tx_usec ||
> +	    ec->tx_max_coalesced_frames != apc->intr_modr_tx_comp)
> +		modr_changed = true;
> +
> +	saved.intr_modr_rx_usec = apc->intr_modr_rx_usec;
> +	saved.intr_modr_rx_comp = apc->intr_modr_rx_comp;
> +	saved.intr_modr_tx_usec = apc->intr_modr_tx_usec;
> +	saved.intr_modr_tx_comp = apc->intr_modr_tx_comp;
> +
> +	apc->intr_modr_rx_usec = ec->rx_coalesce_usecs;
> +	apc->intr_modr_rx_comp = ec->rx_max_coalesced_frames;
> +	apc->intr_modr_tx_usec = ec->tx_coalesce_usecs;
> +	apc->intr_modr_tx_comp = ec->tx_max_coalesced_frames;
> +
> +	if (!!ec->use_adaptive_rx_coalesce != apc->rx_dim_enabled ||
> +	    !!ec->use_adaptive_tx_coalesce != apc->tx_dim_enabled)
> +		dim_changed = true;
> +
> +	saved.rx_dim_enabled = apc->rx_dim_enabled;
> +	saved.tx_dim_enabled = apc->tx_dim_enabled;
> +	apc->rx_dim_enabled = !!ec->use_adaptive_rx_coalesce;
> +	apc->tx_dim_enabled = !!ec->use_adaptive_tx_coalesce;
> +
> +	saved.cqe_coalescing_enable = apc->cqe_coalescing_enable;
>  	apc->cqe_coalescing_enable =
>  		kernel_coal->rx_cqe_frames == MANA_RXCOMP_OOB_NUM_PPI;
>  
>  	if (!apc->port_is_up)
>  		return 0;
>  
> -	err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> -	if (err)
> -		apc->cqe_coalescing_enable = saved_cqe_coalescing_enable;
> +	if (apc->cqe_coalescing_enable != saved.cqe_coalescing_enable &&
> +	    !modr_changed && !dim_changed) {
> +		/* If only CQE coalescing setting is changed, we can just update
> +		 * RSS configuration.
> +		 */
> +		err = mana_config_rss(apc, TRI_STATE_TRUE, false, false);
> +		if (err) {
> +			netdev_err(ndev, "Change CQE coalescing failed: %d\n",
> +				   err);
> +			apc->cqe_coalescing_enable =
> +				saved.cqe_coalescing_enable;
> +			return err;
> +		}
> +		return 0;
> +	}
> +
> +	if (modr_changed || dim_changed) {
> +		err = mana_detach(ndev, false);
> +		if (err) {
> +			netdev_err(ndev, "mana_detach failed: %d\n", err);
> +			goto restore_modr;
> +		}
> +
> +		err = mana_attach(ndev);
> +		if (err) {
> +			netdev_err(ndev, "mana_attach failed: %d\n", err);
> +			goto restore_modr;
> +		}

You should try hard to avoid this sequence: if mana_attach fails,
mana_set_coalesce() will leave the NIC unexpectedly down.

/P


^ permalink raw reply

* Re: [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: John Garry @ 2026-06-09 13:03 UTC (permalink / raw)
  To: Sumit Saxena, Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel
In-Reply-To: <20260609121806.2121755-3-sumit.saxena@broadcom.com>

On 09/06/2026 13:18, Sumit Saxena wrote:
> scsi_host_alloc() used kzalloc(), which always picks an arbitrary node.
> Extend the function to accept a 'struct device *dev' parameter and use
> kzalloc_node() with dev_to_node(dev) so the Scsi_Host struct lands on
> the same NUMA node as the HBA, mirroring the treatment already applied
> to struct scsi_device, struct scsi_target, and shost_data.
> 
> When dev is NULL (legacy ISA/platform drivers without a dma_dev) the
> allocation falls back to NUMA_NO_NODE, preserving existing behaviour.
> 
> Update all in-tree callers:
>    - PCI-based HBA drivers pass &pdev->dev (or the equivalent struct
>      member such as &phba->pcidev->dev, &h->pdev->dev, &ha->pdev->dev)
>      so their host struct is placed on the adapter's node.
>    - Non-PCI drivers (ISA, Amiga, ARM PCMCIA, virtio, Hyper-V, PS3, …)
>      pass NULL.
>    - libfc's libfc_host_alloc() inline helper passes NULL; FC drivers
>      that want NUMA awareness can open-code the call with their pdev.
> 
> Suggested-by: John Garry <john.g.garry@oracle.com>
> Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>

Wow ... I was not expecting such a large change, but admittedly I did 
not consider the implementation.

I did mention that pci-based adapters should already be effectively 
doing kzalloc_node() since the adapter driver is probed on the local 
NUMA node (and kmalloc first tries local NUMA allocations).

> ---

> diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
> index e047747d4ecf..e1f42be79729 100644
> --- a/drivers/scsi/hosts.c
> +++ b/drivers/scsi/hosts.c
> @@ -403,12 +403,14 @@ static const struct device_type scsi_host_type = {
>    * Return value:
>    * 	Pointer to a new Scsi_Host
>    **/
> -struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize)
> +struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize,
> +				  struct device *dev)
>   {
>   	struct Scsi_Host *shost;
>   	int index;
>   
> -	shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL);
> +	shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL,
> +			     dev ? dev_to_node(dev) : NUMA_NO_NODE);
>   	if (!shost)
>   		return NULL;
>   

> -extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *, int);
> +extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht,
> +					 int privsize, struct device *dev);
>   extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
>   					       struct device *,
>   					       struct device *);


scsi_add_host_with_dma() and scsi_add_host() do assignment of 
shost->dma_dev, so I think that could be moved to scsi_host_alloc().

I can imagine that we always know dev and dma_dev at Scsi_Host alloc 
time (and not just scsi_add_host()) time. However those would be very 
intrusive changes.

Let me consider this more. Maybe we can have a platform device version 
of shost alloc, as I can't imagine that we care about much more. Thanks!

^ permalink raw reply

* Re: [PATCH v4 10/47] x86/tsc: Consolidate forcing of X86_FEATURE_TSC_KNOWN_FREQ for PV code
From: Sean Christopherson @ 2026-06-09 12:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: David Woodhouse, Paolo Bonzini, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, Kiryl Shutsemau, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Ajay Kaher,
	Alexey Makhalov, Jan Kiszka, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, John Stultz, H. Peter Anvin,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, Tom Lendacky, Nikunj A Dadhania,
	Michael Kelley
In-Reply-To: <87a4t440js.ffs@fw13>

On Tue, Jun 09, 2026, Thomas Gleixner wrote:
> On Mon, Jun 08 2026 at 15:38, Sean Christopherson wrote:
> > On Sat, Jun 06, 2026, David Woodhouse wrote:
> >> > Along with:
> >> > 
> >> >    if (!hypervisor_is_type(X86_HYPER_NATIVE)) {
> >> >       if (tsc_khz_early)
> >> >          pr_warn("Ignoring non-sensical tsc_early_khz command line argument\n");
> >> > 
> >> > or something daft like that.
> >
> > Ya, I ended up in the same place once Sashiko pointed out that skipping the SNP/TDX
> > setup was hazardous[*], and also once I realized that tsc_khz_early *complemented*
> > the refinement instead of replacing it.
> >
> > This is what I have locally:
> >
> >         if (cc_platform_has(CC_ATTR_GUEST_SNP_SECURE_TSC))
> >                 known_tsc_khz = snp_secure_tsc_init();
> >         else if (boot_cpu_has(X86_FEATURE_TDX_GUEST))
> >                 known_tsc_khz = tdx_tsc_init();
> >
> >         /*
> >          * If the TSC frequency wasn't provided by trusted firmware, try to get
> >          * it from the hypervisor (which is untrusted when running as a CoCo guest).
> >          */
> >         if (!known_tsc_khz && x86_init.hyper.get_tsc_khz)
> >                 known_tsc_khz = x86_init.hyper.get_tsc_khz();
> >
> >         /*
> >          * Mark the TSC frequency as known if it was obtained from a hypervisor
> >          * or trusted firmware.  Don't mark the frequency as known if the user
> >          * specified the frequency, as the user-provided frequency is intended
> >          * as a "starting point", not a known, guaranteed frequency.
> >          */
> >         if (known_tsc_khz && !tsc_early_khz)
> >                 setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> 
> If the frequenct is known via the above then you want to set the
> KNOWN_FREQ feature bit unconditionally. SNP/TDX/hypervisor override the
> command line argument as you print below.

Doh, forgot to remove that check when I shuffled things around.  Thank you!

^ permalink raw reply

* [PATCH v3 4/4] scsi: use percpu counters for iostat counters in struct scsi_device
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Sumit Saxena, John Garry
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>

iorequest_cnt and iodone_cnt are updated on every command dispatch and
completion, often from different CPUs on high queue depth workloads.
Using adjacent atomic_t fields causes cache line contention between the
submission and completion paths.

Extend the same treatment to ioerr_cnt and iotmo_cnt so all four iostat
counters in struct scsi_device use struct percpu_counter.

Suggested-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 drivers/scsi/scsi_error.c  |  4 ++--
 drivers/scsi/scsi_lib.c    | 10 +++++-----
 drivers/scsi/scsi_scan.c   |  8 ++++++++
 drivers/scsi/scsi_sysfs.c  | 23 ++++++++++++++---------
 drivers/scsi/sd.c          |  2 +-
 include/scsi/scsi_device.h |  9 +++++----
 6 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/drivers/scsi/scsi_error.c b/drivers/scsi/scsi_error.c
index 147127fb4db9..b1aa7da2ba7c 100644
--- a/drivers/scsi/scsi_error.c
+++ b/drivers/scsi/scsi_error.c
@@ -349,7 +349,7 @@ enum blk_eh_timer_return scsi_timeout(struct request *req)
 	trace_scsi_dispatch_cmd_timeout(scmd);
 	scsi_log_completion(scmd, TIMEOUT_ERROR);
 
-	atomic_inc(&scmd->device->iotmo_cnt);
+	percpu_counter_inc(&scmd->device->iotmo_cnt);
 	if (host->eh_deadline != -1 && !host->last_reset)
 		host->last_reset = jiffies;
 
@@ -370,7 +370,7 @@ enum blk_eh_timer_return scsi_timeout(struct request *req)
 	 */
 	if (test_and_set_bit(SCMD_STATE_COMPLETE, &scmd->state))
 		return BLK_EH_DONE;
-	atomic_inc(&scmd->device->iodone_cnt);
+	percpu_counter_inc(&scmd->device->iodone_cnt);
 	if (scsi_abort_command(scmd) != SUCCESS) {
 		set_host_byte(scmd, DID_TIME_OUT);
 		scsi_eh_scmd_add(scmd);
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 6e8c7a42603e..979fdace33ac 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1554,9 +1554,9 @@ static void scsi_complete(struct request *rq)
 
 	INIT_LIST_HEAD(&cmd->eh_entry);
 
-	atomic_inc(&cmd->device->iodone_cnt);
+	percpu_counter_inc(&cmd->device->iodone_cnt);
 	if (cmd->result)
-		atomic_inc(&cmd->device->ioerr_cnt);
+		percpu_counter_inc(&cmd->device->ioerr_cnt);
 
 	disposition = scsi_decide_disposition(cmd);
 	if (disposition != SUCCESS && scsi_cmd_runtime_exceeced(cmd))
@@ -1592,7 +1592,7 @@ static enum scsi_qc_status scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 	struct Scsi_Host *host = cmd->device->host;
 	int rtn = 0;
 
-	atomic_inc(&cmd->device->iorequest_cnt);
+	percpu_counter_inc(&cmd->device->iorequest_cnt);
 
 	/* check if the device is still usable */
 	if (unlikely(cmd->device->sdev_state == SDEV_DEL)) {
@@ -1614,7 +1614,7 @@ static enum scsi_qc_status scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 		 */
 		SCSI_LOG_MLQUEUE(3, scmd_printk(KERN_INFO, cmd,
 			"queuecommand : device blocked\n"));
-		atomic_dec(&cmd->device->iorequest_cnt);
+		percpu_counter_dec(&cmd->device->iorequest_cnt);
 		return SCSI_MLQUEUE_DEVICE_BUSY;
 	}
 
@@ -1647,7 +1647,7 @@ static enum scsi_qc_status scsi_dispatch_cmd(struct scsi_cmnd *cmd)
 	trace_scsi_dispatch_cmd_start(cmd);
 	rtn = host->hostt->queuecommand(host, cmd);
 	if (rtn) {
-		atomic_dec(&cmd->device->iorequest_cnt);
+		percpu_counter_dec(&cmd->device->iorequest_cnt);
 		trace_scsi_dispatch_cmd_error(cmd, rtn);
 		if (rtn != SCSI_MLQUEUE_DEVICE_BUSY &&
 		    rtn != SCSI_MLQUEUE_TARGET_BUSY)
diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index 121a14d5fdb8..bc885c72f01e 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -350,6 +350,14 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 
 	scsi_sysfs_device_initialize(sdev);
 
+	if (percpu_counter_init(&sdev->iorequest_cnt, 0, GFP_KERNEL) ||
+	    percpu_counter_init(&sdev->iodone_cnt, 0, GFP_KERNEL) ||
+	    percpu_counter_init(&sdev->ioerr_cnt, 0, GFP_KERNEL) ||
+	    percpu_counter_init(&sdev->iotmo_cnt, 0, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto out_device_destroy;
+	}
+
 	if (scsi_device_is_pseudo_dev(sdev))
 		return sdev;
 
diff --git a/drivers/scsi/scsi_sysfs.c b/drivers/scsi/scsi_sysfs.c
index dfc3559e7e04..f652edd16497 100644
--- a/drivers/scsi/scsi_sysfs.c
+++ b/drivers/scsi/scsi_sysfs.c
@@ -516,6 +516,10 @@ static void scsi_device_dev_release(struct device *dev)
 	if (vpd_pgb7)
 		kfree_rcu(vpd_pgb7, rcu);
 	kfree(sdev->inquiry);
+	percpu_counter_destroy(&sdev->iotmo_cnt);
+	percpu_counter_destroy(&sdev->ioerr_cnt);
+	percpu_counter_destroy(&sdev->iodone_cnt);
+	percpu_counter_destroy(&sdev->iorequest_cnt);
 	kfree(sdev);
 
 	if (parent)
@@ -936,26 +940,27 @@ static ssize_t
 show_iostat_counterbits(struct device *dev, struct device_attribute *attr,
 			char *buf)
 {
-	return snprintf(buf, 20, "%d\n", (int)sizeof(atomic_t) * 8);
+	/* iostat counters are per-CPU sums (s64).  Report width for tools. */
+	return sysfs_emit(buf, "%zu\n", sizeof(s64) * 8);
 }
 
 static DEVICE_ATTR(iocounterbits, S_IRUGO, show_iostat_counterbits, NULL);
 
-#define show_sdev_iostat(field)						\
+#define show_sdev_iostat_percpu(field)					\
 static ssize_t								\
 show_iostat_##field(struct device *dev, struct device_attribute *attr,	\
 		    char *buf)						\
 {									\
 	struct scsi_device *sdev = to_scsi_device(dev);			\
-	unsigned long long count = atomic_read(&sdev->field);		\
-	return snprintf(buf, 20, "0x%llx\n", count);			\
+	unsigned long long count = percpu_counter_sum(&sdev->field);	\
+	return sysfs_emit(buf, "0x%llx\n", count);			\
 }									\
-static DEVICE_ATTR(field, S_IRUGO, show_iostat_##field, NULL)
+static DEVICE_ATTR(field, 0444, show_iostat_##field, NULL)
 
-show_sdev_iostat(iorequest_cnt);
-show_sdev_iostat(iodone_cnt);
-show_sdev_iostat(ioerr_cnt);
-show_sdev_iostat(iotmo_cnt);
+show_sdev_iostat_percpu(iorequest_cnt);
+show_sdev_iostat_percpu(iodone_cnt);
+show_sdev_iostat_percpu(ioerr_cnt);
+show_sdev_iostat_percpu(iotmo_cnt);
 
 static ssize_t
 sdev_show_modalias(struct device *dev, struct device_attribute *attr, char *buf)
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index adc3fa55ca2c..b7ce01de17b3 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -4043,7 +4043,7 @@ static int sd_probe(struct scsi_device *sdp)
 	sdkp->index = index;
 	sdkp->max_retries = SD_MAX_RETRIES;
 	atomic_set(&sdkp->openers, 0);
-	atomic_set(&sdkp->device->ioerr_cnt, 0);
+	percpu_counter_set(&sdkp->device->ioerr_cnt, 0);
 
 	if (!sdp->request_queue->rq_timeout) {
 		if (sdp->type != TYPE_MOD)
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 029f5115b2ea..4be36bf2a475 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -9,6 +9,7 @@
 #include <scsi/scsi.h>
 #include <scsi/scsi_common.h>
 #include <linux/atomic.h>
+#include <linux/percpu_counter.h>
 #include <linux/sbitmap.h>
 
 struct bsg_device;
@@ -272,10 +273,10 @@ struct scsi_device {
 	unsigned int max_device_blocked; /* what device_blocked counts down from  */
 #define SCSI_DEFAULT_DEVICE_BLOCKED	3
 
-	atomic_t iorequest_cnt;
-	atomic_t iodone_cnt;
-	atomic_t ioerr_cnt;
-	atomic_t iotmo_cnt;
+	struct percpu_counter iorequest_cnt;
+	struct percpu_counter iodone_cnt;
+	struct percpu_counter ioerr_cnt;
+	struct percpu_counter iotmo_cnt;
 
 	struct device		sdev_gendev,
 				sdev_dev;
-- 
2.43.7


^ permalink raw reply related

* [PATCH v3 3/4] block: drop shared-tag fairness throttling
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Bart Van Assche, Sumit Saxena
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>

From: Bart Van Assche <bvanassche@acm.org>

Original patch [1] by Bart Van Assche; this version is rebased onto the
current tree.  In testing it improves IOPS by roughly 16-18% by removing
the fair-sharing throttle on shared tag queues.

This patch removes the following code and structure members:
- The function hctx_may_queue().
- blk_mq_hw_ctx.nr_active and request_queue.nr_active_requests_shared_tags
  and also all the code that modifies these two member variables.

[1]: https://lore.kernel.org/linux-block/20240529213921.3166462-1-bvanassche@acm.org/

Signed-off-by: Bart Van Assche <bvanassche@acm.org>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 block/blk-core.c       |   2 -
 block/blk-mq-debugfs.c |  22 ++++++++-
 block/blk-mq-tag.c     |   4 --
 block/blk-mq.c         |  17 +------
 block/blk-mq.h         | 100 -----------------------------------------
 include/linux/blk-mq.h |   6 ---
 include/linux/blkdev.h |   2 -
 7 files changed, 22 insertions(+), 131 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 17450058ea6d..129acc1b27e5 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -421,8 +421,6 @@ struct request_queue *blk_alloc_queue(struct queue_limits *lim, int node_id)
 
 	q->node = node_id;
 
-	atomic_set(&q->nr_active_requests_shared_tags, 0);
-
 	timer_setup(&q->timeout, blk_rq_timed_out_timer, 0);
 	INIT_WORK(&q->timeout_work, blk_timeout_work);
 	INIT_LIST_HEAD(&q->icq_list);
diff --git a/block/blk-mq-debugfs.c b/block/blk-mq-debugfs.c
index 047ec887456b..8b85a7f8e987 100644
--- a/block/blk-mq-debugfs.c
+++ b/block/blk-mq-debugfs.c
@@ -468,11 +468,31 @@ static int hctx_sched_tags_bitmap_show(void *data, struct seq_file *m)
 	return 0;
 }
 
+struct count_active_params {
+	struct blk_mq_hw_ctx	*hctx;
+	int			*active;
+};
+
+static bool hctx_count_active(struct request *rq, void *data)
+{
+	const struct count_active_params *params = data;
+
+	if (rq->mq_hctx == params->hctx)
+		(*params->active)++;
+
+	return true;
+}
+
 static int hctx_active_show(void *data, struct seq_file *m)
 {
 	struct blk_mq_hw_ctx *hctx = data;
+	int active = 0;
+	struct count_active_params params = { .hctx = hctx, .active = &active };
+
+	blk_mq_all_tag_iter(hctx->sched_tags ?: hctx->tags, hctx_count_active,
+			    &params);
 
-	seq_printf(m, "%d\n", __blk_mq_active_requests(hctx));
+	seq_printf(m, "%d\n", active);
 	return 0;
 }
 
diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
index 33946cdb5716..bfd27cc6249b 100644
--- a/block/blk-mq-tag.c
+++ b/block/blk-mq-tag.c
@@ -109,10 +109,6 @@ void __blk_mq_tag_idle(struct blk_mq_hw_ctx *hctx)
 static int __blk_mq_get_tag(struct blk_mq_alloc_data *data,
 			    struct sbitmap_queue *bt)
 {
-	if (!data->q->elevator && !(data->flags & BLK_MQ_REQ_RESERVED) &&
-			!hctx_may_queue(data->hctx, bt))
-		return BLK_MQ_NO_TAG;
-
 	if (data->shallow_depth)
 		return sbitmap_queue_get_shallow(bt, data->shallow_depth);
 	else
diff --git a/block/blk-mq.c b/block/blk-mq.c
index 4c5c16cce4f8..bbac59a06044 100644
--- a/block/blk-mq.c
+++ b/block/blk-mq.c
@@ -489,8 +489,6 @@ __blk_mq_alloc_requests_batch(struct blk_mq_alloc_data *data)
 		}
 	} while (data->nr_tags > nr);
 
-	if (!(data->rq_flags & RQF_SCHED_TAGS))
-		blk_mq_add_active_requests(data->hctx, nr);
 	/* caller already holds a reference, add for remainder */
 	percpu_ref_get_many(&data->q->q_usage_counter, nr - 1);
 	data->nr_tags -= nr;
@@ -587,8 +585,6 @@ static struct request *__blk_mq_alloc_requests(struct blk_mq_alloc_data *data)
 		goto retry;
 	}
 
-	if (!(data->rq_flags & RQF_SCHED_TAGS))
-		blk_mq_inc_active_requests(data->hctx);
 	rq = blk_mq_rq_ctx_init(data, blk_mq_tags_from_data(data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	return rq;
@@ -763,8 +759,6 @@ struct request *blk_mq_alloc_request_hctx(struct request_queue *q,
 	tag = blk_mq_get_tag(&data);
 	if (tag == BLK_MQ_NO_TAG)
 		goto out_queue_exit;
-	if (!(data.rq_flags & RQF_SCHED_TAGS))
-		blk_mq_inc_active_requests(data.hctx);
 	rq = blk_mq_rq_ctx_init(&data, blk_mq_tags_from_data(&data), tag);
 	blk_mq_rq_time_init(rq, alloc_time_ns);
 	rq->__data_len = 0;
@@ -807,10 +801,8 @@ static void __blk_mq_free_request(struct request *rq)
 	blk_pm_mark_last_busy(rq);
 	rq->mq_hctx = NULL;
 
-	if (rq->tag != BLK_MQ_NO_TAG) {
-		blk_mq_dec_active_requests(hctx);
+	if (rq->tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->tags, ctx, rq->tag);
-	}
 	if (sched_tag != BLK_MQ_NO_TAG)
 		blk_mq_put_tag(hctx->sched_tags, ctx, sched_tag);
 	blk_mq_sched_restart(hctx);
@@ -1188,8 +1180,6 @@ static inline void blk_mq_flush_tag_batch(struct blk_mq_hw_ctx *hctx,
 {
 	struct request_queue *q = hctx->queue;
 
-	blk_mq_sub_active_requests(hctx, nr_tags);
-
 	blk_mq_put_tags(hctx->tags, tag_array, nr_tags);
 	percpu_ref_put_many(&q->q_usage_counter, nr_tags);
 }
@@ -1875,9 +1865,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq)
 	if (blk_mq_tag_is_reserved(rq->mq_hctx->sched_tags, rq->internal_tag)) {
 		bt = &rq->mq_hctx->tags->breserved_tags;
 		tag_offset = 0;
-	} else {
-		if (!hctx_may_queue(rq->mq_hctx, bt))
-			return false;
 	}
 
 	tag = __sbitmap_queue_get(bt);
@@ -1885,7 +1872,6 @@ bool __blk_mq_alloc_driver_tag(struct request *rq)
 		return false;
 
 	rq->tag = tag + tag_offset;
-	blk_mq_inc_active_requests(rq->mq_hctx);
 	return true;
 }
 
@@ -4058,7 +4044,6 @@ blk_mq_alloc_hctx(struct request_queue *q, struct blk_mq_tag_set *set,
 	if (!zalloc_cpumask_var_node(&hctx->cpumask, gfp, node))
 		goto free_hctx;
 
-	atomic_set(&hctx->nr_active, 0);
 	if (node == NUMA_NO_NODE)
 		node = set->numa_node;
 	hctx->numa_node = node;
diff --git a/block/blk-mq.h b/block/blk-mq.h
index aa15d31aaae9..8dfb67c55f5d 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -291,70 +291,9 @@ static inline int blk_mq_get_rq_budget_token(struct request *rq)
 	return -1;
 }
 
-static inline void __blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx,
-						int val)
-{
-	if (blk_mq_is_shared_tags(hctx->flags))
-		atomic_add(val, &hctx->queue->nr_active_requests_shared_tags);
-	else
-		atomic_add(val, &hctx->nr_active);
-}
-
-static inline void __blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	__blk_mq_add_active_requests(hctx, 1);
-}
-
-static inline void __blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx,
-		int val)
-{
-	if (blk_mq_is_shared_tags(hctx->flags))
-		atomic_sub(val, &hctx->queue->nr_active_requests_shared_tags);
-	else
-		atomic_sub(val, &hctx->nr_active);
-}
-
-static inline void __blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	__blk_mq_sub_active_requests(hctx, 1);
-}
-
-static inline void blk_mq_add_active_requests(struct blk_mq_hw_ctx *hctx,
-					      int val)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_add_active_requests(hctx, val);
-}
-
-static inline void blk_mq_inc_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_inc_active_requests(hctx);
-}
-
-static inline void blk_mq_sub_active_requests(struct blk_mq_hw_ctx *hctx,
-					      int val)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_sub_active_requests(hctx, val);
-}
-
-static inline void blk_mq_dec_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	if (hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED)
-		__blk_mq_dec_active_requests(hctx);
-}
-
-static inline int __blk_mq_active_requests(struct blk_mq_hw_ctx *hctx)
-{
-	if (blk_mq_is_shared_tags(hctx->flags))
-		return atomic_read(&hctx->queue->nr_active_requests_shared_tags);
-	return atomic_read(&hctx->nr_active);
-}
 static inline void __blk_mq_put_driver_tag(struct blk_mq_hw_ctx *hctx,
 					   struct request *rq)
 {
-	blk_mq_dec_active_requests(hctx);
 	blk_mq_put_tag(hctx->tags, rq->mq_ctx, rq->tag);
 	rq->tag = BLK_MQ_NO_TAG;
 }
@@ -396,45 +335,6 @@ static inline void blk_mq_free_requests(struct list_head *list)
 	}
 }
 
-/*
- * For shared tag users, we track the number of currently active users
- * and attempt to provide a fair share of the tag depth for each of them.
- */
-static inline bool hctx_may_queue(struct blk_mq_hw_ctx *hctx,
-				  struct sbitmap_queue *bt)
-{
-	unsigned int depth, users;
-
-	if (!hctx || !(hctx->flags & BLK_MQ_F_TAG_QUEUE_SHARED))
-		return true;
-
-	/*
-	 * Don't try dividing an ant
-	 */
-	if (bt->sb.depth == 1)
-		return true;
-
-	if (blk_mq_is_shared_tags(hctx->flags)) {
-		struct request_queue *q = hctx->queue;
-
-		if (!test_bit(QUEUE_FLAG_HCTX_ACTIVE, &q->queue_flags))
-			return true;
-	} else {
-		if (!test_bit(BLK_MQ_S_TAG_ACTIVE, &hctx->state))
-			return true;
-	}
-
-	users = READ_ONCE(hctx->tags->active_queues);
-	if (!users)
-		return true;
-
-	/*
-	 * Allow at least some tags
-	 */
-	depth = max((bt->sb.depth + users - 1) / users, 4U);
-	return __blk_mq_active_requests(hctx) < depth;
-}
-
 /* run the code block in @dispatch_ops with rcu/srcu read lock held */
 #define __blk_mq_run_dispatch_ops(q, check_sleep, dispatch_ops)	\
 do {								\
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 18a2388ba581..ccbb07559402 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -432,12 +432,6 @@ struct blk_mq_hw_ctx {
 	/** @queue_num: Index of this hardware queue. */
 	unsigned int		queue_num;
 
-	/**
-	 * @nr_active: Number of active requests. Only used when a tag set is
-	 * shared across request queues.
-	 */
-	atomic_t		nr_active;
-
 	/** @cpuhp_online: List to store request if CPU is going to die */
 	struct hlist_node	cpuhp_online;
 	/** @cpuhp_dead: List to store request if some CPU die. */
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 890128cdea1c..95525b1d7b74 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -567,8 +567,6 @@ struct request_queue {
 	struct timer_list	timeout;
 	struct work_struct	timeout_work;
 
-	atomic_t		nr_active_requests_shared_tags;
-
 	struct blk_mq_tags	*sched_shared_tags;
 
 	struct list_head	icq_list;
-- 
2.43.7


^ permalink raw reply related

* [PATCH v3 2/4] scsi: host: allocate struct Scsi_Host on the NUMA node of the host adapter
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, Sumit Saxena, John Garry
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>

scsi_host_alloc() used kzalloc(), which always picks an arbitrary node.
Extend the function to accept a 'struct device *dev' parameter and use
kzalloc_node() with dev_to_node(dev) so the Scsi_Host struct lands on
the same NUMA node as the HBA, mirroring the treatment already applied
to struct scsi_device, struct scsi_target, and shost_data.

When dev is NULL (legacy ISA/platform drivers without a dma_dev) the
allocation falls back to NUMA_NO_NODE, preserving existing behaviour.

Update all in-tree callers:
  - PCI-based HBA drivers pass &pdev->dev (or the equivalent struct
    member such as &phba->pcidev->dev, &h->pdev->dev, &ha->pdev->dev)
    so their host struct is placed on the adapter's node.
  - Non-PCI drivers (ISA, Amiga, ARM PCMCIA, virtio, Hyper-V, PS3, …)
    pass NULL.
  - libfc's libfc_host_alloc() inline helper passes NULL; FC drivers
    that want NUMA awareness can open-code the call with their pdev.

Suggested-by: John Garry <john.g.garry@oracle.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 drivers/scsi/3w-9xxx.c                    | 2 +-
 drivers/scsi/3w-sas.c                     | 2 +-
 drivers/scsi/3w-xxxx.c                    | 2 +-
 drivers/scsi/53c700.c                     | 2 +-
 drivers/scsi/BusLogic.c                   | 2 +-
 drivers/scsi/a100u2w.c                    | 2 +-
 drivers/scsi/a2091.c                      | 2 +-
 drivers/scsi/a3000.c                      | 2 +-
 drivers/scsi/aacraid/linit.c              | 2 +-
 drivers/scsi/advansys.c                   | 6 +++---
 drivers/scsi/aha152x.c                    | 2 +-
 drivers/scsi/aha1542.c                    | 2 +-
 drivers/scsi/aha1740.c                    | 2 +-
 drivers/scsi/aic7xxx/aic79xx_osm.c        | 2 +-
 drivers/scsi/aic7xxx/aic7xxx_osm.c        | 2 +-
 drivers/scsi/aic94xx/aic94xx_init.c       | 2 +-
 drivers/scsi/am53c974.c                   | 2 +-
 drivers/scsi/arcmsr/arcmsr_hba.c          | 3 ++-
 drivers/scsi/arm/acornscsi.c              | 2 +-
 drivers/scsi/arm/arxescsi.c               | 2 +-
 drivers/scsi/arm/cumana_1.c               | 2 +-
 drivers/scsi/arm/cumana_2.c               | 2 +-
 drivers/scsi/arm/eesox.c                  | 2 +-
 drivers/scsi/arm/oak.c                    | 2 +-
 drivers/scsi/arm/powertec.c               | 2 +-
 drivers/scsi/atari_scsi.c                 | 2 +-
 drivers/scsi/atp870u.c                    | 2 +-
 drivers/scsi/bfa/bfad_im.c                | 2 +-
 drivers/scsi/csiostor/csio_init.c         | 4 ++--
 drivers/scsi/dc395x.c                     | 2 +-
 drivers/scsi/dmx3191d.c                   | 2 +-
 drivers/scsi/elx/efct/efct_xport.c        | 4 ++--
 drivers/scsi/esas2r/esas2r_main.c         | 2 +-
 drivers/scsi/fdomain.c                    | 2 +-
 drivers/scsi/fnic/fnic_main.c             | 2 +-
 drivers/scsi/g_NCR5380.c                  | 2 +-
 drivers/scsi/gvp11.c                      | 2 +-
 drivers/scsi/hisi_sas/hisi_sas_main.c     | 2 +-
 drivers/scsi/hisi_sas/hisi_sas_v3_hw.c    | 2 +-
 drivers/scsi/hosts.c                      | 6 ++++--
 drivers/scsi/hpsa.c                       | 2 +-
 drivers/scsi/hptiop.c                     | 2 +-
 drivers/scsi/ibmvscsi/ibmvfc.c            | 2 +-
 drivers/scsi/ibmvscsi/ibmvscsi.c          | 2 +-
 drivers/scsi/imm.c                        | 2 +-
 drivers/scsi/initio.c                     | 2 +-
 drivers/scsi/ipr.c                        | 2 +-
 drivers/scsi/ips.c                        | 2 +-
 drivers/scsi/isci/init.c                  | 2 +-
 drivers/scsi/jazz_esp.c                   | 2 +-
 drivers/scsi/libiscsi.c                   | 2 +-
 drivers/scsi/lpfc/lpfc_init.c             | 2 +-
 drivers/scsi/mac53c94.c                   | 2 +-
 drivers/scsi/mac_esp.c                    | 2 +-
 drivers/scsi/mac_scsi.c                   | 2 +-
 drivers/scsi/megaraid.c                   | 2 +-
 drivers/scsi/megaraid/megaraid_mbox.c     | 2 +-
 drivers/scsi/megaraid/megaraid_sas_base.c | 2 +-
 drivers/scsi/mesh.c                       | 2 +-
 drivers/scsi/mpi3mr/mpi3mr_os.c           | 2 +-
 drivers/scsi/mpt3sas/mpt3sas_scsih.c      | 4 ++--
 drivers/scsi/mvme147.c                    | 2 +-
 drivers/scsi/mvsas/mv_init.c              | 2 +-
 drivers/scsi/mvumi.c                      | 2 +-
 drivers/scsi/myrb.c                       | 2 +-
 drivers/scsi/myrs.c                       | 2 +-
 drivers/scsi/ncr53c8xx.c                  | 2 +-
 drivers/scsi/nsp32.c                      | 2 +-
 drivers/scsi/pcmcia/nsp_cs.c              | 2 +-
 drivers/scsi/pcmcia/qlogic_stub.c         | 2 +-
 drivers/scsi/pcmcia/sym53c500_cs.c        | 2 +-
 drivers/scsi/pm8001/pm8001_init.c         | 2 +-
 drivers/scsi/pmcraid.c                    | 2 +-
 drivers/scsi/ppa.c                        | 2 +-
 drivers/scsi/ps3rom.c                     | 2 +-
 drivers/scsi/qla1280.c                    | 2 +-
 drivers/scsi/qla2xxx/qla_mid.c            | 2 +-
 drivers/scsi/qla2xxx/qla_os.c             | 2 +-
 drivers/scsi/qlogicfas.c                  | 2 +-
 drivers/scsi/qlogicpti.c                  | 2 +-
 drivers/scsi/scsi_debug.c                 | 2 +-
 drivers/scsi/sgiwd93.c                    | 2 +-
 drivers/scsi/smartpqi/smartpqi_init.c     | 2 +-
 drivers/scsi/snic/snic_main.c             | 2 +-
 drivers/scsi/stex.c                       | 2 +-
 drivers/scsi/storvsc_drv.c                | 2 +-
 drivers/scsi/sun3_scsi.c                  | 2 +-
 drivers/scsi/sun3x_esp.c                  | 2 +-
 drivers/scsi/sun_esp.c                    | 2 +-
 drivers/scsi/sym53c8xx_2/sym_glue.c       | 2 +-
 drivers/scsi/virtio_scsi.c                | 2 +-
 drivers/scsi/vmw_pvscsi.c                 | 2 +-
 drivers/scsi/wd719x.c                     | 2 +-
 drivers/scsi/xen-scsifront.c              | 2 +-
 drivers/scsi/zorro_esp.c                  | 2 +-
 include/scsi/libfc.h                      | 2 +-
 include/scsi/scsi_host.h                  | 3 ++-
 97 files changed, 107 insertions(+), 103 deletions(-)

diff --git a/drivers/scsi/3w-9xxx.c b/drivers/scsi/3w-9xxx.c
index 9b93a2440af8..444578ee8070 100644
--- a/drivers/scsi/3w-9xxx.c
+++ b/drivers/scsi/3w-9xxx.c
@@ -2021,7 +2021,7 @@ static int twa_probe(struct pci_dev *pdev, const struct pci_device_id *dev_id)
 		goto out_disable_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension));
+	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension), &pdev->dev);
 	if (!host) {
 		TW_PRINTK(host, TW_DRIVER, 0x24, "Failed to allocate memory for device extension");
 		retval = -ENOMEM;
diff --git a/drivers/scsi/3w-sas.c b/drivers/scsi/3w-sas.c
index 52dc1aa639f7..d063d39faf4f 100644
--- a/drivers/scsi/3w-sas.c
+++ b/drivers/scsi/3w-sas.c
@@ -1576,7 +1576,7 @@ static int twl_probe(struct pci_dev *pdev, const struct pci_device_id *dev_id)
 		goto out_disable_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension));
+	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension), &pdev->dev);
 	if (!host) {
 		TW_PRINTK(host, TW_DRIVER, 0x19, "Failed to allocate memory for device extension");
 		retval = -ENOMEM;
diff --git a/drivers/scsi/3w-xxxx.c b/drivers/scsi/3w-xxxx.c
index c68678fa72c1..0ccb5f1f8805 100644
--- a/drivers/scsi/3w-xxxx.c
+++ b/drivers/scsi/3w-xxxx.c
@@ -2268,7 +2268,7 @@ static int tw_probe(struct pci_dev *pdev, const struct pci_device_id *dev_id)
 		goto out_disable_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension));
+	host = scsi_host_alloc(&driver_template, sizeof(TW_Device_Extension), &pdev->dev);
 	if (!host) {
 		printk(KERN_WARNING "3w-xxxx: Failed to allocate memory for device extension.");
 		retval = -ENOMEM;
diff --git a/drivers/scsi/53c700.c b/drivers/scsi/53c700.c
index c78f74b8f45c..e30d55ab5dea 100644
--- a/drivers/scsi/53c700.c
+++ b/drivers/scsi/53c700.c
@@ -341,7 +341,7 @@ NCR_700_detect(struct scsi_host_template *tpnt,
 	if(tpnt->proc_name == NULL)
 		tpnt->proc_name = "53c700";
 
-	host = scsi_host_alloc(tpnt, 4);
+	host = scsi_host_alloc(tpnt, 4, NULL);
 	if (!host)
 		return NULL;
 	memset(hostdata->slots, 0, sizeof(struct NCR_700_command_slot)
diff --git a/drivers/scsi/BusLogic.c b/drivers/scsi/BusLogic.c
index 5304d2febd63..f865fdec4136 100644
--- a/drivers/scsi/BusLogic.c
+++ b/drivers/scsi/BusLogic.c
@@ -2302,7 +2302,7 @@ static int __init blogic_init(void)
 		 */
 
 		host = scsi_host_alloc(&blogic_template,
-				sizeof(struct blogic_adapter));
+				sizeof(struct blogic_adapter), NULL);
 		if (host == NULL) {
 			release_region(myadapter->io_addr,
 					myadapter->addr_count);
diff --git a/drivers/scsi/a100u2w.c b/drivers/scsi/a100u2w.c
index 4365b896f5c4..9124c6103902 100644
--- a/drivers/scsi/a100u2w.c
+++ b/drivers/scsi/a100u2w.c
@@ -1106,7 +1106,7 @@ static int inia100_probe_one(struct pci_dev *pdev,
 	bios = inw(port + 0x50);
 
 
-	shost = scsi_host_alloc(&inia100_template, sizeof(struct orc_host));
+	shost = scsi_host_alloc(&inia100_template, sizeof(struct orc_host), &pdev->dev);
 	if (!shost)
 		goto out_release_region;
 
diff --git a/drivers/scsi/a2091.c b/drivers/scsi/a2091.c
index 204448bfd04b..51effb2edefb 100644
--- a/drivers/scsi/a2091.c
+++ b/drivers/scsi/a2091.c
@@ -214,7 +214,7 @@ static int a2091_probe(struct zorro_dev *z, const struct zorro_device_id *ent)
 		return -EBUSY;
 
 	instance = scsi_host_alloc(&a2091_scsi_template,
-				   sizeof(struct a2091_hostdata));
+				   sizeof(struct a2091_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/a3000.c b/drivers/scsi/a3000.c
index bf054dd7682b..5b3d25b8ad37 100644
--- a/drivers/scsi/a3000.c
+++ b/drivers/scsi/a3000.c
@@ -235,7 +235,7 @@ static int __init amiga_a3000_scsi_probe(struct platform_device *pdev)
 		return -EBUSY;
 
 	instance = scsi_host_alloc(&amiga_a3000_scsi_template,
-				   sizeof(struct a3000_hostdata));
+				   sizeof(struct a3000_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/aacraid/linit.c b/drivers/scsi/aacraid/linit.c
index 2fa8f7ddb703..d003667007f7 100644
--- a/drivers/scsi/aacraid/linit.c
+++ b/drivers/scsi/aacraid/linit.c
@@ -1636,7 +1636,7 @@ static int aac_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 
 	pci_set_master(pdev);
 
-	shost = scsi_host_alloc(&aac_driver_template, sizeof(struct aac_dev));
+	shost = scsi_host_alloc(&aac_driver_template, sizeof(struct aac_dev), &pdev->dev);
 	if (!shost) {
 		error = -ENOMEM;
 		goto out_disable_pdev;
diff --git a/drivers/scsi/advansys.c b/drivers/scsi/advansys.c
index 5cdbf2bdb13d..e7ef433778a1 100644
--- a/drivers/scsi/advansys.c
+++ b/drivers/scsi/advansys.c
@@ -11237,7 +11237,7 @@ static int advansys_vlb_probe(struct device *dev, unsigned int id)
 		goto release_region;
 
 	err = -ENOMEM;
-	shost = scsi_host_alloc(&advansys_template, sizeof(*board));
+	shost = scsi_host_alloc(&advansys_template, sizeof(*board), NULL);
 	if (!shost)
 		goto release_region;
 
@@ -11345,7 +11345,7 @@ static int advansys_eisa_probe(struct device *dev)
 			irq = advansys_eisa_irq_no(edev);
 
 		err = -ENOMEM;
-		shost = scsi_host_alloc(&advansys_template, sizeof(*board));
+		shost = scsi_host_alloc(&advansys_template, sizeof(*board), NULL);
 		if (!shost)
 			goto release_region;
 
@@ -11462,7 +11462,7 @@ static int advansys_pci_probe(struct pci_dev *pdev,
 	ioport = pci_resource_start(pdev, 0);
 
 	err = -ENOMEM;
-	shost = scsi_host_alloc(&advansys_template, sizeof(*board));
+	shost = scsi_host_alloc(&advansys_template, sizeof(*board), &pdev->dev);
 	if (!shost)
 		goto release_region;
 
diff --git a/drivers/scsi/aha152x.c b/drivers/scsi/aha152x.c
index e3ccb6bb62c0..d82ce80de098 100644
--- a/drivers/scsi/aha152x.c
+++ b/drivers/scsi/aha152x.c
@@ -734,7 +734,7 @@ struct Scsi_Host *aha152x_probe_one(struct aha152x_setup *setup)
 {
 	struct Scsi_Host *shpnt;
 
-	shpnt = scsi_host_alloc(&aha152x_driver_template, sizeof(struct aha152x_hostdata));
+	shpnt = scsi_host_alloc(&aha152x_driver_template, sizeof(struct aha152x_hostdata), NULL);
 	if (!shpnt) {
 		printk(KERN_ERR "aha152x: scsi_host_alloc failed\n");
 		return NULL;
diff --git a/drivers/scsi/aha1542.c b/drivers/scsi/aha1542.c
index fd766282d4a4..1a109c850785 100644
--- a/drivers/scsi/aha1542.c
+++ b/drivers/scsi/aha1542.c
@@ -752,7 +752,7 @@ static struct Scsi_Host *aha1542_hw_init(const struct scsi_host_template *tpnt,
 	if (!request_region(base_io, AHA1542_REGION_SIZE, "aha1542"))
 		return NULL;
 
-	sh = scsi_host_alloc(tpnt, sizeof(struct aha1542_hostdata));
+	sh = scsi_host_alloc(tpnt, sizeof(struct aha1542_hostdata), NULL);
 	if (!sh)
 		goto release;
 	aha1542 = shost_priv(sh);
diff --git a/drivers/scsi/aha1740.c b/drivers/scsi/aha1740.c
index c435769359f2..31a52edf0748 100644
--- a/drivers/scsi/aha1740.c
+++ b/drivers/scsi/aha1740.c
@@ -583,7 +583,7 @@ static int aha1740_probe (struct device *dev)
 	printk(KERN_INFO "aha174x: Extended translation %sabled.\n",
 	       translation ? "en" : "dis");
 	shpnt = scsi_host_alloc(&aha1740_template,
-			      sizeof(struct aha1740_hostdata));
+			      sizeof(struct aha1740_hostdata), NULL);
 	if(shpnt == NULL)
 		goto err_release_region;
 
diff --git a/drivers/scsi/aic7xxx/aic79xx_osm.c b/drivers/scsi/aic7xxx/aic79xx_osm.c
index feb1707feb7e..76e30b0784b9 100644
--- a/drivers/scsi/aic7xxx/aic79xx_osm.c
+++ b/drivers/scsi/aic7xxx/aic79xx_osm.c
@@ -1214,7 +1214,7 @@ ahd_linux_register_host(struct ahd_softc *ahd, struct scsi_host_template *templa
 	int	retval;
 
 	template->name = ahd->description;
-	host = scsi_host_alloc(template, sizeof(struct ahd_softc *));
+	host = scsi_host_alloc(template, sizeof(struct ahd_softc *), NULL);
 	if (host == NULL)
 		return (ENOMEM);
 
diff --git a/drivers/scsi/aic7xxx/aic7xxx_osm.c b/drivers/scsi/aic7xxx/aic7xxx_osm.c
index d93b522695eb..0169509abd76 100644
--- a/drivers/scsi/aic7xxx/aic7xxx_osm.c
+++ b/drivers/scsi/aic7xxx/aic7xxx_osm.c
@@ -1083,7 +1083,7 @@ ahc_linux_register_host(struct ahc_softc *ahc, struct scsi_host_template *templa
 	int	retval;
 
 	template->name = ahc->description;
-	host = scsi_host_alloc(template, sizeof(struct ahc_softc *));
+	host = scsi_host_alloc(template, sizeof(struct ahc_softc *), NULL);
 	if (host == NULL)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/aic94xx/aic94xx_init.c b/drivers/scsi/aic94xx/aic94xx_init.c
index 4400a3661d90..1336e5e38f8d 100644
--- a/drivers/scsi/aic94xx/aic94xx_init.c
+++ b/drivers/scsi/aic94xx/aic94xx_init.c
@@ -704,7 +704,7 @@ static int asd_pci_probe(struct pci_dev *dev, const struct pci_device_id *id)
 
 	err = -ENOMEM;
 
-	shost = scsi_host_alloc(&aic94xx_sht, sizeof(void *));
+	shost = scsi_host_alloc(&aic94xx_sht, sizeof(void *), &dev->dev);
 	if (!shost)
 		goto Err;
 
diff --git a/drivers/scsi/am53c974.c b/drivers/scsi/am53c974.c
index f972a3c90a2f..4ca73e801232 100644
--- a/drivers/scsi/am53c974.c
+++ b/drivers/scsi/am53c974.c
@@ -388,7 +388,7 @@ static int pci_esp_probe_one(struct pci_dev *pdev,
 		goto fail_disable_device;
 	}
 
-	shost = scsi_host_alloc(hostt, sizeof(struct esp));
+	shost = scsi_host_alloc(hostt, sizeof(struct esp), &pdev->dev);
 	if (!shost) {
 		dev_printk(KERN_INFO, &pdev->dev,
 			   "failed to allocate scsi host\n");
diff --git a/drivers/scsi/arcmsr/arcmsr_hba.c b/drivers/scsi/arcmsr/arcmsr_hba.c
index 8aa948f06cac..f0cc59e756dc 100644
--- a/drivers/scsi/arcmsr/arcmsr_hba.c
+++ b/drivers/scsi/arcmsr/arcmsr_hba.c
@@ -1087,7 +1087,8 @@ static int arcmsr_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	if(error){
 		return -ENODEV;
 	}
-	host = scsi_host_alloc(&arcmsr_scsi_host_template, sizeof(struct AdapterControlBlock));
+	host = scsi_host_alloc(&arcmsr_scsi_host_template,
+			       sizeof(struct AdapterControlBlock), &pdev->dev);
 	if(!host){
     		goto pci_disable_dev;
 	}
diff --git a/drivers/scsi/arm/acornscsi.c b/drivers/scsi/arm/acornscsi.c
index 79d7d7336b6a..97e3db7e6a7c 100644
--- a/drivers/scsi/arm/acornscsi.c
+++ b/drivers/scsi/arm/acornscsi.c
@@ -2806,7 +2806,7 @@ static int acornscsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 	if (ret)
 		goto out;
 
-	host = scsi_host_alloc(&acornscsi_template, sizeof(AS_Host));
+	host = scsi_host_alloc(&acornscsi_template, sizeof(AS_Host), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_release;
diff --git a/drivers/scsi/arm/arxescsi.c b/drivers/scsi/arm/arxescsi.c
index 925d0bd68aa5..32f0a3aefb44 100644
--- a/drivers/scsi/arm/arxescsi.c
+++ b/drivers/scsi/arm/arxescsi.c
@@ -272,7 +272,7 @@ static int arxescsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 		goto out_region;
 	}
 
-	host = scsi_host_alloc(&arxescsi_template, sizeof(struct arxescsi_info));
+	host = scsi_host_alloc(&arxescsi_template, sizeof(struct arxescsi_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/arm/cumana_1.c b/drivers/scsi/arm/cumana_1.c
index d1a2a22ffe8c..d47ff9353c1b 100644
--- a/drivers/scsi/arm/cumana_1.c
+++ b/drivers/scsi/arm/cumana_1.c
@@ -238,7 +238,7 @@ static int cumanascsi1_probe(struct expansion_card *ec,
 	if (ret)
 		goto out;
 
-	host = scsi_host_alloc(&cumanascsi_template, sizeof(struct NCR5380_hostdata));
+	host = scsi_host_alloc(&cumanascsi_template, sizeof(struct NCR5380_hostdata), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_release;
diff --git a/drivers/scsi/arm/cumana_2.c b/drivers/scsi/arm/cumana_2.c
index e460068f6834..e35afe3a1fe4 100644
--- a/drivers/scsi/arm/cumana_2.c
+++ b/drivers/scsi/arm/cumana_2.c
@@ -394,7 +394,7 @@ static int cumanascsi2_probe(struct expansion_card *ec,
 	}
 
 	host = scsi_host_alloc(&cumanascsi2_template,
-			       sizeof(struct cumanascsi2_info));
+			       sizeof(struct cumanascsi2_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/arm/eesox.c b/drivers/scsi/arm/eesox.c
index 99be9da8757f..de4d457f8ce7 100644
--- a/drivers/scsi/arm/eesox.c
+++ b/drivers/scsi/arm/eesox.c
@@ -510,7 +510,7 @@ static int eesoxscsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 	}
 
 	host = scsi_host_alloc(&eesox_template,
-			       sizeof(struct eesoxscsi_info));
+			       sizeof(struct eesoxscsi_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/arm/oak.c b/drivers/scsi/arm/oak.c
index d69245007096..b2ff8616f963 100644
--- a/drivers/scsi/arm/oak.c
+++ b/drivers/scsi/arm/oak.c
@@ -126,7 +126,7 @@ static int oakscsi_probe(struct expansion_card *ec, const struct ecard_id *id)
 	if (ret)
 		goto out;
 
-	host = scsi_host_alloc(&oakscsi_template, sizeof(struct NCR5380_hostdata));
+	host = scsi_host_alloc(&oakscsi_template, sizeof(struct NCR5380_hostdata), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto release;
diff --git a/drivers/scsi/arm/powertec.c b/drivers/scsi/arm/powertec.c
index 823c65ff6c12..045f35e50eff 100644
--- a/drivers/scsi/arm/powertec.c
+++ b/drivers/scsi/arm/powertec.c
@@ -318,7 +318,7 @@ static int powertecscsi_probe(struct expansion_card *ec,
 	}
 
 	host = scsi_host_alloc(&powertecscsi_template,
-			       sizeof (struct powertec_info));
+			       sizeof(struct powertec_info), NULL);
 	if (!host) {
 		ret = -ENOMEM;
 		goto out_region;
diff --git a/drivers/scsi/atari_scsi.c b/drivers/scsi/atari_scsi.c
index 85055677666c..9a469cf3991f 100644
--- a/drivers/scsi/atari_scsi.c
+++ b/drivers/scsi/atari_scsi.c
@@ -785,7 +785,7 @@ static int __init atari_scsi_probe(struct platform_device *pdev)
 	}
 
 	instance = scsi_host_alloc(&atari_scsi_template,
-	                           sizeof(struct NCR5380_hostdata));
+				   sizeof(struct NCR5380_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/atp870u.c b/drivers/scsi/atp870u.c
index 67459d81f479..57f0b4a11ba7 100644
--- a/drivers/scsi/atp870u.c
+++ b/drivers/scsi/atp870u.c
@@ -1579,7 +1579,7 @@ static int atp870u_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	pci_set_master(pdev);
 
 	err = -ENOMEM;
-	shpnt = scsi_host_alloc(&atp870u_template, sizeof(struct atp_unit));
+	shpnt = scsi_host_alloc(&atp870u_template, sizeof(struct atp_unit), &pdev->dev);
 	if (!shpnt)
 		goto release_region;
 
diff --git a/drivers/scsi/bfa/bfad_im.c b/drivers/scsi/bfa/bfad_im.c
index 97990b285e17..bd14aee64886 100644
--- a/drivers/scsi/bfa/bfad_im.c
+++ b/drivers/scsi/bfa/bfad_im.c
@@ -740,7 +740,7 @@ bfad_scsi_host_alloc(struct bfad_im_port_s *im_port, struct bfad_s *bfad)
 
 	sht->sg_tablesize = bfad->cfg_data.io_max_sge;
 
-	return scsi_host_alloc(sht, sizeof(struct bfad_im_port_pointer));
+	return scsi_host_alloc(sht, sizeof(struct bfad_im_port_pointer), NULL);
 }
 
 void
diff --git a/drivers/scsi/csiostor/csio_init.c b/drivers/scsi/csiostor/csio_init.c
index 238431524801..a4bf1ba03248 100644
--- a/drivers/scsi/csiostor/csio_init.c
+++ b/drivers/scsi/csiostor/csio_init.c
@@ -606,11 +606,11 @@ csio_shost_init(struct csio_hw *hw, struct device *dev,
 	if (dev == &hw->pdev->dev)
 		shost = scsi_host_alloc(
 				&csio_fcoe_shost_template,
-				sizeof(struct csio_lnode));
+				sizeof(struct csio_lnode), &hw->pdev->dev);
 	else
 		shost = scsi_host_alloc(
 				&csio_fcoe_shost_vport_template,
-				sizeof(struct csio_lnode));
+				sizeof(struct csio_lnode), &hw->pdev->dev);
 
 	if (!shost)
 		goto err;
diff --git a/drivers/scsi/dc395x.c b/drivers/scsi/dc395x.c
index 6183ce05d8cf..16adeac93aac 100644
--- a/drivers/scsi/dc395x.c
+++ b/drivers/scsi/dc395x.c
@@ -3984,7 +3984,7 @@ static int dc395x_init_one(struct pci_dev *dev, const struct pci_device_id *id)
 
 	/* allocate scsi host information (includes out adapter) */
 	scsi_host = scsi_host_alloc(&dc395x_driver_template,
-				    sizeof(struct AdapterCtlBlk));
+				    sizeof(struct AdapterCtlBlk), &dev->dev);
 	if (!scsi_host)
 		goto fail;
 
diff --git a/drivers/scsi/dmx3191d.c b/drivers/scsi/dmx3191d.c
index d6d091b2f3c7..8ba17e3eefe3 100644
--- a/drivers/scsi/dmx3191d.c
+++ b/drivers/scsi/dmx3191d.c
@@ -74,7 +74,7 @@ static int dmx3191d_probe_one(struct pci_dev *pdev,
 	}
 
 	shost = scsi_host_alloc(&dmx3191d_driver_template,
-			sizeof(struct NCR5380_hostdata));
+			sizeof(struct NCR5380_hostdata), &pdev->dev);
 	if (!shost)
 		goto out_release_region;       
 
diff --git a/drivers/scsi/elx/efct/efct_xport.c b/drivers/scsi/elx/efct/efct_xport.c
index 9dcaef6fc188..74ef76e00eb5 100644
--- a/drivers/scsi/elx/efct/efct_xport.c
+++ b/drivers/scsi/elx/efct/efct_xport.c
@@ -378,7 +378,7 @@ efct_scsi_new_device(struct efct *efct)
 	int error = 0;
 	struct efct_vport *vport = NULL;
 
-	shost = scsi_host_alloc(&efct_template, sizeof(*vport));
+	shost = scsi_host_alloc(&efct_template, sizeof(*vport), NULL);
 	if (!shost) {
 		efc_log_err(efct, "failed to allocate Scsi_Host struct\n");
 		return -ENOMEM;
@@ -902,7 +902,7 @@ efct_scsi_new_vport(struct efct *efct, struct device *dev)
 	int error = 0;
 	struct efct_vport *vport = NULL;
 
-	shost = scsi_host_alloc(&efct_template, sizeof(*vport));
+	shost = scsi_host_alloc(&efct_template, sizeof(*vport), NULL);
 	if (!shost) {
 		efc_log_err(efct, "failed to allocate Scsi_Host struct\n");
 		return NULL;
diff --git a/drivers/scsi/esas2r/esas2r_main.c b/drivers/scsi/esas2r/esas2r_main.c
index ada278c24c51..4aac1f6db5e9 100644
--- a/drivers/scsi/esas2r/esas2r_main.c
+++ b/drivers/scsi/esas2r/esas2r_main.c
@@ -382,7 +382,7 @@ static int esas2r_probe(struct pci_dev *pcid,
 		       "after pci_enable_device() enable_cnt: %d",
 		       pcid->enable_cnt.counter);
 
-	host = scsi_host_alloc(&driver_template, host_alloc_size);
+	host = scsi_host_alloc(&driver_template, host_alloc_size, &pcid->dev);
 	if (host == NULL) {
 		esas2r_log(ESAS2R_LOG_CRIT, "scsi_host_alloc() FAIL");
 		return -ENODEV;
diff --git a/drivers/scsi/fdomain.c b/drivers/scsi/fdomain.c
index 22fbb0222f07..66ba4551def8 100644
--- a/drivers/scsi/fdomain.c
+++ b/drivers/scsi/fdomain.c
@@ -537,7 +537,7 @@ struct Scsi_Host *fdomain_create(int base, int irq, int this_id,
 		return NULL;
 	}
 
-	sh = scsi_host_alloc(&fdomain_template, sizeof(struct fdomain));
+	sh = scsi_host_alloc(&fdomain_template, sizeof(struct fdomain), NULL);
 	if (!sh)
 		return NULL;
 
diff --git a/drivers/scsi/fnic/fnic_main.c b/drivers/scsi/fnic/fnic_main.c
index 24d62c0874ac..688d85bc3f01 100644
--- a/drivers/scsi/fnic/fnic_main.c
+++ b/drivers/scsi/fnic/fnic_main.c
@@ -847,7 +847,7 @@ static int fnic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 		{
 			host =
 				scsi_host_alloc(&fnic_host_template,
-								sizeof(struct fnic *));
+								sizeof(struct fnic *), &pdev->dev);
 			if (!host) {
 				dev_err(&fnic->pdev->dev, "Unable to allocate scsi host\n");
 				err = -ENOMEM;
diff --git a/drivers/scsi/g_NCR5380.c b/drivers/scsi/g_NCR5380.c
index 270eae7ac427..8b9076d6a964 100644
--- a/drivers/scsi/g_NCR5380.c
+++ b/drivers/scsi/g_NCR5380.c
@@ -312,7 +312,7 @@ static int generic_NCR5380_init_one(const struct scsi_host_template *tpnt,
 		goto out_release;
 	}
 
-	instance = scsi_host_alloc(tpnt, sizeof(struct NCR5380_hostdata));
+	instance = scsi_host_alloc(tpnt, sizeof(struct NCR5380_hostdata), NULL);
 	if (instance == NULL) {
 		ret = -ENOMEM;
 		goto out_unmap;
diff --git a/drivers/scsi/gvp11.c b/drivers/scsi/gvp11.c
index 0420bfe9bd42..ad5052db5a2e 100644
--- a/drivers/scsi/gvp11.c
+++ b/drivers/scsi/gvp11.c
@@ -353,7 +353,7 @@ static int gvp11_probe(struct zorro_dev *z, const struct zorro_device_id *ent)
 		goto fail_check_or_alloc;
 
 	instance = scsi_host_alloc(&gvp11_scsi_template,
-				   sizeof(struct gvp11_hostdata));
+				   sizeof(struct gvp11_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_check_or_alloc;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_main.c b/drivers/scsi/hisi_sas/hisi_sas_main.c
index 944ce19ae2fc..5696da8da6c7 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_main.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_main.c
@@ -2483,7 +2483,7 @@ static struct Scsi_Host *hisi_sas_shost_alloc(struct platform_device *pdev,
 	struct device *dev = &pdev->dev;
 	int error;
 
-	shost = scsi_host_alloc(hw->sht, sizeof(*hisi_hba));
+	shost = scsi_host_alloc(hw->sht, sizeof(*hisi_hba), NULL);
 	if (!shost) {
 		dev_err(dev, "scsi host alloc failed\n");
 		return NULL;
diff --git a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
index c7430f7c4048..44e584496ed5 100644
--- a/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
+++ b/drivers/scsi/hisi_sas/hisi_sas_v3_hw.c
@@ -3469,7 +3469,7 @@ hisi_sas_shost_alloc_pci(struct pci_dev *pdev)
 	struct hisi_hba *hisi_hba;
 	struct device *dev = &pdev->dev;
 
-	shost = scsi_host_alloc(&sht_v3_hw, sizeof(*hisi_hba));
+	shost = scsi_host_alloc(&sht_v3_hw, sizeof(*hisi_hba), &pdev->dev);
 	if (!shost) {
 		dev_err(dev, "shost alloc failed\n");
 		return NULL;
diff --git a/drivers/scsi/hosts.c b/drivers/scsi/hosts.c
index e047747d4ecf..e1f42be79729 100644
--- a/drivers/scsi/hosts.c
+++ b/drivers/scsi/hosts.c
@@ -403,12 +403,14 @@ static const struct device_type scsi_host_type = {
  * Return value:
  * 	Pointer to a new Scsi_Host
  **/
-struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize)
+struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht, int privsize,
+				  struct device *dev)
 {
 	struct Scsi_Host *shost;
 	int index;
 
-	shost = kzalloc(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL);
+	shost = kzalloc_node(sizeof(struct Scsi_Host) + privsize, GFP_KERNEL,
+			     dev ? dev_to_node(dev) : NUMA_NO_NODE);
 	if (!shost)
 		return NULL;
 
diff --git a/drivers/scsi/hpsa.c b/drivers/scsi/hpsa.c
index a1b116cd4723..b9f9f18bd985 100644
--- a/drivers/scsi/hpsa.c
+++ b/drivers/scsi/hpsa.c
@@ -5837,7 +5837,7 @@ static int hpsa_scsi_host_alloc(struct ctlr_info *h)
 {
 	struct Scsi_Host *sh;
 
-	sh = scsi_host_alloc(&hpsa_driver_template, sizeof(struct ctlr_info *));
+	sh = scsi_host_alloc(&hpsa_driver_template, sizeof(struct ctlr_info *), &h->pdev->dev);
 	if (sh == NULL) {
 		dev_err(&h->pdev->dev, "scsi_host_alloc failed\n");
 		return -ENOMEM;
diff --git a/drivers/scsi/hptiop.c b/drivers/scsi/hptiop.c
index 7083c14c5302..7d79357be265 100644
--- a/drivers/scsi/hptiop.c
+++ b/drivers/scsi/hptiop.c
@@ -1311,7 +1311,7 @@ static int hptiop_probe(struct pci_dev *pcidev, const struct pci_device_id *id)
 		goto disable_pci_device;
 	}
 
-	host = scsi_host_alloc(&driver_template, sizeof(struct hptiop_hba));
+	host = scsi_host_alloc(&driver_template, sizeof(struct hptiop_hba), &pcidev->dev);
 	if (!host) {
 		printk(KERN_ERR "hptiop: fail to alloc scsi host\n");
 		goto free_pci_regions;
diff --git a/drivers/scsi/ibmvscsi/ibmvfc.c b/drivers/scsi/ibmvscsi/ibmvfc.c
index 3dd2adda195e..b11d564a21d9 100644
--- a/drivers/scsi/ibmvscsi/ibmvfc.c
+++ b/drivers/scsi/ibmvscsi/ibmvfc.c
@@ -6325,7 +6325,7 @@ static int ibmvfc_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 	unsigned int max_scsi_queues = min((unsigned int)IBMVFC_MAX_SCSI_QUEUES, online_cpus);
 
 	ENTER;
-	shost = scsi_host_alloc(&driver_template, sizeof(*vhost));
+	shost = scsi_host_alloc(&driver_template, sizeof(*vhost), NULL);
 	if (!shost) {
 		dev_err(dev, "Couldn't allocate host data\n");
 		goto out;
diff --git a/drivers/scsi/ibmvscsi/ibmvscsi.c b/drivers/scsi/ibmvscsi/ibmvscsi.c
index 609bda730b3a..e8342e581246 100644
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -2235,7 +2235,7 @@ static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id *id)
 
 	dev_set_drvdata(&vdev->dev, NULL);
 
-	host = scsi_host_alloc(&driver_template, sizeof(*hostdata));
+	host = scsi_host_alloc(&driver_template, sizeof(*hostdata), NULL);
 	if (!host) {
 		dev_err(&vdev->dev, "couldn't allocate host data\n");
 		goto scsi_host_alloc_failed;
diff --git a/drivers/scsi/imm.c b/drivers/scsi/imm.c
index 0535252e77e3..a6131f87fcaf 100644
--- a/drivers/scsi/imm.c
+++ b/drivers/scsi/imm.c
@@ -1221,7 +1221,7 @@ static int __imm_attach(struct parport *pb)
 	INIT_DELAYED_WORK(&dev->imm_tq, imm_interrupt);
 
 	err = -ENOMEM;
-	host = scsi_host_alloc(&imm_template, sizeof(imm_struct *));
+	host = scsi_host_alloc(&imm_template, sizeof(imm_struct *), NULL);
 	if (!host)
 		goto out1;
 	host->io_port = pb->base;
diff --git a/drivers/scsi/initio.c b/drivers/scsi/initio.c
index 06fbe85dccfa..294f7f8d5dbb 100644
--- a/drivers/scsi/initio.c
+++ b/drivers/scsi/initio.c
@@ -2824,7 +2824,7 @@ static int initio_probe_one(struct pci_dev *pdev,
 		error = -ENODEV;
 		goto out_disable_device;
 	}
-	shost = scsi_host_alloc(&initio_template, sizeof(struct initio_host));
+	shost = scsi_host_alloc(&initio_template, sizeof(struct initio_host), &pdev->dev);
 	if (!shost) {
 		printk(KERN_WARNING "initio: Could not allocate host structure.\n");
 		error = -ENOMEM;
diff --git a/drivers/scsi/ipr.c b/drivers/scsi/ipr.c
index d207e5e81afe..85608804ff39 100644
--- a/drivers/scsi/ipr.c
+++ b/drivers/scsi/ipr.c
@@ -9379,7 +9379,7 @@ static int ipr_probe_ioa(struct pci_dev *pdev,
 	ENTER;
 
 	dev_info(&pdev->dev, "Found IOA with IRQ: %d\n", pdev->irq);
-	host = scsi_host_alloc(&driver_template, sizeof(*ioa_cfg));
+	host = scsi_host_alloc(&driver_template, sizeof(*ioa_cfg), &pdev->dev);
 
 	if (!host) {
 		dev_err(&pdev->dev, "call to scsi_host_alloc failed!\n");
diff --git a/drivers/scsi/ips.c b/drivers/scsi/ips.c
index 41ed73966a48..709a2a799f3e 100644
--- a/drivers/scsi/ips.c
+++ b/drivers/scsi/ips.c
@@ -6638,7 +6638,7 @@ ips_register_scsi(int index)
 {
 	struct Scsi_Host *sh;
 	ips_ha_t *ha, *oldha = ips_ha[index];
-	sh = scsi_host_alloc(&ips_driver_template, sizeof (ips_ha_t));
+	sh = scsi_host_alloc(&ips_driver_template, sizeof(ips_ha_t), &oldha->pcidev->dev);
 	if (!sh) {
 		IPS_PRINTK(KERN_WARNING, oldha->pcidev,
 			   "Unable to register controller with SCSI subsystem\n");
diff --git a/drivers/scsi/isci/init.c b/drivers/scsi/isci/init.c
index acf0c2038d20..7da06ace20ad 100644
--- a/drivers/scsi/isci/init.c
+++ b/drivers/scsi/isci/init.c
@@ -538,7 +538,7 @@ static struct isci_host *isci_host_alloc(struct pci_dev *pdev, int id)
 		INIT_LIST_HEAD(&idev->node);
 	}
 
-	shost = scsi_host_alloc(&isci_sht, sizeof(void *));
+	shost = scsi_host_alloc(&isci_sht, sizeof(void *), &pdev->dev);
 	if (!shost)
 		return NULL;
 
diff --git a/drivers/scsi/jazz_esp.c b/drivers/scsi/jazz_esp.c
index 35137f5cfb3a..1817246e4cc6 100644
--- a/drivers/scsi/jazz_esp.c
+++ b/drivers/scsi/jazz_esp.c
@@ -110,7 +110,7 @@ static int esp_jazz_probe(struct platform_device *dev)
 	struct resource *res;
 	int err;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	err = -ENOMEM;
 	if (!host)
diff --git a/drivers/scsi/libiscsi.c b/drivers/scsi/libiscsi.c
index 160f02f2f51d..458955dfc0aa 100644
--- a/drivers/scsi/libiscsi.c
+++ b/drivers/scsi/libiscsi.c
@@ -2903,7 +2903,7 @@ struct Scsi_Host *iscsi_host_alloc(const struct scsi_host_template *sht,
 	struct Scsi_Host *shost;
 	struct iscsi_host *ihost;
 
-	shost = scsi_host_alloc(sht, sizeof(struct iscsi_host) + dd_data_size);
+	shost = scsi_host_alloc(sht, sizeof(struct iscsi_host) + dd_data_size, NULL);
 	if (!shost)
 		return NULL;
 	ihost = shost_priv(shost);
diff --git a/drivers/scsi/lpfc/lpfc_init.c b/drivers/scsi/lpfc/lpfc_init.c
index 82af59c913e9..25264866075f 100644
--- a/drivers/scsi/lpfc/lpfc_init.c
+++ b/drivers/scsi/lpfc/lpfc_init.c
@@ -4745,7 +4745,7 @@ lpfc_create_port(struct lpfc_hba *phba, int instance, struct device *dev)
 		template->sg_tablesize = lpfc_get_sg_tablesize(phba);
 	}
 
-	shost = scsi_host_alloc(template, sizeof(struct lpfc_vport));
+	shost = scsi_host_alloc(template, sizeof(struct lpfc_vport), &phba->pcidev->dev);
 	if (!shost)
 		goto out;
 
diff --git a/drivers/scsi/mac53c94.c b/drivers/scsi/mac53c94.c
index de2bd860b9d7..737e5f2fef6f 100644
--- a/drivers/scsi/mac53c94.c
+++ b/drivers/scsi/mac53c94.c
@@ -426,7 +426,7 @@ static int mac53c94_probe(struct macio_dev *mdev, const struct of_device_id *mat
 		return -EBUSY;
 	}
 
-       	host = scsi_host_alloc(&mac53c94_template, sizeof(struct fsc_state));
+	host = scsi_host_alloc(&mac53c94_template, sizeof(struct fsc_state), NULL);
 	if (host == NULL) {
 		printk(KERN_ERR "mac53c94: couldn't register host");
 		rc = -ENOMEM;
diff --git a/drivers/scsi/mac_esp.c b/drivers/scsi/mac_esp.c
index a0ceaa2428c2..c8652bfdb3b8 100644
--- a/drivers/scsi/mac_esp.c
+++ b/drivers/scsi/mac_esp.c
@@ -301,7 +301,7 @@ static int esp_mac_probe(struct platform_device *dev)
 	if (dev->id > 1)
 		return -ENODEV;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	err = -ENOMEM;
 	if (!host)
diff --git a/drivers/scsi/mac_scsi.c b/drivers/scsi/mac_scsi.c
index a86bd839d08e..eeb00ee30aaa 100644
--- a/drivers/scsi/mac_scsi.c
+++ b/drivers/scsi/mac_scsi.c
@@ -474,7 +474,7 @@ static int __init mac_scsi_probe(struct platform_device *pdev)
 		mac_scsi_template.sg_tablesize = 1;
 
 	instance = scsi_host_alloc(&mac_scsi_template,
-	                           sizeof(struct NCR5380_hostdata));
+				   sizeof(struct NCR5380_hostdata), NULL);
 	if (!instance)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/megaraid.c b/drivers/scsi/megaraid.c
index 9476a0d2c72d..701e54843193 100644
--- a/drivers/scsi/megaraid.c
+++ b/drivers/scsi/megaraid.c
@@ -4203,7 +4203,7 @@ megaraid_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	/* Initialize SCSI Host structure */
-	host = scsi_host_alloc(&megaraid_template, sizeof(adapter_t));
+	host = scsi_host_alloc(&megaraid_template, sizeof(adapter_t), &pdev->dev);
 	if (!host)
 		goto out_iounmap;
 
diff --git a/drivers/scsi/megaraid/megaraid_mbox.c b/drivers/scsi/megaraid/megaraid_mbox.c
index ce89032a5a74..17b015b3d35f 100644
--- a/drivers/scsi/megaraid/megaraid_mbox.c
+++ b/drivers/scsi/megaraid/megaraid_mbox.c
@@ -620,7 +620,7 @@ megaraid_io_attach(adapter_t *adapter)
 	struct Scsi_Host	*host;
 
 	// Initialize SCSI Host structure
-	host = scsi_host_alloc(&megaraid_template_g, 8);
+	host = scsi_host_alloc(&megaraid_template_g, 8, &pdev->dev);
 	if (!host) {
 		con_log(CL_ANN, (KERN_WARNING
 			"megaraid mbox: scsi_host_alloc failed\n"));
diff --git a/drivers/scsi/megaraid/megaraid_sas_base.c b/drivers/scsi/megaraid/megaraid_sas_base.c
index ecd365d78ae3..bae1070371d5 100644
--- a/drivers/scsi/megaraid/megaraid_sas_base.c
+++ b/drivers/scsi/megaraid/megaraid_sas_base.c
@@ -7512,7 +7512,7 @@ static int megasas_probe_one(struct pci_dev *pdev,
 	pci_set_master(pdev);
 
 	host = scsi_host_alloc(&megasas_template,
-			       sizeof(struct megasas_instance));
+			       sizeof(struct megasas_instance), &pdev->dev);
 
 	if (!host) {
 		dev_printk(KERN_DEBUG, &pdev->dev, "scsi_host_alloc failed\n");
diff --git a/drivers/scsi/mesh.c b/drivers/scsi/mesh.c
index dc1402b321da..a4ba6bc49d23 100644
--- a/drivers/scsi/mesh.c
+++ b/drivers/scsi/mesh.c
@@ -1877,7 +1877,7 @@ static int mesh_probe(struct macio_dev *mdev, const struct of_device_id *match)
        		printk(KERN_ERR "mesh: unable to request memory resources");
 		return -EBUSY;
 	}
-       	mesh_host = scsi_host_alloc(&mesh_template, sizeof(struct mesh_state));
+	mesh_host = scsi_host_alloc(&mesh_template, sizeof(struct mesh_state), NULL);
 	if (mesh_host == NULL) {
 		printk(KERN_ERR "mesh: couldn't register host");
 		goto out_release;
diff --git a/drivers/scsi/mpi3mr/mpi3mr_os.c b/drivers/scsi/mpi3mr/mpi3mr_os.c
index 402d1f35d214..c74e2addc77d 100644
--- a/drivers/scsi/mpi3mr/mpi3mr_os.c
+++ b/drivers/scsi/mpi3mr/mpi3mr_os.c
@@ -5468,7 +5468,7 @@ mpi3mr_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	}
 
 	shost = scsi_host_alloc(&mpi3mr_driver_template,
-	    sizeof(struct mpi3mr_ioc));
+	    sizeof(struct mpi3mr_ioc), &pdev->dev);
 	if (!shost) {
 		retval = -ENODEV;
 		goto shost_failed;
diff --git a/drivers/scsi/mpt3sas/mpt3sas_scsih.c b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
index 6ff788557294..06c8df6261d4 100644
--- a/drivers/scsi/mpt3sas/mpt3sas_scsih.c
+++ b/drivers/scsi/mpt3sas/mpt3sas_scsih.c
@@ -13367,7 +13367,7 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 			PCIE_LINK_STATE_L1 | PCIE_LINK_STATE_CLKPM);
 		/* Use mpt2sas driver host template for SAS 2.0 HBA's */
 		shost = scsi_host_alloc(&mpt2sas_driver_template,
-		  sizeof(struct MPT3SAS_ADAPTER));
+		  sizeof(struct MPT3SAS_ADAPTER), &pdev->dev);
 		if (!shost)
 			return -ENODEV;
 		ioc = shost_priv(shost);
@@ -13399,7 +13399,7 @@ _scsih_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	case MPI26_VERSION:
 		/* Use mpt3sas driver host template for SAS 3.0 HBA's */
 		shost = scsi_host_alloc(&mpt3sas_driver_template,
-		  sizeof(struct MPT3SAS_ADAPTER));
+		  sizeof(struct MPT3SAS_ADAPTER), &pdev->dev);
 		if (!shost)
 			return -ENODEV;
 		ioc = shost_priv(shost);
diff --git a/drivers/scsi/mvme147.c b/drivers/scsi/mvme147.c
index 98b99c0f5bc7..4d61e25de2cf 100644
--- a/drivers/scsi/mvme147.c
+++ b/drivers/scsi/mvme147.c
@@ -97,7 +97,7 @@ static int __init mvme147_init(void)
 		return 0;
 
 	mvme147_shost = scsi_host_alloc(&mvme147_host_template,
-			sizeof(struct WD33C93_hostdata));
+			sizeof(struct WD33C93_hostdata), NULL);
 	if (!mvme147_shost)
 		goto err_out;
 	mvme147_shost->base = 0xfffe4000;
diff --git a/drivers/scsi/mvsas/mv_init.c b/drivers/scsi/mvsas/mv_init.c
index 5abc17a2e261..fd90b5eec0b4 100644
--- a/drivers/scsi/mvsas/mv_init.c
+++ b/drivers/scsi/mvsas/mv_init.c
@@ -494,7 +494,7 @@ static int mvs_pci_init(struct pci_dev *pdev, const struct pci_device_id *ent)
 	if (rc)
 		goto err_out_regions;
 
-	shost = scsi_host_alloc(&mvs_sht, sizeof(void *));
+	shost = scsi_host_alloc(&mvs_sht, sizeof(void *), &pdev->dev);
 	if (!shost) {
 		rc = -ENOMEM;
 		goto err_out_regions;
diff --git a/drivers/scsi/mvumi.c b/drivers/scsi/mvumi.c
index e70d336b4ab3..d12b33a32a09 100644
--- a/drivers/scsi/mvumi.c
+++ b/drivers/scsi/mvumi.c
@@ -2468,7 +2468,7 @@ static int mvumi_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	if (ret)
 		goto fail_set_dma_mask;
 
-	host = scsi_host_alloc(&mvumi_template, sizeof(*mhba));
+	host = scsi_host_alloc(&mvumi_template, sizeof(*mhba), &pdev->dev);
 	if (!host) {
 		dev_err(&pdev->dev, "scsi_host_alloc failed\n");
 		ret = -ENOMEM;
diff --git a/drivers/scsi/myrb.c b/drivers/scsi/myrb.c
index 3678b66310ed..f28c29b41cf6 100644
--- a/drivers/scsi/myrb.c
+++ b/drivers/scsi/myrb.c
@@ -3401,7 +3401,7 @@ static struct myrb_hba *myrb_detect(struct pci_dev *pdev,
 	struct Scsi_Host *shost;
 	struct myrb_hba *cb = NULL;
 
-	shost = scsi_host_alloc(&myrb_template, sizeof(struct myrb_hba));
+	shost = scsi_host_alloc(&myrb_template, sizeof(struct myrb_hba), &pdev->dev);
 	if (!shost) {
 		dev_err(&pdev->dev, "Unable to allocate Controller\n");
 		return NULL;
diff --git a/drivers/scsi/myrs.c b/drivers/scsi/myrs.c
index afd68225221a..a8ce488e6520 100644
--- a/drivers/scsi/myrs.c
+++ b/drivers/scsi/myrs.c
@@ -1937,7 +1937,7 @@ static struct myrs_hba *myrs_alloc_host(struct pci_dev *pdev,
 	struct Scsi_Host *shost;
 	struct myrs_hba *cs;
 
-	shost = scsi_host_alloc(&myrs_template, sizeof(struct myrs_hba));
+	shost = scsi_host_alloc(&myrs_template, sizeof(struct myrs_hba), &pdev->dev);
 	if (!shost)
 		return NULL;
 
diff --git a/drivers/scsi/ncr53c8xx.c b/drivers/scsi/ncr53c8xx.c
index 5369ca3fe4fd..009d4c55054e 100644
--- a/drivers/scsi/ncr53c8xx.c
+++ b/drivers/scsi/ncr53c8xx.c
@@ -8108,7 +8108,7 @@ struct Scsi_Host * __init ncr_attach(struct scsi_host_template *tpnt,
 	printk(KERN_INFO "ncr53c720-%d: rev 0x%x irq %d\n",
 		unit, device->chip.revision_id, device->slot.irq);
 
-	instance = scsi_host_alloc(tpnt, sizeof(*host_data));
+	instance = scsi_host_alloc(tpnt, sizeof(*host_data), NULL);
 	if (!instance)
 	        goto attach_error;
 	host_data = (struct host_data *) instance->hostdata;
diff --git a/drivers/scsi/nsp32.c b/drivers/scsi/nsp32.c
index e893d5677241..681e1d554657 100644
--- a/drivers/scsi/nsp32.c
+++ b/drivers/scsi/nsp32.c
@@ -2556,7 +2556,7 @@ static int nsp32_detect(struct pci_dev *pdev)
 	/*
 	 * register this HBA as SCSI device
 	 */
-	host = scsi_host_alloc(&nsp32_template, sizeof(nsp32_hw_data));
+	host = scsi_host_alloc(&nsp32_template, sizeof(nsp32_hw_data), &pdev->dev);
 	if (host == NULL) {
 		nsp32_msg (KERN_ERR, "failed to scsi register");
 		goto err;
diff --git a/drivers/scsi/pcmcia/nsp_cs.c b/drivers/scsi/pcmcia/nsp_cs.c
index ae70fda96ae9..32ca7872b7f8 100644
--- a/drivers/scsi/pcmcia/nsp_cs.c
+++ b/drivers/scsi/pcmcia/nsp_cs.c
@@ -1326,7 +1326,7 @@ static struct Scsi_Host *nsp_detect(struct scsi_host_template *sht)
 	nsp_hw_data *data_b = &nsp_data_base, *data;
 
 	nsp_dbg(NSP_DEBUG_INIT, "this_id=%d", sht->this_id);
-	host = scsi_host_alloc(&nsp_driver_template, sizeof(nsp_hw_data));
+	host = scsi_host_alloc(&nsp_driver_template, sizeof(nsp_hw_data), NULL);
 	if (host == NULL) {
 		nsp_dbg(NSP_DEBUG_INIT, "host failed");
 		return NULL;
diff --git a/drivers/scsi/pcmcia/qlogic_stub.c b/drivers/scsi/pcmcia/qlogic_stub.c
index 5d8a434d3f66..b417b39ab723 100644
--- a/drivers/scsi/pcmcia/qlogic_stub.c
+++ b/drivers/scsi/pcmcia/qlogic_stub.c
@@ -106,7 +106,7 @@ static struct Scsi_Host *qlogic_detect(struct scsi_host_template *host,
 	qlogicfas408_setup(qbase, qinitid, INT_TYPE);
 
 	host->name = qlogic_name;
-	shost = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv));
+	shost = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv), NULL);
 	if (!shost)
 		goto err;
 	shost->io_port = qbase;
diff --git a/drivers/scsi/pcmcia/sym53c500_cs.c b/drivers/scsi/pcmcia/sym53c500_cs.c
index 1530c1ad5d36..83aab6c69a62 100644
--- a/drivers/scsi/pcmcia/sym53c500_cs.c
+++ b/drivers/scsi/pcmcia/sym53c500_cs.c
@@ -752,7 +752,7 @@ SYM53C500_config(struct pcmcia_device *link)
 
 	chip_init(port_base);
 
-	host = scsi_host_alloc(tpnt, sizeof(struct sym53c500_data));
+	host = scsi_host_alloc(tpnt, sizeof(struct sym53c500_data), NULL);
 	if (!host) {
 		printk("SYM53C500: Unable to register host, giving up.\n");
 		goto err_release;
diff --git a/drivers/scsi/pm8001/pm8001_init.c b/drivers/scsi/pm8001/pm8001_init.c
index e93ea76b565e..873810c6853c 100644
--- a/drivers/scsi/pm8001/pm8001_init.c
+++ b/drivers/scsi/pm8001/pm8001_init.c
@@ -1142,7 +1142,7 @@ static int pm8001_pci_probe(struct pci_dev *pdev,
 	if (rc)
 		goto err_out_regions;
 
-	shost = scsi_host_alloc(&pm8001_sht, sizeof(void *));
+	shost = scsi_host_alloc(&pm8001_sht, sizeof(void *), &pdev->dev);
 	if (!shost) {
 		rc = -ENOMEM;
 		goto err_out_regions;
diff --git a/drivers/scsi/pmcraid.c b/drivers/scsi/pmcraid.c
index 942a99393204..a26c747806ef 100644
--- a/drivers/scsi/pmcraid.c
+++ b/drivers/scsi/pmcraid.c
@@ -5236,7 +5236,7 @@ static int pmcraid_probe(struct pci_dev *pdev,
 	}
 
 	host = scsi_host_alloc(&pmcraid_host_template,
-				sizeof(struct pmcraid_instance));
+				sizeof(struct pmcraid_instance), &pdev->dev);
 
 	if (!host) {
 		dev_err(&pdev->dev, "scsi_host_alloc failed!\n");
diff --git a/drivers/scsi/ppa.c b/drivers/scsi/ppa.c
index 8a4e910d5758..40fe9c6acc3b 100644
--- a/drivers/scsi/ppa.c
+++ b/drivers/scsi/ppa.c
@@ -1101,7 +1101,7 @@ static int __ppa_attach(struct parport *pb)
 	INIT_DELAYED_WORK(&dev->ppa_tq, ppa_interrupt);
 
 	err = -ENOMEM;
-	host = scsi_host_alloc(&ppa_template, sizeof(ppa_struct *));
+	host = scsi_host_alloc(&ppa_template, sizeof(ppa_struct *), NULL);
 	if (!host)
 		goto out1;
 	host->io_port = pb->base;
diff --git a/drivers/scsi/ps3rom.c b/drivers/scsi/ps3rom.c
index a9c727d22931..3542a35b137e 100644
--- a/drivers/scsi/ps3rom.c
+++ b/drivers/scsi/ps3rom.c
@@ -361,7 +361,7 @@ static int ps3rom_probe(struct ps3_system_bus_device *_dev)
 		goto fail_free_bounce;
 
 	host = scsi_host_alloc(&ps3rom_host_template,
-			       sizeof(struct ps3rom_private));
+			       sizeof(struct ps3rom_private), NULL);
 	if (!host) {
 		dev_err(&dev->sbd.core, "%s:%u: scsi_host_alloc failed\n",
 			__func__, __LINE__);
diff --git a/drivers/scsi/qla1280.c b/drivers/scsi/qla1280.c
index cdd6fe002c32..f88f2e659baa 100644
--- a/drivers/scsi/qla1280.c
+++ b/drivers/scsi/qla1280.c
@@ -4142,7 +4142,7 @@ qla1280_probe_one(struct pci_dev *pdev, const struct pci_device_id *id)
 	pci_set_master(pdev);
 
 	error = -ENOMEM;
-	host = scsi_host_alloc(&qla1280_driver_template, sizeof(*ha));
+	host = scsi_host_alloc(&qla1280_driver_template, sizeof(*ha), &pdev->dev);
 	if (!host) {
 		printk(KERN_WARNING
 		       "qla1280: Failed to register host, aborting.\n");
diff --git a/drivers/scsi/qla2xxx/qla_mid.c b/drivers/scsi/qla2xxx/qla_mid.c
index c563133f751e..4bafc367e21d 100644
--- a/drivers/scsi/qla2xxx/qla_mid.c
+++ b/drivers/scsi/qla2xxx/qla_mid.c
@@ -502,7 +502,7 @@ qla24xx_create_vhost(struct fc_vport *fc_vport)
 	vha = qla2x00_create_host(sht, ha);
 	if (!vha) {
 		ql_log(ql_log_warn, vha, 0xa005,
-		    "scsi_host_alloc() failed for vport.\n");
+		    "scsi_host_alloc() failed for vport.\n", NULL);
 		return(NULL);
 	}
 
diff --git a/drivers/scsi/qla2xxx/qla_os.c b/drivers/scsi/qla2xxx/qla_os.c
index 72b1c28e4dae..ce0d097f3317 100644
--- a/drivers/scsi/qla2xxx/qla_os.c
+++ b/drivers/scsi/qla2xxx/qla_os.c
@@ -5046,7 +5046,7 @@ struct scsi_qla_host *qla2x00_create_host(const struct scsi_host_template *sht,
 	struct Scsi_Host *host;
 	struct scsi_qla_host *vha = NULL;
 
-	host = scsi_host_alloc(sht, sizeof(scsi_qla_host_t));
+	host = scsi_host_alloc(sht, sizeof(scsi_qla_host_t), &ha->pdev->dev);
 	if (!host) {
 		ql_log_pci(ql_log_fatal, ha->pdev, 0x0107,
 		    "Failed to allocate host from the scsi layer, aborting.\n");
diff --git a/drivers/scsi/qlogicfas.c b/drivers/scsi/qlogicfas.c
index 8f05e3707d69..b9ead7dc371c 100644
--- a/drivers/scsi/qlogicfas.c
+++ b/drivers/scsi/qlogicfas.c
@@ -95,7 +95,7 @@ static struct Scsi_Host *__qlogicfas_detect(struct scsi_host_template *host,
 
 	qlogicfas408_setup(qbase, qinitid, INT_TYPE);
 
-	hreg = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv));
+	hreg = scsi_host_alloc(host, sizeof(struct qlogicfas408_priv), NULL);
 	if (!hreg)
 		goto err_release_mem;
 	priv = get_priv_by_host(hreg);
diff --git a/drivers/scsi/qlogicpti.c b/drivers/scsi/qlogicpti.c
index ea0a2b5a0a42..f67a9b400100 100644
--- a/drivers/scsi/qlogicpti.c
+++ b/drivers/scsi/qlogicpti.c
@@ -1316,7 +1316,7 @@ static int qpti_sbus_probe(struct platform_device *op)
 	if (op->archdata.irqs[0] == 0)
 		return -ENODEV;
 
-	host = scsi_host_alloc(&qpti_template, sizeof(struct qlogicpti));
+	host = scsi_host_alloc(&qpti_template, sizeof(struct qlogicpti), NULL);
 	if (!host)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/scsi_debug.c b/drivers/scsi/scsi_debug.c
index bb6b0e7fb910..59488bf74ce0 100644
--- a/drivers/scsi/scsi_debug.c
+++ b/drivers/scsi/scsi_debug.c
@@ -9548,7 +9548,7 @@ static int sdebug_driver_probe(struct device *dev)
 
 	sdbg_host = dev_to_sdebug_host(dev);
 
-	hpnt = scsi_host_alloc(&sdebug_driver_template, 0);
+	hpnt = scsi_host_alloc(&sdebug_driver_template, 0, NULL);
 	if (NULL == hpnt) {
 		pr_err("scsi_host_alloc failed\n");
 		error = -ENODEV;
diff --git a/drivers/scsi/sgiwd93.c b/drivers/scsi/sgiwd93.c
index 6594661db5f4..07fbe6fda7c2 100644
--- a/drivers/scsi/sgiwd93.c
+++ b/drivers/scsi/sgiwd93.c
@@ -231,7 +231,7 @@ static int sgiwd93_probe(struct platform_device *pdev)
 	unsigned int irq = pd->irq;
 	int err;
 
-	host = scsi_host_alloc(&sgiwd93_template, sizeof(struct ip22_hostdata));
+	host = scsi_host_alloc(&sgiwd93_template, sizeof(struct ip22_hostdata), NULL);
 	if (!host) {
 		err = -ENOMEM;
 		goto out;
diff --git a/drivers/scsi/smartpqi/smartpqi_init.c b/drivers/scsi/smartpqi/smartpqi_init.c
index 65ff50982978..a3163c06b3f8 100644
--- a/drivers/scsi/smartpqi/smartpqi_init.c
+++ b/drivers/scsi/smartpqi/smartpqi_init.c
@@ -7619,7 +7619,7 @@ static int pqi_register_scsi(struct pqi_ctrl_info *ctrl_info)
 	int rc;
 	struct Scsi_Host *shost;
 
-	shost = scsi_host_alloc(&pqi_driver_template, sizeof(ctrl_info));
+	shost = scsi_host_alloc(&pqi_driver_template, sizeof(ctrl_info), &ctrl_info->pci_dev->dev);
 	if (!shost) {
 		dev_err(&ctrl_info->pci_dev->dev, "scsi_host_alloc failed\n");
 		return -ENOMEM;
diff --git a/drivers/scsi/snic/snic_main.c b/drivers/scsi/snic/snic_main.c
index 82953e6a0915..9edf6661e6f1 100644
--- a/drivers/scsi/snic/snic_main.c
+++ b/drivers/scsi/snic/snic_main.c
@@ -363,7 +363,7 @@ snic_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	/*
 	 * Allocate SCSI Host and setup association between host, and snic
 	 */
-	shost = scsi_host_alloc(&snic_host_template, sizeof(struct snic));
+	shost = scsi_host_alloc(&snic_host_template, sizeof(struct snic), &pdev->dev);
 	if (!shost) {
 		SNIC_ERR("Unable to alloc scsi_host\n");
 		ret = -ENOMEM;
diff --git a/drivers/scsi/stex.c b/drivers/scsi/stex.c
index 6aeeb338633d..7d6b851fef24 100644
--- a/drivers/scsi/stex.c
+++ b/drivers/scsi/stex.c
@@ -1667,7 +1667,7 @@ static int stex_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	S6flag = 0;
 	register_reboot_notifier(&stex_notifier);
 
-	host = scsi_host_alloc(&driver_template, sizeof(struct st_hba));
+	host = scsi_host_alloc(&driver_template, sizeof(struct st_hba), &pdev->dev);
 
 	if (!host) {
 		printk(KERN_ERR DRV_NAME "(%s): scsi_host_alloc failed\n",
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 571ea549152b..fc4c05127dc4 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1969,7 +1969,7 @@ static int storvsc_probe(struct hv_device *device,
 				(100 - ring_avail_percent_lowater) / 100;
 
 	host = scsi_host_alloc(&scsi_driver,
-			       sizeof(struct hv_host_device));
+			       sizeof(struct hv_host_device), NULL);
 	if (!host)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/sun3_scsi.c b/drivers/scsi/sun3_scsi.c
index ca9cd691cc32..ed41b605328e 100644
--- a/drivers/scsi/sun3_scsi.c
+++ b/drivers/scsi/sun3_scsi.c
@@ -578,7 +578,7 @@ static int __init sun3_scsi_probe(struct platform_device *pdev)
 #endif
 
 	instance = scsi_host_alloc(&sun3_scsi_template,
-	                           sizeof(struct NCR5380_hostdata));
+				   sizeof(struct NCR5380_hostdata), NULL);
 	if (!instance) {
 		error = -ENOMEM;
 		goto fail_alloc;
diff --git a/drivers/scsi/sun3x_esp.c b/drivers/scsi/sun3x_esp.c
index 365406885b8e..f7e48f4c5444 100644
--- a/drivers/scsi/sun3x_esp.c
+++ b/drivers/scsi/sun3x_esp.c
@@ -175,7 +175,7 @@ static int esp_sun3x_probe(struct platform_device *dev)
 	struct resource *res;
 	int err = -ENOMEM;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 	if (!host)
 		goto fail;
 
diff --git a/drivers/scsi/sun_esp.c b/drivers/scsi/sun_esp.c
index aa430501f0c7..bc4e4030acb6 100644
--- a/drivers/scsi/sun_esp.c
+++ b/drivers/scsi/sun_esp.c
@@ -457,7 +457,7 @@ static int esp_sbus_probe_one(struct platform_device *op,
 	struct esp *esp;
 	int err;
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	err = -ENOMEM;
 	if (!host)
diff --git a/drivers/scsi/sym53c8xx_2/sym_glue.c b/drivers/scsi/sym53c8xx_2/sym_glue.c
index 27e22acaf1a7..16e821c3b59e 100644
--- a/drivers/scsi/sym53c8xx_2/sym_glue.c
+++ b/drivers/scsi/sym53c8xx_2/sym_glue.c
@@ -1300,7 +1300,7 @@ static struct Scsi_Host *sym_attach(const struct scsi_host_template *tpnt, int u
 	if (!fw)
 		goto attach_failed;
 
-	shost = scsi_host_alloc(tpnt, sizeof(*sym_data));
+	shost = scsi_host_alloc(tpnt, sizeof(*sym_data), &pdev->dev);
 	if (!shost)
 		goto attach_failed;
 	sym_data = shost_priv(shost);
diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 5fdaa71f0652..88375574cb18 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -929,7 +929,7 @@ static int virtscsi_probe(struct virtio_device *vdev)
 	num_targets = virtscsi_config_get(vdev, max_target) + 1;
 
 	shost = scsi_host_alloc(&virtscsi_host_template,
-				struct_size(vscsi, req_vqs, num_queues));
+				struct_size(vscsi, req_vqs, num_queues), NULL);
 	if (!shost)
 		return -ENOMEM;
 
diff --git a/drivers/scsi/vmw_pvscsi.c b/drivers/scsi/vmw_pvscsi.c
index 151cac9f9c2a..32c39c66c49b 100644
--- a/drivers/scsi/vmw_pvscsi.c
+++ b/drivers/scsi/vmw_pvscsi.c
@@ -1435,7 +1435,7 @@ static int pvscsi_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		PVSCSI_MAX_NUM_REQ_ENTRIES_PER_PAGE;
 	pvscsi_template.cmd_per_lun =
 		min(pvscsi_template.can_queue, pvscsi_cmd_per_lun);
-	host = scsi_host_alloc(&pvscsi_template, sizeof(struct pvscsi_adapter));
+	host = scsi_host_alloc(&pvscsi_template, sizeof(struct pvscsi_adapter), &pdev->dev);
 	if (!host) {
 		printk(KERN_ERR "vmw_pvscsi: failed to allocate host\n");
 		goto out_release_resources_and_disable;
diff --git a/drivers/scsi/wd719x.c b/drivers/scsi/wd719x.c
index 830d40f57f6a..0aa6bb093431 100644
--- a/drivers/scsi/wd719x.c
+++ b/drivers/scsi/wd719x.c
@@ -921,7 +921,7 @@ static int wd719x_pci_probe(struct pci_dev *pdev, const struct pci_device_id *d)
 		goto release_region;
 
 	err = -ENOMEM;
-	sh = scsi_host_alloc(&wd719x_template, sizeof(struct wd719x));
+	sh = scsi_host_alloc(&wd719x_template, sizeof(struct wd719x), &pdev->dev);
 	if (!sh)
 		goto release_region;
 
diff --git a/drivers/scsi/xen-scsifront.c b/drivers/scsi/xen-scsifront.c
index 989bcaee42ca..d4d57f33cc15 100644
--- a/drivers/scsi/xen-scsifront.c
+++ b/drivers/scsi/xen-scsifront.c
@@ -899,7 +899,7 @@ static int scsifront_probe(struct xenbus_device *dev,
 	int err = -ENOMEM;
 	char name[TASK_COMM_LEN];
 
-	host = scsi_host_alloc(&scsifront_sht, sizeof(*info));
+	host = scsi_host_alloc(&scsifront_sht, sizeof(*info), NULL);
 	if (!host) {
 		xenbus_dev_fatal(dev, err, "fail to allocate scsi host");
 		return err;
diff --git a/drivers/scsi/zorro_esp.c b/drivers/scsi/zorro_esp.c
index 1622285c9aec..5983015877a7 100644
--- a/drivers/scsi/zorro_esp.c
+++ b/drivers/scsi/zorro_esp.c
@@ -774,7 +774,7 @@ static int zorro_esp_probe(struct zorro_dev *z,
 		goto fail_free_zep;
 	}
 
-	host = scsi_host_alloc(tpnt, sizeof(struct esp));
+	host = scsi_host_alloc(tpnt, sizeof(struct esp), NULL);
 
 	if (!host) {
 		pr_err("No host detected; board configuration problem?\n");
diff --git a/include/scsi/libfc.h b/include/scsi/libfc.h
index be0ffe1e3395..17e545fa5c7e 100644
--- a/include/scsi/libfc.h
+++ b/include/scsi/libfc.h
@@ -883,7 +883,7 @@ libfc_host_alloc(const struct scsi_host_template *sht, int priv_size)
 	struct fc_lport *lport;
 	struct Scsi_Host *shost;
 
-	shost = scsi_host_alloc(sht, sizeof(*lport) + priv_size);
+	shost = scsi_host_alloc(sht, sizeof(*lport) + priv_size, NULL);
 	if (!shost)
 		return NULL;
 	lport = shost_priv(shost);
diff --git a/include/scsi/scsi_host.h b/include/scsi/scsi_host.h
index 7e2011830ba4..09c82a41b7a1 100644
--- a/include/scsi/scsi_host.h
+++ b/include/scsi/scsi_host.h
@@ -796,7 +796,8 @@ static inline int scsi_host_in_recovery(struct Scsi_Host *shost)
 extern int scsi_queue_work(struct Scsi_Host *, struct work_struct *);
 extern void scsi_flush_work(struct Scsi_Host *);
 
-extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *, int);
+extern struct Scsi_Host *scsi_host_alloc(const struct scsi_host_template *sht,
+					 int privsize, struct device *dev);
 extern int __must_check scsi_add_host_with_dma(struct Scsi_Host *,
 					       struct device *,
 					       struct device *);
-- 
2.43.7


^ permalink raw reply related

* [PATCH v3 1/4] scsi: scan: allocate sdev and starget on the NUMA node of the host adapter
From: Sumit Saxena @ 2026-06-09 12:18 UTC (permalink / raw)
  To: Martin K . Petersen, Jens Axboe
  Cc: James E . J . Bottomley, linux-scsi, linux-block, Adam Radford,
	Khalid Aziz, Adaptec OEM Raid Solutions, Matthew Wilcox,
	Hannes Reinecke, Juergen E . Fischer, Russell King,
	linux-arm-kernel, Finn Thain, Michael Schmitz, Anil Gurumurthy,
	Sudarsana Kalluru, Oliver Neukum, Ali Akcaagac, Jamie Lenehan,
	Ram Vegesna, target-devel, Bradley Grove, Satish Kharat,
	Sesidhar Baddela, Karan Tilak Kumar, Yihang Li, Don Brace,
	storagedev, HighPoint Linux Team, Tyrel Datwyler,
	Madhavan Srinivasan, Michael Ellerman, Nicholas Piggin,
	Christophe Leroy, linuxppc-dev, Brian King, Lee Duncan,
	Chris Leech, Mike Christie, open-iscsi, Justin Tee, Paul Ely,
	Kashyap Desai, Shivasharan S, Chandrakanth Patil,
	megaraidlinux.pdl, Sathya Prakash Veerichetty, Sreekanth Reddy,
	mpi3mr-linuxdrv.pdl, Suganath Prabu Subramani, Ranjan Kumar,
	MPT-FusionLinux.pdl, Daniel Palmer, GOTO Masanori, YOKOTA Hiroshi,
	Jack Wang, Geoff Levand, Michael Reed, Nilesh Javali,
	GR-QLogic-Storage-Upstream, Narsimhulu Musini, K . Y . Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, linux-hyperv,
	Michael S . Tsirkin, Jason Wang, Paolo Bonzini, Stefan Hajnoczi,
	Eugenio Perez, virtualization, Vishal Bhakta,
	bcm-kernel-feedback-list, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, xen-devel, James Rizzo, Sumit Saxena
In-Reply-To: <20260609121806.2121755-1-sumit.saxena@broadcom.com>

From: James Rizzo <james.rizzo@broadcom.com>

When a host adapter is attached to a specific NUMA node, allocating
scsi_device and scsi_target via kzalloc() may place them on a remote
node.  All hot-path I/O accesses to these structures then cross the NUMA
interconnect, adding latency and consuming inter-node bandwidth.

Use kzalloc_node() with dev_to_node(shost->dma_dev) so allocations land
on the same node as the HBA, reducing cross-node traffic and improving
I/O performance on NUMA systems.

Signed-off-by: James Rizzo <james.rizzo@broadcom.com>
Signed-off-by: Sumit Saxena <sumit.saxena@broadcom.com>
---
 drivers/scsi/scsi_scan.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index e27da038603a..121a14d5fdb8 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -34,6 +34,7 @@
 #include <linux/kthread.h>
 #include <linux/spinlock.h>
 #include <linux/async.h>
+#include <linux/topology.h>
 #include <linux/slab.h>
 #include <linux/unaligned.h>
 
@@ -287,8 +288,8 @@ static struct scsi_device *scsi_alloc_sdev(struct scsi_target *starget,
 	struct Scsi_Host *shost = dev_to_shost(starget->dev.parent);
 	struct queue_limits lim;
 
-	sdev = kzalloc(sizeof(*sdev) + shost->transportt->device_size,
-		       GFP_KERNEL);
+	sdev = kzalloc_node(sizeof(*sdev) + shost->transportt->device_size,
+		       GFP_KERNEL, dev_to_node(shost->dma_dev));
 	if (!sdev)
 		goto out;
 
@@ -502,7 +503,7 @@ static struct scsi_target *scsi_alloc_target(struct device *parent,
 	struct scsi_target *found_target;
 	int error, ref_got;
 
-	starget = kzalloc(size, GFP_KERNEL);
+	starget = kzalloc_node(size, GFP_KERNEL, dev_to_node(shost->dma_dev));
 	if (!starget) {
 		printk(KERN_ERR "%s: allocation failure\n", __func__);
 		return NULL;
-- 
2.43.7


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox