Linux-HyperV List

Linux-HyperV List
 help / color / mirror / Atom feed

* Re: (subset) [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-04-04 15:07 UTC (permalink / raw)
  To: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
  Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
	platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
	linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
	Danilo Krummrich
In-Reply-To: <20260324005919.2408620-1-dakr@kernel.org>

On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
> Danilo Krummrich (12):
>   PCI: use generic driver_override infrastructure
>   platform/wmi: use generic driver_override infrastructure
>   vdpa: use generic driver_override infrastructure
>   s390/cio: use generic driver_override infrastructure
>   s390/ap: use generic driver_override infrastructure

Applied to driver-core-testing, thanks!

>   amba: use generic driver_override infrastructure
>   cdx: use generic driver_override infrastructure
>   hv: vmbus: use generic driver_override infrastructure
>   rpmsg: use generic driver_override infrastructure

I have not picked these up, as they have not received ACKs from the
corresponding subsystem maintainers so far.

>   bus: fsl-mc: use generic driver_override infrastructure
>   spi: use generic driver_override infrastructure

These have already been picked up via the respective subsystem trees -- thanks!

Thanks,
Danilo

^ permalink raw reply

* Re: [PATCH 02/12] bus: fsl-mc: use generic driver_override infrastructure
From: Christophe Leroy (CS GROUP) @ 2026-04-04 16:56 UTC (permalink / raw)
  To: Ioana Ciornei, Danilo Krummrich
  Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki, Nipun Gupta,
	Nikhil Agarwal, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Bjorn Helgaas, Armin Wolf, Bjorn Andersson,
	Mathieu Poirier, Vineeth Vijayan, Peter Oberparleiter,
	Heiko Carstens, Vasily Gorbik, Alexander Gordeev,
	Christian Borntraeger, Sven Schnelle, Harald Freudenberger,
	Holger Dengler, Mark Brown, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio PĂ©rez, Alex Williamson,
	Juergen Gross, Stefano Stabellini, Oleksandr Tyshchenko,
	linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
	platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
	linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel,
	Gui-Dong Han
In-Reply-To: <4c5e9bad-82f0-4714-99c2-8ccd79a45043@kernel.org>



Le 28/03/2026 à 13:10, Christophe Leroy (CS GROUP) a écrit :
> 
> 
> Le 25/03/2026 à 13:01, Ioana Ciornei a écrit :
>> On Tue, Mar 24, 2026 at 01:59:06AM +0100, Danilo Krummrich wrote:
>>> When a driver is probed through __driver_attach(), the bus' match()
>>> callback is called without the device lock held, thus accessing the
>>> driver_override field without a lock, which can cause a UAF.
>>>
>>> Fix this by using the driver-core driver_override infrastructure taking
>>> care of proper locking internally.
>>>
>>> Note that calling match() from __driver_attach() without the device lock
>>> held is intentional. [1]
>>>
>>> Link: https://eur01.safelinks.protection.outlook.com/? 
>>> url=https%3A%2F%2Flore.kernel.org%2Fdriver- 
>>> core%2FDGRGTIRHA62X.3RY09D9SOK77P%40kernel.org%2F&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7C4b9262ddecdd4ce29f9808de8a66485e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C639100369055903282%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=%2BRfjlUkq7oWV%2F0v2S2B%2BEuxCY%2FLRQv6qHiEWiupd6kc%3D&reserved=0 [1]
>>> Reported-by: Gui-Dong Han <hanguidong02@gmail.com>
>>> Closes: https://eur01.safelinks.protection.outlook.com/? 
>>> url=https%3A%2F%2Fbugzilla.kernel.org%2Fshow_bug.cgi%3Fid%3D220789&data=05%7C02%7Cchristophe.leroy%40csgroup.eu%7C4b9262ddecdd4ce29f9808de8a66485e%7C8b87af7d86474dc78df45f69a2011bb5%7C0%7C0%7C639100369055936232%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=XL1K1ICiygOZnlvDUbQFe192KnLsBQms0HFNGCuyz%2Fw%3D&reserved=0
>>> Fixes: 1f86a00c1159 ("bus/fsl-mc: add support for 'driver_override' 
>>> in the mc-bus")
>>> Signed-off-by: Danilo Krummrich <dakr@kernel.org>
>>
>> Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
>> Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
>>
> 
> 
> Applied, thanks

Have to drop it for now, build fails:

   CALL    scripts/checksyscalls.sh
   CC      drivers/bus/fsl-mc/fsl-mc-bus.o
drivers/bus/fsl-mc/fsl-mc-bus.c: In function 'fsl_mc_bus_match':
drivers/bus/fsl-mc/fsl-mc-bus.c:92:15: error: implicit declaration of 
function 'device_match_driver_override' 
[-Werror=implicit-function-declaration]
    92 |         ret = device_match_driver_override(dev, drv);
       |               ^~~~~~~~~~~~~~~~~~~~~~~~~~~~
drivers/bus/fsl-mc/fsl-mc-bus.c: At top level:
drivers/bus/fsl-mc/fsl-mc-bus.c:321:10: error: 'const struct bus_type' 
has no member named 'driver_override'
   321 |         .driver_override = true,
       |          ^~~~~~~~~~~~~~~
drivers/bus/fsl-mc/fsl-mc-bus.c:321:28: warning: initialization of 
'const char *' from 'int' makes pointer from integer without a cast 
[-Wint-conversion]
   321 |         .driver_override = true,
       |                            ^~~~
drivers/bus/fsl-mc/fsl-mc-bus.c:321:28: note: (near initialization for 
'fsl_mc_bus_type.dev_name')
cc1: some warnings being treated as errors
make[5]: *** [scripts/Makefile.build:289: 
drivers/bus/fsl-mc/fsl-mc-bus.o] Error 1
make[4]: *** [scripts/Makefile.build:546: drivers/bus/fsl-mc] Error 2
make[3]: *** [scripts/Makefile.build:546: drivers/bus] Error 2
make[2]: *** [scripts/Makefile.build:546: drivers] Error 2
make[1]: *** [/home/chleroy/linux-powerpc/Makefile:2101: .] Error 2
make: *** [Makefile:248: __sub-make] Error 2

Christophe


^ permalink raw reply

* Re: (subset) [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Christophe Leroy (CS GROUP) @ 2026-04-04 16:58 UTC (permalink / raw)
  To: Danilo Krummrich, Russell King, Greg Kroah-Hartman,
	Rafael J. Wysocki, Ioana Ciornei, Nipun Gupta, Nikhil Agarwal,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Bjorn Helgaas, Armin Wolf, Bjorn Andersson, Mathieu Poirier,
	Vineeth Vijayan, Peter Oberparleiter, Heiko Carstens,
	Vasily Gorbik, Alexander Gordeev, Christian Borntraeger,
	Sven Schnelle, Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, Christophe Leroy (CS GROUP)
  Cc: linux-kernel, driver-core, linuxppc-dev, linux-hyperv, linux-pci,
	platform-driver-x86, linux-arm-msm, linux-remoteproc, linux-s390,
	linux-spi, virtualization, kvm, xen-devel, linux-arm-kernel
In-Reply-To: <DHKGQN6D0ANO.2QYY3JTM5435O@kernel.org>



Le 04/04/2026 à 17:07, Danilo Krummrich a écrit :
> On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
>> Danilo Krummrich (12):
>>    PCI: use generic driver_override infrastructure
>>    platform/wmi: use generic driver_override infrastructure
>>    vdpa: use generic driver_override infrastructure
>>    s390/cio: use generic driver_override infrastructure
>>    s390/ap: use generic driver_override infrastructure
> 
> Applied to driver-core-testing, thanks!
> 
>>    amba: use generic driver_override infrastructure
>>    cdx: use generic driver_override infrastructure
>>    hv: vmbus: use generic driver_override infrastructure
>>    rpmsg: use generic driver_override infrastructure
> 
> I have not picked these up, as they have not received ACKs from the
> corresponding subsystem maintainers so far.
> 
>>    bus: fsl-mc: use generic driver_override infrastructure

I droped it from soc_fsl tree, some dependency must be missing.

Feal free to take it if you can, it is acked-by Ioana.

>>    spi: use generic driver_override infrastructure
> 
> These have already been picked up via the respective subsystem trees -- thanks!
> 
> Thanks,
> Danilo


^ permalink raw reply

* Re: (subset) [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-04-04 17:04 UTC (permalink / raw)
  To: Christophe Leroy (CS GROUP)
  Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, linux-kernel, driver-core, linuxppc-dev,
	linux-hyperv, linux-pci, platform-driver-x86, linux-arm-msm,
	linux-remoteproc, linux-s390, linux-spi, virtualization, kvm,
	xen-devel, linux-arm-kernel
In-Reply-To: <76355cb5-0b5d-4a29-9702-8d020a79f4c0@kernel.org>

On Sat Apr 4, 2026 at 6:58 PM CEST, Christophe Leroy (CS GROUP) wrote:
>
>
> Le 04/04/2026 à 17:07, Danilo Krummrich a écrit :
>> On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
>>> Danilo Krummrich (12):
>>>    PCI: use generic driver_override infrastructure
>>>    platform/wmi: use generic driver_override infrastructure
>>>    vdpa: use generic driver_override infrastructure
>>>    s390/cio: use generic driver_override infrastructure
>>>    s390/ap: use generic driver_override infrastructure
>> 
>> Applied to driver-core-testing, thanks!
>> 
>>>    amba: use generic driver_override infrastructure
>>>    cdx: use generic driver_override infrastructure
>>>    hv: vmbus: use generic driver_override infrastructure
>>>    rpmsg: use generic driver_override infrastructure
>> 
>> I have not picked these up, as they have not received ACKs from the
>> corresponding subsystem maintainers so far.
>> 
>>>    bus: fsl-mc: use generic driver_override infrastructure
>
> I droped it from soc_fsl tree, some dependency must be missing.
>
> Feal free to take it if you can, it is acked-by Ioana.

It is based on v7.0-rc5; if you want I can pick it up.

>>>    spi: use generic driver_override infrastructure
>> 
>> These have already been picked up via the respective subsystem trees -- thanks!
>> 
>> Thanks,
>> Danilo


^ permalink raw reply

* Re: (subset) [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Christophe Leroy (CS GROUP) @ 2026-04-04 17:09 UTC (permalink / raw)
  To: Danilo Krummrich
  Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, linux-kernel, driver-core, linuxppc-dev,
	linux-hyperv, linux-pci, platform-driver-x86, linux-arm-msm,
	linux-remoteproc, linux-s390, linux-spi, virtualization, kvm,
	xen-devel, linux-arm-kernel
In-Reply-To: <DHKJ7VWI1CHO.3ETHUGQVPFFDE@kernel.org>



Le 04/04/2026 à 19:04, Danilo Krummrich a écrit :
> On Sat Apr 4, 2026 at 6:58 PM CEST, Christophe Leroy (CS GROUP) wrote:
>>
>>
>> Le 04/04/2026 à 17:07, Danilo Krummrich a écrit :
>>> On Tue Mar 24, 2026 at 1:59 AM CET, Danilo Krummrich wrote:
>>>> Danilo Krummrich (12):
>>>>     PCI: use generic driver_override infrastructure
>>>>     platform/wmi: use generic driver_override infrastructure
>>>>     vdpa: use generic driver_override infrastructure
>>>>     s390/cio: use generic driver_override infrastructure
>>>>     s390/ap: use generic driver_override infrastructure
>>>
>>> Applied to driver-core-testing, thanks!
>>>
>>>>     amba: use generic driver_override infrastructure
>>>>     cdx: use generic driver_override infrastructure
>>>>     hv: vmbus: use generic driver_override infrastructure
>>>>     rpmsg: use generic driver_override infrastructure
>>>
>>> I have not picked these up, as they have not received ACKs from the
>>> corresponding subsystem maintainers so far.
>>>
>>>>     bus: fsl-mc: use generic driver_override infrastructure
>>
>> I droped it from soc_fsl tree, some dependency must be missing.
>>
>> Feal free to take it if you can, it is acked-by Ioana.
> 
> It is based on v7.0-rc5; if you want I can pick it up.

Yes please pick it up as my tree is based on rc1.

Thanks
Christophe


> 
>>>>     spi: use generic driver_override infrastructure
>>>
>>> These have already been picked up via the respective subsystem trees -- thanks!
>>>
>>> Thanks,
>>> Danilo
> 


^ permalink raw reply

* Re: (subset) [PATCH 00/12] treewide: Convert buses to use generic driver_override
From: Danilo Krummrich @ 2026-04-04 19:20 UTC (permalink / raw)
  To: Christophe Leroy (CS GROUP)
  Cc: Russell King, Greg Kroah-Hartman, Rafael J. Wysocki,
	Ioana Ciornei, Nipun Gupta, Nikhil Agarwal, K. Y. Srinivasan,
	Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li, Bjorn Helgaas,
	Armin Wolf, Bjorn Andersson, Mathieu Poirier, Vineeth Vijayan,
	Peter Oberparleiter, Heiko Carstens, Vasily Gorbik,
	Alexander Gordeev, Christian Borntraeger, Sven Schnelle,
	Harald Freudenberger, Holger Dengler, Mark Brown,
	Michael S. Tsirkin, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Alex Williamson, Juergen Gross, Stefano Stabellini,
	Oleksandr Tyshchenko, linux-kernel, driver-core, linuxppc-dev,
	linux-hyperv, linux-pci, platform-driver-x86, linux-arm-msm,
	linux-remoteproc, linux-s390, linux-spi, virtualization, kvm,
	xen-devel, linux-arm-kernel
In-Reply-To: <a8c85884-e2ba-4a3a-a660-9715f0de2704@kernel.org>

On Sat Apr 4, 2026 at 7:09 PM CEST, Christophe Leroy (CS GROUP) wrote:
> Yes please pick it up as my tree is based on rc1.

Applied the patch to driver-core-testing, thanks!

^ permalink raw reply

* Re: [PATCH net-next,v4] net: mana: Force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-04-05  3:14 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	pabeni, leon, longli, kotaranov, horms, shradhagupta, ssengar,
	ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, leitao, kees, dipayanroy
In-Reply-To: <20260330154755.6a8c73a6@kernel.org>

On Mon, Mar 30, 2026 at 03:47:55PM -0700, Jakub Kicinski wrote:
> On Mon, 30 Mar 2026 14:01:54 -0700 Dipayaan Roy wrote:
> > On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
> > allocation in the RX refill path can cause 15-20% throughput
> > regression under high connection counts (>16 TCP streams).
> 
> Did you investigate what makes such a difference exactly?
> As I said I suspect there are some improvements we could
> make in the page pool fragmentation logic that could yield
> similar wins without bothering the user.
>
I collected the perf numbers, shared the analysis below.
> > Add an ethtool private flag "full-page-rx" that allows the user to
> > force one RX buffer per page, bypassing the page_pool fragment path.
> > This restores line-rate(180+ Gbps) performance on affected platforms.
> > 
> > Usage:
> >   ethtool --set-priv-flags eth0 full-page-rx on
> > 
> > There is no behavioral change by default. The flag must be explicitly
> > enabled by the user or udev rule.
> > 
> > The existing single-buffer-per-page logic for XDP and jumbo frames is
> > consolidated into a new helper mana_use_single_rxbuf_per_page().
> 
> ethtool -g rx-buf-len could also fit the bill but I guess this is more
> of a hack / workaround than legit config so no strong preference.
> 
ok, want to stay with private flag.
> > -static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
> > +static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
> >  {
> > -	struct mana_port_context *apc = netdev_priv(ndev);
> >  	unsigned int num_queues = apc->num_queues;
> >  	int i, j;
> >  
> > -	if (stringset != ETH_SS_STATS)
> > -		return;
> >  	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
> > -		ethtool_puts(&data, mana_eth_stats[i].name);
> > +		ethtool_puts(data, mana_eth_stats[i].name);
> >  
> >  	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
> > -		ethtool_puts(&data, mana_hc_stats[i].name);
> > +		ethtool_puts(data, mana_hc_stats[i].name);
> >  
> >  	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
> > -		ethtool_puts(&data, mana_phy_stats[i].name);
> > +		ethtool_puts(data, mana_phy_stats[i].name);
> >  
> >  	for (i = 0; i < num_queues; i++) {
> > -		ethtool_sprintf(&data, "rx_%d_packets", i);
> > -		ethtool_sprintf(&data, "rx_%d_bytes", i);
> > -		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
> > -		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
> > -		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
> > -		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
> > +		ethtool_sprintf(data, "rx_%d_packets", i);
> 
> Please factor out the noisy, no-op prep work into a separate patch for
> ease of review
Ack, will split it out in 2 separate patches in v5.
> -- 
> pw-bot: cr

Hi Jakub,

I did some perf analysis on the ARM64 platform for which we want to
have this work around of full page rx buffers:

test: ntttcp with 48 tcp connections
perf: perf record -ag --call-graph dwarf -C 0-33 -- sleep 32

Page pool overhead summary: 
(framgment based rx buff vs full page rx buff on the same ARM64
platform)

  Function                        Fragment   Full-page   Delta
  ─----------------------------   ─-------   ---------   -----
  napi_pp_put_page                  3.93%      0.85%    +3.08%
  page_pool_alloc_frag_netmem       1.93%         —     +1.93%
  Total page_pool overhead          5.86%      0.85%    +5.01%

In fragment mode, napi_pp_put_page performs an atomic decrement of
the shared page refcount on every packet free. This single operation
accounts for ~3% more CPU than in full-page mode, where the page is
sole-owned and the atomic is skipped entirely. Additionally,
page_pool_alloc_frag_netmem adds ~2% overhead on the allocation
path for fragments.

Further annotation of the hot page pool functions in fragment mode
shows:

napi_pp_put_page:

    0.09 :   ffff80008117c240:       b       ffff80008117c268
<napi_pp_put_page+0x68>
         : 64               ATOMIC64_FETCH_OP(        , al, op, asm_op,
"memory")
         :
         : 66               ATOMIC64_FETCH_OPS(andnot, ldclr)
         : 67               ATOMIC64_FETCH_OPS(or, ldset)
         : 68               ATOMIC64_FETCH_OPS(xor, ldeor)
         : 69               ATOMIC64_FETCH_OPS(add, ldadd)
    0.00 :   ffff80008117c244:       mov     x3, #0xffffffffffffffff
// #-1
    0.08 :   ffff80008117c248:       add     x0, x2, #0x28
    0.06 :   ffff80008117c24c:       ldaddal x3, x3, [x0]
         : 73               }
         :
         : 75               ATOMIC64_OP_ADD_SUB_RETURN(_relaxed)
         : 76               ATOMIC64_OP_ADD_SUB_RETURN(_acquire)
         : 77               ATOMIC64_OP_ADD_SUB_RETURN(_release)
         : 78               ATOMIC64_OP_ADD_SUB_RETURN(        )
   88.09 :   ffff80008117c250:       sub     x3, x3, #0x1
         :
         : 81               return 0;
         : 82               }

88% of this function's cycles stall on the sub that depends on
ldaddal.


page_pool_alloc_frag_netmem:

         : 151              ATOMIC64_FETCH_OPS(add, ldadd)
    0.00 :   ffff8000811fd40c:       add     x1, x21, #0x28
    0.14 :   ffff8000811fd410:       ldaddal x0, x1, [x1]
         : 154              }
         :
         : 156              ATOMIC64_OP_ADD_SUB_RETURN(_relaxed)
         : 157              ATOMIC64_OP_ADD_SUB_RETURN(_acquire)
         : 158              ATOMIC64_OP_ADD_SUB_RETURN(_release)
         : 159              ATOMIC64_OP_ADD_SUB_RETURN(        )
   75.09 :   ffff8000811fd414:       add     x0, x0, x1
         : 161              WARN_ON(ret < 0);
    0.16 :   ffff8000811fd418:       cmp     x0, #0x0
    0.00 :   ffff8000811fd41c:       b.lt    ffff8000811fd394
<page_pool_alloc_frag_netmem+0xb4>  // b.tstop


75% of this function's cycles stall on the same pattern.


Full comparison (top functions, >0.5%):

Fragment mode:                          Full-page mode:
-------------                           --------------
 15.88%  __wake_up_sync_key             13.66%  __wake_up_sync_key
  9.66%  default_idle_call              10.41%  default_idle_call
  8.38%  handle_softirqs                 8.89%  handle_softirqs
  3.93%  napi_pp_put_page       ←        0.85%  napi_pp_put_page
  3.18%  tcp_gro_receive                 3.43%  tcp_gro_receive
  1.93%  page_pool_alloc_frag   ←           —
     —                                   1.14%
page_pool_recycle_in_cache
     —                                   1.06%
page_pool_put_unrefed_netmem
  0.93%  napi_build_skb                  1.24%  napi_build_skb
  0.56%  __build_skb_around              1.46%  __build_skb_around

In full page rx buffers mode  'napi_pp_put_page' took just 0.85% on
the same ARM64 platform.

Comparing with another platform(x86):

To confirm this behaviour is specific to this ARM64 platform, I
collected the same data on a x86 Vm (Intel, 192 vCPUs, same MANA NIC 200Gbps)
Here both full page rx buff mode and fragment modes rx buffs achieves identical
~182 Gbps on x86.

x86 fragment mode:                      x86 full-page mode:
─-----------------                      ─------------------
 61.69%  pv_native_safe_halt            50.91%  pv_native_safe_halt
  4.17%  _raw_spin_unlock_irqrestore     6.19%
_raw_spin_unlock_irqrestore
  3.95%  handle_softirqs                 4.02%  handle_softirqs
  2.51%  _copy_to_iter                   2.53%  _copy_to_iter
  0.60%  napi_pp_put_page                  —    napi_pp_put_page (<0.5%)

On x86, napi_pp_put_page is only 0.60% in fragment mode (vs 3.93%
on the ARM64 platform data shared earlier).

Note: I did not had a different arm64 platform available to run and compare
it with.

From the above data, seems to be an issue specific to this ARM64
platform.


Regards

^ permalink raw reply

* [PATCH net-next v5 0/2] net: mana: add ethtool private flag for full-page RX buffers
From: Dipayaan Roy @ 2026-04-05  3:42 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees

On some ARM64 platforms with 4K PAGE_SIZE, utilizing page_pool 
fragments for allocation in the RX refill path (~2kB buffer per fragment)
causes 15-20% throughput regression under high connection counts
(>16 TCP streams at 180+ Gbps). Using full-page buffers on these
platforms shows no regression and restores line-rate performance.

This behavior is observed on a single platform; other platforms
perform better with page_pool fragments, indicating this is not a
page_pool issue but platform-specific.

This series adds an ethtool private flag "full-page-rx" to let the
user opt in to one RX buffer per page:

  ethtool --set-priv-flags eth0 full-page-rx on

There is no behavioral change by default. The flag can be persisted
via udev rule for affected platforms.

Changes in v5:
  - Split prep refactor into separate patch (patch 1/2)
Changes in v4:
  - Dropping the smbios string parsing and add ethtool priv flag
    to reconfigure the queues with full page rx buffers.
Changes in v3:
  - changed u8* to char*
Changes in v2:
  - separate reading string index and the string, remove inline.

Dipayaan Roy (2):
  net: mana: refactor mana_get_strings() and mana_get_sset_count() to
    use switch
  net: mana: force full-page RX buffers via ethtool private flag

 drivers/net/ethernet/microsoft/mana/mana_en.c |  22 ++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 164 ++++++++++++++----
 include/net/mana/mana.h                       |   8 +
 3 files changed, 163 insertions(+), 31 deletions(-)

-- 
2.43.0

^ permalink raw reply

* [PATCH net-next v5 1/2] net: mana: refactor mana_get_strings() and mana_get_sset_count() to use switch
From: Dipayaan Roy @ 2026-04-05  3:44 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees

Refactor mana_get_strings() and mana_get_sset_count() from if/else to
switch statements in preparation for adding ethtool private flags
support which requires handling ETH_SS_PRIV_FLAGS.

No functional change.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 .../ethernet/microsoft/mana/mana_ethtool.c    | 75 ++++++++++++-------
 1 file changed, 46 insertions(+), 29 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index 6a4b42fe0944..a28ca461c135 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -138,53 +138,70 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 
-	if (stringset != ETH_SS_STATS)
+	switch (stringset) {
+	case ETH_SS_STATS:
+		return ARRAY_SIZE(mana_eth_stats) +
+		       ARRAY_SIZE(mana_phy_stats) +
+		       ARRAY_SIZE(mana_hc_stats)  +
+		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	default:
 		return -EINVAL;
-
-	return ARRAY_SIZE(mana_eth_stats) + ARRAY_SIZE(mana_phy_stats) + ARRAY_SIZE(mana_hc_stats) +
-			num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+	}
 }
 
-static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 {
-	struct mana_port_context *apc = netdev_priv(ndev);
 	unsigned int num_queues = apc->num_queues;
 	int i, j;
 
-	if (stringset != ETH_SS_STATS)
-		return;
 	for (i = 0; i < ARRAY_SIZE(mana_eth_stats); i++)
-		ethtool_puts(&data, mana_eth_stats[i].name);
+		ethtool_puts(data, mana_eth_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_hc_stats); i++)
-		ethtool_puts(&data, mana_hc_stats[i].name);
+		ethtool_puts(data, mana_hc_stats[i].name);
 
 	for (i = 0; i < ARRAY_SIZE(mana_phy_stats); i++)
-		ethtool_puts(&data, mana_phy_stats[i].name);
+		ethtool_puts(data, mana_phy_stats[i].name);
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "rx_%d_packets", i);
-		ethtool_sprintf(&data, "rx_%d_bytes", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_drop", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_tx", i);
-		ethtool_sprintf(&data, "rx_%d_xdp_redirect", i);
-		ethtool_sprintf(&data, "rx_%d_pkt_len0_err", i);
+		ethtool_sprintf(data, "rx_%d_packets", i);
+		ethtool_sprintf(data, "rx_%d_bytes", i);
+		ethtool_sprintf(data, "rx_%d_xdp_drop", i);
+		ethtool_sprintf(data, "rx_%d_xdp_tx", i);
+		ethtool_sprintf(data, "rx_%d_xdp_redirect", i);
+		ethtool_sprintf(data, "rx_%d_pkt_len0_err", i);
 		for (j = 0; j < MANA_RXCOMP_OOB_NUM_PPI - 1; j++)
-			ethtool_sprintf(&data, "rx_%d_coalesced_cqe_%d", i, j + 2);
+			ethtool_sprintf(data,
+					"rx_%d_coalesced_cqe_%d",
+					i,
+					j + 2);
 	}
 
 	for (i = 0; i < num_queues; i++) {
-		ethtool_sprintf(&data, "tx_%d_packets", i);
-		ethtool_sprintf(&data, "tx_%d_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_xdp_xmit", i);
-		ethtool_sprintf(&data, "tx_%d_tso_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_packets", i);
-		ethtool_sprintf(&data, "tx_%d_tso_inner_bytes", i);
-		ethtool_sprintf(&data, "tx_%d_long_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_short_pkt_fmt", i);
-		ethtool_sprintf(&data, "tx_%d_csum_partial", i);
-		ethtool_sprintf(&data, "tx_%d_mana_map_err", i);
+		ethtool_sprintf(data, "tx_%d_packets", i);
+		ethtool_sprintf(data, "tx_%d_bytes", i);
+		ethtool_sprintf(data, "tx_%d_xdp_xmit", i);
+		ethtool_sprintf(data, "tx_%d_tso_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_bytes", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_packets", i);
+		ethtool_sprintf(data, "tx_%d_tso_inner_bytes", i);
+		ethtool_sprintf(data, "tx_%d_long_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_short_pkt_fmt", i);
+		ethtool_sprintf(data, "tx_%d_csum_partial", i);
+		ethtool_sprintf(data, "tx_%d_mana_map_err", i);
+	}
+}
+
+static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	switch (stringset) {
+	case ETH_SS_STATS:
+		mana_get_strings_stats(apc, &data);
+		break;
+	default:
+		break;
 	}
 }
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH net-next v5 2/2] net: mana: force full-page RX buffers via ethtool private flag
From: Dipayaan Roy @ 2026-04-05  3:47 UTC (permalink / raw)
  To: kys, haiyangz, wei.liu, decui, andrew+netdev, davem, edumazet,
	kuba, pabeni, leon, longli, kotaranov, horms, shradhagupta,
	ssengar, ernis, shirazsaleem, linux-hyperv, netdev, linux-kernel,
	linux-rdma, stephen, jacob.e.keller, dipayanroy, leitao, kees

On some ARM64 platforms with 4K PAGE_SIZE, page_pool fragment
allocation in the RX refill path can cause 15-20% throughput
regression under high connection counts (>16 TCP streams).

Add an ethtool private flag "full-page-rx" that allows the user to
force one RX buffer per page, bypassing the page_pool fragment path.
This restores line-rate (180+ Gbps) performance on affected platforms.

Usage:
  ethtool --set-priv-flags ethx full-page-rx on

There is no behavioral change by default. The flag must be explicitly
enabled by the user or udev rule.

The existing single-buffer-per-page logic for XDP and jumbo frames is
consolidated into a new helper mana_use_single_rxbuf_per_page() which
is now the single decision point for both the automatic and
user-controlled paths.

Signed-off-by: Dipayaan Roy <dipayanroy@linux.microsoft.com>
---
 drivers/net/ethernet/microsoft/mana/mana_en.c | 22 ++++-
 .../ethernet/microsoft/mana/mana_ethtool.c    | 89 +++++++++++++++++++
 include/net/mana/mana.h                       |  8 ++
 3 files changed, 117 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/microsoft/mana/mana_en.c b/drivers/net/ethernet/microsoft/mana/mana_en.c
index 49c65cc1697c..59a1626c2be1 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_en.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_en.c
@@ -744,6 +744,25 @@ static void *mana_get_rxbuf_pre(struct mana_rxq *rxq, dma_addr_t *da)
 	return va;
 }
 
+static bool
+mana_use_single_rxbuf_per_page(struct mana_port_context *apc, u32 mtu)
+{
+	/* On some platforms with 4K PAGE_SIZE, page_pool fragment allocation
+	 * in the RX refill path (~2kB buffer) can cause significant throughput
+	 * regression under high connection counts. Allow user to force one RX
+	 * buffer per page via ethtool private flag to bypass the fragment
+	 * path.
+	 */
+	if (apc->priv_flags & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF))
+		return true;
+
+	/* For xdp and jumbo frames make sure only one packet fits per page. */
+	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc))
+		return true;
+
+	return false;
+}
+
 /* Get RX buffer's data size, alloc size, XDP headroom based on MTU */
 static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 			       int mtu, u32 *datasize, u32 *alloc_size,
@@ -754,8 +773,7 @@ static void mana_get_rxbuf_cfg(struct mana_port_context *apc,
 	/* Calculate datasize first (consistent across all cases) */
 	*datasize = mtu + ETH_HLEN;
 
-	/* For xdp and jumbo frames make sure only one packet fits per page */
-	if (mtu + MANA_RXBUF_PAD > PAGE_SIZE / 2 || mana_xdp_get(apc)) {
+	if (mana_use_single_rxbuf_per_page(apc, mtu)) {
 		if (mana_xdp_get(apc)) {
 			*headroom = XDP_PACKET_HEADROOM;
 			*alloc_size = PAGE_SIZE;
diff --git a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
index a28ca461c135..0547c903f613 100644
--- a/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
+++ b/drivers/net/ethernet/microsoft/mana/mana_ethtool.c
@@ -133,6 +133,10 @@ static const struct mana_stats_desc mana_phy_stats[] = {
 	{ "hc_tc7_tx_pause_phy", offsetof(struct mana_ethtool_phy_stats, tx_pause_tc7_phy) },
 };
 
+static const char mana_priv_flags[MANA_PRIV_FLAG_MAX][ETH_GSTRING_LEN] = {
+	[MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF] = "full-page-rx"
+};
+
 static int mana_get_sset_count(struct net_device *ndev, int stringset)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -144,6 +148,10 @@ static int mana_get_sset_count(struct net_device *ndev, int stringset)
 		       ARRAY_SIZE(mana_phy_stats) +
 		       ARRAY_SIZE(mana_hc_stats)  +
 		       num_queues * (MANA_STATS_RX_COUNT + MANA_STATS_TX_COUNT);
+
+	case ETH_SS_PRIV_FLAGS:
+		return MANA_PRIV_FLAG_MAX;
+
 	default:
 		return -EINVAL;
 	}
@@ -192,6 +200,14 @@ static void mana_get_strings_stats(struct mana_port_context *apc, u8 **data)
 	}
 }
 
+static void mana_get_strings_priv_flags(u8 **data)
+{
+	int i;
+
+	for (i = 0; i < MANA_PRIV_FLAG_MAX; i++)
+		ethtool_puts(data, mana_priv_flags[i]);
+}
+
 static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 {
 	struct mana_port_context *apc = netdev_priv(ndev);
@@ -200,6 +216,9 @@ static void mana_get_strings(struct net_device *ndev, u32 stringset, u8 *data)
 	case ETH_SS_STATS:
 		mana_get_strings_stats(apc, &data);
 		break;
+	case ETH_SS_PRIV_FLAGS:
+		mana_get_strings_priv_flags(&data);
+		break;
 	default:
 		break;
 	}
@@ -590,6 +609,74 @@ static int mana_get_link_ksettings(struct net_device *ndev,
 	return 0;
 }
 
+static u32 mana_get_priv_flags(struct net_device *ndev)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+
+	return apc->priv_flags;
+}
+
+static int mana_set_priv_flags(struct net_device *ndev, u32 priv_flags)
+{
+	struct mana_port_context *apc = netdev_priv(ndev);
+	u32 changed = apc->priv_flags ^ priv_flags;
+	u32 old_priv_flags = apc->priv_flags;
+	bool schedule_port_reset = false;
+	int err = 0;
+
+	if (!changed)
+		return 0;
+
+	/* Reject unknown bits */
+	if (priv_flags & ~GENMASK(MANA_PRIV_FLAG_MAX - 1, 0))
+		return -EINVAL;
+
+	if (changed & BIT(MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF)) {
+		apc->priv_flags = priv_flags;
+
+		if (!apc->port_is_up) {
+			/* Port is down, flag updated to apply on next up
+			 * so just return.
+			 */
+			return 0;
+		}
+
+		/* Pre-allocate buffers to prevent failure in mana_attach
+		 * later
+		 */
+		err = mana_pre_alloc_rxbufs(apc, ndev->mtu, apc->num_queues);
+		if (err) {
+			netdev_err(ndev,
+				   "Insufficient memory for new allocations\n");
+			apc->priv_flags = old_priv_flags;
+			return err;
+		}
+
+		err = mana_detach(ndev, false);
+		if (err) {
+			netdev_err(ndev, "mana_detach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			goto out;
+		}
+
+		err = mana_attach(ndev);
+		if (err) {
+			netdev_err(ndev, "mana_attach failed: %d\n", err);
+			apc->priv_flags = old_priv_flags;
+			schedule_port_reset = true;
+		}
+	}
+
+out:
+	mana_pre_dealloc_rxbufs(apc);
+
+	if (err && schedule_port_reset)
+		queue_work(apc->ac->per_port_queue_reset_wq,
+			   &apc->queue_reset_work);
+
+	return err;
+}
+
 const struct ethtool_ops mana_ethtool_ops = {
 	.supported_coalesce_params = ETHTOOL_COALESCE_RX_CQE_FRAMES,
 	.get_ethtool_stats	= mana_get_ethtool_stats,
@@ -608,4 +695,6 @@ const struct ethtool_ops mana_ethtool_ops = {
 	.set_ringparam          = mana_set_ringparam,
 	.get_link_ksettings	= mana_get_link_ksettings,
 	.get_link		= ethtool_op_get_link,
+	.get_priv_flags		= mana_get_priv_flags,
+	.set_priv_flags		= mana_set_priv_flags,
 };
diff --git a/include/net/mana/mana.h b/include/net/mana/mana.h
index 3336688fed5e..fd87e3d6c1f4 100644
--- a/include/net/mana/mana.h
+++ b/include/net/mana/mana.h
@@ -30,6 +30,12 @@ enum TRI_STATE {
 	TRI_STATE_TRUE = 1
 };
 
+/* MANA ethtool private flag bit positions */
+enum mana_priv_flag_bits {
+	MANA_PRIV_FLAG_USE_FULL_PAGE_RXBUF = 0,
+	MANA_PRIV_FLAG_MAX,
+};
+
 /* Number of entries for hardware indirection table must be in power of 2 */
 #define MANA_INDIRECT_TABLE_MAX_SIZE 512
 #define MANA_INDIRECT_TABLE_DEF_SIZE 64
@@ -531,6 +537,8 @@ struct mana_port_context {
 	u32 rxbpre_headroom;
 	u32 rxbpre_frag_count;
 
+	u32 priv_flags;
+
 	struct bpf_prog *bpf_prog;
 
 	/* Create num_queues EQs, SQs, SQ-CQs, RQs and RQ-CQs, respectively. */
-- 
2.43.0


^ permalink raw reply related

* [BUG] KVM: x86: kvmclock jumps ~253 years on Hyper-V nested virt due to cross-CPU raw TSC inconsistency
From: Thomas Lefebvre @ 2026-04-05 22:10 UTC (permalink / raw)
  To: seanjc, pbonzini; +Cc: kvm, linux-kernel, linux-hyperv, vkuznets

Hi,

I'm seeing KVM_GET_CLOCK return values ~253 years in the future when
running KVM inside a Hyper-V VM (nested virtualization).  I tracked
it down to an unsigned wraparound in __get_kvmclock() and have
bpftrace data showing the exact failure.

Setup:
  - Intel i7-11800H laptop running Windows with Hyper-V
  - L1 guest: Ubuntu 24.04, kernel 6.8.0, 4 vCPUs
  - Clocksource: hyperv_clocksource_tsc_page (VDSO_CLOCKMODE_HVCLOCK)
  - KVM running inside L1, hosting L2 guests

Root cause:

__get_kvmclock() does:

    hv_clock.tsc_timestamp = ka->master_cycle_now;
    hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
    ...
    data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);

and __pvclock_read_cycles() does:

    delta = tsc - src->tsc_timestamp;    /* unsigned */

master_cycle_now is a raw RDTSC captured by
pvclock_update_vm_gtod_copy().  host_tsc is a raw RDTSC read by
__get_kvmclock() on the current CPU.  Both go through the vgettsc()
HVCLOCK path which calls hv_read_tsc_page_tsc() -- this computes a
cross-CPU-consistent reference counter via scale/offset, but stores
the *raw* RDTSC in tsc_timestamp as a side effect.

Under Hyper-V, raw RDTSC values are not consistent across vCPUs.
The hypervisor corrects them only through the TSC page scale/offset.
If pvclock_update_vm_gtod_copy() runs on CPU 0 and __get_kvmclock()
later runs on CPU 1 where the raw TSC is lower, the unsigned
subtraction wraps.

I wrote a bpftrace tracer (included below) to instrument both
functions and captured two corruption events:

  Event 1:

    [GTOD_COPY] pid=2117649 cpu=0->0 use_master=1
                mcn=598992030530137 mkn=259977082393200

    [GET_CLOCK] pid=2117649 entry_cpu=1 exit_cpu=1 use_master=1
      clock=8006399342167092479 host_tsc=598991848289183
      master_cycle_now=598992030530137
      system_time(mkn+off)=5175860260
      TSC DEFICIT: 182240954 cycles

    master_cycle_now captured on CPU 0, host_tsc read on CPU 1.
    CPU 1's raw RDTSC was 182M cycles lower.

      598991848289183 - 598992030530137 = 18446744073527310662 (u64)

    Returned clock: 8,006,399,342,167,092,479 ns (~253.7 years)
    Correct system_time: 5,175,860,260 ns (~5.2 seconds)

  Event 2:

    [GTOD_COPY] pid=2117953 cpu=0->0 use_master=1
                mcn=599040238416510

    [GET_CLOCK] pid=2117953 entry_cpu=3 exit_cpu=3 use_master=1
      clock=8006399342464295526 host_tsc=599040211994220
      master_cycle_now=599040238416510
      TSC DEFICIT: 26422290 cycles

    Same pattern, CPU 0 vs CPU 3, 26M cycle deficit.

kvm_get_wall_clock_epoch() has the same pattern -- fresh host_tsc
vs stale master_cycle_now passed to __pvclock_read_cycles().

The simplest fix I can think of is guarding the __pvclock_read_cycles
call in __get_kvmclock():

    if (data->host_tsc >= hv_clock.tsc_timestamp)
        data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
    else
        data->clock = hv_clock.system_time;

system_time (= master_kernel_ns + kvmclock_offset) was computed from
the TSC page's corrected reference counter and is accurate regardless
of CPU.  The fallback loses sub-us interpolation but avoids a 253-year
jump.  On systems with consistent cross-CPU TSC, the branch is never
taken.

One thing I wasn't sure about: when the fallback triggers,
KVM_CLOCK_TSC_STABLE is still set in data->flags.  I left it alone
since the returned value is still correct (just less precise), but
I could see an argument for clearing it.

Disabling master clock entirely for HVCLOCK would also work but
seemed heavy -- it sacrifices PVCLOCK_TSC_STABLE_BIT, forces the
guest pvclock read into the atomic64_cmpxchg monotonicity guard,
and triggers KVM_REQ_GLOBAL_CLOCK_UPDATE on vCPU migration.

Reproducer bpftrace script (run while exercising KVM on a Hyper-V
host):

  #!/usr/bin/env bpftrace
  /*
   * Detect host_tsc < master_cycle_now in __get_kvmclock.
   *
   * struct kvm_clock_data layout (for raw offset reads):
   *   offset 0:  u64 clock
   *   offset 24: u64 host_tsc
   */

  kprobe:__get_kvmclock
  {
      $kvm = (struct kvm *)arg0;
      @get_data[tid] = (uint64)arg1;
      @get_use_master[tid] = (uint64)$kvm->arch.use_master_clock;
      @get_mcn[tid] = (uint64)$kvm->arch.master_cycle_now;
      @get_cpu[tid] = cpu;
  }

  kretprobe:__get_kvmclock
  {
      $data_ptr = @get_data[tid];
      if ($data_ptr != 0) {
          $clock = *(uint64 *)($data_ptr);
          $host_tsc = *(uint64 *)($data_ptr + 24);
          $use_master = @get_use_master[tid];
          $mcn = @get_mcn[tid];

          if ($use_master && $host_tsc != 0 && $host_tsc < $mcn) {
              printf("BUG: pid=%d cpu=%d->%d host_tsc=%lu mcn=%lu "
                     "deficit=%lu clock=%lu\n",
                     pid, @get_cpu[tid], cpu, $host_tsc,
                     $mcn, $mcn - $host_tsc, $clock);
          }
      }
      delete(@get_data[tid]);
      delete(@get_use_master[tid]);
      delete(@get_mcn[tid]);
      delete(@get_cpu[tid]);
  }

  kprobe:pvclock_update_vm_gtod_copy {
      @gtod_kvm[tid] = (uint64)arg0;
      @gtod_cpu[tid] = cpu;
  }
  kretprobe:pvclock_update_vm_gtod_copy
  {
      $kvm = (struct kvm *)@gtod_kvm[tid];
      if ($kvm != 0) {
          printf("GTOD: pid=%d cpu=%d->%d mcn=%lu use_master=%d\n",
                 pid, @gtod_cpu[tid], cpu,
                 $kvm->arch.master_cycle_now,
                 $kvm->arch.use_master_clock);
      }
      delete(@gtod_kvm[tid]);
      delete(@gtod_cpu[tid]);
  }

Thanks,
Thomas

^ permalink raw reply

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-05 23:11 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, KY Srinivasan, Haiyang Zhang,
	wei.liu@kernel.org, Long Li, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, Jake Oshins, linux-hyperv@vger.kernel.org,
	linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org
  Cc: stable@vger.kernel.org, Matthew Ruffell, Krister Johansen
In-Reply-To: <SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Thursday, April 2, 2026 10:10 AM
> 
> > From: Michael Kelley <mhklinux@outlook.com>
> > Sent: Wednesday, January 21, 2026 11:11 PM
> > ...
> > From: Dexuan Cui <decui@microsoft.com> Sent: Wednesday, January 21,
> > 2026 6:04 PM
> > >
> > > There has been a longstanding MMIO conflict between the pci_hyperv
> > > driver's config_window (see hv_allocate_config_window()) and the
> > > hyperv_drm (or hyperv_fb) driver (see hyperv_setup_vram()): typically
> > > both get MMIO from the low MMIO range below 4GB; this is not an issue
> > > in the normal kernel since the VMBus driver reserves the framebuffer
> > > MMIO in vmbus_reserve_fb(), so the drm driver's hyperv_setup_vram()
> > > can always get the reserved framebuffer MMIO; however, a Gen2 VM's
> > > kdump kernel fails to reserve the framebuffer MMIO in vmbus_reserve_fb()
> > >  because the screen_info.lfb_base is zero in the kdump kernel:
> > > the screen_info is not initialized at all in the kdump kernel, because the
> > > EFI stub code, which initializes screen_info, doesn't run in the case of kdump.
> >
> > I don't think this is correct. Yes, the EFI stub doesn't run, but screen_info
> 
> Hi Michael, sorry for delaying the reply for so long! Now I think I should
> understand all the details.
> 
> My earlier statement "the screen_info is not initialized at all in the kdump
> kernel" is not correct on x86, but I believe it's correct on ARM64. Please see
> my explanation below.

Sadly, I must agree. It's surprising, because it affects kexec scenarios that
don't include Hyper-V. On arm64 bare metal, if you kexec to a kernel configured
to run the efifb frame buffer driver, the driver won't load.

> 
> > should be initialized in the kdump kernel by the code that loads the
> > kdump kernel into the reserved crash memory. See discussion in the commit
> > message for commit 304386373007.
> >
> > I wonder if commit a41e0ab394e4 broke the initialization of screen_info
> > in the kdump kernel. Or perhaps there is now a rev-lock between the kernel
> > with this commit and a new version of the user space kexec command.
> 
> The commit
> a41e0ab394e4 ("sysfb: Replace screen_info with sysfb_primary_display")
> should be unrelated here.

Agreed.

> 
> > There's a parameter to the kexec() command that governs whether it
> > uses the kexec_file_load() system call or the kexec_load() system call.
> > I wonder if that parameter makes a difference in the problem described
> > for this patch.
> >
> > I can't immediately remember if, when I was working on commit
> > 304386373007, I tested kdump in a Gen 2 VM with an NVMe OS disk to
> > ensure that MMIO space was properly allocated to the frame buffer
> > driver (either hyperv_fb or hyperv_drm). I'm thinking I did, but tomorrow
> > I'll check for any definitive notes on that.
> >
> > Michael

Evidently, I did not fully test an arm64 VM, or I would have seen that
screen_info was't being populated for the kdump kernel.

> 
> If vmbus_reserve_fb() in the kdump kernel fails to reserve the framebuffer
> MMIO range due to a Gen2 VM's screen_info.lfb_base being 0,  the MMIO
> conflict between hyperv_fb/hyperv_drm and hv_pci happens -- this is
> especially an issue if hv_pci is built-in and hyperv_fb/hyperv_drm is built
> as modules. vmbus_reserve_fb() should always succeed for a Gen1 VM, since
> it can always get the framebuffer MMIO base from the legacy PCI graphics
> device, so we only need to discuss Gen2 VMs here.

Agreed.

> 
> When kdump-tools loads the kdump kernel into memory, the tool can
> accept any of the 3 parameters (e.g. I got the below via "man kexec" in
> Ubuntu 24.04):
> 
>        -s (--kexec-file-syscall)
>               Specify that the new KEXEC_FILE_LOAD syscall should be used exclusively.
> 
>        -c (--kexec-syscall)
>               Specify that the old KEXEC_LOAD syscall should be used exclusively.
> 
>        -a (--kexec-syscall-auto)
>               Try the new KEXEC_FILE_LOAD syscall first and when it is not supported or the kernel does not understand the supplied  im‐
>               age fall back to the old KEXEC_LOAD interface.
> 
>               There is no one single interface that always works, so this is the default.
> 
>               KEXEC_FILE_LOAD is required on systems that use locked-down secure boot to verify the kernel signature.  KEXEC_LOAD may be
>               also disabled in the kernel configuration.
> 
>               KEXEC_LOAD is required for some kernel image formats and on architectures that do not implement KEXEC_FILE_LOAD.
> 
> If none of the parameters are specified, the default may be -c, or -s
> or -a, depending on the distro and the version in use.  We can run
>     strace -f kdump-config reload  2>&1 | egrep 'kexec_file_load|kexec_load' to tell which syscall is being used.
> 
> Old distro versions seem to use KEXEC_LOAD by default, and new distro
> versions tend to use KEXEC_FILE_LOAD by default, especially when
> Secure Boot is enabled (e.g. see /usr/sbin/kdump-config: kdump_load()
> in Ubuntu).

Agreed. I think I had seen that previously.

> 
> In Ubuntu, we can explicitly specify one of the parameters in
> "/etc/default/kdump-tools", e.g. KDUMP_KEXEC_ARGS="-c -d".
> 
> The -d is for debugging. I found it very useful: when we run
> "kdump-config show" or "kdump-config reload", we get very useful
> debug info with -d.
> 
> On x86-64, with -c:
> The kdump-tools gets the framebuffer's MMIO base using
> ioctl(fd, FBIOGET_FSCREENINFO, ....): see the end of the email for
> an example program; kdump-tools then uses the KEXEC_LOAD syscall
> to set up the screen_info.lfb_base for the kdump kernel.

Thanks. While redoing some experiments yesterday, I found the
similar program that I had written a year ago to dump the ioctl results.

> 
> The function in kdump-tools that gets the framebuffer MMIO base
> is kexec/arch/i386/x86-linux-setup.c: setup_linux_vesafb():
> https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-
> tools.git/tree/kexec/arch/i386/x86-linux-setup.c?h=v2.0.32#n133
> 
> Unluckily, setup_linux_vesafb() only recognizes the vesafb
> driver in Linux kernel ("VESA VGA") and the efifb driver ("EFI VGA").
> It looks like normally arch_options.reuse_video_type is always 0.
> 
> This means the kdump kernel's screen_info.lfb_base is 0, if
> hyperv_fb or hyperv_drm loads. In the past,  for a Ubuntu kernel
> with CONFIG_FB_EFI=y, our workaround is blacklisting
> hyperv_fb or hyperv_drm, so /dev/fb0 is backed by efifb, and
> the screen_info.lfb_base is correctly set for kdump.

Hmmm. This worse than I thought for x86/x64. In fact, it means
a part of my commit message for 304386373007 is now wrong. I had
described everything as working when using the kexec_load() system
call because the FBIOGET_FSCREENINFO ioctl was returning a good
value for smem_start (at least with the hyperv_fb driver). But as you
point out further down, newer versions of the kexec user space program
are ignoring that smem_start value unless the driver is vesafb or efifb.

Was blacklisting hyperv_fb or hyperv_drm in the kdump kernel
a workaround we had promulgated in the past? My recollection
is vague. But no matter.

> 
> However, now CONFIG_FB_EFI is not set in recent Ubuntu kernels:
> $ egrep
> 'CONFIG_FB_EFI|CONFIG_SYSFB|CONFIG_SYSFB_SIMPLEFB|CONFIG_DRM_SIMPLEDR
> M|CONFIG_DRM_HYPERV' /boot/config-6.8.0-1051-azure
> CONFIG_SYSFB=y
> CONFIG_SYSFB_SIMPLEFB=y
> CONFIG_DRM_SIMPLEDRM=y
> CONFIG_DRM_HYPERV=m
> # CONFIG_FB_EFI is not set
> 
> So, with Ubuntu 22.04/24.04,  -c can't avoid the MMIO conflict
> for Gen2 x86-64 VMs now, even if we blacklist hyperv_fb/hyperv_drm.
> Note: Ubuntu 20.04 uses an old version of the kdump-tools, so
> the statement is different there (see the later discussion below).
> 
> hyperv_fb has been removed in the mainline kernel: see
> commit 40227f2efcfb ("fbdev: hyperv_fb: Remove hyperv_fb driver")
> so we no longer need to worry about it.
> 
> Even if we modify setup_linux_vesafb() to support  hyperv_drm,
> it still won't work, because the MMIO base is hidden by commit
> da6c7707caf3 ("fbdev: Add FBINFO_HIDE_SMEM_START flag")

Agreed.

> 
> On x86-64, with -s:
> The KEXEC_FILE_LOAD syscall sets the kdump kernel's
> screen_info.lfb_base in the kernel: see
> 
> "arch/x86/kernel/kexec-bzimage64.c"
>     bzImage64_load
>         setup_boot_parameters
>             memcpy(&params->screen_info, &screen_info, sizeof(struct screen_info));
> 
> so, as long as the first kernel's hyperv_drm doesn't relocate the
> MMIO base, kdump should work fine; if the MMIO base is relocated,
> currently hyperv_drm doesn't update the screen_info.lfb_base,
> so the kdump's efifb driver and hv_pci driver won't work. Normally
> hyperv_drm doesn't relocate the MMIO base, unless the user
> specifies a very high resolution and the required MMIO size
> exceeds the default 8MB reserved by vmbus_reserve_fb() -- let's
> ignore that scenario for now.
> 

Agreed.

> 
> On AMR64, with -c:
> The kdump-tools doesn't even open /dev/fb0 (we can confirm this by using
> strace or bpftrace), so the kdump kernel's screen_info.lfb_base ia always 0.

Agreed.

> 
> On AMR64, with -s:
> "arch/arm64/kernel/kexec_image.c": image_load() doesn't set the
> params->screen_info, so the kdump kernel's screen_info.lfb_base ia always 0.

Agreed.

> 
> To recap, with a recent mainline kernel (or the linux-azure kernels) that
> has 304386373007, my observation on Ubuntu 22.04 and 24.04 is:
>     on x86-64, -c fails, but -s works.
>     on ARM64, -c fails, and -s also fails.
> 
> Note: the kdump-tools v2.0.18 in Ubuntu 20.04 doesn't have this commit:
> https://git.kernel.org/pub/scm/utils/kernel/kexec/kexec-
> tools.git/commit/?id=fb5a8792e6e4ee7de7ae3e06d193ea5beaaececc
> (Note the "return 0;" in setup_linux_vesafb())
> so, on x86-64, -c also works in Ubuntu 20.04, if hyperv_fb is used
> (-c still doesn't work if hyperv_drm is used due to da6c7707caf3).

Ah. That explains why I thought x86/x64 kdump was working with
hyperv_fb when working on commit 304386373007. I was testing with
kexec user space utility v2.0.18, which*does* propagate smem_start
from the ioctl to the loaded kdump image.

> 
> With this patch
> "PCI: hv: Allocate MMIO from above 4GB for the config window",
> both -c  and -s work on x86-64 and ARM64 due to no MMIO conflict,
> as long as there are no 32-bit PCI BARs (which should be true on
> Azure and on modern hosts.)
> 
> With the patch, even if hyperv_drm relocates the framebuffer MMO
> base, there would still be no MMIO conflict because typically hyperv_drm
> gets its MMIO from below 4GB: it seems like vmbus_walk_resources()
> always finds the low MMIO range first and adds it to the beginning of the
> MMIO resources "hyperv_mmio", so presumably hyperv_drm would
> get MMIO from the low MMIO range.
> 
> I'll update the commit message, add Matthew's and Krister's
> Tested-by's and post v2.

See my comments on v2 of your patch.  I have a thought for a
slightly different approach to solve the problem.

Michael

> 
> Thanks,
> Dexuan

^ permalink raw reply

* RE: [PATCH] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-05 23:13 UTC (permalink / raw)
  To: Dexuan Cui, Michael Kelley, Matthew Ruffell
  Cc: bhelgaas@google.com, Haiyang Zhang, Jake Oshins,
	kwilczynski@kernel.org, KY Srinivasan,
	linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-pci@vger.kernel.org, Long Li, lpieralisi@kernel.org,
	mani@kernel.org, robh@kernel.org, stable@vger.kernel.org,
	wei.liu@kernel.org
In-Reply-To: <SA1PR21MB69213185A6FE899BD2D4BFC3BF51A@SA1PR21MB6921.namprd21.prod.outlook.com>

From: Dexuan Cui <DECUI@microsoft.com> Sent: Thursday, April 2, 2026 12:24 PM
> 
> > From: Michael Kelley <mhklinux@outlook.com>
> > Sent: Friday, January 23, 2026 10:28 AM
> >  ...
> > One more thought here: Is commit 96959283a58d relevant? The
> > commit message describes a scenario where vmbus_reserve_fb()
> > doesn't do anything because CONFIG_SYSFB is not set. Looking at
> > the code for vmbus_reserve_fb(), it doing nothing might imply that
> > screen_info.lfb_base is 0. But when CONFIG_SYSFB is not set,
> > screen_info.lfb_base is just ignored, with the same result. This behavior
> > started with the 6.7 kernel due to commit a07b50d80ab6.
> >
> > Note that commit 96959283a58d has a follow-on to correct a
> > problem when CONFIG_EFI is not set.  See commit 7b89a44b2e8c.
> > If there's a reason to backport 96959283a58d, also get
> > 7b89a44b2e8c.
> >
> > Michael
> 
> In my opinion,
> 96959283a58d ("Drivers: hv: Always select CONFIG_SYSFB for Hyper-V guests")
> is not a good fix for a07b50d80ab6: the commit message of a07b50d80ab6
> says "the vmbus_drv code marks the original EFI framebuffer as reserved, but
> this is not required if there is no sysfb" -- IMO the message is incorrect.
> 
> Even if CONFIG_SYSFB is not set, we still need to reserve the framebuffer
> MMIO range, because we need to make sure that hv_pci doesn't allocate
> MMIO from there.
> 
> 96959283a58d adds "select SYSFB if !HYPERV_VTL_MODE", but we can
> still manually unset CONFIG_SYSFB (I happened to do this when debugging
> the kdump issue), and hv_pci won't work.

Just curious -- how would you manually unset CONFIG_SYSFB? The kernel
makefile always resync's .config against the Kconfig rules, which would add
CONFIG_SYSFB back again. The Kconfig files essentially say that removing
CONFIG_SYSFB is an invalid configuration.

> 
> IMO vmbus_reserve_fb() should unconditionally reserve the frame buffer
> MMIO range. I'll post a patch like this:
> 
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2395,10 +2398,8 @@ static void __maybe_unused vmbus_reserve_fb(void)
> 
>         if (efi_enabled(EFI_BOOT)) {
>                 /* Gen2 VM: get FB base from EFI framebuffer */
> -               if (IS_ENABLED(CONFIG_SYSFB)) {
> -                       start = sysfb_primary_display.screen.lfb_base;
> -                       size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);
> -               }
> +               start = sysfb_primary_display.screen.lfb_base;
> +               size = max_t(__u32, sysfb_primary_display.screen.lfb_size, 0x800000);

On arm64 the existence of sysfb_primary_display is conditional on
several config variables, including CONFIG_SYSFB and CONFIG_EFI_EARLYCON.
(see drivers/firmware/efi/efi-init.c) If you can take away CONFIG_SYSFB, you
could also take away CONFIG_EFI_EARLYCON and end up with build error on
arm64. So I'm not clear how this approach would be more robust against
invalid .config changes.

Also this recent patch set [1] submitted by Thomas Zimmerman is even more
explicit about sysfb_primary_display being conditional on CONFIG_SYSFB.

Michael

[1] https://lore.kernel.org/linux-hyperv/20260402092305.208728-1-tzimmermann@suse.de/

>         } else {
>                 /* Gen1 VM: get FB base from PCI */
>                 pdev = pci_get_device(PCI_VENDOR_ID_MICROSOFT,
> 
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 7937ac0cbd0f..78d7f8c66278 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -9,7 +9,6 @@ config HYPERV
>         select PARAVIRT
>         select X86_HV_CALLBACK_VECTOR if X86
>         select OF_EARLY_FLATTREE if OF
> -       select SYSFB if EFI && !HYPERV_VTL_MODE
>         select IRQ_MSI_LIB if X86
>         help
>           Select this option to run Linux as a Hyper-V client operating
> 
> Thanks,
> Dexuan


^ permalink raw reply

* RE: [PATCH v2] PCI: hv: Allocate MMIO from above 4GB for the config window
From: Michael Kelley @ 2026-04-05 23:15 UTC (permalink / raw)
  To: Dexuan Cui, kys@microsoft.com, haiyangz@microsoft.com,
	wei.liu@kernel.org, longli@microsoft.com, lpieralisi@kernel.org,
	kwilczynski@kernel.org, mani@kernel.org, robh@kernel.org,
	bhelgaas@google.com, jakeo@microsoft.com,
	linux-hyperv@vger.kernel.org, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, Michael Kelley,
	matthew.ruffell@canonical.com, kjlx@templeofstupid.com
  Cc: Krister Johansen, stable@vger.kernel.org
In-Reply-To: <20260402234313.2490779-1-decui@microsoft.com>

From: Dexuan Cui <decui@microsoft.com> Sent: Thursday, April 2, 2026 4:43 PM
> 
> There has been a longstanding MMIO conflict between the pci_hyperv
> driver's config_window (see hv_allocate_config_window()) and the
> hyperv_drm (or hyperv_fb) driver (see hyperv_setup_vram()): typically
> both get MMIO from the low MMIO range below 4GB; this is not an issue
> in the normal kernel since the VMBus driver reserves the framebuffer
> MMIO range in vmbus_reserve_fb(), so the drm driver's hyperv_setup_vram()
> can always get the reserved framebuffer MMIO; however, a Gen2 VM's
> kdump kernel can fail to reserve the framebuffer MMIO in
> vmbus_reserve_fb() because the screen_info.lfb_base is zero in the
> kdump kernel due to several possible reasons (see the Link below for
> more details):
> 
> 1) on ARM64, the two syscalls (KEXEC_LOAD, KEXEC_FILE_LOAD) don't
> initialize the screen_info.lfb_base for the kdump kernel;
> 
> 2) on x86-64, the KEXEC_FILE_LOAD syscall initializes kdump kernel's
> screen_info.lfb_base, but the KEXEC_LOAD syscall doesn't really do that
> when the hyperv_drm driver loads, because the user-space kexec-tools
> (i.e. the program 'kexec') doesn't recognize the hyperv_drm driver
> (let's ignore the behavior of kexec-tools of very old versions).
> 
> When vmbus_reserve_fb() fails to reserve the framebuffer MMIO in the
> kdump kernel, if pci_hyperv in the kdump kernel loads before hyperv_drm
> loads, pci_hyperv's vmbus_allocate_mmio() gets the framebuffer MMIO
> and tries to use it, but since the host thinks that the MMIO range is
> still in use by hyperv_drm, the host refuses to accept the MMIO range
> as the config window, and pci_hyperv's hv_pci_enter_d0() errors out,
> e.g. an error can be "PCI Pass-through VSP failed D0 Entry with status
> c0370048".
> 
> Typically, this pci_hyperv error in the kdump kernel was not fatal in
> the past because the kdump kernel typically doesn't rely on pci_hyperv,
> i.e. the root file system is on a VMBus SCSI device.
> 
> Now, a VM on Azure can boot from NVMe, i.e. the root file system can be
> on a NVMe device, which depends on pci_hyperv. When the error occurs,
> the kdump kernel fails to boot up since no root file system is detected.
> 
> Fix the MMIO conflict by allocating MMIO above 4GB for the config_window,
> so it won't conflict with hyperv_drm's MMIO, which should be below the
> 4GB boundary. The size of config_window is small: it's only 8KB per PCI
> device, so there should be sufficient MMIO space available above 4GB.
> 
> Note: we still need to figure out how to address the possible MMIO
> conflict between hyperv_drm and pci_hyperv in the case of 32-bit PCI
> MMIO BARs, but that's of low priority because all PCI devices available
> to a Linux VM on Azure or on a modern host should use 64-bit BARs and
> should not use 32-bit BARs -- I checked Mellanox VFs, MANA VFs, NVMe
> devices, and GPUs in Linux VMs on Azure, and found no 32-bit BARs.

Just to clarify, since this patch is predicated on all BARs being 64-bit,
hv_pci_alloc_bridge_windows() never encounters a non-zero
hbus->low_mmio_space, and hence also never allocates from low
MMIO space. So hv_pci_alloc_bridge_windows() does not need to be
patched. Is that correct?

Taking a broader view, fundamentally the current MMIO location of
the frame buffer may be unknown to the Linux guest. At the same time,
Linux must ensure that PCI devices don't get assigned to the MMIO space
where the frame buffer is located. While the current MMIO location of
the frame buffer may be unknown, we can assume it was placed in low
MMIO space by the host -- either Windows Hyper-V or Linux/VMM
in the root partition, and perhaps as mediated by a paravisor. Probably
need to confirm with the Linux-in-the-root partition team (and maybe
the OpenHCL team) that this assumption is true. Presumably the
hyperv_drm driver doesn't need to move the frame buffer, but if it
does, it must stay in the low MMIO space.

This patch depends on this assumption, and effectively reserves
the entire low MMIO space for the frame buffer. The low MMIO space
size defaults to 128 MiB on a local Hyper-V, and is set to 3 GiB in most
Azure VMs (or to 1 GiB in an Azure CVM), so that all gets reserved.

A slightly different approach to the whole problem is to change
vmbus_reserve_fb(). If it is unable to get a non-zero "start" value, then
it should use the same assumption as above, and reserve a frame buffer
area starting at the lowest address in low MMIO space. The reserved size
could be the max possible frame buffer size, which I think is 64 MiB (?).
This still leaves low MMIO space for subsequent PCI devices, and allows
32-bit BARs to continue to work. This approach requires one further
assumption, which is that the host, plus any movement by hyperv_drm,
has kept the frame buffer at the low end of the low MMIO space. From
what I've seen, that assumption is reality -- the frame buffer always
starts at the beginning of low MMIO space.

This approach could be taken one step further, where vmbus_reserve_fb()
*always* reserves 64 MiB starting at the low end of low MMIO space,
regardless of the value of "start". The messy code for getting "start"
could be dropped entirely, and the dependency on CONFIG_SYSFB goes
away. Or maybe still get the value of "start" and "size", and if non-zero
just do a sanity check that they are within the fixed 64 MiB reserved area.

Thoughts? To me tweaking vmbus_reserve_fb() is a more
straightforward and explicit way to do the reserving, vs. modifying
the requested range in the Hyper-V PCI driver. And FWIW, it avoids
introducing the 32-bit BAR limitation.

Michael

> 
> Fixes: 4daace0d8ce8 ("PCI: hv: Add paravirtual PCI front-end for Microsoft Hyper-V VMs")
> Link: https://lore.kernel.org/all/SA1PR21MB692176C1BC53BFC9EAE5CF8EBF51A@SA1PR21MB6921.namprd21.prod.outlook.com/
> Tested-by: Matthew Ruffell <matthew.ruffell@canonical.com>
> Tested-by: Krister Johansen <johansen@templeofstupid.com>
> Signed-off-by: Dexuan Cui <decui@microsoft.com>
> Cc: stable@vger.kernel.org
> ---
> 
> Changes since v1:
>     Updated the commit message and the comment to better explain
>     why screen_info.lfb_base can be 0 in the kdump kernel.
> 
>     No code change since v1.
> 
> 
>  drivers/pci/controller/pci-hyperv.c | 21 +++++++++++++++++++--
>  1 file changed, 19 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/pci/controller/pci-hyperv.c b/drivers/pci/controller/pci-hyperv.c
> index 2c7a406b4ba8..1a79334ea9f4 100644
> --- a/drivers/pci/controller/pci-hyperv.c
> +++ b/drivers/pci/controller/pci-hyperv.c
> @@ -3403,9 +3403,26 @@ static int hv_allocate_config_window(struct
> hv_pcibus_device *hbus)
> 
>  	/*
>  	 * Set up a region of MMIO space to use for accessing configuration
> -	 * space.
> +	 * space. Use the high MMIO range to not conflict with the hyperv_drm
> +	 * driver (which normally gets MMIO from the low MMIO range) in the
> +	 * kdump kernel of a Gen2 VM, which may fail to reserve the framebuffer
> +	 * MMIO range in vmbus_reserve_fb() due to screen_info.lfb_base being
> +	 * zero in the kdump kernel:
> +	 *
> +	 * on ARM64, the two syscalls (KEXEC_LOAD, KEXEC_FILE_LOAD) don't
> +	 * initialize the screen_info.lfb_base for the kdump kernel;
> +	 *
> +	 * on x86-64, the KEXEC_FILE_LOAD syscall initializes kdump kernel's
> +	 * screen_info.lfb_base (see bzImage64_load() -> setup_boot_parameters())
> +	 * but the KEXEC_LOAD syscall doesn't really do that when the hyperv_drm
> +	 * driver loads, because the user-space program 'kexec' doesn't
> +	 * recognize hyperv_drm: see the function setup_linux_vesafb() in the
> +	 * kexec-tools.git repo. Note: old versions of kexec-tools, e.g.
> +	 * v2.0.18, initialize screen_info.lfb_base if the hyperv_fb driver
> +	 * loads, but hyperv_fb is deprecated and has been removed from the
> +	 * mainline kernel.
>  	 */
> -	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, 0, -1,
> +	ret = vmbus_allocate_mmio(&hbus->mem_config, hbus->hdev, SZ_4G, -1,
>  				  PCI_CONFIG_MMIO_LENGTH, 0x1000, false);
>  	if (ret)
>  		return ret;
> --
> 2.43.0
> 

^ permalink raw reply

* [PATCH] scsi: storvsc: Handle PERSISTENT_RESERVE_IN truncation for Hyper-V vFC
From: Li Tian @ 2026-04-06  1:53 UTC (permalink / raw)
  To: linux-scsi
  Cc: Li Tian, K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
	Long Li, James E.J. Bottomley, Martin K. Petersen, linux-hyperv,
	linux-kernel

The storvsc driver has become stricter in handling
SRB status codes returned by the Hyper-V host. When using Virtual
Fibre Channel (vFC) passthrough, the host may return
SRB_STATUS_DATA_OVERRUN for PERSISTENT_RESERVE_IN commands if the
allocation length in the CDB does not match the host's expected
response size.

Currently, this status is treated as a fatal error, propagating
Host_status=0x07 [DID_ERROR] to the SCSI mid-layer. This causes
userspace storage utilities (such as sg_persist) to fail with
transport errors, even when the host has actually returned the
requested reservation data in the buffer.

Refactor the existing command-specific workarounds into a new helper
function, storvsc_host_mishandles_cmd(), and add
PERSISTENT_RESERVE_IN to the list of commands where SRB status
errors should be suppressed for vFC devices. This ensures that
the SCSI mid-layer processes the returned data buffer instead of
terminating the command.

Signed-off-by: Li Tian <litian@redhat.com>
---
 drivers/scsi/storvsc_drv.c | 32 +++++++++++++++++++++-----------
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index ae1abab97835..6977ca8a0658 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1131,6 +1131,26 @@ static void storvsc_command_completion(struct storvsc_cmd_request *cmd_request,
 		kfree(payload);
 }
 
+/*
+ * The current SCSI handling on the host side does not correctly handle:
+ * INQUIRY with page code 0x80, MODE_SENSE / MODE_SENSE_10 with cmd[2] == 0x1c,
+ * and (for FC) MAINTENANCE_IN / PERSISTENT_RESERVE_IN passthrough.
+ */
+static bool storvsc_host_mishandles_cmd(u8 opcode, struct hv_device *device)
+{
+	switch (opcode) {
+	case INQUIRY:
+	case MODE_SENSE:
+	case MODE_SENSE_10:
+		return true;
+	case MAINTENANCE_IN:
+	case PERSISTENT_RESERVE_IN:
+		return hv_dev_is_fc(device);
+	default:
+		return false;
+	}
+}
+
 static void storvsc_on_io_completion(struct storvsc_device *stor_device,
 				  struct vstor_packet *vstor_packet,
 				  struct storvsc_cmd_request *request)
@@ -1141,22 +1161,12 @@ static void storvsc_on_io_completion(struct storvsc_device *stor_device,
 	stor_pkt = &request->vstor_packet;
 
 	/*
-	 * The current SCSI handling on the host side does
-	 * not correctly handle:
-	 * INQUIRY command with page code parameter set to 0x80
-	 * MODE_SENSE and MODE_SENSE_10 command with cmd[2] == 0x1c
-	 * MAINTENANCE_IN is not supported by HyperV FC passthrough
-	 *
 	 * Setup srb and scsi status so this won't be fatal.
 	 * We do this so we can distinguish truly fatal failues
 	 * (srb status == 0x4) and off-line the device in that case.
 	 */
 
-	if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
-	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
-	   (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
-	   (stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
-	   hv_dev_is_fc(device))) {
+	if (storvsc_host_mishandles_cmd(stor_pkt->vm_srb.cdb[0], device)) {
 		vstor_packet->vm_srb.scsi_status = 0;
 		vstor_packet->vm_srb.srb_status = SRB_STATUS_SUCCESS;
 	}
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH] Drivers: hv: mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER
From: Naman Jain @ 2026-04-06  4:56 UTC (permalink / raw)
  To: Michael Kelley, K . Y . Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li
  Cc: Saurabh Sengar, linux-hyperv@vger.kernel.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <SN6PR02MB4157550DA8F143F4DAFBEEE4D45EA@SN6PR02MB4157.namprd02.prod.outlook.com>



On 4/4/2026 12:07 AM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Friday, April 3, 2026 12:25 AM
>>
> 
> Nit: I wonder what's the best prefix to use in the patch Subject field.
> "Drivers: hv: mshv_vtl:" is rather long.  There was agreement to use
> just "mshv:" for the root partition code, and I probably misused that
> in commit 754cf84504ea. How about just "mshv_vtl:" as the prefix for this
> patch and other VTL patches going forward?
> 

If "mshv:" is OK for the root partition code, "mshv_vtl:" should be good 
for this. I'll change this subject, and for other patches in the future.

>> On 4/3/2026 9:05 AM, Michael Kelley wrote:
>>> From: Naman Jain <namjain@linux.microsoft.com> Sent: Tuesday, March 31, 2026 10:40 PM
>>>>
>>>> When registering VTL0 memory via MSHV_ADD_VTL0_MEMORY, the kernel
>>>> computes pgmap->vmemmap_shift as the number of trailing zeros in the
>>>> OR of start_pfn and last_pfn, intending to use the largest compound
>>>> page order both endpoints are aligned to.
>>>>
>>>> However, this value is not clamped to MAX_FOLIO_ORDER, so a
>>>> sufficiently aligned range (e.g. physical range 0x800000000000-
>>>> 0x800080000000, corresponding to start_pfn=0x800000000 with 35
>>>> trailing zeros) can produce a shift larger than what
>>>> memremap_pages() accepts, triggering a WARN and returning -EINVAL:
>>>>
>>>>     WARNING: ... memremap_pages+0x512/0x650
>>>>     requested folio size unsupported
>>>>
>>>> The MAX_FOLIO_ORDER check was added by
>>>> commit 646b67d57589 ("mm/memremap: reject unreasonable folio/compound
>>>> page sizes in memremap_pages()").
>>>> When CONFIG_HAVE_GIGANTIC_FOLIOS=y, CONFIG_SPARSEMEM_VMEMMAP=y, and
>>>> CONFIG_HUGETLB_PAGE is not set, MAX_FOLIO_ORDER resolves to
>>>> (PUD_SHIFT - PAGE_SHIFT) = 18. Any range whose PFN alignment exceeds
>>>> order 18 hits this path.
>>>
>>> I'm not clear on what point you are making with this specific
>>> configuration that results in MAX_FOLIO_ORDER being 18. Is it just
>>> an example? Is 18 the largest expected value for MAX_FOLIO_ORDER?
>>> And note that PUD_SHIFT and PAGE_SHIFT might have different values
>>> on arm64 with a page size other than 4K.
>>>
>>
>> Yes, this was just an example. It is not generalized enough, I will
>> remove it.
>> MAX_FOLIO_ORDER could go beyond 18.
>>
>>>>
>>>> Fix this by clamping vmemmap_shift to MAX_FOLIO_ORDER so we always
>>>> request the largest order the kernel supports, rather than an
>>>> out-of-range value.
>>>>
>>>> Also fix the error path to propagate the actual error code from
>>>> devm_memremap_pages() instead of hard-coding -EFAULT, which was
>>>> masking the real -EINVAL return.
>>>>
>>>> Fixes: 7bfe3b8ea6e3 ("Drivers: hv: Introduce mshv_vtl driver")
>>>> Cc: <stable@vger.kernel.org>
>>>> Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
>>>> ---
>>>>    drivers/hv/mshv_vtl_main.c | 8 ++++++--
>>>>    1 file changed, 6 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
>>>> index 5856975f32e12..255fed3a740c1 100644
>>>> --- a/drivers/hv/mshv_vtl_main.c
>>>> +++ b/drivers/hv/mshv_vtl_main.c
>>>> @@ -405,8 +405,12 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
>>>>    	/*
>>>>    	 * Determine the highest page order that can be used for the given memory range.
>>>>    	 * This works best when the range is aligned; i.e. both the start and the length.
>>>> +	 * Clamp to MAX_FOLIO_ORDER to avoid a WARN in memremap_pages() when the range
>>>> +	 * alignment exceeds the maximum supported folio order for this kernel config.
>>>>    	 */
>>>> -	pgmap->vmemmap_shift = count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn);
>>>> +	pgmap->vmemmap_shift = min_t(unsigned long,
>>>> +				     count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn),
>>>> +				     MAX_FOLIO_ORDER);
>>>
>>> Is it necessary to use min_t() here, or would min() work?  Neither count_trailing_zeros()
>>> nor MAX_FOLIO_ORDER is ever negative, so it seems like just min() would work with
>>> no potential for doing a bogus comparison or assignment.
>>>
>>
>> min could work, yes. I just felt min_t is more safer for comparing these
>> two different types of values -
>> count_trailing_zeroes being 'int'
>> MAX_FOLIO_ORDER being a macro, calculated by applying bit operations.
>>
>> and destination being unsigned int.
>>
>>
>> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          MAX_PAGE_ORDER
>> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          PFN_SECTION_SHIFT
>> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          (ilog2(SZ_16G) - PAGE_SHIFT)
>> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          (ilog2(SZ_1G) - PAGE_SHIFT)
>> include/linux/mmzone.h:#define MAX_FOLIO_ORDER          (PUD_SHIFT - PAGE_SHIFT)
>>
>> I am fine with anything you suggest here.
> 
> There's a fair number of patches on LKML that are replacing min_t() with
> min().  At some point in the not-too-distant past, the implementation of
> min() was improved to deal with different but compatible integer types.
> My sense is that min() is the better choice for general integer comparisons,
> particularly when the values are known to be non-negative.
> 
>>
>>> The shift is calculated using the originally passed in start_pfn and last_pfn, while the
>>> "range" struct in pgmap has an "end" value that is one page less. So is the idea to
>>> go ahead and create the mapping with folios of a size that includes that last page,
>>> and then just waste the last page of the last folio?
>>
>> No, waste does not occur. The vmemmap_shift determines the folio order,
>> and memmap_init_zone_device() walks the range [start_pfn, last_pfn) in
>> steps of (1 << vmemmap_shift) pages, creating one folio per step. The OR
>> operation ensures both boundaries are aligned to multiples of (1 <<
>> vmemmap_shift), guaranteeing the range divides evenly into folios with
>> no partial folio at the end.
>> The intention is to find the highest folio order possible here, and if
>> it reaches the MAX_FOLIO_ORDER, restrict vmemmap_shift to it.
> 
> OK, I figured out what is confusing me. I had a misunderstanding
> when I reviewed this code during its original submission, and that
> misunderstanding has influenced my (incorrect) review of this change.
> 
> The struct mshv_vtl_ram_disposition that is passed from user space has
> two fields: start_pfn and last_pfn. But last_pfn is somewhat misnamed
> in my view. For example, an aligned 2 MiB of memory would consist of
> 512 PFNs. If the first PFN is 0x200, the last PFN would be 0x3FF.  But in
> the semantics of the struct, the last_pfn field should be 0x400.
> 
> In response to my comments in the original review, you added the comment
> about last_pfn being excluded in the pagemap range, which is true. But it's
> not because that page is somehow reserved or being wasted. It's because
> the range is being described by specifying the PFN *after* the last PFN.
> 
> With the start_pfn and last_pfn fields used to determine the highest
> page order that can be used, the slightly unorthodox semantics of
> last_pfn make that calculation easy. But then you must subtract 1
> from last_pfn when setting the range start and end for
> devm_memremap_pages() to use. And the code does that, so the code
> is all correct. The comment might be improved to speak about the
> semantics of the last_pfn field, not that a page of memory is
> intentionally being excluded/wasted.  And/or maybe the struct
> mshv_vtl_ram_disposition definition should get a comment to clarify
> the semantics of last_pfn.
> 
> Michael

Right, last_pfn could be interpreted as actual last pfn. I'll add the 
comment to avoid the confusion.

Regards,
Naman


^ permalink raw reply

* [PATCH v2] mshv_vtl: Fix vmemmap_shift exceeding MAX_FOLIO_ORDER
From: Naman Jain @ 2026-04-06  9:24 UTC (permalink / raw)
  To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Michael Kelley
  Cc: Saurabh Sengar, Naman Jain, linux-hyperv, linux-kernel

When registering VTL0 memory via MSHV_ADD_VTL0_MEMORY, the kernel
computes pgmap->vmemmap_shift as the number of trailing zeros in the
OR of start_pfn and last_pfn, intending to use the largest compound
page order both endpoints are aligned to.

However, this value is not clamped to MAX_FOLIO_ORDER, so a
sufficiently aligned range (e.g. physical range
[0x800000000000, 0x800080000000), corresponding to start_pfn=0x800000000
with 35 trailing zeros) can produce a shift larger than what
memremap_pages() accepts, triggering a WARN and returning -EINVAL:

  WARNING: ... memremap_pages+0x512/0x650
  requested folio size unsupported

The MAX_FOLIO_ORDER check was added by
commit 646b67d57589 ("mm/memremap: reject unreasonable folio/compound
page sizes in memremap_pages()").

Fix this by clamping vmemmap_shift to MAX_FOLIO_ORDER so we always
request the largest order the kernel supports, in those cases, rather
than an out-of-range value.

Also fix the error path to propagate the actual error code from
devm_memremap_pages() instead of hard-coding -EFAULT, which was
masking the real -EINVAL return.

Fixes: 7bfe3b8ea6e3 ("Drivers: hv: Introduce mshv_vtl driver")
Cc: stable@vger.kernel.org
Signed-off-by: Naman Jain <namjain@linux.microsoft.com>
---
Changes since v1:
https://lore.kernel.org/all/20260401054005.1532381-1-namjain@linux.microsoft.com/
Addressed Michael's comments:
* remove MAX_FOLIO_ORDER value related text in commit msg
* Change change summary to keep prefix "mshv_vtl:"
* Add comments regarding last_pfn to avoid confusion
* use min instead of min_t
---
 drivers/hv/mshv_vtl_main.c | 12 +++++++++---
 include/uapi/linux/mshv.h  |  2 +-
 2 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/hv/mshv_vtl_main.c b/drivers/hv/mshv_vtl_main.c
index 5856975f32e1..c19400701467 100644
--- a/drivers/hv/mshv_vtl_main.c
+++ b/drivers/hv/mshv_vtl_main.c
@@ -386,7 +386,6 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
 
 	if (copy_from_user(&vtl0_mem, arg, sizeof(vtl0_mem)))
 		return -EFAULT;
-	/* vtl0_mem.last_pfn is excluded in the pagemap range for VTL0 as per design */
 	if (vtl0_mem.last_pfn <= vtl0_mem.start_pfn) {
 		dev_err(vtl->module_dev, "range start pfn (%llx) > end pfn (%llx)\n",
 			vtl0_mem.start_pfn, vtl0_mem.last_pfn);
@@ -397,6 +396,10 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
 	if (!pgmap)
 		return -ENOMEM;
 
+	/*
+	 * vtl0_mem.last_pfn is excluded in the pagemap range for VTL0 as per design.
+	 * last_pfn is not reserved or wasted, and reflects 'start_pfn + size' of pagemap range.
+	 */
 	pgmap->ranges[0].start = PFN_PHYS(vtl0_mem.start_pfn);
 	pgmap->ranges[0].end = PFN_PHYS(vtl0_mem.last_pfn) - 1;
 	pgmap->nr_range = 1;
@@ -405,8 +408,11 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
 	/*
 	 * Determine the highest page order that can be used for the given memory range.
 	 * This works best when the range is aligned; i.e. both the start and the length.
+	 * Clamp to MAX_FOLIO_ORDER to avoid a WARN in memremap_pages() when the range
+	 * alignment exceeds the maximum supported folio order for this kernel config.
 	 */
-	pgmap->vmemmap_shift = count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn);
+	pgmap->vmemmap_shift = min(count_trailing_zeros(vtl0_mem.start_pfn | vtl0_mem.last_pfn),
+				   MAX_FOLIO_ORDER);
 	dev_dbg(vtl->module_dev,
 		"Add VTL0 memory: start: 0x%llx, end_pfn: 0x%llx, page order: %lu\n",
 		vtl0_mem.start_pfn, vtl0_mem.last_pfn, pgmap->vmemmap_shift);
@@ -415,7 +421,7 @@ static int mshv_vtl_ioctl_add_vtl0_mem(struct mshv_vtl *vtl, void __user *arg)
 	if (IS_ERR(addr)) {
 		dev_err(vtl->module_dev, "devm_memremap_pages error: %ld\n", PTR_ERR(addr));
 		kfree(pgmap);
-		return -EFAULT;
+		return PTR_ERR(addr);
 	}
 
 	/* Don't free pgmap, since it has to stick around until the memory
diff --git a/include/uapi/linux/mshv.h b/include/uapi/linux/mshv.h
index e0645a34b55b..32ff92b6342b 100644
--- a/include/uapi/linux/mshv.h
+++ b/include/uapi/linux/mshv.h
@@ -357,7 +357,7 @@ struct mshv_vtl_sint_post_msg {
 
 struct mshv_vtl_ram_disposition {
 	__u64 start_pfn;
-	__u64 last_pfn;
+	__u64 last_pfn; /* last_pfn is excluded from the range [start_pfn, last_pfn) */
 };
 
 struct mshv_vtl_set_poll_file {
-- 
2.43.0


^ permalink raw reply related

* [PATCH 00/10] Convert all drivers to the new udata response flow
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches

Go through the drivers and migrate them to use
ib_respond_udata(). Remove debugging prints on failure paths.
Ensure the error propagates from ib_respond_udata(). Use the = {}
pattern to initialize the uresp.

There are a couple of oddball cases which are fixed up in their own
commits, but otherwise this is fairly straightforward.

Jason Gunthorpe (10):
  RDMA: Use ib_is_udata_in_empty() for places calling
    ib_is_udata_cleared()
  IB/rdmavt: Don't abuse udata and ib_respond_udata()
  RDMA: Convert drivers using min to ib_respond_udata()
  RDMA: Convert drivers using sizeof() to ib_respond_udata()
  RDMA/cxgb4: Convert to ib_respond_udata()
  RDMA/qedr: Replace qedr_ib_copy_to_udata() with ib_respond_udata()
  RDMA/mlx: Replace response_len with ib_respond_udata()
  RDMA: Use proper driver data response structs instead of open coding
  RDMA: Add missed = {} initialization to uresp structs
  RDMA: Replace memset with = {} pattern for ib_respond_udata()

 drivers/infiniband/hw/bnxt_re/ib_verbs.c      |  2 +-
 drivers/infiniband/hw/cxgb4/cq.c              | 11 +--
 drivers/infiniband/hw/cxgb4/provider.c        | 14 +--
 drivers/infiniband/hw/cxgb4/qp.c              | 10 +--
 drivers/infiniband/hw/efa/efa_verbs.c         | 87 ++++++-------------
 drivers/infiniband/hw/erdma/erdma_verbs.c     | 13 ++-
 drivers/infiniband/hw/hns/hns_roce_ah.c       |  4 +-
 drivers/infiniband/hw/hns/hns_roce_cq.c       |  3 +-
 drivers/infiniband/hw/hns/hns_roce_main.c     |  3 +-
 drivers/infiniband/hw/hns/hns_roce_pd.c       |  8 +-
 drivers/infiniband/hw/hns/hns_roce_qp.c       | 13 +--
 drivers/infiniband/hw/hns/hns_roce_srq.c      |  6 +-
 .../infiniband/hw/ionic/ionic_controlpath.c   |  8 +-
 drivers/infiniband/hw/irdma/verbs.c           | 48 ++++------
 drivers/infiniband/hw/mana/cq.c               |  6 +-
 drivers/infiniband/hw/mana/qp.c               | 22 ++---
 drivers/infiniband/hw/mlx4/cq.c               |  7 +-
 drivers/infiniband/hw/mlx4/main.c             | 31 ++++---
 drivers/infiniband/hw/mlx4/qp.c               |  9 +-
 drivers/infiniband/hw/mlx4/srq.c              | 12 ++-
 drivers/infiniband/hw/mlx5/ah.c               |  2 +-
 drivers/infiniband/hw/mlx5/cq.c               |  7 +-
 drivers/infiniband/hw/mlx5/main.c             | 16 ++--
 drivers/infiniband/hw/mlx5/mr.c               |  2 +-
 drivers/infiniband/hw/mlx5/qp.c               | 17 ++--
 drivers/infiniband/hw/mlx5/srq.c              |  7 +-
 drivers/infiniband/hw/mthca/mthca_provider.c  | 40 ++++++---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c   | 31 +++----
 drivers/infiniband/hw/qedr/verbs.c            | 43 ++-------
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c  | 13 +--
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_cq.c  |  7 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_qp.c  |  8 +-
 drivers/infiniband/hw/vmw_pvrdma/pvrdma_srq.c |  6 +-
 .../infiniband/hw/vmw_pvrdma/pvrdma_verbs.c   | 11 ++-
 drivers/infiniband/sw/rdmavt/cq.c             |  2 +-
 drivers/infiniband/sw/rdmavt/qp.c             |  3 +-
 drivers/infiniband/sw/rdmavt/srq.c            | 19 ++--
 drivers/infiniband/sw/siw/siw_verbs.c         | 10 +--
 38 files changed, 223 insertions(+), 338 deletions(-)


base-commit: 69db255d5fafb5651013c79e54c1d535fc5015fb
-- 
2.43.0


^ permalink raw reply

* [PATCH 05/10] RDMA/cxgb4: Convert to ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches
In-Reply-To: <0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com>

These cases carefully work around 32-bit unpadded structures, but
the min integrated into ib_respond_udata() handles this
automatically. Zero-initialize data that would not have been copied.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/cxgb4/cq.c       | 8 +++-----
 drivers/infiniband/hw/cxgb4/provider.c | 5 ++---
 2 files changed, 5 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index e31fb9134aa818..47508df4cec023 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -1115,13 +1115,11 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 		/* communicate to the userspace that
 		 * kernel driver supports 64B CQE
 		 */
-		uresp.flags |= C4IW_64B_CQE;
+		if (!ucontext->is_32b_cqe)
+			uresp.flags |= C4IW_64B_CQE;
 
 		spin_unlock(&ucontext->mmap_lock);
-		ret = ib_copy_to_udata(udata, &uresp,
-				       ucontext->is_32b_cqe ?
-				       sizeof(uresp) - sizeof(uresp.flags) :
-				       sizeof(uresp));
+		ret = ib_respond_udata(udata, uresp);
 		if (ret)
 			goto err_free_mm2;
 
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index a119e8793aef40..0e3827022c63da 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -80,7 +80,7 @@ static int c4iw_alloc_ucontext(struct ib_ucontext *ucontext,
 	struct ib_device *ibdev = ucontext->device;
 	struct c4iw_ucontext *context = to_c4iw_ucontext(ucontext);
 	struct c4iw_dev *rhp = to_c4iw_dev(ibdev);
-	struct c4iw_alloc_ucontext_resp uresp;
+	struct c4iw_alloc_ucontext_resp uresp = {};
 	int ret = 0;
 	struct c4iw_mm_entry *mm = NULL;
 
@@ -106,8 +106,7 @@ static int c4iw_alloc_ucontext(struct ib_ucontext *ucontext,
 		context->key += PAGE_SIZE;
 		spin_unlock(&context->mmap_lock);
 
-		ret = ib_copy_to_udata(udata, &uresp,
-				       sizeof(uresp) - sizeof(uresp.reserved));
+		ret = ib_respond_udata(udata, uresp);
 		if (ret)
 			goto err_mm;
 
-- 
2.43.0


^ permalink raw reply related

* [PATCH 07/10] RDMA/mlx: Replace response_len with ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches
In-Reply-To: <0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com>

The Mellanox drivers have a pattern where they compute the response
length they think they need based on what the user asked for, then
blindly write that ignoring the provided size limit on the response
structure.

Drop this and just use ib_respond_udata() which caps the response
struct to the user's memory, which is fine for what mlx5 is doing.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/mlx4/main.c |  2 +-
 drivers/infiniband/hw/mlx4/qp.c   |  2 +-
 drivers/infiniband/hw/mlx5/ah.c   |  2 +-
 drivers/infiniband/hw/mlx5/main.c |  4 ++--
 drivers/infiniband/hw/mlx5/mr.c   |  2 +-
 drivers/infiniband/hw/mlx5/qp.c   | 10 +++++-----
 6 files changed, 11 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index ce77e893065c92..4b187ec9e01738 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -626,7 +626,7 @@ static int mlx4_ib_query_device(struct ib_device *ibdev,
 	}
 
 	if (uhw->outlen) {
-		err = ib_copy_to_udata(uhw, &resp, resp.response_length);
+		err = ib_respond_udata(uhw, resp);
 		if (err)
 			goto out;
 	}
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index aca8a985ce33cd..8dc4196218bf05 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -4331,7 +4331,7 @@ int mlx4_ib_create_rwq_ind_table(struct ib_rwq_ind_table *rwq_ind_table,
 	if (udata->outlen) {
 		resp.response_length = offsetof(typeof(resp), response_length) +
 					sizeof(resp.response_length);
-		err = ib_copy_to_udata(udata, &resp, resp.response_length);
+		err = ib_respond_udata(udata, resp);
 	}
 
 	return err;
diff --git a/drivers/infiniband/hw/mlx5/ah.c b/drivers/infiniband/hw/mlx5/ah.c
index 531a57f9ee7e8b..a3aa700d08355d 100644
--- a/drivers/infiniband/hw/mlx5/ah.c
+++ b/drivers/infiniband/hw/mlx5/ah.c
@@ -121,7 +121,7 @@ int mlx5_ib_create_ah(struct ib_ah *ibah, struct rdma_ah_init_attr *init_attr,
 		resp.response_length = min_resp_len;
 
 		memcpy(resp.dmac, ah_attr->roce.dmac, ETH_ALEN);
-		err = ib_copy_to_udata(udata, &resp, resp.response_length);
+		err = ib_respond_udata(udata, resp);
 		if (err)
 			return err;
 	}
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 57d3b80e7550b6..84dddaded6fdef 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1355,7 +1355,7 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 	}
 
 	if (uhw_outlen) {
-		err = ib_copy_to_udata(uhw, &resp, resp.response_length);
+		err = ib_respond_udata(uhw, resp);
 
 		if (err)
 			return err;
@@ -2280,7 +2280,7 @@ static int mlx5_ib_alloc_ucontext(struct ib_ucontext *uctx,
 		goto out_mdev;
 
 	resp.response_length = min(udata->outlen, sizeof(resp));
-	err = ib_copy_to_udata(udata, &resp, resp.response_length);
+	err = ib_respond_udata(udata, resp);
 	if (err)
 		goto out_mdev;
 
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 3ef467ac9e3d15..8eb922bd3b663d 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1811,7 +1811,7 @@ int mlx5_ib_alloc_mw(struct ib_mw *ibmw, struct ib_udata *udata)
 	resp.response_length =
 		min(offsetofend(typeof(resp), response_length), udata->outlen);
 	if (resp.response_length) {
-		err = ib_copy_to_udata(udata, &resp, resp.response_length);
+		err = ib_respond_udata(udata, resp);
 		if (err)
 			goto free_mkey;
 	}
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 81d98b5010f1ca..4a7363327d2a8e 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -3327,7 +3327,7 @@ int mlx5_ib_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attr,
 		 * including MLX5_IB_QPT_DCT, which doesn't need it.
 		 * In that case, resp will be filled with zeros.
 		 */
-		err = ib_copy_to_udata(udata, &params.resp, params.outlen);
+		err = ib_respond_udata(udata, params.resp);
 	if (err)
 		goto destroy_qp;
 
@@ -4626,7 +4626,7 @@ static int mlx5_ib_modify_dct(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 		resp.dctn = qp->dct.mdct.mqp.qpn;
 		if (MLX5_CAP_GEN(dev->mdev, ece_support))
 			resp.ece_options = MLX5_GET(create_dct_out, out, ece);
-		err = ib_copy_to_udata(udata, &resp, resp.response_length);
+		err = ib_respond_udata(udata, resp);
 		if (err) {
 			mlx5_core_destroy_dct(dev, &qp->dct.mdct);
 			return err;
@@ -4785,7 +4785,7 @@ int mlx5_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 	if (!err && resp.response_length &&
 	    udata->outlen >= resp.response_length)
 		/* Return -EFAULT to the user and expect him to destroy QP. */
-		err = ib_copy_to_udata(udata, &resp, resp.response_length);
+		err = ib_respond_udata(udata, resp);
 
 out:
 	mutex_unlock(&qp->mutex);
@@ -5485,7 +5485,7 @@ struct ib_wq *mlx5_ib_create_wq(struct ib_pd *pd,
 	if (udata->outlen) {
 		resp.response_length = offsetofend(
 			struct mlx5_ib_create_wq_resp, response_length);
-		err = ib_copy_to_udata(udata, &resp, resp.response_length);
+		err = ib_respond_udata(udata, resp);
 		if (err)
 			goto err_copy;
 	}
@@ -5576,7 +5576,7 @@ int mlx5_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
 		resp.response_length =
 			offsetofend(struct mlx5_ib_create_rwq_ind_tbl_resp,
 				    response_length);
-		err = ib_copy_to_udata(udata, &resp, resp.response_length);
+		err = ib_respond_udata(udata, resp);
 		if (err)
 			goto err_copy;
 	}
-- 
2.43.0


^ permalink raw reply related

* [PATCH 09/10] RDMA: Add missed = {} initialization to uresp structs
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches
In-Reply-To: <0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com>

All of these are fully initialized so no bugs are being fixed. Add
the missing initializer as a precaution against future changes.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/bnxt_re/ib_verbs.c  | 2 +-
 drivers/infiniband/hw/erdma/erdma_verbs.c | 2 +-
 drivers/infiniband/hw/mlx4/main.c         | 4 ++--
 drivers/infiniband/hw/mlx5/main.c         | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/bnxt_re/ib_verbs.c b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
index 7ed294516b7edb..ccb362d6d2e669 100644
--- a/drivers/infiniband/hw/bnxt_re/ib_verbs.c
+++ b/drivers/infiniband/hw/bnxt_re/ib_verbs.c
@@ -1884,7 +1884,7 @@ int bnxt_re_create_qp(struct ib_qp *ib_qp, struct ib_qp_init_attr *qp_init_attr,
 		}
 
 		if (udata) {
-			struct bnxt_re_qp_resp resp;
+			struct bnxt_re_qp_resp resp = {};
 
 			resp.qpid = qp->qplib_qp.id;
 			resp.rsvd = 0;
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index 92a65970ab6fa1..c8a35337ba51e8 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -1977,7 +1977,7 @@ int erdma_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 
 	if (!rdma_is_kernel_res(&ibcq->res)) {
 		struct erdma_ureq_create_cq ureq;
-		struct erdma_uresp_create_cq uresp;
+		struct erdma_uresp_create_cq uresp = {};
 
 		ret = ib_copy_validate_udata_in(udata, ureq, rsvd0);
 		if (ret)
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 25f9738bd77223..d50743f090bf21 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1090,8 +1090,8 @@ static int mlx4_ib_alloc_ucontext(struct ib_ucontext *uctx,
 	struct ib_device *ibdev = uctx->device;
 	struct mlx4_ib_dev *dev = to_mdev(ibdev);
 	struct mlx4_ib_ucontext *context = to_mucontext(uctx);
-	struct mlx4_ib_alloc_ucontext_resp_v3 resp_v3;
-	struct mlx4_ib_alloc_ucontext_resp resp;
+	struct mlx4_ib_alloc_ucontext_resp_v3 resp_v3 = {};
+	struct mlx4_ib_alloc_ucontext_resp resp = {};
 	int err;
 
 	if (!dev->ib_active)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index 84dddaded6fdef..a6a696864f9e0a 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -2772,7 +2772,7 @@ static int mlx5_ib_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 {
 	struct mlx5_ib_pd *pd = to_mpd(ibpd);
 	struct ib_device *ibdev = ibpd->device;
-	struct mlx5_ib_alloc_pd_resp resp;
+	struct mlx5_ib_alloc_pd_resp resp = {};
 	int err;
 	u32 out[MLX5_ST_SZ_DW(alloc_pd_out)] = {};
 	u32 in[MLX5_ST_SZ_DW(alloc_pd_in)] = {};
-- 
2.43.0


^ permalink raw reply related

* [PATCH 06/10] RDMA/qedr: Replace qedr_ib_copy_to_udata() with ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches
In-Reply-To: <0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com>

This is another instance of the min() pattern.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/qedr/verbs.c | 30 ++++--------------------------
 1 file changed, 4 insertions(+), 26 deletions(-)

diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 3b86ea1cf88883..72ee57dc85687e 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -64,14 +64,6 @@ enum {
 	QEDR_USER_MMAP_PHYS_PAGE,
 };
 
-static inline int qedr_ib_copy_to_udata(struct ib_udata *udata, void *src,
-					size_t len)
-{
-	size_t min_len = min_t(size_t, len, udata->outlen);
-
-	return ib_copy_to_udata(udata, src, min_len);
-}
-
 int qedr_query_pkey(struct ib_device *ibdev, u32 port, u16 index, u16 *pkey)
 {
 	if (index >= QEDR_ROCE_PKEY_TABLE_LEN)
@@ -340,7 +332,7 @@ int qedr_alloc_ucontext(struct ib_ucontext *uctx, struct ib_udata *udata)
 	uresp.sges_per_srq_wr = dev->attr.max_srq_sge;
 	uresp.max_cqes = QEDR_MAX_CQES;
 
-	rc = qedr_ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+	rc = ib_respond_udata(udata, uresp);
 	if (rc)
 		goto err;
 
@@ -459,9 +451,8 @@ int qedr_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 		struct qedr_ucontext *context = rdma_udata_to_drv_context(
 			udata, struct qedr_ucontext, ibucontext);
 
-		rc = qedr_ib_copy_to_udata(udata, &uresp, sizeof(uresp));
+		rc = ib_respond_udata(udata, uresp);
 		if (rc) {
-			DP_ERR(dev, "copy error pd_id=0x%x.\n", pd_id);
 			dev->ops->rdma_dealloc_pd(dev->rdma_ctx, pd_id);
 			return rc;
 		}
@@ -701,7 +692,6 @@ static int qedr_copy_cq_uresp(struct qedr_dev *dev,
 			      u32 db_offset)
 {
 	struct qedr_create_cq_uresp uresp;
-	int rc;
 
 	memset(&uresp, 0, sizeof(uresp));
 
@@ -711,11 +701,7 @@ static int qedr_copy_cq_uresp(struct qedr_dev *dev,
 		uresp.db_rec_addr =
 			rdma_user_mmap_get_offset(cq->q.db_mmap_entry);
 
-	rc = qedr_ib_copy_to_udata(udata, &uresp, sizeof(uresp));
-	if (rc)
-		DP_ERR(dev, "copy error cqid=0x%x.\n", cq->icid);
-
-	return rc;
+	return ib_respond_udata(udata, uresp);
 }
 
 static void consume_cqe(struct qedr_cq *cq)
@@ -1298,8 +1284,6 @@ static int qedr_copy_qp_uresp(struct qedr_dev *dev,
 			      struct qedr_qp *qp, struct ib_udata *udata,
 			      struct qedr_create_qp_uresp *uresp)
 {
-	int rc;
-
 	memset(uresp, 0, sizeof(*uresp));
 
 	if (qedr_qp_has_sq(qp))
@@ -1311,13 +1295,7 @@ static int qedr_copy_qp_uresp(struct qedr_dev *dev,
 	uresp->atomic_supported = dev->atomic_cap != IB_ATOMIC_NONE;
 	uresp->qp_id = qp->qp_id;
 
-	rc = qedr_ib_copy_to_udata(udata, uresp, sizeof(*uresp));
-	if (rc)
-		DP_ERR(dev,
-		       "create qp: failed a copy to user space with qp icid=0x%x.\n",
-		       qp->icid);
-
-	return rc;
+	return ib_respond_udata(udata, *uresp);
 }
 
 static void qedr_reset_qp_hwq_info(struct qedr_qp_hwq_info *qph)
-- 
2.43.0


^ permalink raw reply related

* [PATCH 01/10] RDMA: Use ib_is_udata_in_empty() for places calling ib_is_udata_cleared()
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches
In-Reply-To: <0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com>

Convert the pattern:

  if (udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen))

Using Coccinelle:

virtual patch
virtual context
virtual report

@@
expression udata;
@@
(
- udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen)
+ !ib_is_udata_in_empty(udata)
|
- udata->inlen > 0 && !ib_is_udata_cleared(udata, 0, udata->inlen)
+ !ib_is_udata_in_empty(udata)
)

@@
expression udata;
@@
- udata && udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen)
+ !ib_is_udata_in_empty(udata)

These cases are already checking for zeroed data that the kernel does
not understand.

Run another pass with AI to propagate the return code correctly and
remove redundant prints.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/efa/efa_verbs.c | 43 +++++++++------------------
 drivers/infiniband/hw/mlx4/main.c     |  6 ++--
 drivers/infiniband/hw/mlx4/qp.c       |  7 ++---
 drivers/infiniband/hw/mlx5/main.c     |  5 ++--
 drivers/infiniband/hw/mlx5/qp.c       |  7 ++---
 5 files changed, 26 insertions(+), 42 deletions(-)

diff --git a/drivers/infiniband/hw/efa/efa_verbs.c b/drivers/infiniband/hw/efa/efa_verbs.c
index 7bd0838ebc99e4..3ad5d6e27b1590 100644
--- a/drivers/infiniband/hw/efa/efa_verbs.c
+++ b/drivers/infiniband/hw/efa/efa_verbs.c
@@ -218,12 +218,9 @@ int efa_query_device(struct ib_device *ibdev,
 	struct efa_dev *dev = to_edev(ibdev);
 	int err;
 
-	if (udata && udata->inlen &&
-	    !ib_is_udata_cleared(udata, 0, udata->inlen)) {
-		ibdev_dbg(ibdev,
-			  "Incompatible ABI params, udata not cleared\n");
-		return -EINVAL;
-	}
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	dev_attr = &dev->dev_attr;
 
@@ -433,13 +430,9 @@ int efa_alloc_pd(struct ib_pd *ibpd, struct ib_udata *udata)
 	struct efa_pd *pd = to_epd(ibpd);
 	int err;
 
-	if (udata->inlen &&
-	    !ib_is_udata_cleared(udata, 0, udata->inlen)) {
-		ibdev_dbg(&dev->ibdev,
-			  "Incompatible ABI params, udata not cleared\n");
-		err = -EINVAL;
+	err = ib_is_udata_in_empty(udata);
+	if (err)
 		goto err_out;
-	}
 
 	err = efa_com_alloc_pd(&dev->edev, &result);
 	if (err)
@@ -982,12 +975,9 @@ int efa_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *qp_attr,
 	if (qp_attr_mask & ~IB_QP_ATTR_STANDARD_BITS)
 		return -EOPNOTSUPP;
 
-	if (udata->inlen &&
-	    !ib_is_udata_cleared(udata, 0, udata->inlen)) {
-		ibdev_dbg(&dev->ibdev,
-			  "Incompatible ABI params, udata not cleared\n");
-		return -EINVAL;
-	}
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	cur_state = qp_attr_mask & IB_QP_CUR_STATE ? qp_attr->cur_qp_state :
 						     qp->state;
@@ -1612,13 +1602,11 @@ static struct efa_mr *efa_alloc_mr(struct ib_pd *ibpd, int access_flags,
 	struct efa_dev *dev = to_edev(ibpd->device);
 	int supp_access_flags;
 	struct efa_mr *mr;
+	int ret;
 
-	if (udata && udata->inlen &&
-	    !ib_is_udata_cleared(udata, 0, udata->inlen)) {
-		ibdev_dbg(&dev->ibdev,
-			  "Incompatible ABI params, udata not cleared\n");
-		return ERR_PTR(-EINVAL);
-	}
+	ret = ib_is_udata_in_empty(udata);
+	if (ret)
+		return ERR_PTR(ret);
 
 	supp_access_flags =
 		IB_ACCESS_LOCAL_WRITE |
@@ -2082,12 +2070,9 @@ int efa_create_ah(struct ib_ah *ibah,
 		goto err_out;
 	}
 
-	if (udata->inlen &&
-	    !ib_is_udata_cleared(udata, 0, udata->inlen)) {
-		ibdev_dbg(&dev->ibdev, "Incompatible ABI params\n");
-		err = -EINVAL;
+	err = ib_is_udata_in_empty(udata);
+	if (err)
 		goto err_out;
-	}
 
 	memcpy(params.dest_addr, ah_attr->grh.dgid.raw,
 	       sizeof(params.dest_addr));
diff --git a/drivers/infiniband/hw/mlx4/main.c b/drivers/infiniband/hw/mlx4/main.c
index 464c9ab4251636..16e9ce8138cb30 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1696,9 +1696,9 @@ static struct ib_flow *mlx4_ib_create_flow(struct ib_qp *qp,
 	    (flow_attr->type != IB_FLOW_ATTR_NORMAL))
 		return ERR_PTR(-EOPNOTSUPP);
 
-	if (udata &&
-	    udata->inlen && !ib_is_udata_cleared(udata, 0, udata->inlen))
-		return ERR_PTR(-EOPNOTSUPP);
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return ERR_PTR(err);
 
 	memset(type, 0, sizeof(type));
 
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index 790be09d985a1a..aca8a985ce33cd 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -4297,10 +4297,9 @@ int mlx4_ib_create_rwq_ind_table(struct ib_rwq_ind_table *rwq_ind_table,
 	size_t min_resp_len;
 	int i, err = 0;
 
-	if (udata->inlen > 0 &&
-	    !ib_is_udata_cleared(udata, 0,
-				 udata->inlen))
-		return -EOPNOTSUPP;
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	min_resp_len = offsetof(typeof(resp), reserved) + sizeof(resp.reserved);
 	if (udata->outlen && udata->outlen < min_resp_len)
diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c
index e02bfb1479f5c3..7d435cf5a2fdae 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -964,8 +964,9 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
 
 	resp.response_length = resp_len;
 
-	if (uhw && uhw->inlen && !ib_is_udata_cleared(uhw, 0, uhw->inlen))
-		return -EINVAL;
+	err = ib_is_udata_in_empty(uhw);
+	if (err)
+		return err;
 
 	memset(props, 0, sizeof(*props));
 	err = mlx5_query_system_image_guid(ibdev,
diff --git a/drivers/infiniband/hw/mlx5/qp.c b/drivers/infiniband/hw/mlx5/qp.c
index 8f50e7342a7694..81d98b5010f1ca 100644
--- a/drivers/infiniband/hw/mlx5/qp.c
+++ b/drivers/infiniband/hw/mlx5/qp.c
@@ -5533,10 +5533,9 @@ int mlx5_ib_create_rwq_ind_table(struct ib_rwq_ind_table *ib_rwq_ind_table,
 	u32 *in;
 	void *rqtc;
 
-	if (udata->inlen > 0 &&
-	    !ib_is_udata_cleared(udata, 0,
-				 udata->inlen))
-		return -EOPNOTSUPP;
+	err = ib_is_udata_in_empty(udata);
+	if (err)
+		return err;
 
 	if (init_attr->log_ind_tbl_size >
 	    MLX5_CAP_GEN(dev->mdev, log_max_rqt_size)) {
-- 
2.43.0


^ permalink raw reply related

* [PATCH 10/10] RDMA: Replace memset with = {} pattern for ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches
In-Reply-To: <0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com>

Most drivers do this already, but some open-code a memset. Switch
all instances found. qedr_copy_qp_uresp() is already called with
zeroed memory so that memset is redundant.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/hw/cxgb4/cq.c             |  3 +--
 drivers/infiniband/hw/cxgb4/qp.c             |  6 ++----
 drivers/infiniband/hw/erdma/erdma_verbs.c    |  4 +---
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c  | 12 ++++--------
 drivers/infiniband/hw/qedr/verbs.c           |  6 +-----
 drivers/infiniband/hw/usnic/usnic_ib_verbs.c |  4 +---
 6 files changed, 10 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cq.c b/drivers/infiniband/hw/cxgb4/cq.c
index 47508df4cec023..d1517f2560b981 100644
--- a/drivers/infiniband/hw/cxgb4/cq.c
+++ b/drivers/infiniband/hw/cxgb4/cq.c
@@ -1004,7 +1004,7 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 	struct c4iw_dev *rhp = to_c4iw_dev(ibcq->device);
 	struct c4iw_cq *chp = to_c4iw_cq(ibcq);
 	struct c4iw_create_cq ucmd;
-	struct c4iw_create_cq_resp uresp;
+	struct c4iw_create_cq_resp uresp = {};
 	int ret, wr_len;
 	size_t memsize, hwentries;
 	struct c4iw_mm_entry *mm, *mm2;
@@ -1102,7 +1102,6 @@ int c4iw_create_cq(struct ib_cq *ibcq, const struct ib_cq_init_attr *attr,
 		if (!mm2)
 			goto err_free_mm;
 
-		memset(&uresp, 0, sizeof(uresp));
 		uresp.qid_mask = rhp->rdev.cqmask;
 		uresp.cqid = chp->cq.cqid;
 		uresp.size = chp->cq.size;
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index f9c7030ac6bfd0..e295f79e0cd3e5 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -2120,7 +2120,7 @@ int c4iw_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
 	struct c4iw_pd *php;
 	struct c4iw_cq *schp;
 	struct c4iw_cq *rchp;
-	struct c4iw_create_qp_resp uresp;
+	struct c4iw_create_qp_resp uresp = {};
 	unsigned int sqsize, rqsize = 0;
 	struct c4iw_ucontext *ucontext = rdma_udata_to_drv_context(
 		udata, struct c4iw_ucontext, ibucontext);
@@ -2242,7 +2242,6 @@ int c4iw_create_qp(struct ib_qp *qp, struct ib_qp_init_attr *attrs,
 				goto err_free_sq_db_key;
 			}
 		}
-		memset(&uresp, 0, sizeof(uresp));
 		if (t4_sq_onchip(&qhp->wq.sq)) {
 			ma_sync_key_mm = kmalloc_obj(*ma_sync_key_mm);
 			if (!ma_sync_key_mm) {
@@ -2686,7 +2685,7 @@ int c4iw_create_srq(struct ib_srq *ib_srq, struct ib_srq_init_attr *attrs,
 	struct c4iw_dev *rhp;
 	struct c4iw_srq *srq = to_c4iw_srq(ib_srq);
 	struct c4iw_pd *php;
-	struct c4iw_create_srq_resp uresp;
+	struct c4iw_create_srq_resp uresp = {};
 	struct c4iw_ucontext *ucontext;
 	struct c4iw_mm_entry *srq_key_mm, *srq_db_key_mm;
 	int rqsize;
@@ -2764,7 +2763,6 @@ int c4iw_create_srq(struct ib_srq *ib_srq, struct ib_srq_init_attr *attrs,
 			ret = -ENOMEM;
 			goto err_free_srq_key_mm;
 		}
-		memset(&uresp, 0, sizeof(uresp));
 		uresp.flags = srq->flags;
 		uresp.qid_mask = rhp->rdev.qpmask;
 		uresp.srqid = srq->wq.qid;
diff --git a/drivers/infiniband/hw/erdma/erdma_verbs.c b/drivers/infiniband/hw/erdma/erdma_verbs.c
index c8a35337ba51e8..b59c2e3a5306d1 100644
--- a/drivers/infiniband/hw/erdma/erdma_verbs.c
+++ b/drivers/infiniband/hw/erdma/erdma_verbs.c
@@ -996,7 +996,7 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
 	struct erdma_ucontext *uctx = rdma_udata_to_drv_context(
 		udata, struct erdma_ucontext, ibucontext);
 	struct erdma_ureq_create_qp ureq;
-	struct erdma_uresp_create_qp uresp;
+	struct erdma_uresp_create_qp uresp = {};
 	void *old_entry;
 	int ret = 0;
 
@@ -1048,8 +1048,6 @@ int erdma_create_qp(struct ib_qp *ibqp, struct ib_qp_init_attr *attrs,
 		if (ret)
 			goto err_out_xa;
 
-		memset(&uresp, 0, sizeof(uresp));
-
 		uresp.num_sqe = qp->attrs.sq_size;
 		uresp.num_rqe = qp->attrs.rq_size;
 		uresp.qp_id = QP_ID(qp);
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
index 083f23fc687b31..d5fdbd7c8dea26 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_verbs.c
@@ -586,11 +586,10 @@ static int ocrdma_copy_pd_uresp(struct ocrdma_dev *dev, struct ocrdma_pd *pd,
 	u64 db_page_addr;
 	u64 dpp_page_addr = 0;
 	u32 db_page_size;
-	struct ocrdma_alloc_pd_uresp rsp;
+	struct ocrdma_alloc_pd_uresp rsp = {};
 	struct ocrdma_ucontext *uctx = rdma_udata_to_drv_context(
 		udata, struct ocrdma_ucontext, ibucontext);
 
-	memset(&rsp, 0, sizeof(rsp));
 	rsp.id = pd->id;
 	rsp.dpp_enabled = pd->dpp_enabled;
 	db_page_addr = ocrdma_get_db_addr(dev, pd->id);
@@ -930,13 +929,12 @@ static int ocrdma_copy_cq_uresp(struct ocrdma_dev *dev, struct ocrdma_cq *cq,
 	int status;
 	struct ocrdma_ucontext *uctx = rdma_udata_to_drv_context(
 		udata, struct ocrdma_ucontext, ibucontext);
-	struct ocrdma_create_cq_uresp uresp;
+	struct ocrdma_create_cq_uresp uresp = {};
 
 	/* this must be user flow! */
 	if (!udata)
 		return -EINVAL;
 
-	memset(&uresp, 0, sizeof(uresp));
 	uresp.cq_id = cq->id;
 	uresp.page_size = PAGE_ALIGN(cq->len);
 	uresp.num_pages = 1;
@@ -1173,11 +1171,10 @@ static int ocrdma_copy_qp_uresp(struct ocrdma_qp *qp,
 {
 	int status;
 	u64 usr_db;
-	struct ocrdma_create_qp_uresp uresp;
+	struct ocrdma_create_qp_uresp uresp = {};
 	struct ocrdma_pd *pd = qp->pd;
 	struct ocrdma_dev *dev = get_ocrdma_dev(pd->ibpd.device);
 
-	memset(&uresp, 0, sizeof(uresp));
 	usr_db = dev->nic_info.unmapped_db +
 			(pd->id * dev->nic_info.db_page_size);
 	uresp.qp_id = qp->id;
@@ -1730,9 +1727,8 @@ static int ocrdma_copy_srq_uresp(struct ocrdma_dev *dev, struct ocrdma_srq *srq,
 				struct ib_udata *udata)
 {
 	int status;
-	struct ocrdma_create_srq_uresp uresp;
+	struct ocrdma_create_srq_uresp uresp = {};
 
-	memset(&uresp, 0, sizeof(uresp));
 	uresp.rq_dbid = srq->rq.dbid;
 	uresp.num_rq_pages = 1;
 	uresp.rq_page_addr[0] = virt_to_phys(srq->rq.va);
diff --git a/drivers/infiniband/hw/qedr/verbs.c b/drivers/infiniband/hw/qedr/verbs.c
index 72ee57dc85687e..c020f882d1875c 100644
--- a/drivers/infiniband/hw/qedr/verbs.c
+++ b/drivers/infiniband/hw/qedr/verbs.c
@@ -691,9 +691,7 @@ static int qedr_copy_cq_uresp(struct qedr_dev *dev,
 			      struct qedr_cq *cq, struct ib_udata *udata,
 			      u32 db_offset)
 {
-	struct qedr_create_cq_uresp uresp;
-
-	memset(&uresp, 0, sizeof(uresp));
+	struct qedr_create_cq_uresp uresp = {};
 
 	uresp.db_offset = db_offset;
 	uresp.icid = cq->icid;
@@ -1284,8 +1282,6 @@ static int qedr_copy_qp_uresp(struct qedr_dev *dev,
 			      struct qedr_qp *qp, struct ib_udata *udata,
 			      struct qedr_create_qp_uresp *uresp)
 {
-	memset(uresp, 0, sizeof(*uresp));
-
 	if (qedr_qp_has_sq(qp))
 		qedr_copy_sq_uresp(dev, uresp, qp);
 
diff --git a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
index e887f03a84d063..261f18a8368543 100644
--- a/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
+++ b/drivers/infiniband/hw/usnic/usnic_ib_verbs.c
@@ -82,15 +82,13 @@ static void usnic_ib_fw_string_to_u64(char *fw_ver_str, u64 *fw_ver)
 static int usnic_ib_fill_create_qp_resp(struct usnic_ib_qp_grp *qp_grp,
 					struct ib_udata *udata)
 {
-	struct usnic_ib_create_qp_resp resp;
+	struct usnic_ib_create_qp_resp resp = {};
 	struct pci_dev *pdev;
 	struct vnic_dev_bar *bar;
 	struct usnic_vnic_res_chunk *chunk;
 	struct usnic_ib_qp_grp_flow *default_flow;
 	int i, err;
 
-	memset(&resp, 0, sizeof(resp));
-
 	pdev = usnic_vnic_get_pdev(qp_grp->vf->vnic);
 	if (!pdev) {
 		usnic_err("Failed to get pdev of qp_grp %d\n",
-- 
2.43.0


^ permalink raw reply related

* [PATCH 02/10] IB/rdmavt: Don't abuse udata and ib_respond_udata()
From: Jason Gunthorpe @ 2026-04-06 12:11 UTC (permalink / raw)
  To: Abhijit Gangurde, Allen Hubbe,
	Broadcom internal kernel review list, Bernard Metzler,
	Potnuri Bharat Teja, Bryan Tan, Cheng Xu, Dennis Dalessandro,
	Gal Pressman, Junxian Huang, Kai Shen, Kalesh AP,
	Konstantin Taranov, Krzysztof Czurylo, Leon Romanovsky,
	linux-hyperv, linux-rdma, Long Li, Michal Kalderon,
	Michael Margolin, Nelson Escobar, Satish Kharat, Selvin Xavier,
	Yossi Leybovich, Chengchang Tang, Tatyana Nikolova, Vishnu Dasa,
	Yishai Hadas
  Cc: patches
In-Reply-To: <0-v1-e911b76a94d1+65d95-rdma_udata_rep_jgg@nvidia.com>

Use copy_to_user() directly since the data is not being placed in the
udata response memory.

It is unclear why this is trying to do two copies, but leave it alone.

Signed-off-by: Jason Gunthorpe <jgg@nvidia.com>
---
 drivers/infiniband/sw/rdmavt/srq.c | 19 +++++++++----------
 1 file changed, 9 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/sw/rdmavt/srq.c b/drivers/infiniband/sw/rdmavt/srq.c
index fe125bf85b2726..d022aa56c5bfd5 100644
--- a/drivers/infiniband/sw/rdmavt/srq.c
+++ b/drivers/infiniband/sw/rdmavt/srq.c
@@ -128,6 +128,7 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
 	struct rvt_srq *srq = ibsrq_to_rvtsrq(ibsrq);
 	struct rvt_dev_info *dev = ib_to_rvt(ibsrq->device);
 	struct rvt_rq tmp_rq = {};
+	__u64 offset_addr;
 	int ret = 0;
 
 	if (attr_mask & IB_SRQ_MAX_WR) {
@@ -149,19 +150,17 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
 			return -ENOMEM;
 		/* Check that we can write the offset to mmap. */
 		if (udata && udata->inlen >= sizeof(__u64)) {
-			__u64 offset_addr;
 			__u64 offset = 0;
 
 			ret = ib_copy_from_udata(&offset_addr, udata,
 						 sizeof(offset_addr));
 			if (ret)
 				goto bail_free;
-			udata->outbuf = (void __user *)
-					(unsigned long)offset_addr;
-			ret = ib_copy_to_udata(udata, &offset,
-					       sizeof(offset));
-			if (ret)
+			if (copy_to_user(u64_to_user_ptr(offset_addr), &offset,
+					 sizeof(offset))) {
+				ret = -EFAULT;
 				goto bail_free;
+			}
 		}
 
 		spin_lock_irq(&srq->rq.kwq->c_lock);
@@ -236,10 +235,10 @@ int rvt_modify_srq(struct ib_srq *ibsrq, struct ib_srq_attr *attr,
 			 * See rvt_mmap() for details.
 			 */
 			if (udata && udata->inlen >= sizeof(__u64)) {
-				ret = ib_copy_to_udata(udata, &ip->offset,
-						       sizeof(ip->offset));
-				if (ret)
-					return ret;
+				if (copy_to_user(u64_to_user_ptr(offset_addr),
+						 &ip->offset,
+						 sizeof(ip->offset)))
+					return -EFAULT;
 			}
 
 			/*
-- 
2.43.0


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox