* RE: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-01-17 18:01 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Shradha Gupta, Saurabh Sengar, Aditya Garg, Dipayaan Roy,
Shiraz Saleem, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260117085850.0ece5765@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Saturday, January 17, 2026 11:59 AM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Konstantin
> Taranov <kotaranov@microsoft.com>; Simon Horman <horms@kernel.org>; Erni
> Sri Satya Vennela <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: Re: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add
> support for coalesced RX packets on CQE
>
> On Fri, 16 Jan 2026 16:44:33 +0000 Haiyang Zhang wrote:
> > > You need to add a new param to the uAPI.
> >
> > Since this feature is not common to other NICs, can we use an
> > ethtool private flag instead?
>
> It's extremely common. Descriptor writeback at the granularity of one
> packet would kill PCIe performance. We just don't have uAPI so NICs
> either don't expose the knob or "reuse" another coalescing param.
I see. So how about adding a new param like below to "ethtool -C"?
ethtool -C|--coalesce devname [rx-cqe-coalesce on|off]
> > When the flag is set, the CQE coalescing will be enabled and put
> > up to 4 pkts in a CQE.
> >
> > > Please add both size and
> > > timeout. Expose the timeout as read only if your device doesn't
> support
> > > controlling it per queue.
> >
> > Does the "size" mean the max pks per CQE (1 or 4)?
>
> The definition of "size" is always a little funny when it comes to
> coalescing and ringparam. In Tx does one frame mean one wire frame
> or one TSO superframe? I wouldn't worry about the exact meaning of
> size too much. Important thing is that user knows what making this
> param smaller or larger will do.
In "ethtool -c" output, add a new value like this?
rx-cqe-frames: (1 or 4 frames/CQE for this NIC)
> > The timeout value is not even exposed to driver, and subject to change
> > in the future. Also the HW mechanism is proprietary... So, can we not
> > "expose" the timeout value in "ethtool -c" outputs, because it's not
> > available at driver level?
>
> Add it to the FW API and have FW send the current value to the driver?
I don't know where is the timeout value in the HW / FW layers. Adding
new info to the HW/FW API needs other team's approval, and their work,
which will need a complex process and a long time.
> You were concerned (in the commit msg) that there's a latency cost,
> which is fair but I think for 99% of users 2usec is absolutely
> not detectable (it takes longer for the CPU to wake). So I think it'd
> be very valuable to the user to understand the order of magnitude of
> latency we're talking about here.
For now, may I document the 2us in the patch description? And add a
new item to the "ethtool -c" output, like "rx-cqe-usecs", label is as
"n/a" for now, while we work out with other teams on the time value
API at HW/FW layers? So, this CQE coalescing feature support won't be
blocked by this "2usec" info API for a long time?
Thanks,
- Haiyang
^ permalink raw reply
* Re: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Jakub Kicinski @ 2026-01-17 16:58 UTC (permalink / raw)
To: Haiyang Zhang
Cc: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Shradha Gupta, Saurabh Sengar, Aditya Garg, Dipayaan Roy,
Shiraz Saleem, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <SA3PR21MB3867B98BBA96FF3BA7F42F3FCA8DA@SA3PR21MB3867.namprd21.prod.outlook.com>
On Fri, 16 Jan 2026 16:44:33 +0000 Haiyang Zhang wrote:
> > You need to add a new param to the uAPI.
>
> Since this feature is not common to other NICs, can we use an
> ethtool private flag instead?
It's extremely common. Descriptor writeback at the granularity of one
packet would kill PCIe performance. We just don't have uAPI so NICs
either don't expose the knob or "reuse" another coalescing param.
> When the flag is set, the CQE coalescing will be enabled and put
> up to 4 pkts in a CQE.
>
> > Please add both size and
> > timeout. Expose the timeout as read only if your device doesn't support
> > controlling it per queue.
>
> Does the "size" mean the max pks per CQE (1 or 4)?
The definition of "size" is always a little funny when it comes to
coalescing and ringparam. In Tx does one frame mean one wire frame
or one TSO superframe? I wouldn't worry about the exact meaning of
size too much. Important thing is that user knows what making this
param smaller or larger will do.
> The timeout value is not even exposed to driver, and subject to change
> in the future. Also the HW mechanism is proprietary... So, can we not
> "expose" the timeout value in "ethtool -c" outputs, because it's not
> available at driver level?
Add it to the FW API and have FW send the current value to the driver?
You were concerned (in the commit msg) that there's a latency cost,
which is fair but I think for 99% of users 2usec is absolutely
not detectable (it takes longer for the CPU to wake). So I think it'd
be very valuable to the user to understand the order of magnitude of
latency we're talking about here.
^ permalink raw reply
* Re: [PATCH v1] mshv: make certain field names descriptive in a header struct
From: Anirudh Rayabharam @ 2026-01-17 10:54 UTC (permalink / raw)
To: Mukesh R; +Cc: linux-hyperv, wei.liu, nunodasneves
In-Reply-To: <1ac21e5b-5ff9-0ff9-c886-33997ad3f7da@linux.microsoft.com>
On Thu, Jan 15, 2026 at 11:03:36AM -0800, Mukesh R wrote:
> On 1/15/26 10:51, Anirudh Rayabharam wrote:
> > On Mon, Jan 12, 2026 at 11:49:43AM -0800, Mukesh Rathor wrote:
> > > When header struct fields use very common names like "pages" or "type",
> > > it makes it difficult to find uses of these fields with tools like grep
> > > and cscope. Add the prefix mreg_ to some fields in struct
> > > mshv_mem_region to make it easier to find them.
> > >
> > > There is no functional change.
> > >
> > > Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
> > > ---
> > > drivers/hv/mshv_regions.c | 44 ++++++++++++++++++-------------------
> > > drivers/hv/mshv_root.h | 6 ++---
> > > drivers/hv/mshv_root_main.c | 10 ++++-----
> > > 3 files changed, 30 insertions(+), 30 deletions(-)
> > >
> > > diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
> > > index 202b9d551e39..af81405f859b 100644
> > > --- a/drivers/hv/mshv_regions.c
> > > +++ b/drivers/hv/mshv_regions.c
> > > @@ -52,7 +52,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > > struct page *page;
> > > int ret;
> > > - page = region->pages[page_offset];
> > > + page = region->mreg_pages[page_offset];
> > > if (!page)
> > > return -EINVAL;
> > > @@ -65,7 +65,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
> > > /* Start at stride since the first page is validated */
> > > for (count = stride; count < page_count; count += stride) {
> > > - page = region->pages[page_offset + count];
> > > + page = region->mreg_pages[page_offset + count];
> > > /* Break if current page is not present */
> > > if (!page)
> > > @@ -117,7 +117,7 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
> > > while (page_count) {
> > > /* Skip non-present pages */
> > > - if (!region->pages[page_offset]) {
> > > + if (!region->mreg_pages[page_offset]) {
> > > page_offset++;
> > > page_count--;
> > > continue;
> > > @@ -164,13 +164,13 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
> > > u32 flags,
> > > u64 page_offset, u64 page_count)
> > > {
> > > - struct page *page = region->pages[page_offset];
> > > + struct page *page = region->mreg_pages[page_offset];
> > > if (PageHuge(page) || PageTransCompound(page))
> > > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > > return hv_call_modify_spa_host_access(region->partition->pt_id,
> > > - region->pages + page_offset,
> > > + region->mreg_pages + page_offset,
> > > page_count,
> > > HV_MAP_GPA_READABLE |
> > > HV_MAP_GPA_WRITABLE,
> > > @@ -190,13 +190,13 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
> > > u32 flags,
> > > u64 page_offset, u64 page_count)
> > > {
> > > - struct page *page = region->pages[page_offset];
> > > + struct page *page = region->mreg_pages[page_offset];
> > > if (PageHuge(page) || PageTransCompound(page))
> > > flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
> > > return hv_call_modify_spa_host_access(region->partition->pt_id,
> > > - region->pages + page_offset,
> > > + region->mreg_pages + page_offset,
> > > page_count, 0,
> > > flags, false);
> > > }
> > > @@ -214,7 +214,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> > > u32 flags,
> > > u64 page_offset, u64 page_count)
> > > {
> > > - struct page *page = region->pages[page_offset];
> > > + struct page *page = region->mreg_pages[page_offset];
> > > if (PageHuge(page) || PageTransCompound(page))
> > > flags |= HV_MAP_GPA_LARGE_PAGE;
> > > @@ -222,7 +222,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
> > > return hv_call_map_gpa_pages(region->partition->pt_id,
> > > region->start_gfn + page_offset,
> > > page_count, flags,
> > > - region->pages + page_offset);
> > > + region->mreg_pages + page_offset);
> > > }
> > > static int mshv_region_remap_pages(struct mshv_mem_region *region,
> > > @@ -245,10 +245,10 @@ int mshv_region_map(struct mshv_mem_region *region)
> > > static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
> > > u64 page_offset, u64 page_count)
> > > {
> > > - if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
> > > - unpin_user_pages(region->pages + page_offset, page_count);
> > > + if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
> > > + unpin_user_pages(region->mreg_pages + page_offset, page_count);
> > > - memset(region->pages + page_offset, 0,
> > > + memset(region->mreg_pages + page_offset, 0,
> > > page_count * sizeof(struct page *));
> > > }
> > > @@ -265,7 +265,7 @@ int mshv_region_pin(struct mshv_mem_region *region)
> > > int ret;
> > > for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
> > > - pages = region->pages + done_count;
> > > + pages = region->mreg_pages + done_count;
> > > userspace_addr = region->start_uaddr +
> > > done_count * HV_HYP_PAGE_SIZE;
> > > nr_pages = min(region->nr_pages - done_count,
> > > @@ -297,7 +297,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
> > > u32 flags,
> > > u64 page_offset, u64 page_count)
> > > {
> > > - struct page *page = region->pages[page_offset];
> > > + struct page *page = region->mreg_pages[page_offset];
> > > if (PageHuge(page) || PageTransCompound(page))
> > > flags |= HV_UNMAP_GPA_LARGE_PAGE;
> > > @@ -321,7 +321,7 @@ static void mshv_region_destroy(struct kref *ref)
> > > struct mshv_partition *partition = region->partition;
> > > int ret;
> > > - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
> > > + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> > > mshv_region_movable_fini(region);
> > > if (mshv_partition_encrypted(partition)) {
> > > @@ -374,9 +374,9 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
> > > int ret;
> > > range->notifier_seq = mmu_interval_read_begin(range->notifier);
> > > - mmap_read_lock(region->mni.mm);
> > > + mmap_read_lock(region->mreg_mni.mm);
> > > ret = hmm_range_fault(range);
> > > - mmap_read_unlock(region->mni.mm);
> > > + mmap_read_unlock(region->mreg_mni.mm);
> > > if (ret)
> > > return ret;
> > > @@ -407,7 +407,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > > u64 page_offset, u64 page_count)
> > > {
> > > struct hmm_range range = {
> > > - .notifier = ®ion->mni,
> > > + .notifier = ®ion->mreg_mni,
> > > .default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
> > > };
> > > unsigned long *pfns;
> > > @@ -430,7 +430,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
> > > goto out;
> > > for (i = 0; i < page_count; i++)
> > > - region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> > > + region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
> > > ret = mshv_region_remap_pages(region, region->hv_map_flags,
> > > page_offset, page_count);
> > > @@ -489,7 +489,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
> > > {
> > > struct mshv_mem_region *region = container_of(mni,
> > > struct mshv_mem_region,
> > > - mni);
> > > + mreg_mni);
> > > u64 page_offset, page_count;
> > > unsigned long mstart, mend;
> > > int ret = -EPERM;
> > > @@ -535,14 +535,14 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
> > > void mshv_region_movable_fini(struct mshv_mem_region *region)
> > > {
> > > - mmu_interval_notifier_remove(®ion->mni);
> > > + mmu_interval_notifier_remove(®ion->mreg_mni);
> > > }
> > > bool mshv_region_movable_init(struct mshv_mem_region *region)
> > > {
> > > int ret;
> > > - ret = mmu_interval_notifier_insert(®ion->mni, current->mm,
> > > + ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
> > > region->start_uaddr,
> > > region->nr_pages << HV_HYP_PAGE_SHIFT,
> > > &mshv_region_mni_ops);
> > > diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
> > > index 3c1d88b36741..f5b6d3979e5a 100644
> > > --- a/drivers/hv/mshv_root.h
> > > +++ b/drivers/hv/mshv_root.h
> > > @@ -85,10 +85,10 @@ struct mshv_mem_region {
> > > u64 start_uaddr;
> > > u32 hv_map_flags;
> > > struct mshv_partition *partition;
> > > - enum mshv_region_type type;
> > > - struct mmu_interval_notifier mni;
> > > + enum mshv_region_type mreg_type;
> > > + struct mmu_interval_notifier mreg_mni;
> > > struct mutex mutex; /* protects region pages remapping */
> > > - struct page *pages[];
> > > + struct page *mreg_pages[];
> > > };
> > > struct mshv_irq_ack_notifier {
> > > diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> > > index 1134a82c7881..eff1b21461dc 100644
> > > --- a/drivers/hv/mshv_root_main.c
> > > +++ b/drivers/hv/mshv_root_main.c
> > > @@ -657,7 +657,7 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
> > > return false;
> > > /* Only movable memory ranges are supported for GPA intercepts */
> > > - if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
> > > + if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
> > > ret = mshv_region_handle_gfn_fault(region, gfn);
> > > else
> > > ret = false;
> > > @@ -1175,12 +1175,12 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
> > > return PTR_ERR(rg);
> > > if (is_mmio)
> > > - rg->type = MSHV_REGION_TYPE_MMIO;
> > > + rg->mreg_type = MSHV_REGION_TYPE_MMIO;
> > > else if (mshv_partition_encrypted(partition) ||
> > > !mshv_region_movable_init(rg))
> > > - rg->type = MSHV_REGION_TYPE_MEM_PINNED;
> > > + rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
> > > else
> > > - rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
> > > + rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
> > > rg->partition = partition;
> > > @@ -1297,7 +1297,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
> > > if (ret)
> > > return ret;
> > > - switch (region->type) {
> > > + switch (region->mreg_type) {
> > > case MSHV_REGION_TYPE_MEM_PINNED:
> > > ret = mshv_prepare_pinned_region(region);
> > > break;
> > > --
> > > 2.51.2.vfs.0.1
> > >
> >
> > TBH, all these new names look ugly to me. Moreover, they are redundant.
> > For example, region->type makes it clear that we're talking about the
> > type *of a region*. Calling it mreg_type adds no additional semantic
> > information; it's just visual noise.
> >
> > Coming to the part about finding it via grep/cscope. You could have
> > easily found these reference by searching for "region->type",
> > "region->mni" etc. Perhaps we can change the variable naming convention
> > i.e. call a struct mshv_mem_region "mreg" everywhere and then one could
> > grep for "mreg->mni" and so on. Also, using more powerful tools such as
> > LSPs (clangd) can help find references more easily without tripping up
> > on common terms like "type", "pages" etc.
>
> Huh! There is no way to enforce that one use ptrs with only certain names,
> and that is unreasonable requirement. What if the field is accessed by
> struct.field reference? Are you suggesting that struct naming be enforced?
> Ability to read code is far far more important to make sure bug free code
> is written, it is a very small price for a large benefit. One gets used to
> it so easily. Why do we prefix function names with mshv_ or hv_, should
We prefix function names with mshv_ or hv_ for namespacing. It indicates
that those functions belong to the respective subsystem/module. There is
no need for such prefixes for these struct fields because they are
already namespaced inside struct mshv_mem_region.
> we get rid of that also? And it's not just cscope or grep, sometimes
> you're dealing with corrupt binary or coredump and you use "strings"
Dealing with currupt binaries/coredumps sounds like a far-fetched
usecase. Does it really makes sense to gear our code towards that?
> to get some meaning out of it. So I totally disagree with you. If you
> don't like mreg_, please suggest alternates that are easy to find.
My point was that there is no need for a prefix in the first place.
Thanks,
Anirudh.
^ permalink raw reply
* Re: [PATCH 00/12] Recover sysfb after DRM probe failure
From: Zack Rusin @ 2026-01-17 6:02 UTC (permalink / raw)
To: Thomas Zimmermann
Cc: dri-devel, Alex Deucher, amd-gfx, Ard Biesheuvel, Ce Sun,
Chia-I Wu, Christian König, Danilo Krummrich, Dave Airlie,
Deepak Rawat, Dmitry Osipenko, Gerd Hoffmann, Gurchetan Singh,
Hans de Goede, Hawking Zhang, Helge Deller, intel-gfx, intel-xe,
Jani Nikula, Javier Martinez Canillas, Jocelyn Falempe,
Joonas Lahtinen, Lijo Lazar, linux-efi, linux-fbdev, linux-hyperv,
linux-kernel, Lucas De Marchi, Lyude Paul, Maarten Lankhorst,
Mario Limonciello (AMD), Mario Limonciello, Maxime Ripard,
nouveau, Rodrigo Vivi, Simona Vetter, spice-devel,
Thomas Hellström, Timur Kristóf, Tvrtko Ursulin,
virtualization, Vitaly Prosyak
In-Reply-To: <f3643c19-c250-4927-b39d-37d2494c7c84@suse.de>
[-- Attachment #1: Type: text/plain, Size: 2416 bytes --]
On Fri, Jan 16, 2026 at 2:58 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
>
> Hi
>
> Am 16.01.26 um 04:59 schrieb Zack Rusin:
> > On Thu, Jan 15, 2026 at 6:02 AM Thomas Zimmermann <tzimmermann@suse.de> wrote:
> >> That's really not going to work. For example, in the current series, you
> >> invoke devm_aperture_remove_conflicting_pci_devices_done() after
> >> drm_mode_reset(), drm_dev_register() and drm_client_setup().
> > That's perfectly fine,
> > devm_aperture_remove_conflicting_pci_devices_done is removing the
> > reload behavior not doing anything.
> >
> > This series, essentially, just adds a "defer" statement to
> > aperture_remove_conflicting_pci_devices that says
> >
> > "reload sysfb if this driver unloads".
> >
> > devm_aperture_remove_conflicting_pci_devices_done just cancels that defer.
>
> Exactly. And if that reload happens after the hardware state has been
> changed, the result is undefined.
This is all predicated on drivers actually cleaning up after
themselves. I don't think any amount of good will or api design is
going to fix device specific state mismatches.
> The current recovery/reload is not reliable in any case. A number of
> high-profile devs have also said that it doesn't work with their driver.
> The same is true for ast. So the current approach is not going to happen.
>
> > There also might be the case of some crazy behavior, e.g. pci bar
> > resize in the driver makes the vga hardware crash or something, in
> > which case, yea, we should definitely skip this patch, at least until
> > those drivers properly cleanup on exit.
>
> There's nothing crazy here. It's standard probing code.
>
> If you want to to move forward, my suggestion is to look at the proposal
> with the aperture_funcs callbacks that control sysfb device access. And
> from there, build a full prototype with one or two drivers.
I don't think that approach is going to work. I don't think there's
anything that can be done if drivers didn't cleanup everything they've
done that might have broken sysfb on unload. I'm going to drop it
then, it's obviously a shame because it works fine with virtualized
drivers and they're ones that would likely profit from this the most
but I'm sceptical that I could do full system state set reset in a
generalized fashion for hw drivers or that the work required would be
worth the payoff.
z
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 5414 bytes --]
^ permalink raw reply
* [Patch v2] scsi: storvsc: Process unsupported MODE_SENSE_10
From: longli @ 2026-01-17 1:03 UTC (permalink / raw)
To: K . Y . Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui,
James E . J . Bottomley, Martin K . Petersen, James Bottomley,
linux-hyperv, linux-scsi, linux-kernel
Cc: Long Li, stable, Michael Kelley
From: Long Li <longli@microsoft.com>
The Hyper-V host does not support MODE_SENSE_10 and MODE_SENSE.
The driver handles MODE_SENSE as unsupported command, but not for
MODE_SENSE_10. Add MODE_SENSE_10 to the same handling logic and
return correct code to SCSI layer.
Fixes: 89ae7d709357 ("Staging: hv: storvsc: Move the storage driver out of the staging area")
Cc: stable@kernel.org
Signed-off-by: Long Li <longli@microsoft.com>
Reviewed-by: Michael Kelley <mhklinux@outlook.com>
---
Change in v2:
Added MODE_SENSE_10 to the code comment
drivers/scsi/storvsc_drv.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/drivers/scsi/storvsc_drv.c b/drivers/scsi/storvsc_drv.c
index 6e4112143c76..b43d876747b7 100644
--- a/drivers/scsi/storvsc_drv.c
+++ b/drivers/scsi/storvsc_drv.c
@@ -1144,7 +1144,7 @@ static void storvsc_on_io_completion(struct storvsc_device *stor_device,
* The current SCSI handling on the host side does
* not correctly handle:
* INQUIRY command with page code parameter set to 0x80
- * MODE_SENSE command with cmd[2] == 0x1c
+ * MODE_SENSE and MODE_SENSE_10 command with cmd[2] == 0x1c
* MAINTENANCE_IN is not supported by HyperV FC passthrough
*
* Setup srb and scsi status so this won't be fatal.
@@ -1154,6 +1154,7 @@ static void storvsc_on_io_completion(struct storvsc_device *stor_device,
if ((stor_pkt->vm_srb.cdb[0] == INQUIRY) ||
(stor_pkt->vm_srb.cdb[0] == MODE_SENSE) ||
+ (stor_pkt->vm_srb.cdb[0] == MODE_SENSE_10) ||
(stor_pkt->vm_srb.cdb[0] == MAINTENANCE_IN &&
hv_dev_is_fc(device))) {
vstor_packet->vm_srb.scsi_status = 0;
--
2.34.1
^ permalink raw reply related
* [PATCH net-next v15 01/12] vsock: add netns to vsock core
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add netns logic to vsock core. Additionally, modify transport hook
prototypes to be used by later transport-specific patches (e.g.,
*_seqpacket_allow()).
Namespaces are supported primarily by changing socket lookup functions
(e.g., vsock_find_connected_socket()) to take into account the socket
namespace and the namespace mode before considering a candidate socket a
"match".
This patch also introduces the sysctl /proc/sys/net/vsock/ns_mode to
report the mode and /proc/sys/net/vsock/child_ns_mode to set the mode
for new namespaces.
Add netns functionality (initialization, passing to transports, procfs,
etc...) to the af_vsock socket layer. Later patches that add netns
support to transports depend on this patch.
dgram_allow(), stream_allow(), and seqpacket_allow() callbacks are
modified to take a vsk in order to perform logic on namespace modes. In
future patches, the net will also be used for socket
lookups in these functions.
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v15:
- make static port in __vsock_bind_connectible per-netns
- remove __net_initdata because we want the ops beyond just boot
- add vsock_init_ns_mode kernel cmdline parameter to set init ns mode
- use if (ret || !write) in __vsock_net_mode_string() (Stefano)
- add vsock_net_mode_global() (Stefano)
- hide !net == VSOCK_NET_MODE_GLOBAL inside vsock_net_mode() (Stefano)
- clarify af_vsock.c comments on ns_mode/child_ns_mode (Stefano)
Changes in v14:
- include linux/sysctl.h in af_vsock.c
- squash patch 'vsock: add per-net vsock NS mode state' into this patch
(prior version can be found here):
https://lore.kernel.org/all/20251223-vsock-vmtest-v13-1-9d6db8e7c80b@meta.com/)
Changes in v13:
- remove net_mode and replace with direct accesses to net->vsock.mode,
since this is now immutable.
- update comments about mode behavior and mutability, and sysctl API
- only pass NULL for net when wanting global, instead of net_mode ==
VSOCK_NET_MODE_GLOBAL. This reflects the new logic
of vsock_net_check_mode() that only requires net pointers (not
net_mode).
- refactor sysctl string code into a re-usable function, because
child_ns_mode and ns_mode both handle the same strings.
- remove redundant vsock_net_init(&init_net) call in module init because
pernet registration calls the callback on the init_net too
Changes in v12:
- return true in dgram_allow(), stream_allow(), and seqpacket_allow()
only if net_mode == VSOCK_NET_MODE_GLOBAL (Stefano)
- document bind(VMADDR_CID_ANY) case in af_vsock.c (Stefano)
- change order of stream_allow() call in vmci so we can pass vsk
to it
Changes in v10:
- add file-level comment about what happens to sockets/devices
when the namespace mode changes (Stefano)
- change the 'if (write)' boolean in vsock_net_mode_string() to
if (!write), this simplifies a later patch which adds "goto"
for mutex unlocking on function exit.
Changes in v9:
- remove virtio_vsock_alloc_rx_skb() (Stefano)
- remove vsock_global_dummy_net, not needed as net=NULL +
net_mode=VSOCK_NET_MODE_GLOBAL achieves identical result
Changes in v7:
- hv_sock: fix hyperv build error
- explain why vhost does not use the dummy
- explain usage of __vsock_global_dummy_net
- explain why VSOCK_NET_MODE_STR_MAX is 8 characters
- use switch-case in vsock_net_mode_string()
- avoid changing transports as much as possible
- add vsock_find_{bound,connected}_socket_net()
- rename `vsock_hdr` to `sysctl_hdr`
- add virtio_vsock_alloc_linear_skb() wrapper for setting dummy net and
global mode for virtio-vsock, move skb->cb zero-ing into wrapper
- explain seqpacket_allow() change
- move net setting to __vsock_create() instead of vsock_create() so
that child sockets also have their net assigned upon accept()
Changes in v6:
- unregister sysctl ops in vsock_exit()
- af_vsock: clarify description of CID behavior
- af_vsock: fix buf vs buffer naming, and length checking
- af_vsock: fix length checking w/ correct ctl_table->maxlen
Changes in v5:
- vsock_global_net() -> vsock_global_dummy_net()
- update comments for new uAPI
- use /proc/sys/net/vsock/ns_mode instead of /proc/net/vsock_ns_mode
- add prototype changes so patch remains compilable
---
Documentation/admin-guide/kernel-parameters.txt | 14 +
MAINTAINERS | 1 +
drivers/vhost/vsock.c | 6 +-
include/linux/virtio_vsock.h | 4 +-
include/net/af_vsock.h | 61 ++++-
include/net/net_namespace.h | 4 +
include/net/netns/vsock.h | 21 ++
net/vmw_vsock/af_vsock.c | 328 ++++++++++++++++++++++--
net/vmw_vsock/hyperv_transport.c | 7 +-
net/vmw_vsock/virtio_transport.c | 9 +-
net/vmw_vsock/virtio_transport_common.c | 6 +-
net/vmw_vsock/vmci_transport.c | 26 +-
net/vmw_vsock/vsock_loopback.c | 8 +-
13 files changed, 444 insertions(+), 51 deletions(-)
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a8d0afde7f85..b6e3bfe365a1 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -8253,6 +8253,20 @@ Kernel parameters
them quite hard to use for exploits but
might break your system.
+ vsock_init_ns_mode=
+ [KNL,NET] Set the vsock namespace mode for the init
+ (root) network namespace.
+
+ global [default] The init namespace operates in
+ global mode where CIDs are system-wide and
+ sockets can communicate across global
+ namespaces.
+
+ local The init namespace operates in local mode
+ where CIDs are private to the namespace and
+ sockets can only communicate within the same
+ namespace.
+
vt.color= [VT] Default text color.
Format: 0xYX, X = foreground, Y = background.
Default: 0x07 = light gray on black.
diff --git a/MAINTAINERS b/MAINTAINERS
index afc71089ba09..c48a2e047686 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -27557,6 +27557,7 @@ L: netdev@vger.kernel.org
S: Maintained
F: drivers/vhost/vsock.c
F: include/linux/virtio_vsock.h
+F: include/net/netns/vsock.h
F: include/uapi/linux/virtio_vsock.h
F: net/vmw_vsock/virtio_transport.c
F: net/vmw_vsock/virtio_transport_common.c
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 552cfb53498a..647ded6f6ea5 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -407,7 +407,8 @@ static bool vhost_transport_msgzerocopy_allow(void)
return true;
}
-static bool vhost_transport_seqpacket_allow(u32 remote_cid);
+static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid);
static struct virtio_transport vhost_transport = {
.transport = {
@@ -463,7 +464,8 @@ static struct virtio_transport vhost_transport = {
.send_pkt = vhost_transport_send_pkt,
};
-static bool vhost_transport_seqpacket_allow(u32 remote_cid)
+static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid)
{
struct vhost_vsock *vsock;
bool seqpacket_allow = false;
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 0c67543a45c8..1845e8d4f78d 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -256,10 +256,10 @@ void virtio_transport_notify_buffer_size(struct vsock_sock *vsk, u64 *val);
u64 virtio_transport_stream_rcvhiwat(struct vsock_sock *vsk);
bool virtio_transport_stream_is_active(struct vsock_sock *vsk);
-bool virtio_transport_stream_allow(u32 cid, u32 port);
+bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port);
int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr);
-bool virtio_transport_dgram_allow(u32 cid, u32 port);
+bool virtio_transport_dgram_allow(struct vsock_sock *vsk, u32 cid, u32 port);
int virtio_transport_connect(struct vsock_sock *vsk);
diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index d40e978126e3..d3ff48a2fbe0 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -10,6 +10,7 @@
#include <linux/kernel.h>
#include <linux/workqueue.h>
+#include <net/netns/vsock.h>
#include <net/sock.h>
#include <uapi/linux/vm_sockets.h>
@@ -124,7 +125,7 @@ struct vsock_transport {
size_t len, int flags);
int (*dgram_enqueue)(struct vsock_sock *, struct sockaddr_vm *,
struct msghdr *, size_t len);
- bool (*dgram_allow)(u32 cid, u32 port);
+ bool (*dgram_allow)(struct vsock_sock *vsk, u32 cid, u32 port);
/* STREAM. */
/* TODO: stream_bind() */
@@ -136,14 +137,14 @@ struct vsock_transport {
s64 (*stream_has_space)(struct vsock_sock *);
u64 (*stream_rcvhiwat)(struct vsock_sock *);
bool (*stream_is_active)(struct vsock_sock *);
- bool (*stream_allow)(u32 cid, u32 port);
+ bool (*stream_allow)(struct vsock_sock *vsk, u32 cid, u32 port);
/* SEQ_PACKET. */
ssize_t (*seqpacket_dequeue)(struct vsock_sock *vsk, struct msghdr *msg,
int flags);
int (*seqpacket_enqueue)(struct vsock_sock *vsk, struct msghdr *msg,
size_t len);
- bool (*seqpacket_allow)(u32 remote_cid);
+ bool (*seqpacket_allow)(struct vsock_sock *vsk, u32 remote_cid);
u32 (*seqpacket_has_data)(struct vsock_sock *vsk);
/* Notification. */
@@ -216,6 +217,11 @@ void vsock_remove_connected(struct vsock_sock *vsk);
struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr);
struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
struct sockaddr_vm *dst);
+struct sock *vsock_find_bound_socket_net(struct sockaddr_vm *addr,
+ struct net *net);
+struct sock *vsock_find_connected_socket_net(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst,
+ struct net *net);
void vsock_remove_sock(struct vsock_sock *vsk);
void vsock_for_each_connected_socket(struct vsock_transport *transport,
void (*fn)(struct sock *sk));
@@ -256,4 +262,53 @@ static inline bool vsock_msgzerocopy_allow(const struct vsock_transport *t)
{
return t->msgzerocopy_allow && t->msgzerocopy_allow();
}
+
+static inline enum vsock_net_mode vsock_net_mode(struct net *net)
+{
+ if (!net)
+ return VSOCK_NET_MODE_GLOBAL;
+
+ return READ_ONCE(net->vsock.mode);
+}
+
+static inline bool vsock_net_mode_global(struct vsock_sock *vsk)
+{
+ return vsock_net_mode(sock_net(sk_vsock(vsk))) == VSOCK_NET_MODE_GLOBAL;
+}
+
+static inline void vsock_net_set_child_mode(struct net *net,
+ enum vsock_net_mode mode)
+{
+ WRITE_ONCE(net->vsock.child_ns_mode, mode);
+}
+
+static inline enum vsock_net_mode vsock_net_child_mode(struct net *net)
+{
+ return READ_ONCE(net->vsock.child_ns_mode);
+}
+
+/* Return true if two namespaces pass the mode rules. Otherwise, return false.
+ *
+ * A NULL namespace is treated as VSOCK_NET_MODE_GLOBAL.
+ *
+ * Read more about modes in the comment header of net/vmw_vsock/af_vsock.c.
+ */
+static inline bool vsock_net_check_mode(struct net *ns0, struct net *ns1)
+{
+ enum vsock_net_mode mode0, mode1;
+
+ /* Any vsocks within the same network namespace are always reachable,
+ * regardless of the mode.
+ */
+ if (net_eq(ns0, ns1))
+ return true;
+
+ mode0 = vsock_net_mode(ns0);
+ mode1 = vsock_net_mode(ns1);
+
+ /* Different namespaces are only reachable if they are both
+ * global mode.
+ */
+ return mode0 == VSOCK_NET_MODE_GLOBAL && mode0 == mode1;
+}
#endif /* __AF_VSOCK_H__ */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index cb664f6e3558..66d3de1d935f 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -37,6 +37,7 @@
#include <net/netns/smc.h>
#include <net/netns/bpf.h>
#include <net/netns/mctp.h>
+#include <net/netns/vsock.h>
#include <net/net_trackers.h>
#include <linux/ns_common.h>
#include <linux/idr.h>
@@ -196,6 +197,9 @@ struct net {
/* Move to a better place when the config guard is removed. */
struct mutex rtnl_mutex;
#endif
+#if IS_ENABLED(CONFIG_VSOCKETS)
+ struct netns_vsock vsock;
+#endif
} __randomize_layout;
#include <linux/seq_file_net.h>
diff --git a/include/net/netns/vsock.h b/include/net/netns/vsock.h
new file mode 100644
index 000000000000..b34d69a22fa8
--- /dev/null
+++ b/include/net/netns/vsock.h
@@ -0,0 +1,21 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __NET_NET_NAMESPACE_VSOCK_H
+#define __NET_NET_NAMESPACE_VSOCK_H
+
+#include <linux/types.h>
+
+enum vsock_net_mode {
+ VSOCK_NET_MODE_GLOBAL,
+ VSOCK_NET_MODE_LOCAL,
+};
+
+struct netns_vsock {
+ struct ctl_table_header *sysctl_hdr;
+
+ /* protected by the vsock_table_lock in af_vsock.c */
+ u32 port;
+
+ enum vsock_net_mode mode;
+ enum vsock_net_mode child_ns_mode;
+};
+#endif /* __NET_NET_NAMESPACE_VSOCK_H */
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index a3505a4dcee0..3fc8160d51df 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -83,6 +83,48 @@
* TCP_ESTABLISHED - connected
* TCP_CLOSING - disconnecting
* TCP_LISTEN - listening
+ *
+ * - Namespaces in vsock support two different modes: "local" and "global".
+ * Each mode defines how the namespace interacts with CIDs.
+ * Each namespace exposes two sysctl files:
+ *
+ * - /proc/sys/net/vsock/ns_mode (read-only) reports the current namespace's
+ * mode, which is set at namespace creation and immutable thereafter.
+ * - /proc/sys/net/vsock/child_ns_mode (writable) controls what mode future
+ * child namespaces will inherit when created. The default is "global".
+ *
+ * Changing child_ns_mode only affects newly created namespaces, not the
+ * current namespace or existing children. At namespace creation, ns_mode
+ * is inherited from the parent's child_ns_mode.
+ *
+ * The modes affect the allocation and accessibility of CIDs as follows:
+ *
+ * - global - access and allocation are all system-wide
+ * - all CID allocation from global namespaces draw from the same
+ * system-wide pool.
+ * - if one global namespace has already allocated some CID, another
+ * global namespace will not be able to allocate the same CID.
+ * - global mode AF_VSOCK sockets can reach any VM or socket in any global
+ * namespace, they are not contained to only their own namespace.
+ * - AF_VSOCK sockets in a global mode namespace cannot reach VMs or
+ * sockets in any local mode namespace.
+ * - local - access and allocation are contained within the namespace
+ * - CID allocation draws only from a private pool local only to the
+ * namespace, and does not affect the CIDs available for allocation in any
+ * other namespace (global or local).
+ * - VMs in a local namespace do not collide with CIDs in any other local
+ * namespace or any global namespace. For example, if a VM in a local mode
+ * namespace is given CID 10, then CID 10 is still available for
+ * allocation in any other namespace, but not in the same namespace.
+ * - AF_VSOCK sockets in a local mode namespace can connect only to VMs or
+ * other sockets within their own namespace.
+ * - sockets bound to VMADDR_CID_ANY in local namespaces will never resolve
+ * to any transport that is not compatible with local mode. There is no
+ * error that propagates to the user (as there is for connection attempts)
+ * because it is possible for some packet to reach this socket from
+ * a different transport that *does* support local mode. For
+ * example, virtio-vsock may not support local mode, but the socket
+ * may still accept a connection from vhost-vsock which does.
*/
#include <linux/compat.h>
@@ -100,20 +142,31 @@
#include <linux/module.h>
#include <linux/mutex.h>
#include <linux/net.h>
+#include <linux/proc_fs.h>
#include <linux/poll.h>
#include <linux/random.h>
#include <linux/skbuff.h>
#include <linux/smp.h>
#include <linux/socket.h>
#include <linux/stddef.h>
+#include <linux/sysctl.h>
#include <linux/unistd.h>
#include <linux/wait.h>
#include <linux/workqueue.h>
#include <net/sock.h>
#include <net/af_vsock.h>
+#include <net/netns/vsock.h>
#include <uapi/linux/vm_sockets.h>
#include <uapi/asm-generic/ioctls.h>
+#define VSOCK_NET_MODE_STR_GLOBAL "global"
+#define VSOCK_NET_MODE_STR_LOCAL "local"
+
+/* 6 chars for "global", 1 for null-terminator, and 1 more for '\n'.
+ * The newline is added by proc_dostring() for read operations.
+ */
+#define VSOCK_NET_MODE_STR_MAX 8
+
static int __vsock_bind(struct sock *sk, struct sockaddr_vm *addr);
static void vsock_sk_destruct(struct sock *sk);
static int vsock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
@@ -149,6 +202,21 @@ static const struct vsock_transport *transport_dgram;
static const struct vsock_transport *transport_local;
static DEFINE_MUTEX(vsock_register_mutex);
+static enum vsock_net_mode vsock_init_ns_mode = VSOCK_NET_MODE_GLOBAL;
+
+#ifndef MODULE
+static int __init vsock_init_ns_mode_setup(char *str)
+{
+ if (!strcmp(str, VSOCK_NET_MODE_STR_LOCAL))
+ vsock_init_ns_mode = VSOCK_NET_MODE_LOCAL;
+ else if (!strcmp(str, VSOCK_NET_MODE_STR_GLOBAL))
+ vsock_init_ns_mode = VSOCK_NET_MODE_GLOBAL;
+
+ return 1;
+}
+__setup("vsock_init_ns_mode=", vsock_init_ns_mode_setup);
+#endif
+
/**** UTILS ****/
/* Each bound VSocket is stored in the bind hash table and each connected
@@ -235,33 +303,42 @@ static void __vsock_remove_connected(struct vsock_sock *vsk)
sock_put(&vsk->sk);
}
-static struct sock *__vsock_find_bound_socket(struct sockaddr_vm *addr)
+static struct sock *__vsock_find_bound_socket_net(struct sockaddr_vm *addr,
+ struct net *net)
{
struct vsock_sock *vsk;
list_for_each_entry(vsk, vsock_bound_sockets(addr), bound_table) {
- if (vsock_addr_equals_addr(addr, &vsk->local_addr))
- return sk_vsock(vsk);
+ struct sock *sk = sk_vsock(vsk);
+
+ if (vsock_addr_equals_addr(addr, &vsk->local_addr) &&
+ vsock_net_check_mode(sock_net(sk), net))
+ return sk;
if (addr->svm_port == vsk->local_addr.svm_port &&
(vsk->local_addr.svm_cid == VMADDR_CID_ANY ||
- addr->svm_cid == VMADDR_CID_ANY))
- return sk_vsock(vsk);
+ addr->svm_cid == VMADDR_CID_ANY) &&
+ vsock_net_check_mode(sock_net(sk), net))
+ return sk;
}
return NULL;
}
-static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
- struct sockaddr_vm *dst)
+static struct sock *
+__vsock_find_connected_socket_net(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst, struct net *net)
{
struct vsock_sock *vsk;
list_for_each_entry(vsk, vsock_connected_sockets(src, dst),
connected_table) {
+ struct sock *sk = sk_vsock(vsk);
+
if (vsock_addr_equals_addr(src, &vsk->remote_addr) &&
- dst->svm_port == vsk->local_addr.svm_port) {
- return sk_vsock(vsk);
+ dst->svm_port == vsk->local_addr.svm_port &&
+ vsock_net_check_mode(sock_net(sk), net)) {
+ return sk;
}
}
@@ -304,12 +381,13 @@ void vsock_remove_connected(struct vsock_sock *vsk)
}
EXPORT_SYMBOL_GPL(vsock_remove_connected);
-struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr)
+struct sock *vsock_find_bound_socket_net(struct sockaddr_vm *addr,
+ struct net *net)
{
struct sock *sk;
spin_lock_bh(&vsock_table_lock);
- sk = __vsock_find_bound_socket(addr);
+ sk = __vsock_find_bound_socket_net(addr, net);
if (sk)
sock_hold(sk);
@@ -317,15 +395,22 @@ struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr)
return sk;
}
+EXPORT_SYMBOL_GPL(vsock_find_bound_socket_net);
+
+struct sock *vsock_find_bound_socket(struct sockaddr_vm *addr)
+{
+ return vsock_find_bound_socket_net(addr, NULL);
+}
EXPORT_SYMBOL_GPL(vsock_find_bound_socket);
-struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
- struct sockaddr_vm *dst)
+struct sock *vsock_find_connected_socket_net(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst,
+ struct net *net)
{
struct sock *sk;
spin_lock_bh(&vsock_table_lock);
- sk = __vsock_find_connected_socket(src, dst);
+ sk = __vsock_find_connected_socket_net(src, dst, net);
if (sk)
sock_hold(sk);
@@ -333,6 +418,13 @@ struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
return sk;
}
+EXPORT_SYMBOL_GPL(vsock_find_connected_socket_net);
+
+struct sock *vsock_find_connected_socket(struct sockaddr_vm *src,
+ struct sockaddr_vm *dst)
+{
+ return vsock_find_connected_socket_net(src, dst, NULL);
+}
EXPORT_SYMBOL_GPL(vsock_find_connected_socket);
void vsock_remove_sock(struct vsock_sock *vsk)
@@ -528,7 +620,7 @@ int vsock_assign_transport(struct vsock_sock *vsk, struct vsock_sock *psk)
if (sk->sk_type == SOCK_SEQPACKET) {
if (!new_transport->seqpacket_allow ||
- !new_transport->seqpacket_allow(remote_cid)) {
+ !new_transport->seqpacket_allow(vsk, remote_cid)) {
module_put(new_transport->module);
return -ESOCKTNOSUPPORT;
}
@@ -676,11 +768,11 @@ static void vsock_pending_work(struct work_struct *work)
static int __vsock_bind_connectible(struct vsock_sock *vsk,
struct sockaddr_vm *addr)
{
- static u32 port;
+ struct net *net = sock_net(sk_vsock(vsk));
struct sockaddr_vm new_addr;
- if (!port)
- port = get_random_u32_above(LAST_RESERVED_PORT);
+ if (!net->vsock.port)
+ net->vsock.port = get_random_u32_above(LAST_RESERVED_PORT);
vsock_addr_init(&new_addr, addr->svm_cid, addr->svm_port);
@@ -689,13 +781,13 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
unsigned int i;
for (i = 0; i < MAX_PORT_RETRIES; i++) {
- if (port == VMADDR_PORT_ANY ||
- port <= LAST_RESERVED_PORT)
- port = LAST_RESERVED_PORT + 1;
+ if (net->vsock.port == VMADDR_PORT_ANY ||
+ net->vsock.port <= LAST_RESERVED_PORT)
+ net->vsock.port = LAST_RESERVED_PORT + 1;
- new_addr.svm_port = port++;
+ new_addr.svm_port = net->vsock.port++;
- if (!__vsock_find_bound_socket(&new_addr)) {
+ if (!__vsock_find_bound_socket_net(&new_addr, net)) {
found = true;
break;
}
@@ -712,7 +804,7 @@ static int __vsock_bind_connectible(struct vsock_sock *vsk,
return -EACCES;
}
- if (__vsock_find_bound_socket(&new_addr))
+ if (__vsock_find_bound_socket_net(&new_addr, net))
return -EADDRINUSE;
}
@@ -1314,7 +1406,7 @@ static int vsock_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
goto out;
}
- if (!transport->dgram_allow(remote_addr->svm_cid,
+ if (!transport->dgram_allow(vsk, remote_addr->svm_cid,
remote_addr->svm_port)) {
err = -EINVAL;
goto out;
@@ -1355,7 +1447,7 @@ static int vsock_dgram_connect(struct socket *sock,
if (err)
goto out;
- if (!vsk->transport->dgram_allow(remote_addr->svm_cid,
+ if (!vsk->transport->dgram_allow(vsk, remote_addr->svm_cid,
remote_addr->svm_port)) {
err = -EINVAL;
goto out;
@@ -1585,7 +1677,7 @@ static int vsock_connect(struct socket *sock, struct sockaddr_unsized *addr,
* endpoints.
*/
if (!transport ||
- !transport->stream_allow(remote_addr->svm_cid,
+ !transport->stream_allow(vsk, remote_addr->svm_cid,
remote_addr->svm_port)) {
err = -ENETUNREACH;
goto out;
@@ -2662,6 +2754,180 @@ static struct miscdevice vsock_device = {
.fops = &vsock_device_ops,
};
+static int __vsock_net_mode_string(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos,
+ enum vsock_net_mode mode,
+ enum vsock_net_mode *new_mode)
+{
+ char data[VSOCK_NET_MODE_STR_MAX] = {0};
+ struct ctl_table tmp;
+ int ret;
+
+ if (!table->data || !table->maxlen || !*lenp) {
+ *lenp = 0;
+ return 0;
+ }
+
+ tmp = *table;
+ tmp.data = data;
+
+ if (!write) {
+ const char *p;
+
+ switch (mode) {
+ case VSOCK_NET_MODE_GLOBAL:
+ p = VSOCK_NET_MODE_STR_GLOBAL;
+ break;
+ case VSOCK_NET_MODE_LOCAL:
+ p = VSOCK_NET_MODE_STR_LOCAL;
+ break;
+ default:
+ WARN_ONCE(true, "netns has invalid vsock mode");
+ *lenp = 0;
+ return 0;
+ }
+
+ strscpy(data, p, sizeof(data));
+ tmp.maxlen = strlen(p);
+ }
+
+ ret = proc_dostring(&tmp, write, buffer, lenp, ppos);
+ if (ret || !write)
+ return ret;
+
+ if (*lenp >= sizeof(data))
+ return -EINVAL;
+
+ if (!strncmp(data, VSOCK_NET_MODE_STR_GLOBAL, sizeof(data)))
+ *new_mode = VSOCK_NET_MODE_GLOBAL;
+ else if (!strncmp(data, VSOCK_NET_MODE_STR_LOCAL, sizeof(data)))
+ *new_mode = VSOCK_NET_MODE_LOCAL;
+ else
+ return -EINVAL;
+
+ return 0;
+}
+
+static int vsock_net_mode_string(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ struct net *net;
+
+ if (write)
+ return -EPERM;
+
+ net = current->nsproxy->net_ns;
+
+ return __vsock_net_mode_string(table, write, buffer, lenp, ppos,
+ vsock_net_mode(net), NULL);
+}
+
+static int vsock_net_child_mode_string(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ enum vsock_net_mode new_mode;
+ struct net *net;
+ int ret;
+
+ net = current->nsproxy->net_ns;
+
+ ret = __vsock_net_mode_string(table, write, buffer, lenp, ppos,
+ vsock_net_child_mode(net), &new_mode);
+ if (ret)
+ return ret;
+
+ if (write)
+ vsock_net_set_child_mode(net, new_mode);
+
+ return 0;
+}
+
+static struct ctl_table vsock_table[] = {
+ {
+ .procname = "ns_mode",
+ .data = &init_net.vsock.mode,
+ .maxlen = VSOCK_NET_MODE_STR_MAX,
+ .mode = 0444,
+ .proc_handler = vsock_net_mode_string
+ },
+ {
+ .procname = "child_ns_mode",
+ .data = &init_net.vsock.child_ns_mode,
+ .maxlen = VSOCK_NET_MODE_STR_MAX,
+ .mode = 0644,
+ .proc_handler = vsock_net_child_mode_string
+ },
+};
+
+static int __net_init vsock_sysctl_register(struct net *net)
+{
+ struct ctl_table *table;
+
+ if (net_eq(net, &init_net)) {
+ table = vsock_table;
+ } else {
+ table = kmemdup(vsock_table, sizeof(vsock_table), GFP_KERNEL);
+ if (!table)
+ goto err_alloc;
+
+ table[0].data = &net->vsock.mode;
+ table[1].data = &net->vsock.child_ns_mode;
+ }
+
+ net->vsock.sysctl_hdr = register_net_sysctl_sz(net, "net/vsock", table,
+ ARRAY_SIZE(vsock_table));
+ if (!net->vsock.sysctl_hdr)
+ goto err_reg;
+
+ return 0;
+
+err_reg:
+ if (!net_eq(net, &init_net))
+ kfree(table);
+err_alloc:
+ return -ENOMEM;
+}
+
+static void vsock_sysctl_unregister(struct net *net)
+{
+ const struct ctl_table *table;
+
+ table = net->vsock.sysctl_hdr->ctl_table_arg;
+ unregister_net_sysctl_table(net->vsock.sysctl_hdr);
+ if (!net_eq(net, &init_net))
+ kfree(table);
+}
+
+static void vsock_net_init(struct net *net)
+{
+ if (net_eq(net, &init_net))
+ net->vsock.mode = vsock_init_ns_mode;
+ else
+ net->vsock.mode = vsock_net_child_mode(current->nsproxy->net_ns);
+
+ net->vsock.child_ns_mode = VSOCK_NET_MODE_GLOBAL;
+}
+
+static __net_init int vsock_sysctl_init_net(struct net *net)
+{
+ vsock_net_init(net);
+
+ if (vsock_sysctl_register(net))
+ return -ENOMEM;
+
+ return 0;
+}
+
+static __net_exit void vsock_sysctl_exit_net(struct net *net)
+{
+ vsock_sysctl_unregister(net);
+}
+
+static struct pernet_operations vsock_sysctl_ops = {
+ .init = vsock_sysctl_init_net,
+ .exit = vsock_sysctl_exit_net,
+};
+
static int __init vsock_init(void)
{
int err = 0;
@@ -2689,10 +2955,17 @@ static int __init vsock_init(void)
goto err_unregister_proto;
}
+ if (register_pernet_subsys(&vsock_sysctl_ops)) {
+ err = -ENOMEM;
+ goto err_unregister_sock;
+ }
+
vsock_bpf_build_proto();
return 0;
+err_unregister_sock:
+ sock_unregister(AF_VSOCK);
err_unregister_proto:
proto_unregister(&vsock_proto);
err_deregister_misc:
@@ -2706,6 +2979,7 @@ static void __exit vsock_exit(void)
misc_deregister(&vsock_device);
sock_unregister(AF_VSOCK);
proto_unregister(&vsock_proto);
+ unregister_pernet_subsys(&vsock_sysctl_ops);
}
const struct vsock_transport *vsock_core_get_transport(struct vsock_sock *vsk)
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 432fcbbd14d4..c3010c874308 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -570,7 +570,7 @@ static int hvs_dgram_enqueue(struct vsock_sock *vsk,
return -EOPNOTSUPP;
}
-static bool hvs_dgram_allow(u32 cid, u32 port)
+static bool hvs_dgram_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
return false;
}
@@ -745,8 +745,11 @@ static bool hvs_stream_is_active(struct vsock_sock *vsk)
return hvs->chan != NULL;
}
-static bool hvs_stream_allow(u32 cid, u32 port)
+static bool hvs_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
+ if (!vsock_net_mode_global(vsk))
+ return false;
+
if (cid == VMADDR_CID_HOST)
return true;
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 8c867023a2e5..f0a9e51118f3 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -536,7 +536,8 @@ static bool virtio_transport_msgzerocopy_allow(void)
return true;
}
-static bool virtio_transport_seqpacket_allow(u32 remote_cid);
+static bool virtio_transport_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid);
static struct virtio_transport virtio_transport = {
.transport = {
@@ -593,11 +594,15 @@ static struct virtio_transport virtio_transport = {
.can_msgzerocopy = virtio_transport_can_msgzerocopy,
};
-static bool virtio_transport_seqpacket_allow(u32 remote_cid)
+static bool
+virtio_transport_seqpacket_allow(struct vsock_sock *vsk, u32 remote_cid)
{
struct virtio_vsock *vsock;
bool seqpacket_allow;
+ if (!vsock_net_mode_global(vsk))
+ return false;
+
seqpacket_allow = false;
rcu_read_lock();
vsock = rcu_dereference(the_virtio_vsock);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index dcc8a1d5851e..fdb8f5b3fa60 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1043,9 +1043,9 @@ bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
}
EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
-bool virtio_transport_stream_allow(u32 cid, u32 port)
+bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
- return true;
+ return vsock_net_mode(sock_net(sk_vsock(vsk))) == VSOCK_NET_MODE_GLOBAL;
}
EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
@@ -1056,7 +1056,7 @@ int virtio_transport_dgram_bind(struct vsock_sock *vsk,
}
EXPORT_SYMBOL_GPL(virtio_transport_dgram_bind);
-bool virtio_transport_dgram_allow(u32 cid, u32 port)
+bool virtio_transport_dgram_allow(struct vsock_sock *vsk, u32 cid, u32 port)
{
return false;
}
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 7eccd6708d66..00f6bbdb035a 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -646,13 +646,17 @@ static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg)
return VMCI_SUCCESS;
}
-static bool vmci_transport_stream_allow(u32 cid, u32 port)
+static bool vmci_transport_stream_allow(struct vsock_sock *vsk, u32 cid,
+ u32 port)
{
static const u32 non_socket_contexts[] = {
VMADDR_CID_LOCAL,
};
int i;
+ if (!vsock_net_mode_global(vsk))
+ return false;
+
BUILD_BUG_ON(sizeof(cid) != sizeof(*non_socket_contexts));
for (i = 0; i < ARRAY_SIZE(non_socket_contexts); i++) {
@@ -682,12 +686,10 @@ static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg)
err = VMCI_SUCCESS;
bh_process_pkt = false;
- /* Ignore incoming packets from contexts without sockets, or resources
- * that aren't vsock implementations.
+ /* Ignore incoming packets from resources that aren't vsock
+ * implementations.
*/
-
- if (!vmci_transport_stream_allow(dg->src.context, -1)
- || vmci_transport_peer_rid(dg->src.context) != dg->src.resource)
+ if (vmci_transport_peer_rid(dg->src.context) != dg->src.resource)
return VMCI_ERROR_NO_ACCESS;
if (VMCI_DG_SIZE(dg) < sizeof(*pkt))
@@ -749,6 +751,12 @@ static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg)
goto out;
}
+ /* Ignore incoming packets from contexts without sockets. */
+ if (!vmci_transport_stream_allow(vsk, dg->src.context, -1)) {
+ err = VMCI_ERROR_NO_ACCESS;
+ goto out;
+ }
+
/* We do most everything in a work queue, but let's fast path the
* notification of reads and writes to help data transfer performance.
* We can only do this if there is no process context code executing
@@ -1784,8 +1792,12 @@ static int vmci_transport_dgram_dequeue(struct vsock_sock *vsk,
return err;
}
-static bool vmci_transport_dgram_allow(u32 cid, u32 port)
+static bool vmci_transport_dgram_allow(struct vsock_sock *vsk, u32 cid,
+ u32 port)
{
+ if (!vsock_net_mode_global(vsk))
+ return false;
+
if (cid == VMADDR_CID_HYPERVISOR) {
/* Registrations of PBRPC Servers do not modify VMX/Hypervisor
* state and are allowed.
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index bc2ff918b315..deff68c64a09 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -46,7 +46,8 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
return 0;
}
-static bool vsock_loopback_seqpacket_allow(u32 remote_cid);
+static bool vsock_loopback_seqpacket_allow(struct vsock_sock *vsk,
+ u32 remote_cid);
static bool vsock_loopback_msgzerocopy_allow(void)
{
return true;
@@ -106,9 +107,10 @@ static struct virtio_transport loopback_transport = {
.send_pkt = vsock_loopback_send_pkt,
};
-static bool vsock_loopback_seqpacket_allow(u32 remote_cid)
+static bool
+vsock_loopback_seqpacket_allow(struct vsock_sock *vsk, u32 remote_cid)
{
- return true;
+ return vsock_net_mode_global(vsk);
}
static void vsock_loopback_work(struct work_struct *work)
--
2.47.3
^ permalink raw reply related
* [PATCH v2] mshv: make certain field names descriptive in a header struct
From: Mukesh Rathor @ 2026-01-16 22:49 UTC (permalink / raw)
To: linux-hyperv; +Cc: wei.liu, nunodasneves
When struct fields use very common names like "pages" or "type", it makes
it difficult to find uses of these fields with tools like grep, cscope,
etc when the struct is in a header file included in many places. Add the
prefix mreg_ to some fields in struct mshv_mem_region to make it easier
to find them.
There is no functional change.
Signed-off-by: Mukesh Rathor <mrathor@linux.microsoft.com>
---
v2: make mutex and refcount descriptive also.
drivers/hv/mshv_regions.c | 66 ++++++++++++++++++-------------------
drivers/hv/mshv_root.h | 10 +++---
drivers/hv/mshv_root_main.c | 10 +++---
3 files changed, 43 insertions(+), 43 deletions(-)
diff --git a/drivers/hv/mshv_regions.c b/drivers/hv/mshv_regions.c
index 202b9d551e39..fec8ae9b2069 100644
--- a/drivers/hv/mshv_regions.c
+++ b/drivers/hv/mshv_regions.c
@@ -52,7 +52,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
struct page *page;
int ret;
- page = region->pages[page_offset];
+ page = region->mreg_pages[page_offset];
if (!page)
return -EINVAL;
@@ -65,7 +65,7 @@ static long mshv_region_process_chunk(struct mshv_mem_region *region,
/* Start at stride since the first page is validated */
for (count = stride; count < page_count; count += stride) {
- page = region->pages[page_offset + count];
+ page = region->mreg_pages[page_offset + count];
/* Break if current page is not present */
if (!page)
@@ -117,7 +117,7 @@ static int mshv_region_process_range(struct mshv_mem_region *region,
while (page_count) {
/* Skip non-present pages */
- if (!region->pages[page_offset]) {
+ if (!region->mreg_pages[page_offset]) {
page_offset++;
page_count--;
continue;
@@ -155,7 +155,7 @@ struct mshv_mem_region *mshv_region_create(u64 guest_pfn, u64 nr_pages,
if (flags & BIT(MSHV_SET_MEM_BIT_EXECUTABLE))
region->hv_map_flags |= HV_MAP_GPA_EXECUTABLE;
- kref_init(®ion->refcount);
+ kref_init(®ion->mreg_refcount);
return region;
}
@@ -164,13 +164,13 @@ static int mshv_region_chunk_share(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
return hv_call_modify_spa_host_access(region->partition->pt_id,
- region->pages + page_offset,
+ region->mreg_pages + page_offset,
page_count,
HV_MAP_GPA_READABLE |
HV_MAP_GPA_WRITABLE,
@@ -190,13 +190,13 @@ static int mshv_region_chunk_unshare(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_MODIFY_SPA_PAGE_HOST_ACCESS_LARGE_PAGE;
return hv_call_modify_spa_host_access(region->partition->pt_id,
- region->pages + page_offset,
+ region->mreg_pages + page_offset,
page_count, 0,
flags, false);
}
@@ -214,7 +214,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_MAP_GPA_LARGE_PAGE;
@@ -222,7 +222,7 @@ static int mshv_region_chunk_remap(struct mshv_mem_region *region,
return hv_call_map_gpa_pages(region->partition->pt_id,
region->start_gfn + page_offset,
page_count, flags,
- region->pages + page_offset);
+ region->mreg_pages + page_offset);
}
static int mshv_region_remap_pages(struct mshv_mem_region *region,
@@ -245,10 +245,10 @@ int mshv_region_map(struct mshv_mem_region *region)
static void mshv_region_invalidate_pages(struct mshv_mem_region *region,
u64 page_offset, u64 page_count)
{
- if (region->type == MSHV_REGION_TYPE_MEM_PINNED)
- unpin_user_pages(region->pages + page_offset, page_count);
+ if (region->mreg_type == MSHV_REGION_TYPE_MEM_PINNED)
+ unpin_user_pages(region->mreg_pages + page_offset, page_count);
- memset(region->pages + page_offset, 0,
+ memset(region->mreg_pages + page_offset, 0,
page_count * sizeof(struct page *));
}
@@ -265,7 +265,7 @@ int mshv_region_pin(struct mshv_mem_region *region)
int ret;
for (done_count = 0; done_count < region->nr_pages; done_count += ret) {
- pages = region->pages + done_count;
+ pages = region->mreg_pages + done_count;
userspace_addr = region->start_uaddr +
done_count * HV_HYP_PAGE_SIZE;
nr_pages = min(region->nr_pages - done_count,
@@ -297,7 +297,7 @@ static int mshv_region_chunk_unmap(struct mshv_mem_region *region,
u32 flags,
u64 page_offset, u64 page_count)
{
- struct page *page = region->pages[page_offset];
+ struct page *page = region->mreg_pages[page_offset];
if (PageHuge(page) || PageTransCompound(page))
flags |= HV_UNMAP_GPA_LARGE_PAGE;
@@ -317,11 +317,11 @@ static int mshv_region_unmap(struct mshv_mem_region *region)
static void mshv_region_destroy(struct kref *ref)
{
struct mshv_mem_region *region =
- container_of(ref, struct mshv_mem_region, refcount);
+ container_of(ref, struct mshv_mem_region, mreg_refcount);
struct mshv_partition *partition = region->partition;
int ret;
- if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+ if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
mshv_region_movable_fini(region);
if (mshv_partition_encrypted(partition)) {
@@ -343,12 +343,12 @@ static void mshv_region_destroy(struct kref *ref)
void mshv_region_put(struct mshv_mem_region *region)
{
- kref_put(®ion->refcount, mshv_region_destroy);
+ kref_put(®ion->mreg_refcount, mshv_region_destroy);
}
int mshv_region_get(struct mshv_mem_region *region)
{
- return kref_get_unless_zero(®ion->refcount);
+ return kref_get_unless_zero(®ion->mreg_refcount);
}
/**
@@ -374,16 +374,16 @@ static int mshv_region_hmm_fault_and_lock(struct mshv_mem_region *region,
int ret;
range->notifier_seq = mmu_interval_read_begin(range->notifier);
- mmap_read_lock(region->mni.mm);
+ mmap_read_lock(region->mreg_mni.mm);
ret = hmm_range_fault(range);
- mmap_read_unlock(region->mni.mm);
+ mmap_read_unlock(region->mreg_mni.mm);
if (ret)
return ret;
- mutex_lock(®ion->mutex);
+ mutex_lock(®ion->mreg_mutex);
if (mmu_interval_read_retry(range->notifier, range->notifier_seq)) {
- mutex_unlock(®ion->mutex);
+ mutex_unlock(®ion->mreg_mutex);
cond_resched();
return -EBUSY;
}
@@ -407,7 +407,7 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
u64 page_offset, u64 page_count)
{
struct hmm_range range = {
- .notifier = ®ion->mni,
+ .notifier = ®ion->mreg_mni,
.default_flags = HMM_PFN_REQ_FAULT | HMM_PFN_REQ_WRITE,
};
unsigned long *pfns;
@@ -430,12 +430,12 @@ static int mshv_region_range_fault(struct mshv_mem_region *region,
goto out;
for (i = 0; i < page_count; i++)
- region->pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
+ region->mreg_pages[page_offset + i] = hmm_pfn_to_page(pfns[i]);
ret = mshv_region_remap_pages(region, region->hv_map_flags,
page_offset, page_count);
- mutex_unlock(®ion->mutex);
+ mutex_unlock(®ion->mreg_mutex);
out:
kfree(pfns);
return ret;
@@ -489,14 +489,14 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
{
struct mshv_mem_region *region = container_of(mni,
struct mshv_mem_region,
- mni);
+ mreg_mni);
u64 page_offset, page_count;
unsigned long mstart, mend;
int ret = -EPERM;
if (mmu_notifier_range_blockable(range))
- mutex_lock(®ion->mutex);
- else if (!mutex_trylock(®ion->mutex))
+ mutex_lock(®ion->mreg_mutex);
+ else if (!mutex_trylock(®ion->mreg_mutex))
goto out_fail;
mmu_interval_set_seq(mni, cur_seq);
@@ -515,7 +515,7 @@ static bool mshv_region_interval_invalidate(struct mmu_interval_notifier *mni,
mshv_region_invalidate_pages(region, page_offset, page_count);
- mutex_unlock(®ion->mutex);
+ mutex_unlock(®ion->mreg_mutex);
return true;
@@ -535,21 +535,21 @@ static const struct mmu_interval_notifier_ops mshv_region_mni_ops = {
void mshv_region_movable_fini(struct mshv_mem_region *region)
{
- mmu_interval_notifier_remove(®ion->mni);
+ mmu_interval_notifier_remove(®ion->mreg_mni);
}
bool mshv_region_movable_init(struct mshv_mem_region *region)
{
int ret;
- ret = mmu_interval_notifier_insert(®ion->mni, current->mm,
+ ret = mmu_interval_notifier_insert(®ion->mreg_mni, current->mm,
region->start_uaddr,
region->nr_pages << HV_HYP_PAGE_SHIFT,
&mshv_region_mni_ops);
if (ret)
return false;
- mutex_init(®ion->mutex);
+ mutex_init(®ion->mreg_mutex);
return true;
}
diff --git a/drivers/hv/mshv_root.h b/drivers/hv/mshv_root.h
index 3c1d88b36741..2a03ad3dc574 100644
--- a/drivers/hv/mshv_root.h
+++ b/drivers/hv/mshv_root.h
@@ -79,16 +79,16 @@ enum mshv_region_type {
struct mshv_mem_region {
struct hlist_node hnode;
- struct kref refcount;
+ struct kref mreg_refcount;
u64 nr_pages;
u64 start_gfn;
u64 start_uaddr;
u32 hv_map_flags;
struct mshv_partition *partition;
- enum mshv_region_type type;
- struct mmu_interval_notifier mni;
- struct mutex mutex; /* protects region pages remapping */
- struct page *pages[];
+ enum mshv_region_type mreg_type;
+ struct mmu_interval_notifier mreg_mni;
+ struct mutex mreg_mutex; /* protects region pages remapping */
+ struct page *mreg_pages[];
};
struct mshv_irq_ack_notifier {
diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
index 1134a82c7881..eff1b21461dc 100644
--- a/drivers/hv/mshv_root_main.c
+++ b/drivers/hv/mshv_root_main.c
@@ -657,7 +657,7 @@ static bool mshv_handle_gpa_intercept(struct mshv_vp *vp)
return false;
/* Only movable memory ranges are supported for GPA intercepts */
- if (region->type == MSHV_REGION_TYPE_MEM_MOVABLE)
+ if (region->mreg_type == MSHV_REGION_TYPE_MEM_MOVABLE)
ret = mshv_region_handle_gfn_fault(region, gfn);
else
ret = false;
@@ -1175,12 +1175,12 @@ static int mshv_partition_create_region(struct mshv_partition *partition,
return PTR_ERR(rg);
if (is_mmio)
- rg->type = MSHV_REGION_TYPE_MMIO;
+ rg->mreg_type = MSHV_REGION_TYPE_MMIO;
else if (mshv_partition_encrypted(partition) ||
!mshv_region_movable_init(rg))
- rg->type = MSHV_REGION_TYPE_MEM_PINNED;
+ rg->mreg_type = MSHV_REGION_TYPE_MEM_PINNED;
else
- rg->type = MSHV_REGION_TYPE_MEM_MOVABLE;
+ rg->mreg_type = MSHV_REGION_TYPE_MEM_MOVABLE;
rg->partition = partition;
@@ -1297,7 +1297,7 @@ mshv_map_user_memory(struct mshv_partition *partition,
if (ret)
return ret;
- switch (region->type) {
+ switch (region->mreg_type) {
case MSHV_REGION_TYPE_MEM_PINNED:
ret = mshv_prepare_pinned_region(region);
break;
--
2.51.2.vfs.0.1
^ permalink raw reply related
* Re: [PATCH 0/2] kbuild, uapi: Mark inner unions in packed structs as packed
From: Nathan Chancellor @ 2026-01-16 22:06 UTC (permalink / raw)
To: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
Hans de Goede, Arnd Bergmann, Greg Kroah-Hartman,
Thomas Weißschuh
Cc: linux-hyperv, linux-kernel, llvm, kernel test robot
In-Reply-To: <20260115-kbuild-alignment-vbox-v1-0-076aed1623ff@linutronix.de>
On Thu, 15 Jan 2026 08:35:43 +0100, Thomas Weißschuh wrote:
> The unpacked unions within a packed struct generates alignment warnings
> on clang for 32-bit ARM.
>
> With the recent changes to compile-test the UAPI headers in more cases,
> these warning in combination with CONFIG_WERROR breaks the build.
>
> Fix the warnings.
>
> [...]
Applied to
https://git.kernel.org/pub/scm/linux/kernel/git/kbuild/linux.git kbuild-next
Thanks!
[1/2] hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packed
https://git.kernel.org/kbuild/c/1e5271393d777
[2/2] virt: vbox: uapi: Mark inner unions in packed structs as packed
https://git.kernel.org/kbuild/c/c25d01e1c4f2d
Please look out for regression or issue reports or other follow up
comments, as they may result in the patch/series getting dropped or
reverted. Patches applied to an "unstable" branch are accepted pending
wider testing in -next and any post-commit review; they will generally
be moved to the main branch in a week if no issues are found.
Best regards,
--
Nathan Chancellor <nathan@kernel.org>
^ permalink raw reply
* [PATCH net-next v15 12/12] selftests/vsock: add tests for namespace deletion
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add tests that validate vsock sockets are resilient to deleting
namespaces. The vsock sockets should still function normally.
The function check_ns_delete_doesnt_break_connection() is added to
re-use the step-by-step logic of 1) setup connections, 2) delete ns,
3) check that the connections are still ok.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- remove tests that change the mode after socket creation (this is not
supported behavior now and the immutability property is tested in other
tests)
- remove "change_mode" behavior of
check_ns_changes_dont_break_connection() and rename to
check_ns_delete_doesnt_break_connection() because we only need to test
namespace deletion (other tests confirm that the mode cannot change)
Changes in v11:
- remove pipefile (Stefano)
Changes in v9:
- more consistent shell style
- clarify -u usage comment for pipefile
---
tools/testing/selftests/vsock/vmtest.sh | 84 +++++++++++++++++++++++++++++++++
1 file changed, 84 insertions(+)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index a9eaf37bc31b..dc8dbe74a6d0 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -68,6 +68,9 @@ readonly TEST_NAMES=(
ns_same_local_loopback_ok
ns_same_local_host_connect_to_local_vm_ok
ns_same_local_vm_connect_to_local_host_ok
+ ns_delete_vm_ok
+ ns_delete_host_ok
+ ns_delete_both_ok
)
readonly TEST_DESCS=(
# vm_server_host_client
@@ -135,6 +138,15 @@ readonly TEST_DESCS=(
# ns_same_local_vm_connect_to_local_host_ok
"Run vsock_test client in VM in a local ns with server in same ns."
+
+ # ns_delete_vm_ok
+ "Check that deleting the VM's namespace does not break the socket connection"
+
+ # ns_delete_host_ok
+ "Check that deleting the host's namespace does not break the socket connection"
+
+ # ns_delete_both_ok
+ "Check that deleting the VM and host's namespaces does not break the socket connection"
)
readonly USE_SHARED_VM=(
@@ -1287,6 +1299,78 @@ test_vm_loopback() {
return "${KSFT_PASS}"
}
+check_ns_delete_doesnt_break_connection() {
+ local pipefile pidfile outfile
+ local ns0="global0"
+ local ns1="global1"
+ local port=12345
+ local pids=()
+ local rc=0
+
+ init_namespaces
+
+ pidfile="$(create_pidfile)"
+ if ! vm_start "${pidfile}" "${ns0}"; then
+ return "${KSFT_FAIL}"
+ fi
+ vm_wait_for_ssh "${ns0}"
+
+ outfile=$(mktemp)
+ vm_ssh "${ns0}" -- \
+ socat VSOCK-LISTEN:"${port}",fork STDOUT > "${outfile}" 2>/dev/null &
+ pids+=($!)
+ vm_wait_for_listener "${ns0}" "${port}" "vsock"
+
+ # We use a pipe here so that we can echo into the pipe instead of using
+ # socat and a unix socket file. We just need a name for the pipe (not a
+ # regular file) so use -u.
+ pipefile=$(mktemp -u /tmp/vmtest_pipe_XXXX)
+ ip netns exec "${ns1}" \
+ socat PIPE:"${pipefile}" VSOCK-CONNECT:"${VSOCK_CID}":"${port}" &
+ pids+=($!)
+
+ timeout "${WAIT_PERIOD}" \
+ bash -c 'while [[ ! -e '"${pipefile}"' ]]; do sleep 1; done; exit 0'
+
+ if [[ "$1" == "vm" ]]; then
+ ip netns del "${ns0}"
+ elif [[ "$1" == "host" ]]; then
+ ip netns del "${ns1}"
+ elif [[ "$1" == "both" ]]; then
+ ip netns del "${ns0}"
+ ip netns del "${ns1}"
+ fi
+
+ echo "TEST" > "${pipefile}"
+
+ timeout "${WAIT_PERIOD}" \
+ bash -c 'while [[ ! -s '"${outfile}"' ]]; do sleep 1; done; exit 0'
+
+ if grep -q "TEST" "${outfile}"; then
+ rc="${KSFT_PASS}"
+ else
+ rc="${KSFT_FAIL}"
+ fi
+
+ terminate_pidfiles "${pidfile}"
+ terminate_pids "${pids[@]}"
+ rm -f "${outfile}" "${pipefile}"
+
+ return "${rc}"
+}
+
+test_ns_delete_vm_ok() {
+ check_ns_delete_doesnt_break_connection "vm"
+}
+
+test_ns_delete_host_ok() {
+ check_ns_delete_doesnt_break_connection "host"
+}
+
+test_ns_delete_both_ok() {
+ check_ns_delete_doesnt_break_connection "both"
+}
+
shared_vm_test() {
local tname
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 11/12] selftests/vsock: add tests for host <-> vm connectivity with namespaces
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add tests to validate namespace correctness using vsock_test and socat.
The vsock_test tool is used to validate expected success tests, but
socat is used for expected failure tests. socat is used to ensure that
connections are rejected outright instead of failing due to some other
socket behavior (as tested in vsock_test). Additionally, socat is
already required for tunneling TCP traffic from vsock_test. Using only
one of the vsock_test tests like 'test_stream_client_close_client' would
have yielded a similar result, but doing so wouldn't remove the socat
dependency.
Additionally, check for the dependency socat. socat needs special
handling beyond just checking if it is on the path because it must be
compiled with support for both vsock and unix. The function
check_socat() checks that this support exists.
Add more padding to test name printf strings because the tests added in
this patch would otherwise overflow.
Add vm_dmesg_* helpers to encapsulate checking dmesg
for oops and warnings.
Add ability to pass extra args to host-side vsock_test so that tests
that cause false positives may be skipped with arg --skip.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v12:
- add test skip (vsock_test test 29) when host_vsock_test() uses client
mode in a local namespace. Test 29 causes a false positive to trigger.
Changes in v11:
- add 'sleep "${WAIT_PERIOD}"' after any non-TCP socat LISTEN cmd
(Stefano)
- add host_wait_for_listener() after any socat TCP-LISTEN (Stefano)
- reuse vm_dmesg_{oops,warn}_count() inside vm_dmesg_check()
- fix copy-paste in test_ns_same_local_vm_connect_to_local_host_ok()
(Stefano)
Changes in v10:
- add vm_dmesg_start() and vm_dmesg_check()
Changes in v9:
- consistent variable quoting
---
tools/testing/selftests/vsock/vmtest.sh | 572 +++++++++++++++++++++++++++++++-
1 file changed, 568 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 1bf537410ea6..a9eaf37bc31b 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -7,6 +7,7 @@
# * virtme-ng
# * busybox-static (used by virtme-ng)
# * qemu (used by virtme-ng)
+# * socat
#
# shellcheck disable=SC2317,SC2119
@@ -54,6 +55,19 @@ readonly TEST_NAMES=(
ns_local_same_cid_ok
ns_global_local_same_cid_ok
ns_local_global_same_cid_ok
+ ns_diff_global_host_connect_to_global_vm_ok
+ ns_diff_global_host_connect_to_local_vm_fails
+ ns_diff_global_vm_connect_to_global_host_ok
+ ns_diff_global_vm_connect_to_local_host_fails
+ ns_diff_local_host_connect_to_local_vm_fails
+ ns_diff_local_vm_connect_to_local_host_fails
+ ns_diff_global_to_local_loopback_local_fails
+ ns_diff_local_to_global_loopback_fails
+ ns_diff_local_to_local_loopback_fails
+ ns_diff_global_to_global_loopback_ok
+ ns_same_local_loopback_ok
+ ns_same_local_host_connect_to_local_vm_ok
+ ns_same_local_vm_connect_to_local_host_ok
)
readonly TEST_DESCS=(
# vm_server_host_client
@@ -82,6 +96,45 @@ readonly TEST_DESCS=(
# ns_local_global_same_cid_ok
"Check QEMU successfully starts one VM in a local ns and then another VM in a global ns with the same CID."
+
+ # ns_diff_global_host_connect_to_global_vm_ok
+ "Run vsock_test client in global ns with server in VM in another global ns."
+
+ # ns_diff_global_host_connect_to_local_vm_fails
+ "Run socat to test a process in a global ns fails to connect to a VM in a local ns."
+
+ # ns_diff_global_vm_connect_to_global_host_ok
+ "Run vsock_test client in VM in a global ns with server in another global ns."
+
+ # ns_diff_global_vm_connect_to_local_host_fails
+ "Run socat to test a VM in a global ns fails to connect to a host process in a local ns."
+
+ # ns_diff_local_host_connect_to_local_vm_fails
+ "Run socat to test a host process in a local ns fails to connect to a VM in another local ns."
+
+ # ns_diff_local_vm_connect_to_local_host_fails
+ "Run socat to test a VM in a local ns fails to connect to a host process in another local ns."
+
+ # ns_diff_global_to_local_loopback_local_fails
+ "Run socat to test a loopback vsock in a global ns fails to connect to a vsock in a local ns."
+
+ # ns_diff_local_to_global_loopback_fails
+ "Run socat to test a loopback vsock in a local ns fails to connect to a vsock in a global ns."
+
+ # ns_diff_local_to_local_loopback_fails
+ "Run socat to test a loopback vsock in a local ns fails to connect to a vsock in another local ns."
+
+ # ns_diff_global_to_global_loopback_ok
+ "Run socat to test a loopback vsock in a global ns successfully connects to a vsock in another global ns."
+
+ # ns_same_local_loopback_ok
+ "Run socat to test a loopback vsock in a local ns successfully connects to a vsock in the same ns."
+
+ # ns_same_local_host_connect_to_local_vm_ok
+ "Run vsock_test client in a local ns with server in VM in same ns."
+
+ # ns_same_local_vm_connect_to_local_host_ok
+ "Run vsock_test client in VM in a local ns with server in same ns."
)
readonly USE_SHARED_VM=(
@@ -112,7 +165,7 @@ usage() {
for ((i = 0; i < ${#TEST_NAMES[@]}; i++)); do
name=${TEST_NAMES[${i}]}
desc=${TEST_DESCS[${i}]}
- printf "\t%-35s%-35s\n" "${name}" "${desc}"
+ printf "\t%-55s%-35s\n" "${name}" "${desc}"
done
echo
@@ -222,7 +275,7 @@ check_args() {
}
check_deps() {
- for dep in vng ${QEMU} busybox pkill ssh ss; do
+ for dep in vng ${QEMU} busybox pkill ssh ss socat; do
if [[ ! -x $(command -v "${dep}") ]]; then
echo -e "skip: dependency ${dep} not found!\n"
exit "${KSFT_SKIP}"
@@ -273,6 +326,20 @@ check_vng() {
fi
}
+check_socat() {
+ local support_string
+
+ support_string="$(socat -V)"
+
+ if [[ "${support_string}" != *"WITH_VSOCK 1"* ]]; then
+ die "err: socat is missing vsock support"
+ fi
+
+ if [[ "${support_string}" != *"WITH_UNIX 1"* ]]; then
+ die "err: socat is missing unix support"
+ fi
+}
+
handle_build() {
if [[ ! "${BUILD}" -eq 1 ]]; then
return
@@ -321,6 +388,14 @@ terminate_pidfiles() {
done
}
+terminate_pids() {
+ local pid
+
+ for pid in "$@"; do
+ kill -SIGTERM "${pid}" &>/dev/null || :
+ done
+}
+
vm_start() {
local pidfile=$1
local ns=$2
@@ -459,6 +534,28 @@ vm_dmesg_warn_count() {
vm_ssh "${ns}" -- dmesg --level=warn 2>/dev/null | grep -c -i 'vsock'
}
+vm_dmesg_check() {
+ local pidfile=$1
+ local ns=$2
+ local oops_before=$3
+ local warn_before=$4
+ local oops_after warn_after
+
+ oops_after=$(vm_dmesg_oops_count "${ns}")
+ if [[ "${oops_after}" -gt "${oops_before}" ]]; then
+ echo "FAIL: kernel oops detected on vm in ns ${ns}" | log_host
+ return 1
+ fi
+
+ warn_after=$(vm_dmesg_warn_count "${ns}")
+ if [[ "${warn_after}" -gt "${warn_before}" ]]; then
+ echo "FAIL: kernel warning detected on vm in ns ${ns}" | log_host
+ return 1
+ fi
+
+ return 0
+}
+
vm_vsock_test() {
local ns=$1
local host=$2
@@ -502,6 +599,8 @@ host_vsock_test() {
local host=$2
local cid=$3
local port=$4
+ shift 4
+ local extra_args=("$@")
local rc
local cmd="${VSOCK_TEST}"
@@ -516,13 +615,15 @@ host_vsock_test() {
--mode=client \
--peer-cid="${cid}" \
--control-host="${host}" \
- --control-port="${port}" 2>&1 | log_host
+ --control-port="${port}" \
+ "${extra_args[@]}" 2>&1 | log_host
rc=$?
else
${cmd} \
--mode=server \
--peer-cid="${cid}" \
- --control-port="${port}" 2>&1 | log_host &
+ --control-port="${port}" \
+ "${extra_args[@]}" 2>&1 | log_host &
rc=$?
if [[ $rc -ne 0 ]]; then
@@ -593,6 +694,468 @@ test_ns_host_vsock_ns_mode_ok() {
return "${KSFT_PASS}"
}
+test_ns_diff_global_host_connect_to_global_vm_ok() {
+ local oops_before warn_before
+ local pids pid pidfile
+ local ns0 ns1 port
+ declare -a pids
+ local unixfile
+ ns0="global0"
+ ns1="global1"
+ port=1234
+ local rc
+
+ init_namespaces
+
+ pidfile="$(create_pidfile)"
+
+ if ! vm_start "${pidfile}" "${ns0}"; then
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns0}"
+ oops_before=$(vm_dmesg_oops_count "${ns0}")
+ warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+ unixfile=$(mktemp -u /tmp/XXXX.sock)
+ ip netns exec "${ns1}" \
+ socat TCP-LISTEN:"${TEST_HOST_PORT}",fork \
+ UNIX-CONNECT:"${unixfile}" &
+ pids+=($!)
+ host_wait_for_listener "${ns1}" "${TEST_HOST_PORT}" "tcp"
+
+ ip netns exec "${ns0}" socat UNIX-LISTEN:"${unixfile}",fork \
+ TCP-CONNECT:localhost:"${TEST_HOST_PORT}" &
+ pids+=($!)
+ host_wait_for_listener "${ns0}" "${unixfile}" "unix"
+
+ vm_vsock_test "${ns0}" "server" 2 "${TEST_GUEST_PORT}"
+ vm_wait_for_listener "${ns0}" "${TEST_GUEST_PORT}" "tcp"
+ host_vsock_test "${ns1}" "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}"
+ rc=$?
+
+ vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pids "${pids[@]}"
+ terminate_pidfiles "${pidfile}"
+
+ if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+ return "${KSFT_FAIL}"
+ fi
+
+ return "${KSFT_PASS}"
+}
+
+test_ns_diff_global_host_connect_to_local_vm_fails() {
+ local oops_before warn_before
+ local ns0="global0"
+ local ns1="local0"
+ local port=12345
+ local dmesg_rc
+ local pidfile
+ local result
+ local pid
+
+ init_namespaces
+
+ outfile=$(mktemp)
+
+ pidfile="$(create_pidfile)"
+ if ! vm_start "${pidfile}" "${ns1}"; then
+ log_host "failed to start vm (cid=${VSOCK_CID}, ns=${ns0})"
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns1}"
+ oops_before=$(vm_dmesg_oops_count "${ns1}")
+ warn_before=$(vm_dmesg_warn_count "${ns1}")
+
+ vm_ssh "${ns1}" -- socat VSOCK-LISTEN:"${port}" STDOUT > "${outfile}" &
+ vm_wait_for_listener "${ns1}" "${port}" "vsock"
+ echo TEST | ip netns exec "${ns0}" \
+ socat STDIN VSOCK-CONNECT:"${VSOCK_CID}":"${port}" 2>/dev/null
+
+ vm_dmesg_check "${pidfile}" "${ns1}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pidfiles "${pidfile}"
+ result=$(cat "${outfile}")
+ rm -f "${outfile}"
+
+ if [[ "${result}" == "TEST" ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+ return "${KSFT_FAIL}"
+ fi
+
+ return "${KSFT_PASS}"
+}
+
+test_ns_diff_global_vm_connect_to_global_host_ok() {
+ local oops_before warn_before
+ local ns0="global0"
+ local ns1="global1"
+ local port=12345
+ local unixfile
+ local dmesg_rc
+ local pidfile
+ local pids
+ local rc
+
+ init_namespaces
+
+ declare -a pids
+
+ log_host "Setup socat bridge from ns ${ns0} to ns ${ns1} over port ${port}"
+
+ unixfile=$(mktemp -u /tmp/XXXX.sock)
+
+ ip netns exec "${ns0}" \
+ socat TCP-LISTEN:"${port}" UNIX-CONNECT:"${unixfile}" &
+ pids+=($!)
+ host_wait_for_listener "${ns0}" "${port}" "tcp"
+
+ ip netns exec "${ns1}" \
+ socat UNIX-LISTEN:"${unixfile}" TCP-CONNECT:127.0.0.1:"${port}" &
+ pids+=($!)
+ host_wait_for_listener "${ns1}" "${unixfile}" "unix"
+
+ log_host "Launching ${VSOCK_TEST} in ns ${ns1}"
+ host_vsock_test "${ns1}" "server" "${VSOCK_CID}" "${port}"
+
+ pidfile="$(create_pidfile)"
+ if ! vm_start "${pidfile}" "${ns0}"; then
+ log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+ terminate_pids "${pids[@]}"
+ rm -f "${unixfile}"
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns0}"
+
+ oops_before=$(vm_dmesg_oops_count "${ns0}")
+ warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+ vm_vsock_test "${ns0}" "10.0.2.2" 2 "${port}"
+ rc=$?
+
+ vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pidfiles "${pidfile}"
+ terminate_pids "${pids[@]}"
+ rm -f "${unixfile}"
+
+ if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+ return "${KSFT_FAIL}"
+ fi
+
+ return "${KSFT_PASS}"
+
+}
+
+test_ns_diff_global_vm_connect_to_local_host_fails() {
+ local ns0="global0"
+ local ns1="local0"
+ local port=12345
+ local oops_before warn_before
+ local dmesg_rc
+ local pidfile
+ local result
+ local pid
+
+ init_namespaces
+
+ log_host "Launching socat in ns ${ns1}"
+ outfile=$(mktemp)
+
+ ip netns exec "${ns1}" socat VSOCK-LISTEN:"${port}" STDOUT &> "${outfile}" &
+ pid=$!
+ host_wait_for_listener "${ns1}" "${port}" "vsock"
+
+ pidfile="$(create_pidfile)"
+ if ! vm_start "${pidfile}" "${ns0}"; then
+ log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+ terminate_pids "${pid}"
+ rm -f "${outfile}"
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns0}"
+
+ oops_before=$(vm_dmesg_oops_count "${ns0}")
+ warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+ vm_ssh "${ns0}" -- \
+ bash -c "echo TEST | socat STDIN VSOCK-CONNECT:2:${port}" 2>&1 | log_guest
+
+ vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pidfiles "${pidfile}"
+ terminate_pids "${pid}"
+
+ result=$(cat "${outfile}")
+ rm -f "${outfile}"
+
+ if [[ "${result}" != TEST ]] && [[ "${dmesg_rc}" -eq 0 ]]; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_host_connect_to_local_vm_fails() {
+ local ns0="local0"
+ local ns1="local1"
+ local port=12345
+ local oops_before warn_before
+ local dmesg_rc
+ local pidfile
+ local result
+ local pid
+
+ init_namespaces
+
+ outfile=$(mktemp)
+
+ pidfile="$(create_pidfile)"
+ if ! vm_start "${pidfile}" "${ns1}"; then
+ log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns1}"
+ oops_before=$(vm_dmesg_oops_count "${ns1}")
+ warn_before=$(vm_dmesg_warn_count "${ns1}")
+
+ vm_ssh "${ns1}" -- socat VSOCK-LISTEN:"${port}" STDOUT > "${outfile}" &
+ vm_wait_for_listener "${ns1}" "${port}" "vsock"
+
+ echo TEST | ip netns exec "${ns0}" \
+ socat STDIN VSOCK-CONNECT:"${VSOCK_CID}":"${port}" 2>/dev/null
+
+ vm_dmesg_check "${pidfile}" "${ns1}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pidfiles "${pidfile}"
+
+ result=$(cat "${outfile}")
+ rm -f "${outfile}"
+
+ if [[ "${result}" != TEST ]] && [[ "${dmesg_rc}" -eq 0 ]]; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_vm_connect_to_local_host_fails() {
+ local oops_before warn_before
+ local ns0="local0"
+ local ns1="local1"
+ local port=12345
+ local dmesg_rc
+ local pidfile
+ local result
+ local pid
+
+ init_namespaces
+
+ log_host "Launching socat in ns ${ns1}"
+ outfile=$(mktemp)
+ ip netns exec "${ns1}" socat VSOCK-LISTEN:"${port}" STDOUT &> "${outfile}" &
+ pid=$!
+ host_wait_for_listener "${ns1}" "${port}" "vsock"
+
+ pidfile="$(create_pidfile)"
+ if ! vm_start "${pidfile}" "${ns0}"; then
+ log_host "failed to start vm (cid=${cid}, ns=${ns0})"
+ rm -f "${outfile}"
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns0}"
+ oops_before=$(vm_dmesg_oops_count "${ns0}")
+ warn_before=$(vm_dmesg_warn_count "${ns0}")
+
+ vm_ssh "${ns0}" -- \
+ bash -c "echo TEST | socat STDIN VSOCK-CONNECT:2:${port}" 2>&1 | log_guest
+
+ vm_dmesg_check "${pidfile}" "${ns0}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pidfiles "${pidfile}"
+ terminate_pids "${pid}"
+
+ result=$(cat "${outfile}")
+ rm -f "${outfile}"
+
+ if [[ "${result}" != TEST ]] && [[ "${dmesg_rc}" -eq 0 ]]; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+__test_loopback_two_netns() {
+ local ns0=$1
+ local ns1=$2
+ local port=12345
+ local result
+ local pid
+
+ modprobe vsock_loopback &> /dev/null || :
+
+ log_host "Launching socat in ns ${ns1}"
+ outfile=$(mktemp)
+
+ ip netns exec "${ns1}" socat VSOCK-LISTEN:"${port}" STDOUT > "${outfile}" 2>/dev/null &
+ pid=$!
+ host_wait_for_listener "${ns1}" "${port}" "vsock"
+
+ log_host "Launching socat in ns ${ns0}"
+ echo TEST | ip netns exec "${ns0}" socat STDIN VSOCK-CONNECT:1:"${port}" 2>/dev/null
+ terminate_pids "${pid}"
+
+ result=$(cat "${outfile}")
+ rm -f "${outfile}"
+
+ if [[ "${result}" == TEST ]]; then
+ return 0
+ fi
+
+ return 1
+}
+
+test_ns_diff_global_to_local_loopback_local_fails() {
+ init_namespaces
+
+ if ! __test_loopback_two_netns "global0" "local0"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_to_global_loopback_fails() {
+ init_namespaces
+
+ if ! __test_loopback_two_netns "local0" "global0"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_diff_local_to_local_loopback_fails() {
+ init_namespaces
+
+ if ! __test_loopback_two_netns "local0" "local1"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_diff_global_to_global_loopback_ok() {
+ init_namespaces
+
+ if __test_loopback_two_netns "global0" "global1"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_same_local_loopback_ok() {
+ init_namespaces
+
+ if __test_loopback_two_netns "local0" "local0"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_same_local_host_connect_to_local_vm_ok() {
+ local oops_before warn_before
+ local ns="local0"
+ local port=1234
+ local dmesg_rc
+ local pidfile
+ local rc
+
+ init_namespaces
+
+ pidfile="$(create_pidfile)"
+
+ if ! vm_start "${pidfile}" "${ns}"; then
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns}"
+ oops_before=$(vm_dmesg_oops_count "${ns}")
+ warn_before=$(vm_dmesg_warn_count "${ns}")
+
+ vm_vsock_test "${ns}" "server" 2 "${TEST_GUEST_PORT}"
+
+ # Skip test 29 (transport release use-after-free): This test attempts
+ # binding both G2H and H2G CIDs. Because virtio-vsock (G2H) doesn't
+ # support local namespaces the test will fail when
+ # transport_g2h->stream_allow() returns false. This edge case only
+ # happens for vsock_test in client mode on the host in a local
+ # namespace. This is a false positive.
+ host_vsock_test "${ns}" "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}" --skip=29
+ rc=$?
+
+ vm_dmesg_check "${pidfile}" "${ns}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pidfiles "${pidfile}"
+
+ if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+ return "${KSFT_FAIL}"
+ fi
+
+ return "${KSFT_PASS}"
+}
+
+test_ns_same_local_vm_connect_to_local_host_ok() {
+ local oops_before warn_before
+ local ns="local0"
+ local port=1234
+ local dmesg_rc
+ local pidfile
+ local rc
+
+ init_namespaces
+
+ pidfile="$(create_pidfile)"
+
+ if ! vm_start "${pidfile}" "${ns}"; then
+ return "${KSFT_FAIL}"
+ fi
+
+ vm_wait_for_ssh "${ns}"
+ oops_before=$(vm_dmesg_oops_count "${ns}")
+ warn_before=$(vm_dmesg_warn_count "${ns}")
+
+ host_vsock_test "${ns}" "server" "${VSOCK_CID}" "${port}"
+ vm_vsock_test "${ns}" "10.0.2.2" 2 "${port}"
+ rc=$?
+
+ vm_dmesg_check "${pidfile}" "${ns}" "${oops_before}" "${warn_before}"
+ dmesg_rc=$?
+
+ terminate_pidfiles "${pidfile}"
+
+ if [[ "${rc}" -ne 0 ]] || [[ "${dmesg_rc}" -ne 0 ]]; then
+ return "${KSFT_FAIL}"
+ fi
+
+ return "${KSFT_PASS}"
+}
+
namespaces_can_boot_same_cid() {
local ns0=$1
local ns1=$2
@@ -882,6 +1445,7 @@ fi
check_args "${ARGS[@]}"
check_deps
check_vng
+check_socat
handle_build
echo "1..${#ARGS[@]}"
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 10/12] selftests/vsock: add namespace tests for CID collisions
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add tests to verify CID collision rules across different vsock namespace
modes.
1. Two VMs with the same CID cannot start in different global namespaces
(ns_global_same_cid_fails)
2. Two VMs with the same CID can start in different local namespaces
(ns_local_same_cid_ok)
3. VMs with the same CID can coexist when one is in a global namespace
and another is in a local namespace (ns_global_local_same_cid_ok and
ns_local_global_same_cid_ok)
The tests ns_global_local_same_cid_ok and ns_local_global_same_cid_ok
make sure that ordering does not matter.
The tests use a shared helper function namespaces_can_boot_same_cid()
that attempts to start two VMs with identical CIDs in the specified
namespaces and verifies whether VM initialization failed or succeeded.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v11:
- check vm_start() rc in namespaces_can_boot_same_cid() (Stefano)
- fix ns_local_same_cid_ok() to use local0 and local1 instead of reusing
local0 twice. This check should pass, ensuring local namespaces do not
collide (Stefano)
---
tools/testing/selftests/vsock/vmtest.sh | 78 +++++++++++++++++++++++++++++++++
1 file changed, 78 insertions(+)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 38785a102236..1bf537410ea6 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -50,6 +50,10 @@ readonly TEST_NAMES=(
vm_loopback
ns_host_vsock_ns_mode_ok
ns_host_vsock_child_ns_mode_ok
+ ns_global_same_cid_fails
+ ns_local_same_cid_ok
+ ns_global_local_same_cid_ok
+ ns_local_global_same_cid_ok
)
readonly TEST_DESCS=(
# vm_server_host_client
@@ -66,6 +70,18 @@ readonly TEST_DESCS=(
# ns_host_vsock_child_ns_mode_ok
"Check /proc/sys/net/vsock/ns_mode is read-only and child_ns_mode is writable."
+
+ # ns_global_same_cid_fails
+ "Check QEMU fails to start two VMs with same CID in two different global namespaces."
+
+ # ns_local_same_cid_ok
+ "Check QEMU successfully starts two VMs with same CID in two different local namespaces."
+
+ # ns_global_local_same_cid_ok
+ "Check QEMU successfully starts one VM in a global ns and then another VM in a local ns with the same CID."
+
+ # ns_local_global_same_cid_ok
+ "Check QEMU successfully starts one VM in a local ns and then another VM in a global ns with the same CID."
)
readonly USE_SHARED_VM=(
@@ -577,6 +593,68 @@ test_ns_host_vsock_ns_mode_ok() {
return "${KSFT_PASS}"
}
+namespaces_can_boot_same_cid() {
+ local ns0=$1
+ local ns1=$2
+ local pidfile1 pidfile2
+ local rc
+
+ pidfile1="$(create_pidfile)"
+
+ # The first VM should be able to start. If it can't then we have
+ # problems and need to return non-zero.
+ if ! vm_start "${pidfile1}" "${ns0}"; then
+ return 1
+ fi
+
+ pidfile2="$(create_pidfile)"
+ vm_start "${pidfile2}" "${ns1}"
+ rc=$?
+ terminate_pidfiles "${pidfile1}" "${pidfile2}"
+
+ return "${rc}"
+}
+
+test_ns_global_same_cid_fails() {
+ init_namespaces
+
+ if namespaces_can_boot_same_cid "global0" "global1"; then
+ return "${KSFT_FAIL}"
+ fi
+
+ return "${KSFT_PASS}"
+}
+
+test_ns_local_global_same_cid_ok() {
+ init_namespaces
+
+ if namespaces_can_boot_same_cid "local0" "global0"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_global_local_same_cid_ok() {
+ init_namespaces
+
+ if namespaces_can_boot_same_cid "global0" "local0"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
+test_ns_local_same_cid_ok() {
+ init_namespaces
+
+ if namespaces_can_boot_same_cid "local0" "local1"; then
+ return "${KSFT_PASS}"
+ fi
+
+ return "${KSFT_FAIL}"
+}
+
test_ns_host_vsock_child_ns_mode_ok() {
local orig_mode
local rc
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 09/12] selftests/vsock: add tests for proc sys vsock ns_mode
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add tests for the /proc/sys/net/vsock/{ns_mode,child_ns_mode}
interfaces. Namely, that they accept/report "global" and "local" strings
and enforce their access policies.
Start a convention of commenting the test name over the test
description. Add test name comments over test descriptions that existed
before this convention.
Add a check_netns() function that checks if the test requires namespaces
and if the current kernel supports namespaces. Skip tests that require
namespaces if the system does not have namespace support.
This patch is the first to add tests that do *not* re-use the same
shared VM. For that reason, it adds a run_ns_tests() function to run
these tests and filter out the shared VM tests.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- remove write-once test ns_host_vsock_ns_mode_write_once_ok to reflect
removing the write-once policy
- add child_ns_mode test test_ns_host_vsock_child_ns_mode_ok
- modify test_ns_host_vsock_ns_mode_ok() to check that the correct mode
was inherited from child_ns_mode
Changes in v12:
- remove ns_vm_local_mode_rejected test, due to dropping that constraint
Changes in v11:
- Document ns_ prefix above TEST_NAMES (Stefano)
Changes in v10:
- Remove extraneous add_namespaces/del_namespaces calls.
- Rename run_tests() to run_ns_tests() since it is designed to only
run ns tests.
Changes in v9:
- add test ns_vm_local_mode_rejected to check that guests cannot use
local mode
---
tools/testing/selftests/vsock/vmtest.sh | 140 +++++++++++++++++++++++++++++++-
1 file changed, 138 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 0e681d4c3a15..38785a102236 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -41,14 +41,38 @@ readonly KERNEL_CMDLINE="\
virtme.ssh virtme_ssh_channel=tcp virtme_ssh_user=$USER \
"
readonly LOG=$(mktemp /tmp/vsock_vmtest_XXXX.log)
-readonly TEST_NAMES=(vm_server_host_client vm_client_host_server vm_loopback)
+
+# Namespace tests must use the ns_ prefix. This is checked in check_netns() and
+# is used to determine if a test needs namespace setup before test execution.
+readonly TEST_NAMES=(
+ vm_server_host_client
+ vm_client_host_server
+ vm_loopback
+ ns_host_vsock_ns_mode_ok
+ ns_host_vsock_child_ns_mode_ok
+)
readonly TEST_DESCS=(
+ # vm_server_host_client
"Run vsock_test in server mode on the VM and in client mode on the host."
+
+ # vm_client_host_server
"Run vsock_test in client mode on the VM and in server mode on the host."
+
+ # vm_loopback
"Run vsock_test using the loopback transport in the VM."
+
+ # ns_host_vsock_ns_mode_ok
+ "Check /proc/sys/net/vsock/ns_mode strings on the host."
+
+ # ns_host_vsock_child_ns_mode_ok
+ "Check /proc/sys/net/vsock/ns_mode is read-only and child_ns_mode is writable."
)
-readonly USE_SHARED_VM=(vm_server_host_client vm_client_host_server vm_loopback)
+readonly USE_SHARED_VM=(
+ vm_server_host_client
+ vm_client_host_server
+ vm_loopback
+)
readonly NS_MODES=("local" "global")
VERBOSE=0
@@ -196,6 +220,20 @@ check_deps() {
fi
}
+check_netns() {
+ local tname=$1
+
+ # If the test requires NS support, check if NS support exists
+ # using /proc/self/ns
+ if [[ "${tname}" =~ ^ns_ ]] &&
+ [[ ! -e /proc/self/ns ]]; then
+ log_host "No NS support detected for test ${tname}"
+ return 1
+ fi
+
+ return 0
+}
+
check_vng() {
local tested_versions
local version
@@ -519,6 +557,54 @@ log_guest() {
LOG_PREFIX=guest log "$@"
}
+ns_get_mode() {
+ local ns=$1
+
+ ip netns exec "${ns}" cat /proc/sys/net/vsock/ns_mode 2>/dev/null
+}
+
+test_ns_host_vsock_ns_mode_ok() {
+ for mode in "${NS_MODES[@]}"; do
+ local actual
+
+ actual=$(ns_get_mode "${mode}0")
+ if [[ "${actual}" != "${mode}" ]]; then
+ log_host "expected mode ${mode}, got ${actual}"
+ return "${KSFT_FAIL}"
+ fi
+ done
+
+ return "${KSFT_PASS}"
+}
+
+test_ns_host_vsock_child_ns_mode_ok() {
+ local orig_mode
+ local rc
+
+ orig_mode=$(cat /proc/sys/net/vsock/child_ns_mode)
+
+ rc="${KSFT_PASS}"
+ for mode in "${NS_MODES[@]}"; do
+ local ns="${mode}0"
+
+ if echo "${mode}" 2>/dev/null > /proc/sys/net/vsock/ns_mode; then
+ log_host "ns_mode should be read-only but write succeeded"
+ rc="${KSFT_FAIL}"
+ continue
+ fi
+
+ if ! echo "${mode}" > /proc/sys/net/vsock/child_ns_mode; then
+ log_host "child_ns_mode should be writable to ${mode}"
+ rc="${KSFT_FAIL}"
+ continue
+ fi
+ done
+
+ echo "${orig_mode}" > /proc/sys/net/vsock/child_ns_mode
+
+ return "${rc}"
+}
+
test_vm_server_host_client() {
if ! vm_vsock_test "init_ns" "server" 2 "${TEST_GUEST_PORT}"; then
return "${KSFT_FAIL}"
@@ -592,6 +678,11 @@ run_shared_vm_tests() {
continue
fi
+ if ! check_netns "${arg}"; then
+ check_result "${KSFT_SKIP}" "${arg}"
+ continue
+ fi
+
run_shared_vm_test "${arg}"
check_result "$?" "${arg}"
done
@@ -645,6 +736,49 @@ run_shared_vm_test() {
return "${rc}"
}
+run_ns_tests() {
+ for arg in "${ARGS[@]}"; do
+ if shared_vm_test "${arg}"; then
+ continue
+ fi
+
+ if ! check_netns "${arg}"; then
+ check_result "${KSFT_SKIP}" "${arg}"
+ continue
+ fi
+
+ add_namespaces
+
+ name=$(echo "${arg}" | awk '{ print $1 }')
+ log_host "Executing test_${name}"
+
+ host_oops_before=$(dmesg 2>/dev/null | grep -c -i 'Oops')
+ host_warn_before=$(dmesg --level=warn 2>/dev/null | grep -c -i 'vsock')
+ eval test_"${name}"
+ rc=$?
+
+ host_oops_after=$(dmesg 2>/dev/null | grep -c -i 'Oops')
+ if [[ "${host_oops_after}" -gt "${host_oops_before}" ]]; then
+ echo "FAIL: kernel oops detected on host" | log_host
+ check_result "${KSFT_FAIL}" "${name}"
+ del_namespaces
+ continue
+ fi
+
+ host_warn_after=$(dmesg --level=warn 2>/dev/null | grep -c -i 'vsock')
+ if [[ "${host_warn_after}" -gt "${host_warn_before}" ]]; then
+ echo "FAIL: kernel warning detected on host" | log_host
+ check_result "${KSFT_FAIL}" "${name}"
+ del_namespaces
+ continue
+ fi
+
+ check_result "${rc}" "${name}"
+
+ del_namespaces
+ done
+}
+
BUILD=0
QEMU="qemu-system-$(uname -m)"
@@ -690,6 +824,8 @@ if shared_vm_tests_requested "${ARGS[@]}"; then
terminate_pidfiles "${pidfile}"
fi
+run_ns_tests "${ARGS[@]}"
+
echo "SUMMARY: PASS=${cnt_pass} SKIP=${cnt_skip} FAIL=${cnt_fail}"
echo "Log: ${LOG}"
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 08/12] selftests/vsock: use ss to wait for listeners instead of /proc/net
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Replace /proc/net parsing with ss(8) for detecting listening sockets in
wait_for_listener() functions and add support for TCP, VSOCK, and Unix
socket protocols.
The previous implementation parsed /proc/net/tcp using awk to detect
listening sockets, but this approach could not support vsock because
vsock does not export socket information to /proc/net/.
Instead, use ss so that we can detect listeners on tcp, vsock, and unix.
The protocol parameter is now required for all wait_for_listener family
functions (wait_for_listener, vm_wait_for_listener,
host_wait_for_listener) to explicitly specify which socket type to wait
for.
ss is added to the dependency check in check_deps().
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
tools/testing/selftests/vsock/vmtest.sh | 47 +++++++++++++++++++++------------
1 file changed, 30 insertions(+), 17 deletions(-)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 4b5929ffc9eb..0e681d4c3a15 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -182,7 +182,7 @@ check_args() {
}
check_deps() {
- for dep in vng ${QEMU} busybox pkill ssh; do
+ for dep in vng ${QEMU} busybox pkill ssh ss; do
if [[ ! -x $(command -v "${dep}") ]]; then
echo -e "skip: dependency ${dep} not found!\n"
exit "${KSFT_SKIP}"
@@ -337,21 +337,32 @@ wait_for_listener()
local port=$1
local interval=$2
local max_intervals=$3
- local protocol=tcp
- local pattern
+ local protocol=$4
local i
- pattern=":$(printf "%04X" "${port}") "
-
- # for tcp protocol additionally check the socket state
- [ "${protocol}" = "tcp" ] && pattern="${pattern}0A"
-
for i in $(seq "${max_intervals}"); do
- if awk -v pattern="${pattern}" \
- 'BEGIN {rc=1} $2" "$4 ~ pattern {rc=0} END {exit rc}' \
- /proc/net/"${protocol}"*; then
+ case "${protocol}" in
+ tcp)
+ if ss --listening --tcp --numeric | grep -q ":${port} "; then
+ break
+ fi
+ ;;
+ vsock)
+ if ss --listening --vsock --numeric | grep -q ":${port} "; then
+ break
+ fi
+ ;;
+ unix)
+ # For unix sockets, port is actually the socket path
+ if ss --listening --unix | grep -q "${port}"; then
+ break
+ fi
+ ;;
+ *)
+ echo "Unknown protocol: ${protocol}" >&2
break
- fi
+ ;;
+ esac
sleep "${interval}"
done
}
@@ -359,23 +370,25 @@ wait_for_listener()
vm_wait_for_listener() {
local ns=$1
local port=$2
+ local protocol=$3
vm_ssh "${ns}" <<EOF
$(declare -f wait_for_listener)
-wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
+wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX} ${protocol}
EOF
}
host_wait_for_listener() {
local ns=$1
local port=$2
+ local protocol=$3
if [[ "${ns}" == "init_ns" ]]; then
- wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}"
+ wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}" "${protocol}"
else
ip netns exec "${ns}" bash <<-EOF
$(declare -f wait_for_listener)
- wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
+ wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX} ${protocol}
EOF
fi
}
@@ -422,7 +435,7 @@ vm_vsock_test() {
return $rc
fi
- vm_wait_for_listener "${ns}" "${port}"
+ vm_wait_for_listener "${ns}" "${port}" "tcp"
rc=$?
fi
set +o pipefail
@@ -463,7 +476,7 @@ host_vsock_test() {
return $rc
fi
- host_wait_for_listener "${ns}" "${port}"
+ host_wait_for_listener "${ns}" "${port}" "tcp"
rc=$?
fi
set +o pipefail
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 07/12] selftests/vsock: add vm_dmesg_{warn,oops}_count() helpers
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
These functions are reused by the VM tests to collect and compare dmesg
warnings and oops counts. The future VM-specific tests use them heavily.
This patches relies on vm_ssh() already supporting namespaces.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v11:
- break these out into an earlier patch so that they can be used
directly in new patches (instead of causing churn by adding this
later)
---
tools/testing/selftests/vsock/vmtest.sh | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index 1d03acb62347..4b5929ffc9eb 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -380,6 +380,17 @@ host_wait_for_listener() {
fi
}
+vm_dmesg_oops_count() {
+ local ns=$1
+
+ vm_ssh "${ns}" -- dmesg 2>/dev/null | grep -c -i 'Oops'
+}
+
+vm_dmesg_warn_count() {
+ local ns=$1
+
+ vm_ssh "${ns}" -- dmesg --level=warn 2>/dev/null | grep -c -i 'vsock'
+}
vm_vsock_test() {
local ns=$1
@@ -587,8 +598,8 @@ run_shared_vm_test() {
host_oops_cnt_before=$(dmesg | grep -c -i 'Oops')
host_warn_cnt_before=$(dmesg --level=warn | grep -c -i 'vsock')
- vm_oops_cnt_before=$(vm_ssh -- dmesg | grep -c -i 'Oops')
- vm_warn_cnt_before=$(vm_ssh -- dmesg --level=warn | grep -c -i 'vsock')
+ vm_oops_cnt_before=$(vm_dmesg_oops_count "init_ns")
+ vm_warn_cnt_before=$(vm_dmesg_warn_count "init_ns")
name=$(echo "${1}" | awk '{ print $1 }')
eval test_"${name}"
@@ -606,13 +617,13 @@ run_shared_vm_test() {
rc=$KSFT_FAIL
fi
- vm_oops_cnt_after=$(vm_ssh -- dmesg | grep -i 'Oops' | wc -l)
+ vm_oops_cnt_after=$(vm_dmesg_oops_count "init_ns")
if [[ ${vm_oops_cnt_after} -gt ${vm_oops_cnt_before} ]]; then
echo "FAIL: kernel oops detected on vm" | log_host
rc=$KSFT_FAIL
fi
- vm_warn_cnt_after=$(vm_ssh -- dmesg --level=warn | grep -c -i 'vsock')
+ vm_warn_cnt_after=$(vm_dmesg_warn_count "init_ns")
if [[ ${vm_warn_cnt_after} -gt ${vm_warn_cnt_before} ]]; then
echo "FAIL: kernel warning detected on vm" | log_host
rc=$KSFT_FAIL
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 06/12] selftests/vsock: prepare vm management helpers for namespaces
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add namespace support to vm management, ssh helpers, and vsock_test
wrapper functions. This enables running VMs and test helpers in specific
namespaces, which is required for upcoming namespace isolation tests.
The functions still work correctly within the init ns, though the caller
must now pass "init_ns" explicitly.
No functional changes for existing tests. All have been updated to pass
"init_ns" explicitly.
Affected functions (such as vm_start() and vm_ssh()) now wrap their
commands with 'ip netns exec' when executing commands in non-init
namespaces.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
tools/testing/selftests/vsock/vmtest.sh | 93 +++++++++++++++++++++++----------
1 file changed, 65 insertions(+), 28 deletions(-)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index c2bdc293b94c..1d03acb62347 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -135,7 +135,18 @@ del_namespaces() {
}
vm_ssh() {
- ssh -q -o UserKnownHostsFile=/dev/null -p ${SSH_HOST_PORT} localhost "$@"
+ local ns_exec
+
+ if [[ "${1}" == init_ns ]]; then
+ ns_exec=""
+ else
+ ns_exec="ip netns exec ${1}"
+ fi
+
+ shift
+
+ ${ns_exec} ssh -q -o UserKnownHostsFile=/dev/null -p "${SSH_HOST_PORT}" localhost "$@"
+
return $?
}
@@ -258,10 +269,12 @@ terminate_pidfiles() {
vm_start() {
local pidfile=$1
+ local ns=$2
local logfile=/dev/null
local verbose_opt=""
local kernel_opt=""
local qemu_opts=""
+ local ns_exec=""
local qemu
qemu=$(command -v "${QEMU}")
@@ -282,7 +295,11 @@ vm_start() {
kernel_opt="${KERNEL_CHECKOUT}"
fi
- vng \
+ if [[ "${ns}" != "init_ns" ]]; then
+ ns_exec="ip netns exec ${ns}"
+ fi
+
+ ${ns_exec} vng \
--run \
${kernel_opt} \
${verbose_opt} \
@@ -297,6 +314,7 @@ vm_start() {
}
vm_wait_for_ssh() {
+ local ns=$1
local i
i=0
@@ -304,7 +322,8 @@ vm_wait_for_ssh() {
if [[ ${i} -gt ${WAIT_PERIOD_MAX} ]]; then
die "Timed out waiting for guest ssh"
fi
- if vm_ssh -- true; then
+
+ if vm_ssh "${ns}" -- true; then
break
fi
i=$(( i + 1 ))
@@ -338,30 +357,41 @@ wait_for_listener()
}
vm_wait_for_listener() {
- local port=$1
+ local ns=$1
+ local port=$2
- vm_ssh <<EOF
+ vm_ssh "${ns}" <<EOF
$(declare -f wait_for_listener)
wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
EOF
}
host_wait_for_listener() {
- local port=$1
+ local ns=$1
+ local port=$2
- wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}"
+ if [[ "${ns}" == "init_ns" ]]; then
+ wait_for_listener "${port}" "${WAIT_PERIOD}" "${WAIT_PERIOD_MAX}"
+ else
+ ip netns exec "${ns}" bash <<-EOF
+ $(declare -f wait_for_listener)
+ wait_for_listener ${port} ${WAIT_PERIOD} ${WAIT_PERIOD_MAX}
+ EOF
+ fi
}
+
vm_vsock_test() {
- local host=$1
- local cid=$2
- local port=$3
+ local ns=$1
+ local host=$2
+ local cid=$3
+ local port=$4
local rc
# log output and use pipefail to respect vsock_test errors
set -o pipefail
if [[ "${host}" != server ]]; then
- vm_ssh -- "${VSOCK_TEST}" \
+ vm_ssh "${ns}" -- "${VSOCK_TEST}" \
--mode=client \
--control-host="${host}" \
--peer-cid="${cid}" \
@@ -369,7 +399,7 @@ vm_vsock_test() {
2>&1 | log_guest
rc=$?
else
- vm_ssh -- "${VSOCK_TEST}" \
+ vm_ssh "${ns}" -- "${VSOCK_TEST}" \
--mode=server \
--peer-cid="${cid}" \
--control-port="${port}" \
@@ -381,7 +411,7 @@ vm_vsock_test() {
return $rc
fi
- vm_wait_for_listener "${port}"
+ vm_wait_for_listener "${ns}" "${port}"
rc=$?
fi
set +o pipefail
@@ -390,22 +420,28 @@ vm_vsock_test() {
}
host_vsock_test() {
- local host=$1
- local cid=$2
- local port=$3
+ local ns=$1
+ local host=$2
+ local cid=$3
+ local port=$4
local rc
+ local cmd="${VSOCK_TEST}"
+ if [[ "${ns}" != "init_ns" ]]; then
+ cmd="ip netns exec ${ns} ${cmd}"
+ fi
+
# log output and use pipefail to respect vsock_test errors
set -o pipefail
if [[ "${host}" != server ]]; then
- ${VSOCK_TEST} \
+ ${cmd} \
--mode=client \
--peer-cid="${cid}" \
--control-host="${host}" \
--control-port="${port}" 2>&1 | log_host
rc=$?
else
- ${VSOCK_TEST} \
+ ${cmd} \
--mode=server \
--peer-cid="${cid}" \
--control-port="${port}" 2>&1 | log_host &
@@ -416,7 +452,7 @@ host_vsock_test() {
return $rc
fi
- host_wait_for_listener "${port}"
+ host_wait_for_listener "${ns}" "${port}"
rc=$?
fi
set +o pipefail
@@ -460,11 +496,11 @@ log_guest() {
}
test_vm_server_host_client() {
- if ! vm_vsock_test "server" 2 "${TEST_GUEST_PORT}"; then
+ if ! vm_vsock_test "init_ns" "server" 2 "${TEST_GUEST_PORT}"; then
return "${KSFT_FAIL}"
fi
- if ! host_vsock_test "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}"; then
+ if ! host_vsock_test "init_ns" "127.0.0.1" "${VSOCK_CID}" "${TEST_HOST_PORT}"; then
return "${KSFT_FAIL}"
fi
@@ -472,11 +508,11 @@ test_vm_server_host_client() {
}
test_vm_client_host_server() {
- if ! host_vsock_test "server" "${VSOCK_CID}" "${TEST_HOST_PORT_LISTENER}"; then
+ if ! host_vsock_test "init_ns" "server" "${VSOCK_CID}" "${TEST_HOST_PORT_LISTENER}"; then
return "${KSFT_FAIL}"
fi
- if ! vm_vsock_test "10.0.2.2" 2 "${TEST_HOST_PORT_LISTENER}"; then
+ if ! vm_vsock_test "init_ns" "10.0.2.2" 2 "${TEST_HOST_PORT_LISTENER}"; then
return "${KSFT_FAIL}"
fi
@@ -486,13 +522,14 @@ test_vm_client_host_server() {
test_vm_loopback() {
local port=60000 # non-forwarded local port
- vm_ssh -- modprobe vsock_loopback &> /dev/null || :
+ vm_ssh "init_ns" -- modprobe vsock_loopback &> /dev/null || :
- if ! vm_vsock_test "server" 1 "${port}"; then
+ if ! vm_vsock_test "init_ns" "server" 1 "${port}"; then
return "${KSFT_FAIL}"
fi
- if ! vm_vsock_test "127.0.0.1" 1 "${port}"; then
+
+ if ! vm_vsock_test "init_ns" "127.0.0.1" 1 "${port}"; then
return "${KSFT_FAIL}"
fi
@@ -621,8 +658,8 @@ cnt_total=0
if shared_vm_tests_requested "${ARGS[@]}"; then
log_host "Booting up VM"
pidfile="$(create_pidfile)"
- vm_start "${pidfile}"
- vm_wait_for_ssh
+ vm_start "${pidfile}" "init_ns"
+ vm_wait_for_ssh "init_ns"
log_host "VM booted up"
run_shared_vm_tests "${ARGS[@]}"
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 05/12] selftests/vsock: add namespace helpers to vmtest.sh
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add functions for initializing namespaces with the different vsock NS
modes. Callers can use add_namespaces() and del_namespaces() to create
namespaces global0, global1, local0, and local1.
The add_namespaces() function initializes global0, local0, etc... with
their respective vsock NS mode by toggling child_ns_mode before creating
the namespace.
Remove namespaces upon exiting the program in cleanup(). This is
unlikely to be needed for a healthy run, but it is useful for tests that
are manually killed mid-test.
This patch is in preparation for later namespace tests.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v13:
- intialize namespaces to use the child_ns_mode mechanism
- remove setting modes from init_namespaces() function (this function
only sets up the lo device now)
- remove ns_set_mode(ns) because ns_mode is no longer mutable
---
tools/testing/selftests/vsock/vmtest.sh | 32 ++++++++++++++++++++++++++++++++
1 file changed, 32 insertions(+)
diff --git a/tools/testing/selftests/vsock/vmtest.sh b/tools/testing/selftests/vsock/vmtest.sh
index c7b270dd77a9..c2bdc293b94c 100755
--- a/tools/testing/selftests/vsock/vmtest.sh
+++ b/tools/testing/selftests/vsock/vmtest.sh
@@ -49,6 +49,7 @@ readonly TEST_DESCS=(
)
readonly USE_SHARED_VM=(vm_server_host_client vm_client_host_server vm_loopback)
+readonly NS_MODES=("local" "global")
VERBOSE=0
@@ -103,6 +104,36 @@ check_result() {
fi
}
+add_namespaces() {
+ local orig_mode
+ orig_mode=$(cat /proc/sys/net/vsock/child_ns_mode)
+
+ for mode in "${NS_MODES[@]}"; do
+ echo "${mode}" > /proc/sys/net/vsock/child_ns_mode
+ ip netns add "${mode}0" 2>/dev/null
+ ip netns add "${mode}1" 2>/dev/null
+ done
+
+ echo "${orig_mode}" > /proc/sys/net/vsock/child_ns_mode
+}
+
+init_namespaces() {
+ for mode in "${NS_MODES[@]}"; do
+ # we need lo for qemu port forwarding
+ ip netns exec "${mode}0" ip link set dev lo up
+ ip netns exec "${mode}1" ip link set dev lo up
+ done
+}
+
+del_namespaces() {
+ for mode in "${NS_MODES[@]}"; do
+ ip netns del "${mode}0" &>/dev/null
+ ip netns del "${mode}1" &>/dev/null
+ log_host "removed ns ${mode}0"
+ log_host "removed ns ${mode}1"
+ done
+}
+
vm_ssh() {
ssh -q -o UserKnownHostsFile=/dev/null -p ${SSH_HOST_PORT} localhost "$@"
return $?
@@ -110,6 +141,7 @@ vm_ssh() {
cleanup() {
terminate_pidfiles "${!PIDFILES[@]}"
+ del_namespaces
}
check_args() {
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 04/12] selftests/vsock: increase timeout to 1200
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Increase the timeout from 300s to 1200s. On a modern bare metal server
my last run showed the new set of tests taking ~400s. Multiply by an
(arbitrary) factor of three to account for slower/nested runners.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
tools/testing/selftests/vsock/settings | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/tools/testing/selftests/vsock/settings b/tools/testing/selftests/vsock/settings
index 694d70710ff0..79b65bdf05db 100644
--- a/tools/testing/selftests/vsock/settings
+++ b/tools/testing/selftests/vsock/settings
@@ -1 +1 @@
-timeout=300
+timeout=1200
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 03/12] vsock: add netns support to virtio transports
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Add netns support to loopback and vhost. Keep netns disabled for
virtio-vsock, but add necessary changes to comply with common API
updates.
This is the patch in the series when vhost-vsock namespaces actually
come online.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v15:
- add vsock_net_mode_global() (Stefano)
Changes in v14:
- fixed merge conflicts in drivers/vhost/vsock.c
Changes in v13:
- do not store or pass the mode around now that net->vsock.mode is
immutable
- move virtio_transport_stream_allow() into virtio_transport.c
because virtio is the only caller now
Changes in v12:
- change seqpacket_allow() and stream_allow() to return true for
loopback and vhost (Stefano)
Changes in v11:
- reorder with the skb ownership patch for loopback (Stefano)
- toggle vhost_transport_supports_local_mode() to true
Changes in v10:
- Splitting patches complicates the series with meaningless placeholder
values that eventually get replaced anyway, so to avoid that this
patch combines into one. Links to previous patches here:
- Link: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-3-852787a37bed@meta.com/
- Link: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-6-852787a37bed@meta.com/
- Link: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-7-852787a37bed@meta.com/
- remove placeholder values (Stefano)
- update comment describe net/net_mode for
virtio_transport_reset_no_sock()
---
drivers/vhost/vsock.c | 38 ++++++++++++++++-------
include/linux/virtio_vsock.h | 5 +--
net/vmw_vsock/virtio_transport.c | 13 ++++++--
net/vmw_vsock/virtio_transport_common.c | 54 +++++++++++++++++++--------------
net/vmw_vsock/vsock_loopback.c | 14 +++++++--
5 files changed, 84 insertions(+), 40 deletions(-)
diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 647ded6f6ea5..488d7fa6e4ec 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -48,6 +48,8 @@ static DEFINE_READ_MOSTLY_HASHTABLE(vhost_vsock_hash, 8);
struct vhost_vsock {
struct vhost_dev dev;
struct vhost_virtqueue vqs[2];
+ struct net *net;
+ netns_tracker ns_tracker;
/* Link to global vhost_vsock_hash, writes use vhost_vsock_mutex */
struct hlist_node hash;
@@ -69,7 +71,7 @@ static u32 vhost_transport_get_local_cid(void)
/* Callers must be in an RCU read section or hold the vhost_vsock_mutex.
* The return value can only be dereferenced while within the section.
*/
-static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
+static struct vhost_vsock *vhost_vsock_get(u32 guest_cid, struct net *net)
{
struct vhost_vsock *vsock;
@@ -81,9 +83,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
if (other_cid == 0)
continue;
- if (other_cid == guest_cid)
+ if (other_cid == guest_cid &&
+ vsock_net_check_mode(net, vsock->net))
return vsock;
-
}
return NULL;
@@ -272,7 +274,7 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
}
static int
-vhost_transport_send_pkt(struct sk_buff *skb)
+vhost_transport_send_pkt(struct sk_buff *skb, struct net *net)
{
struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
struct vhost_vsock *vsock;
@@ -281,7 +283,7 @@ vhost_transport_send_pkt(struct sk_buff *skb)
rcu_read_lock();
/* Find the vhost_vsock according to guest context id */
- vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid));
+ vsock = vhost_vsock_get(le64_to_cpu(hdr->dst_cid), net);
if (!vsock) {
rcu_read_unlock();
kfree_skb(skb);
@@ -308,7 +310,8 @@ vhost_transport_cancel_pkt(struct vsock_sock *vsk)
rcu_read_lock();
/* Find the vhost_vsock according to guest context id */
- vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
+ vsock = vhost_vsock_get(vsk->remote_addr.svm_cid,
+ sock_net(sk_vsock(vsk)));
if (!vsock)
goto out;
@@ -410,6 +413,12 @@ static bool vhost_transport_msgzerocopy_allow(void)
static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
u32 remote_cid);
+static bool
+vhost_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
+{
+ return true;
+}
+
static struct virtio_transport vhost_transport = {
.transport = {
.module = THIS_MODULE,
@@ -434,7 +443,7 @@ static struct virtio_transport vhost_transport = {
.stream_has_space = virtio_transport_stream_has_space,
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
- .stream_allow = virtio_transport_stream_allow,
+ .stream_allow = vhost_transport_stream_allow,
.seqpacket_dequeue = virtio_transport_seqpacket_dequeue,
.seqpacket_enqueue = virtio_transport_seqpacket_enqueue,
@@ -467,11 +476,12 @@ static struct virtio_transport vhost_transport = {
static bool vhost_transport_seqpacket_allow(struct vsock_sock *vsk,
u32 remote_cid)
{
+ struct net *net = sock_net(sk_vsock(vsk));
struct vhost_vsock *vsock;
bool seqpacket_allow = false;
rcu_read_lock();
- vsock = vhost_vsock_get(remote_cid);
+ vsock = vhost_vsock_get(remote_cid, net);
if (vsock)
seqpacket_allow = vsock->seqpacket_allow;
@@ -542,7 +552,8 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work *work)
if (le64_to_cpu(hdr->src_cid) == vsock->guest_cid &&
le64_to_cpu(hdr->dst_cid) ==
vhost_transport_get_local_cid())
- virtio_transport_recv_pkt(&vhost_transport, skb);
+ virtio_transport_recv_pkt(&vhost_transport, skb,
+ vsock->net);
else
kfree_skb(skb);
@@ -659,6 +670,7 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
{
struct vhost_virtqueue **vqs;
struct vhost_vsock *vsock;
+ struct net *net;
int ret;
/* This struct is large and allocation could fail, fall back to vmalloc
@@ -674,6 +686,9 @@ static int vhost_vsock_dev_open(struct inode *inode, struct file *file)
goto out;
}
+ net = current->nsproxy->net_ns;
+ vsock->net = get_net_track(net, &vsock->ns_tracker, GFP_KERNEL);
+
vsock->guest_cid = 0; /* no CID assigned yet */
vsock->seqpacket_allow = false;
@@ -715,7 +730,7 @@ static void vhost_vsock_reset_orphans(struct sock *sk)
rcu_read_lock();
/* If the peer is still valid, no need to reset connection */
- if (vhost_vsock_get(vsk->remote_addr.svm_cid)) {
+ if (vhost_vsock_get(vsk->remote_addr.svm_cid, sock_net(sk))) {
rcu_read_unlock();
return;
}
@@ -764,6 +779,7 @@ static int vhost_vsock_dev_release(struct inode *inode, struct file *file)
virtio_vsock_skb_queue_purge(&vsock->send_pkt_queue);
vhost_dev_cleanup(&vsock->dev);
+ put_net_track(vsock->net, &vsock->ns_tracker);
kfree(vsock->dev.vqs);
vhost_vsock_free(vsock);
return 0;
@@ -790,7 +806,7 @@ static int vhost_vsock_set_cid(struct vhost_vsock *vsock, u64 guest_cid)
/* Refuse if CID is already in use */
mutex_lock(&vhost_vsock_mutex);
- other = vhost_vsock_get(guest_cid);
+ other = vhost_vsock_get(guest_cid, vsock->net);
if (other && other != vsock) {
mutex_unlock(&vhost_vsock_mutex);
return -EADDRINUSE;
diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
index 1845e8d4f78d..f91704731057 100644
--- a/include/linux/virtio_vsock.h
+++ b/include/linux/virtio_vsock.h
@@ -173,6 +173,7 @@ struct virtio_vsock_pkt_info {
u32 remote_cid, remote_port;
struct vsock_sock *vsk;
struct msghdr *msg;
+ struct net *net;
u32 pkt_len;
u16 type;
u16 op;
@@ -185,7 +186,7 @@ struct virtio_transport {
struct vsock_transport transport;
/* Takes ownership of the packet */
- int (*send_pkt)(struct sk_buff *skb);
+ int (*send_pkt)(struct sk_buff *skb, struct net *net);
/* Used in MSG_ZEROCOPY mode. Checks, that provided data
* (number of buffers) could be transmitted with zerocopy
@@ -280,7 +281,7 @@ virtio_transport_dgram_enqueue(struct vsock_sock *vsk,
void virtio_transport_destruct(struct vsock_sock *vsk);
void virtio_transport_recv_pkt(struct virtio_transport *t,
- struct sk_buff *skb);
+ struct sk_buff *skb, struct net *net);
void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct sk_buff *skb);
u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index f0a9e51118f3..3f7ea2db9bd7 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -231,7 +231,7 @@ static int virtio_transport_send_skb_fast_path(struct virtio_vsock *vsock, struc
}
static int
-virtio_transport_send_pkt(struct sk_buff *skb)
+virtio_transport_send_pkt(struct sk_buff *skb, struct net *net)
{
struct virtio_vsock_hdr *hdr;
struct virtio_vsock *vsock;
@@ -536,6 +536,11 @@ static bool virtio_transport_msgzerocopy_allow(void)
return true;
}
+bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
+{
+ return vsock_net_mode_global(vsk);
+}
+
static bool virtio_transport_seqpacket_allow(struct vsock_sock *vsk,
u32 remote_cid);
@@ -665,7 +670,11 @@ static void virtio_transport_rx_work(struct work_struct *work)
virtio_vsock_skb_put(skb, payload_len);
virtio_transport_deliver_tap_pkt(skb);
- virtio_transport_recv_pkt(&virtio_transport, skb);
+
+ /* Force virtio-transport into global mode since it
+ * does not yet support local-mode namespacing.
+ */
+ virtio_transport_recv_pkt(&virtio_transport, skb, NULL);
}
} while (!virtqueue_enable_cb(vq));
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index 718be9f33274..c126aa235091 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -413,7 +413,7 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
virtio_transport_inc_tx_pkt(vvs, skb);
- ret = t_ops->send_pkt(skb);
+ ret = t_ops->send_pkt(skb, info->net);
if (ret < 0)
break;
@@ -527,6 +527,7 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk)
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_CREDIT_UPDATE,
.vsk = vsk,
+ .net = sock_net(sk_vsock(vsk)),
};
return virtio_transport_send_pkt_info(vsk, &info);
@@ -1043,12 +1044,6 @@ bool virtio_transport_stream_is_active(struct vsock_sock *vsk)
}
EXPORT_SYMBOL_GPL(virtio_transport_stream_is_active);
-bool virtio_transport_stream_allow(struct vsock_sock *vsk, u32 cid, u32 port)
-{
- return vsock_net_mode(sock_net(sk_vsock(vsk))) == VSOCK_NET_MODE_GLOBAL;
-}
-EXPORT_SYMBOL_GPL(virtio_transport_stream_allow);
-
int virtio_transport_dgram_bind(struct vsock_sock *vsk,
struct sockaddr_vm *addr)
{
@@ -1067,6 +1062,7 @@ int virtio_transport_connect(struct vsock_sock *vsk)
struct virtio_vsock_pkt_info info = {
.op = VIRTIO_VSOCK_OP_REQUEST,
.vsk = vsk,
+ .net = sock_net(sk_vsock(vsk)),
};
return virtio_transport_send_pkt_info(vsk, &info);
@@ -1082,6 +1078,7 @@ int virtio_transport_shutdown(struct vsock_sock *vsk, int mode)
(mode & SEND_SHUTDOWN ?
VIRTIO_VSOCK_SHUTDOWN_SEND : 0),
.vsk = vsk,
+ .net = sock_net(sk_vsock(vsk)),
};
return virtio_transport_send_pkt_info(vsk, &info);
@@ -1108,6 +1105,7 @@ virtio_transport_stream_enqueue(struct vsock_sock *vsk,
.msg = msg,
.pkt_len = len,
.vsk = vsk,
+ .net = sock_net(sk_vsock(vsk)),
};
return virtio_transport_send_pkt_info(vsk, &info);
@@ -1145,6 +1143,7 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
.op = VIRTIO_VSOCK_OP_RST,
.reply = !!skb,
.vsk = vsk,
+ .net = sock_net(sk_vsock(vsk)),
};
/* Send RST only if the original pkt is not a RST pkt */
@@ -1156,9 +1155,13 @@ static int virtio_transport_reset(struct vsock_sock *vsk,
/* Normally packets are associated with a socket. There may be no socket if an
* attempt was made to connect to a socket that does not exist.
+ *
+ * net refers to the namespace of whoever sent the invalid message. For
+ * loopback, this is the namespace of the socket. For vhost, this is the
+ * namespace of the VM (i.e., vhost_vsock).
*/
static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
- struct sk_buff *skb)
+ struct sk_buff *skb, struct net *net)
{
struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
struct virtio_vsock_pkt_info info = {
@@ -1171,6 +1174,12 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
* sock_net(sk) until the reply skb is freed.
*/
.vsk = vsock_sk(skb->sk),
+
+ /* net is not defined here because we pass it directly to
+ * t->send_pkt(), instead of relying on
+ * virtio_transport_send_pkt_info() to pass it. It is not needed
+ * by virtio_transport_alloc_skb().
+ */
};
struct sk_buff *reply;
@@ -1189,7 +1198,7 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
if (!reply)
return -ENOMEM;
- return t->send_pkt(reply);
+ return t->send_pkt(reply, net);
}
/* This function should be called with sk_lock held and SOCK_DONE set */
@@ -1471,6 +1480,7 @@ virtio_transport_send_response(struct vsock_sock *vsk,
.remote_port = le32_to_cpu(hdr->src_port),
.reply = true,
.vsk = vsk,
+ .net = sock_net(sk_vsock(vsk)),
};
return virtio_transport_send_pkt_info(vsk, &info);
@@ -1513,12 +1523,12 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
int ret;
if (le16_to_cpu(hdr->op) != VIRTIO_VSOCK_OP_REQUEST) {
- virtio_transport_reset_no_sock(t, skb);
+ virtio_transport_reset_no_sock(t, skb, sock_net(sk));
return -EINVAL;
}
if (sk_acceptq_is_full(sk)) {
- virtio_transport_reset_no_sock(t, skb);
+ virtio_transport_reset_no_sock(t, skb, sock_net(sk));
return -ENOMEM;
}
@@ -1526,13 +1536,13 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
* Subsequent enqueues would lead to a memory leak.
*/
if (sk->sk_shutdown == SHUTDOWN_MASK) {
- virtio_transport_reset_no_sock(t, skb);
+ virtio_transport_reset_no_sock(t, skb, sock_net(sk));
return -ESHUTDOWN;
}
child = vsock_create_connected(sk);
if (!child) {
- virtio_transport_reset_no_sock(t, skb);
+ virtio_transport_reset_no_sock(t, skb, sock_net(sk));
return -ENOMEM;
}
@@ -1554,7 +1564,7 @@ virtio_transport_recv_listen(struct sock *sk, struct sk_buff *skb,
*/
if (ret || vchild->transport != &t->transport) {
release_sock(child);
- virtio_transport_reset_no_sock(t, skb);
+ virtio_transport_reset_no_sock(t, skb, sock_net(sk));
sock_put(child);
return ret;
}
@@ -1582,7 +1592,7 @@ static bool virtio_transport_valid_type(u16 type)
* lock.
*/
void virtio_transport_recv_pkt(struct virtio_transport *t,
- struct sk_buff *skb)
+ struct sk_buff *skb, struct net *net)
{
struct virtio_vsock_hdr *hdr = virtio_vsock_hdr(skb);
struct sockaddr_vm src, dst;
@@ -1605,24 +1615,24 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
le32_to_cpu(hdr->fwd_cnt));
if (!virtio_transport_valid_type(le16_to_cpu(hdr->type))) {
- (void)virtio_transport_reset_no_sock(t, skb);
+ (void)virtio_transport_reset_no_sock(t, skb, net);
goto free_pkt;
}
/* The socket must be in connected or bound table
* otherwise send reset back
*/
- sk = vsock_find_connected_socket(&src, &dst);
+ sk = vsock_find_connected_socket_net(&src, &dst, net);
if (!sk) {
- sk = vsock_find_bound_socket(&dst);
+ sk = vsock_find_bound_socket_net(&dst, net);
if (!sk) {
- (void)virtio_transport_reset_no_sock(t, skb);
+ (void)virtio_transport_reset_no_sock(t, skb, net);
goto free_pkt;
}
}
if (virtio_transport_get_type(sk) != le16_to_cpu(hdr->type)) {
- (void)virtio_transport_reset_no_sock(t, skb);
+ (void)virtio_transport_reset_no_sock(t, skb, net);
sock_put(sk);
goto free_pkt;
}
@@ -1641,7 +1651,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
*/
if (sock_flag(sk, SOCK_DONE) ||
(sk->sk_state != TCP_LISTEN && vsk->transport != &t->transport)) {
- (void)virtio_transport_reset_no_sock(t, skb);
+ (void)virtio_transport_reset_no_sock(t, skb, net);
release_sock(sk);
sock_put(sk);
goto free_pkt;
@@ -1673,7 +1683,7 @@ void virtio_transport_recv_pkt(struct virtio_transport *t,
kfree_skb(skb);
break;
default:
- (void)virtio_transport_reset_no_sock(t, skb);
+ (void)virtio_transport_reset_no_sock(t, skb, net);
kfree_skb(skb);
break;
}
diff --git a/net/vmw_vsock/vsock_loopback.c b/net/vmw_vsock/vsock_loopback.c
index deff68c64a09..8068d1b6e851 100644
--- a/net/vmw_vsock/vsock_loopback.c
+++ b/net/vmw_vsock/vsock_loopback.c
@@ -26,7 +26,7 @@ static u32 vsock_loopback_get_local_cid(void)
return VMADDR_CID_LOCAL;
}
-static int vsock_loopback_send_pkt(struct sk_buff *skb)
+static int vsock_loopback_send_pkt(struct sk_buff *skb, struct net *net)
{
struct vsock_loopback *vsock = &the_vsock_loopback;
int len = skb->len;
@@ -48,6 +48,13 @@ static int vsock_loopback_cancel_pkt(struct vsock_sock *vsk)
static bool vsock_loopback_seqpacket_allow(struct vsock_sock *vsk,
u32 remote_cid);
+
+static bool vsock_loopback_stream_allow(struct vsock_sock *vsk, u32 cid,
+ u32 port)
+{
+ return true;
+}
+
static bool vsock_loopback_msgzerocopy_allow(void)
{
return true;
@@ -77,7 +84,7 @@ static struct virtio_transport loopback_transport = {
.stream_has_space = virtio_transport_stream_has_space,
.stream_rcvhiwat = virtio_transport_stream_rcvhiwat,
.stream_is_active = virtio_transport_stream_is_active,
- .stream_allow = virtio_transport_stream_allow,
+ .stream_allow = vsock_loopback_stream_allow,
.seqpacket_dequeue = virtio_transport_seqpacket_dequeue,
.seqpacket_enqueue = virtio_transport_seqpacket_enqueue,
@@ -132,7 +139,8 @@ static void vsock_loopback_work(struct work_struct *work)
*/
virtio_transport_consume_skb_sent(skb, false);
virtio_transport_deliver_tap_pkt(skb);
- virtio_transport_recv_pkt(&loopback_transport, skb);
+ virtio_transport_recv_pkt(&loopback_transport, skb,
+ sock_net(skb->sk));
}
}
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 02/12] virtio: set skb owner of virtio_transport_reset_no_sock() reply
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
In-Reply-To: <20260116-vsock-vmtest-v15-0-bbfd1a668548@meta.com>
From: Bobby Eshleman <bobbyeshleman@meta.com>
Associate reply packets with the sending socket. When vsock must reply
with an RST packet and there exists a sending socket (e.g., for
loopback), setting the skb owner to the socket correctly handles
reference counting between the skb and sk (i.e., the sk stays alive
until the skb is freed).
This allows the net namespace to be used for socket lookups for the
duration of the reply skb's lifetime, preventing race conditions between
the namespace lifecycle and vsock socket search using the namespace
pointer.
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Bobby Eshleman <bobbyeshleman@meta.com>
---
Changes in v11:
- move before adding to netns support (Stefano)
Changes in v10:
- break this out into its own patch for easy revert (Stefano)
---
net/vmw_vsock/virtio_transport_common.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index fdb8f5b3fa60..718be9f33274 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -1165,6 +1165,12 @@ static int virtio_transport_reset_no_sock(const struct virtio_transport *t,
.op = VIRTIO_VSOCK_OP_RST,
.type = le16_to_cpu(hdr->type),
.reply = true,
+
+ /* Set sk owner to socket we are replying to (may be NULL for
+ * non-loopback). This keeps a reference to the sock and
+ * sock_net(sk) until the reply skb is freed.
+ */
+ .vsk = vsock_sk(skb->sk),
};
struct sk_buff *reply;
--
2.47.3
^ permalink raw reply related
* [PATCH net-next v15 00/12] vsock: add namespace support to vhost-vsock and loopback
From: Bobby Eshleman @ 2026-01-16 21:28 UTC (permalink / raw)
To: Stefano Garzarella, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Simon Horman, Stefan Hajnoczi, Michael S. Tsirkin,
Jason Wang, Eugenio Pérez, Xuan Zhuo, K. Y. Srinivasan,
Haiyang Zhang, Wei Liu, Dexuan Cui, Bryan Tan, Vishnu Dasa,
Broadcom internal kernel review list, Shuah Khan, Long Li,
Jonathan Corbet
Cc: linux-kernel, virtualization, netdev, kvm, linux-hyperv,
linux-kselftest, berrange, Sargun Dhillon, linux-doc,
Bobby Eshleman, Bobby Eshleman
This series adds namespace support to vhost-vsock and loopback. It does
not add namespaces to any of the other guest transports (virtio-vsock,
hyperv, or vmci).
The current revision supports two modes: local and global. Local
mode is complete isolation of namespaces, while global mode is complete
sharing between namespaces of CIDs (the original behavior).
The mode is set using the parent namespace's
/proc/sys/net/vsock/child_ns_mode and inherited when a new namespace is
created. The mode of the current namespace can be queried by reading
/proc/sys/net/vsock/ns_mode. The mode can not change after the namespace
has been created.
Modes are per-netns. This allows a system to configure namespaces
independently (some may share CIDs, others are completely isolated).
This also supports future possible mixed use cases, where there may be
namespaces in global mode spinning up VMs while there are mixed mode
namespaces that provide services to the VMs, but are not allowed to
allocate from the global CID pool (this mode is not implemented in this
series).
Additionally, added tests for the new namespace features:
tools/testing/selftests/vsock/vmtest.sh
1..25
ok 1 vm_server_host_client
ok 2 vm_client_host_server
ok 3 vm_loopback
ok 4 ns_host_vsock_ns_mode_ok
ok 5 ns_host_vsock_child_ns_mode_ok
ok 6 ns_global_same_cid_fails
ok 7 ns_local_same_cid_ok
ok 8 ns_global_local_same_cid_ok
ok 9 ns_local_global_same_cid_ok
ok 10 ns_diff_global_host_connect_to_global_vm_ok
ok 11 ns_diff_global_host_connect_to_local_vm_fails
ok 12 ns_diff_global_vm_connect_to_global_host_ok
ok 13 ns_diff_global_vm_connect_to_local_host_fails
ok 14 ns_diff_local_host_connect_to_local_vm_fails
ok 15 ns_diff_local_vm_connect_to_local_host_fails
ok 16 ns_diff_global_to_local_loopback_local_fails
ok 17 ns_diff_local_to_global_loopback_fails
ok 18 ns_diff_local_to_local_loopback_fails
ok 19 ns_diff_global_to_global_loopback_ok
ok 20 ns_same_local_loopback_ok
ok 21 ns_same_local_host_connect_to_local_vm_ok
ok 22 ns_same_local_vm_connect_to_local_host_ok
ok 23 ns_delete_vm_ok
ok 24 ns_delete_host_ok
ok 25 ns_delete_both_ok
SUMMARY: PASS=25 SKIP=0 FAIL=0
Thanks again for everyone's help and reviews!
Suggested-by: Sargun Dhillon <sargun@sargun.me>
Signed-off-by: Bobby Eshleman <bobbyeshleman@gmail.com>
Changes in v15:
- see per-patch change notes in 'vsock: add netns to vsock core'
- Link to v14: https://lore.kernel.org/r/20260112-vsock-vmtest-v14-0-a5c332db3e2b@meta.com
Changes in v14:
- squashed 'vsock: add per-net vsock NS mode state' into 'vsock: add
netns to vsock core' (MST)
- remove RFC tag
- fixed base-commit (still had b4 configured to depend on old vmtest.sh
series)
- Link to v13: https://lore.kernel.org/all/20251223-vsock-vmtest-v13-0-9d6db8e7c80b@meta.com/
Changes in v13:
- add support for immutable sysfs ns_mode and inheritance from sysfs child_ns_mode
- remove passing around of net_mode, can be accessed now via
vsock_net_mode(net) since it is immutable
- update tests for new uAPI
- add one patch to extend the kselftest timeout (it was starting to
fail with the new tests added)
- Link to v12: https://lore.kernel.org/r/20251126-vsock-vmtest-v12-0-257ee21cd5de@meta.com
Changes in v12:
- add ns mode checking to _allow() callbacks to reject local mode for
incompatible transports (Stefano)
- flip vhost/loopback to return true for stream_allow() and
seqpacket_allow() in "vsock: add netns support to virtio transports"
(Stefano)
- add VMADDR_CID_ANY + local mode documentation in af_vsock.c (Stefano)
- change "selftests/vsock: add tests for host <-> vm connectivity with
namespaces" to skip test 29 in vsock_test for namespace local
vsock_test calls in a host local-mode namespace. There is a
false-positive edge case for that test encountered with the
->stream_allow() approach. More details in that patch.
- updated cover letter with new test output
- Link to v11: https://lore.kernel.org/r/20251120-vsock-vmtest-v11-0-55cbc80249a7@meta.com
Changes in v11:
- vmtest: add a patch to use ss in wait_for_listener functions and
support vsock, tcp, and unix. Change all patches to use the new
functions.
- vmtest: add a patch to re-use vm dmesg / warn counting functions
- Link to v10: https://lore.kernel.org/r/20251117-vsock-vmtest-v10-0-df08f165bf3e@meta.com
Changes in v10:
- Combine virtio common patches into one (Stefano)
- Resolve vsock_loopback virtio_transport_reset_no_sock() issue
with info->vsk setting. This eliminates the need for skb->cb,
so remove skb->cb patches.
- many line width 80 fixes
- Link to v9: https://lore.kernel.org/all/20251111-vsock-vmtest-v9-0-852787a37bed@meta.com
Changes in v9:
- reorder loopback patch after patch for virtio transport common code
- remove module ordering tests patch because loopback no longer depends
on pernet ops
- major simplifications in vsock_loopback
- added a new patch for blocking local mode for guests, added test case
to check
- add net ref tracking to vsock_loopback patch
- Link to v8: https://lore.kernel.org/r/20251023-vsock-vmtest-v8-0-dea984d02bb0@meta.com
Changes in v8:
- Break generic cleanup/refactoring patches into standalone series,
remove those from this series
- Link to dependency: https://lore.kernel.org/all/20251022-vsock-selftests-fixes-and-improvements-v1-0-edeb179d6463@meta.com/
- Link to v7: https://lore.kernel.org/r/20251021-vsock-vmtest-v7-0-0661b7b6f081@meta.com
Changes in v7:
- fix hv_sock build
- break out vmtest patches into distinct, more well-scoped patches
- change `orig_net_mode` to `net_mode`
- many fixes and style changes in per-patch change sets (see individual
patches for specific changes)
- optimize `virtio_vsock_skb_cb` layout
- update commit messages with more useful descriptions
- vsock_loopback: use orig_net_mode instead of current net mode
- add tests for edge cases (ns deletion, mode changing, loopback module
load ordering)
- Link to v6: https://lore.kernel.org/r/20250916-vsock-vmtest-v6-0-064d2eb0c89d@meta.com
Changes in v6:
- define behavior when mode changes to local while socket/VM is alive
- af_vsock: clarify description of CID behavior
- af_vsock: use stronger langauge around CID rules (dont use "may")
- af_vsock: improve naming of buf/buffer
- af_vsock: improve string length checking on proc writes
- vsock_loopback: add space in struct to clarify lock protection
- vsock_loopback: do proper cleanup/unregister on vsock_loopback_exit()
- vsock_loopback: use virtio_vsock_skb_net() instead of sock_net()
- vsock_loopback: set loopback to NULL after kfree()
- vsock_loopback: use pernet_operations and remove callback mechanism
- vsock_loopback: add macros for "global" and "local"
- vsock_loopback: fix length checking
- vmtest.sh: check for namespace support in vmtest.sh
- Link to v5: https://lore.kernel.org/r/20250827-vsock-vmtest-v5-0-0ba580bede5b@meta.com
Changes in v5:
- /proc/net/vsock_ns_mode -> /proc/sys/net/vsock/ns_mode
- vsock_global_net -> vsock_global_dummy_net
- fix netns lookup in vhost_vsock to respect pid namespaces
- add callbacks for vsock_loopback to avoid circular dependency
- vmtest.sh loads vsock_loopback module
- remove vsock_net_mode_can_set()
- change vsock_net_write_mode() to return true/false based on success
- make vsock_net_mode enum instead of u8
- Link to v4: https://lore.kernel.org/r/20250805-vsock-vmtest-v4-0-059ec51ab111@meta.com
Changes in v4:
- removed RFC tag
- implemented loopback support
- renamed new tests to better reflect behavior
- completed suite of tests with permutations of ns modes and vsock_test
as guest/host
- simplified socat bridging with unix socket instead of tcp + veth
- only use vsock_test for success case, socat for failure case (context
in commit message)
- lots of cleanup
Changes in v3:
- add notion of "modes"
- add procfs /proc/net/vsock_ns_mode
- local and global modes only
- no /dev/vhost-vsock-netns
- vmtest.sh already merged, so new patch just adds new tests for NS
- Link to v2:
https://lore.kernel.org/kvm/20250312-vsock-netns-v2-0-84bffa1aa97a@gmail.com
Changes in v2:
- only support vhost-vsock namespaces
- all g2h namespaces retain old behavior, only common API changes
impacted by vhost-vsock changes
- add /dev/vhost-vsock-netns for "opt-in"
- leave /dev/vhost-vsock to old behavior
- removed netns module param
- Link to v1:
https://lore.kernel.org/r/20200116172428.311437-1-sgarzare@redhat.com
Changes in v1:
- added 'netns' module param to vsock.ko to enable the
network namespace support (disabled by default)
- added 'vsock_net_eq()' to check the "net" assigned to a socket
only when 'netns' support is enabled
- Link to RFC: https://patchwork.ozlabs.org/cover/1202235/
---
Bobby Eshleman (12):
vsock: add netns to vsock core
virtio: set skb owner of virtio_transport_reset_no_sock() reply
vsock: add netns support to virtio transports
selftests/vsock: increase timeout to 1200
selftests/vsock: add namespace helpers to vmtest.sh
selftests/vsock: prepare vm management helpers for namespaces
selftests/vsock: add vm_dmesg_{warn,oops}_count() helpers
selftests/vsock: use ss to wait for listeners instead of /proc/net
selftests/vsock: add tests for proc sys vsock ns_mode
selftests/vsock: add namespace tests for CID collisions
selftests/vsock: add tests for host <-> vm connectivity with namespaces
selftests/vsock: add tests for namespace deletion
Documentation/admin-guide/kernel-parameters.txt | 14 +
MAINTAINERS | 1 +
drivers/vhost/vsock.c | 44 +-
include/linux/virtio_vsock.h | 9 +-
include/net/af_vsock.h | 61 +-
include/net/net_namespace.h | 4 +
include/net/netns/vsock.h | 21 +
net/vmw_vsock/af_vsock.c | 328 ++++++-
net/vmw_vsock/hyperv_transport.c | 7 +-
net/vmw_vsock/virtio_transport.c | 22 +-
net/vmw_vsock/virtio_transport_common.c | 62 +-
net/vmw_vsock/vmci_transport.c | 26 +-
net/vmw_vsock/vsock_loopback.c | 22 +-
tools/testing/selftests/vsock/settings | 2 +-
tools/testing/selftests/vsock/vmtest.sh | 1055 +++++++++++++++++++++--
15 files changed, 1538 insertions(+), 140 deletions(-)
---
base-commit: 74ecff77dace0f9aead6aac852b57af5d4ad3b85
change-id: 20250325-vsock-vmtest-b3a21d2102c2
Best regards,
--
Bobby Eshleman <bobbyeshleman@meta.com>
^ permalink raw reply
* Re: [PATCH 0/2] kbuild, uapi: Mark inner unions in packed structs as packed
From: Nicolas Schier @ 2026-01-16 19:57 UTC (permalink / raw)
To: Thomas Weißschuh
Cc: K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Nathan Chancellor, Nick Desaulniers, Bill Wendling, Justin Stitt,
Hans de Goede, Arnd Bergmann, Greg Kroah-Hartman, linux-hyperv,
linux-kernel, llvm, kernel test robot, linux-kbuild
In-Reply-To: <20260115-kbuild-alignment-vbox-v1-0-076aed1623ff@linutronix.de>
Cc += linux-kbuild
On Thu, Jan 15, 2026 at 08:35:43AM +0100, Thomas Weißschuh wrote:
> The unpacked unions within a packed struct generates alignment warnings
> on clang for 32-bit ARM.
>
> With the recent changes to compile-test the UAPI headers in more cases,
> these warning in combination with CONFIG_WERROR breaks the build.
>
> Fix the warnings.
>
> Intended for the kbuild tree.
>
> Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
> ---
> Thomas Weißschuh (2):
> hyper-v: Mark inner union in hv_kvp_exchg_msg_value as packed
> virt: vbox: uapi: Mark inner unions in packed structs as packed
>
> include/uapi/linux/hyperv.h | 2 +-
> include/uapi/linux/vbox_vmmdev_types.h | 4 ++--
> 2 files changed, 3 insertions(+), 3 deletions(-)
> ---
> base-commit: e3970d77ec504e54c3f91a48b2125775c16ba4c0
> change-id: 20260115-kbuild-alignment-vbox-d0409134d335
>
> Best regards,
> --
> Thomas Weißschuh <thomas.weissschuh@linutronix.de>
>
Thanks!
Tested-by: Nicolas Schier <nsc@kernel.org>
Reviewed-by: Nicolas Schier <nsc@kernel.org>
Kind regards,
Nicolas
^ permalink raw reply
* Re: [PATCH v3 5/6] mshv: Add definitions for stats pages
From: Nuno Das Neves @ 2026-01-16 18:33 UTC (permalink / raw)
To: Michael Kelley, Stanislav Kinsburskii
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com
In-Reply-To: <SN6PR02MB41575DED97B3E791238296AAD48DA@SN6PR02MB4157.namprd02.prod.outlook.com>
On 1/16/2026 9:01 AM, Michael Kelley wrote:
> From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, January 15, 2026 11:35 AM
>>
>> On 1/15/2026 8:19 AM, Stanislav Kinsburskii wrote:
>>> On Wed, Jan 14, 2026 at 01:38:02PM -0800, Nuno Das Neves wrote:
>>>> Add the definitions for hypervisor, logical processor, and partition
>>>> stats pages.
>>>>
>>>
>>> The definitions in for partition and virtual processor are outdated.
>>> Now is the good time to sync the new values in.
>>>
>>> Thanks,
>>> Stanislav
>>>
>>
>> Good point, thanks, I will update it for v4.
>>
>> I'm finally noticing that these counters are not really from hvhdk.h, in
>> the windows code, but their own file. Since I'm still iterating on this,
>> what do you think about creating a file just for the counters?
>> e.g. drivers/hv/hvcounters.h, which combines hvcountersarm64 and amd64.
>>
>> That would have a couple of advantages:
>> 1. Not putting things in hvhdk.h which aren't actually there in the
>> Windows source
>> 2. Less visibility of CamelCase naming outside our driver
>> 3. I could define the enums using "X macro"s to generate the show() code
>> more cleanly in mshv_debugfs.c, which is something Michael suggested
>> here:
>> https://lore.kernel.org/linux-hyperv/SN6PR02MB4157938404BC0D12978ACD9BD4A2A@SN6PR02MB4157.namprd02.prod.outlook.com/
>>
>> It would look something like this:
>>
>> In hvcounters.h:
>>
>> #if is_enabled(CONFIG_X86_64)
>>
>> #define HV_COUNTER_VP_LIST(X) \
>> X(VpTotalRunTime, 1), \
>> X(VpHypervisorRunTime, 2), \
>> X(VpRemoteNodeRunTime, 3), \
>> /* <snip> */
>>
>> #elif is_enabled(CONFIG_ARM64)
>>
>> /* <snip> */
>>
>> #endif
>>
>> Just like now, it's a copy/paste from Windows + simple pattern
>> replacement. Note with this approach we need separate lists for arm64
>> and x86, but that matches how the enums are defined in Windows.
>>
>> Then, in mshv_debugfs.c:
>>
>> /*
>> * We need the strings paired with their enum values.
>> * This structure can be used for all the different stat types.
>> */
>> struct hv_counter_entry {
>> char *name;
>> int idx;
>> };
>>
>> /* Define an array entry (again, reusable) */
>> #define HV_COUNTER_LIST(name, idx) \
>> { __stringify(name), idx },
>
> Couldn't this also go in hvcounters.h, so it doesn't need to be
> passed as a parameter to HV_COUNTER_VP_LIST() and friends?
> Or is the goal to keep hvcounters.h as bare minimum as possible?
>
Oh, yes certainly the struct and macros could all be hv_counters.h.
>>
>> /* Create our static array */
>> static struct hv_counter_entry hv_counter_vp_array[] = {
>> HV_ST_COUNTER_VP(HV_COUNTER_VP)
>> };
>
> Shouldn't the above be HV_COUNTER_VP_LIST(HV_COUNTER_LIST)
> to match the #define in hvcounters.h, and the macro that does the
> __stringify()? Assuming so, I think I understand the overall idea you
> are proposing. It's pretty clever. :-)
>
Oh, yes it should be HV_COUNTER_VP_LIST(HV_COUNTER_LIST)
> The #define of HV_COUNTER_VP_LIST() in hvcounters.h gets large
> for VP stats -- the #define will be about 200 lines. I have no sense
> of whether being that large is problematic for the tooling. And that
> question needs to be considered beyond just the C preprocessor and
> compiler, to include things like sparse, cscope, and other tools that
> parse source code. I had originally suggested building the static array
> directly in a .c file, which would avoid the need for the big #define.
> And maybe you could still do that with a separate .c source file just
> for the static arrays -- i.e., hvcounters.h becomes hvcounters.c. It
> seems like the " it's a copy/paste from Windows + simple pattern
> replacement" could be done to generate a .c file as easily as a .h file
> while still keeping the file contents to a bare minimum.
>
Good point... I usually reach for this "X macros" technique when I have
a list of things that needs to be repeated in multiple places (e.g.
defining a big enum AND also using the enum values in a big switch
statement).
Since we don't need the enum after all, apparently (it's not in hvhdk.h),
your original suggestion is probably the most straightforward thing; just
putting the values into a static array directly.
Putting it in it's own .c file and including that might be the easiest
thing, I'll give that a go and see how it looks.
> Either way (.h or .c file), I like the idea.
>
> Michael
>
>>
>> static int vp_stats_show(struct seq_file *m, void *v)
>> {
>> const struct hv_stats_page **pstats = m->private;
>> int i;
>>
>> for (i = 0; i < ARRAY_SIZE(hv_counter_vp_array); ++i) {
>> struct hv_counter_entry entry = hv_counter_vp_array[i];
>> u64 parent_val = pstats[HV_STATS_AREA_PARENT]->vp_cntrs[entry.idx];
>> u64 self_val = pstats[HV_STATS_AREA_SELF]->vp_cntrs[entry.idx];
>>
>> /* Prioritize the PARENT area value */
>> seq_printf(m, "%-30s: %llu\n", entry.name,
>> parent_val ? parent_val : self_val);
>> }
>> }
>>
>> Any thoughts? I was originally going to just go with the pattern we had,
>> but since these definitions aren't from the hv*dk.h files, we can maybe
>> get more creative and make the resulting code look a bit better.
>>
>> Thanks
>> Nuno
>>
>>>> Move the definition for the VP stats page to its rightful place in
>>>> hvhdk.h, and add the missing members.
>>>>
>>>> While at it, correct the ARM64 value of VpRootDispatchThreadBlocked,
>>>> (which is not yet used, so there is no impact).
>>>>
>>>> These enum members retain their CamelCase style, since they are imported
>>>> directly from the hypervisor code. They will be stringified when
>>>> printing the stats out, and retain more readability in this form.
>>>>
>>>> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
>>>> ---
>>>> drivers/hv/mshv_root_main.c | 17 --
>>>> include/hyperv/hvhdk.h | 437 ++++++++++++++++++++++++++++++++++++
>>>> 2 files changed, 437 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
>>>> index fbfc9e7d9fa4..724bbaa0b08c 100644
>>>> --- a/drivers/hv/mshv_root_main.c
>>>> +++ b/drivers/hv/mshv_root_main.c
>>>> @@ -39,23 +39,6 @@ MODULE_AUTHOR("Microsoft");
>>>> MODULE_LICENSE("GPL");
>>>> MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface
>> /dev/mshv");
>>>>
>>>> -/* TODO move this to another file when debugfs code is added */
>>>> -enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
>>>> -#if defined(CONFIG_X86)
>>>> - VpRootDispatchThreadBlocked = 202,
>>>> -#elif defined(CONFIG_ARM64)
>>>> - VpRootDispatchThreadBlocked = 94,
>>>> -#endif
>>>> - VpStatsMaxCounter
>>>> -};
>>>> -
>>>> -struct hv_stats_page {
>>>> - union {
>>>> - u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
>>>> - u8 data[HV_HYP_PAGE_SIZE];
>>>> - };
>>>> -} __packed;
>>>> -
>>>> struct mshv_root mshv_root;
>>>>
>>>> enum hv_scheduler_type hv_scheduler_type;
>>>> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
>>>> index 469186df7826..8bddd11feeba 100644
>>>> --- a/include/hyperv/hvhdk.h
>>>> +++ b/include/hyperv/hvhdk.h
>>>> @@ -10,6 +10,443 @@
>>>> #include "hvhdk_mini.h"
>>>> #include "hvgdk.h"
>>>>
>>>> +enum hv_stats_hypervisor_counters { /* HV_HYPERVISOR_COUNTER
>> */
>>>> + HvLogicalProcessors = 1,
>>>> + HvPartitions = 2,
>>>> + HvTotalPages = 3,
>>>> + HvVirtualProcessors = 4,
>>>> + HvMonitoredNotifications = 5,
>>>> + HvModernStandbyEntries = 6,
>>>> + HvPlatformIdleTransitions = 7,
>>>> + HvHypervisorStartupCost = 8,
>>>> + HvIOSpacePages = 10,
>>>> + HvNonEssentialPagesForDump = 11,
>>>> + HvSubsumedPages = 12,
>>>> + HvStatsMaxCounter
>>>> +};
>>>> +
>>>> +enum hv_stats_partition_counters { /* HV_PROCESS_COUNTER */
>>>> + PartitionVirtualProcessors = 1,
>>>> + PartitionTlbSize = 3,
>>>> + PartitionAddressSpaces = 4,
>>>> + PartitionDepositedPages = 5,
>>>> + PartitionGpaPages = 6,
>>>> + PartitionGpaSpaceModifications = 7,
>>>> + PartitionVirtualTlbFlushEntires = 8,
>>>> + PartitionRecommendedTlbSize = 9,
>>>> + PartitionGpaPages4K = 10,
>>>> + PartitionGpaPages2M = 11,
>>>> + PartitionGpaPages1G = 12,
>>>> + PartitionGpaPages512G = 13,
>>>> + PartitionDevicePages4K = 14,
>>>> + PartitionDevicePages2M = 15,
>>>> + PartitionDevicePages1G = 16,
>>>> + PartitionDevicePages512G = 17,
>>>> + PartitionAttachedDevices = 18,
>>>> + PartitionDeviceInterruptMappings = 19,
>>>> + PartitionIoTlbFlushes = 20,
>>>> + PartitionIoTlbFlushCost = 21,
>>>> + PartitionDeviceInterruptErrors = 22,
>>>> + PartitionDeviceDmaErrors = 23,
>>>> + PartitionDeviceInterruptThrottleEvents = 24,
>>>> + PartitionSkippedTimerTicks = 25,
>>>> + PartitionPartitionId = 26,
>>>> +#if IS_ENABLED(CONFIG_X86_64)
>>>> + PartitionNestedTlbSize = 27,
>>>> + PartitionRecommendedNestedTlbSize = 28,
>>>> + PartitionNestedTlbFreeListSize = 29,
>>>> + PartitionNestedTlbTrimmedPages = 30,
>>>> + PartitionPagesShattered = 31,
>>>> + PartitionPagesRecombined = 32,
>>>> + PartitionHwpRequestValue = 33,
>>>> +#elif IS_ENABLED(CONFIG_ARM64)
>>>> + PartitionHwpRequestValue = 27,
>>>> +#endif
>>>> + PartitionStatsMaxCounter
>>>> +};
>>>> +
>>>> +enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
>>>> + VpTotalRunTime = 1,
>>>> + VpHypervisorRunTime = 2,
>>>> + VpRemoteNodeRunTime = 3,
>>>> + VpNormalizedRunTime = 4,
>>>> + VpIdealCpu = 5,
>>>> + VpHypercallsCount = 7,
>>>> + VpHypercallsTime = 8,
>>>> +#if IS_ENABLED(CONFIG_X86_64)
>>>> + VpPageInvalidationsCount = 9,
>>>> + VpPageInvalidationsTime = 10,
>>>> + VpControlRegisterAccessesCount = 11,
>>>> + VpControlRegisterAccessesTime = 12,
>>>> + VpIoInstructionsCount = 13,
>>>> + VpIoInstructionsTime = 14,
>>>> + VpHltInstructionsCount = 15,
>>>> + VpHltInstructionsTime = 16,
>>>> + VpMwaitInstructionsCount = 17,
>>>> + VpMwaitInstructionsTime = 18,
>>>> + VpCpuidInstructionsCount = 19,
>>>> + VpCpuidInstructionsTime = 20,
>>>> + VpMsrAccessesCount = 21,
>>>> + VpMsrAccessesTime = 22,
>>>> + VpOtherInterceptsCount = 23,
>>>> + VpOtherInterceptsTime = 24,
>>>> + VpExternalInterruptsCount = 25,
>>>> + VpExternalInterruptsTime = 26,
>>>> + VpPendingInterruptsCount = 27,
>>>> + VpPendingInterruptsTime = 28,
>>>> + VpEmulatedInstructionsCount = 29,
>>>> + VpEmulatedInstructionsTime = 30,
>>>> + VpDebugRegisterAccessesCount = 31,
>>>> + VpDebugRegisterAccessesTime = 32,
>>>> + VpPageFaultInterceptsCount = 33,
>>>> + VpPageFaultInterceptsTime = 34,
>>>> + VpGuestPageTableMaps = 35,
>>>> + VpLargePageTlbFills = 36,
>>>> + VpSmallPageTlbFills = 37,
>>>> + VpReflectedGuestPageFaults = 38,
>>>> + VpApicMmioAccesses = 39,
>>>> + VpIoInterceptMessages = 40,
>>>> + VpMemoryInterceptMessages = 41,
>>>> + VpApicEoiAccesses = 42,
>>>> + VpOtherMessages = 43,
>>>> + VpPageTableAllocations = 44,
>>>> + VpLogicalProcessorMigrations = 45,
>>>> + VpAddressSpaceEvictions = 46,
>>>> + VpAddressSpaceSwitches = 47,
>>>> + VpAddressDomainFlushes = 48,
>>>> + VpAddressSpaceFlushes = 49,
>>>> + VpGlobalGvaRangeFlushes = 50,
>>>> + VpLocalGvaRangeFlushes = 51,
>>>> + VpPageTableEvictions = 52,
>>>> + VpPageTableReclamations = 53,
>>>> + VpPageTableResets = 54,
>>>> + VpPageTableValidations = 55,
>>>> + VpApicTprAccesses = 56,
>>>> + VpPageTableWriteIntercepts = 57,
>>>> + VpSyntheticInterrupts = 58,
>>>> + VpVirtualInterrupts = 59,
>>>> + VpApicIpisSent = 60,
>>>> + VpApicSelfIpisSent = 61,
>>>> + VpGpaSpaceHypercalls = 62,
>>>> + VpLogicalProcessorHypercalls = 63,
>>>> + VpLongSpinWaitHypercalls = 64,
>>>> + VpOtherHypercalls = 65,
>>>> + VpSyntheticInterruptHypercalls = 66,
>>>> + VpVirtualInterruptHypercalls = 67,
>>>> + VpVirtualMmuHypercalls = 68,
>>>> + VpVirtualProcessorHypercalls = 69,
>>>> + VpHardwareInterrupts = 70,
>>>> + VpNestedPageFaultInterceptsCount = 71,
>>>> + VpNestedPageFaultInterceptsTime = 72,
>>>> + VpPageScans = 73,
>>>> + VpLogicalProcessorDispatches = 74,
>>>> + VpWaitingForCpuTime = 75,
>>>> + VpExtendedHypercalls = 76,
>>>> + VpExtendedHypercallInterceptMessages = 77,
>>>> + VpMbecNestedPageTableSwitches = 78,
>>>> + VpOtherReflectedGuestExceptions = 79,
>>>> + VpGlobalIoTlbFlushes = 80,
>>>> + VpGlobalIoTlbFlushCost = 81,
>>>> + VpLocalIoTlbFlushes = 82,
>>>> + VpLocalIoTlbFlushCost = 83,
>>>> + VpHypercallsForwardedCount = 84,
>>>> + VpHypercallsForwardingTime = 85,
>>>> + VpPageInvalidationsForwardedCount = 86,
>>>> + VpPageInvalidationsForwardingTime = 87,
>>>> + VpControlRegisterAccessesForwardedCount = 88,
>>>> + VpControlRegisterAccessesForwardingTime = 89,
>>>> + VpIoInstructionsForwardedCount = 90,
>>>> + VpIoInstructionsForwardingTime = 91,
>>>> + VpHltInstructionsForwardedCount = 92,
>>>> + VpHltInstructionsForwardingTime = 93,
>>>> + VpMwaitInstructionsForwardedCount = 94,
>>>> + VpMwaitInstructionsForwardingTime = 95,
>>>> + VpCpuidInstructionsForwardedCount = 96,
>>>> + VpCpuidInstructionsForwardingTime = 97,
>>>> + VpMsrAccessesForwardedCount = 98,
>>>> + VpMsrAccessesForwardingTime = 99,
>>>> + VpOtherInterceptsForwardedCount = 100,
>>>> + VpOtherInterceptsForwardingTime = 101,
>>>> + VpExternalInterruptsForwardedCount = 102,
>>>> + VpExternalInterruptsForwardingTime = 103,
>>>> + VpPendingInterruptsForwardedCount = 104,
>>>> + VpPendingInterruptsForwardingTime = 105,
>>>> + VpEmulatedInstructionsForwardedCount = 106,
>>>> + VpEmulatedInstructionsForwardingTime = 107,
>>>> + VpDebugRegisterAccessesForwardedCount = 108,
>>>> + VpDebugRegisterAccessesForwardingTime = 109,
>>>> + VpPageFaultInterceptsForwardedCount = 110,
>>>> + VpPageFaultInterceptsForwardingTime = 111,
>>>> + VpVmclearEmulationCount = 112,
>>>> + VpVmclearEmulationTime = 113,
>>>> + VpVmptrldEmulationCount = 114,
>>>> + VpVmptrldEmulationTime = 115,
>>>> + VpVmptrstEmulationCount = 116,
>>>> + VpVmptrstEmulationTime = 117,
>>>> + VpVmreadEmulationCount = 118,
>>>> + VpVmreadEmulationTime = 119,
>>>> + VpVmwriteEmulationCount = 120,
>>>> + VpVmwriteEmulationTime = 121,
>>>> + VpVmxoffEmulationCount = 122,
>>>> + VpVmxoffEmulationTime = 123,
>>>> + VpVmxonEmulationCount = 124,
>>>> + VpVmxonEmulationTime = 125,
>>>> + VpNestedVMEntriesCount = 126,
>>>> + VpNestedVMEntriesTime = 127,
>>>> + VpNestedSLATSoftPageFaultsCount = 128,
>>>> + VpNestedSLATSoftPageFaultsTime = 129,
>>>> + VpNestedSLATHardPageFaultsCount = 130,
>>>> + VpNestedSLATHardPageFaultsTime = 131,
>>>> + VpInvEptAllContextEmulationCount = 132,
>>>> + VpInvEptAllContextEmulationTime = 133,
>>>> + VpInvEptSingleContextEmulationCount = 134,
>>>> + VpInvEptSingleContextEmulationTime = 135,
>>>> + VpInvVpidAllContextEmulationCount = 136,
>>>> + VpInvVpidAllContextEmulationTime = 137,
>>>> + VpInvVpidSingleContextEmulationCount = 138,
>>>> + VpInvVpidSingleContextEmulationTime = 139,
>>>> + VpInvVpidSingleAddressEmulationCount = 140,
>>>> + VpInvVpidSingleAddressEmulationTime = 141,
>>>> + VpNestedTlbPageTableReclamations = 142,
>>>> + VpNestedTlbPageTableEvictions = 143,
>>>> + VpFlushGuestPhysicalAddressSpaceHypercalls = 144,
>>>> + VpFlushGuestPhysicalAddressListHypercalls = 145,
>>>> + VpPostedInterruptNotifications = 146,
>>>> + VpPostedInterruptScans = 147,
>>>> + VpTotalCoreRunTime = 148,
>>>> + VpMaximumRunTime = 149,
>>>> + VpHwpRequestContextSwitches = 150,
>>>> + VpWaitingForCpuTimeBucket0 = 151,
>>>> + VpWaitingForCpuTimeBucket1 = 152,
>>>> + VpWaitingForCpuTimeBucket2 = 153,
>>>> + VpWaitingForCpuTimeBucket3 = 154,
>>>> + VpWaitingForCpuTimeBucket4 = 155,
>>>> + VpWaitingForCpuTimeBucket5 = 156,
>>>> + VpWaitingForCpuTimeBucket6 = 157,
>>>> + VpVmloadEmulationCount = 158,
>>>> + VpVmloadEmulationTime = 159,
>>>> + VpVmsaveEmulationCount = 160,
>>>> + VpVmsaveEmulationTime = 161,
>>>> + VpGifInstructionEmulationCount = 162,
>>>> + VpGifInstructionEmulationTime = 163,
>>>> + VpEmulatedErrataSvmInstructions = 164,
>>>> + VpPlaceholder1 = 165,
>>>> + VpPlaceholder2 = 166,
>>>> + VpPlaceholder3 = 167,
>>>> + VpPlaceholder4 = 168,
>>>> + VpPlaceholder5 = 169,
>>>> + VpPlaceholder6 = 170,
>>>> + VpPlaceholder7 = 171,
>>>> + VpPlaceholder8 = 172,
>>>> + VpPlaceholder9 = 173,
>>>> + VpPlaceholder10 = 174,
>>>> + VpSchedulingPriority = 175,
>>>> + VpRdpmcInstructionsCount = 176,
>>>> + VpRdpmcInstructionsTime = 177,
>>>> + VpPerfmonPmuMsrAccessesCount = 178,
>>>> + VpPerfmonLbrMsrAccessesCount = 179,
>>>> + VpPerfmonIptMsrAccessesCount = 180,
>>>> + VpPerfmonInterruptCount = 181,
>>>> + VpVtl1DispatchCount = 182,
>>>> + VpVtl2DispatchCount = 183,
>>>> + VpVtl2DispatchBucket0 = 184,
>>>> + VpVtl2DispatchBucket1 = 185,
>>>> + VpVtl2DispatchBucket2 = 186,
>>>> + VpVtl2DispatchBucket3 = 187,
>>>> + VpVtl2DispatchBucket4 = 188,
>>>> + VpVtl2DispatchBucket5 = 189,
>>>> + VpVtl2DispatchBucket6 = 190,
>>>> + VpVtl1RunTime = 191,
>>>> + VpVtl2RunTime = 192,
>>>> + VpIommuHypercalls = 193,
>>>> + VpCpuGroupHypercalls = 194,
>>>> + VpVsmHypercalls = 195,
>>>> + VpEventLogHypercalls = 196,
>>>> + VpDeviceDomainHypercalls = 197,
>>>> + VpDepositHypercalls = 198,
>>>> + VpSvmHypercalls = 199,
>>>> + VpBusLockAcquisitionCount = 200,
>>>> + VpLoadAvg = 201,
>>>> + VpRootDispatchThreadBlocked = 202,
>>>> +#elif IS_ENABLED(CONFIG_ARM64)
>>>> + VpSysRegAccessesCount = 9,
>>>> + VpSysRegAccessesTime = 10,
>>>> + VpSmcInstructionsCount = 11,
>>>> + VpSmcInstructionsTime = 12,
>>>> + VpOtherInterceptsCount = 13,
>>>> + VpOtherInterceptsTime = 14,
>>>> + VpExternalInterruptsCount = 15,
>>>> + VpExternalInterruptsTime = 16,
>>>> + VpPendingInterruptsCount = 17,
>>>> + VpPendingInterruptsTime = 18,
>>>> + VpGuestPageTableMaps = 19,
>>>> + VpLargePageTlbFills = 20,
>>>> + VpSmallPageTlbFills = 21,
>>>> + VpReflectedGuestPageFaults = 22,
>>>> + VpMemoryInterceptMessages = 23,
>>>> + VpOtherMessages = 24,
>>>> + VpLogicalProcessorMigrations = 25,
>>>> + VpAddressDomainFlushes = 26,
>>>> + VpAddressSpaceFlushes = 27,
>>>> + VpSyntheticInterrupts = 28,
>>>> + VpVirtualInterrupts = 29,
>>>> + VpApicSelfIpisSent = 30,
>>>> + VpGpaSpaceHypercalls = 31,
>>>> + VpLogicalProcessorHypercalls = 32,
>>>> + VpLongSpinWaitHypercalls = 33,
>>>> + VpOtherHypercalls = 34,
>>>> + VpSyntheticInterruptHypercalls = 35,
>>>> + VpVirtualInterruptHypercalls = 36,
>>>> + VpVirtualMmuHypercalls = 37,
>>>> + VpVirtualProcessorHypercalls = 38,
>>>> + VpHardwareInterrupts = 39,
>>>> + VpNestedPageFaultInterceptsCount = 40,
>>>> + VpNestedPageFaultInterceptsTime = 41,
>>>> + VpLogicalProcessorDispatches = 42,
>>>> + VpWaitingForCpuTime = 43,
>>>> + VpExtendedHypercalls = 44,
>>>> + VpExtendedHypercallInterceptMessages = 45,
>>>> + VpMbecNestedPageTableSwitches = 46,
>>>> + VpOtherReflectedGuestExceptions = 47,
>>>> + VpGlobalIoTlbFlushes = 48,
>>>> + VpGlobalIoTlbFlushCost = 49,
>>>> + VpLocalIoTlbFlushes = 50,
>>>> + VpLocalIoTlbFlushCost = 51,
>>>> + VpFlushGuestPhysicalAddressSpaceHypercalls = 52,
>>>> + VpFlushGuestPhysicalAddressListHypercalls = 53,
>>>> + VpPostedInterruptNotifications = 54,
>>>> + VpPostedInterruptScans = 55,
>>>> + VpTotalCoreRunTime = 56,
>>>> + VpMaximumRunTime = 57,
>>>> + VpWaitingForCpuTimeBucket0 = 58,
>>>> + VpWaitingForCpuTimeBucket1 = 59,
>>>> + VpWaitingForCpuTimeBucket2 = 60,
>>>> + VpWaitingForCpuTimeBucket3 = 61,
>>>> + VpWaitingForCpuTimeBucket4 = 62,
>>>> + VpWaitingForCpuTimeBucket5 = 63,
>>>> + VpWaitingForCpuTimeBucket6 = 64,
>>>> + VpHwpRequestContextSwitches = 65,
>>>> + VpPlaceholder2 = 66,
>>>> + VpPlaceholder3 = 67,
>>>> + VpPlaceholder4 = 68,
>>>> + VpPlaceholder5 = 69,
>>>> + VpPlaceholder6 = 70,
>>>> + VpPlaceholder7 = 71,
>>>> + VpPlaceholder8 = 72,
>>>> + VpContentionTime = 73,
>>>> + VpWakeUpTime = 74,
>>>> + VpSchedulingPriority = 75,
>>>> + VpVtl1DispatchCount = 76,
>>>> + VpVtl2DispatchCount = 77,
>>>> + VpVtl2DispatchBucket0 = 78,
>>>> + VpVtl2DispatchBucket1 = 79,
>>>> + VpVtl2DispatchBucket2 = 80,
>>>> + VpVtl2DispatchBucket3 = 81,
>>>> + VpVtl2DispatchBucket4 = 82,
>>>> + VpVtl2DispatchBucket5 = 83,
>>>> + VpVtl2DispatchBucket6 = 84,
>>>> + VpVtl1RunTime = 85,
>>>> + VpVtl2RunTime = 86,
>>>> + VpIommuHypercalls = 87,
>>>> + VpCpuGroupHypercalls = 88,
>>>> + VpVsmHypercalls = 89,
>>>> + VpEventLogHypercalls = 90,
>>>> + VpDeviceDomainHypercalls = 91,
>>>> + VpDepositHypercalls = 92,
>>>> + VpSvmHypercalls = 93,
>>>> + VpLoadAvg = 94,
>>>> + VpRootDispatchThreadBlocked = 95,
>>>> +#endif
>>>> + VpStatsMaxCounter
>>>> +};
>>>> +
>>>> +enum hv_stats_lp_counters { /* HV_CPU_COUNTER */
>>>> + LpGlobalTime = 1,
>>>> + LpTotalRunTime = 2,
>>>> + LpHypervisorRunTime = 3,
>>>> + LpHardwareInterrupts = 4,
>>>> + LpContextSwitches = 5,
>>>> + LpInterProcessorInterrupts = 6,
>>>> + LpSchedulerInterrupts = 7,
>>>> + LpTimerInterrupts = 8,
>>>> + LpInterProcessorInterruptsSent = 9,
>>>> + LpProcessorHalts = 10,
>>>> + LpMonitorTransitionCost = 11,
>>>> + LpContextSwitchTime = 12,
>>>> + LpC1TransitionsCount = 13,
>>>> + LpC1RunTime = 14,
>>>> + LpC2TransitionsCount = 15,
>>>> + LpC2RunTime = 16,
>>>> + LpC3TransitionsCount = 17,
>>>> + LpC3RunTime = 18,
>>>> + LpRootVpIndex = 19,
>>>> + LpIdleSequenceNumber = 20,
>>>> + LpGlobalTscCount = 21,
>>>> + LpActiveTscCount = 22,
>>>> + LpIdleAccumulation = 23,
>>>> + LpReferenceCycleCount0 = 24,
>>>> + LpActualCycleCount0 = 25,
>>>> + LpReferenceCycleCount1 = 26,
>>>> + LpActualCycleCount1 = 27,
>>>> + LpProximityDomainId = 28,
>>>> + LpPostedInterruptNotifications = 29,
>>>> + LpBranchPredictorFlushes = 30,
>>>> +#if IS_ENABLED(CONFIG_X86_64)
>>>> + LpL1DataCacheFlushes = 31,
>>>> + LpImmediateL1DataCacheFlushes = 32,
>>>> + LpMbFlushes = 33,
>>>> + LpCounterRefreshSequenceNumber = 34,
>>>> + LpCounterRefreshReferenceTime = 35,
>>>> + LpIdleAccumulationSnapshot = 36,
>>>> + LpActiveTscCountSnapshot = 37,
>>>> + LpHwpRequestContextSwitches = 38,
>>>> + LpPlaceholder1 = 39,
>>>> + LpPlaceholder2 = 40,
>>>> + LpPlaceholder3 = 41,
>>>> + LpPlaceholder4 = 42,
>>>> + LpPlaceholder5 = 43,
>>>> + LpPlaceholder6 = 44,
>>>> + LpPlaceholder7 = 45,
>>>> + LpPlaceholder8 = 46,
>>>> + LpPlaceholder9 = 47,
>>>> + LpPlaceholder10 = 48,
>>>> + LpReserveGroupId = 49,
>>>> + LpRunningPriority = 50,
>>>> + LpPerfmonInterruptCount = 51,
>>>> +#elif IS_ENABLED(CONFIG_ARM64)
>>>> + LpCounterRefreshSequenceNumber = 31,
>>>> + LpCounterRefreshReferenceTime = 32,
>>>> + LpIdleAccumulationSnapshot = 33,
>>>> + LpActiveTscCountSnapshot = 34,
>>>> + LpHwpRequestContextSwitches = 35,
>>>> + LpPlaceholder2 = 36,
>>>> + LpPlaceholder3 = 37,
>>>> + LpPlaceholder4 = 38,
>>>> + LpPlaceholder5 = 39,
>>>> + LpPlaceholder6 = 40,
>>>> + LpPlaceholder7 = 41,
>>>> + LpPlaceholder8 = 42,
>>>> + LpPlaceholder9 = 43,
>>>> + LpSchLocalRunListSize = 44,
>>>> + LpReserveGroupId = 45,
>>>> + LpRunningPriority = 46,
>>>> +#endif
>>>> + LpStatsMaxCounter
>>>> +};
>>>> +
>>>> +/*
>>>> + * Hypervisor statistics page format
>>>> + */
>>>> +struct hv_stats_page {
>>>> + union {
>>>> + u64 hv_cntrs[HvStatsMaxCounter]; /* Hypervisor counters
>> */
>>>> + u64 pt_cntrs[PartitionStatsMaxCounter]; /* Partition
>> counters */
>>>> + u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
>>>> + u64 lp_cntrs[LpStatsMaxCounter]; /* LP counters */
>>>> + u8 data[HV_HYP_PAGE_SIZE];
>>>> + };
>>>> +} __packed;
>>>> +
>>>> /* Bits for dirty mask of hv_vp_register_page */
>>>> #define HV_X64_REGISTER_CLASS_GENERAL 0
>>>> #define HV_X64_REGISTER_CLASS_IP 1
>>>> --
>>>> 2.34.1
^ permalink raw reply
* RE: [PATCH 2/3] drivers: hv: vmbus_drv: Remove reference to hpyerv_fb
From: Michael Kelley @ 2026-01-16 17:18 UTC (permalink / raw)
To: Prasanna Kumar T S M, linux-hyperv@vger.kernel.org,
longli@microsoft.com, decui@microsoft.com, wei.liu@kernel.org,
haiyangz@microsoft.com, kys@microsoft.com, Helge Deller
Cc: linux-kernel@vger.kernel.org
In-Reply-To: <1766809622-25388-1-git-send-email-ptsm@linux.microsoft.com>
From: Prasanna Kumar T S M <ptsm@linux.microsoft.com> Sent: Friday, December 26, 2025 8:27 PM
>
Helge --
I don't know why I'm just noticing this now, but this patch that you picked up
also has a "hyperv_fb" spelling typo in the Subject: line. To match historical practice,
the Subject: line really should be:
Drivers: hv: vmbus: Remove reference to hyperv_fb
If it's something you can clean up easily, that would be nice. If it's a pain,
don't worry about it.
Michael
> Remove hyperv_fb reference as the driver is removed.
>
> Signed-off-by: Prasanna Kumar T S M <ptsm@linux.microsoft.com>
> ---
> drivers/hv/vmbus_drv.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/hv/vmbus_drv.c b/drivers/hv/vmbus_drv.c
> index a53af6fe81a6..7758d7e25a7b 100644
> --- a/drivers/hv/vmbus_drv.c
> +++ b/drivers/hv/vmbus_drv.c
> @@ -2356,8 +2356,8 @@ static void __maybe_unused vmbus_reserve_fb(void)
> }
>
> /*
> - * Release the PCI device so hyperv_drm or hyperv_fb driver can
> - * grab it later.
> + * Release the PCI device so hyperv_drm driver can grab it
> + * later.
> */
> pci_dev_put(pdev);
> }
> --
> 2.49.0
>
^ permalink raw reply
* RE: [PATCH v3 5/6] mshv: Add definitions for stats pages
From: Michael Kelley @ 2026-01-16 17:01 UTC (permalink / raw)
To: Nuno Das Neves, Stanislav Kinsburskii
Cc: linux-hyperv@vger.kernel.org, linux-kernel@vger.kernel.org,
kys@microsoft.com, haiyangz@microsoft.com, wei.liu@kernel.org,
decui@microsoft.com, longli@microsoft.com,
prapal@linux.microsoft.com, mrathor@linux.microsoft.com,
paekkaladevi@linux.microsoft.com
In-Reply-To: <89385dc3-e702-4bf6-8ad7-f6e634851851@linux.microsoft.com>
From: Nuno Das Neves <nunodasneves@linux.microsoft.com> Sent: Thursday, January 15, 2026 11:35 AM
>
> On 1/15/2026 8:19 AM, Stanislav Kinsburskii wrote:
> > On Wed, Jan 14, 2026 at 01:38:02PM -0800, Nuno Das Neves wrote:
> >> Add the definitions for hypervisor, logical processor, and partition
> >> stats pages.
> >>
> >
> > The definitions in for partition and virtual processor are outdated.
> > Now is the good time to sync the new values in.
> >
> > Thanks,
> > Stanislav
> >
>
> Good point, thanks, I will update it for v4.
>
> I'm finally noticing that these counters are not really from hvhdk.h, in
> the windows code, but their own file. Since I'm still iterating on this,
> what do you think about creating a file just for the counters?
> e.g. drivers/hv/hvcounters.h, which combines hvcountersarm64 and amd64.
>
> That would have a couple of advantages:
> 1. Not putting things in hvhdk.h which aren't actually there in the
> Windows source
> 2. Less visibility of CamelCase naming outside our driver
> 3. I could define the enums using "X macro"s to generate the show() code
> more cleanly in mshv_debugfs.c, which is something Michael suggested
> here:
> https://lore.kernel.org/linux-hyperv/SN6PR02MB4157938404BC0D12978ACD9BD4A2A@SN6PR02MB4157.namprd02.prod.outlook.com/
>
> It would look something like this:
>
> In hvcounters.h:
>
> #if is_enabled(CONFIG_X86_64)
>
> #define HV_COUNTER_VP_LIST(X) \
> X(VpTotalRunTime, 1), \
> X(VpHypervisorRunTime, 2), \
> X(VpRemoteNodeRunTime, 3), \
> /* <snip> */
>
> #elif is_enabled(CONFIG_ARM64)
>
> /* <snip> */
>
> #endif
>
> Just like now, it's a copy/paste from Windows + simple pattern
> replacement. Note with this approach we need separate lists for arm64
> and x86, but that matches how the enums are defined in Windows.
>
> Then, in mshv_debugfs.c:
>
> /*
> * We need the strings paired with their enum values.
> * This structure can be used for all the different stat types.
> */
> struct hv_counter_entry {
> char *name;
> int idx;
> };
>
> /* Define an array entry (again, reusable) */
> #define HV_COUNTER_LIST(name, idx) \
> { __stringify(name), idx },
Couldn't this also go in hvcounters.h, so it doesn't need to be
passed as a parameter to HV_COUNTER_VP_LIST() and friends?
Or is the goal to keep hvcounters.h as bare minimum as possible?
>
> /* Create our static array */
> static struct hv_counter_entry hv_counter_vp_array[] = {
> HV_ST_COUNTER_VP(HV_COUNTER_VP)
> };
Shouldn't the above be HV_COUNTER_VP_LIST(HV_COUNTER_LIST)
to match the #define in hvcounters.h, and the macro that does the
__stringify()? Assuming so, I think I understand the overall idea you
are proposing. It's pretty clever. :-)
The #define of HV_COUNTER_VP_LIST() in hvcounters.h gets large
for VP stats -- the #define will be about 200 lines. I have no sense
of whether being that large is problematic for the tooling. And that
question needs to be considered beyond just the C preprocessor and
compiler, to include things like sparse, cscope, and other tools that
parse source code. I had originally suggested building the static array
directly in a .c file, which would avoid the need for the big #define.
And maybe you could still do that with a separate .c source file just
for the static arrays -- i.e., hvcounters.h becomes hvcounters.c. It
seems like the " it's a copy/paste from Windows + simple pattern
replacement" could be done to generate a .c file as easily as a .h file
while still keeping the file contents to a bare minimum.
Either way (.h or .c file), I like the idea.
Michael
>
> static int vp_stats_show(struct seq_file *m, void *v)
> {
> const struct hv_stats_page **pstats = m->private;
> int i;
>
> for (i = 0; i < ARRAY_SIZE(hv_counter_vp_array); ++i) {
> struct hv_counter_entry entry = hv_counter_vp_array[i];
> u64 parent_val = pstats[HV_STATS_AREA_PARENT]->vp_cntrs[entry.idx];
> u64 self_val = pstats[HV_STATS_AREA_SELF]->vp_cntrs[entry.idx];
>
> /* Prioritize the PARENT area value */
> seq_printf(m, "%-30s: %llu\n", entry.name,
> parent_val ? parent_val : self_val);
> }
> }
>
> Any thoughts? I was originally going to just go with the pattern we had,
> but since these definitions aren't from the hv*dk.h files, we can maybe
> get more creative and make the resulting code look a bit better.
>
> Thanks
> Nuno
>
> >> Move the definition for the VP stats page to its rightful place in
> >> hvhdk.h, and add the missing members.
> >>
> >> While at it, correct the ARM64 value of VpRootDispatchThreadBlocked,
> >> (which is not yet used, so there is no impact).
> >>
> >> These enum members retain their CamelCase style, since they are imported
> >> directly from the hypervisor code. They will be stringified when
> >> printing the stats out, and retain more readability in this form.
> >>
> >> Signed-off-by: Nuno Das Neves <nunodasneves@linux.microsoft.com>
> >> ---
> >> drivers/hv/mshv_root_main.c | 17 --
> >> include/hyperv/hvhdk.h | 437 ++++++++++++++++++++++++++++++++++++
> >> 2 files changed, 437 insertions(+), 17 deletions(-)
> >>
> >> diff --git a/drivers/hv/mshv_root_main.c b/drivers/hv/mshv_root_main.c
> >> index fbfc9e7d9fa4..724bbaa0b08c 100644
> >> --- a/drivers/hv/mshv_root_main.c
> >> +++ b/drivers/hv/mshv_root_main.c
> >> @@ -39,23 +39,6 @@ MODULE_AUTHOR("Microsoft");
> >> MODULE_LICENSE("GPL");
> >> MODULE_DESCRIPTION("Microsoft Hyper-V root partition VMM interface
> /dev/mshv");
> >>
> >> -/* TODO move this to another file when debugfs code is added */
> >> -enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
> >> -#if defined(CONFIG_X86)
> >> - VpRootDispatchThreadBlocked = 202,
> >> -#elif defined(CONFIG_ARM64)
> >> - VpRootDispatchThreadBlocked = 94,
> >> -#endif
> >> - VpStatsMaxCounter
> >> -};
> >> -
> >> -struct hv_stats_page {
> >> - union {
> >> - u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
> >> - u8 data[HV_HYP_PAGE_SIZE];
> >> - };
> >> -} __packed;
> >> -
> >> struct mshv_root mshv_root;
> >>
> >> enum hv_scheduler_type hv_scheduler_type;
> >> diff --git a/include/hyperv/hvhdk.h b/include/hyperv/hvhdk.h
> >> index 469186df7826..8bddd11feeba 100644
> >> --- a/include/hyperv/hvhdk.h
> >> +++ b/include/hyperv/hvhdk.h
> >> @@ -10,6 +10,443 @@
> >> #include "hvhdk_mini.h"
> >> #include "hvgdk.h"
> >>
> >> +enum hv_stats_hypervisor_counters { /* HV_HYPERVISOR_COUNTER
> */
> >> + HvLogicalProcessors = 1,
> >> + HvPartitions = 2,
> >> + HvTotalPages = 3,
> >> + HvVirtualProcessors = 4,
> >> + HvMonitoredNotifications = 5,
> >> + HvModernStandbyEntries = 6,
> >> + HvPlatformIdleTransitions = 7,
> >> + HvHypervisorStartupCost = 8,
> >> + HvIOSpacePages = 10,
> >> + HvNonEssentialPagesForDump = 11,
> >> + HvSubsumedPages = 12,
> >> + HvStatsMaxCounter
> >> +};
> >> +
> >> +enum hv_stats_partition_counters { /* HV_PROCESS_COUNTER */
> >> + PartitionVirtualProcessors = 1,
> >> + PartitionTlbSize = 3,
> >> + PartitionAddressSpaces = 4,
> >> + PartitionDepositedPages = 5,
> >> + PartitionGpaPages = 6,
> >> + PartitionGpaSpaceModifications = 7,
> >> + PartitionVirtualTlbFlushEntires = 8,
> >> + PartitionRecommendedTlbSize = 9,
> >> + PartitionGpaPages4K = 10,
> >> + PartitionGpaPages2M = 11,
> >> + PartitionGpaPages1G = 12,
> >> + PartitionGpaPages512G = 13,
> >> + PartitionDevicePages4K = 14,
> >> + PartitionDevicePages2M = 15,
> >> + PartitionDevicePages1G = 16,
> >> + PartitionDevicePages512G = 17,
> >> + PartitionAttachedDevices = 18,
> >> + PartitionDeviceInterruptMappings = 19,
> >> + PartitionIoTlbFlushes = 20,
> >> + PartitionIoTlbFlushCost = 21,
> >> + PartitionDeviceInterruptErrors = 22,
> >> + PartitionDeviceDmaErrors = 23,
> >> + PartitionDeviceInterruptThrottleEvents = 24,
> >> + PartitionSkippedTimerTicks = 25,
> >> + PartitionPartitionId = 26,
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + PartitionNestedTlbSize = 27,
> >> + PartitionRecommendedNestedTlbSize = 28,
> >> + PartitionNestedTlbFreeListSize = 29,
> >> + PartitionNestedTlbTrimmedPages = 30,
> >> + PartitionPagesShattered = 31,
> >> + PartitionPagesRecombined = 32,
> >> + PartitionHwpRequestValue = 33,
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + PartitionHwpRequestValue = 27,
> >> +#endif
> >> + PartitionStatsMaxCounter
> >> +};
> >> +
> >> +enum hv_stats_vp_counters { /* HV_THREAD_COUNTER */
> >> + VpTotalRunTime = 1,
> >> + VpHypervisorRunTime = 2,
> >> + VpRemoteNodeRunTime = 3,
> >> + VpNormalizedRunTime = 4,
> >> + VpIdealCpu = 5,
> >> + VpHypercallsCount = 7,
> >> + VpHypercallsTime = 8,
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + VpPageInvalidationsCount = 9,
> >> + VpPageInvalidationsTime = 10,
> >> + VpControlRegisterAccessesCount = 11,
> >> + VpControlRegisterAccessesTime = 12,
> >> + VpIoInstructionsCount = 13,
> >> + VpIoInstructionsTime = 14,
> >> + VpHltInstructionsCount = 15,
> >> + VpHltInstructionsTime = 16,
> >> + VpMwaitInstructionsCount = 17,
> >> + VpMwaitInstructionsTime = 18,
> >> + VpCpuidInstructionsCount = 19,
> >> + VpCpuidInstructionsTime = 20,
> >> + VpMsrAccessesCount = 21,
> >> + VpMsrAccessesTime = 22,
> >> + VpOtherInterceptsCount = 23,
> >> + VpOtherInterceptsTime = 24,
> >> + VpExternalInterruptsCount = 25,
> >> + VpExternalInterruptsTime = 26,
> >> + VpPendingInterruptsCount = 27,
> >> + VpPendingInterruptsTime = 28,
> >> + VpEmulatedInstructionsCount = 29,
> >> + VpEmulatedInstructionsTime = 30,
> >> + VpDebugRegisterAccessesCount = 31,
> >> + VpDebugRegisterAccessesTime = 32,
> >> + VpPageFaultInterceptsCount = 33,
> >> + VpPageFaultInterceptsTime = 34,
> >> + VpGuestPageTableMaps = 35,
> >> + VpLargePageTlbFills = 36,
> >> + VpSmallPageTlbFills = 37,
> >> + VpReflectedGuestPageFaults = 38,
> >> + VpApicMmioAccesses = 39,
> >> + VpIoInterceptMessages = 40,
> >> + VpMemoryInterceptMessages = 41,
> >> + VpApicEoiAccesses = 42,
> >> + VpOtherMessages = 43,
> >> + VpPageTableAllocations = 44,
> >> + VpLogicalProcessorMigrations = 45,
> >> + VpAddressSpaceEvictions = 46,
> >> + VpAddressSpaceSwitches = 47,
> >> + VpAddressDomainFlushes = 48,
> >> + VpAddressSpaceFlushes = 49,
> >> + VpGlobalGvaRangeFlushes = 50,
> >> + VpLocalGvaRangeFlushes = 51,
> >> + VpPageTableEvictions = 52,
> >> + VpPageTableReclamations = 53,
> >> + VpPageTableResets = 54,
> >> + VpPageTableValidations = 55,
> >> + VpApicTprAccesses = 56,
> >> + VpPageTableWriteIntercepts = 57,
> >> + VpSyntheticInterrupts = 58,
> >> + VpVirtualInterrupts = 59,
> >> + VpApicIpisSent = 60,
> >> + VpApicSelfIpisSent = 61,
> >> + VpGpaSpaceHypercalls = 62,
> >> + VpLogicalProcessorHypercalls = 63,
> >> + VpLongSpinWaitHypercalls = 64,
> >> + VpOtherHypercalls = 65,
> >> + VpSyntheticInterruptHypercalls = 66,
> >> + VpVirtualInterruptHypercalls = 67,
> >> + VpVirtualMmuHypercalls = 68,
> >> + VpVirtualProcessorHypercalls = 69,
> >> + VpHardwareInterrupts = 70,
> >> + VpNestedPageFaultInterceptsCount = 71,
> >> + VpNestedPageFaultInterceptsTime = 72,
> >> + VpPageScans = 73,
> >> + VpLogicalProcessorDispatches = 74,
> >> + VpWaitingForCpuTime = 75,
> >> + VpExtendedHypercalls = 76,
> >> + VpExtendedHypercallInterceptMessages = 77,
> >> + VpMbecNestedPageTableSwitches = 78,
> >> + VpOtherReflectedGuestExceptions = 79,
> >> + VpGlobalIoTlbFlushes = 80,
> >> + VpGlobalIoTlbFlushCost = 81,
> >> + VpLocalIoTlbFlushes = 82,
> >> + VpLocalIoTlbFlushCost = 83,
> >> + VpHypercallsForwardedCount = 84,
> >> + VpHypercallsForwardingTime = 85,
> >> + VpPageInvalidationsForwardedCount = 86,
> >> + VpPageInvalidationsForwardingTime = 87,
> >> + VpControlRegisterAccessesForwardedCount = 88,
> >> + VpControlRegisterAccessesForwardingTime = 89,
> >> + VpIoInstructionsForwardedCount = 90,
> >> + VpIoInstructionsForwardingTime = 91,
> >> + VpHltInstructionsForwardedCount = 92,
> >> + VpHltInstructionsForwardingTime = 93,
> >> + VpMwaitInstructionsForwardedCount = 94,
> >> + VpMwaitInstructionsForwardingTime = 95,
> >> + VpCpuidInstructionsForwardedCount = 96,
> >> + VpCpuidInstructionsForwardingTime = 97,
> >> + VpMsrAccessesForwardedCount = 98,
> >> + VpMsrAccessesForwardingTime = 99,
> >> + VpOtherInterceptsForwardedCount = 100,
> >> + VpOtherInterceptsForwardingTime = 101,
> >> + VpExternalInterruptsForwardedCount = 102,
> >> + VpExternalInterruptsForwardingTime = 103,
> >> + VpPendingInterruptsForwardedCount = 104,
> >> + VpPendingInterruptsForwardingTime = 105,
> >> + VpEmulatedInstructionsForwardedCount = 106,
> >> + VpEmulatedInstructionsForwardingTime = 107,
> >> + VpDebugRegisterAccessesForwardedCount = 108,
> >> + VpDebugRegisterAccessesForwardingTime = 109,
> >> + VpPageFaultInterceptsForwardedCount = 110,
> >> + VpPageFaultInterceptsForwardingTime = 111,
> >> + VpVmclearEmulationCount = 112,
> >> + VpVmclearEmulationTime = 113,
> >> + VpVmptrldEmulationCount = 114,
> >> + VpVmptrldEmulationTime = 115,
> >> + VpVmptrstEmulationCount = 116,
> >> + VpVmptrstEmulationTime = 117,
> >> + VpVmreadEmulationCount = 118,
> >> + VpVmreadEmulationTime = 119,
> >> + VpVmwriteEmulationCount = 120,
> >> + VpVmwriteEmulationTime = 121,
> >> + VpVmxoffEmulationCount = 122,
> >> + VpVmxoffEmulationTime = 123,
> >> + VpVmxonEmulationCount = 124,
> >> + VpVmxonEmulationTime = 125,
> >> + VpNestedVMEntriesCount = 126,
> >> + VpNestedVMEntriesTime = 127,
> >> + VpNestedSLATSoftPageFaultsCount = 128,
> >> + VpNestedSLATSoftPageFaultsTime = 129,
> >> + VpNestedSLATHardPageFaultsCount = 130,
> >> + VpNestedSLATHardPageFaultsTime = 131,
> >> + VpInvEptAllContextEmulationCount = 132,
> >> + VpInvEptAllContextEmulationTime = 133,
> >> + VpInvEptSingleContextEmulationCount = 134,
> >> + VpInvEptSingleContextEmulationTime = 135,
> >> + VpInvVpidAllContextEmulationCount = 136,
> >> + VpInvVpidAllContextEmulationTime = 137,
> >> + VpInvVpidSingleContextEmulationCount = 138,
> >> + VpInvVpidSingleContextEmulationTime = 139,
> >> + VpInvVpidSingleAddressEmulationCount = 140,
> >> + VpInvVpidSingleAddressEmulationTime = 141,
> >> + VpNestedTlbPageTableReclamations = 142,
> >> + VpNestedTlbPageTableEvictions = 143,
> >> + VpFlushGuestPhysicalAddressSpaceHypercalls = 144,
> >> + VpFlushGuestPhysicalAddressListHypercalls = 145,
> >> + VpPostedInterruptNotifications = 146,
> >> + VpPostedInterruptScans = 147,
> >> + VpTotalCoreRunTime = 148,
> >> + VpMaximumRunTime = 149,
> >> + VpHwpRequestContextSwitches = 150,
> >> + VpWaitingForCpuTimeBucket0 = 151,
> >> + VpWaitingForCpuTimeBucket1 = 152,
> >> + VpWaitingForCpuTimeBucket2 = 153,
> >> + VpWaitingForCpuTimeBucket3 = 154,
> >> + VpWaitingForCpuTimeBucket4 = 155,
> >> + VpWaitingForCpuTimeBucket5 = 156,
> >> + VpWaitingForCpuTimeBucket6 = 157,
> >> + VpVmloadEmulationCount = 158,
> >> + VpVmloadEmulationTime = 159,
> >> + VpVmsaveEmulationCount = 160,
> >> + VpVmsaveEmulationTime = 161,
> >> + VpGifInstructionEmulationCount = 162,
> >> + VpGifInstructionEmulationTime = 163,
> >> + VpEmulatedErrataSvmInstructions = 164,
> >> + VpPlaceholder1 = 165,
> >> + VpPlaceholder2 = 166,
> >> + VpPlaceholder3 = 167,
> >> + VpPlaceholder4 = 168,
> >> + VpPlaceholder5 = 169,
> >> + VpPlaceholder6 = 170,
> >> + VpPlaceholder7 = 171,
> >> + VpPlaceholder8 = 172,
> >> + VpPlaceholder9 = 173,
> >> + VpPlaceholder10 = 174,
> >> + VpSchedulingPriority = 175,
> >> + VpRdpmcInstructionsCount = 176,
> >> + VpRdpmcInstructionsTime = 177,
> >> + VpPerfmonPmuMsrAccessesCount = 178,
> >> + VpPerfmonLbrMsrAccessesCount = 179,
> >> + VpPerfmonIptMsrAccessesCount = 180,
> >> + VpPerfmonInterruptCount = 181,
> >> + VpVtl1DispatchCount = 182,
> >> + VpVtl2DispatchCount = 183,
> >> + VpVtl2DispatchBucket0 = 184,
> >> + VpVtl2DispatchBucket1 = 185,
> >> + VpVtl2DispatchBucket2 = 186,
> >> + VpVtl2DispatchBucket3 = 187,
> >> + VpVtl2DispatchBucket4 = 188,
> >> + VpVtl2DispatchBucket5 = 189,
> >> + VpVtl2DispatchBucket6 = 190,
> >> + VpVtl1RunTime = 191,
> >> + VpVtl2RunTime = 192,
> >> + VpIommuHypercalls = 193,
> >> + VpCpuGroupHypercalls = 194,
> >> + VpVsmHypercalls = 195,
> >> + VpEventLogHypercalls = 196,
> >> + VpDeviceDomainHypercalls = 197,
> >> + VpDepositHypercalls = 198,
> >> + VpSvmHypercalls = 199,
> >> + VpBusLockAcquisitionCount = 200,
> >> + VpLoadAvg = 201,
> >> + VpRootDispatchThreadBlocked = 202,
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + VpSysRegAccessesCount = 9,
> >> + VpSysRegAccessesTime = 10,
> >> + VpSmcInstructionsCount = 11,
> >> + VpSmcInstructionsTime = 12,
> >> + VpOtherInterceptsCount = 13,
> >> + VpOtherInterceptsTime = 14,
> >> + VpExternalInterruptsCount = 15,
> >> + VpExternalInterruptsTime = 16,
> >> + VpPendingInterruptsCount = 17,
> >> + VpPendingInterruptsTime = 18,
> >> + VpGuestPageTableMaps = 19,
> >> + VpLargePageTlbFills = 20,
> >> + VpSmallPageTlbFills = 21,
> >> + VpReflectedGuestPageFaults = 22,
> >> + VpMemoryInterceptMessages = 23,
> >> + VpOtherMessages = 24,
> >> + VpLogicalProcessorMigrations = 25,
> >> + VpAddressDomainFlushes = 26,
> >> + VpAddressSpaceFlushes = 27,
> >> + VpSyntheticInterrupts = 28,
> >> + VpVirtualInterrupts = 29,
> >> + VpApicSelfIpisSent = 30,
> >> + VpGpaSpaceHypercalls = 31,
> >> + VpLogicalProcessorHypercalls = 32,
> >> + VpLongSpinWaitHypercalls = 33,
> >> + VpOtherHypercalls = 34,
> >> + VpSyntheticInterruptHypercalls = 35,
> >> + VpVirtualInterruptHypercalls = 36,
> >> + VpVirtualMmuHypercalls = 37,
> >> + VpVirtualProcessorHypercalls = 38,
> >> + VpHardwareInterrupts = 39,
> >> + VpNestedPageFaultInterceptsCount = 40,
> >> + VpNestedPageFaultInterceptsTime = 41,
> >> + VpLogicalProcessorDispatches = 42,
> >> + VpWaitingForCpuTime = 43,
> >> + VpExtendedHypercalls = 44,
> >> + VpExtendedHypercallInterceptMessages = 45,
> >> + VpMbecNestedPageTableSwitches = 46,
> >> + VpOtherReflectedGuestExceptions = 47,
> >> + VpGlobalIoTlbFlushes = 48,
> >> + VpGlobalIoTlbFlushCost = 49,
> >> + VpLocalIoTlbFlushes = 50,
> >> + VpLocalIoTlbFlushCost = 51,
> >> + VpFlushGuestPhysicalAddressSpaceHypercalls = 52,
> >> + VpFlushGuestPhysicalAddressListHypercalls = 53,
> >> + VpPostedInterruptNotifications = 54,
> >> + VpPostedInterruptScans = 55,
> >> + VpTotalCoreRunTime = 56,
> >> + VpMaximumRunTime = 57,
> >> + VpWaitingForCpuTimeBucket0 = 58,
> >> + VpWaitingForCpuTimeBucket1 = 59,
> >> + VpWaitingForCpuTimeBucket2 = 60,
> >> + VpWaitingForCpuTimeBucket3 = 61,
> >> + VpWaitingForCpuTimeBucket4 = 62,
> >> + VpWaitingForCpuTimeBucket5 = 63,
> >> + VpWaitingForCpuTimeBucket6 = 64,
> >> + VpHwpRequestContextSwitches = 65,
> >> + VpPlaceholder2 = 66,
> >> + VpPlaceholder3 = 67,
> >> + VpPlaceholder4 = 68,
> >> + VpPlaceholder5 = 69,
> >> + VpPlaceholder6 = 70,
> >> + VpPlaceholder7 = 71,
> >> + VpPlaceholder8 = 72,
> >> + VpContentionTime = 73,
> >> + VpWakeUpTime = 74,
> >> + VpSchedulingPriority = 75,
> >> + VpVtl1DispatchCount = 76,
> >> + VpVtl2DispatchCount = 77,
> >> + VpVtl2DispatchBucket0 = 78,
> >> + VpVtl2DispatchBucket1 = 79,
> >> + VpVtl2DispatchBucket2 = 80,
> >> + VpVtl2DispatchBucket3 = 81,
> >> + VpVtl2DispatchBucket4 = 82,
> >> + VpVtl2DispatchBucket5 = 83,
> >> + VpVtl2DispatchBucket6 = 84,
> >> + VpVtl1RunTime = 85,
> >> + VpVtl2RunTime = 86,
> >> + VpIommuHypercalls = 87,
> >> + VpCpuGroupHypercalls = 88,
> >> + VpVsmHypercalls = 89,
> >> + VpEventLogHypercalls = 90,
> >> + VpDeviceDomainHypercalls = 91,
> >> + VpDepositHypercalls = 92,
> >> + VpSvmHypercalls = 93,
> >> + VpLoadAvg = 94,
> >> + VpRootDispatchThreadBlocked = 95,
> >> +#endif
> >> + VpStatsMaxCounter
> >> +};
> >> +
> >> +enum hv_stats_lp_counters { /* HV_CPU_COUNTER */
> >> + LpGlobalTime = 1,
> >> + LpTotalRunTime = 2,
> >> + LpHypervisorRunTime = 3,
> >> + LpHardwareInterrupts = 4,
> >> + LpContextSwitches = 5,
> >> + LpInterProcessorInterrupts = 6,
> >> + LpSchedulerInterrupts = 7,
> >> + LpTimerInterrupts = 8,
> >> + LpInterProcessorInterruptsSent = 9,
> >> + LpProcessorHalts = 10,
> >> + LpMonitorTransitionCost = 11,
> >> + LpContextSwitchTime = 12,
> >> + LpC1TransitionsCount = 13,
> >> + LpC1RunTime = 14,
> >> + LpC2TransitionsCount = 15,
> >> + LpC2RunTime = 16,
> >> + LpC3TransitionsCount = 17,
> >> + LpC3RunTime = 18,
> >> + LpRootVpIndex = 19,
> >> + LpIdleSequenceNumber = 20,
> >> + LpGlobalTscCount = 21,
> >> + LpActiveTscCount = 22,
> >> + LpIdleAccumulation = 23,
> >> + LpReferenceCycleCount0 = 24,
> >> + LpActualCycleCount0 = 25,
> >> + LpReferenceCycleCount1 = 26,
> >> + LpActualCycleCount1 = 27,
> >> + LpProximityDomainId = 28,
> >> + LpPostedInterruptNotifications = 29,
> >> + LpBranchPredictorFlushes = 30,
> >> +#if IS_ENABLED(CONFIG_X86_64)
> >> + LpL1DataCacheFlushes = 31,
> >> + LpImmediateL1DataCacheFlushes = 32,
> >> + LpMbFlushes = 33,
> >> + LpCounterRefreshSequenceNumber = 34,
> >> + LpCounterRefreshReferenceTime = 35,
> >> + LpIdleAccumulationSnapshot = 36,
> >> + LpActiveTscCountSnapshot = 37,
> >> + LpHwpRequestContextSwitches = 38,
> >> + LpPlaceholder1 = 39,
> >> + LpPlaceholder2 = 40,
> >> + LpPlaceholder3 = 41,
> >> + LpPlaceholder4 = 42,
> >> + LpPlaceholder5 = 43,
> >> + LpPlaceholder6 = 44,
> >> + LpPlaceholder7 = 45,
> >> + LpPlaceholder8 = 46,
> >> + LpPlaceholder9 = 47,
> >> + LpPlaceholder10 = 48,
> >> + LpReserveGroupId = 49,
> >> + LpRunningPriority = 50,
> >> + LpPerfmonInterruptCount = 51,
> >> +#elif IS_ENABLED(CONFIG_ARM64)
> >> + LpCounterRefreshSequenceNumber = 31,
> >> + LpCounterRefreshReferenceTime = 32,
> >> + LpIdleAccumulationSnapshot = 33,
> >> + LpActiveTscCountSnapshot = 34,
> >> + LpHwpRequestContextSwitches = 35,
> >> + LpPlaceholder2 = 36,
> >> + LpPlaceholder3 = 37,
> >> + LpPlaceholder4 = 38,
> >> + LpPlaceholder5 = 39,
> >> + LpPlaceholder6 = 40,
> >> + LpPlaceholder7 = 41,
> >> + LpPlaceholder8 = 42,
> >> + LpPlaceholder9 = 43,
> >> + LpSchLocalRunListSize = 44,
> >> + LpReserveGroupId = 45,
> >> + LpRunningPriority = 46,
> >> +#endif
> >> + LpStatsMaxCounter
> >> +};
> >> +
> >> +/*
> >> + * Hypervisor statistics page format
> >> + */
> >> +struct hv_stats_page {
> >> + union {
> >> + u64 hv_cntrs[HvStatsMaxCounter]; /* Hypervisor counters
> */
> >> + u64 pt_cntrs[PartitionStatsMaxCounter]; /* Partition
> counters */
> >> + u64 vp_cntrs[VpStatsMaxCounter]; /* VP counters */
> >> + u64 lp_cntrs[LpStatsMaxCounter]; /* LP counters */
> >> + u8 data[HV_HYP_PAGE_SIZE];
> >> + };
> >> +} __packed;
> >> +
> >> /* Bits for dirty mask of hv_vp_register_page */
> >> #define HV_X64_REGISTER_CLASS_GENERAL 0
> >> #define HV_X64_REGISTER_CLASS_IP 1
> >> --
> >> 2.34.1
^ permalink raw reply
* RE: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add support for coalesced RX packets on CQE
From: Haiyang Zhang @ 2026-01-16 16:44 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Haiyang Zhang, linux-hyperv@vger.kernel.org,
netdev@vger.kernel.org, KY Srinivasan, Wei Liu, Dexuan Cui,
Long Li, Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
Konstantin Taranov, Simon Horman, Erni Sri Satya Vennela,
Shradha Gupta, Saurabh Sengar, Aditya Garg, Dipayaan Roy,
Shiraz Saleem, linux-kernel@vger.kernel.org,
linux-rdma@vger.kernel.org, Paul Rosswurm
In-Reply-To: <20260115181434.4494fe9f@kernel.org>
> -----Original Message-----
> From: Jakub Kicinski <kuba@kernel.org>
> Sent: Thursday, January 15, 2026 9:15 PM
> To: Haiyang Zhang <haiyangz@microsoft.com>
> Cc: Haiyang Zhang <haiyangz@linux.microsoft.com>; linux-
> hyperv@vger.kernel.org; netdev@vger.kernel.org; KY Srinivasan
> <kys@microsoft.com>; Wei Liu <wei.liu@kernel.org>; Dexuan Cui
> <DECUI@microsoft.com>; Long Li <longli@microsoft.com>; Andrew Lunn
> <andrew+netdev@lunn.ch>; David S. Miller <davem@davemloft.net>; Eric
> Dumazet <edumazet@google.com>; Paolo Abeni <pabeni@redhat.com>; Konstantin
> Taranov <kotaranov@microsoft.com>; Simon Horman <horms@kernel.org>; Erni
> Sri Satya Vennela <ernis@linux.microsoft.com>; Shradha Gupta
> <shradhagupta@linux.microsoft.com>; Saurabh Sengar
> <ssengar@linux.microsoft.com>; Aditya Garg
> <gargaditya@linux.microsoft.com>; Dipayaan Roy
> <dipayanroy@linux.microsoft.com>; Shiraz Saleem
> <shirazsaleem@microsoft.com>; linux-kernel@vger.kernel.org; linux-
> rdma@vger.kernel.org; Paul Rosswurm <paulros@microsoft.com>
> Subject: Re: [EXTERNAL] Re: [PATCH V2,net-next, 1/2] net: mana: Add
> support for coalesced RX packets on CQE
>
> On Thu, 15 Jan 2026 19:57:44 +0000 Haiyang Zhang wrote:
> > > > When coalescing is enabled, the device waits for packets which can
> > > > have the CQE coalesced with previous packet(s). That coalescing
> process
> > > > is finished (and a CQE written to the appropriate CQ) when the CQE
> is
> > > > filled with 4 pkts, or time expired, or other device specific logic
> is
> > > > satisfied.
> > >
> > > See, what I'm afraid is happening here is that you are enabling
> > > completion coalescing (how long the device keeps the CQE pending).
> > > Which is _not_ what rx_max_coalesced_frames controls for most NICs.
> > > For most NICs rx_max_coalesced_frames controls IRQ generation logic.
> > >
> > > The NIC first buffers up CQEs for typically single digit usecs, and
> > > then once CQE timer exipred and writeback happened it starts an IRQ
> > > coalescing timer. Once the IRQ coalescing timer expires IRQ is
> > > triggered, which schedules NAPI. (broad strokes, obviously many
> > > differences and optimizations exist)
> > >
> > > Is my guess correct? Are you controlling CQE coalescing>
> > >
> > > Can you control the timeout instead of the frame count?
> >
> > Our NIC's timeout value cannot be controlled by driver. Also, the
> > timeout may be changed in future NIC HW.
> >
> > So, I use the ethtool/rx-frames, which is either 1 or 4 on our
> > NIC, to switch the CQE coalescing feature on/off.
>
> I feel like this is not the first time I'm having a conversation with
> you where you are not answering my direct questions, not just one
> sliver. IDK why you're doing this, but being able to participate
> in an email exchange is a bare minimum for participating upstream.
> Please consider this a warning.
Sure, let me try to reply again -- does this (see below) answer all
your questions? And, feel free to ask any further questions, we are
willing to collaborate with you and other upstream people at any time :)
> The NIC first buffers up CQEs for typically single digit usecs, and
> then once CQE timer exipred and writeback happened it starts an IRQ
> coalescing timer. Once the IRQ coalescing timer expires IRQ is
> triggered, which schedules NAPI. (broad strokes, obviously many
> differences and optimizations exist)
> Is my guess correct? Are you controlling CQE coalescing?
Yes, it's correct. And we are controlling "CQE coalescing".
>
> If I interpret your reply correctly you are indeed coalescing writeback.
Yes, we are coalescing CQE writeback.
> You need to add a new param to the uAPI.
Since this feature is not common to other NICs, can we use an
ethtool private flag instead?
When the flag is set, the CQE coalescing will be enabled and put
up to 4 pkts in a CQE.
> Please add both size and
> timeout. Expose the timeout as read only if your device doesn't support
> controlling it per queue.
Does the "size" mean the max pks per CQE (1 or 4)?
The timeout value is not even exposed to driver, and subject to change
in the future. Also the HW mechanism is proprietary... So, can we not
"expose" the timeout value in "ethtool -c" outputs, because it's not
available at driver level?
Thanks,
- Haiyang
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox