Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH v3 0/5] nvdimm: virtio_pmem: fix request lifetime and converge broken queue failures
From: Alison Schofield @ 2026-06-02  1:51 UTC (permalink / raw)
  To: Li Chen
  Cc: Pankaj Gupta, Dan Williams, Vishal Verma, Dave Jiang, Ira Weiny,
	virtualization, nvdimm, linux-kernel
In-Reply-To: <20260226025712.2236279-1-me@linux.beauty>

On Thu, Feb 26, 2026 at 10:57:05AM +0800, Li Chen wrote:
> Hi,
> 
> The virtio-pmem flush path uses a virtqueue cookie/token to carry a
> per-request context through completion. Under broken virtqueue / notify
> failure conditions, the submitter can return and free the request object
> while the host/backend may still complete the published request. The IRQ
> completion handler then dereferences freed memory when waking waiters,
> which is reported by KASAN as a slab-use-after-free and may manifest as
> lock corruption (e.g. "BUG: spinlock already unlocked") without KASAN.
> 
> In addition, the flush path has two wait sites: one for virtqueue
> descriptor availability (-ENOSPC from virtqueue_add_sgs()) and one for
> request completion. If the virtqueue becomes broken, forward progress is
> no longer guaranteed and these waiters may sleep indefinitely unless the
> driver converges the failure and wakes all wait sites.
> 
> This series addresses both issues:
> 
> 1/5 nvdimm: virtio_pmem: always wake -ENOSPC waiters
> Wake one -ENOSPC waiter for each reclaimed used buffer, decoupled from
> token completion.
> 
> 2/5 nvdimm: virtio_pmem: use READ_ONCE()/WRITE_ONCE() for wait flags
> Use READ_ONCE()/WRITE_ONCE() for the wait_event() flags (done and
> wq_buf_avail).
> 
> 3/5 nvdimm: virtio_pmem: refcount requests for token lifetime
> Refcount request objects so the token lifetime spans the window where it
> is reachable through the virtqueue until completion/drain drops the
> virtqueue reference.
> 
> 4/5 nvdimm: virtio_pmem: converge broken virtqueue to -EIO
> Track a device-level broken state to converge broken/notify failures to
> -EIO: wake all waiters and drain/detach outstanding requests to complete
> them with an error, and fail-fast new requests.
> 
> 5/5 nvdimm: virtio_pmem: drain requests in freeze
> Drain outstanding requests in freeze() before tearing down virtqueues so
> waiters do not sleep indefinitely.
> 
> Testing was done on QEMU x86_64 with a virtio-pmem device exported as
> /dev/pmem0, formatted with ext4 (-O fast_commit), mounted with DAX, and
> stressed with fsync-heavy workloads.
> 
> Thanks,
> Li Chen

Hi Li Chen,

Today I took a look at this set, noting that it's been sitting idle 
in our nvdimm backlog for a while. I'm not able to apply it. Can you
post a new rev that applies to 7.1-rc6 ?

Thanks,
Alison


^ permalink raw reply

* Re: [PATCH v4 02/47] x86/tsc: Add a standalone helpers for getting TSC info from CPUID.0x15
From: Borislav Petkov @ 2026-06-02  3:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Dave Hansen, x86,
	Kiryl Shutsemau, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-3-seanjc@google.com>

On Fri, May 29, 2026 at 07:43:49AM -0700, Sean Christopherson wrote:
> +static int cpuid_get_tsc_info(struct cpuid_tsc_info *info)
> +{
> +	unsigned int ecx_hz, edx;
> +
> +	memset(info, 0, sizeof(*info));

Let's not clear this unnecessarily...

> +
> +	if (boot_cpu_data.cpuid_level < CPUID_LEAF_TSC)
> +		return -ENOENT;

... just to return here...

> +
> +	/* CPUID 15H TSC/Crystal ratio, plus optionally Crystal Hz */
> +	cpuid(CPUID_LEAF_TSC, &info->denominator, &info->numerator, &ecx_hz, &edx);
> +
> +	if (!info->denominator || !info->numerator)
> +		return -ENOENT;

... or here.

We wanna clear it here, when we'll return success.

> +
> +	/*
> +	 * Note, some CPUs provide the multiplier information, but not the core

	Note: some CPUs...

> +	 * crystal frequency.  The multiplier information is still useful for
> +	 * such CPUs, as the crystal frequency can be gleaned from CPUID.0x16.
> +	 */
> +	info->crystal_khz = ecx_hz / 1000;
> +	return 0;
> +}
> +
> +int __init cpuid_get_tsc_freq(struct cpuid_tsc_info *info)
> +{
> +	if (cpuid_get_tsc_info(info) || !info->crystal_khz)
> +		return -ENOENT;
> +
> +	info->tsc_khz = info->crystal_khz * info->numerator / info->denominator;
> +	return 0;
> +}

Unused here. Add it with its first user pls.

-- 
Regards/Gruss,
    Boris.

https://people.kernel.org/tglx/notes-about-netiquette

^ permalink raw reply

* [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: yangjiale @ 2026-06-02  4:31 UTC (permalink / raw)
  To: Michael S . Tsirkin
  Cc: Jason Wang, Xuan Zhuo, Eugenio Pérez, virtualization,
	linux-kernel, yangjiale

When a descriptor list spans across cache lines,
updating the flag first can lead to a scenario where the device side
perceives the flag as valid, yet the corresponding address and length
fields remain unupdated—resulting in invalid values.
Therefore, the flag field must be updated last.

Signed-off-by: yangjiale <yangjiale133@163.com>
---
 drivers/virtio/virtio_ring.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
index fbca7ce1c6bf..036b4f90d30f 100644
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
 					     &addr, &len, premapped, attr))
 				goto unmap_release;
 
+			desc[i].addr = cpu_to_le64(addr);
+			desc[i].len = cpu_to_le32(len);
+			desc[i].id = cpu_to_le16(id);
+
 			flags = cpu_to_le16(vq->packed.avail_used_flags |
 				    (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
 				    (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
@@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
 			else
 				desc[i].flags = flags;
 
-			desc[i].addr = cpu_to_le64(addr);
-			desc[i].len = cpu_to_le32(len);
-			desc[i].id = cpu_to_le16(id);
-
 			if (unlikely(vq->use_map_api)) {
 				vq->packed.desc_extra[curr].addr = premapped ?
 					DMA_MAPPING_ERROR : addr;
-- 
2.25.1


^ permalink raw reply related

* Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Eugenio Perez Martin @ 2026-06-02  6:04 UTC (permalink / raw)
  To: yangjiale
  Cc: Michael S . Tsirkin, Jason Wang, Xuan Zhuo, virtualization,
	linux-kernel
In-Reply-To: <20260602043123.10207-1-yangjiale133@163.com>

On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
>
> When a descriptor list spans across cache lines,
> updating the flag first can lead to a scenario where the device side
> perceives the flag as valid, yet the corresponding address and length
> fields remain unupdated—resulting in invalid values.
> Therefore, the flag field must be updated last.
>
> Signed-off-by: yangjiale <yangjiale133@163.com>
> ---
>  drivers/virtio/virtio_ring.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index fbca7ce1c6bf..036b4f90d30f 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>                                              &addr, &len, premapped, attr))
>                                 goto unmap_release;
>
> +                       desc[i].addr = cpu_to_le64(addr);
> +                       desc[i].len = cpu_to_le32(len);
> +                       desc[i].id = cpu_to_le16(id);
> +
>                         flags = cpu_to_le16(vq->packed.avail_used_flags |
>                                     (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>                                     (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>                         else
>                                 desc[i].flags = flags;
>
> -                       desc[i].addr = cpu_to_le64(addr);
> -                       desc[i].len = cpu_to_le32(len);
> -                       desc[i].id = cpu_to_le16(id);
> -
>                         if (unlikely(vq->use_map_api)) {
>                                 vq->packed.desc_extra[curr].addr = premapped ?
>                                         DMA_MAPPING_ERROR : addr;

These flags are updated before the flags of the head descriptor at the
end of the function, at "vq->packed.vring.desc[head].flags =
head_flags", so the device should not see these. Because of that, the
relative order between the rest of the fields of the same descriptor
or other descriptors' fields, except for the head descriptor's flags,
should not matter. There is a write memory barrier just before
updating the head's flags.

Also, I don't get why the cache line matters here. Can you expand? Am
I missing something?


^ permalink raw reply

* Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Michael S. Tsirkin @ 2026-06-02  6:59 UTC (permalink / raw)
  To: yangjiale
  Cc: Jason Wang, Xuan Zhuo, Eugenio Pérez, virtualization,
	linux-kernel
In-Reply-To: <20260602043123.10207-1-yangjiale133@163.com>

On Tue, Jun 02, 2026 at 12:31:23PM +0800, yangjiale wrote:
> When a descriptor list spans across cache lines,
> updating the flag first can lead to a scenario where the device side
> perceives the flag as valid, yet the corresponding address and length
> fields remain unupdated—resulting in invalid values.
> Therefore, the flag field must be updated last.
> 
> Signed-off-by: yangjiale <yangjiale133@163.com>
> ---
>  drivers/virtio/virtio_ring.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index fbca7ce1c6bf..036b4f90d30f 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>  					     &addr, &len, premapped, attr))
>  				goto unmap_release;
>  
> +			desc[i].addr = cpu_to_le64(addr);
> +			desc[i].len = cpu_to_le32(len);
> +			desc[i].id = cpu_to_le16(id);
> +
>  			flags = cpu_to_le16(vq->packed.avail_used_flags |
>  				    (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>  				    (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>  			else
>  				desc[i].flags = flags;
>  
> -			desc[i].addr = cpu_to_le64(addr);
> -			desc[i].len = cpu_to_le32(len);
> -			desc[i].id = cpu_to_le16(id);
> -
>  			if (unlikely(vq->use_map_api)) {
>  				vq->packed.desc_extra[curr].addr = premapped ?
>  					DMA_MAPPING_ERROR : addr;

Good catch!  And presumably we need a write barrier then?

> -- 
> 2.25.1


^ permalink raw reply

* Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Michael S. Tsirkin @ 2026-06-02  6:59 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: yangjiale, Jason Wang, Xuan Zhuo, virtualization, linux-kernel
In-Reply-To: <CAJaqyWfugEttrQuR5_LrbQaivAqCaixsprUOjEVtdiz_sexFpA@mail.gmail.com>

On Tue, Jun 02, 2026 at 08:04:13AM +0200, Eugenio Perez Martin wrote:
> On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
> >
> > When a descriptor list spans across cache lines,
> > updating the flag first can lead to a scenario where the device side
> > perceives the flag as valid, yet the corresponding address and length
> > fields remain unupdated—resulting in invalid values.
> > Therefore, the flag field must be updated last.
> >
> > Signed-off-by: yangjiale <yangjiale133@163.com>
> > ---
> >  drivers/virtio/virtio_ring.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> >
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index fbca7ce1c6bf..036b4f90d30f 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >                                              &addr, &len, premapped, attr))
> >                                 goto unmap_release;
> >
> > +                       desc[i].addr = cpu_to_le64(addr);
> > +                       desc[i].len = cpu_to_le32(len);
> > +                       desc[i].id = cpu_to_le16(id);
> > +
> >                         flags = cpu_to_le16(vq->packed.avail_used_flags |
> >                                     (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
> >                                     (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> > @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >                         else
> >                                 desc[i].flags = flags;
> >
> > -                       desc[i].addr = cpu_to_le64(addr);
> > -                       desc[i].len = cpu_to_le32(len);
> > -                       desc[i].id = cpu_to_le16(id);
> > -
> >                         if (unlikely(vq->use_map_api)) {
> >                                 vq->packed.desc_extra[curr].addr = premapped ?
> >                                         DMA_MAPPING_ERROR : addr;
> 
> These flags are updated before the flags of the head descriptor at the
> end of the function, at "vq->packed.vring.desc[head].flags =
> head_flags", so the device should not see these. Because of that, the
> relative order between the rest of the fields of the same descriptor
> or other descriptors' fields, except for the head descriptor's flags,
> should not matter. There is a write memory barrier just before
> updating the head's flags.
> 
> Also, I don't get why the cache line matters here. Can you expand? Am
> I missing something?

Oh desc is just a local thing. ENOCOFFEE )


^ permalink raw reply

* Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Dongli Zhang @ 2026-06-02  9:21 UTC (permalink / raw)
  To: Michael S. Tsirkin, Eugenio Perez Martin, yangjiale
  Cc: Jason Wang, Xuan Zhuo, virtualization, linux-kernel
In-Reply-To: <20260602025930-mutt-send-email-mst@kernel.org>



On 2026-06-01 11:59 PM, Michael S. Tsirkin wrote:
> On Tue, Jun 02, 2026 at 08:04:13AM +0200, Eugenio Perez Martin wrote:
>> On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
>>>
>>> When a descriptor list spans across cache lines,
>>> updating the flag first can lead to a scenario where the device side
>>> perceives the flag as valid, yet the corresponding address and length
>>> fields remain unupdated—resulting in invalid values.
>>> Therefore, the flag field must be updated last.
>>>
>>> Signed-off-by: yangjiale <yangjiale133@163.com>
>>> ---
>>>  drivers/virtio/virtio_ring.c | 8 ++++----
>>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>>> index fbca7ce1c6bf..036b4f90d30f 100644
>>> --- a/drivers/virtio/virtio_ring.c
>>> +++ b/drivers/virtio/virtio_ring.c
>>> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>>>                                              &addr, &len, premapped, attr))
>>>                                 goto unmap_release;
>>>
>>> +                       desc[i].addr = cpu_to_le64(addr);
>>> +                       desc[i].len = cpu_to_le32(len);
>>> +                       desc[i].id = cpu_to_le16(id);
>>> +
>>>                         flags = cpu_to_le16(vq->packed.avail_used_flags |
>>>                                     (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>>>                                     (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
>>> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>>>                         else
>>>                                 desc[i].flags = flags;
>>>
>>> -                       desc[i].addr = cpu_to_le64(addr);
>>> -                       desc[i].len = cpu_to_le32(len);
>>> -                       desc[i].id = cpu_to_le16(id);
>>> -
>>>                         if (unlikely(vq->use_map_api)) {
>>>                                 vq->packed.desc_extra[curr].addr = premapped ?
>>>                                         DMA_MAPPING_ERROR : addr;
>>
>> These flags are updated before the flags of the head descriptor at the
>> end of the function, at "vq->packed.vring.desc[head].flags =
>> head_flags", so the device should not see these. Because of that, the
>> relative order between the rest of the fields of the same descriptor
>> or other descriptors' fields, except for the head descriptor's flags,
>> should not matter. There is a write memory barrier just before
>> updating the head's flags.
>>
>> Also, I don't get why the cache line matters here. Can you expand? Am
>> I missing something?
> 
> Oh desc is just a local thing. ENOCOFFEE )
> 
> 

Whether or not this can happen in practice, I notice
virtqueue_add_packed_in_order() has the same write pattern.

Dongli Zhang

^ permalink raw reply

* Re: [patch V2 04/25] pps: Convert to ktime_get_snapshot_id()
From: Thomas Gleixner @ 2026-06-02  9:31 UTC (permalink / raw)
  To: Rodolfo Giometti, LKML
  Cc: David Woodhouse, Miroslav Lichvar, John Stultz, Stephen Boyd,
	Anna-Maria Behnsen, Frederic Weisbecker, thomas.weissschuh,
	Arthur Kiyanovski, Vincent Donnefort, Marc Zyngier, Oliver Upton,
	kvmarm, Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, David Woodhouse, Vadim Fedorenko
In-Reply-To: <912a22e8-878f-4763-958b-71052ee69b36@enneenne.com>

On Sat, May 30 2026 at 13:08, Rodolfo Giometti wrote:
> On 29/05/2026 22:00, Thomas Gleixner wrote:
>>   static inline void pps_get_ts(struct pps_event_time *ts)
>>   {
>> +#ifdef CONFIG_NTP_PPS
>>   	struct system_time_snapshot snap;
>>   
>> -	ktime_get_snapshot(&snap);
>> -	ts->ts_real = ktime_to_timespec64(snap.real);
>> -#ifdef CONFIG_NTP_PPS
>> -	ts->ts_raw = ktime_to_timespec64(snap.raw);
>> +	ktime_get_snapshot_id(CLOCK_REALTIME, &snap);
>> +	ts->ts_real = ktime_to_timespec64(snap.systime);
>> +	ts->ts_raw = ktime_to_timespec64(snap.monoraw);
>> +#else
>> +	ktime_get_real_ts64(&ts->ts_real);
>
> 	/*
> 	 * Prevent kernel stack information disclosure if the
> 	 * structure is later copied to userspace.
> 	 */
> 	ts->ts_raw = (struct timespec64){0, 0};

Which will fail to build because of:

struct pps_event_time {
#ifdef CONFIG_NTP_PPS
        struct timespec64 ts_raw;
#endif /* CONFIG_NTP_PPS */
        struct timespec64 ts_real;
};

No?

^ permalink raw reply

* Re: iommu: iterator used after loop end in resv region insertion?
From: Will Deacon @ 2026-06-02 10:28 UTC (permalink / raw)
  To: Maoyi Xie; +Cc: joro, robin.murphy, jpb, iommu, linux-kernel, virtualization
In-Reply-To: <CAHPEe=G-FZvEXjkE+vAXN5MHXCtsOaUoKwg2RQL6m=om+c20hQ@mail.gmail.com>

On Tue, May 19, 2026 at 02:49:58AM +0800, Maoyi Xie wrote:
> Hi all,
> 
> While reading drivers/iommu/ I noticed two places that look
> like a past the end iterator pattern. I would appreciate it
> if you could take a look and let me know whether these are
> real issues and whether they are worth fixing.
> 
> The first is iommu_insert_resv_region() in drivers/iommu/iommu.c
> (linux-7.1-rc1, around line 873):
> 
>     list_for_each_entry(iter, regions, list) {
>             if (nr->start < iter->start ||
>                 (nr->start == iter->start && nr->type <= iter->type))
>                     break;
>     }
>     list_add_tail(&nr->list, &iter->list);
> 
> The second is viommu_add_resv_mem() in drivers/iommu/virtio-iommu.c
> (linux-7.1-rc1, around line 523):
> 
>     list_for_each_entry(next, &vdev->resv_regions, list) {
>             if (next->start > region->start)
>                     break;
>     }
>     list_add_tail(&region->list, &next->list);
> 
> In both cases, when the loop walks all entries without break,
> the iterator has gone one step past the last entry. &iter->list
> then aliases the list head via container_of offset cancellation,
> so the insert lands at the list tail. That is the intended
> behaviour, but the access is undefined per C11.
> 
> Jakob Koschel cleaned up many such sites in 2022, for example
> commits 99d8ae4ec8a (tracing: Remove usage of list iterator
> variable after the loop), 2966a9918df (clockevents: Use dedicated
> list iterator variable) and dc1acd5c946 (dlm: replace usage of
> found with dedicated list iterator variable). These two sites
> in drivers/iommu/ were not covered.
> 
> A candidate fix would track an explicit insert_before pointer
> initialised to the list head and overwritten to &iter->list
> only when the loop breaks early. The observable behaviour is
> unchanged.
> 
> If this is intentional or already known, please disregard.
> Otherwise, I am happy to send a [PATCH] or to leave the fix to
> you. Thank you for your time, and sorry for the noise if this
> is not actually worth fixing or has already been spotted.

Given that there's some precedent for "fixing" these type of things,
it's probably best if you just send some patches for these new instances
if you don't mind.

Cheers,

Will

^ permalink raw reply

* Re: iommu: iterator used after loop end in resv region insertion?
From: Robin Murphy @ 2026-06-02 10:44 UTC (permalink / raw)
  To: Will Deacon, Maoyi Xie; +Cc: joro, jpb, iommu, linux-kernel, virtualization
In-Reply-To: <ah6wYJM4g-GTBTCL@willie-the-truck>

On 2026-06-02 11:28 am, Will Deacon wrote:
> On Tue, May 19, 2026 at 02:49:58AM +0800, Maoyi Xie wrote:
>> Hi all,
>>
>> While reading drivers/iommu/ I noticed two places that look
>> like a past the end iterator pattern. I would appreciate it
>> if you could take a look and let me know whether these are
>> real issues and whether they are worth fixing.
>>
>> The first is iommu_insert_resv_region() in drivers/iommu/iommu.c
>> (linux-7.1-rc1, around line 873):
>>
>>      list_for_each_entry(iter, regions, list) {
>>              if (nr->start < iter->start ||
>>                  (nr->start == iter->start && nr->type <= iter->type))
>>                      break;
>>      }
>>      list_add_tail(&nr->list, &iter->list);
>>
>> The second is viommu_add_resv_mem() in drivers/iommu/virtio-iommu.c
>> (linux-7.1-rc1, around line 523):
>>
>>      list_for_each_entry(next, &vdev->resv_regions, list) {
>>              if (next->start > region->start)
>>                      break;
>>      }
>>      list_add_tail(&region->list, &next->list);
>>
>> In both cases, when the loop walks all entries without break,
>> the iterator has gone one step past the last entry. &iter->list
>> then aliases the list head via container_of offset cancellation,
>> so the insert lands at the list tail. That is the intended
>> behaviour, but the access is undefined per C11.
>>
>> Jakob Koschel cleaned up many such sites in 2022, for example
>> commits 99d8ae4ec8a (tracing: Remove usage of list iterator
>> variable after the loop), 2966a9918df (clockevents: Use dedicated
>> list iterator variable) and dc1acd5c946 (dlm: replace usage of
>> found with dedicated list iterator variable). These two sites
>> in drivers/iommu/ were not covered.
>>
>> A candidate fix would track an explicit insert_before pointer
>> initialised to the list head and overwritten to &iter->list
>> only when the loop breaks early. The observable behaviour is
>> unchanged.
>>
>> If this is intentional or already known, please disregard.
>> Otherwise, I am happy to send a [PATCH] or to leave the fix to
>> you. Thank you for your time, and sorry for the noise if this
>> is not actually worth fixing or has already been spotted.
> 
> Given that there's some precedent for "fixing" these type of things,
> it's probably best if you just send some patches for these new instances
> if you don't mind.

Agreed, given that the intent in both cases is only to find a list_head 
to use as an insertion point, rather than operate on the entries per se, 
for clarity I'd suggest changing the whole iteration to list_for_each(), 
and keeping the list_entry() dereference strictly inside the loop body.

Thanks,
Robin.

^ permalink raw reply

* Re: [patch V2 04/25] pps: Convert to ktime_get_snapshot_id()
From: David Woodhouse @ 2026-06-02 10:56 UTC (permalink / raw)
  To: Thomas Gleixner, Rodolfo Giometti, LKML
  Cc: Miroslav Lichvar, John Stultz, Stephen Boyd, Anna-Maria Behnsen,
	Frederic Weisbecker, thomas.weissschuh, Arthur Kiyanovski,
	Vincent Donnefort, Marc Zyngier, Oliver Upton, kvmarm,
	Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, Vadim Fedorenko
In-Reply-To: <877boh8f0j.ffs@fw13>

[-- Attachment #1: Type: text/plain, Size: 1836 bytes --]

On Tue, 2026-06-02 at 11:31 +0200, Thomas Gleixner wrote:
> On Sat, May 30 2026 at 13:08, Rodolfo Giometti wrote:
> > On 29/05/2026 22:00, Thomas Gleixner wrote:
> > >    static inline void pps_get_ts(struct pps_event_time *ts)
> > >    {
> > > +#ifdef CONFIG_NTP_PPS
> > >    	struct system_time_snapshot snap;
> > >    
> > > -	ktime_get_snapshot(&snap);
> > > -	ts->ts_real = ktime_to_timespec64(snap.real);
> > > -#ifdef CONFIG_NTP_PPS
> > > -	ts->ts_raw = ktime_to_timespec64(snap.raw);
> > > +	ktime_get_snapshot_id(CLOCK_REALTIME, &snap);
> > > +	ts->ts_real = ktime_to_timespec64(snap.systime);
> > > +	ts->ts_raw = ktime_to_timespec64(snap.monoraw);
> > > +#else
> > > +	ktime_get_real_ts64(&ts->ts_real);
> > 
> >  	/*
> >  	 * Prevent kernel stack information disclosure if the
> >  	 * structure is later copied to userspace.
> >  	 */
> >  	ts->ts_raw = (struct timespec64){0, 0};
> 
> Which will fail to build because of:
> 
> struct pps_event_time {
> #ifdef CONFIG_NTP_PPS
>         struct timespec64 ts_raw;
> #endif /* CONFIG_NTP_PPS */
>         struct timespec64 ts_real;
> };
> 
> No?

I think we should kill struct pps_event_time and just use a
system_time_snapshot anyway.

And I think we can make CONFIG_NTP_PPS work for NOHZ too. If I
understand correctly, the main problem with NOHZ is that without a
tick, the reported CLOCK_REALTIME sawtooths much more around the time
that the core timekeeping *wants* to report. That is, tk->ntp_error
goes much higher and much lower, than when the precise 'mult' rate can
be re-adjusted every tick to keep it on course.

If we add an ntp_error field to the system_time_snapshot, the PPS code
could use that to correct the CLOCK_REALTIME value and the jitter that
NOHZ introduces wouldn't matter.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH 01/11] params: bound array element output to the caller's page buffer
From: Andy Shevchenko @ 2026-06-02 11:26 UTC (permalink / raw)
  To: Kees Cook
  Cc: Luis Chamberlain, Pengpeng Hou, stable, Petr Pavlu,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Rafael J. Wysocki, Len Brown, Corey Minyard, Gabriel Somlo,
	Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Bart Van Assche,
	Jason Gunthorpe, Leon Romanovsky, Laurent Pinchart, Hans de Goede,
	Mauro Carvalho Chehab, Bjorn Helgaas, Hannes Reinecke,
	James E.J. Bottomley, Martin K. Petersen, Daniel Lezcano,
	Zhang Rui, Lukasz Luba, Greg Kroah-Hartman, Jiri Slaby,
	Alan Stern, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Jason Baron, Jim Cromie, Tiwei Bie, Benjamin Berg,
	Ilpo Järvinen, David E. Box, Maciej W. Rozycki,
	Srinivas Pandruvada, Peter Zijlstra, Heiko Carstens,
	Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Georgia Garcia, kvm, dmaengine, linux-modules,
	kasan-dev, linux-mm, apparmor, linux-security-module, linux-um,
	linux-acpi, openipmi-developer, qemu-devel, intel-gfx, dri-devel,
	linux-rdma, linux-media, linux-pci, linux-scsi, linux-pm,
	linuxppc-dev, linux-serial, linux-usb, usb-storage,
	virtualization, linux-kernel, linux-arch, netdev, linux-fsdevel,
	linux-hardening
In-Reply-To: <20260521133326.2465264-1-kees@kernel.org>

On Thu, May 21, 2026 at 06:33:14AM -0700, Kees Cook wrote:
> 
> param_array_get() appends each element's string representation into the
> shared sysfs page buffer by passing buffer + off to the element getter.
> 
> That works for getters that only write a small bounded string, but
> param_get_charp() and similar helpers format against PAGE_SIZE from the
> pointer they receive. Once off is non-zero, an element getter can
> therefore write past the end of the original sysfs page buffer.
> 
> Collect each element into a temporary PAGE_SIZE buffer first and then
> copy only the remaining space into the caller's page buffer.

...

> +	elem_buf = kmalloc(PAGE_SIZE, GFP_KERNEL);

get_free_page() (or how it is called)?

> +	if (!elem_buf)
> +		return -ENOMEM;
> +
>  	for (i = off = 0; i < (arr->num ? *arr->num : arr->max); i++) {
> -		/* Replace \n with comma */
> -		if (i)
> -			buffer[off - 1] = ',';
>  		p.arg = arr->elem + arr->elemsize * i;
>  		check_kparam_locked(p.mod);
> -		ret = arr->ops->get(buffer + off, &p);
> +		ret = arr->ops->get(elem_buf, &p);
>  		if (ret < 0)
> -			return ret;
> +			goto out;
> +		ret = min(ret, (int)(PAGE_SIZE - 1 - off));

It's usually discouraged to use castings in min/max/clamp. Can we make ret long
or do something different here?

> +		if (!ret)
> +			break;

> +		/* Replace the previous element's trailing newline with a comma. */
> +		if (i)
> +			buffer[off - 1] = ',';

Can't we do this after with help of strreplace()?

> +		memcpy(buffer + off, elem_buf, ret);
>  		off += ret;
> +		if (off == PAGE_SIZE - 1)
> +			break;
>  	}
>  	buffer[off] = '\0';
> -	return off;
> +	ret = off;
> +out:
> +	kfree(elem_buf);
> +	return ret;

-- 
With Best Regards,
Andy Shevchenko



^ permalink raw reply

* Re: [PATCH 01/11] params: bound array element output to the caller's page buffer
From: Jason Gunthorpe @ 2026-06-02 12:33 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Kees Cook, Luis Chamberlain, Pengpeng Hou, stable, Petr Pavlu,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Rafael J. Wysocki, Len Brown, Corey Minyard, Gabriel Somlo,
	Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Bart Van Assche,
	Leon Romanovsky, Laurent Pinchart, Hans de Goede,
	Mauro Carvalho Chehab, Bjorn Helgaas, Hannes Reinecke,
	James E.J. Bottomley, Martin K. Petersen, Daniel Lezcano,
	Zhang Rui, Lukasz Luba, Greg Kroah-Hartman, Jiri Slaby,
	Alan Stern, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Jason Baron, Jim Cromie, Tiwei Bie, Benjamin Berg,
	Ilpo Järvinen, David E. Box, Maciej W. Rozycki,
	Srinivas Pandruvada, Peter Zijlstra, Heiko Carstens,
	Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Georgia Garcia, kvm, dmaengine, linux-modules,
	kasan-dev, linux-mm, apparmor, linux-security-module, linux-um,
	linux-acpi, openipmi-developer, qemu-devel, intel-gfx, dri-devel,
	linux-rdma, linux-media, linux-pci, linux-scsi, linux-pm,
	linuxppc-dev, linux-serial, linux-usb, usb-storage,
	virtualization, linux-kernel, linux-arch, netdev, linux-fsdevel,
	linux-hardening
In-Reply-To: <ah699hwLxIIOZ0-7@ashevche-desk.local>

On Tue, Jun 02, 2026 at 02:26:46PM +0300, Andy Shevchenko wrote:
> On Thu, May 21, 2026 at 06:33:14AM -0700, Kees Cook wrote:
> > 
> > param_array_get() appends each element's string representation into the
> > shared sysfs page buffer by passing buffer + off to the element getter.
> > 
> > That works for getters that only write a small bounded string, but
> > param_get_charp() and similar helpers format against PAGE_SIZE from the
> > pointer they receive. Once off is non-zero, an element getter can
> > therefore write past the end of the original sysfs page buffer.
> > 
> > Collect each element into a temporary PAGE_SIZE buffer first and then
> > copy only the remaining space into the caller's page buffer.
> 
> ...
> 
> > +	elem_buf = kmalloc(PAGE_SIZE, GFP_KERNEL);
> 
> get_free_page() (or how it is called)?

I thought modern mm guidance was to use kmalloc whenever possible and
not use get_free_page() unless you intend to use the struct page bits?

Jason

^ permalink raw reply

* Re: [PATCH 01/11] params: bound array element output to the caller's page buffer
From: David Laight @ 2026-06-02 13:04 UTC (permalink / raw)
  To: Andy Shevchenko
  Cc: Kees Cook, Luis Chamberlain, Pengpeng Hou, stable, Petr Pavlu,
	Richard Weinberger, Anton Ivanov, Johannes Berg,
	Rafael J. Wysocki, Len Brown, Corey Minyard, Gabriel Somlo,
	Michael S. Tsirkin, Jani Nikula, Joonas Lahtinen, Rodrigo Vivi,
	Tvrtko Ursulin, David Airlie, Simona Vetter, Bart Van Assche,
	Jason Gunthorpe, Leon Romanovsky, Laurent Pinchart, Hans de Goede,
	Mauro Carvalho Chehab, Bjorn Helgaas, Hannes Reinecke,
	James E.J. Bottomley, Martin K. Petersen, Daniel Lezcano,
	Zhang Rui, Lukasz Luba, Greg Kroah-Hartman, Jiri Slaby,
	Alan Stern, Jason Wang, Xuan Zhuo, Eugenio Pérez,
	Jason Baron, Jim Cromie, Tiwei Bie, Benjamin Berg,
	Ilpo Järvinen, David E. Box, Maciej W. Rozycki,
	Srinivas Pandruvada, Peter Zijlstra, Heiko Carstens,
	Vasily Gorbik, Sean Christopherson, Paolo Bonzini,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Vinod Koul, Frank Li, Daniel Gomez, Sami Tolvanen,
	Aaron Tomlin, Alexander Potapenko, Marco Elver, Dmitry Vyukov,
	Andrew Morton, John Johansen, Paul Moore, James Morris,
	Serge E. Hallyn, Georgia Garcia, kvm, dmaengine, linux-modules,
	kasan-dev, linux-mm, apparmor, linux-security-module, linux-um,
	linux-acpi, openipmi-developer, qemu-devel, intel-gfx, dri-devel,
	linux-rdma, linux-media, linux-pci, linux-scsi, linux-pm,
	linuxppc-dev, linux-serial, linux-usb, usb-storage,
	virtualization, linux-kernel, linux-arch, netdev, linux-fsdevel,
	linux-hardening
In-Reply-To: <ah699hwLxIIOZ0-7@ashevche-desk.local>

On Tue, 2 Jun 2026 14:26:46 +0300
Andy Shevchenko <andriy.shevchenko@linux.intel.com> wrote:

> On Thu, May 21, 2026 at 06:33:14AM -0700, Kees Cook wrote:
> > 
> > param_array_get() appends each element's string representation into the
> > shared sysfs page buffer by passing buffer + off to the element getter.
> > 
> > That works for getters that only write a small bounded string, but
> > param_get_charp() and similar helpers format against PAGE_SIZE from the
> > pointer they receive. Once off is non-zero, an element getter can
> > therefore write past the end of the original sysfs page buffer.
> > 
> > Collect each element into a temporary PAGE_SIZE buffer first and then
> > copy only the remaining space into the caller's page buffer.  
> 
> ...
> 
> > +	elem_buf = kmalloc(PAGE_SIZE, GFP_KERNEL);  
> 
> get_free_page() (or how it is called)?

The kmalloc() should be faster and I think has to be aligned.
There is another patch set to replace get_free_pages() with kmalloc().

Although all these 'show' functions should really head to using a safer
interface.
Although, at the moment, it is really difficult to find the ones that
are guaranteed to be passed a page aligned buffer.

-- David

> 
> > +	if (!elem_buf)
> > +		return -ENOMEM;
> > +
> >  	for (i = off = 0; i < (arr->num ? *arr->num : arr->max); i++) {
> > -		/* Replace \n with comma */
> > -		if (i)
> > -			buffer[off - 1] = ',';
> >  		p.arg = arr->elem + arr->elemsize * i;
> >  		check_kparam_locked(p.mod);
> > -		ret = arr->ops->get(buffer + off, &p);
> > +		ret = arr->ops->get(elem_buf, &p);
> >  		if (ret < 0)
> > -			return ret;
> > +			goto out;
> > +		ret = min(ret, (int)(PAGE_SIZE - 1 - off));  
> 
> It's usually discouraged to use castings in min/max/clamp. Can we make ret long
> or do something different here?
> 
> > +		if (!ret)
> > +			break;  
> 
> > +		/* Replace the previous element's trailing newline with a comma. */
> > +		if (i)
> > +			buffer[off - 1] = ',';  
> 
> Can't we do this after with help of strreplace()?
> 
> > +		memcpy(buffer + off, elem_buf, ret);
> >  		off += ret;
> > +		if (off == PAGE_SIZE - 1)
> > +			break;
> >  	}
> >  	buffer[off] = '\0';
> > -	return off;
> > +	ret = off;
> > +out:
> > +	kfree(elem_buf);
> > +	return ret;  
> 


^ permalink raw reply

* Re: [PATCH 2/7] drm/vblank: timer: Fix timestamp calculation
From: Michel Dänzer @ 2026-06-02 14:14 UTC (permalink / raw)
  To: Thomas Zimmermann, simona, louis.chauvet, ville.syrjala,
	jani.nikula, mhklkml, maarten.lankhorst, mripard, airlied
  Cc: dri-devel, amd-gfx, virtualization
In-Reply-To: <f7f53f9b-2576-4488-b0c1-35b6998d123c@suse.de>

On 6/1/26 19:30, Thomas Zimmermann wrote:
> Am 01.06.26 um 18:24 schrieb Michel Dänzer:
>> On 6/1/26 16:08, Thomas Zimmermann wrote:
>>> In drm_crtc_vblank_get_vblank_timeout(), return the timestamp of the
>>> first visible scanline after the last vblank timeout. This is what the
>>> caller expects.
>>>
>>> A vblank phase starts with a vblank timeout. At this point the display
>>> is blanked for several scanlines. Afterwards the display is unblanked
>>> until the next vblank timeout occurs. The display content is only visible
>>> during that second part.
>>>
>>> The current implementation of drm_crtc_vblank_get_vblank_timeout()
>>> returns the timestamp of the last vblank timeout that started the current
>>> vblank phase. But the display only unblanks after 20 to 30 percent of
>>> the overall frame duration. The returned timestamp is therefore too early.
>>>
>>> The next vblank timeout is already known when calculating the returned
>>> timestamp. Instead of subtracting the duration of a full frame from the
>>> value, only subtract the duration of the active, visible part. The result
>>> is the timestamp of the first visible scanline, as expected by the caller.
>>>
>>> This bug was not introduced by the generic vblank timer. It appears that
>>> the get_vblank_timeout logic has always been buggy since it was first
>>> added in commit 3a0709928b17 ("drm/vkms: Add vblank events simulated by
>>> hrtimers").
>>>
>>> Signed-off-by: Thomas Zimmermann <tzimmermann@suse.de>
>>> ---
>>>   drivers/gpu/drm/drm_vblank.c | 32 +++++++++++++++++++++++++-------
>>>   1 file changed, 25 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/drivers/gpu/drm/drm_vblank.c b/drivers/gpu/drm/drm_vblank.c
>>> index 96d70c3d4522..d52df247d04e 100644
>>> --- a/drivers/gpu/drm/drm_vblank.c
>>> +++ b/drivers/gpu/drm/drm_vblank.c
>>> [...]
>>> @@ -2312,17 +2321,26 @@ bool drm_crtc_vblank_get_vblank_timeout(struct drm_crtc *crtc, ktime_t *vblank_t
>>>           *vblank_time = READ_ONCE(vtimer->timer.node.expires);
>>>       } while (cur_count != drm_crtc_vblank_count_and_time(crtc, &cur_time));
>>>   -    if (drm_WARN_ON(crtc->dev, !ktime_compare(*vblank_time, cur_time)))
>>> +    if (drm_WARN_ON(dev, !ktime_compare(*vblank_time, cur_time)))
>>>           return false; /* Already expired */
>>>   +    framedur_ns = vblank->framedur_ns;
>>> +
>>>       /*
>>> -     * To prevent races we roll the hrtimer forward before we do any
>>> -     * interrupt processing - this is how real hw works (the interrupt
>>> -     * is only generated after all the vblank registers are updated)
>>> -     * and what the vblank core expects. Therefore we need to always
>>> -     * correct the timestamp by one frame.
>>> +     * To prevent races we rolled the hrtimer forward before we did any
>>> +     * timeout processing - this is how real hw works (the interrupt is
>>> +     * only generated after all the vblank registers are updated) and what
>>> +     * the vblank core expects.
>>> +     *
>>> +     * Therefore we always need to correct the timestamp. The returned
>>> +     * time should be the time of the first active scanline after the
>>> +     * previous vblank. Hence subtract the active phase's duration from
>>> +     * the next expiration time.
>>>        */
>>> -    *vblank_time = ktime_sub(*vblank_time, vtimer->interval);
>>> +    if (drm_WARN_ON(dev, !mode->crtc_vtotal))
>>> +        return false;
>>> +    activedur_ns = div_s64(framedur_ns * mode->crtc_vdisplay, mode->crtc_vtotal);
>>> +    *vblank_time = ktime_sub_ns(*vblank_time, activedur_ns);
>> Normally the timestamp returned by drm_crtc_vblank_count_and_time is supposed to correspond to the end of vertical blank / start of active, in which case the new code here looks wrong.
>>
>> Also, while the current time is inside an active area, it's supposed to return the timestamp corresponding to the start of the current active area, not the next one.
> 
> The initial value of *vblank_time is when the vblank timer fires _next_ and the display blanks. Subtracting the length of the active period should give the time of the first active scanline within the current vblank phase.
> 
> Isn't that exactly what you describe?

I don't think so.

It means that the timestamp returned by drm_(crtc_)vblank_count_and_time (which also used e.g. in events sent to user space) corresponds to the end of active / start of vblank, not to the end of vblank / start of active as it should (and does when the vblank timer isn't used).


-- 
Earthling Michel Dänzer       \        GNOME / Xwayland / Mesa developer
https://redhat.com             \               Libre software enthusiast

^ permalink raw reply

* Re: [patch V2 16/25] virtio_rtc: Use provided clock ID for history snapshot
From: Peter Hilber @ 2026-06-02 14:44 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Woodhouse, Miroslav Lichvar, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	thomas.weissschuh, Arthur Kiyanovski, Rodolfo Giometti,
	Vincent Donnefort, Marc Zyngier, Oliver Upton, kvmarm,
	Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, David Woodhouse, Vadim Fedorenko
In-Reply-To: <20260529195557.744271454@kernel.org>

On Fri, May 29, 2026 at 10:00:52PM +0200, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@kernel.org>
> 
> The PTP core indicates in system_device_crosststamp::clock_id the clock ID
> for which the system time stamp should be taken. That allows to utilize
> hardware timestamps with e.g. AUX clocks.
> 
> Use ktime_get_snapshot_id() and hand the provided clock ID in.
> 
> No functional change.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

Acked-by: Peter Hilber <peter.hilber@oss.qualcomm.com>

> ---
>  drivers/virtio/virtio_rtc_ptp.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> --- a/drivers/virtio/virtio_rtc_ptp.c
> +++ b/drivers/virtio/virtio_rtc_ptp.c
> @@ -139,7 +139,7 @@ static int viortc_ptp_getcrosststamp(str
>  	if (ret)
>  		return ret;
>  
> -	ktime_get_snapshot(&history_begin);
> +	ktime_get_snapshot_id(xtstamp->clock_id, &history_begin);
>  	if (history_begin.cs_id != cs_id)
>  		return -EOPNOTSUPP;
>  
> 

^ permalink raw reply

* Re: [patch V2 21/25] ALSA: hda/common: Use system_device_crosststamp::sys_systime
From: Takashi Iwai @ 2026-06-02 17:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, David Woodhouse, Miroslav Lichvar, John Stultz,
	Stephen Boyd, Anna-Maria Behnsen, Frederic Weisbecker,
	thomas.weissschuh, Arthur Kiyanovski, Rodolfo Giometti,
	Vincent Donnefort, Marc Zyngier, Oliver Upton, kvmarm,
	Oliver Upton, Richard Cochran, netdev, Takashi Iwai,
	Miri Korenblit, Johannes Berg, Jacob Keller, Tony Nguyen,
	Saeed Mahameed, Peter Hilber, Michael S. Tsirkin, virtualization,
	linux-wireless, linux-sound, David Woodhouse, Vadim Fedorenko
In-Reply-To: <20260529195557.995298795@kernel.org>

On Fri, 29 May 2026 22:01:13 +0200,
Thomas Gleixner wrote:
> 
> From: Thomas Gleixner <tglx@kernel.org>
> 
> sys_systime is an alias for sys_realtime. The latter will be removed so
> switch the code over to the new naming scheme.
> 
> No functional change.
> 
> Signed-off-by: Thomas Gleixner <tglx@kernel.org>
> Tested-by: Arthur Kiyanovski <akiyano@amazon.com>
> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>

Acked-by: Takashi Iwai <tiwai@suse.de>


thanks,

Takashi

> ---
>  sound/hda/common/controller.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> --- a/sound/hda/common/controller.c
> +++ b/sound/hda/common/controller.c
> @@ -525,7 +525,7 @@ static int azx_get_time_info(struct snd_
>  			break;
>  
>  		default:
> -			*system_ts = ktime_to_timespec64(xtstamp.sys_realtime);
> +			*system_ts = ktime_to_timespec64(xtstamp.sys_systime);
>  			break;
>  
>  		}
> 
> 
> 

^ permalink raw reply

* Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Xuan Zhuo @ 2026-06-03  1:09 UTC (permalink / raw)
  To: yangjiale
  Cc: Jason Wang, Eugenio Pérez, virtualization, linux-kernel,
	yangjiale, Michael S . Tsirkin
In-Reply-To: <20260602043123.10207-1-yangjiale133@163.com>

On Tue,  2 Jun 2026 12:31:23 +0800, yangjiale <yangjiale133@163.com> wrote:
> When a descriptor list spans across cache lines,
> updating the flag first can lead to a scenario where the device side
> perceives the flag as valid, yet the corresponding address and length
> fields remain unupdated—resulting in invalid values.
> Therefore, the flag field must be updated last.

Are you raising this based on your code review, or did you actually encounter an
issue during testing?

I don't think there is a problem here, as all operations are protected by the
head desc flag. Even if cacheline effects cause data to be written prematurely,
it shouldn't be an issue. The device must wait until the head desc flags
are updated before it can start any subsequent work.

Thanks.

>
> Signed-off-by: yangjiale <yangjiale133@163.com>
> ---
>  drivers/virtio/virtio_ring.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index fbca7ce1c6bf..036b4f90d30f 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>  					     &addr, &len, premapped, attr))
>  				goto unmap_release;
>
> +			desc[i].addr = cpu_to_le64(addr);
> +			desc[i].len = cpu_to_le32(len);
> +			desc[i].id = cpu_to_le16(id);
> +
>  			flags = cpu_to_le16(vq->packed.avail_used_flags |
>  				    (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>  				    (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>  			else
>  				desc[i].flags = flags;
>
> -			desc[i].addr = cpu_to_le64(addr);
> -			desc[i].len = cpu_to_le32(len);
> -			desc[i].id = cpu_to_le16(id);
> -
>  			if (unlikely(vq->use_map_api)) {
>  				vq->packed.desc_extra[curr].addr = premapped ?
>  					DMA_MAPPING_ERROR : addr;
> --
> 2.25.1
>

^ permalink raw reply

* Re: [PATCH v3] drm/virtio: abort virtqueue wait on device removal to avoid hung task
From: Dmitry Osipenko @ 2026-06-03  1:23 UTC (permalink / raw)
  To: Ryosuke Yasuoka, David Airlie, Gerd Hoffmann, Gurchetan Singh,
	Chia-I Wu, Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann,
	Simona Vetter
  Cc: dri-devel, virtualization, linux-kernel
In-Reply-To: <20260601-virtio-gpu_wait_event-v3-1-89530517a98a@redhat.com>

On 6/1/26 10:53, Ryosuke Yasuoka wrote:
> virtio_gpu_queue_ctrl_sgs() and virtio_gpu_queue_cursor() use
> wait_event() without any abort condition when waiting for virtqueue
> space. If the host device stops processing commands, these waits block
> indefinitely inside a drm_dev_enter/exit() critical section. Since
> drm_dev_unplug(), which is called in device removal and system shutdown
> call path, blocks on synchronize_srcu() until all critical sections
> complete, device removal and system shutdown also hang.
> 
> Add a vqs_released flag to virtio_gpu_device and include it in the
> wait_event() condition. Set the flag and wake up both queues in a new
> virtio_gpu_release_vqs() helper, called before drm_dev_unplug() in both
> virtio_gpu_remove() and virtio_gpu_shutdown(). When the flag is set, the
> wait returns immediately and the command is aborted, following the same
> cleanup path as drm_dev_enter() failure.
> 
> Signed-off-by: Ryosuke Yasuoka <ryasuoka@redhat.com>
> ---
> Changes in v3:
> - Remove Reported-by and Closes tag from commit msg because they are not
>   related to this fix.
> - Link to v2: https://lore.kernel.org/r/20260521-virtio-gpu_wait_event-v2-1-5796b3a71d03@redhat.com
> 
> Changes in v2:
> - Update the commit message.
> - Replace wait_event_timeout() with wait_event() using a compound
> condition that includes a new vqs_released flag.
> - Add virtio_gpu_release_vqs() helper to set the flag and wake up
> both queues, called before drm_dev_unplug() in remove and shutdown
> paths.
> - Remove the hardcoded 5-second timeout. Recovery is now driven by
> the driver flag instead of an arbitrary timeout value.
> ---
>  drivers/gpu/drm/virtio/virtgpu_drv.c | 15 +++++++++++++++
>  drivers/gpu/drm/virtio/virtgpu_drv.h |  1 +
>  drivers/gpu/drm/virtio/virtgpu_vq.c  | 23 +++++++++++++++++++++--
>  3 files changed, 37 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.c b/drivers/gpu/drm/virtio/virtgpu_drv.c
> index a5ce96fb8a1d..e4fe5e0780f9 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_drv.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.c
> @@ -119,10 +119,24 @@ static int virtio_gpu_probe(struct virtio_device *vdev)
>  	return ret;
>  }
>  
> +/*
> + * Release pending virtqueue waits so the drm_dev_enter/exit() critical
> + * sections complete before drm_dev_unplug() blocks on synchronize_srcu().
> + */
> +static void virtio_gpu_release_vqs(struct drm_device *dev)
> +{
> +	struct virtio_gpu_device *vgdev = dev->dev_private;
> +
> +	vgdev->vqs_released = true;
> +	wake_up_all(&vgdev->ctrlq.ack_queue);
> +	wake_up_all(&vgdev->cursorq.ack_queue);
> +}
> +
>  static void virtio_gpu_remove(struct virtio_device *vdev)
>  {
>  	struct drm_device *dev = vdev->priv;
>  
> +	virtio_gpu_release_vqs(dev);
>  	drm_dev_unplug(dev);
>  	drm_atomic_helper_shutdown(dev);
>  	virtio_gpu_deinit(dev);
> @@ -133,6 +147,7 @@ static void virtio_gpu_shutdown(struct virtio_device *vdev)
>  {
>  	struct drm_device *dev = vdev->priv;
>  
> +	virtio_gpu_release_vqs(dev);
>  	/* stop talking to the device */
>  	drm_dev_unplug(dev);
>  }
> diff --git a/drivers/gpu/drm/virtio/virtgpu_drv.h b/drivers/gpu/drm/virtio/virtgpu_drv.h
> index 2f3531950aa4..5f7bb6cc6ba7 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_drv.h
> +++ b/drivers/gpu/drm/virtio/virtgpu_drv.h
> @@ -235,6 +235,7 @@ struct virtio_gpu_device {
>  
>  	struct virtio_gpu_queue ctrlq;
>  	struct virtio_gpu_queue cursorq;
> +	bool vqs_released;
>  	struct kmem_cache *vbufs;
>  
>  	atomic_t pending_commands;
> diff --git a/drivers/gpu/drm/virtio/virtgpu_vq.c b/drivers/gpu/drm/virtio/virtgpu_vq.c
> index 67865810a2e7..8057a9b7356d 100644
> --- a/drivers/gpu/drm/virtio/virtgpu_vq.c
> +++ b/drivers/gpu/drm/virtio/virtgpu_vq.c
> @@ -396,7 +396,19 @@ static int virtio_gpu_queue_ctrl_sgs(struct virtio_gpu_device *vgdev,
>  	if (vq->num_free < elemcnt) {
>  		spin_unlock(&vgdev->ctrlq.qlock);
>  		virtio_gpu_notify(vgdev);
> -		wait_event(vgdev->ctrlq.ack_queue, vq->num_free >= elemcnt);
> +		wait_event(vgdev->ctrlq.ack_queue,
> +			   vq->num_free >= elemcnt || vgdev->vqs_released);
> +		/*
> +		 * Set by virtio_gpu_release_vqs() to unblock
> +		 * synchronize_srcu() wait in drm_dev_unplug().
> +		 */
> +		if (vgdev->vqs_released) {
> +			if (fence && vbuf->objs)
> +				virtio_gpu_array_unlock_resv(vbuf->objs);
> +			free_vbuf(vgdev, vbuf);
> +			drm_dev_exit(idx);
> +			return -ENODEV;
> +		}
>  		goto again;
>  	}
>  
> @@ -566,7 +578,14 @@ static void virtio_gpu_queue_cursor(struct virtio_gpu_device *vgdev,
>  	ret = virtqueue_add_sgs(vq, sgs, outcnt, 0, vbuf, GFP_ATOMIC);
>  	if (ret == -ENOSPC) {
>  		spin_unlock(&vgdev->cursorq.qlock);
> -		wait_event(vgdev->cursorq.ack_queue, vq->num_free >= outcnt);
> +		wait_event(vgdev->cursorq.ack_queue,
> +			   vq->num_free >= outcnt || vgdev->vqs_released);
> +		/* See comment in virtio_gpu_queue_ctrl_sgs(). */
> +		if (vgdev->vqs_released) {
> +			free_vbuf(vgdev, vbuf);
> +			drm_dev_exit(idx);
> +			return;
> +		}
>  		spin_lock(&vgdev->cursorq.qlock);
>  		goto retry;
>  	} else {
> 
> ---
> base-commit: e43ffb69e0438cddd72aaa30898b4dc446f664f8
> change-id: 20260518-virtio-gpu_wait_event-5aa060754f12
> 
> Best regards,

Applied to misc-next, thanks!

-- 
Best regards,
Dmitry

^ permalink raw reply

* Re:Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: yangjiale133 @ 2026-06-03  1:58 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Michael S . Tsirkin, Jason Wang, Xuan Zhuo,
	virtualization@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <CAJaqyWfugEttrQuR5_LrbQaivAqCaixsprUOjEVtdiz_sexFpA@mail.gmail.com>

From the device's perspective, during a single read of the descriptor 
list-and that list spans across cache lines.
the retrieved data will show `desc[head].flags` as valid, 
and `desc[i].flags` as valid as well; however, 
the `desc[i].addr` and `len` fields may be invalid.

I suspect this occurs because a single DMA operation is internally 
split into multiple parallel read transactions, 
aligned to cache line boundaries.

I apologize that I currently lack the necessary environment to verify 
whether this modification definitively resolves the issue or 
merely reduces the probability of its occurrence; 
therefore, this patch can be discarded.

yangjiale



At 2026-06-02 14:04:13, "Eugenio Perez Martin" <eperezma@redhat.com> wrote:
>On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
>>
>> When a descriptor list spans across cache lines,
>> updating the flag first can lead to a scenario where the device side
>> perceives the flag as valid, yet the corresponding address and length
>> fields remain unupdated—resulting in invalid values.
>> Therefore, the flag field must be updated last.
>>
>> Signed-off-by: yangjiale <yangjiale133@163.com>
>> ---
>>  drivers/virtio/virtio_ring.c | 8 ++++----
>>  1 file changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
>> index fbca7ce1c6bf..036b4f90d30f 100644
>> --- a/drivers/virtio/virtio_ring.c
>> +++ b/drivers/virtio/virtio_ring.c
>> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>>                                              &addr, &len, premapped, attr))
>>                                 goto unmap_release;
>>
>> +                       desc[i].addr = cpu_to_le64(addr);
>> +                       desc[i].len = cpu_to_le32(len);
>> +                       desc[i].id = cpu_to_le16(id);
>> +
>>                         flags = cpu_to_le16(vq->packed.avail_used_flags |
>>                                     (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
>>                                     (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
>> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
>>                         else
>>                                 desc[i].flags = flags;
>>
>> -                       desc[i].addr = cpu_to_le64(addr);
>> -                       desc[i].len = cpu_to_le32(len);
>> -                       desc[i].id = cpu_to_le16(id);
>> -
>>                         if (unlikely(vq->use_map_api)) {
>>                                 vq->packed.desc_extra[curr].addr = premapped ?
>>                                         DMA_MAPPING_ERROR : addr;
>
>These flags are updated before the flags of the head descriptor at the
>end of the function, at "vq->packed.vring.desc[head].flags =
>head_flags", so the device should not see these. Because of that, the
>relative order between the rest of the fields of the same descriptor
>or other descriptors' fields, except for the head descriptor's flags,
>should not matter. There is a write memory barrier just before
>updating the head's flags.
>
>Also, I don't get why the cache line matters here. Can you expand? Am
>I missing something?

^ permalink raw reply

* Re:Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Xuan Zhuo @ 2026-06-03  2:08 UTC (permalink / raw)
  To: yangjiale133
  Cc: Michael S . Tsirkin, Jason Wang, virtualization@lists.linux.dev,
	linux-kernel@vger.kernel.org, Eugenio Perez Martin
In-Reply-To: <46a9dc36.1986.19e8b34896c.Coremail.yangjiale133@163.com>

On Wed, 3 Jun 2026 09:58:56 +0800 (CST), "yangjiale133" <yangjiale133@163.com> wrote:
> From the device's perspective, during a single read of the descriptor
> list-and that list spans across cache lines.
> the retrieved data will show `desc[head].flags` as valid,
> and `desc[i].flags` as valid as well; however,
> the `desc[i].addr` and `len` fields may be invalid.
>
> I suspect this occurs because a single DMA operation is internally
> split into multiple parallel read transactions,
> aligned to cache line boundaries.


I am aware of one case. When reading, the device reads multiple descriptors
at once. If the head descriptor's flag is 'avail', it then checks the data of
the subsequent descriptors from that same batch. However, this does not comply
with the specification. The device should only read the subsequent descriptors
after the head descriptor's flag is confirmed to be 'avail'. Although this might
impact performance, it is the correct behavior.

Thanks


>
> I apologize that I currently lack the necessary environment to verify
> whether this modification definitively resolves the issue or
> merely reduces the probability of its occurrence;
> therefore, this patch can be discarded.
>
> yangjiale
>
>
>
> At 2026-06-02 14:04:13, "Eugenio Perez Martin" <eperezma@redhat.com> wrote:
> >On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
> >>
> >> When a descriptor list spans across cache lines,
> >> updating the flag first can lead to a scenario where the device side
> >> perceives the flag as valid, yet the corresponding address and length
> >> fields remain unupdated—resulting in invalid values.
> >> Therefore, the flag field must be updated last.
> >>
> >> Signed-off-by: yangjiale <yangjiale133@163.com>
> >> ---
> >>  drivers/virtio/virtio_ring.c | 8 ++++----
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >> index fbca7ce1c6bf..036b4f90d30f 100644
> >> --- a/drivers/virtio/virtio_ring.c
> >> +++ b/drivers/virtio/virtio_ring.c
> >> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >>                                              &addr, &len, premapped, attr))
> >>                                 goto unmap_release;
> >>
> >> +                       desc[i].addr = cpu_to_le64(addr);
> >> +                       desc[i].len = cpu_to_le32(len);
> >> +                       desc[i].id = cpu_to_le16(id);
> >> +
> >>                         flags = cpu_to_le16(vq->packed.avail_used_flags |
> >>                                     (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
> >>                                     (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> >> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >>                         else
> >>                                 desc[i].flags = flags;
> >>
> >> -                       desc[i].addr = cpu_to_le64(addr);
> >> -                       desc[i].len = cpu_to_le32(len);
> >> -                       desc[i].id = cpu_to_le16(id);
> >> -
> >>                         if (unlikely(vq->use_map_api)) {
> >>                                 vq->packed.desc_extra[curr].addr = premapped ?
> >>                                         DMA_MAPPING_ERROR : addr;
> >
> >These flags are updated before the flags of the head descriptor at the
> >end of the function, at "vq->packed.vring.desc[head].flags =
> >head_flags", so the device should not see these. Because of that, the
> >relative order between the rest of the fields of the same descriptor
> >or other descriptors' fields, except for the head descriptor's flags,
> >should not matter. There is a write memory barrier just before
> >updating the head's flags.
> >
> >Also, I don't get why the cache line matters here. Can you expand? Am
> >I missing something?
>

^ permalink raw reply

* Re: Re: [PATCH] VIRTIO: Update the desc 'flag' fied last in packed ring.
From: Michael S. Tsirkin @ 2026-06-03  5:10 UTC (permalink / raw)
  To: yangjiale133
  Cc: Eugenio Perez Martin, Jason Wang, Xuan Zhuo, virtualization,
	linux-kernel
In-Reply-To: <5a3e06d5.103d.19e8b1863dd.Coremail.yangjiale133@163.com>

On Wed, Jun 03, 2026 at 09:28:11AM +0800, yangjiale133 wrote:
> From the device's perspective, during a single read of the descriptor list
> specifically when that list spans across cache lines.
> the retrieved data will show `desc[head].flags` as valid, 
> and `desc[i].flags` as valid as well; however, 
> the `desc[i].addr` and `len` fields may be invalid.
> 
> I apologize that I currently lack the necessary environment to verify 
> whether this modification definitively resolves the issue or 
> merely reduces the probability of its occurrence; 
> therefore, this patch can be discarded.
> 
> yangjiale


it could be that your device does a single read of the descriptor.
the pci spec is explicit that after getting valid flags,
device must read the descriptor again.


> 
> At 2026-06-02 14:04:13, "Eugenio Perez Martin" <eperezma@redhat.com> wrote:
> >On Tue, Jun 2, 2026 at 6:34 AM yangjiale <yangjiale133@163.com> wrote:
> >>
> >> When a descriptor list spans across cache lines,
> >> updating the flag first can lead to a scenario where the device side
> >> perceives the flag as valid, yet the corresponding address and length
> >> fields remain unupdated—resulting in invalid values.
> >> Therefore, the flag field must be updated last.
> >>
> >> Signed-off-by: yangjiale <yangjiale133@163.com>
> >> ---
> >>  drivers/virtio/virtio_ring.c | 8 ++++----
> >>  1 file changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> >> index fbca7ce1c6bf..036b4f90d30f 100644
> >> --- a/drivers/virtio/virtio_ring.c
> >> +++ b/drivers/virtio/virtio_ring.c
> >> @@ -1688,6 +1688,10 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >>                                              &addr, &len, premapped, attr))
> >>                                 goto unmap_release;
> >>
> >> +                       desc[i].addr = cpu_to_le64(addr);
> >> +                       desc[i].len = cpu_to_le32(len);
> >> +                       desc[i].id = cpu_to_le16(id);
> >> +
> >>                         flags = cpu_to_le16(vq->packed.avail_used_flags |
> >>                                     (++c == total_sg ? 0 : VRING_DESC_F_NEXT) |
> >>                                     (n < out_sgs ? 0 : VRING_DESC_F_WRITE));
> >> @@ -1696,10 +1700,6 @@ static inline int virtqueue_add_packed(struct vring_virtqueue *vq,
> >>                         else
> >>                                 desc[i].flags = flags;
> >>
> >> -                       desc[i].addr = cpu_to_le64(addr);
> >> -                       desc[i].len = cpu_to_le32(len);
> >> -                       desc[i].id = cpu_to_le16(id);
> >> -
> >>                         if (unlikely(vq->use_map_api)) {
> >>                                 vq->packed.desc_extra[curr].addr = premapped ?
> >>                                         DMA_MAPPING_ERROR : addr;
> >
> >These flags are updated before the flags of the head descriptor at the
> >end of the function, at "vq->packed.vring.desc[head].flags =
> >head_flags", so the device should not see these. Because of that, the
> >relative order between the rest of the fields of the same descriptor
> >or other descriptors' fields, except for the head descriptor's flags,
> >should not matter. There is a write memory barrier just before
> >updating the head's flags.
> >
> >Also, I don't get why the cache line matters here. Can you expand? Am
> >I missing something?
> 


^ permalink raw reply

* Re: [PATCH v4 07/47] x86/tdx: Force TSC frequency with CPUID-based info provided by the TDX-Module
From: Kiryl Shutsemau @ 2026-06-03 10:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Thomas Gleixner, Ingo Molnar, Borislav Petkov,
	Dave Hansen, x86, K. Y. Srinivasan, Haiyang Zhang, Wei Liu,
	Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov, Jan Kiszka,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	John Stultz, H. Peter Anvin, Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, kvm, linux-kernel, linux-coco, linux-hyperv,
	virtualization, xen-devel, David Woodhouse, Tom Lendacky,
	Nikunj A Dadhania, David Woodhouse, Michael Kelley,
	Thomas Gleixner
In-Reply-To: <20260529144435.704127-8-seanjc@google.com>

On Fri, May 29, 2026 at 07:43:54AM -0700, Sean Christopherson wrote:
> When running as a TDX guest, explicitly set the TSC frequency to a known
> value, using CPUID-based information, instead of potentially relying on a
> hypervisor-controlled PV routine.  For TDX guests, CPUID.0x15 is always
> emulated by the TDX-Module, i.e. the information from CPUID is more
> trustworthy than the information provided by the hypervisor.

Right. EBX is configurable by TD_PARAMS.TSC_FREQUENCY at TD build. The
rest is fixed.

> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency.  While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
> 
>   The APIC timer frequency will be the processor’s bus clock or core
>   crystal clock frequency (when TSC/core crystal clock ratio is enumerated
>   in CPUID leaf 0x15).
> 
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.

Agreed.

> Deliberately leave CPU frequency calibration as is, since the TDX-Module
> doesn't provide any guarantees with respect to CPUID.0x16.

It is fixed to zeros. Sounds like a guarantee to me :P

> Signed-off-by: Sean Christopherson <seanjc@google.com>

Looks sane to me. Including your reasoning about tsc_early_khz= in reply
to Sashiko.

Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>

-- 
  Kiryl Shutsemau / Kirill A. Shutemov

^ permalink raw reply

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Peter Zijlstra @ 2026-06-03 12:08 UTC (permalink / raw)
  To: Dmitry Ilvokhin
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <agXBb0ga_6HJrrnm@shell.ilvokhin.com>

On Thu, May 14, 2026 at 12:34:55PM +0000, Dmitry Ilvokhin wrote:

> Baseline, in best case scenario of least number of executed
> instructions.
> 
>     3e0:  endbr64                          ; 4 bytes (always executed)
>     3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock,
>                                            ; always executed)
>     3e7:  decl %gs:__preempt_count         ; 7 bytes (always executed)
>     3ee:  je   3f5                         ; 2 bytes (always executed)
>     3f0:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
>                                            ; je is not taken)
>                                            ; rest is not executed
>     3f5:  call __SCT__preempt_schedule     ; 5 bytes
>     3fa:  jmp  __x86_return_thunk          ; 5 bytes
> 
> Tracepoint (again same case of least number of executed instructions).
> 
>     bc0:  endbr64                          ; 4 bytes (always executed)
>     bc4:  xchg %ax,%ax                     ; 2 bytes (always executed, this is an
>                                            ; only addition on the execution path).
>     bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock, always executed)
>     bc9:  decl %gs:__preempt_count         ; 7 bytes (always executed)
>     bd0:  je   bde                         ; 2 bytes (always executed)
>     bd2:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
>                                            ; je is not taken)
>                                            ; rest is not executed
>     bd7:  call queued_spin_release_traced  ; 5 bytes
>     bdc:  jmp  bc9                         ; 2 bytes
>     bde:  call __SCT__preempt_schedule     ; 5 bytes
>     be3:  jmp  __x86_return_thunk          ; 5 bytes
> 

So I've been playing with this a bit, and it is all really sad.

Now, since pretty much everybody+dog will have PARAVIRT_SPINLOCK=y, the
'best' solution would be changing that paravirt call with a
static_call(), that actually shrinks the code by 1 byte.

And then this tracepoint nonsense can simply use a different unlock
function, just like paravirt.

0000 00000000000001d0 <_raw_spin_unlock>:
0000  1d0:	f3 0f 1e fa          	endbr64
0004  1d4:	ff 15 00 00 00 00    	call   *0x0(%rip)        # 1da <_raw_spin_unlock+0xa>	1d6: R_X86_64_PC32	pv_ops_lock+0x4
000a  1da:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e1 <_raw_spin_unlock+0x11>	1dd: R_X86_64_PC32	__preempt_count-0x4
0011  1e1:	74 06                	je     1e9 <_raw_spin_unlock+0x19>
0013  1e3:	2e e9 00 00 00 00    	cs jmp 1e9 <_raw_spin_unlock+0x19>	1e5: R_X86_64_PLT32	__x86_return_thunk-0x4
0019  1e9:	e8 00 00 00 00       	call   1ee <_raw_spin_unlock+0x1e>	1ea: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
001e  1ee:	2e e9 00 00 00 00    	cs jmp 1f4 <_raw_spin_unlock+0x24>	1f0: R_X86_64_PLT32	__x86_return_thunk-0x4


0000 00000000000001d0 <_raw_spin_unlock>:
0000  1d0:	f3 0f 1e fa          	endbr64
0004  1d4:	e8 00 00 00 00       	call   1d9 <_raw_spin_unlock+0x9>	1d5: R_X86_64_PLT32	__SCT__queued_spin_unlock-0x4
0009  1d9:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e0 <_raw_spin_unlock+0x10>	1dc: R_X86_64_PC32	__preempt_count-0x4
0010  1e0:	74 06                	je     1e8 <_raw_spin_unlock+0x18>
0012  1e2:	2e e9 00 00 00 00    	cs jmp 1e8 <_raw_spin_unlock+0x18>	1e4: R_X86_64_PLT32	__x86_return_thunk-0x4
0018  1e8:	e8 00 00 00 00       	call   1ed <_raw_spin_unlock+0x1d>	1e9: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
001d  1ed:	2e e9 00 00 00 00    	cs jmp 1f3 <_raw_spin_unlock+0x23>	1ef: R_X86_64_PLT32	__x86_return_thunk-0x4


Something a little like so, which is completely untested, except to
build kernel/locking/spinlock.o (with clang-23).

Also, I think someone should go do some performance runs with
ARCH_INLINE_SPIN_* set for x86 just like for s390.

Of course, this is only x86 done, and it doesn't nicely tie in with
tracepoints, but it does give the sanest asm.

---
diff --git a/arch/x86/hyperv/hv_spinlock.c b/arch/x86/hyperv/hv_spinlock.c
index 210b494e4de0..6b4bdea18218 100644
--- a/arch/x86/hyperv/hv_spinlock.c
+++ b/arch/x86/hyperv/hv_spinlock.c
@@ -78,8 +78,8 @@ void __init hv_init_spinlocks(void)
 	pr_info("PV spinlocks enabled\n");
 
 	__pv_init_lock_hash();
-	pv_ops_lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
-	pv_ops_lock.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
+	static_call_update(queued_spin_lock_slowpath, __pv_queued_spin_lock_slowpath);
+	static_call_update(queued_spin_unlock, __raw_callee_save___pv_queued_spin_unlock);
 	pv_ops_lock.wait = hv_qlock_wait;
 	pv_ops_lock.kick = hv_qlock_kick;
 	pv_ops_lock.vcpu_is_preempted = PV_CALLEE_SAVE(hv_vcpu_is_preempted);
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 1d506e5d6f46..52e7ce10df57 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -225,7 +225,6 @@
 #define X86_FEATURE_EPT_AD		( 8*32+17) /* "ept_ad" Intel Extended Page Table access-dirty bit */
 #define X86_FEATURE_VMCALL		( 8*32+18) /* Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* VMware prefers VMMCALL hypercall instruction */
-#define X86_FEATURE_PVUNLOCK		( 8*32+20) /* PV unlock function */
 #define X86_FEATURE_VCPUPREEMPT		( 8*32+21) /* PV vcpu_is_preempted function */
 #define X86_FEATURE_TDX_GUEST		( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */
 
diff --git a/arch/x86/include/asm/paravirt-spinlock.h b/arch/x86/include/asm/paravirt-spinlock.h
index 7beffcb08ed6..553910f92906 100644
--- a/arch/x86/include/asm/paravirt-spinlock.h
+++ b/arch/x86/include/asm/paravirt-spinlock.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_PARAVIRT_SPINLOCK_H
 
 #include <asm/paravirt_types.h>
+#include <linux/static_call_types.h>
 
 #ifdef CONFIG_SMP
 #include <asm/spinlock_types.h>
@@ -11,9 +12,6 @@
 struct qspinlock;
 
 struct pv_lock_ops {
-	void (*queued_spin_lock_slowpath)(struct qspinlock *lock, u32 val);
-	struct paravirt_callee_save queued_spin_unlock;
-
 	void (*wait)(u8 *ptr, u8 val);
 	void (*kick)(int cpu);
 
@@ -26,20 +24,23 @@ extern struct pv_lock_ops pv_ops_lock;
 extern void native_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
 extern void __pv_init_lock_hash(void);
 extern void __pv_queued_spin_lock_slowpath(struct qspinlock *lock, u32 val);
+extern void __raw_callee_save___native_queued_spin_unlock(struct qspinlock *lock);
 extern void __raw_callee_save___pv_queued_spin_unlock(struct qspinlock *lock);
 extern bool nopvspin;
 
+DECLARE_STATIC_CALL(queued_spin_lock_slowpath, native_queued_spin_lock_slowpath);
+DECLARE_STATIC_CALL(queued_spin_unlock, __raw_callee_save___native_queued_spin_unlock);
+
 static __always_inline void pv_queued_spin_lock_slowpath(struct qspinlock *lock,
 							 u32 val)
 {
-	PVOP_VCALL2(pv_ops_lock, queued_spin_lock_slowpath, lock, val);
+	static_call(queued_spin_lock_slowpath)(lock, val);
 }
 
 static __always_inline void pv_queued_spin_unlock(struct qspinlock *lock)
 {
-	PVOP_ALT_VCALLEE1(pv_ops_lock, queued_spin_unlock, lock,
-			  "movb $0, (%%" _ASM_ARG1 ")",
-			  ALT_NOT(X86_FEATURE_PVUNLOCK));
+	__STATIC_CALL_MOD_ADDRESSABLE(queued_spin_unlock);
+	asm volatile ("call " STATIC_CALL_TRAMP_STR(queued_spin_unlock) : ASM_CALL_CONSTRAINT);
 }
 
 static __always_inline bool pv_vcpu_is_preempted(long cpu)
diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index 29226d112029..5908d9fd94bb 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -1135,9 +1135,8 @@ void __init kvm_spinlock_init(void)
 	pr_info("PV spinlocks enabled\n");
 
 	__pv_init_lock_hash();
-	pv_ops_lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
-	pv_ops_lock.queued_spin_unlock =
-		PV_CALLEE_SAVE(__pv_queued_spin_unlock);
+	static_call_update(queued_spin_lock_slowpath, __pv_queued_spin_lock_slowpath);
+	static_call_update(queued_spin_unlock, __raw_callee_save___pv_queued_spin_unlock);
 	pv_ops_lock.wait = kvm_wait;
 	pv_ops_lock.kick = kvm_kick_cpu;
 
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 95452444868f..28407304f123 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -25,9 +25,12 @@ __visible void __native_queued_spin_unlock(struct qspinlock *lock)
 }
 PV_CALLEE_SAVE_REGS_THUNK(__native_queued_spin_unlock);
 
+DEFINE_STATIC_CALL(queued_spin_lock_slowpath, native_queued_spin_lock_slowpath);
+DEFINE_STATIC_CALL(queued_spin_unlock, __raw_callee_save___native_queued_spin_unlock);
+
 bool pv_is_native_spin_unlock(void)
 {
-	return pv_ops_lock.queued_spin_unlock.func ==
+	return static_call_query(queued_spin_unlock) ==
 		__raw_callee_save___native_queued_spin_unlock;
 }
 
@@ -45,16 +48,11 @@ bool pv_is_native_vcpu_is_preempted(void)
 
 void __init paravirt_set_cap(void)
 {
-	if (!pv_is_native_spin_unlock())
-		setup_force_cpu_cap(X86_FEATURE_PVUNLOCK);
-
 	if (!pv_is_native_vcpu_is_preempted())
 		setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT);
 }
 
 struct pv_lock_ops pv_ops_lock = {
-	.queued_spin_lock_slowpath	= native_queued_spin_lock_slowpath,
-	.queued_spin_unlock		= PV_CALLEE_SAVE(__native_queued_spin_unlock),
 	.wait				= paravirt_nop,
 	.kick				= paravirt_nop,
 	.vcpu_is_preempted		= PV_CALLEE_SAVE(__native_vcpu_is_preempted),
diff --git a/arch/x86/kernel/static_call.c b/arch/x86/kernel/static_call.c
index 61592e41a6b1..1268b7dc93b1 100644
--- a/arch/x86/kernel/static_call.c
+++ b/arch/x86/kernel/static_call.c
@@ -3,6 +3,7 @@
 #include <linux/memory.h>
 #include <linux/bug.h>
 #include <asm/text-patching.h>
+#include <asm/paravirt-spinlock.h>
 
 enum insn_type {
 	CALL = 0, /* site call */
@@ -31,6 +32,15 @@ static const u8 retinsn[] = { RET_INSN_OPCODE, 0xcc, 0xcc, 0xcc, 0xcc };
  */
 static const u8 warninsn[] = { 0x67, 0x48, 0x0f, 0xb9, 0x3a };
 
+/*
+ * ds ds movb $0, (_ASM_ARG1)
+ */
+#ifdef CONFIG_64BIT
+static const u8 unlockinsn[] = { 0x3e, 0x3e, 0xc6, 0x07, 0x00 };
+#else
+static const u8 unlockinsn[] = { 0x3e, 0x3e, 0xc6, 0x00, 0x00 };
+#endif
+
 static u8 __is_Jcc(u8 *insn) /* Jcc.d32 */
 {
 	u8 ret = 0;
@@ -78,6 +88,10 @@ static void __ref __static_call_transform(void *insn, enum insn_type type,
 			emulate = code;
 			code = &warninsn;
 		}
+		if (func == &__raw_callee_save___native_queued_spin_unlock) {
+			emulate = code;
+			code = &unlockinsn;
+		}
 		break;
 
 	case NOP:
diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 83ac24ead289..f718e535ea7c 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -134,9 +134,8 @@ void __init xen_init_spinlocks(void)
 	printk(KERN_DEBUG "xen: PV spinlocks enabled\n");
 
 	__pv_init_lock_hash();
-	pv_ops_lock.queued_spin_lock_slowpath = __pv_queued_spin_lock_slowpath;
-	pv_ops_lock.queued_spin_unlock =
-		PV_CALLEE_SAVE(__pv_queued_spin_unlock);
+	static_call_update(queued_spin_lock_slowpath, __pv_queued_spin_lock_slowpath);
+	static_call_update(queued_spin_unlock, __raw_callee_save___pv_queued_spin_unlock);
 	pv_ops_lock.wait = xen_qlock_wait;
 	pv_ops_lock.kick = xen_qlock_kick;
 	pv_ops_lock.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
diff --git a/tools/arch/x86/include/asm/cpufeatures.h b/tools/arch/x86/include/asm/cpufeatures.h
index 86d17b195e79..61541f042f74 100644
--- a/tools/arch/x86/include/asm/cpufeatures.h
+++ b/tools/arch/x86/include/asm/cpufeatures.h
@@ -225,7 +225,6 @@
 #define X86_FEATURE_EPT_AD		( 8*32+17) /* "ept_ad" Intel Extended Page Table access-dirty bit */
 #define X86_FEATURE_VMCALL		( 8*32+18) /* Hypervisor supports the VMCALL instruction */
 #define X86_FEATURE_VMW_VMMCALL		( 8*32+19) /* VMware prefers VMMCALL hypercall instruction */
-#define X86_FEATURE_PVUNLOCK		( 8*32+20) /* PV unlock function */
 #define X86_FEATURE_VCPUPREEMPT		( 8*32+21) /* PV vcpu_is_preempted function */
 #define X86_FEATURE_TDX_GUEST		( 8*32+22) /* "tdx_guest" Intel Trust Domain Extensions Guest */
 

^ permalink raw reply related

* Re: [PATCH v6 5/7] locking: Add contended_release tracepoint to qspinlock
From: Dmitry Ilvokhin @ 2026-06-03 14:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Will Deacon, Boqun Feng, Waiman Long,
	Thomas Bogendoerfer, Juergen Gross, Ajay Kaher, Alexey Makhalov,
	Broadcom internal kernel review list, Thomas Gleixner,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Arnd Bergmann,
	Dennis Zhou, Tejun Heo, Christoph Lameter, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, linux-kernel, linux-mips,
	virtualization, linux-arch, linux-mm, linux-trace-kernel,
	kernel-team, Paul E. McKenney
In-Reply-To: <20260603120811.GW3493090@noisy.programming.kicks-ass.net>

On Wed, Jun 03, 2026 at 02:08:11PM +0200, Peter Zijlstra wrote:
> On Thu, May 14, 2026 at 12:34:55PM +0000, Dmitry Ilvokhin wrote:
> 
> > Baseline, in best case scenario of least number of executed
> > instructions.
> > 
> >     3e0:  endbr64                          ; 4 bytes (always executed)
> >     3e4:  movb $0x0,(%rdi)                 ; 3 bytes (unlock,
> >                                            ; always executed)
> >     3e7:  decl %gs:__preempt_count         ; 7 bytes (always executed)
> >     3ee:  je   3f5                         ; 2 bytes (always executed)
> >     3f0:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
> >                                            ; je is not taken)
> >                                            ; rest is not executed
> >     3f5:  call __SCT__preempt_schedule     ; 5 bytes
> >     3fa:  jmp  __x86_return_thunk          ; 5 bytes
> > 
> > Tracepoint (again same case of least number of executed instructions).
> > 
> >     bc0:  endbr64                          ; 4 bytes (always executed)
> >     bc4:  xchg %ax,%ax                     ; 2 bytes (always executed, this is an
> >                                            ; only addition on the execution path).
> >     bc6:  movb $0x0,(%rdi)                 ; 3 bytes (unlock, always executed)
> >     bc9:  decl %gs:__preempt_count         ; 7 bytes (always executed)
> >     bd0:  je   bde                         ; 2 bytes (always executed)
> >     bd2:  jmp  __x86_return_thunk          ; 5 bytes (executed if above
> >                                            ; je is not taken)
> >                                            ; rest is not executed
> >     bd7:  call queued_spin_release_traced  ; 5 bytes
> >     bdc:  jmp  bc9                         ; 2 bytes
> >     bde:  call __SCT__preempt_schedule     ; 5 bytes
> >     be3:  jmp  __x86_return_thunk          ; 5 bytes
> > 
> 
> So I've been playing with this a bit, and it is all really sad.
> 
> Now, since pretty much everybody+dog will have PARAVIRT_SPINLOCK=y, the
> 'best' solution would be changing that paravirt call with a
> static_call(), that actually shrinks the code by 1 byte.
> 
> And then this tracepoint nonsense can simply use a different unlock
> function, just like paravirt.
> 
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000  1d0:	f3 0f 1e fa          	endbr64
> 0004  1d4:	ff 15 00 00 00 00    	call   *0x0(%rip)        # 1da <_raw_spin_unlock+0xa>	1d6: R_X86_64_PC32	pv_ops_lock+0x4
> 000a  1da:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e1 <_raw_spin_unlock+0x11>	1dd: R_X86_64_PC32	__preempt_count-0x4
> 0011  1e1:	74 06                	je     1e9 <_raw_spin_unlock+0x19>
> 0013  1e3:	2e e9 00 00 00 00    	cs jmp 1e9 <_raw_spin_unlock+0x19>	1e5: R_X86_64_PLT32	__x86_return_thunk-0x4
> 0019  1e9:	e8 00 00 00 00       	call   1ee <_raw_spin_unlock+0x1e>	1ea: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
> 001e  1ee:	2e e9 00 00 00 00    	cs jmp 1f4 <_raw_spin_unlock+0x24>	1f0: R_X86_64_PLT32	__x86_return_thunk-0x4
> 
> 
> 0000 00000000000001d0 <_raw_spin_unlock>:
> 0000  1d0:	f3 0f 1e fa          	endbr64
> 0004  1d4:	e8 00 00 00 00       	call   1d9 <_raw_spin_unlock+0x9>	1d5: R_X86_64_PLT32	__SCT__queued_spin_unlock-0x4
> 0009  1d9:	65 ff 0d 00 00 00 00 	decl   %gs:0x0(%rip)        # 1e0 <_raw_spin_unlock+0x10>	1dc: R_X86_64_PC32	__preempt_count-0x4
> 0010  1e0:	74 06                	je     1e8 <_raw_spin_unlock+0x18>
> 0012  1e2:	2e e9 00 00 00 00    	cs jmp 1e8 <_raw_spin_unlock+0x18>	1e4: R_X86_64_PLT32	__x86_return_thunk-0x4
> 0018  1e8:	e8 00 00 00 00       	call   1ed <_raw_spin_unlock+0x1d>	1e9: R_X86_64_PLT32	__SCT__preempt_schedule-0x4
> 001d  1ed:	2e e9 00 00 00 00    	cs jmp 1f3 <_raw_spin_unlock+0x23>	1ef: R_X86_64_PLT32	__x86_return_thunk-0x4
> 
> 
> Something a little like so, which is completely untested, except to
> build kernel/locking/spinlock.o (with clang-23).

Thanks a lot for taking a look, Peter.

I like the static_call idea. It's truly zero cost on x86 (and, as you
note, even a byte smaller). The one caveat is that it relies on
HAVE_STATIC_CALL_INLINE to stay free. 

So my plan would be: static_call where HAVE_STATIC_CALL_INLINE is
available (x86), and a static branch fallback elsewhere, gated behind a
default-off config so it imposes nothing on arches/kernels that don't
opt in. I'm mostly interested in x86, but would like arm64 to work too,
which would use the fallback.

Concretely:

1. Split the sleepable-lock patches out and send them separately.
   They're independent of the static call work and look far less
   controversial.

2. Convert the paravirt spinlock unlock to a static_call, as the
   foundation for the unlock tracepoint. I'm happy to take a stab at it.
   Let me know if you'd rather do it yourself.

3. Build the unlock tracepoint on top: static_call where it's cheap,
   config-gated static_branch fallback where it isn't.

Does this plan sound reasonable to you?

> 
> Also, I think someone should go do some performance runs with
> ARCH_INLINE_SPIN_* set for x86 just like for s390.

That's a good point, I'll run benchmarks and report back with the
results.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox