qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
@ 2025-06-04 19:18 Stefan Hajnoczi
  2025-06-05  8:34 ` Daniel P. Berrangé
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2025-06-04 19:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: qemu-block, Michael S. Tsirkin, Stefan Hajnoczi,
	Daniel P. Berrangé

Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
used to zero local variables. While this reduces security risks
associated with uninitialized stack data, it introduced a measurable
bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
functions.

These virtqueue functions are in the hot path. They are called for each
element (request) that is popped from a VIRTIO device's virtqueue. Using
__attribute__((uninitialized)) on large stack variables in these
functions improves fio randread bs=4k iodepth=64 performance from 304k
to 332k IOPS (+9%).

This issue was found using perf-top(1). virtqueue_split_pop() was one of
the top CPU consumers and the "annotate" feature showed that the memory
zeroing instructions at the beginning of the functions were hot.

Fixes: 7ff9ff039380 ("meson: mitigate against use of uninitialize stack for exploits")
Cc: Daniel P. Berrangé <berrange@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/qemu/compiler.h | 12 ++++++++++++
 hw/virtio/virtio.c      |  8 ++++----
 2 files changed, 16 insertions(+), 4 deletions(-)

diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
index 496dac5ac1..fabd540b02 100644
--- a/include/qemu/compiler.h
+++ b/include/qemu/compiler.h
@@ -207,6 +207,18 @@
 # define QEMU_USED
 #endif
 
+/*
+ * Disable -ftrivial-auto-var-init on a local variable. Use this in rare cases
+ * when the compiler zeroes a large on-stack variable and this causes a
+ * performance bottleneck. Only use it when performance data indicates this is
+ * necessary since security risks increase with uninitialized stack variables.
+ */
+#if __has_attribute(uninitialized)
+# define QEMU_UNINITIALIZED __attribute__((uninitialized))
+#else
+# define QEMU_UNINITIALIZED
+#endif
+
 /*
  * http://clang.llvm.org/docs/ThreadSafetyAnalysis.html
  *
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 5534251e01..82a285a31d 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -1689,8 +1689,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem = NULL;
     unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
     VRingDesc desc;
     int rc;
 
@@ -1836,8 +1836,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
     VirtIODevice *vdev = vq->vdev;
     VirtQueueElement *elem = NULL;
     unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
+    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
     VRingPackedDesc desc;
     uint16_t id;
     int rc;
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-04 19:18 [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path Stefan Hajnoczi
@ 2025-06-05  8:34 ` Daniel P. Berrangé
  2025-06-05 11:28   ` Philippe Mathieu-Daudé
  2025-06-05 12:44   ` Stefan Hajnoczi
  2025-06-05 18:54 ` Daniel P. Berrangé
  2025-06-10 16:41 ` Michael S. Tsirkin
  2 siblings, 2 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2025-06-05  8:34 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, Michael S. Tsirkin

On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> used to zero local variables. While this reduces security risks
> associated with uninitialized stack data, it introduced a measurable
> bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> functions.
> 
> These virtqueue functions are in the hot path. They are called for each
> element (request) that is popped from a VIRTIO device's virtqueue. Using
> __attribute__((uninitialized)) on large stack variables in these
> functions improves fio randread bs=4k iodepth=64 performance from 304k
> to 332k IOPS (+9%).

IIUC, the 'hwaddr addr' variable is 8k in size, and the 'struct iovec iov'
array is 16k in size, so we have 24k on the stack that we're clearing and
then later writing the real value. Makes sense that this would have a
perf impact in a hotpath.

> This issue was found using perf-top(1). virtqueue_split_pop() was one of
> the top CPU consumers and the "annotate" feature showed that the memory
> zeroing instructions at the beginning of the functions were hot.

When you say you found it with 'perf-top' was that just discovered by
accident, or was this usage of perf-top in response to users reporting
a performance degradation vs earlier QEMU ?

> 
> Fixes: 7ff9ff039380 ("meson: mitigate against use of uninitialize stack for exploits")
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/qemu/compiler.h | 12 ++++++++++++
>  hw/virtio/virtio.c      |  8 ++++----
>  2 files changed, 16 insertions(+), 4 deletions(-)

Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>


> 
> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index 496dac5ac1..fabd540b02 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -207,6 +207,18 @@
>  # define QEMU_USED
>  #endif
>  
> +/*
> + * Disable -ftrivial-auto-var-init on a local variable. Use this in rare cases
> + * when the compiler zeroes a large on-stack variable and this causes a
> + * performance bottleneck. Only use it when performance data indicates this is
> + * necessary since security risks increase with uninitialized stack variables.
> + */
> +#if __has_attribute(uninitialized)
> +# define QEMU_UNINITIALIZED __attribute__((uninitialized))
> +#else
> +# define QEMU_UNINITIALIZED
> +#endif

For the benefit of other reviewers, this attribute is specifically
intended for this very purpose:

[quote "info gcc"]
  ‘uninitialized’
     This attribute, attached to a variable with automatic storage,
     means that the variable should not be automatically initialized by
     the compiler when the option ‘-ftrivial-auto-var-init’ presents.

     With the option ‘-ftrivial-auto-var-init’, all the automatic
     variables that do not have explicit initializers will be
     initialized by the compiler.  These additional compiler
     initializations might incur run-time overhead, sometimes
     dramatically.  This attribute can be used to mark some variables to
     be excluded from such automatic initialization in order to reduce
     runtime overhead.

     This attribute has no effect when the option
     ‘-ftrivial-auto-var-init’ is not present.
[/quote]

> +
>  /*
>   * http://clang.llvm.org/docs/ThreadSafetyAnalysis.html
>   *
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 5534251e01..82a285a31d 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1689,8 +1689,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
>      VRingDesc desc;
>      int rc;
>  
> @@ -1836,8 +1836,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
>      VRingPackedDesc desc;
>      uint16_t id;
>      int rc;
> -- 
> 2.49.0
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-05  8:34 ` Daniel P. Berrangé
@ 2025-06-05 11:28   ` Philippe Mathieu-Daudé
  2025-06-05 12:50     ` Stefan Hajnoczi
  2025-06-05 12:44   ` Stefan Hajnoczi
  1 sibling, 1 reply; 11+ messages in thread
From: Philippe Mathieu-Daudé @ 2025-06-05 11:28 UTC (permalink / raw)
  To: Daniel P. Berrangé, Stefan Hajnoczi
  Cc: qemu-devel, qemu-block, Michael S. Tsirkin

On 5/6/25 10:34, Daniel P. Berrangé wrote:
> On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
>> Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
>> stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
>> used to zero local variables. While this reduces security risks
>> associated with uninitialized stack data, it introduced a measurable
>> bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
>> functions.
>>
>> These virtqueue functions are in the hot path. They are called for each
>> element (request) that is popped from a VIRTIO device's virtqueue. Using
>> __attribute__((uninitialized)) on large stack variables in these
>> functions improves fio randread bs=4k iodepth=64 performance from 304k
>> to 332k IOPS (+9%).
> 
> IIUC, the 'hwaddr addr' variable is 8k in size, and the 'struct iovec iov'
> array is 16k in size, so we have 24k on the stack that we're clearing and
> then later writing the real value. Makes sense that this would have a
> perf impact in a hotpath.
> 
>> This issue was found using perf-top(1). virtqueue_split_pop() was one of
>> the top CPU consumers and the "annotate" feature showed that the memory
>> zeroing instructions at the beginning of the functions were hot.
> 
> When you say you found it with 'perf-top' was that just discovered by
> accident, or was this usage of perf-top in response to users reporting
> a performance degradation vs earlier QEMU ?

Would it make sense to move these to VirtQueue (since the structure
definition is local anyway)?

-- >8 --
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index 85110bce374..b96c6ec603c 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -153,6 +153,12 @@ struct VirtQueue
      EventNotifier host_notifier;
      bool host_notifier_enabled;
      QLIST_ENTRY(VirtQueue) node;
+
+    /* Only used by virtqueue_pop() */
+    struct {
+        hwaddr addr[VIRTQUEUE_MAX_SIZE];
+        struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    } pop;
  };

  const char *virtio_device_names[] = {
@@ -1680,8 +1686,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, 
size_t sz)
      VirtIODevice *vdev = vq->vdev;
      VirtQueueElement *elem = NULL;
      unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr *addr = vq->pop.addr;
+    struct iovec *iov = vq->pop.iov;
      VRingDesc desc;
      int rc;

@@ -1826,8 +1832,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, 
size_t sz)
      VirtIODevice *vdev = vq->vdev;
      VirtQueueElement *elem = NULL;
      unsigned out_num, in_num, elem_entries;
-    hwaddr addr[VIRTQUEUE_MAX_SIZE];
-    struct iovec iov[VIRTQUEUE_MAX_SIZE];
+    hwaddr *addr = vq->pop.addr;
+    struct iovec *iov = vq->pop.iov;
      VRingPackedDesc desc;
      uint16_t id;
      int rc;
---

> 
>>
>> Fixes: 7ff9ff039380 ("meson: mitigate against use of uninitialize stack for exploits")
>> Cc: Daniel P. Berrangé <berrange@redhat.com>
>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
>> ---
>>   include/qemu/compiler.h | 12 ++++++++++++
>>   hw/virtio/virtio.c      |  8 ++++----
>>   2 files changed, 16 insertions(+), 4 deletions(-)



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-05  8:34 ` Daniel P. Berrangé
  2025-06-05 11:28   ` Philippe Mathieu-Daudé
@ 2025-06-05 12:44   ` Stefan Hajnoczi
  1 sibling, 0 replies; 11+ messages in thread
From: Stefan Hajnoczi @ 2025-06-05 12:44 UTC (permalink / raw)
  To: Daniel P. Berrangé; +Cc: qemu-devel, qemu-block, Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1699 bytes --]

On Thu, Jun 05, 2025 at 09:34:01AM +0100, Daniel P. Berrangé wrote:
> On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> > Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> > stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> > used to zero local variables. While this reduces security risks
> > associated with uninitialized stack data, it introduced a measurable
> > bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> > functions.
> > 
> > These virtqueue functions are in the hot path. They are called for each
> > element (request) that is popped from a VIRTIO device's virtqueue. Using
> > __attribute__((uninitialized)) on large stack variables in these
> > functions improves fio randread bs=4k iodepth=64 performance from 304k
> > to 332k IOPS (+9%).
> 
> IIUC, the 'hwaddr addr' variable is 8k in size, and the 'struct iovec iov'
> array is 16k in size, so we have 24k on the stack that we're clearing and
> then later writing the real value. Makes sense that this would have a
> perf impact in a hotpath.
> 
> > This issue was found using perf-top(1). virtqueue_split_pop() was one of
> > the top CPU consumers and the "annotate" feature showed that the memory
> > zeroing instructions at the beginning of the functions were hot.
> 
> When you say you found it with 'perf-top' was that just discovered by
> accident, or was this usage of perf-top in response to users reporting
> a performance degradation vs earlier QEMU ?

By accident. I was looking for ways to optimize my work-in-progress QEMU
io_uring patches and virtqueue_split_pop() stood out in the CPU time
profile.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-05 11:28   ` Philippe Mathieu-Daudé
@ 2025-06-05 12:50     ` Stefan Hajnoczi
  2025-06-05 16:16       ` Philippe Mathieu-Daudé
  0 siblings, 1 reply; 11+ messages in thread
From: Stefan Hajnoczi @ 2025-06-05 12:50 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Daniel P. Berrangé, qemu-devel, qemu-block,
	Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 4139 bytes --]

On Thu, Jun 05, 2025 at 01:28:49PM +0200, Philippe Mathieu-Daudé wrote:
> On 5/6/25 10:34, Daniel P. Berrangé wrote:
> > On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> > > Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> > > stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> > > used to zero local variables. While this reduces security risks
> > > associated with uninitialized stack data, it introduced a measurable
> > > bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> > > functions.
> > > 
> > > These virtqueue functions are in the hot path. They are called for each
> > > element (request) that is popped from a VIRTIO device's virtqueue. Using
> > > __attribute__((uninitialized)) on large stack variables in these
> > > functions improves fio randread bs=4k iodepth=64 performance from 304k
> > > to 332k IOPS (+9%).
> > 
> > IIUC, the 'hwaddr addr' variable is 8k in size, and the 'struct iovec iov'
> > array is 16k in size, so we have 24k on the stack that we're clearing and
> > then later writing the real value. Makes sense that this would have a
> > perf impact in a hotpath.
> > 
> > > This issue was found using perf-top(1). virtqueue_split_pop() was one of
> > > the top CPU consumers and the "annotate" feature showed that the memory
> > > zeroing instructions at the beginning of the functions were hot.
> > 
> > When you say you found it with 'perf-top' was that just discovered by
> > accident, or was this usage of perf-top in response to users reporting
> > a performance degradation vs earlier QEMU ?
> 
> Would it make sense to move these to VirtQueue (since the structure
> definition is local anyway)?
> 
> -- >8 --
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 85110bce374..b96c6ec603c 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -153,6 +153,12 @@ struct VirtQueue
>      EventNotifier host_notifier;
>      bool host_notifier_enabled;
>      QLIST_ENTRY(VirtQueue) node;
> +
> +    /* Only used by virtqueue_pop() */
> +    struct {
> +        hwaddr addr[VIRTQUEUE_MAX_SIZE];
> +        struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    } pop;

This is an alternative. Using g_alloca() is another alternative.

I chose __attribute__((uninitialized)) because it clearly documents the
reason why these variables need special treatment. In your patch the
"Only used by virtqueue_pop()" comment isn't enough to explain why these
variables should be located here. Someone might accidentally move them
back into virtqueue_pop() functions in the future if they are unaware of
the reason.

I'm happy to change approaches based on the pros/cons. Why do you prefer
moving the local variables into VirtQueue?

>  };
> 
>  const char *virtio_device_names[] = {
> @@ -1680,8 +1686,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t
> sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr *addr = vq->pop.addr;
> +    struct iovec *iov = vq->pop.iov;
>      VRingDesc desc;
>      int rc;
> 
> @@ -1826,8 +1832,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq,
> size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr *addr = vq->pop.addr;
> +    struct iovec *iov = vq->pop.iov;
>      VRingPackedDesc desc;
>      uint16_t id;
>      int rc;
> ---
> 
> > 
> > > 
> > > Fixes: 7ff9ff039380 ("meson: mitigate against use of uninitialize stack for exploits")
> > > Cc: Daniel P. Berrangé <berrange@redhat.com>
> > > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > > ---
> > >   include/qemu/compiler.h | 12 ++++++++++++
> > >   hw/virtio/virtio.c      |  8 ++++----
> > >   2 files changed, 16 insertions(+), 4 deletions(-)
> 

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-05 12:50     ` Stefan Hajnoczi
@ 2025-06-05 16:16       ` Philippe Mathieu-Daudé
  2025-06-05 16:30         ` Peter Maydell
  0 siblings, 1 reply; 11+ messages in thread
From: Philippe Mathieu-Daudé @ 2025-06-05 16:16 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Daniel P. Berrangé, qemu-devel, qemu-block,
	Michael S. Tsirkin

On 5/6/25 14:50, Stefan Hajnoczi wrote:
> On Thu, Jun 05, 2025 at 01:28:49PM +0200, Philippe Mathieu-Daudé wrote:
>> On 5/6/25 10:34, Daniel P. Berrangé wrote:
>>> On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
>>>> Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
>>>> stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
>>>> used to zero local variables. While this reduces security risks
>>>> associated with uninitialized stack data, it introduced a measurable
>>>> bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
>>>> functions.
>>>>
>>>> These virtqueue functions are in the hot path. They are called for each
>>>> element (request) that is popped from a VIRTIO device's virtqueue. Using
>>>> __attribute__((uninitialized)) on large stack variables in these
>>>> functions improves fio randread bs=4k iodepth=64 performance from 304k
>>>> to 332k IOPS (+9%).
>>>
>>> IIUC, the 'hwaddr addr' variable is 8k in size, and the 'struct iovec iov'
>>> array is 16k in size, so we have 24k on the stack that we're clearing and
>>> then later writing the real value. Makes sense that this would have a
>>> perf impact in a hotpath.
>>>
>>>> This issue was found using perf-top(1). virtqueue_split_pop() was one of
>>>> the top CPU consumers and the "annotate" feature showed that the memory
>>>> zeroing instructions at the beginning of the functions were hot.
>>>
>>> When you say you found it with 'perf-top' was that just discovered by
>>> accident, or was this usage of perf-top in response to users reporting
>>> a performance degradation vs earlier QEMU ?
>>
>> Would it make sense to move these to VirtQueue (since the structure
>> definition is local anyway)?
>>
>> -- >8 --
>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
>> index 85110bce374..b96c6ec603c 100644
>> --- a/hw/virtio/virtio.c
>> +++ b/hw/virtio/virtio.c
>> @@ -153,6 +153,12 @@ struct VirtQueue
>>       EventNotifier host_notifier;
>>       bool host_notifier_enabled;
>>       QLIST_ENTRY(VirtQueue) node;
>> +
>> +    /* Only used by virtqueue_pop() */
>> +    struct {
>> +        hwaddr addr[VIRTQUEUE_MAX_SIZE];
>> +        struct iovec iov[VIRTQUEUE_MAX_SIZE];
>> +    } pop;
> 
> This is an alternative. Using g_alloca() is another alternative.

Not a lot of these:

$ git grep -w g_alloca
backends/tpm/tpm_emulator.c:136:        buf = g_alloca(n);
tests/unit/test-char.c:1012:        be = g_alloca(sizeof(CharBackend));

The tpm_emulator.c use could be replaced by g_autofree g_malloc.

> I chose __attribute__((uninitialized)) because it clearly documents the
> reason why these variables need special treatment. In your patch the
> "Only used by virtqueue_pop()" comment isn't enough to explain why these
> variables should be located here. Someone might accidentally move them
> back into virtqueue_pop() functions in the future if they are unaware of
> the reason.

The only safe-net is a better comment.

> I'm happy to change approaches based on the pros/cons. Why do you prefer
> moving the local variables into VirtQueue?

I don't have a particular preference, I'm just wondering why these
vars have to be handled differently than the rest, by introducing
QEMU_UNINITIALIZED.

Anyway, no objection to this patch :)

Regards,

Phil.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-05 16:16       ` Philippe Mathieu-Daudé
@ 2025-06-05 16:30         ` Peter Maydell
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Maydell @ 2025-06-05 16:30 UTC (permalink / raw)
  To: Philippe Mathieu-Daudé
  Cc: Stefan Hajnoczi, Daniel P. Berrangé, qemu-devel, qemu-block,
	Michael S. Tsirkin

On Thu, 5 Jun 2025 at 17:18, Philippe Mathieu-Daudé <philmd@linaro.org> wrote:
>
> On 5/6/25 14:50, Stefan Hajnoczi wrote:
> > On Thu, Jun 05, 2025 at 01:28:49PM +0200, Philippe Mathieu-Daudé wrote:
> >> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >> index 85110bce374..b96c6ec603c 100644
> >> --- a/hw/virtio/virtio.c
> >> +++ b/hw/virtio/virtio.c
> >> @@ -153,6 +153,12 @@ struct VirtQueue
> >>       EventNotifier host_notifier;
> >>       bool host_notifier_enabled;
> >>       QLIST_ENTRY(VirtQueue) node;
> >> +
> >> +    /* Only used by virtqueue_pop() */
> >> +    struct {
> >> +        hwaddr addr[VIRTQUEUE_MAX_SIZE];
> >> +        struct iovec iov[VIRTQUEUE_MAX_SIZE];
> >> +    } pop;
> >
> > This is an alternative. Using g_alloca() is another alternative.
>
> Not a lot of these:
>
> $ git grep -w g_alloca
> backends/tpm/tpm_emulator.c:136:        buf = g_alloca(n);
> tests/unit/test-char.c:1012:        be = g_alloca(sizeof(CharBackend));

There are also some alloca() uses, mostly ancient code in
linux-user that should really not be using it.

Like variable-length-arrays (which we managed to eradicate)
use of alloca is tricky because if the allocation size is
large then it just runs us out of stack in an uncontrolled
way. I'm not sure it's worth trying to remove existing alloca
use, but it would probably be preferable not to add more.

(GCC has a -Walloca option, so if we ever did get rid of all
of them we could avoid new ones creeping back in.)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-04 19:18 [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path Stefan Hajnoczi
  2025-06-05  8:34 ` Daniel P. Berrangé
@ 2025-06-05 18:54 ` Daniel P. Berrangé
  2025-06-06  9:33   ` Kevin Wolf
  2025-06-10 16:41 ` Michael S. Tsirkin
  2 siblings, 1 reply; 11+ messages in thread
From: Daniel P. Berrangé @ 2025-06-05 18:54 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, Michael S. Tsirkin

On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> used to zero local variables. While this reduces security risks
> associated with uninitialized stack data, it introduced a measurable
> bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> functions.
> 
> These virtqueue functions are in the hot path. They are called for each
> element (request) that is popped from a VIRTIO device's virtqueue. Using
> __attribute__((uninitialized)) on large stack variables in these
> functions improves fio randread bs=4k iodepth=64 performance from 304k
> to 332k IOPS (+9%).
> 
> This issue was found using perf-top(1). virtqueue_split_pop() was one of
> the top CPU consumers and the "annotate" feature showed that the memory
> zeroing instructions at the beginning of the functions were hot.

I'm concerned we have other issues lurking, so I built qemu
with -Wframe-larger-than=8192 and looked at every source
file location it reported. I've not done performance testing
but I've found a decent number of locations that look like
they are in the I/O path, so likely hot paths. It seems I
was too naive when introducing -ftrivial-auto-var-init=zero
wrt possible perf hits.

The results are as follow, and show some areas we should
likely proactively marked with "QEMU_UNINITIALIZED' even
without checking perf results:


../block/linux-aio.c:342:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - ioq_submit - struct iocb *iocbs[MAX_EVENTS];

../contrib/elf2dmp/main.c:636:1: warning: the frame size of 22240 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Misc big data structures
    Not a hot path

../contrib/ivshmem-server/main.c:274:1: warning: the frame size of 8784 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Ivshmem state is stored in main() stack
    Not a hot path

../contrib/vhost-user-gpu/vhost-user-gpu.c:1011:1: warning: the frame size of 16432 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/vhost-user-gpu.c:1107:1: warning: the frame size of 16464 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/vhost-user-gpu.c:1120:1: warning: the frame size of 16432 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/vhost-user-gpu.c:304:1: warning: the frame size of 16432 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/vhost-user-gpu.c:351:1: warning: the frame size of 16464 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/vhost-user-gpu.c:435:1: warning: the frame size of 16448 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/vhost-user-gpu.c:748:1: warning: the frame size of 16512 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/vhost-user-gpu.c:857:1: warning: the frame size of 16592 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/virgl.c:427:1: warning: the frame size of 16576 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../contrib/vhost-user-gpu/virgl.c:459:1: warning: the frame size of 16480 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/display/vhost-user-gpu.c:168:1: warning: the frame size of 16432 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/display/vhost-user-gpu.c:354:1: warning: the frame size of 17536 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Many methods use the VhostUserGpuMsg struct which has a union where
    VhostUserGpuCursorUpdate has a 16kb data field.

    Varibles are initialized explicitly at declaration avoiding
    auto-init, but only the union branches that are needed. GCC
    may zero initialize unused branches, and possibly also
    union padding bytes.

    Hot paths => probably skip init

../hw/acpi/core.c:312:1: warning: the frame size of 8256 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Has a "char unsigned data[8192];" used for reading data from a file.
    Not a hot path.

../hw/arm/mps2-tz.c:1216:1: warning: the frame size of 12048 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Several large "const PPCInfo" const variables
    Not a hot path

../hw/audio/cs4231a.c:571:1: warning: the frame size of 12320 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 4k and 8k buffers for converting audio samples

    Hot path, should skip init

../hw/core/qdev-properties-system.c:474:1: warning: the frame size of 8256 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - "NetClientState *peers[MAX_QUEUE_NUM];"  is 8K in size
    Not a hot path


../hw/display/vmware_vga.c:794:1: warning: the frame size of 20624 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Large "struct vmsvga_cursor_definition_s cursor;"
    Hot path

../hw/i386/x86-common.c:1001:1: warning: the frame size of 8400 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 8k buffer for loading initrd header
    Not a hot path

../hw/net/rtl8139.c:1840:1: warning: the frame size of 8224 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 8k buffer for transferring data
    Hot path

../hw/net/virtio-net.c:2057:1: warning: the frame size of 32912 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/net/virtio-net.c:2821:1: warning: the frame size of 32864 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Several arrays of VIRTQUEUE_MAX_SIZE (1k)
    Hot path

../hw/net/xgmac.c:264:1: warning: the frame size of 8256 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 8k buffer for transferring data
    Hot path

../hw/ppc/pnv_occ.c:807:1: warning: the frame size of 12960 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/ppc/pnv_occ.c:922:1: warning: the frame size of 15136 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Large structs
    Hot path

../hw/ppc/spapr_tpm_proxy.c:100:1: warning: the frame size of 8256 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Two 4k buffers for transferring data
    Hot path, though TPMs are low performance devices 

../hw/s390x/ipl.c:552:1: warning: the frame size of 28688 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Large IplParameterBlock struct
    Not hot path

../hw/s390x/s390-pci-inst.c:359:1: warning: the frame size of 8304 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 8k buffer for transferring data between host/guest
    Unclear if clp_service_call is a hot path ?

../hw/sparc64/sun4u.c:154:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/sparc/sun4m.c:171:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 8k buffers for initializing NVRAM
    Not a hot path

../hw/usb/hcd-ohci.c:833:1: warning: the frame size of 8384 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 8k buffer for transferring data between host/guest
    Hot path

../hw/virtio/vhost-user.c:886:1: warning: the frame size of 21760 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Several arrays of VHOST_USER_MAX_RAM_SLOTS (512)
    Unclear if vhost_user_add_remove_regions is a hot path

../hw/virtio/virtio.c:1827:1: warning: the frame size of 24784 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/virtio/virtio.c:1827:1: warning: the frame size of 24800 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/virtio/virtio.c:1977:1: warning: the frame size of 24816 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Several arrays of VIRTQUEUE_MAX_SIZE (1k)
    Hot path - Stefan's patch fixes


../hw/virtio/virtio.c:2156:1: warning: the frame size of 49184 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../hw/virtio/virtio.c:2193:1: warning: the frame size of 49184 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Struct VirtQueueElementOld containing several arrays of VIRTQUEUE_MAX_SIZE (1k)
    Unclear if qemu_{put,get}_virtqueue_element are a hot path

../net/filter.c:304:1: warning: the frame size of 8224 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../net/net.c:1647:1: warning: the frame size of 8224 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../net/net.c:461:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../net/net-hmp-cmds.c:150:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../net/net-hmp-cmds.c:170:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - More MAX_QUEUE_NUM variables
    Not a hot path

../net/socket.c:191:1: warning: the frame size of 69664 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../net/stream.c:192:1: warning: the frame size of 69664 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - 68k buffers for transferring data
    Hot path

../net/vhost-user.c:276:1: warning: the frame size of 8224 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../net/vhost-user.c:334:1: warning: the frame size of 8304 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - More MAX_QUEUE_NUM variables
    Not a hot path

../scsi/qemu-pr-helper.c:423:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../scsi/qemu-pr-helper.c:540:1: warning: the frame size of 8256 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Large "struct prin_resp"
    Not a hot path

../target/arm/helper.c:9028:1: warning: the frame size of 9184 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Many Large ARMCPRegInfo structs
    Unclear if register_cp_regs_for_features is a hot path ?

../target/s390x/ioinst.c:719:1: warning: the frame size of 8272 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - A TARGET_PAGE_SIZE buffer - 4k
    Unclear if ioinst_handle_chsc is a hot path

../target/xtensa/translate.c:1117:1: warning: the frame size of 42608 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Several MAX_INSN_SLOTS arrays
    Unclear if disas_xtensa_insn is a hot path

../tests/qtest/hd-geo-test.c:510:1: warning: the frame size of 8832 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/qtest/ide-test.c:553:1: warning: the frame size of 32832 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/qtest/ivshmem-test.c:379:1: warning: the frame size of 8448 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/qtest/ufs-test.c:680:1: warning: the frame size of 8576 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/unit/test-string-input-visitor.c:464:1: warning: the frame size of 10064 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/unit/test-write-threshold.c:24:1: warning: the frame size of 17168 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/unit/test-write-threshold.c:37:1: warning: the frame size of 17168 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/vhost-user-bridge.c:232:1: warning: the frame size of 16480 bytes is larger than 8192 bytes [-Wframe-larger-than=]
../tests/vhost-user-bridge.c:370:1: warning: the frame size of 16544 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - Tests aren't a hot path

../ui/vnc-jobs.c:336:1: warning: the frame size of 103696 bytes is larger than 8192 bytes [-Wframe-larger-than=]

  - A VncState variable
    Hot path



With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-05 18:54 ` Daniel P. Berrangé
@ 2025-06-06  9:33   ` Kevin Wolf
  0 siblings, 0 replies; 11+ messages in thread
From: Kevin Wolf @ 2025-06-06  9:33 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Stefan Hajnoczi, qemu-devel, qemu-block, Michael S. Tsirkin

Am 05.06.2025 um 20:54 hat Daniel P. Berrangé geschrieben:
> On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> > Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> > stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> > used to zero local variables. While this reduces security risks
> > associated with uninitialized stack data, it introduced a measurable
> > bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> > functions.
> > 
> > These virtqueue functions are in the hot path. They are called for each
> > element (request) that is popped from a VIRTIO device's virtqueue. Using
> > __attribute__((uninitialized)) on large stack variables in these
> > functions improves fio randread bs=4k iodepth=64 performance from 304k
> > to 332k IOPS (+9%).
> > 
> > This issue was found using perf-top(1). virtqueue_split_pop() was one of
> > the top CPU consumers and the "annotate" feature showed that the memory
> > zeroing instructions at the beginning of the functions were hot.
> 
> I'm concerned we have other issues lurking, so I built qemu
> with -Wframe-larger-than=8192 and looked at every source
> file location it reported. I've not done performance testing
> but I've found a decent number of locations that look like
> they are in the I/O path, so likely hot paths. It seems I
> was too naive when introducing -ftrivial-auto-var-init=zero
> wrt possible perf hits.
> 
> The results are as follow, and show some areas we should
> likely proactively marked with "QEMU_UNINITIALIZED' even
> without checking perf results:
> 
> 
> ../block/linux-aio.c:342:1: warning: the frame size of 8208 bytes is larger than 8192 bytes [-Wframe-larger-than=]
> 
>   - ioq_submit - struct iocb *iocbs[MAX_EVENTS];

This is certainly a hot path. We should probably try and measure it.

> ../hw/virtio/vhost-user.c:886:1: warning: the frame size of 21760 bytes is larger than 8192 bytes [-Wframe-larger-than=]
> 
>   - Several arrays of VHOST_USER_MAX_RAM_SLOTS (512)
>     Unclear if vhost_user_add_remove_regions is a hot path

Pretty sure it's not, you don't reconfigure your memory all the time.
(If you did, vhost-user performance would be destroyed anyway.)

> ../hw/virtio/virtio.c:1827:1: warning: the frame size of 24784 bytes is larger than 8192 bytes [-Wframe-larger-than=]
> ../hw/virtio/virtio.c:1827:1: warning: the frame size of 24800 bytes is larger than 8192 bytes [-Wframe-larger-than=]
> ../hw/virtio/virtio.c:1977:1: warning: the frame size of 24816 bytes is larger than 8192 bytes [-Wframe-larger-than=]
> 
>   - Several arrays of VIRTQUEUE_MAX_SIZE (1k)
>     Hot path - Stefan's patch fixes
> 
> 
> ../hw/virtio/virtio.c:2156:1: warning: the frame size of 49184 bytes is larger than 8192 bytes [-Wframe-larger-than=]
> ../hw/virtio/virtio.c:2193:1: warning: the frame size of 49184 bytes is larger than 8192 bytes [-Wframe-larger-than=]
> 
>   - Struct VirtQueueElementOld containing several arrays of VIRTQUEUE_MAX_SIZE (1k)
>     Unclear if qemu_{put,get}_virtqueue_element are a hot path

These are for (de)serialising state for migration, not a hot path.

Kevin



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-04 19:18 [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path Stefan Hajnoczi
  2025-06-05  8:34 ` Daniel P. Berrangé
  2025-06-05 18:54 ` Daniel P. Berrangé
@ 2025-06-10 16:41 ` Michael S. Tsirkin
  2025-06-10 16:52   ` Daniel P. Berrangé
  2 siblings, 1 reply; 11+ messages in thread
From: Michael S. Tsirkin @ 2025-06-10 16:41 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: qemu-devel, qemu-block, Daniel P. Berrangé

On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> used to zero local variables. While this reduces security risks
> associated with uninitialized stack data, it introduced a measurable
> bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> functions.
> 
> These virtqueue functions are in the hot path. They are called for each
> element (request) that is popped from a VIRTIO device's virtqueue. Using
> __attribute__((uninitialized)) on large stack variables in these
> functions improves fio randread bs=4k iodepth=64 performance from 304k
> to 332k IOPS (+9%).

I ask however whether it is always not worth it for all users.
It does reduce chances of leaking stack info, does it not?

Maybe we can start with a tri-state Kconfig knob to select between
performance/balanced/paranoid for this and similar variables?


> This issue was found using perf-top(1). virtqueue_split_pop() was one of
> the top CPU consumers and the "annotate" feature showed that the memory
> zeroing instructions at the beginning of the functions were hot.
> 
> Fixes: 7ff9ff039380 ("meson: mitigate against use of uninitialize stack for exploits")
> Cc: Daniel P. Berrangé <berrange@redhat.com>
> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> ---
>  include/qemu/compiler.h | 12 ++++++++++++
>  hw/virtio/virtio.c      |  8 ++++----
>  2 files changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> index 496dac5ac1..fabd540b02 100644
> --- a/include/qemu/compiler.h
> +++ b/include/qemu/compiler.h
> @@ -207,6 +207,18 @@
>  # define QEMU_USED
>  #endif
>  
> +/*
> + * Disable -ftrivial-auto-var-init on a local variable. Use this in rare cases
> + * when the compiler zeroes a large on-stack variable and this causes a
> + * performance bottleneck. Only use it when performance data indicates this is
> + * necessary since security risks increase with uninitialized stack variables.
> + */
> +#if __has_attribute(uninitialized)
> +# define QEMU_UNINITIALIZED __attribute__((uninitialized))
> +#else
> +# define QEMU_UNINITIALIZED
> +#endif
> +
>  /*
>   * http://clang.llvm.org/docs/ThreadSafetyAnalysis.html
>   *
> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> index 5534251e01..82a285a31d 100644
> --- a/hw/virtio/virtio.c
> +++ b/hw/virtio/virtio.c
> @@ -1689,8 +1689,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
>      VRingDesc desc;
>      int rc;
>  
> @@ -1836,8 +1836,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
>      VirtIODevice *vdev = vq->vdev;
>      VirtQueueElement *elem = NULL;
>      unsigned out_num, in_num, elem_entries;
> -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> +    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
> +    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
>      VRingPackedDesc desc;
>      uint16_t id;
>      int rc;
> -- 
> 2.49.0



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path
  2025-06-10 16:41 ` Michael S. Tsirkin
@ 2025-06-10 16:52   ` Daniel P. Berrangé
  0 siblings, 0 replies; 11+ messages in thread
From: Daniel P. Berrangé @ 2025-06-10 16:52 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Stefan Hajnoczi, qemu-devel, qemu-block

On Tue, Jun 10, 2025 at 12:41:00PM -0400, Michael S. Tsirkin wrote:
> On Wed, Jun 04, 2025 at 03:18:43PM -0400, Stefan Hajnoczi wrote:
> > Since commit 7ff9ff039380 ("meson: mitigate against use of uninitialize
> > stack for exploits") the -ftrivial-auto-var-init=zero compiler option is
> > used to zero local variables. While this reduces security risks
> > associated with uninitialized stack data, it introduced a measurable
> > bottleneck in the virtqueue_split_pop() and virtqueue_packed_pop()
> > functions.
> > 
> > These virtqueue functions are in the hot path. They are called for each
> > element (request) that is popped from a VIRTIO device's virtqueue. Using
> > __attribute__((uninitialized)) on large stack variables in these
> > functions improves fio randread bs=4k iodepth=64 performance from 304k
> > to 332k IOPS (+9%).
> 
> I ask however whether it is always not worth it for all users.
> It does reduce chances of leaking stack info, does it not?
> 
> Maybe we can start with a tri-state Kconfig knob to select between
> performance/balanced/paranoid for this and similar variables?

I'd prefer this not to be configurable. With a fixed setup, when we
analyse reported bugs, we can make a clear determination of whether
it is actually an expoitable security bug or not, without having to
concern ourselves with build time settings. Essentially the
-ftrivial-auto-var-init=zero turned almost all 'uninitialized
stack variable' bugs into plain bugs, or even non-bugs in most
cases, instead of likely security issues. The QEMU_UNINITIALIZED
annotations are targetted fixed to avoid the worst case perf
hits, without compromising our security posture.

> 
> 
> > This issue was found using perf-top(1). virtqueue_split_pop() was one of
> > the top CPU consumers and the "annotate" feature showed that the memory
> > zeroing instructions at the beginning of the functions were hot.
> > 
> > Fixes: 7ff9ff039380 ("meson: mitigate against use of uninitialize stack for exploits")
> > Cc: Daniel P. Berrangé <berrange@redhat.com>
> > Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> > ---
> >  include/qemu/compiler.h | 12 ++++++++++++
> >  hw/virtio/virtio.c      |  8 ++++----
> >  2 files changed, 16 insertions(+), 4 deletions(-)
> > 
> > diff --git a/include/qemu/compiler.h b/include/qemu/compiler.h
> > index 496dac5ac1..fabd540b02 100644
> > --- a/include/qemu/compiler.h
> > +++ b/include/qemu/compiler.h
> > @@ -207,6 +207,18 @@
> >  # define QEMU_USED
> >  #endif
> >  
> > +/*
> > + * Disable -ftrivial-auto-var-init on a local variable. Use this in rare cases
> > + * when the compiler zeroes a large on-stack variable and this causes a
> > + * performance bottleneck. Only use it when performance data indicates this is
> > + * necessary since security risks increase with uninitialized stack variables.
> > + */
> > +#if __has_attribute(uninitialized)
> > +# define QEMU_UNINITIALIZED __attribute__((uninitialized))
> > +#else
> > +# define QEMU_UNINITIALIZED
> > +#endif
> > +
> >  /*
> >   * http://clang.llvm.org/docs/ThreadSafetyAnalysis.html
> >   *
> > diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> > index 5534251e01..82a285a31d 100644
> > --- a/hw/virtio/virtio.c
> > +++ b/hw/virtio/virtio.c
> > @@ -1689,8 +1689,8 @@ static void *virtqueue_split_pop(VirtQueue *vq, size_t sz)
> >      VirtIODevice *vdev = vq->vdev;
> >      VirtQueueElement *elem = NULL;
> >      unsigned out_num, in_num, elem_entries;
> > -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> > -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> > +    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
> > +    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
> >      VRingDesc desc;
> >      int rc;
> >  
> > @@ -1836,8 +1836,8 @@ static void *virtqueue_packed_pop(VirtQueue *vq, size_t sz)
> >      VirtIODevice *vdev = vq->vdev;
> >      VirtQueueElement *elem = NULL;
> >      unsigned out_num, in_num, elem_entries;
> > -    hwaddr addr[VIRTQUEUE_MAX_SIZE];
> > -    struct iovec iov[VIRTQUEUE_MAX_SIZE];
> > +    hwaddr QEMU_UNINITIALIZED addr[VIRTQUEUE_MAX_SIZE];
> > +    struct iovec QEMU_UNINITIALIZED iov[VIRTQUEUE_MAX_SIZE];
> >      VRingPackedDesc desc;
> >      uint16_t id;
> >      int rc;
> > -- 
> > 2.49.0
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-06-10 17:04 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-04 19:18 [PATCH] virtio: avoid cost of -ftrivial-auto-var-init in hot path Stefan Hajnoczi
2025-06-05  8:34 ` Daniel P. Berrangé
2025-06-05 11:28   ` Philippe Mathieu-Daudé
2025-06-05 12:50     ` Stefan Hajnoczi
2025-06-05 16:16       ` Philippe Mathieu-Daudé
2025-06-05 16:30         ` Peter Maydell
2025-06-05 12:44   ` Stefan Hajnoczi
2025-06-05 18:54 ` Daniel P. Berrangé
2025-06-06  9:33   ` Kevin Wolf
2025-06-10 16:41 ` Michael S. Tsirkin
2025-06-10 16:52   ` Daniel P. Berrangé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).