All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Tejus GK <tejus.gk@nutanix.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	Fabiano Rosas <farosas@suse.de>, Eric Blake <eblake@redhat.com>,
	Markus Armbruster <armbru@redhat.com>
Subject: Re: [PATCH v2 1/1] io: make zerocopy fallback accounting more accurate
Date: Wed, 11 Mar 2026 14:43:47 -0400	[thread overview]
Message-ID: <abG349ix_S-CEn8C@x1.local> (raw)
In-Reply-To: <abGqkFCoApJv2rT1@redhat.com>

On Wed, Mar 11, 2026 at 05:46:56PM +0000, Daniel P. Berrangé wrote:
> On Wed, Mar 11, 2026 at 01:28:36PM -0400, Peter Xu wrote:
> > On Wed, Mar 11, 2026 at 04:56:17PM +0000, Daniel P. Berrangé wrote:
> > > On Wed, Mar 11, 2026 at 11:30:26AM -0400, Peter Xu wrote:
> > > > On Wed, Mar 11, 2026 at 12:02:05PM +0000, Daniel P. Berrangé wrote:
> > > > > On Mon, Mar 09, 2026 at 02:21:49PM -0400, Peter Xu wrote:
> > > > > > On Mon, Mar 09, 2026 at 05:51:29PM +0000, Daniel P. Berrangé wrote:
> > > > > > > On Mon, Mar 09, 2026 at 05:42:08PM +0000, Tejus GK wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > On 9 Mar 2026, at 10:47 PM, Daniel P. Berrangé <berrange@redhat.com> wrote:
> > > > > > > > > 
> > > > > > > > > !-------------------------------------------------------------------|
> > > > > > > > >  CAUTION: External Email
> > > > > > > > > 
> > > > > > > > > |-------------------------------------------------------------------!
> > > > > > > > > 
> > > > > > > > > On Mon, Mar 09, 2026 at 12:59:44PM -0400, Peter Xu wrote:
> > > > > > > > >> On Mon, Mar 09, 2026 at 04:48:37PM +0000, Daniel P. Berrangé wrote:
> > > > > > > > >>>> @@ -881,8 +881,8 @@ static int qio_channel_socket_flush_internal(QIOChannel *ioc,
> > > > > > > > >>>>         sioc->zero_copy_sent += serr->ee_data - serr->ee_info + 1;
> > > > > > > > >>>> 
> > > > > > > > >>>>         /* If any sendmsg() succeeded using zero copy, mark zerocopy success */
> > > > > > > > >>>> -        if (serr->ee_code != SO_EE_CODE_ZEROCOPY_COPIED) {
> > > > > > > > >>>> -            sioc->new_zero_copy_sent_success = true;
> > > > > > > > >>>> +        if (serr->ee_code == SO_EE_CODE_ZEROCOPY_COPIED) {
> > > > > > > > >>>> +            sioc->zero_copy_fallback++;
> > > > > > > > >>> 
> > > > > > > > >>> ...this is counting the number of MSG_ERRQUEUE items, which is not
> > > > > > > > >>> the same as the number of IO requests. That's why we only used it
> > > > > > > > >>> as a boolean marker originally, rather than making it a counter.
> > > > > > > > >> 
> > > > > > > > >> Would the logic still work and better than before?  Say, it's a counter of
> > > > > > > > >> "messages" rather than "IOs" then.
> > > > > > > > > 
> > > > > > > > > IIUC it is a counter of processing notifications which is not directly
> > > > > > > > > correlated to any action by QEMU - neither bytes nor syscalls.
> > > > > > > > 
> > > > > > > > Please correct me if I'm wrong about this, isn’t each notification an information 
> > > > > > > > about what happened to an individual IO?
> > > > > > > 
> > > > > > > If userspace hasn't read a queued notification yet, the kernel will
> > > > > > > merge new notifications with the existing queued one.
> > > > > > > 
> > > > > > > The line above your change
> > > > > > > 
> > > > > > >   serr->ee_data - serr->ee_info + 1;
> > > > > > > 
> > > > > > > records how many notifications were merged, so we now how many
> > > > > > > syscalls were processed.
> > > > > > > 
> > > > > > > If ee_code is  SO_EE_CODE_ZEROCOPY_COPIED though it means at least
> > > > > > > one syscall resulted in a copy, but that doesn't imply that *all*
> > > > > > > syscalls resulted in a copy.
> > > > > > > 
> > > > > > > AFAICT, it could be 1 out of a 1000 syscalls resulted in a copy,
> > > > > > > or it could be 1000 out of 1000 resulted in a copy. We don't know.
> > > > > > > 
> > > > > > > IIUC the kernel's merging of notifications appears lossy wrt this
> > > > > > > information. It could be partially mitigated by doing a flush for
> > > > > > > notifications really really frequently but that feels like it would
> > > > > > > have its own downsides
> > > > > > 
> > > > > > IMHO what this change does is removing the false negatives.
> > > > > > 
> > > > > > Before this patch, if QEMU reports fallback=0, it doesn't mean all the
> > > > > > MSG_ZEROCOPY requests were all fulfilled by zerocopy.  It's because we
> > > > > > justify it with one boolean over "a period of time" between two flushes, we
> > > > > > set the boolean to TRUE as long as there is _one_ successful report of
> > > > > > MSG_ZEROCOPY.  So even if every flush reports TRUE it only means "there is
> > > > > > at least one MSG_ZEROCOPY request that didn't fallback".  It has no
> > > > > > implication of whether a fallback happened.
> > > > > > 
> > > > > > Hence, before this v2 patch, there can be false negative reported by QEMU,
> > > > > > assuming there's no fallback (reflected in stats) but it actually happened.
> > > > > > 
> > > > > > After this patch, if QEMU reports fallback=0, it guarantees that _all_
> > > > > > MSG_ZEROCOPY requests are fulfilled with zerocopy.  It's because we monitor
> > > > > > all messages and accumulate any fallback cases.  Even if the messages can
> > > > > > be merged, when "fallback" shows anything non-zero would imply some
> > > > > > fallback happened.  Here, the counter value doesn't really matter much
> > > > > > IMHO, as long as it becomes non-zero.
> > > > > 
> > > > > AFAICT, the v1 of this patch was sufficient to address the original
> > > > > bug and maintain the current intended semantics of the migration
> > > > > counter. This v2 is mixing a bug fix with functional change in
> > > > > behaviour and I don't think the latter is justified.
> > > > 
> > > > It's just that when it cannot report all fallback cases, I don't yet see
> > > > how it would help much even if we fix the previous behavior with v1..
> > > > 
> > > > OTOH, the new behavior will be deemed to have no issue on the problem v1
> > > > was fixing.
> > > > 
> > > > So IIUC v2's behavior is the one we want, and helps identify fallback
> > > > happened.
> > > 
> > > I don't consider v2 acceptable as the value its returning is an
> > > meaningless counter that doesn't correlate to any quantity that
> > > is used by QEMU, nor visible to users of QEMU.
> > 
> > It can be a boolean if we want showing "if any fallback happened", that'll
> > at least make it accurate and avoid false negatives. But IMHO a counter is
> > always better, e.g. when we dump it from time to time we know if any more
> > fallback happened.
> > 
> > In that case, no matter how that counter is defined in granularity that'll
> > help, as long as it get boosted when fallback happened.
> > 
> > I also don't expect this value to be consumed by an user, but only reported
> > by an user and should only be consumed by a developer.
> 
> Ok, so the problem is that we've got a design inversion between what
> the kernel is reporting and what the io channel is reporting.
> 
> With the kernel notifications we can determine
> 
>  * All syscalls successfully used zero copy
>  * At least one syscall failed to use zero copy
> 
> whereas what the io channel flush is (claiming) to report is
> 
>  * 1 => all syscalls failed to use zero copy
>  * 0 => at least one syscall successfully used zero copy
> 
> and you cannot infer the latter from the former, as we have missing
> information due to merging of notifications.
> 
> So we need to invert the return values semantics of the flush method
> to account for the missing information:
> 
>  * 1 => at least one syscall failed to use zero copy
>  * 0 => all syscalls successfully used zero copy

Yep, this should be one good way to nail this problem.  Maybe Tejus, as a
real consumer of this counter, will have a preference on how it looks the
best.

Thanks,

-- 
Peter Xu



  reply	other threads:[~2026-03-11 18:44 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-09  9:09 [PATCH v2 1/1] io: make zerocopy fallback accounting more accurate Tejus GK
2026-03-09 16:48 ` Daniel P. Berrangé
2026-03-09 16:59   ` Peter Xu
2026-03-09 17:17     ` Daniel P. Berrangé
2026-03-09 17:42       ` Tejus GK
2026-03-09 17:51         ` Daniel P. Berrangé
2026-03-09 18:21           ` Peter Xu
2026-03-11 12:02             ` Daniel P. Berrangé
2026-03-11 15:30               ` Peter Xu
2026-03-11 16:56                 ` Daniel P. Berrangé
2026-03-11 17:28                   ` Peter Xu
2026-03-11 17:46                     ` Daniel P. Berrangé
2026-03-11 18:43                       ` Peter Xu [this message]
2026-03-16 16:26                         ` Tejus GK
2026-03-10  8:52 ` Markus Armbruster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abG349ix_S-CEn8C@x1.local \
    --to=peterx@redhat.com \
    --cc=armbru@redhat.com \
    --cc=berrange@redhat.com \
    --cc=eblake@redhat.com \
    --cc=farosas@suse.de \
    --cc=qemu-devel@nongnu.org \
    --cc=tejus.gk@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.