From: "Michael S. Tsirkin" <mst@redhat.com>
To: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
qemu-devel@nongnu.org, Anthony Liguori <anthony@codemonkey.ws>,
David Gibson <david@gibson.dropbear.id.au>
Subject: Re: [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function
Date: Tue, 22 May 2012 01:22:44 +0300 [thread overview]
Message-ID: <20120521222243.GI17031@redhat.com> (raw)
In-Reply-To: <1337637497.2779.114.camel@pasglop>
On Tue, May 22, 2012 at 07:58:17AM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-05-21 at 15:18 +0300, Michael S. Tsirkin wrote:
>
> > > mb is more than just "read flush writes" (besides it's not a statement
> > > about flushing, it's a statement about ordering. whether it has a
> > > flushing side effect on x86 is a separate issue, it doesn't on power for
> > > example).
> >
> > I referred to reads not bypassing writes on PCI.
>
> Again, from which originator ? From a given initiator, nothing bypasses
> anything, so the right thing to do here is a full mb(). However, I
> suspect what you are talking about here is read -responses- not
> bypassing writes in the direction of the response (ie, the "flushing"
> semantic of reads) which is a different matter.
No. My spec says:
A3, A4
A Posted Request must be able to pass Non-Posted Requests to avoid
deadlocks.
> Also don't forget that
> this is only a semantic of PCI, not of the system fabric, ie, a device
> DMA read doesn't flush a CPU write that is still in that CPU store
> queue.
We need to emulate what hardware does IMO.
So if usb has different rules it needs a different barriers.
> > This is the real argument in my eyes: that we
> > should behave the way real hardware does.
>
> But that doesn't really make much sense since we don't actually have a
> non-coherent bus sitting in the middle :-)
>
> However we should as much as possible be observed to behave as such, I
> agree, though I don't think we need to bother too much about timings
> since we don't really have way to enforce the immediate visibility of
> stores within the coherent domain without a bunch of arch specific very
> very heavy hammers which we really don't want to wield at this point.
>
> > > Real flushing out writes matters very much in real life in two very
> > > different contexts that tend to not affect emulation in qemu as much.
> > >
> > > One is flushing write in the opposite direction (or rather, having the
> > > read response queued up behind those writes) which is critical to
> > > ensuring proper completion of DMAs after an LSI from a guest driver
> > > perspective on real HW typically.
> > >
> > > The other classic case is to flush posted MMIO writes in order to ensure
> > > that a subsequent delay is respected.
> > >
> > > Most of those don't actually matter when doing emulation. Besides a
> > > barrier won't provide you the second guarantee, you need a nastier
> > > construct at least on some architectures like power.
> >
> > Exactly. This is what I was saying too.
>
> Right and I'm reasonably sure that none of those above is our problem.
>
> As I said, at this point, what I want to sort out is purely the
> observable ordering of DMA transactions. The side effect of reads in one
> direction on writes in the other direction is an orthogonal problem
> which as I wrote above is probably not hurting us.
>
> > > However, we do need to ensure that read and writes are properly ordered
> > > vs. each other (regardless of any "flush" semantic) or things could go
> > > very wrong on OO architectures (here too, my understanding on x86 is
> > > limited).
> >
> > Right. Here's a compromize:
> > - add smp_rmb() on any DMA read
> > - add smp_wmb( on any DMA write
> > This is almost zero cost on x86 at least.
> > So we are not regressing existing setups.
>
> And it's not correct. With that setup, DMA writes can pass DMA reads
> (and vice-versa) which doesn't correspond to the guarantees of the PCI
> spec.
Cite the spec please. Express spec matches this at least.
> The question I suppose is whether this is a problem in practice...
>
> > Are there any places where devices do read after write?
>
> It's possible, things like update of a descriptor followed by reading of
> the next one, etc... I don't have an example hot in mind right know of
> a device that would be hurt but I'm a bit nervous as this would be a
> violation of the PCI guaranteed ordering.
>
> > My preferred way is to find them and do pci_dma_flush() invoking
> > smp_mb(). If there is such a case it's likely on datapath anyway
> > so we do care.
> >
> > But I can also live with a global flag "latest_dma_read"
> > and on read we could do
> > if (unlikely(latest_dma_read))
> > smp_mb();
> >
> > if you really insist on it
> > though I do think it's inelegant.
>
> Again, why do you object on simply making the default accessors fully
> ordered ? Do you think it will be a measurable different in most cases ?
>
> Shouldn't we measure it first ?
It's a lot of work. We measured the effect for virtio in
the past. I don't think we need to redo it.
> > You said above x86 is unaffected. This is portability, not safety.
>
> x86 is unaffected by the missing wmb/rmb, it might not be unaffected by
> the missing ordering between loads and stores, I don't know, as I said,
> I don't fully know the x86 memory model.
You don't need to understand it. Assume memory-barriers.h is correct.
> In any case, opposing "portability" to "safety" the way you do it means
> you are making assumptions that basically "qemu is written for x86 and
> nothing else matters".
No. But find a way to fix power without hurting working setups.
> If that's your point of view, so be it and be clear about it, but I will
> disagree :-) And while I can understand that powerpc might not be
> considered as the most important arch around at this point in time,
> these problems are going to affect ARM as well.
>
> > > Almost all our
> > > devices were written without any thought given to ordering, so they
> > > basically can and should be considered as all broken.
> >
> > Problem is, a lot of code is likely broken even after you sprinkle
> > barriers around. For example qemu might write A then B where guest driver
> > expects to see B written before A.
>
> No, this is totally unrelated bugs, nothing to do with barriers. You are
> mixing up two completely different problems and using one as an excuse
> to not fix the other one :-)
>
> A device with the above problem would be broken today on x86 regardless.
>
> > > Since thinking
> > > about ordering is something that, by experience, very few programmer can
> > > do and get right, the default should absolutely be fully ordered.
> >
> > Give it bus ordering. That is not fully ordered.
>
> It pretty much is actually, look at your PCI spec :-)
I looked. 2.4.1. Transaction Ordering Rules
> > > Performance regressions aren't a big deal to bisect in that case: If
> > > there's a regression for a given driver and it points to this specific
> > > patch adding the barriers then we know precisely where the regression
> > > come from, and we can get some insight about how this specific driver
> > > can be improved to use more relaxed accessors.
> > >
> > > I don't see the problem.
> > >
> > > One thing that might be worth looking at is if indeed mb() is so much
> > > more costly than just wmb/rmb, in which circumstances we could have some
> > > smarts in the accessors to make them skip the full mb based on knowledge
> > > of previous access direction, though here too I would be tempted to only
> > > do that if absolutely necessary (ie if we can't instead just fix the
> > > sensitive driver to use explicitly relaxed accessors).
> >
> > We did this in virtio and yes it is measureable.
>
> You did it in virtio in a very hot spot on a performance critical
> driver. My argument is that:
>
> - We can do it in a way that doesn't affect virtio at all (by using the
> dma accessors instead of cpu_*)
>
> - Only few drivers have that kind of performance criticality and they
> can be easily hand fixed.
>
> > branches are pretty cheap though.
>
> Depends, not always but yes, cheaper than barriers in many cases.
>
> > > > > So on that I will not compromise.
> > > > >
> > > > > However, I think it might be better to leave the barrier in the dma
> > > > > accessor since that's how you also get iommu transparency etc... so it's
> > > > > not a bad place to put them, and leave the cpu_physical_* for use by
> > > > > lower level device drivers which are thus responsible also for dealing
> > > > > with ordering if they have to.
> > > > >
> > > > > Cheers,
> > > > > Ben.
> > > >
> > > > You claim to understand what matters for all devices I doubt that.
> > >
> > > It's pretty obvious that anything that does DMA using a classic
> > > descriptor + buffers structure is broken without appropriate ordering.
> > >
> > > And yes, I claim to have a fairly good idea of the problem, but I don't
> > > think throwing credentials around is going to be helpful.
> > >
> > > > Why don't we add safe APIs, then go over devices and switch over?
> > > > I counted 97 pci_dma_ accesses.
> > > > 33 in rtl, 32 in eepro100, 12 in lsi, 7 in e1000.
> > > >
> > > > Let maintainers make a decision where does speed matter.
> > >
> > > No. Let's fix the general bug first. Then let's people who know the
> > > individual drivers intimately and understand their access patterns make
> > > the call as to when things can/should be relaxed.
> > >
> > > Ben.
> >
> > As a maintainer of a device, if you send me a patch I can review.
> > If you change core APIs creating performance regressions
> > I don't even know what to review without wasting time debugging
> > and bisecting.
>
> Well, if you don't know then you ask on the list and others (such as
> myself or a certain mst) who happens to know those issues will help that
> lone clueless maintainer, seriously it's not like it's hard, and a lot
> easier than just keeping everything broken and hoping we get to audit
> them all properly.
>
> > According to what you said you want to fix kvm on powerpc.
> > Good. Find a way that looks non intrusive on x86 please.
>
> And ARM. And any other OO arch.
>
> Ben.
next prev parent reply other threads:[~2012-05-21 22:22 UTC|newest]
Thread overview: 89+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-05-10 4:48 [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Benjamin Herrenschmidt
2012-05-10 4:48 ` [Qemu-devel] [PATCH 01/13] Better support for dma_addr_t variables Benjamin Herrenschmidt
2012-05-10 4:48 ` [Qemu-devel] [PATCH 02/13] Implement cpu_physical_memory_zero() Benjamin Herrenschmidt
2012-05-15 0:42 ` Anthony Liguori
2012-05-15 1:23 ` David Gibson
2012-05-15 2:03 ` Anthony Liguori
2012-05-10 4:48 ` [Qemu-devel] [PATCH 03/13] iommu: Add universal DMA helper functions Benjamin Herrenschmidt
2012-05-10 4:48 ` [Qemu-devel] [PATCH 04/13] usb-ohci: Use " Benjamin Herrenschmidt
2012-05-10 4:48 ` [Qemu-devel] [PATCH 05/13] iommu: Make sglists and dma_bdrv helpers use new universal DMA helpers Benjamin Herrenschmidt
2012-05-10 4:49 ` [Qemu-devel] [PATCH 06/13] ide/ahci: Use universal DMA helper functions Benjamin Herrenschmidt
2012-05-21 1:51 ` [Qemu-devel] [PATCH 06/13 - UPDATED] " Benjamin Herrenschmidt
2012-05-10 4:49 ` [Qemu-devel] [PATCH 07/13] usb: Convert usb_packet_{map, unmap} to universal DMA helpers Benjamin Herrenschmidt
2012-05-10 4:49 ` [Qemu-devel] [PATCH 08/13] iommu: Introduce IOMMU emulation infrastructure Benjamin Herrenschmidt
2012-05-15 0:49 ` Anthony Liguori
2012-05-15 1:42 ` David Gibson
2012-05-15 2:03 ` Anthony Liguori
2012-05-15 2:32 ` Benjamin Herrenschmidt
2012-05-15 2:50 ` Anthony Liguori
2012-05-15 3:02 ` Benjamin Herrenschmidt
2012-05-15 14:02 ` Anthony Liguori
2012-05-15 21:55 ` Benjamin Herrenschmidt
2012-05-15 22:02 ` Anthony Liguori
2012-05-15 23:08 ` Benjamin Herrenschmidt
2012-05-15 23:58 ` Anthony Liguori
2012-05-16 0:41 ` Benjamin Herrenschmidt
2012-05-16 0:54 ` Anthony Liguori
2012-05-16 1:20 ` Benjamin Herrenschmidt
2012-05-16 19:36 ` Anthony Liguori
2012-05-10 4:49 ` [Qemu-devel] [PATCH 09/13] iommu: Add facility to cancel in-use dma memory maps Benjamin Herrenschmidt
2012-05-10 4:49 ` [Qemu-devel] [PATCH 10/13] pseries: Convert sPAPR TCEs to use generic IOMMU infrastructure Benjamin Herrenschmidt
2012-05-10 4:49 ` [Qemu-devel] [PATCH 11/13] iommu: Allow PCI to use " Benjamin Herrenschmidt
2012-05-10 4:49 ` [Qemu-devel] [PATCH 12/13] pseries: Implement IOMMU and DMA for PAPR PCI devices Benjamin Herrenschmidt
2012-05-10 4:49 ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Benjamin Herrenschmidt
2012-05-15 0:52 ` Anthony Liguori
2012-05-15 1:11 ` Benjamin Herrenschmidt
2012-05-15 1:44 ` David Gibson
2012-05-16 4:35 ` Benjamin Herrenschmidt
2012-05-16 5:51 ` David Gibson
2012-05-16 19:39 ` Anthony Liguori
2012-05-16 21:10 ` Benjamin Herrenschmidt
2012-05-16 21:12 ` Benjamin Herrenschmidt
2012-05-17 0:07 ` Benjamin Herrenschmidt
2012-05-17 0:24 ` Benjamin Herrenschmidt
2012-05-17 0:52 ` [Qemu-devel] [RFC/PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
2012-05-17 2:28 ` Anthony Liguori
2012-05-17 2:44 ` Benjamin Herrenschmidt
2012-05-17 22:09 ` Anthony Liguori
2012-05-18 1:04 ` David Gibson
2012-05-18 1:16 ` Benjamin Herrenschmidt
2012-05-17 3:35 ` David Gibson
2012-05-18 6:53 ` [Qemu-devel] [PATCH 13/13] iommu: Add a memory barrier to DMA RW function Paolo Bonzini
2012-05-18 8:18 ` Benjamin Herrenschmidt
2012-05-18 8:57 ` Paolo Bonzini
2012-05-18 22:26 ` Benjamin Herrenschmidt
2012-05-19 7:24 ` Paolo Bonzini
2012-05-20 21:36 ` Benjamin Herrenschmidt
2012-05-21 1:56 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Benjamin Herrenschmidt
2012-05-21 8:11 ` Paolo Bonzini
2012-05-21 8:31 ` Michael S. Tsirkin
2012-05-21 8:58 ` Benjamin Herrenschmidt
2012-05-21 9:07 ` Benjamin Herrenschmidt
2012-05-21 9:16 ` Benjamin Herrenschmidt
2012-05-21 9:34 ` Michael S. Tsirkin
2012-05-21 9:53 ` Benjamin Herrenschmidt
2012-05-21 10:31 ` Michael S. Tsirkin
2012-05-21 11:45 ` Benjamin Herrenschmidt
2012-05-21 12:18 ` Michael S. Tsirkin
2012-05-21 15:16 ` Paolo Bonzini
2012-05-21 21:58 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access function Benjamin Herrenschmidt
2012-05-21 22:22 ` Michael S. Tsirkin [this message]
2012-05-21 22:56 ` Benjamin Herrenschmidt
2012-05-22 5:11 ` Michael S. Tsirkin
2012-05-22 0:00 ` Benjamin Herrenschmidt
2012-05-22 4:19 ` Rusty Russell
2012-05-21 22:18 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Anthony Liguori
2012-05-21 22:26 ` Benjamin Herrenschmidt
2012-05-21 22:31 ` Anthony Liguori
2012-05-21 22:44 ` Michael S. Tsirkin
2012-05-21 23:02 ` Benjamin Herrenschmidt
2012-05-22 4:34 ` [Qemu-devel] [PATCH] Add a memory barrier to DMA functions Benjamin Herrenschmidt
2012-05-22 4:51 ` Benjamin Herrenschmidt
2012-05-22 7:17 ` Benjamin Herrenschmidt
2012-05-22 11:14 ` Michael S. Tsirkin
2012-05-22 11:41 ` Benjamin Herrenschmidt
2012-05-22 12:03 ` Michael S. Tsirkin
2012-05-22 21:24 ` Benjamin Herrenschmidt
2012-05-22 21:40 ` Anthony Liguori
2012-05-21 22:37 ` [Qemu-devel] [PATCH] Add a memory barrier to guest memory access functions Michael S. Tsirkin
2012-05-15 0:52 ` [Qemu-devel] [PATCH 00/13] IOMMU infrastructure Anthony Liguori
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20120521222243.GI17031@redhat.com \
--to=mst@redhat.com \
--cc=anthony@codemonkey.ws \
--cc=benh@kernel.crashing.org \
--cc=david@gibson.dropbear.id.au \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).