qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
@ 2013-01-18 16:04 Luigi Rizzo
  2013-01-18 16:14 ` Paolo Bonzini
  0 siblings, 1 reply; 5+ messages in thread
From: Luigi Rizzo @ 2013-01-18 16:04 UTC (permalink / raw)
  To: qemu-devel; +Cc: Giuseppe Lettieri, v.maffione

Hi,
with a bunch of e1000 improvements we are at a point where we are
doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
between two guests, and two things that are high in the "perf top"
stats are phys_page_find() and related memory copies.

Both are triggered by the pci_dma_read() and pci_dma_write(),
which on e1000 (and presumably other frontends) are called on
every single descriptor and every single buffer.

I have then tried to access the guest memory without going every
time through the page lookup.

For the tx and rx rings i have a partial workaround, which tracks changes
to the base address of the ring, converts it to a host virtual address

     ....
    +#ifdef MAP_RING
    +    base = tx_desc_base(s);
    +    if (base != s->txring_phi) {
    +       hwaddr desclen = s->mac_reg[TDLEN];
    +       s->txring_phi = base;
    +       s->txring = address_space_map(pci_dma_context(&s->dev)->as,
    +                       base, &desclen, 0 /* is_write */);
    +    }
    +#endif /* MAP_RING */
     ...

and then accesses the descriptor directly into guest memory

    desc = s->txring[s->mac_reg[TDH]];

(sprinkle with memory barriers as needed).

This relies on the assumption that the ring (which is contiguous in the
guest's physical address space) is also contiguous in the host's virtual
address space.  In principle the property could be easily verified once
the ring is set up.

I have not done this for buffers because I am not sure how to verify
that the same mapping holds for all packet buffers.
One way could be the following:
on the first buffer access, make the address translation and try to
determine the boundaries of the contiguous (in virtual host memory)
region that holds the buffer. Then subsequent buffers can be easily
validated against this region.

So the problem is now the following: given a guest physical
address, is there a quick way to determine the contiguous
region of memory in the host that contains it ?

And of course, am i missing some important detail ?
Of course the above could be used conditionally if the required
conditions hold, and then revert to the pci_dma_*()
in other cases.

cheers
luigi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
  2013-01-18 16:04 [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ? Luigi Rizzo
@ 2013-01-18 16:14 ` Paolo Bonzini
  2013-01-18 16:33   ` Luigi Rizzo
  2013-01-19  0:58   ` Luigi Rizzo
  0 siblings, 2 replies; 5+ messages in thread
From: Paolo Bonzini @ 2013-01-18 16:14 UTC (permalink / raw)
  To: Luigi Rizzo; +Cc: Giuseppe Lettieri, qemu-devel, v.maffione

Il 18/01/2013 17:04, Luigi Rizzo ha scritto:
> Hi,
> with a bunch of e1000 improvements we are at a point where we are
> doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
> between two guests, and two things that are high in the "perf top"
> stats are phys_page_find() and related memory copies.
> 
> Both are triggered by the pci_dma_read() and pci_dma_write(),
> which on e1000 (and presumably other frontends) are called on
> every single descriptor and every single buffer.
> 
> I have then tried to access the guest memory without going every
> time through the page lookup. [...]
> 
> This relies on the assumption that the ring (which is contiguous in the
> guest's physical address space) is also contiguous in the host's virtual
> address space.  In principle the property could be easily verified once
> the ring is set up.

IIRC, the amount of contiguous memory is written by address_space_map in
the plen parameter.

In your case:

>     +       s->txring = address_space_map(pci_dma_context(&s->dev)->as,
>     +                       base, &desclen, 0 /* is_write */);

that would be desclen on return from address_space_map.

> And of course, am i missing some important detail ?

Unfortunately yes.

First, host memory mappings could change (though they rarely do on PC).
 The result of address_space_map is not guaranteed to be stable.  To
avoid problems with this, however, you could use something like
hw/dataplane/hostmem.c and even avoid address_space_map altogether.

Second, that pci_dma_*() could have the addresses translated by an
IOMMU.  virtio is documented to have "real" physical memory addresses,
but this does not apply to other devices.

Paolo

> Of course the above could be used conditionally if the required
> conditions hold, and then revert to the pci_dma_*()
> in other cases.
> 
> cheers
> luigi
> 
> 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
  2013-01-18 16:14 ` Paolo Bonzini
@ 2013-01-18 16:33   ` Luigi Rizzo
  2013-01-18 16:49     ` Paolo Bonzini
  2013-01-19  0:58   ` Luigi Rizzo
  1 sibling, 1 reply; 5+ messages in thread
From: Luigi Rizzo @ 2013-01-18 16:33 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Giuseppe Lettieri, qemu-devel, v.maffione

On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote:
...
> > This relies on the assumption that the ring (which is contiguous in the
> > guest's physical address space) is also contiguous in the host's virtual
> > address space.  In principle the property could be easily verified once
> > the ring is set up.
> 
> IIRC, the amount of contiguous memory is written by address_space_map in
> the plen parameter.
> 
> In your case:
> 
> >     +       s->txring = address_space_map(pci_dma_context(&s->dev)->as,
> >     +                       base, &desclen, 0 /* is_write */);
> 
> that would be desclen on return from address_space_map.

ok thanks.

> > And of course, am i missing some important detail ?
> 
> Unfortunately yes.
> 
> First, host memory mappings could change (though they rarely do on PC).
>  The result of address_space_map is not guaranteed to be stable.  To
> avoid problems with this, however, you could use something like
> hw/dataplane/hostmem.c and even avoid address_space_map altogether.

I'll look into that. Hopefully there is something that i can
use as a notification that the mapping has changed...

> Second, that pci_dma_*() could have the addresses translated by an
> IOMMU.  virtio is documented to have "real" physical memory addresses,
> but this does not apply to other devices.

I see. I suppose the ability to have an iommu depends on the
specific NIC ? I am only planning to use the above shortcut for
e1000.

thanks a lot for the quick feedback
luigi

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
  2013-01-18 16:33   ` Luigi Rizzo
@ 2013-01-18 16:49     ` Paolo Bonzini
  0 siblings, 0 replies; 5+ messages in thread
From: Paolo Bonzini @ 2013-01-18 16:49 UTC (permalink / raw)
  To: Luigi Rizzo; +Cc: Giuseppe Lettieri, qemu-devel, v.maffione

Il 18/01/2013 17:33, Luigi Rizzo ha scritto:
>> > First, host memory mappings could change (though they rarely do on PC).
>> >  The result of address_space_map is not guaranteed to be stable.  To
>> > avoid problems with this, however, you could use something like
>> > hw/dataplane/hostmem.c and even avoid address_space_map altogether.
> I'll look into that. Hopefully there is something that i can
> use as a notification that the mapping has changed...

Yes, that's the MemoryListener interface that hw/dataplane/hostmem.c uses.

>> > Second, that pci_dma_*() could have the addresses translated by an
>> > IOMMU.  virtio is documented to have "real" physical memory addresses,
>> > but this does not apply to other devices.
> I see. I suppose the ability to have an iommu depends on the
> specific NIC ? I am only planning to use the above shortcut for
> e1000.

It depends on the bus, in this case PCI.

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ?
  2013-01-18 16:14 ` Paolo Bonzini
  2013-01-18 16:33   ` Luigi Rizzo
@ 2013-01-19  0:58   ` Luigi Rizzo
  1 sibling, 0 replies; 5+ messages in thread
From: Luigi Rizzo @ 2013-01-19  0:58 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Giuseppe Lettieri, qemu-devel, v.maffione

On Fri, Jan 18, 2013 at 05:14:02PM +0100, Paolo Bonzini wrote:
> Il 18/01/2013 17:04, Luigi Rizzo ha scritto:
> > Hi,
> > with a bunch of e1000 improvements we are at a point where we are
> > doing over 1Mpps (short frames) and 7-8Gbit/s (1500 byte frames)
> > between two guests, and two things that are high in the "perf top"
> > stats are phys_page_find() and related memory copies.
> > 
> > Both are triggered by the pci_dma_read() and pci_dma_write(),
> > which on e1000 (and presumably other frontends) are called on
> > every single descriptor and every single buffer.
> > 
> > I have then tried to access the guest memory without going every
> > time through the page lookup. [...]
> > 
> > This relies on the assumption that the ring (which is contiguous in the
> > guest's physical address space) is also contiguous in the host's virtual
> > address space.  In principle the property could be easily verified once
> > the ring is set up.
> 
> IIRC, the amount of contiguous memory is written by address_space_map in
> the plen parameter.

unfortunately the plen parameter is modified only if the area
is smaller than the request, and there is no method that i can
find that returns [base,len] of a RAMBlock.

What I came up with, also to check for invalid addresses,
IOMMU and the like, is something like this:

	// addr is the address we want to map into host VM
    int mappable_addr(PCIDevice *dev, hwaddr addr, 
		uint64_t *guest_ha_low, uint64_t *guest_ha_high,
		uint64_t *gpa_to_hva)
    {
	AddressSpace *as = dev->as;
	AddressSpaceDispatch *d = as->dispatch;
	MemoryRegionSection *section;
	RAMBlock *block;

	if (dma_has_iommu(pci_dma_context(dev)))
	    return 0;	// no direct access

	section = phys_page_find(d, addr >> TARGET_PAGE_BITS);
	if (!memory_region_is_ram(section->mr) || section->readonly)
	    return 0;	// no direct access

	QLIST_FOREACH(block, &ram_list.blocks, next) {
	    if (addr - block->offset < block->length) {
		/* set 3 variables indicating the valid range
		 * and the offset between the two address spaces.
		 */
		*guest_ha_low =  block->offset;
		*guest_ha_high = block->offset + block->length;
		*gpa_to_hva = (uint64_t)block->host - block->offset;
		return 1;
	    }
	}
	return 0;
    }

(this probably needs to be put in exec.c or some other place
that can access phys_page_find() and RAMBlock)

The interested client (hw/e1000.c in my case) could then do a
memory_listener_register() to be notified of changes,
invoke mappable_addr() on the first data buffer it has to
translate, and cache the result and use it to speed up the
translation subsequently in case of a hit (with the
pci_dma_read/write being the fallback methods in case
of a miss).

The cache is invalidated on updates arriving from the
memory listener, and refreshed at the next access.

Is this more sound ?
The only missing piece then is the call to
invalidate_and_set_dirty() 


cheers
luigi

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-01-19  0:59 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-18 16:04 [Qemu-devel] bypassing pci_dma_read() and pci_dma_write() ? Luigi Rizzo
2013-01-18 16:14 ` Paolo Bonzini
2013-01-18 16:33   ` Luigi Rizzo
2013-01-18 16:49     ` Paolo Bonzini
2013-01-19  0:58   ` Luigi Rizzo

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).