From: Jason Wang <jasowang@redhat.com>
To: Peter Xu <peterx@redhat.com>, qemu-devel@nongnu.org
Cc: tianyu.lan@intel.com, kevin.tian@intel.com, mst@redhat.com,
jan.kiszka@siemens.com, alex.williamson@redhat.com,
bd.aviv@gmail.com
Subject: Re: [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback
Date: Sun, 22 Jan 2017 15:56:10 +0800 [thread overview]
Message-ID: <f2cceaa6-b3cf-9ed1-2c9c-f9222a0ab7ce@redhat.com> (raw)
In-Reply-To: <1484917736-32056-16-git-send-email-peterx@redhat.com>
On 2017年01月20日 21:08, Peter Xu wrote:
> The default replay() don't work for VT-d since vt-d will have a huge
> default memory region which covers address range 0-(2^64-1). This will
> normally consumes a lot of time (which looks like a dead loop).
>
> The solution is simple - we don't walk over all the regions. Instead, we
> jump over the regions when we found that the page directories are empty.
> It'll greatly reduce the time to walk the whole region.
>
> To achieve this, we provided a page walk helper to do that, invoking
> corresponding hook function when we found an page we are interested in.
> vtd_page_walk_level() is the core logic for the page walking. It's
> interface is designed to suite further use case, e.g., to invalidate a
> range of addresses.
>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
> hw/i386/intel_iommu.c | 216 ++++++++++++++++++++++++++++++++++++++++++++++++--
> hw/i386/trace-events | 7 ++
> include/exec/memory.h | 2 +
> 3 files changed, 220 insertions(+), 5 deletions(-)
>
> diff --git a/hw/i386/intel_iommu.c b/hw/i386/intel_iommu.c
> index 6f5f68a..f9c5142 100644
> --- a/hw/i386/intel_iommu.c
> +++ b/hw/i386/intel_iommu.c
> @@ -598,6 +598,22 @@ static inline uint32_t vtd_get_agaw_from_context_entry(VTDContextEntry *ce)
> return 30 + (ce->hi & VTD_CONTEXT_ENTRY_AW) * 9;
> }
>
> +static inline uint64_t vtd_iova_limit(VTDContextEntry *ce)
> +{
> + uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
> + return 1ULL << MIN(ce_agaw, VTD_MGAW);
> +}
> +
> +/* Return true if IOVA passes range check, otherwise false. */
> +static inline bool vtd_iova_range_check(uint64_t iova, VTDContextEntry *ce)
> +{
> + /*
> + * Check if @iova is above 2^X-1, where X is the minimum of MGAW
> + * in CAP_REG and AW in context-entry.
> + */
> + return !(iova & ~(vtd_iova_limit(ce) - 1));
> +}
> +
> static const uint64_t vtd_paging_entry_rsvd_field[] = {
> [0] = ~0ULL,
> /* For not large page */
> @@ -633,13 +649,9 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
> uint32_t level = vtd_get_level_from_context_entry(ce);
> uint32_t offset;
> uint64_t slpte;
> - uint32_t ce_agaw = vtd_get_agaw_from_context_entry(ce);
> uint64_t access_right_check;
>
> - /* Check if @iova is above 2^X-1, where X is the minimum of MGAW
> - * in CAP_REG and AW in context-entry.
> - */
> - if (iova & ~((1ULL << MIN(ce_agaw, VTD_MGAW)) - 1)) {
> + if (!vtd_iova_range_check(iova, ce)) {
> trace_vtd_err("IOVA exceeds limits");
> return -VTD_FR_ADDR_BEYOND_MGAW;
> }
> @@ -681,6 +693,168 @@ static int vtd_gpa_to_slpte(VTDContextEntry *ce, uint64_t iova, bool is_write,
> }
> }
>
> +typedef int (*vtd_page_walk_hook)(IOMMUTLBEntry *entry, void *private);
> +
> +/**
> + * vtd_page_walk_level - walk over specific level for IOVA range
> + *
> + * @addr: base GPA addr to start the walk
> + * @start: IOVA range start address
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: hook func to be called when detected page
> + * @private: private data to be passed into hook func
> + * @read: whether parent level has read permission
> + * @write: whether parent level has write permission
> + * @skipped: accumulated skipped ranges
What's the usage for this parameter? Looks like it was never used in
this series.
> + * @notify_unmap: whether we should notify invalid entries
> + */
> +static int vtd_page_walk_level(dma_addr_t addr, uint64_t start,
> + uint64_t end, vtd_page_walk_hook hook_fn,
> + void *private, uint32_t level,
> + bool read, bool write, uint64_t *skipped,
> + bool notify_unmap)
> +{
> + bool read_cur, write_cur, entry_valid;
> + uint32_t offset;
> + uint64_t slpte;
> + uint64_t subpage_size, subpage_mask;
> + IOMMUTLBEntry entry;
> + uint64_t iova = start;
> + uint64_t iova_next;
> + uint64_t skipped_local = 0;
> + int ret = 0;
> +
> + trace_vtd_page_walk_level(addr, level, start, end);
> +
> + subpage_size = 1ULL << vtd_slpt_level_shift(level);
> + subpage_mask = vtd_slpt_level_page_mask(level);
> +
> + while (iova < end) {
> + iova_next = (iova & subpage_mask) + subpage_size;
> +
> + offset = vtd_iova_level_offset(iova, level);
> + slpte = vtd_get_slpte(addr, offset);
> +
> + /*
> + * When one of the following case happens, we assume the whole
> + * range is invalid:
> + *
> + * 1. read block failed
Don't get the meaning (and don't see any code relate to this comment).
> + * 3. reserved area non-zero
> + * 2. both read & write flag are not set
Should be 1,2,3? And the above comment is conflict with the code at
least when notify_unmap is true.
> + */
> +
> + if (slpte == (uint64_t)-1) {
If this is true, vtd_slpte_nonzero_rsvd(slpte) should be true too I think?
> + trace_vtd_page_walk_skip_read(iova, iova_next);
> + skipped_local++;
> + goto next;
> + }
> +
> + if (vtd_slpte_nonzero_rsvd(slpte, level)) {
> + trace_vtd_page_walk_skip_reserve(iova, iova_next);
> + skipped_local++;
> + goto next;
> + }
> +
> + /* Permissions are stacked with parents' */
> + read_cur = read && (slpte & VTD_SL_R);
> + write_cur = write && (slpte & VTD_SL_W);
> +
> + /*
> + * As long as we have either read/write permission, this is
> + * a valid entry. The rule works for both page or page tables.
> + */
> + entry_valid = read_cur | write_cur;
> +
> + if (vtd_is_last_slpte(slpte, level)) {
> + entry.target_as = &address_space_memory;
> + entry.iova = iova & subpage_mask;
> + /*
> + * This might be meaningless addr if (!read_cur &&
> + * !write_cur), but after all this field will be
> + * meaningless in that case, so let's share the code to
> + * generate the IOTLBs no matter it's an MAP or UNMAP
> + */
> + entry.translated_addr = vtd_get_slpte_addr(slpte);
> + entry.addr_mask = ~subpage_mask;
> + entry.perm = IOMMU_ACCESS_FLAG(read_cur, write_cur);
> + if (!entry_valid && !notify_unmap) {
> + trace_vtd_page_walk_skip_perm(iova, iova_next);
> + skipped_local++;
> + goto next;
> + }
Under which case should we care about unmap here (better with a comment
I think)?
> + trace_vtd_page_walk_one(level, entry.iova, entry.translated_addr,
> + entry.addr_mask, entry.perm);
> + if (hook_fn) {
> + ret = hook_fn(&entry, private);
For better performance, we could try to merge adjacent mappings here. I
think both vfio and vhost support this and it can save a lot of ioctls.
> + if (ret < 0) {
> + error_report("Detected error in page walk hook "
> + "function, stop walk.");
> + return ret;
> + }
> + }
> + } else {
> + if (!entry_valid) {
> + trace_vtd_page_walk_skip_perm(iova, iova_next);
> + skipped_local++;
> + goto next;
> + }
> + trace_vtd_page_walk_level(vtd_get_slpte_addr(slpte), level - 1,
> + iova, MIN(iova_next, end));
This looks duplicated?
> + ret = vtd_page_walk_level(vtd_get_slpte_addr(slpte), iova,
> + MIN(iova_next, end), hook_fn, private,
> + level - 1, read_cur, write_cur,
> + &skipped_local, notify_unmap);
> + if (ret < 0) {
> + error_report("Detected page walk error on addr 0x%"PRIx64
> + " level %"PRIu32", stop walk.", addr, level - 1);
Guest triggered, so better use debug macro or tracepoint.
> + return ret;
> + }
> + }
> +
> +next:
> + iova = iova_next;
> + }
> +
> + if (skipped) {
> + *skipped += skipped_local;
> + }
> +
> + return 0;
> +}
> +
> +/**
> + * vtd_page_walk - walk specific IOVA range, and call the hook
> + *
> + * @ce: context entry to walk upon
> + * @start: IOVA address to start the walk
> + * @end: IOVA range end address (start <= addr < end)
> + * @hook_fn: the hook that to be called for each detected area
> + * @private: private data for the hook function
> + */
> +static int vtd_page_walk(VTDContextEntry *ce, uint64_t start, uint64_t end,
> + vtd_page_walk_hook hook_fn, void *private)
> +{
> + dma_addr_t addr = vtd_get_slpt_base_from_context(ce);
> + uint32_t level = vtd_get_level_from_context_entry(ce);
> +
> + if (!vtd_iova_range_check(start, ce)) {
> + error_report("IOVA start 0x%"PRIx64 " end 0x%"PRIx64" exceeds limits",
> + start, end);
Guest triggered, better use debug macro or tracepoint.
> + return -VTD_FR_ADDR_BEYOND_MGAW;
> + }
> +
> + if (!vtd_iova_range_check(end, ce)) {
> + /* Fix end so that it reaches the maximum */
> + end = vtd_iova_limit(ce);
> + }
> +
> + trace_vtd_page_walk_level(addr, level, start, end);
Duplicated with the tracepoint in vtd_page_walk_level() too?
> +
> + return vtd_page_walk_level(addr, start, end, hook_fn, private,
> + level, true, true, NULL, false);
> +}
> +
> /* Map a device to its corresponding domain (context-entry) */
> static int vtd_dev_to_context_entry(IntelIOMMUState *s, uint8_t bus_num,
> uint8_t devfn, VTDContextEntry *ce)
> @@ -2395,6 +2569,37 @@ VTDAddressSpace *vtd_find_add_as(IntelIOMMUState *s, PCIBus *bus, int devfn)
> return vtd_dev_as;
> }
>
> +static int vtd_replay_hook(IOMMUTLBEntry *entry, void *private)
> +{
> + memory_region_notify_one((IOMMUNotifier *)private, entry);
> + return 0;
> +}
> +
> +static void vtd_iommu_replay(MemoryRegion *mr, IOMMUNotifier *n)
> +{
> + VTDAddressSpace *vtd_as = container_of(mr, VTDAddressSpace, iommu);
> + IntelIOMMUState *s = vtd_as->iommu_state;
> + uint8_t bus_n = pci_bus_num(vtd_as->bus);
> + VTDContextEntry ce;
> +
> + if (vtd_dev_to_context_entry(s, bus_n, vtd_as->devfn, &ce) == 0) {
> + /*
> + * Scanned a valid context entry, walk over the pages and
> + * notify when needed.
> + */
> + trace_vtd_replay_ce_valid(bus_n, PCI_SLOT(vtd_as->devfn),
> + PCI_FUNC(vtd_as->devfn),
> + VTD_CONTEXT_ENTRY_DID(ce.hi),
> + ce.hi, ce.lo);
> + vtd_page_walk(&ce, 0, ~0, vtd_replay_hook, (void *)n);
~0ULL?
> + } else {
> + trace_vtd_replay_ce_invalid(bus_n, PCI_SLOT(vtd_as->devfn),
> + PCI_FUNC(vtd_as->devfn));
> + }
> +
> + return;
> +}
> +
> /* Do the initialization. It will also be called when reset, so pay
> * attention when adding new initialization stuff.
> */
> @@ -2409,6 +2614,7 @@ static void vtd_init(IntelIOMMUState *s)
>
> s->iommu_ops.translate = vtd_iommu_translate;
> s->iommu_ops.notify_flag_changed = vtd_iommu_notify_flag_changed;
> + s->iommu_ops.replay = vtd_iommu_replay;
> s->root = 0;
> s->root_extended = false;
> s->dmar_enabled = false;
> diff --git a/hw/i386/trace-events b/hw/i386/trace-events
> index a273980..a3e1a9d 100644
> --- a/hw/i386/trace-events
> +++ b/hw/i386/trace-events
> @@ -31,6 +31,13 @@ vtd_iotlb_page_update(uint16_t sid, uint64_t addr, uint64_t slpte, uint16_t doma
> vtd_iotlb_cc_hit(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen) "IOTLB context hit bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32
> vtd_iotlb_cc_update(uint8_t bus, uint8_t devfn, uint64_t high, uint64_t low, uint32_t gen1, uint32_t gen2) "IOTLB context update bus 0x%"PRIx8" devfn 0x%"PRIx8" high 0x%"PRIx64" low 0x%"PRIx64" gen %"PRIu32" -> gen %"PRIu32
> vtd_iotlb_reset(const char *reason) "IOTLB reset (reason: %s)"
> +vtd_replay_ce_valid(uint8_t bus, uint8_t dev, uint8_t fn, uint16_t domain, uint64_t hi, uint64_t lo) "replay valid context device %02"PRIx8":%02"PRIx8".%02"PRIx8" domain 0x%"PRIx16" hi 0x%"PRIx64" lo 0x%"PRIx64
> +vtd_replay_ce_invalid(uint8_t bus, uint8_t dev, uint8_t fn) "replay invalid context device %02"PRIx8":%02"PRIx8".%02"PRIx8
> +vtd_page_walk_level(uint64_t addr, uint32_t level, uint64_t start, uint64_t end) "walk (base=0x%"PRIx64", level=%"PRIu32") iova range 0x%"PRIx64" - 0x%"PRIx64
> +vtd_page_walk_one(uint32_t level, uint64_t iova, uint64_t gpa, uint64_t mask, int perm) "detected page level 0x%"PRIx32" iova 0x%"PRIx64" -> gpa 0x%"PRIx64" mask 0x%"PRIx64" perm %d"
> +vtd_page_walk_skip_read(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to unable to read"
> +vtd_page_walk_skip_perm(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to perm empty"
> +vtd_page_walk_skip_reserve(uint64_t iova, uint64_t next) "Page walk skip iova 0x%"PRIx64" - 0x%"PRIx64" due to rsrv set"
>
> # hw/i386/amd_iommu.c
> amdvi_evntlog_fail(uint64_t addr, uint32_t head) "error: fail to write at addr 0x%"PRIx64" + offset 0x%"PRIx32
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index bb4e654..9fd3232 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -59,6 +59,8 @@ typedef enum {
> IOMMU_RW = 3,
> } IOMMUAccessFlags;
>
> +#define IOMMU_ACCESS_FLAG(r, w) (((r) ? IOMMU_RO : 0) | ((w) ? IOMMU_WO : 0))
> +
> struct IOMMUTLBEntry {
> AddressSpace *target_as;
> hwaddr iova;
next prev parent reply other threads:[~2017-01-22 7:56 UTC|newest]
Thread overview: 75+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-01-20 13:08 [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 01/20] vfio: trace map/unmap for notify as well Peter Xu
2017-01-23 18:20 ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 02/20] vfio: introduce vfio_get_vaddr() Peter Xu
2017-01-23 18:49 ` Alex Williamson
2017-01-24 3:28 ` Peter Xu
2017-01-24 4:30 ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 03/20] vfio: allow to notify unmap for very large region Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 04/20] IOMMU: add option to enable VTD_CAP_CM to vIOMMU capility exposoed to guest Peter Xu
2017-01-22 2:51 ` [Qemu-devel] [PATCH RFC v4.1 04/20] intel_iommu: add "caching-mode" option Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 05/20] intel_iommu: simplify irq region translation Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 06/20] intel_iommu: renaming gpa to iova where proper Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 07/20] intel_iommu: fix trace for inv desc handling Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 08/20] intel_iommu: fix trace for addr translation Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 09/20] intel_iommu: vtd_slpt_level_shift check level Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 10/20] memory: add section range info for IOMMU notifier Peter Xu
2017-01-23 19:12 ` Alex Williamson
2017-01-24 7:48 ` Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 11/20] memory: provide IOMMU_NOTIFIER_FOREACH macro Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 12/20] memory: provide iommu_replay_all() Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 13/20] memory: introduce memory_region_notify_one() Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 14/20] memory: add MemoryRegionIOMMUOps.replay() callback Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 15/20] intel_iommu: provide its own replay() callback Peter Xu
2017-01-22 7:56 ` Jason Wang [this message]
2017-01-22 8:51 ` Peter Xu
2017-01-22 9:36 ` Peter Xu
2017-01-23 1:50 ` Jason Wang
2017-01-23 1:48 ` Jason Wang
2017-01-23 2:54 ` Peter Xu
2017-01-23 3:12 ` Jason Wang
2017-01-23 3:35 ` Peter Xu
2017-01-23 19:34 ` Alex Williamson
2017-01-24 4:04 ` Peter Xu
2017-01-23 19:33 ` Alex Williamson
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 16/20] intel_iommu: do replay when context invalidate Peter Xu
2017-01-23 10:36 ` Jason Wang
2017-01-24 4:52 ` Peter Xu
2017-01-25 3:09 ` Jason Wang
2017-01-25 3:46 ` Peter Xu
2017-01-25 6:37 ` Tian, Kevin
2017-01-25 6:44 ` Peter Xu
2017-01-25 7:45 ` Jason Wang
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 17/20] intel_iommu: allow dynamic switch of IOMMU region Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 18/20] intel_iommu: enable vfio devices Peter Xu
2017-01-22 8:08 ` Jason Wang
2017-01-22 9:04 ` Peter Xu
2017-01-23 1:55 ` Jason Wang
2017-01-23 3:34 ` Peter Xu
2017-01-23 10:23 ` Jason Wang
2017-01-23 19:40 ` Alex Williamson
2017-01-25 1:19 ` Jason Wang
2017-01-25 1:31 ` Alex Williamson
2017-01-25 7:41 ` Jason Wang
2017-01-24 4:42 ` Peter Xu
2017-01-23 18:03 ` Alex Williamson
2017-01-24 7:22 ` Peter Xu
2017-01-24 16:24 ` Alex Williamson
2017-01-25 4:04 ` Peter Xu
2017-01-23 2:01 ` Jason Wang
2017-01-23 2:17 ` Jason Wang
2017-01-23 3:40 ` Peter Xu
2017-01-23 10:27 ` Jason Wang
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 19/20] intel_iommu: unmap existing pages before replay Peter Xu
2017-01-22 8:13 ` Jason Wang
2017-01-22 9:09 ` Peter Xu
2017-01-23 1:57 ` Jason Wang
2017-01-23 7:30 ` Peter Xu
2017-01-23 10:29 ` Jason Wang
2017-01-23 10:40 ` Jason Wang
2017-01-24 7:31 ` Peter Xu
2017-01-25 3:11 ` Jason Wang
2017-01-25 4:15 ` Peter Xu
2017-01-20 13:08 ` [Qemu-devel] [PATCH RFC v4 20/20] intel_iommu: replay even with DSI/GLOBAL inv desc Peter Xu
2017-01-23 15:55 ` [Qemu-devel] [PATCH RFC v4 00/20] VT-d: vfio enablement and misc enhances Michael S. Tsirkin
2017-01-24 7:40 ` Peter Xu
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f2cceaa6-b3cf-9ed1-2c9c-f9222a0ab7ce@redhat.com \
--to=jasowang@redhat.com \
--cc=alex.williamson@redhat.com \
--cc=bd.aviv@gmail.com \
--cc=jan.kiszka@siemens.com \
--cc=kevin.tian@intel.com \
--cc=mst@redhat.com \
--cc=peterx@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=tianyu.lan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).