* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
2013-06-15 16:27 ` Alex Williamson
@ 2013-07-24 15:25 ` Alex Williamson
-1 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2013-07-24 15:25 UTC (permalink / raw)
To: dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, kvm-u79uwXL29TY76Z2rM5mHXA
This is a pretty massive memory leak, anyone @Intel care? Thanks,
Alex
On Sat, 2013-06-15 at 10:27 -0600, Alex Williamson wrote:
> At best the current code only seems to free the leaf pagetables and
> the root. If you're unlucky enough to have a large gap (like any
> QEMU guest with more than 3G of memory), only the first chunk of leaf
> pagetables are freed (plus the root). This is a massive memory leak.
> This patch re-writes the pagetable freeing function to use a
> recursive algorithm and manages to not only free all the pagetables,
> but does it without any apparent performance loss versus the current
> broken version.
>
> Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>
> Suggesting for stable, would like to see some soak time, but it's
> hard to imagine this being any worse than the current code.
>
> This likely also affects device domains, but the current code does
> ok at freeing individual leaf pagetables and driver domains would
> only get a full pruning if the driver or device is removed.
>
> Some test programs:
> https://github.com/awilliam/tests/blob/master/kvm-huge-guest-test.c
> https://github.com/awilliam/tests/blob/master/vfio-huge-guest-test.c
>
> Both of these simulate a large guest on a small host system. They
> mmap 4G of memory and map it across a large address space just like
> QEMU would (aside from re-using the same mmap across multiple IOVAs).
> On existing code the vfio version (w/o a KVM memory slot limit) will
> leak over 1G of pagetables per run.
>
> drivers/iommu/intel-iommu.c | 72 +++++++++++++++++++++----------------------
> 1 file changed, 35 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..15e9b57 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -890,56 +890,54 @@ static int dma_pte_clear_range(struct dmar_domain *domain,
> return order;
> }
>
> +static void dma_pte_free_level(struct dmar_domain *domain, int level,
> + struct dma_pte *pte, unsigned long pfn,
> + unsigned long start_pfn, unsigned long last_pfn)
> +{
> + pfn = max(start_pfn, pfn);
> + pte = &pte[pfn_level_offset(pfn, level)];
> +
> + do {
> + unsigned long level_pfn;
> + struct dma_pte *level_pte;
> +
> + if (!dma_pte_present(pte) || dma_pte_superpage(pte))
> + goto next;
> +
> + level_pfn = pfn & level_mask(level - 1);
> + level_pte = phys_to_virt(dma_pte_addr(pte));
> +
> + if (level > 2)
> + dma_pte_free_level(domain, level - 1, level_pte,
> + level_pfn, start_pfn, last_pfn);
> +
> + /* If range covers entire pagetable, free it */
> + if (!(start_pfn > level_pfn ||
> + last_pfn < level_pfn + level_size(level))) {
> + dma_clear_pte(pte);
> + domain_flush_cache(domain, pte, sizeof(*pte));
> + free_pgtable_page(level_pte);
> + }
> +next:
> + pfn += level_size(level);
> + } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
> +}
> +
> /* free page table pages. last level pte should already be cleared */
> static void dma_pte_free_pagetable(struct dmar_domain *domain,
> unsigned long start_pfn,
> unsigned long last_pfn)
> {
> int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
> - struct dma_pte *first_pte, *pte;
> - int total = agaw_to_level(domain->agaw);
> - int level;
> - unsigned long tmp;
> - int large_page = 2;
>
> BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
> BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
> BUG_ON(start_pfn > last_pfn);
>
> /* We don't need lock here; nobody else touches the iova range */
> - level = 2;
> - while (level <= total) {
> - tmp = align_to_level(start_pfn, level);
> -
> - /* If we can't even clear one PTE at this level, we're done */
> - if (tmp + level_size(level) - 1 > last_pfn)
> - return;
> -
> - do {
> - large_page = level;
> - first_pte = pte = dma_pfn_level_pte(domain, tmp, level, &large_page);
> - if (large_page > level)
> - level = large_page + 1;
> - if (!pte) {
> - tmp = align_to_level(tmp + 1, level + 1);
> - continue;
> - }
> - do {
> - if (dma_pte_present(pte)) {
> - free_pgtable_page(phys_to_virt(dma_pte_addr(pte)));
> - dma_clear_pte(pte);
> - }
> - pte++;
> - tmp += level_size(level);
> - } while (!first_pte_in_page(pte) &&
> - tmp + level_size(level) - 1 <= last_pfn);
> + dma_pte_free_level(domain, agaw_to_level(domain->agaw),
> + domain->pgd, 0, start_pfn, last_pfn);
>
> - domain_flush_cache(domain, first_pte,
> - (void *)pte - (void *)first_pte);
> -
> - } while (tmp && tmp + level_size(level) - 1 <= last_pfn);
> - level++;
> - }
> /* free pgd */
> if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
> free_pgtable_page(domain->pgd);
>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
@ 2013-07-24 15:25 ` Alex Williamson
0 siblings, 0 replies; 12+ messages in thread
From: Alex Williamson @ 2013-07-24 15:25 UTC (permalink / raw)
To: dwmw2; +Cc: iommu, ddutile, linux-kernel, kvm
This is a pretty massive memory leak, anyone @Intel care? Thanks,
Alex
On Sat, 2013-06-15 at 10:27 -0600, Alex Williamson wrote:
> At best the current code only seems to free the leaf pagetables and
> the root. If you're unlucky enough to have a large gap (like any
> QEMU guest with more than 3G of memory), only the first chunk of leaf
> pagetables are freed (plus the root). This is a massive memory leak.
> This patch re-writes the pagetable freeing function to use a
> recursive algorithm and manages to not only free all the pagetables,
> but does it without any apparent performance loss versus the current
> broken version.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Cc: stable@vger.kernel.org
> ---
>
> Suggesting for stable, would like to see some soak time, but it's
> hard to imagine this being any worse than the current code.
>
> This likely also affects device domains, but the current code does
> ok at freeing individual leaf pagetables and driver domains would
> only get a full pruning if the driver or device is removed.
>
> Some test programs:
> https://github.com/awilliam/tests/blob/master/kvm-huge-guest-test.c
> https://github.com/awilliam/tests/blob/master/vfio-huge-guest-test.c
>
> Both of these simulate a large guest on a small host system. They
> mmap 4G of memory and map it across a large address space just like
> QEMU would (aside from re-using the same mmap across multiple IOVAs).
> On existing code the vfio version (w/o a KVM memory slot limit) will
> leak over 1G of pagetables per run.
>
> drivers/iommu/intel-iommu.c | 72 +++++++++++++++++++++----------------------
> 1 file changed, 35 insertions(+), 37 deletions(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..15e9b57 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -890,56 +890,54 @@ static int dma_pte_clear_range(struct dmar_domain *domain,
> return order;
> }
>
> +static void dma_pte_free_level(struct dmar_domain *domain, int level,
> + struct dma_pte *pte, unsigned long pfn,
> + unsigned long start_pfn, unsigned long last_pfn)
> +{
> + pfn = max(start_pfn, pfn);
> + pte = &pte[pfn_level_offset(pfn, level)];
> +
> + do {
> + unsigned long level_pfn;
> + struct dma_pte *level_pte;
> +
> + if (!dma_pte_present(pte) || dma_pte_superpage(pte))
> + goto next;
> +
> + level_pfn = pfn & level_mask(level - 1);
> + level_pte = phys_to_virt(dma_pte_addr(pte));
> +
> + if (level > 2)
> + dma_pte_free_level(domain, level - 1, level_pte,
> + level_pfn, start_pfn, last_pfn);
> +
> + /* If range covers entire pagetable, free it */
> + if (!(start_pfn > level_pfn ||
> + last_pfn < level_pfn + level_size(level))) {
> + dma_clear_pte(pte);
> + domain_flush_cache(domain, pte, sizeof(*pte));
> + free_pgtable_page(level_pte);
> + }
> +next:
> + pfn += level_size(level);
> + } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
> +}
> +
> /* free page table pages. last level pte should already be cleared */
> static void dma_pte_free_pagetable(struct dmar_domain *domain,
> unsigned long start_pfn,
> unsigned long last_pfn)
> {
> int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
> - struct dma_pte *first_pte, *pte;
> - int total = agaw_to_level(domain->agaw);
> - int level;
> - unsigned long tmp;
> - int large_page = 2;
>
> BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
> BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
> BUG_ON(start_pfn > last_pfn);
>
> /* We don't need lock here; nobody else touches the iova range */
> - level = 2;
> - while (level <= total) {
> - tmp = align_to_level(start_pfn, level);
> -
> - /* If we can't even clear one PTE at this level, we're done */
> - if (tmp + level_size(level) - 1 > last_pfn)
> - return;
> -
> - do {
> - large_page = level;
> - first_pte = pte = dma_pfn_level_pte(domain, tmp, level, &large_page);
> - if (large_page > level)
> - level = large_page + 1;
> - if (!pte) {
> - tmp = align_to_level(tmp + 1, level + 1);
> - continue;
> - }
> - do {
> - if (dma_pte_present(pte)) {
> - free_pgtable_page(phys_to_virt(dma_pte_addr(pte)));
> - dma_clear_pte(pte);
> - }
> - pte++;
> - tmp += level_size(level);
> - } while (!first_pte_in_page(pte) &&
> - tmp + level_size(level) - 1 <= last_pfn);
> + dma_pte_free_level(domain, agaw_to_level(domain->agaw),
> + domain->pgd, 0, start_pfn, last_pfn);
>
> - domain_flush_cache(domain, first_pte,
> - (void *)pte - (void *)first_pte);
> -
> - } while (tmp && tmp + level_size(level) - 1 <= last_pfn);
> - level++;
> - }
> /* free pgd */
> if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
> free_pgtable_page(domain->pgd);
>
^ permalink raw reply [flat|nested] 12+ messages in thread[parent not found: <1374679519.1675.1.camel-85EaTFmN5p//9pzu0YdTqQ@public.gmane.org>]
* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
2013-07-24 15:25 ` Alex Williamson
@ 2013-08-06 16:08 ` Marcelo Tosatti
-1 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2013-08-06 16:08 UTC (permalink / raw)
To: Alex Williamson
Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kvm-u79uwXL29TY76Z2rM5mHXA
On Wed, Jul 24, 2013 at 09:25:19AM -0600, Alex Williamson wrote:
>
> This is a pretty massive memory leak, anyone @Intel care? Thanks,
>
> Alex
>
> On Sat, 2013-06-15 at 10:27 -0600, Alex Williamson wrote:
> > At best the current code only seems to free the leaf pagetables and
> > the root. If you're unlucky enough to have a large gap (like any
> > QEMU guest with more than 3G of memory), only the first chunk of leaf
> > pagetables are freed (plus the root). This is a massive memory leak.
> > This patch re-writes the pagetable freeing function to use a
> > recursive algorithm and manages to not only free all the pagetables,
> > but does it without any apparent performance loss versus the current
> > broken version.
> >
> > Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > ---
> >
> > Suggesting for stable, would like to see some soak time, but it's
> > hard to imagine this being any worse than the current code.
> >
> > This likely also affects device domains, but the current code does
> > ok at freeing individual leaf pagetables and driver domains would
> > only get a full pruning if the driver or device is removed.
> >
> > Some test programs:
> > https://github.com/awilliam/tests/blob/master/kvm-huge-guest-test.c
> > https://github.com/awilliam/tests/blob/master/vfio-huge-guest-test.c
> >
> > Both of these simulate a large guest on a small host system. They
> > mmap 4G of memory and map it across a large address space just like
> > QEMU would (aside from re-using the same mmap across multiple IOVAs).
> > On existing code the vfio version (w/o a KVM memory slot limit) will
> > leak over 1G of pagetables per run.
> >
> > drivers/iommu/intel-iommu.c | 72 +++++++++++++++++++++----------------------
> > 1 file changed, 35 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > index eec0d3e..15e9b57 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -890,56 +890,54 @@ static int dma_pte_clear_range(struct dmar_domain *domain,
> > return order;
> > }
> >
> > +static void dma_pte_free_level(struct dmar_domain *domain, int level,
> > + struct dma_pte *pte, unsigned long pfn,
> > + unsigned long start_pfn, unsigned long last_pfn)
> > +{
> > + pfn = max(start_pfn, pfn);
> > + pte = &pte[pfn_level_offset(pfn, level)];
> > +
> > + do {
> > + unsigned long level_pfn;
> > + struct dma_pte *level_pte;
> > +
> > + if (!dma_pte_present(pte) || dma_pte_superpage(pte))
> > + goto next;
> > +
> > + level_pfn = pfn & level_mask(level - 1);
> > + level_pte = phys_to_virt(dma_pte_addr(pte));
> > +
> > + if (level > 2)
> > + dma_pte_free_level(domain, level - 1, level_pte,
> > + level_pfn, start_pfn, last_pfn);
> > +
> > + /* If range covers entire pagetable, free it */
> > + if (!(start_pfn > level_pfn ||
> > + last_pfn < level_pfn + level_size(level))) {
> > + dma_clear_pte(pte);
> > + domain_flush_cache(domain, pte, sizeof(*pte));
> > + free_pgtable_page(level_pte);
> > + }
> > +next:
> > + pfn += level_size(level);
> > + } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
> > +}
> > +
> > /* free page table pages. last level pte should already be cleared */
> > static void dma_pte_free_pagetable(struct dmar_domain *domain,
> > unsigned long start_pfn,
> > unsigned long last_pfn)
> > {
> > int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
> > - struct dma_pte *first_pte, *pte;
> > - int total = agaw_to_level(domain->agaw);
> > - int level;
> > - unsigned long tmp;
> > - int large_page = 2;
> >
> > BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
> > BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
> > BUG_ON(start_pfn > last_pfn);
> >
> > /* We don't need lock here; nobody else touches the iova range */
> > - level = 2;
> > - while (level <= total) {
> > - tmp = align_to_level(start_pfn, level);
> > -
> > - /* If we can't even clear one PTE at this level, we're done */
> > - if (tmp + level_size(level) - 1 > last_pfn)
> > - return;
> > -
> > - do {
> > - large_page = level;
> > - first_pte = pte = dma_pfn_level_pte(domain, tmp, level, &large_page);
> > - if (large_page > level)
> > - level = large_page + 1;
> > - if (!pte) {
> > - tmp = align_to_level(tmp + 1, level + 1);
> > - continue;
> > - }
> > - do {
> > - if (dma_pte_present(pte)) {
> > - free_pgtable_page(phys_to_virt(dma_pte_addr(pte)));
> > - dma_clear_pte(pte);
> > - }
> > - pte++;
> > - tmp += level_size(level);
> > - } while (!first_pte_in_page(pte) &&
> > - tmp + level_size(level) - 1 <= last_pfn);
> > + dma_pte_free_level(domain, agaw_to_level(domain->agaw),
> > + domain->pgd, 0, start_pfn, last_pfn);
> >
> > - domain_flush_cache(domain, first_pte,
> > - (void *)pte - (void *)first_pte);
> > -
> > - } while (tmp && tmp + level_size(level) - 1 <= last_pfn);
> > - level++;
> > - }
> > /* free pgd */
> > if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
> > free_pgtable_page(domain->pgd);
> >
Reviewed-by: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
@ 2013-08-06 16:08 ` Marcelo Tosatti
0 siblings, 0 replies; 12+ messages in thread
From: Marcelo Tosatti @ 2013-08-06 16:08 UTC (permalink / raw)
To: Alex Williamson; +Cc: dwmw2, iommu, ddutile, linux-kernel, kvm
On Wed, Jul 24, 2013 at 09:25:19AM -0600, Alex Williamson wrote:
>
> This is a pretty massive memory leak, anyone @Intel care? Thanks,
>
> Alex
>
> On Sat, 2013-06-15 at 10:27 -0600, Alex Williamson wrote:
> > At best the current code only seems to free the leaf pagetables and
> > the root. If you're unlucky enough to have a large gap (like any
> > QEMU guest with more than 3G of memory), only the first chunk of leaf
> > pagetables are freed (plus the root). This is a massive memory leak.
> > This patch re-writes the pagetable freeing function to use a
> > recursive algorithm and manages to not only free all the pagetables,
> > but does it without any apparent performance loss versus the current
> > broken version.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Cc: stable@vger.kernel.org
> > ---
> >
> > Suggesting for stable, would like to see some soak time, but it's
> > hard to imagine this being any worse than the current code.
> >
> > This likely also affects device domains, but the current code does
> > ok at freeing individual leaf pagetables and driver domains would
> > only get a full pruning if the driver or device is removed.
> >
> > Some test programs:
> > https://github.com/awilliam/tests/blob/master/kvm-huge-guest-test.c
> > https://github.com/awilliam/tests/blob/master/vfio-huge-guest-test.c
> >
> > Both of these simulate a large guest on a small host system. They
> > mmap 4G of memory and map it across a large address space just like
> > QEMU would (aside from re-using the same mmap across multiple IOVAs).
> > On existing code the vfio version (w/o a KVM memory slot limit) will
> > leak over 1G of pagetables per run.
> >
> > drivers/iommu/intel-iommu.c | 72 +++++++++++++++++++++----------------------
> > 1 file changed, 35 insertions(+), 37 deletions(-)
> >
> > diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> > index eec0d3e..15e9b57 100644
> > --- a/drivers/iommu/intel-iommu.c
> > +++ b/drivers/iommu/intel-iommu.c
> > @@ -890,56 +890,54 @@ static int dma_pte_clear_range(struct dmar_domain *domain,
> > return order;
> > }
> >
> > +static void dma_pte_free_level(struct dmar_domain *domain, int level,
> > + struct dma_pte *pte, unsigned long pfn,
> > + unsigned long start_pfn, unsigned long last_pfn)
> > +{
> > + pfn = max(start_pfn, pfn);
> > + pte = &pte[pfn_level_offset(pfn, level)];
> > +
> > + do {
> > + unsigned long level_pfn;
> > + struct dma_pte *level_pte;
> > +
> > + if (!dma_pte_present(pte) || dma_pte_superpage(pte))
> > + goto next;
> > +
> > + level_pfn = pfn & level_mask(level - 1);
> > + level_pte = phys_to_virt(dma_pte_addr(pte));
> > +
> > + if (level > 2)
> > + dma_pte_free_level(domain, level - 1, level_pte,
> > + level_pfn, start_pfn, last_pfn);
> > +
> > + /* If range covers entire pagetable, free it */
> > + if (!(start_pfn > level_pfn ||
> > + last_pfn < level_pfn + level_size(level))) {
> > + dma_clear_pte(pte);
> > + domain_flush_cache(domain, pte, sizeof(*pte));
> > + free_pgtable_page(level_pte);
> > + }
> > +next:
> > + pfn += level_size(level);
> > + } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
> > +}
> > +
> > /* free page table pages. last level pte should already be cleared */
> > static void dma_pte_free_pagetable(struct dmar_domain *domain,
> > unsigned long start_pfn,
> > unsigned long last_pfn)
> > {
> > int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
> > - struct dma_pte *first_pte, *pte;
> > - int total = agaw_to_level(domain->agaw);
> > - int level;
> > - unsigned long tmp;
> > - int large_page = 2;
> >
> > BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
> > BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
> > BUG_ON(start_pfn > last_pfn);
> >
> > /* We don't need lock here; nobody else touches the iova range */
> > - level = 2;
> > - while (level <= total) {
> > - tmp = align_to_level(start_pfn, level);
> > -
> > - /* If we can't even clear one PTE at this level, we're done */
> > - if (tmp + level_size(level) - 1 > last_pfn)
> > - return;
> > -
> > - do {
> > - large_page = level;
> > - first_pte = pte = dma_pfn_level_pte(domain, tmp, level, &large_page);
> > - if (large_page > level)
> > - level = large_page + 1;
> > - if (!pte) {
> > - tmp = align_to_level(tmp + 1, level + 1);
> > - continue;
> > - }
> > - do {
> > - if (dma_pte_present(pte)) {
> > - free_pgtable_page(phys_to_virt(dma_pte_addr(pte)));
> > - dma_clear_pte(pte);
> > - }
> > - pte++;
> > - tmp += level_size(level);
> > - } while (!first_pte_in_page(pte) &&
> > - tmp + level_size(level) - 1 <= last_pfn);
> > + dma_pte_free_level(domain, agaw_to_level(domain->agaw),
> > + domain->pgd, 0, start_pfn, last_pfn);
> >
> > - domain_flush_cache(domain, first_pte,
> > - (void *)pte - (void *)first_pte);
> > -
> > - } while (tmp && tmp + level_size(level) - 1 <= last_pfn);
> > - level++;
> > - }
> > /* free pgd */
> > if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
> > free_pgtable_page(domain->pgd);
> >
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
2013-06-15 16:27 ` Alex Williamson
@ 2013-08-14 20:23 ` Joerg Roedel
-1 siblings, 0 replies; 12+ messages in thread
From: Joerg Roedel @ 2013-08-14 20:23 UTC (permalink / raw)
To: Alex Williamson
Cc: iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
kvm-u79uwXL29TY76Z2rM5mHXA
On Sat, Jun 15, 2013 at 10:27:19AM -0600, Alex Williamson wrote:
> At best the current code only seems to free the leaf pagetables and
> the root. If you're unlucky enough to have a large gap (like any
> QEMU guest with more than 3G of memory), only the first chunk of leaf
> pagetables are freed (plus the root). This is a massive memory leak.
> This patch re-writes the pagetable freeing function to use a
> recursive algorithm and manages to not only free all the pagetables,
> but does it without any apparent performance loss versus the current
> broken version.
>
> Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Applied to iommu/fixes, thanks Alex. Will send this for v3.11 after a
couple of days in next.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
@ 2013-08-14 20:23 ` Joerg Roedel
0 siblings, 0 replies; 12+ messages in thread
From: Joerg Roedel @ 2013-08-14 20:23 UTC (permalink / raw)
To: Alex Williamson; +Cc: dwmw2, iommu, linux-kernel, kvm
On Sat, Jun 15, 2013 at 10:27:19AM -0600, Alex Williamson wrote:
> At best the current code only seems to free the leaf pagetables and
> the root. If you're unlucky enough to have a large gap (like any
> QEMU guest with more than 3G of memory), only the first chunk of leaf
> pagetables are freed (plus the root). This is a massive memory leak.
> This patch re-writes the pagetable freeing function to use a
> recursive algorithm and manages to not only free all the pagetables,
> but does it without any apparent performance loss versus the current
> broken version.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Cc: stable@vger.kernel.org
Applied to iommu/fixes, thanks Alex. Will send this for v3.11 after a
couple of days in next.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
2013-06-15 16:27 ` Alex Williamson
@ 2013-10-02 8:44 ` Borislav Petkov
-1 siblings, 0 replies; 12+ messages in thread
From: Borislav Petkov @ 2013-10-02 8:44 UTC (permalink / raw)
To: Alex Williamson
Cc: kvm-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
stable-u79uwXL29TY76Z2rM5mHXA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
On Sat, Jun 15, 2013 at 10:27:19AM -0600, Alex Williamson wrote:
> At best the current code only seems to free the leaf pagetables and
> the root. If you're unlucky enough to have a large gap (like any
> QEMU guest with more than 3G of memory), only the first chunk of leaf
> pagetables are freed (plus the root). This is a massive memory leak.
> This patch re-writes the pagetable freeing function to use a
> recursive algorithm and manages to not only free all the pagetables,
> but does it without any apparent performance loss versus the current
> broken version.
>
> Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> ---
>
> Suggesting for stable, would like to see some soak time, but it's
> hard to imagine this being any worse than the current code.
Btw, I have a backport for the 3.0.x series which builds fine here, in
case you guys are interested :)
--
From: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Date: Sat, 15 Jun 2013 10:27:19 -0600
Subject: [PATCH] intel-iommu: Fix leaks in pagetable freeing
upstream commit: 3269ee0bd6686baf86630300d528500ac5b516d7
At best the current code only seems to free the leaf pagetables and
the root. If you're unlucky enough to have a large gap (like any
QEMU guest with more than 3G of memory), only the first chunk of leaf
pagetables are freed (plus the root). This is a massive memory leak.
This patch re-writes the pagetable freeing function to use a
recursive algorithm and manages to not only free all the pagetables,
but does it without any apparent performance loss versus the current
broken version.
Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Marcelo Tosatti <mtosatti-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
Signed-off-by: Joerg Roedel <joro-zLv9SwRftAIdnm+yROfE0A@public.gmane.org>
Signed-off-by: Borislav Petkov <bp-l3A5Bk7waGM@public.gmane.org>
---
drivers/pci/intel-iommu.c | 72 +++++++++++++++++++++++------------------------
1 file changed, 35 insertions(+), 37 deletions(-)
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index ae762ecc658b..68baf178cede 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -853,56 +853,54 @@ static int dma_pte_clear_range(struct dmar_domain *domain,
return order;
}
+static void dma_pte_free_level(struct dmar_domain *domain, int level,
+ struct dma_pte *pte, unsigned long pfn,
+ unsigned long start_pfn, unsigned long last_pfn)
+{
+ pfn = max(start_pfn, pfn);
+ pte = &pte[pfn_level_offset(pfn, level)];
+
+ do {
+ unsigned long level_pfn;
+ struct dma_pte *level_pte;
+
+ if (!dma_pte_present(pte) || dma_pte_superpage(pte))
+ goto next;
+
+ level_pfn = pfn & level_mask(level - 1);
+ level_pte = phys_to_virt(dma_pte_addr(pte));
+
+ if (level > 2)
+ dma_pte_free_level(domain, level - 1, level_pte,
+ level_pfn, start_pfn, last_pfn);
+
+ /* If range covers entire pagetable, free it */
+ if (!(start_pfn > level_pfn ||
+ last_pfn < level_pfn + level_size(level))) {
+ dma_clear_pte(pte);
+ domain_flush_cache(domain, pte, sizeof(*pte));
+ free_pgtable_page(level_pte);
+ }
+next:
+ pfn += level_size(level);
+ } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
+}
+
/* free page table pages. last level pte should already be cleared */
static void dma_pte_free_pagetable(struct dmar_domain *domain,
unsigned long start_pfn,
unsigned long last_pfn)
{
int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
- struct dma_pte *first_pte, *pte;
- int total = agaw_to_level(domain->agaw);
- int level;
- unsigned long tmp;
- int large_page = 2;
BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
BUG_ON(start_pfn > last_pfn);
/* We don't need lock here; nobody else touches the iova range */
- level = 2;
- while (level <= total) {
- tmp = align_to_level(start_pfn, level);
-
- /* If we can't even clear one PTE at this level, we're done */
- if (tmp + level_size(level) - 1 > last_pfn)
- return;
-
- do {
- large_page = level;
- first_pte = pte = dma_pfn_level_pte(domain, tmp, level, &large_page);
- if (large_page > level)
- level = large_page + 1;
- if (!pte) {
- tmp = align_to_level(tmp + 1, level + 1);
- continue;
- }
- do {
- if (dma_pte_present(pte)) {
- free_pgtable_page(phys_to_virt(dma_pte_addr(pte)));
- dma_clear_pte(pte);
- }
- pte++;
- tmp += level_size(level);
- } while (!first_pte_in_page(pte) &&
- tmp + level_size(level) - 1 <= last_pfn);
+ dma_pte_free_level(domain, agaw_to_level(domain->agaw),
+ domain->pgd, 0, start_pfn, last_pfn);
- domain_flush_cache(domain, first_pte,
- (void *)pte - (void *)first_pte);
-
- } while (tmp && tmp + level_size(level) - 1 <= last_pfn);
- level++;
- }
/* free pgd */
if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
free_pgtable_page(domain->pgd);
--
1.8.4
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
@ 2013-10-02 8:44 ` Borislav Petkov
0 siblings, 0 replies; 12+ messages in thread
From: Borislav Petkov @ 2013-10-02 8:44 UTC (permalink / raw)
To: Alex Williamson; +Cc: dwmw2, iommu, ddutile, linux-kernel, kvm, stable
On Sat, Jun 15, 2013 at 10:27:19AM -0600, Alex Williamson wrote:
> At best the current code only seems to free the leaf pagetables and
> the root. If you're unlucky enough to have a large gap (like any
> QEMU guest with more than 3G of memory), only the first chunk of leaf
> pagetables are freed (plus the root). This is a massive memory leak.
> This patch re-writes the pagetable freeing function to use a
> recursive algorithm and manages to not only free all the pagetables,
> but does it without any apparent performance loss versus the current
> broken version.
>
> Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> Cc: stable@vger.kernel.org
> ---
>
> Suggesting for stable, would like to see some soak time, but it's
> hard to imagine this being any worse than the current code.
Btw, I have a backport for the 3.0.x series which builds fine here, in
case you guys are interested :)
--
From: Alex Williamson <alex.williamson@redhat.com>
Date: Sat, 15 Jun 2013 10:27:19 -0600
Subject: [PATCH] intel-iommu: Fix leaks in pagetable freeing
upstream commit: 3269ee0bd6686baf86630300d528500ac5b516d7
At best the current code only seems to free the leaf pagetables and
the root. If you're unlucky enough to have a large gap (like any
QEMU guest with more than 3G of memory), only the first chunk of leaf
pagetables are freed (plus the root). This is a massive memory leak.
This patch re-writes the pagetable freeing function to use a
recursive algorithm and manages to not only free all the pagetables,
but does it without any apparent performance loss versus the current
broken version.
Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
Reviewed-by: Marcelo Tosatti <mtosatti@redhat.com>
Signed-off-by: Joerg Roedel <joro@8bytes.org>
Signed-off-by: Borislav Petkov <bp@suse.de>
---
drivers/pci/intel-iommu.c | 72 +++++++++++++++++++++++------------------------
1 file changed, 35 insertions(+), 37 deletions(-)
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index ae762ecc658b..68baf178cede 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -853,56 +853,54 @@ static int dma_pte_clear_range(struct dmar_domain *domain,
return order;
}
+static void dma_pte_free_level(struct dmar_domain *domain, int level,
+ struct dma_pte *pte, unsigned long pfn,
+ unsigned long start_pfn, unsigned long last_pfn)
+{
+ pfn = max(start_pfn, pfn);
+ pte = &pte[pfn_level_offset(pfn, level)];
+
+ do {
+ unsigned long level_pfn;
+ struct dma_pte *level_pte;
+
+ if (!dma_pte_present(pte) || dma_pte_superpage(pte))
+ goto next;
+
+ level_pfn = pfn & level_mask(level - 1);
+ level_pte = phys_to_virt(dma_pte_addr(pte));
+
+ if (level > 2)
+ dma_pte_free_level(domain, level - 1, level_pte,
+ level_pfn, start_pfn, last_pfn);
+
+ /* If range covers entire pagetable, free it */
+ if (!(start_pfn > level_pfn ||
+ last_pfn < level_pfn + level_size(level))) {
+ dma_clear_pte(pte);
+ domain_flush_cache(domain, pte, sizeof(*pte));
+ free_pgtable_page(level_pte);
+ }
+next:
+ pfn += level_size(level);
+ } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
+}
+
/* free page table pages. last level pte should already be cleared */
static void dma_pte_free_pagetable(struct dmar_domain *domain,
unsigned long start_pfn,
unsigned long last_pfn)
{
int addr_width = agaw_to_width(domain->agaw) - VTD_PAGE_SHIFT;
- struct dma_pte *first_pte, *pte;
- int total = agaw_to_level(domain->agaw);
- int level;
- unsigned long tmp;
- int large_page = 2;
BUG_ON(addr_width < BITS_PER_LONG && start_pfn >> addr_width);
BUG_ON(addr_width < BITS_PER_LONG && last_pfn >> addr_width);
BUG_ON(start_pfn > last_pfn);
/* We don't need lock here; nobody else touches the iova range */
- level = 2;
- while (level <= total) {
- tmp = align_to_level(start_pfn, level);
-
- /* If we can't even clear one PTE at this level, we're done */
- if (tmp + level_size(level) - 1 > last_pfn)
- return;
-
- do {
- large_page = level;
- first_pte = pte = dma_pfn_level_pte(domain, tmp, level, &large_page);
- if (large_page > level)
- level = large_page + 1;
- if (!pte) {
- tmp = align_to_level(tmp + 1, level + 1);
- continue;
- }
- do {
- if (dma_pte_present(pte)) {
- free_pgtable_page(phys_to_virt(dma_pte_addr(pte)));
- dma_clear_pte(pte);
- }
- pte++;
- tmp += level_size(level);
- } while (!first_pte_in_page(pte) &&
- tmp + level_size(level) - 1 <= last_pfn);
+ dma_pte_free_level(domain, agaw_to_level(domain->agaw),
+ domain->pgd, 0, start_pfn, last_pfn);
- domain_flush_cache(domain, first_pte,
- (void *)pte - (void *)first_pte);
-
- } while (tmp && tmp + level_size(level) - 1 <= last_pfn);
- level++;
- }
/* free pgd */
if (start_pfn == 0 && last_pfn == DOMAIN_MAX_PFN(domain->gaw)) {
free_pgtable_page(domain->pgd);
--
1.8.4
--
Regards/Gruss,
Boris.
Sent from a fat crate under my desk. Formatting is fine.
--
^ permalink raw reply related [flat|nested] 12+ messages in thread[parent not found: <20131002084431.GA20568-fF5Pk5pvG8Y@public.gmane.org>]
* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
2013-10-02 8:44 ` Borislav Petkov
@ 2013-10-05 23:41 ` Greg KH
-1 siblings, 0 replies; 12+ messages in thread
From: Greg KH @ 2013-10-05 23:41 UTC (permalink / raw)
To: Borislav Petkov
Cc: kvm-u79uwXL29TY76Z2rM5mHXA,
iommu-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
stable-u79uwXL29TY76Z2rM5mHXA,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, dwmw2-wEGCiKHe2LqWVfeAwA7xHQ
On Wed, Oct 02, 2013 at 10:44:31AM +0200, Borislav Petkov wrote:
> On Sat, Jun 15, 2013 at 10:27:19AM -0600, Alex Williamson wrote:
> > At best the current code only seems to free the leaf pagetables and
> > the root. If you're unlucky enough to have a large gap (like any
> > QEMU guest with more than 3G of memory), only the first chunk of leaf
> > pagetables are freed (plus the root). This is a massive memory leak.
> > This patch re-writes the pagetable freeing function to use a
> > recursive algorithm and manages to not only free all the pagetables,
> > but does it without any apparent performance loss versus the current
> > broken version.
> >
> > Signed-off-by: Alex Williamson <alex.williamson-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
> > Cc: stable-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > ---
> >
> > Suggesting for stable, would like to see some soak time, but it's
> > hard to imagine this being any worse than the current code.
>
> Btw, I have a backport for the 3.0.x series which builds fine here, in
> case you guys are interested :)
Thanks, now applied.
greg k-h
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH] intel-iommu: Fix leaks in pagetable freeing
@ 2013-10-05 23:41 ` Greg KH
0 siblings, 0 replies; 12+ messages in thread
From: Greg KH @ 2013-10-05 23:41 UTC (permalink / raw)
To: Borislav Petkov
Cc: Alex Williamson, dwmw2, iommu, ddutile, linux-kernel, kvm, stable
On Wed, Oct 02, 2013 at 10:44:31AM +0200, Borislav Petkov wrote:
> On Sat, Jun 15, 2013 at 10:27:19AM -0600, Alex Williamson wrote:
> > At best the current code only seems to free the leaf pagetables and
> > the root. If you're unlucky enough to have a large gap (like any
> > QEMU guest with more than 3G of memory), only the first chunk of leaf
> > pagetables are freed (plus the root). This is a massive memory leak.
> > This patch re-writes the pagetable freeing function to use a
> > recursive algorithm and manages to not only free all the pagetables,
> > but does it without any apparent performance loss versus the current
> > broken version.
> >
> > Signed-off-by: Alex Williamson <alex.williamson@redhat.com>
> > Cc: stable@vger.kernel.org
> > ---
> >
> > Suggesting for stable, would like to see some soak time, but it's
> > hard to imagine this being any worse than the current code.
>
> Btw, I have a backport for the 3.0.x series which builds fine here, in
> case you guys are interested :)
Thanks, now applied.
greg k-h
^ permalink raw reply [flat|nested] 12+ messages in thread