Re: [PATCH v4] iommu: Optimise PCI SAC address trick

Linux IOMMU Development
 help / color / mirror / Atom feed

From: John Garry <john.g.garry@oracle.com>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Vasant Hegde <vasant.hegde@amd.com>,
	Robin Murphy <robin.murphy@arm.com>,
	joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev,
	linux-kernel@vger.kernel.org, Jakub Kicinski <kuba@kernel.org>
Subject: Re: [PATCH v4] iommu: Optimise PCI SAC address trick
Date: Tue, 18 Apr 2023 19:50:06 +0100	[thread overview]
Message-ID: <8404e7ef-d929-bd39-a10d-f4053cc4bd3e@oracle.com> (raw)
In-Reply-To: <CAHk-=whogEk1UJfU3E7aW18PDYRbdAzXta5J0ECg=CB5=sCe7g@mail.gmail.com>

On 18/04/2023 18:36, Linus Torvalds wrote:
>> JFYI, Since you are using NVMe, you could also alternatively try
>> something like which I did for some SCSI storage controller drivers to
>> limit the request_queue max_sectors soft limit, like:
> That patch is not only whitespace-damaged, it's randomly missing one
> '+' character

My copy and paste error

> so it makes no sense even ignoring the whitespace
> problems._and_  it has a nonsensical cast to 'unsigned int' which
> makes that 'min()' possibly do crazy and invalid things (ie imagine
> dma_opt_mapping_size() returning 4GB).
> 
> You can't cast things to the smaller size just to get rid of a
> warning, for chrissake!

Yeah, sorry, I was just trying to show a very quick demo of how this can 
actually be done.

Indeed, I could have mentioned that it would actually have been easier 
to test by feeding a lower limit into /sys/block/<dev>/queue/max_sectors_kb

> 
> In fact, even without the cast, it seems entirely broken, since the
> fallback for dma_opt_mapping_size() is to return 0 (admittedly_that_
> case only happens with HAS_DMA=n).
> 
> Finally, doing this inside the
> 
>          if (ctrl->max_hw_sectors) {

I think that this would be set for PCI NVMe controllers, which we were 
interested in here. But, indeed, I could check for a better place to set 
this.

> 
> conditional seems entirely wrong, since any dma mapping limits would
> be entirely independent of any driver maximum hw size, and in fact
> *easier*  to hit if the block device itself doesn't have any max
> limits.
> 
> So please burn that patch in the darkest pits of hell and let's try to
> forget it ever existed. Ok?

Sure

> 
> Also, shouldn't any possible dma mapping size affect not
> 'max_sectors', but 'max_segment_size'? At least the docs imply that
> dma_opt_mapping_size() is about the max size of a_single_  mapping,
> not of the whole thing?

It's meant to apply to total mapping length and not a single segment, so 
then the doc would be misleading.

> 
> Anyway, if this is actually an issue, to the point that it's now being
> discussed for a_second_  block driver subsystem, then shouldn't the
> queue handling just do this all automatically, instead of adding
> random crap to random block driver architectures?

Other storage controllers may enjoy better performance with very large 
DMA mappings (whose total length exceed the IOVA caching limit), so it 
was too risky to apply a performance-related change of this nature 
across the board when that API was introduced.

So far it had only been a single controller where we were actually 
seeing the issue of alloc'ing IOVAs giving very (very) poor performance.

However, as far as I am aware, there was nothing special about that 
controller, apart from the fact that it was often creating requests 
whose length exceeded that IOVA caching limit, and it also filling the 
32b IOVA space quickly - that may be because the system had lots of CPUs.

Since there are now reports of poor performance in other storage 
controllers and also in networking adapters, I can only assume that 
people are testing more often with IOMMU-enabled systems with lots of 
CPUs. Having said that, I would still be cautious of applying that limit 
everywhere.

> 
> And no, I don't know this code, so maybe I'm entirely missing
> something, but that patch just raised my hackles enough that I had to
> say something.

Sure.

Thanks,
John

next prev parent reply	other threads:[~2023-04-18 18:50 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-13 13:40 [PATCH v4] iommu: Optimise PCI SAC address trick Robin Murphy
2023-04-13 14:02 ` Jakub Kicinski
2023-04-14 11:45 ` Joerg Roedel
2023-04-14 17:45   ` Robin Murphy
2023-05-23 16:06     ` Joerg Roedel
2023-05-24 14:56       ` Robin Murphy
2023-06-13 17:58   ` Jakub Kicinski
2023-06-15  7:49     ` John Garry
2023-06-15  9:04       ` Robin Murphy
2023-06-15 10:11         ` John Garry
2023-06-15 11:41           ` Robin Murphy
2023-06-15 12:15             ` John Garry
2023-04-18  9:23 ` Vasant Hegde
2023-04-18 10:19   ` John Garry
2023-04-18 17:36     ` Linus Torvalds
2023-04-18 18:50       ` John Garry [this message]
2023-04-18 10:57   ` Robin Murphy
2023-04-18 13:05     ` Vasant Hegde
2023-07-14 14:09 ` Joerg Roedel
2023-07-17  9:24   ` John Garry

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8404e7ef-d929-bd39-a10d-f4053cc4bd3e@oracle.com \
    --to=john.g.garry@oracle.com \
    --cc=iommu@lists.linux.dev \
    --cc=joro@8bytes.org \
    --cc=kuba@kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=torvalds@linux-foundation.org \
    --cc=vasant.hegde@amd.com \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox