* Re: USB mass storage and ARM cache coherency [not found] ` <1267738660.22204.77.camel@pasglop> @ 2010-03-05 1:17 ` Paul Mundt 2010-03-05 4:44 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 4+ messages in thread From: Paul Mundt @ 2010-03-05 1:17 UTC (permalink / raw) To: linux-arm-kernel On Fri, Mar 05, 2010 at 08:37:40AM +1100, Benjamin Herrenschmidt wrote: > On Thu, 2010-03-04 at 18:07 +0000, Catalin Marinas wrote: > > Are you more in favour if a PIO kmap API than inverting the meaning of > > PG_arch_1? > > My main worry with this approach is the sheer amount of drivers that > need fixing. I believe inverting PG_arch_1 is a better solution and I > somewhat fail to see how we end up doing too much flushing if we have > per-page execute permission (but maybe SH doesn't ?) > Basically we have two different MMUs on VIPT parts, the older one on all SH-4 parts were all read-implies-exec with no ability to differentiate between read or exec access. For these parts the PG_dcache_dirty approach saves us from a lot of flushing, and the corner cases were isolated enough that we could tolerate fixups at the driver level, even on a write-allocate D-cache. For second generation SH-4A (SH-X2) and up parts, read and exec are split out and we could reasonably adopt the PG_dcache_clean approach there while adopting the same sort of flushing semantics as PPC to avoid flushing constantly. The current generation of parts far outnumber their legacy counterparts, so it's certainly something I plan to experiment with. We have an additional level of complexity on some of the SMP parts with a non-coherent I-cache, some of the early CPUs have broken broadcasting of the cacheops in hardware and so need to rely on IPIs, while the later parts broadcast properly. We also need to deal with D-cache IPIs when using mixed coherency protocols on different CPUs. For older PIPT parts we've never used the deferred flush, since the only time we ever had to bother with cache maintenance was in the DMA ops, as anything closer to the CPU than the PCI DMAC had no opportunity to be snooped. > > I'm not familiar with SH but for PIO devices the flushing shouldn't be > > more aggressive. For the DMA devices, Russell suggested that we mark > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the > > default flushing. > > I really like that idea, as I said earlier, but I'm worried about the I$ > side of things. IE. What I'm trying to say is that I can't see how to do > that optimisation without ending up with missing I$ invalidations or > doing way too many of them, unless we have a separate bit to track I$ > state. > Using PG_dcache_clean from the DMA API sounds like a pretty good idea, and certainly worth experimenting with. I don't know how we would do the I-cache optimization without a PG_arch_2, though. In any event, if there's going to be a mass exodus to PG_dcache_clean, Documentation/cachetlb.txt could use a considerable amount of expanding. The read/exec and I-cache optimizations are something that would be valuable to document, as opposed to simply being pointed at the sparc64 approach with the regular PG_dcache_dirty caveats. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: USB mass storage and ARM cache coherency 2010-03-05 1:17 ` USB mass storage and ARM cache coherency Paul Mundt @ 2010-03-05 4:44 ` Benjamin Herrenschmidt 2010-03-10 3:52 ` Paul Mundt 0 siblings, 1 reply; 4+ messages in thread From: Benjamin Herrenschmidt @ 2010-03-05 4:44 UTC (permalink / raw) To: linux-arm-kernel > Basically we have two different MMUs on VIPT parts, the older one on all > SH-4 parts were all read-implies-exec with no ability to differentiate > between read or exec access. Ok, this is the same as the older ppc32 processors. > For these parts the PG_dcache_dirty approach > saves us from a lot of flushing, and the corner cases were isolated > enough that we could tolerate fixups at the driver level, even on a > write-allocate D-cache. But how wide a range of devices do you have to support with those ? Is this a few SoCs or people putting any random PCI device in there for example ? If I were to do it that way on ppc32, I worried that it would be more than a few drivers that I would have to fix :-) All the 32-bit PowerMac and PowerBooks for example, all of freescale 74xx based parts, etc... those guys have PCI, and all sort of random HW plugged into them. I would -love- to avoid that horrible amount of flushing we do on these, it's quite high on any profile run, but I haven't found a good way to do so. There's also a nasty issue of icache content leaking between processes which I doubt is exploitable but I had people having a go at me about it when I tried to avoid icache cleaning anonymous pages by default. > For second generation SH-4A (SH-X2) and up parts, read and exec are split > out and we could reasonably adopt the PG_dcache_clean approach there > while adopting the same sort of flushing semantics as PPC to avoid > flushing constantly. The current generation of parts far outnumber their > legacy counterparts, so it's certainly something I plan to experiment > with. I'd be curious to see whether you get a perf imporovement with that. Note that we still have this additional thing that is floating around in this thread which I thing is definitely worthwhile to do, which is to mark clean pages that have been written to with DMA in dma_unmap and friends.... if we can fix the icache problem. So far, I haven't found James replies on this satisfactory :-) But maybe I just missed something. > We have an additional level of complexity on some of the SMP parts with a > non-coherent I-cache, I've that on some embedded ppc's too, where the icache flush instrutions aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately today nobody sane (appart from Bluegene) did an SMP part with those and so we have well localized internal hacks for them. But I've heared that some vendors might be pumping out SoCs with that stuff too soon which worries me. > some of the early CPUs have broken broadcasting of > the cacheops in hardware and so need to rely on IPIs, while the later > parts broadcast properly. We also need to deal with D-cache IPIs when > using mixed coherency protocols on different CPUs. Right, that sucks. Do those have no-exec permission support ? If they do, then you can do what I did for BG, which is to ping pong user pages so they are either writable or executable (since userspace code itself will break as it will assume the cache ops -are- broadcast, since that's what the architecture says). > For older PIPT parts we've never used the deferred flush, since the only > time we ever had to bother with cache maintenance was in the DMA ops, as > anything closer to the CPU than the PCI DMAC had no opportunity to be > snooped. Do you also, like ARM11MP, have a case of non-cache coherent DMA and non-broadcast cache ops in SMP ? That's somewhat of a killer, I still don't see how it can be dealt properly other than using load/store tricks to bring the data into the local cache and flushing it from there. DMA ops are called way to deep into spinlock hell to rely on IPIs (unless your HW also provides some kind of NMI IPIs). > > > I'm not familiar with SH but for PIO devices the flushing shouldn't be > > > more aggressive. For the DMA devices, Russell suggested that we mark > > > the page as clean (set PG_dcache_clean) in the DMA API to avoid the > > > default flushing. > > > > I really like that idea, as I said earlier, but I'm worried about the I$ > > side of things. IE. What I'm trying to say is that I can't see how to do > > that optimisation without ending up with missing I$ invalidations or > > doing way too many of them, unless we have a separate bit to track I$ > > state. > > > Using PG_dcache_clean from the DMA API sounds like a pretty good idea, > and certainly worth experimenting with. I don't know how we would do the > I-cache optimization without a PG_arch_2, though. Right. That's the one thing I've been trying to figure out without success. But then, is it a big deal to add PG_arch_2 ? doesn't sound like it to me... > In any event, if there's going to be a mass exodus to PG_dcache_clean, > Documentation/cachetlb.txt could use a considerable amount of expanding. > The read/exec and I-cache optimizations are something that would be > valuable to document, as opposed to simply being pointed at the sparc64 > approach with the regular PG_dcache_dirty caveats. Cheers, Ben. > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: USB mass storage and ARM cache coherency 2010-03-05 4:44 ` Benjamin Herrenschmidt @ 2010-03-10 3:52 ` Paul Mundt 2010-03-11 21:44 ` Benjamin Herrenschmidt 0 siblings, 1 reply; 4+ messages in thread From: Paul Mundt @ 2010-03-10 3:52 UTC (permalink / raw) To: linux-arm-kernel On Fri, Mar 05, 2010 at 03:44:55PM +1100, Benjamin Herrenschmidt wrote: > > For these parts the PG_dcache_dirty approach > > saves us from a lot of flushing, and the corner cases were isolated > > enough that we could tolerate fixups at the driver level, even on a > > write-allocate D-cache. > > But how wide a range of devices do you have to support with those ? Is > this a few SoCs or people putting any random PCI device in there for > example ? > > If I were to do it that way on ppc32, I worried that it would be more > than a few drivers that I would have to fix :-) All the 32-bit PowerMac > and PowerBooks for example, all of freescale 74xx based parts, etc... > those guys have PCI, and all sort of random HW plugged into them. > Many of those parts do support PCI, but are rarely used with arbitrary devices. The PCI controller on those parts also permits one to establish coherency for any transactions between PCI and memory through a rudimentary snoop controller that requires the CPU to avoid entering any sleep states. This works ok in practice since that series of host controllers doesn't really support power management anyways (nor do any of the cores of that generation implement any of the more complex sleep states). > > For second generation SH-4A (SH-X2) and up parts, read and exec are split > > out and we could reasonably adopt the PG_dcache_clean approach there > > while adopting the same sort of flushing semantics as PPC to avoid > > flushing constantly. The current generation of parts far outnumber their > > legacy counterparts, so it's certainly something I plan to experiment > > with. > > I'd be curious to see whether you get a perf imporovement with that. > > Note that we still have this additional thing that is floating around in > this thread which I thing is definitely worthwhile to do, which is to > mark clean pages that have been written to with DMA in dma_unmap and > friends.... if we can fix the icache problem. So far, I haven't found > James replies on this satisfactory :-) But maybe I just missed > something. > I'll start in on profiling some of this once I start on 2.6.35 stuff. I think I still have my old numbers from when we did the PG_mapped to PG_dcache_dirty transition, so it will be interesting to see how PG_dcache_clean stacks up against both of those. > > We have an additional level of complexity on some of the SMP parts with a > > non-coherent I-cache, > > I've that on some embedded ppc's too, where the icache flush instrutions > aren't broadcast, like ARM11MP in fact. Pretty horrible. Fortunately > today nobody sane (appart from Bluegene) did an SMP part with those and > so we have well localized internal hacks for them. But I've heared that > some vendors might be pumping out SoCs with that stuff too soon which > worries me. > I-cache invalidations are broadcast on all mass produced SH-4A SMP parts, but we do have some early proto chips that screwed that up. For the case of mainline, we ought to be able to assume hardware broadcast though. > > some of the early CPUs have broken broadcasting of > > the cacheops in hardware and so need to rely on IPIs, while the later > > parts broadcast properly. We also need to deal with D-cache IPIs when > > using mixed coherency protocols on different CPUs. > > Right, that sucks. Do those have no-exec permission support ? If they > do, then you can do what I did for BG, which is to ping pong user pages > so they are either writable or executable (since userspace code itself > will break as it will assume the cache ops -are- broadcast, since that's > what the architecture says). > Yes, these all support no-exec. I'll give the ping ponging thing a try, thanks for the tip. > Do you also, like ARM11MP, have a case of non-cache coherent DMA and > non-broadcast cache ops in SMP ? That's somewhat of a killer, I still > don't see how it can be dealt properly other than using load/store > tricks to bring the data into the local cache and flushing it from > there. DMA ops are called way to deep into spinlock hell to rely on IPIs The only thing we really lack is I-cache coherency, which isn't such a big deal with invalidations being broadcast. All DMA accesses are snooped, and the D-cache is fully coherent. > (unless your HW also provides some kind of NMI IPIs). > While we don't have anything like FIQs to work with, we do have IRQ priority levels to play with. I'd toyed with this idea in the past of simply having a reserved level that never gets masked, particularly for things like broadcast backtraces. > > Using PG_dcache_clean from the DMA API sounds like a pretty good idea, > > and certainly worth experimenting with. I don't know how we would do the > > I-cache optimization without a PG_arch_2, though. > > Right. That's the one thing I've been trying to figure out without > success. But then, is it a big deal to add PG_arch_2 ? doesn't sound > like it to me... > Well, it does start to get a bit painful with sparsemem section or NUMA node IDs also digging in to the page flags on 32-bit.. the benefits would have to be pretty compelling to offset the pain. ^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: USB mass storage and ARM cache coherency 2010-03-10 3:52 ` Paul Mundt @ 2010-03-11 21:44 ` Benjamin Herrenschmidt 0 siblings, 0 replies; 4+ messages in thread From: Benjamin Herrenschmidt @ 2010-03-11 21:44 UTC (permalink / raw) To: linux-arm-kernel On Wed, 2010-03-10 at 12:52 +0900, Paul Mundt wrote: > Well, it does start to get a bit painful with sparsemem section or > NUMA > node IDs also digging in to the page flags on 32-bit.. the benefits > would > have to be pretty compelling to offset the pain. Unless we play a dangerous trick and re-use another flag that isn't meaningful for allocated pages... maybe PG_buddy ? Or do I miss something about that guy semantics ? Cheers, Ben. ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2010-03-11 21:44 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20100302211049V.fujita.tomonori@lab.ntt.co.jp>
[not found] ` <1267549527.15401.78.camel@e102109-lin.cambridge.arm.com>
[not found] ` <20100303215437.GF2579@ucw.cz>
[not found] ` <1267709756.6526.380.camel@e102109-lin.cambridge.arm.com>
[not found] ` <20100304135128.GA12191@atrey.karlin.mff.cuni.cz>
[not found] ` <1267712512.31654.176.camel@mulgrave.site>
[not found] ` <1267716578.6526.483.camel@e102109-lin.cambridge.arm.com>
[not found] ` <20100304154103.GA9384@linux-sh.org>
[not found] ` <1267726049.6526.543.camel@e102109-lin.cambridge.arm.com>
[not found] ` <1267738660.22204.77.camel@pasglop>
2010-03-05 1:17 ` USB mass storage and ARM cache coherency Paul Mundt
2010-03-05 4:44 ` Benjamin Herrenschmidt
2010-03-10 3:52 ` Paul Mundt
2010-03-11 21:44 ` Benjamin Herrenschmidt
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).