Re: [Linaro-mm-sig] [RFC 0/2] ARM: DMA-mapping & IOMMU integration

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "Michael K. Edwards" <m.k.edwards@gmail.com>
To: KyongHo Cho <pullip.cho@samsung.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>,
	Russell King - ARM Linux <linux@arm.linux.org.uk>,
	Arnd Bergmann <arnd@arndb.de>, Joerg Roedel <joro@8bytes.org>,
	linaro-mm-sig@lists.linaro.org, linux-mm@kvack.org,
	Kyungmin Park <kyungmin.park@samsung.com>,
	linux-arm-kernel@lists.infradead.org
Subject: Re: [Linaro-mm-sig] [RFC 0/2] ARM: DMA-mapping & IOMMU integration
Date: Mon, 13 Jun 2011 10:55:59 -0700	[thread overview]
Message-ID: <BANLkTi=C6NKT94Fk6Rq6wmhndVixOqC6mg@mail.gmail.com> (raw)
In-Reply-To: <BANLkTikkCV=rWM_Pq6t6EyVRHcWeoMPUqw@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3138 bytes --]

The need to allocate pages for "write combining" access goes deeper
than anything to do with DMA or IOMMUs.  Please keep "write combine"
distinct from "coherent" in the allocation/mapping APIs.

Write-combining is a special case because it's an end-to-end
requirement, usually architecturally invisible, and getting it to
happen requires a very specific combination of mappings and code.
There's a good explanation here of the requirements on some Intel
implementations of the x86 architecture:
http://software.intel.com/en-us/articles/copying-accelerated-video-decode-frame-buffers/
.  As I understand it, similar considerations apply on at least some
ARMv7 implementations, with NEON multi-register load/store operations
taking the place of MOVNTDQ.  (See
http://www.arm.com/files/pdf/A8_Paper.pdf for instance; although I
don't think there's enough detail about the conditions under which "if
the full cache line is written, the Level-2 line is simply marked
dirty and no external memory requests are required.")

As far as I can tell, there is not yet any way to get real
cache-bypassing write-combining from userland in a mainline kernel,
for x86/x86_64 or ARM.  I have been able to do it from inside a driver
on x86, including in an ISR with some fixes to the kernel's FPU
context save/restore code (patch attached, if you're curious);
otherwise I haven't yet seen write-combining in operation on Linux.
The code that needs to bypass the cache is part of a SoC silicon
erratum workaround supplied by Intel.  It didn't work as delivered --
it oopsed the kernel -- but is now shipping inside our product, and no
problems have been reported from QA or the field.  So I'm fairly sure
that the changes I made are effective.

I am not expert in this area; I was just forced to learn something
about it in order to make a product work.  My assertion that "there's
no way to do it yet" is almost certainly wrong.  I am hoping and
expecting to be immediately contradicted, with a working code example
and benchmarks that show that cache lines are not being fetched,
clobbered, and stored again, with the latencies hidden inside the
cache architecture.  :-)  (Seriously: there are four bits in the
Cortex-A8's "L2 Cache Auxiliary Control Register" that control various
aspects of this mechanism, and if you don't have a fairly good
explanation of which bits do and don't affect your benchmark, then I
contend that the job isn't done.  I don't begin to understand the
equivalent for the multi-core A9 I'm targeting next.)

If some kind person doesn't help me see the error of my ways, I'm
going to have to figure it out for myself on ARM in the next couple of
months, this time for performance reasons rather than to work around
silicon errata.  Unfortunately, I do not expect it to be particularly
low-hanging fruit.  I expect to switch to the hard-float ABI first
(the only remaining obstacle being a couple of TI-supplied binary-only
libraries).  That might provide enough of a system-level performance
win (by allowing the compiler to reorder fetches to NEON registers
across function/method calls) to obviate the need.

Cheers,
- Michael

[-- Attachment #2: 0011-Clean-up-task-FPU-state-thoroughly-during-exec-and-p.patch --]
[-- Type: application/octet-stream, Size: 2040 bytes --]

From ffb3feb73f89a34459b84f229a9ac699a589ed8f Mon Sep 17 00:00:00 2001
From: Michael Edwards <michaedw@cisco.com>
Date: Sat, 24 Apr 2010 20:53:38 -0700
Subject: [PATCH] x86,fpu: Protect FPU context cleanup against SSE2 in ISRs

 Clean up task FPU state thoroughly during exec() and process tear-down,
 and lock out local IRQs while doing it, so that SSE2 instructions in
 ISRs don't cause fxsave/fxrstor to/from a null pointer.
 (They still need to be guarded with kernel_fpu_(begin|end), of course.)

Signed-off-by: Michael Edwards <michaedw@cisco.com>
---
 arch/x86/kernel/process.c    |   10 ++++++++++
 arch/x86/kernel/process_32.c |    4 +---
 2 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 876e918..bde7d09 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -8,6 +8,7 @@
 #include <linux/pm.h>
 #include <linux/clockchips.h>
 #include <asm/system.h>
+#include <asm/i387.h>

 unsigned long idle_halt;
 EXPORT_SYMBOL(idle_halt);
@@ -31,12 +32,21 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	return 0;
 }

+/*
+ * Locks out local IRQs while clearing FPU state and
+ * related task properties, so that ISRs can use SSE2.
+ */
 void free_thread_xstate(struct task_struct *tsk)
 {
+	local_irq_disable();
 	if (tsk->thread.xstate) {
+		tsk->fpu_counter = 0;
+		clear_stopped_child_used_math(tsk);
+		__clear_fpu(tsk);
 		kmem_cache_free(task_xstate_cachep, tsk->thread.xstate);
 		tsk->thread.xstate = NULL;
 	}
+	local_irq_enable();
 }

 void free_thread_info(struct thread_info *ti)
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 31f40b2..7122160 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -294,9 +294,7 @@ void flush_thread(void)
 	/*
 	 * Forget coprocessor state..
 	 */
-	tsk->fpu_counter = 0;
-	clear_fpu(tsk);
-	clear_used_math();
+	free_thread_xstate(tsk);
 }

 void release_thread(struct task_struct *dead_task)
-- 
1.7.0

next prev parent reply	other threads:[~2011-06-13 17:56 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-05-25  7:35 [RFC 0/2] ARM: DMA-mapping & IOMMU integration Marek Szyprowski
2011-05-25  7:35 ` [RFC 1/2] ARM: Move dma related inlines into arm_dma_ops methods Marek Szyprowski
2011-05-25  7:35 ` [RFC 2/2] ARM: initial proof-of-concept IOMMU mapper for DMA-mapping Marek Szyprowski
2011-06-13 14:12 ` [RFC 0/2] ARM: DMA-mapping & IOMMU integration KyongHo Cho
2011-06-13 15:07   ` Arnd Bergmann
2011-06-13 15:30     ` KyongHo Cho
2011-06-13 15:40       ` Catalin Marinas
2011-06-13 16:00         ` [Linaro-mm-sig] " KyongHo Cho
2011-06-13 17:55           ` Michael K. Edwards [this message]
2011-06-13 18:54             ` Jesse Barnes
2011-06-14 18:15               ` Michael K. Edwards
2011-06-14 18:21                 ` Jesse Barnes
2011-06-14 19:10                   ` Zach Pfeffer
2011-06-14 20:59                   ` Michael K. Edwards
2011-06-13 18:01           ` Catalin Marinas
2011-06-13 15:46       ` Arnd Bergmann
2011-06-13 15:58         ` [Linaro-mm-sig] " KyongHo Cho
2011-06-14  7:46       ` Marek Szyprowski
2011-06-20 14:31 ` Subash Patel
2011-06-20 14:59   ` Marek Szyprowski

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:876e918 dfblob:bde7d09 dfblob:31f40b2 dfblob:7122160 )
 OR (
bs:"x86,fpu: Protect FPU context cleanup against SSE2 in ISRs" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='BANLkTi=C6NKT94Fk6Rq6wmhndVixOqC6mg@mail.gmail.com' \
    --to=m.k.edwards@gmail.com \
    --cc=arnd@arndb.de \
    --cc=catalin.marinas@arm.com \
    --cc=joro@8bytes.org \
    --cc=kyungmin.park@samsung.com \
    --cc=linaro-mm-sig@lists.linaro.org \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@arm.linux.org.uk \
    --cc=pullip.cho@samsung.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).