[patch 00/10] KSPU API + AES offloaded to SPU + testing module

linux-crypto.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [patch 00/10] KSPU API + AES offloaded to SPU + testing module
@ 2007-08-16 20:01 Sebastian Siewior
  2007-08-16 20:01 ` [patch 01/10] t add cast to regain ablkcipher_request from private ctx Sebastian Siewior
                   ` (9 more replies)
  0 siblings, 10 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto

This is a complete submission of $subject against current git.
Content:
 1 - required casting to retrieve private struct out of crypto API
 2 - function to retrieve private crypto struct. Herbert queued this
    already for 2.6.24.
 3 - KSPU doc
 4 - KSPU skeleton
 5 - exporting required symbols within spufs.ko
 6 - allocation of KSPU context
 7 - KSPU, PPU side implementation
 8 - KSPU, SPE side implementation
 9 - AES as KSPU & Crypto user. Providing ECB+CBC block mode
10 - testing module for AES crypto.

Herbert, I've put you on CC to consider patch 1 for inclusion. As I noticed
earlier, KSPU is a general purpose interface so it is not clever to use
crypto's infrastructure since KSPU may be used for non-crypto related
tasks. However, I included a soft limit.

Figure [1] shows performance of aes-spu in ECB mode with a transfer size
of 16 KiBb. Generic is crypto/aes.c compiled for SPU, sync means it is there
is no queue process just copy data and start the SPU, async is what is
provided by the patch series. 
Figure [2] shows the performance with different DMA transfer sizes
(smaller chunks = more transfers). Async benefits from double buffering
and more requests at a time but slower at 16 KiB due to more communication.

The pdfs are 4.6 KiB each but scalable :)

[1] http://download.breakpoint.cc/spu/spu_code_async.pdf
[2] http://download.breakpoint.cc/spu/spu_async_blocksize_aligned.pdf

sleepy Sebastian
-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 01/10] t add cast to regain ablkcipher_request from private ctx
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-17  8:55   ` Herbert Xu
  2007-08-16 20:01 ` [patch 02/10] crypto: retrieve private ctx aligned Sebastian Siewior
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: crypto_add_casts.diff --]
[-- Type: text/plain, Size: 610 bytes --]

This cast allows to regain the struct ablkcipher_request for a request 
from private data.

Signed-off-by: Sebastian Siewior <linux-crypto@ml.breakpoint.cc>
--- a/include/linux/crypto.h
+++ b/include/linux/crypto.h
@@ -580,6 +580,12 @@ static inline struct ablkcipher_request 
 	return container_of(req, struct ablkcipher_request, base);
 }
 
+static inline struct ablkcipher_request *ablkcipher_ctx_cast(
+		void *ctx)
+{
+	return container_of(ctx, struct ablkcipher_request, __ctx);
+}
+
 static inline struct ablkcipher_request *ablkcipher_request_alloc(
 	struct crypto_ablkcipher *tfm, gfp_t gfp)
 {

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 02/10] crypto: retrieve private ctx aligned
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
  2007-08-16 20:01 ` [patch 01/10] t add cast to regain ablkcipher_request from private ctx Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-16 20:01 ` [patch 03/10] spufs: kspu documentation Sebastian Siewior
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: crypto_ctx_alignment.diff --]
[-- Type: text/plain, Size: 625 bytes --]

This is function does the same thing for ablkcipher that is done for
blkcipher by crypto_blkcipher_ctx_aligned(): it returns an aligned
address of the private ctx.

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- a/include/crypto/algapi.h
+++ b/include/crypto/algapi.h
@@ -160,6 +160,11 @@ static inline void *crypto_ablkcipher_ct
 	return crypto_tfm_ctx(&tfm->base);
 }
 
+static inline void *crypto_ablkcipher_ctx_aligned(struct crypto_ablkcipher *tfm)
+{
+	return crypto_tfm_ctx_aligned(&tfm->base);
+}
+
 static inline struct crypto_blkcipher *crypto_spawn_blkcipher(
 	struct crypto_spawn *spawn)
 {

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 03/10] spufs: kspu documentation
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
  2007-08-16 20:01 ` [patch 01/10] t add cast to regain ablkcipher_request from private ctx Sebastian Siewior
  2007-08-16 20:01 ` [patch 02/10] crypto: retrieve private ctx aligned Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-16 20:01 ` [patch 04/10] spufs: kspu doc skeleton Sebastian Siewior
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: spufs-kspu_doc.diff --]
[-- Type: text/plain, Size: 9822 bytes --]

Documentation how to use kspu from the PPU & SPU side

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- /dev/null
+++ b/Documentation/powerpc/kspu.txt
@@ -0,0 +1,243 @@
+                  KSPU: Utilization of SPUs for kernel tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+0. KSPU design
+==============
+
+The idea is to offload single time consuming tasks to the SPU. Those tasks are
+fed with data that they have to process.
+Once the function on the SPU side is invoked, the input data is already
+available. After the job is done, the offloaded function must kick off a DMA
+transfer, that transfers the result back to the main memory.
+On the PPU, the KSPU user queues the job temporary in a linked list and receives
+later a callback to put queue the job directly in SPU's ring buffer. The transit
+stop is required for two reasons:
+- It must be possible to queue work items from softirq context
+- All requests must be accepted, even if the ring buffer is full. Waiting (until
+	a slot becomes available) is not an option.
+
+The callback (for the enqueue process on the SPU) happens in a kthread context,
+so a mutex may be hold. However, there is only one kthread for this job, so
+every delay will have global impact.
+The user should enqueue only one job item on every enqueue request. The user
+may enqueue more than one job item if _really_ necessary. If there are not
+enough free slots than the enqueue function will be called as the first enqueue
+function once free slots are available again.
+After the offloaded function completed the job, the kthread calls the
+completion callback (it is the same kthread that is used for enqueue).
+Double/multi buffering is performed by KSPU.
+
+0.5 SPU usage
+=============
+Currently only one SPU is used and allocated. Allocation occurs during KSPU
+module initialization via the spufs interface. Therefore the physical SPU is
+considered in the scheduling process (and "shared" with user space). Right
+now, it is not "easy" to find out
+- how many SPUs may be taken (i.e. not used by user space)
+- how many SPUs are useful to be taken (depending of the workload)
+The later is (theoretically) an easy accounting approach if there are no
+dependencies in processing (and two jobs of the same kind may be processed on
+two SPUs in parallel).
+A second SPU (context) may be required if the local store memory is used up.
+This can be prevented if "overlays" are used. The advantage over several SPU
+context:
+- less complexity (since there is only one kind of SPU code)
+- no tracking which function is in which context. Plus overlay code switches
+	the binary probably faster than the scheduler does.
+
+1. Overview of memory layout
+=============================
+
+            ------------------ 256 KiB
+            |    RB ENTRY    |
+            ------------------ Ring buffer entries
+            |   .........    |
+            ------------------
+            |    RB ENTRY    |
+            ------------------
+            |    RB state    | (consumed + outstanding)
+            ------------------
+            |      STACK     | Stack growing downwards
+            |       ||       |
+            |       \/       |
+            ------------------
+            |     .......    |  unused / reserved :)
+            ------------------
+            |      Data      |
+            | DMA Buffers,   |
+            | functions'     |
+            | private data   |
+            ------------------
+            |      Code      |
+            | offloaded SPU  |
+            | functions      |
+            ------------------
+            |   multiplexor  | spu_main.c
+            ------------------ 0
+
+Type of Ring buffer entry is struct kspu_job.
+Number of ring buffer entries is determined by RB_SLOTS.
+Number of of DMA buffers is determined by DMA_BUFFERS.
+The stack grows uncontrolled. There is no (cheap) way to notice a stack
+overflow. After adding a new SPU program, the developer is encourage to check
+the stack usage and make sure the stack will never hit the data segment. This
+task is not required if recursive functions are used (I hope the suicide part
+has been understood).
+
+1.1 Ring buffer
+===============
+The ring buffer has been chosen because the data structure allows exchange of
+data (PPU <-> SPU) without any locking. The ring buffer entry consists of two
+parts
+- Data known by the KSPU (public data).
+- Private data is only known by the user (hidden from KSPU)
+Public data contains the function parameters of the offloaded SPU program.
+Private data is meaningless to the KSPU and may consider algorithm specific
+information (like where to put the result).
+The number of ring buffer entries (RB_SLOTS) has two constrains (except
+LS_SIZE :D):
+- it must be power of 2.
+- it must be at least DMA_BUFFERS*2
+
+1.2 DMA Buffers
+===============
+Every DMA buffer is DMA_MAX_TRANS_SIZE bytes in size. The size reflects the
+maximum transfer size that may be request by the SPU. Therefore the same
+requirements apply here like to the MFC DMA size: it must be a multiple of 16
+and may not by larger than 16KiB.
+The only limit for the number of available DMA buffers (DMA_BUFFERS) (besides
+the available space) is that "DMA_BUFFERS*2 <= RB_SLOTS" must be true. The
+reason for this constraint is that the "multiplexor", once started, requests
+DMA_BUFFERS buffers and starts processing. While processing the first batch,
+the next DMA_BUFFERS are requested (to get into streaming mode). After
+processing DMA_BUFFERS*2 requests, the first point is reached, where the SPU
+starts to notify the PPU about done requests and may stop. Therefore the
+shortest run is DMA_BUFFERS*2 requests. If there are not enough requests
+available, KSPU fills the ring buffer with NOPs to fit. A NOP is a DMA
+transfer with the size zero (nop for the MFC) and just a return statement as
+the job function.
+
+2. Offloading a task to SPU
+===========================
+Three steps are required to offload a task to SPU:
+- PPU code
+- SPU code
+- Update header files & Makefile
+
+The example code shows an example how to offload an 'add operation' via spu_add
+on the SPU. The complete implementation is in skeleton files.
+
+2.1 PPU code
+============
+1. Init
+- Prepare a struct with 'struct kspu_work_item' embedded in it.
+  struct my_spu_req {
+		struct kspu_work_item kspu_work;
+		void *data;
+	};
+
+- get global kspu ctx.
+    struct kspu_context *kctx = kspu_get_kctx();
+
+2. Enqueue a specific request. (struct my_spu_req spe_req)
+- Setup enqueue callback.
+  spe_req.kspu_work.enqueue = my_enqueue_func;
+
+- Enqueue it in kspu.
+  kspu_enqueue_work_item(kctx, &spe_req.kspu_work);
+
+3. Wait for the callback, enqueue it than on the SPU
+- Get an empty slot
+  struct kspu_job *work_item = kspu_get_rb_slot(kctx);
+
+- fill it
+  work_item->operation = MY_ADD;
+	work_item->in = spe_req.data;
+	work_item->in_size = 16;
+
+- mark it as ready
+  kspu_mark_rb_slot_ready(kctx, &spe_req.kspu_work);
+
+- set the finish callback
+  spe_req.kspu_work.notify = my_notify_func;
+
+4. Wait for the "finish" callback.
+- job finished.
+
+2.2 SPU code
+============
+- prepare a function that matches the following params:
+ void spu_my_add(struct kspu_job *kjob, void *buffer, unsigned int buf_num)
+
+	Use init_put_data() to write data back to main memory. It is just a wrapper
+	around mfc_putf(). Use the supplied buf_num as the tag.
+  init_put_data(buffer, out, length,	buf_num);
+
+2.3 Update files
+================
+- define your private data structures which are visible from your PPU program
+	and from your SPU program. They become later part of struct kspu_job if you
+	need them for parameters. Keep them as small as possible.
+
+- attach the function to SPU_OPS in
+	include/asm-powerpc/kspu/merged_code.h before TOTAL_SPU_OPS
+
+- attach the function to spu_funcs[] in arch/powerpc/platforms/cell/spufs/spu_main.c
+
+2.4 Skeleton files
+=================
+PPU code in Documentation/powerpc/kspu_ppu_skeleton.c
+SPU code in Documentation/powerpc/kspu_spu_skeleton.[ch]
+
+Merge both into kspu:
+--- a/arch/powerpc/platforms/cell/spufs/Makefile
++++ b/arch/powerpc/platforms/cell/spufs/Makefile
+@@ -24,6 +24,7 @@ kspu-y += kspu_helper.o kspu_code.o
+ $(obj)/kspu_code.o: $(obj)/spu_kspu_dump.h
+
+ spu_kspu_code_obj-y += $(obj)/spu_main.o $(obj)/spu_runtime.o
++spu_kspu_code_obj-y += $(obj)/spu_kspu_ppu_skeleton.o
+ spu_kspu_code_obj-y += $(spu_kspu_code_obj-m)
+
+ $(obj)/spu_kspu: $(spu_kspu_code_obj-y)
+
+--- a/arch/powerpc/platforms/cell/spufs/spu_main.c
++++ b/arch/powerpc/platforms/cell/spufs/spu_main.c
+@@ -13,6 +13,7 @@
+
+ static spu_operation_t spu_ops[TOTAL_SPU_FUNCS] __attribute__((aligned(16))) = {
+        [SPU_OP_nop] = spu_nop,
++       [SPU_OP_my_add] = spu_my_add
+ };
+
+ static unsigned char kspu_buff[DMA_BUFFERS][DMA_MAX_TRANS_SIZE];
+
+--- a/arch/powerpc/platforms/cell/spufs/spu_runtime.h
++++ b/arch/powerpc/platforms/cell/spufs/spu_runtime.h
+@@ -25,5 +25,6 @@ void memcpy_aligned(void *dest, const vo
+ /* exported offloaded functions */
+ void spu_nop(struct kspu_job *kjob, void *buffer,
+                unsigned int buf_num);
+
++void spu_my_add(struct kspu_job *kjob, void *buffer,
++               unsigned int buf_num);
+
+ #endif
+
+--- a/include/asm-powerpc/kspu/merged_code.h
++++ b/include/asm-powerpc/kspu/merged_code.h
+@@ -14,6 +14,7 @@
+
+ enum SPU_OPERATIONS {
+        SPU_OP_nop,
++       SPU_OP_my_add,
+
+        TOTAL_OP_FUNCS,
+ };
+@@ -23,6 +24,7 @@ struct kspu_job {
+        unsigned long long in __attribute__((aligned(16)));
+        unsigned int in_size __attribute__((aligned(16)));
+        union {
++               struct my_sum my_sum;
+        } __attribute__((aligned(16)));
+ };
+

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 04/10] spufs: kspu doc skeleton
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
                   ` (2 preceding siblings ...)
  2007-08-16 20:01 ` [patch 03/10] spufs: kspu documentation Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-16 20:01 ` [patch 05/10] spufs: kspu add required declarations Sebastian Siewior
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: spufs-kspu_doc_skeleton.diff --]
[-- Type: text/plain, Size: 3162 bytes --]

Skeleton for PPU & SPU code that the documentation refers to.

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- /dev/null
+++ b/Documentation/powerpc/kspu_ppu_skeleton.c
@@ -0,0 +1,98 @@
+/*
+ *  KSPU skeleton - PPU part
+ *
+ */
+
+#include <asm/kspu/kspu.h>
+#include <asm/kspu/merged_code.h>
+
+struct spu_async_req {
+	struct kspu_context *kctx;
+	struct kspu_work_item kspu_work;
+	struct completion mcompletion;
+	void *n1p;
+	unsigned char n2;
+};
+
+static void my_add_finish_callback(struct kspu_work_item *kspu_work)
+{
+	struct spu_async_req *req = container_of(kspu_work,
+			struct spu_async_req, kspu_work);
+
+
+	complete(&req->mcompletion);
+	return;
+}
+
+static int enqueue_on_spu(struct kspu_work_item *kspu_work)
+{
+	struct spu_async_req *req = container_of(kspu_work,
+			struct spu_async_req, kspu_work);
+	struct kspu_job *work_item;
+	struct my_sum *msum;
+	unsigned int i;
+
+	work_item = kspu_get_rb_slot(req->kctx);
+
+	my_sum = &work_item->my_sum;
+	work_item->operation = SPU_OP_my_add;
+
+	work_item->in = (unsigned long int) req->n1p;
+	my_sum->out = (unsigned long int) req->n1p;
+
+	for (i=0; i<16; i++)
+		my_sum->num[i] = req->n2;
+
+	kspu_work->notify = my_add_finish_callback;
+	kspu_mark_rb_slot_ready(req->kctx, kspu_work);
+	return 1;
+}
+
+static void enqueue_request(struct spu_async_req *req)
+{
+	struct kspu_work_item *work = &asy_d_ctx->kspu_work;
+
+	work->enqueue = enqueue_on_spu;
+
+	kspu_enqueue_work_item(ctx->spe_ctx->ctx, &req->kspu_work);
+}
+
+static unsigned char n1[16] __attribute__((aligned(16)));
+
+static int __init skeleton_init(void)
+{
+	struct async_d_request req;
+	unsigned int i;
+
+	req.kctx = kspu_get_kctx();
+
+	for (i=0; i<sizeof(n1); i++)
+		n1[i] = i*i;
+
+	req.n1p = n1;
+	req.n2 = 20;
+
+	init_completion(&req.mcompletion);
+	enqueue_request(req);
+
+	wait_for_completion_interruptible(&req.mcompletion);
+
+	for (i=0; i<sizeof(n1); i++)
+		printk("0x%02x ", n1[i]);
+
+	printk("\n");
+
+	/* don't load me plz */
+	return -ENOMEM;
+}
+
+static void __exit skeleton_finit(void)
+{
+}
+
+module_init(skeleton_init);
+module_exit(skeleton_finit);
+
+MODULE_DESCRIPTION("KSPU Skeleton for the PPU part");
+MODULE_AUTHOR("John Doe <john@the-doe-family.cc>");
+MODULE_LICENSE("GPL");
--- /dev/null
+++ b/Documentation/powerpc/kspu_spu_skeleton.c
@@ -0,0 +1,19 @@
+/*
+ * KSPU skeleton - SPU part
+ *
+ */
+
+#include <spu_intrinsics.h>
+#include <asm/kspu/spu_skeleton.h>
+#include <asm/kspu/merged_code.h>
+
+void spu_my_add(struct kspu_job *kjob, void *buffer, unsigned int buf_num)
+{
+	vector unsigned char sum1 = (*((vector unsigned char *)(buffer)));
+	struct my_sum *my_sum = &kjob->my_sum;
+	vector unsigned char sum2 = (*((vector unsigned char *)(my_sum->num)));
+	vector unsigned char sum;
+
+	sum = spu_add(sum1, sum2);
+	init_put_data(&sum, my_sum->out, 16, buf_num);
+}
--- /dev/null
+++ b/Documentation/powerpc/kspu_spu_skeleton.h
@@ -0,0 +1,9 @@
+#ifndef asm_kspu_spu_skeleton_h
+#define asm_kspu_spu_skeleton_h
+
+struct my_sum {
+	unsigned char num[16] __attribute__((aligned(16)));
+	unsigned long long out __attribute__((aligned(16)));
+};
+
+#endif

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 05/10] spufs: kspu add required declarations
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
                   ` (3 preceding siblings ...)
  2007-08-16 20:01 ` [patch 04/10] spufs: kspu doc skeleton Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-16 20:01 ` [patch 06/10] spufs: add kspu_alloc_context() Sebastian Siewior
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: spufs-export_symbols2.diff --]
[-- Type: text/plain, Size: 1097 bytes --]

spu_run_init() and spu_run_fini() need to be used outside of spufs/run.c
but within spufs.ko.

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- a/arch/powerpc/platforms/cell/spufs/run.c
+++ b/arch/powerpc/platforms/cell/spufs/run.c
@@ -126,7 +126,7 @@ out:
 	return ret;
 }
 
-static int spu_run_init(struct spu_context *ctx, u32 *npc)
+int spu_run_init(struct spu_context *ctx, u32 *npc)
 {
 	spuctx_switch_state(ctx, SPU_UTIL_SYSTEM);
 
@@ -160,8 +160,7 @@ static int spu_run_init(struct spu_conte
 	return 0;
 }
 
-static int spu_run_fini(struct spu_context *ctx, u32 *npc,
-			       u32 *status)
+int spu_run_fini(struct spu_context *ctx, u32 *npc, u32 *status)
 {
 	int ret = 0;
 
--- a/arch/powerpc/platforms/cell/spufs/spufs.h
+++ b/arch/powerpc/platforms/cell/spufs/spufs.h
@@ -254,6 +254,10 @@ void spu_sched_exit(void);
 
 extern char *isolated_loader;
 
+/* sched */
+int spu_run_init(struct spu_context *ctx, u32 *npc);
+int spu_run_fini(struct spu_context *ctx, u32 *npc, u32 *status);
+
 /*
  * spufs_wait
  *	Same as wait_event_interruptible(), except that here

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 06/10] spufs: add kspu_alloc_context()
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
                   ` (4 preceding siblings ...)
  2007-08-16 20:01 ` [patch 05/10] spufs: kspu add required declarations Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-16 20:01 ` [patch 07/10] spufs: add kernel support for spu task Sebastian Siewior
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: spufs-add_kspu_alloc_context.diff --]
[-- Type: text/plain, Size: 2132 bytes --]

Function behaves like alloc_spu_context() but instead of current-> init
task's informations are used and the problem state bit is removed.

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- a/arch/powerpc/platforms/cell/spufs/context.c
+++ b/arch/powerpc/platforms/cell/spufs/context.c
@@ -32,7 +32,8 @@
 
 atomic_t nr_spu_contexts = ATOMIC_INIT(0);
 
-struct spu_context *alloc_spu_context(struct spu_gang *gang)
+static struct spu_context *__alloc_spu_context(struct spu_gang *gang,
+		struct mm_struct *mm)
 {
 	struct spu_context *ctx;
 	ctx = kzalloc(sizeof *ctx, GFP_KERNEL);
@@ -54,7 +55,7 @@ struct spu_context *alloc_spu_context(st
 	init_waitqueue_head(&ctx->mfc_wq);
 	ctx->state = SPU_STATE_SAVED;
 	ctx->ops = &spu_backing_ops;
-	ctx->owner = get_task_mm(current);
+	ctx->owner = mm;
 	INIT_LIST_HEAD(&ctx->rq);
 	INIT_LIST_HEAD(&ctx->aff_list);
 	if (gang)
@@ -73,6 +74,37 @@ out:
 	return ctx;
 }
 
+struct spu_context *alloc_spu_context(struct spu_gang *gang)
+{
+	struct mm_struct *mm;
+	struct spu_context *ctx;
+
+	mm = get_task_mm(current);
+	ctx = __alloc_spu_context(gang, mm);
+	if (!ctx)
+		mmput(mm);
+	return ctx;
+}
+
+struct spu_context *kspu_alloc_context(void)
+{
+	struct spu_context *ctx;
+
+	/* for priviliged spu context, we borrow all the task specific
+	 * informations from init_task.
+	 */
+	atomic_inc(&init_mm.mm_users);
+	ctx = __alloc_spu_context(NULL, &init_mm);
+	if (!ctx) {
+		mmput(&init_mm);
+		return ctx;
+	}
+
+	/* remove problem state bit in order to access kernel memory */
+	ctx->csa.priv1.mfc_sr1_RW &= ~MFC_STATE1_PROBLEM_STATE_MASK;
+	return ctx;
+}
+
 void destroy_spu_context(struct kref *kref)
 {
 	struct spu_context *ctx;
--- a/arch/powerpc/platforms/cell/spufs/spufs.h
+++ b/arch/powerpc/platforms/cell/spufs/spufs.h
@@ -232,6 +232,7 @@ static inline void spu_release(struct sp
 }
 
 struct spu_context * alloc_spu_context(struct spu_gang *gang);
+struct spu_context *kspu_alloc_context(void);
 void destroy_spu_context(struct kref *kref);
 struct spu_context * get_spu_context(struct spu_context *ctx);
 int put_spu_context(struct spu_context *ctx);

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 07/10] spufs: add kernel support for spu task
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
                   ` (5 preceding siblings ...)
  2007-08-16 20:01 ` [patch 06/10] spufs: add kspu_alloc_context() Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-18 16:48   ` Arnd Bergmann
  2007-08-16 20:01 ` [patch 08/10] spufs: SPE side implementation of kspu Sebastian Siewior
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: spufs-add_kspu_ppu_side.diff --]
[-- Type: text/plain, Size: 23667 bytes --]

Utilization of SPUs by the kernel, main implementation. 
Functions that are offloaded to the SPU must be spitted into two parts:
- SPU part (executing)
- PPU part (prepare/glue)

The SPU part expects a buffer and maybe some other parameters and performs
the work on the buffer. After the work/job is done, it requests the
transfer back into main memory.
The PPU part needs to split the information into this kind of job. Every
job consists of one buffer (16 KiB max) and a few parameters. Once
everything is prepared, the request is added to a list. There is soft
limit for the number of requests that fit into this list. Once the limit
is reached, all request are dropped (unless a flag is passed in order not
to). The limit makes sure the user is not trying to process faster than
the SPU is capable of. The "queue anyway" flag is necessary because under
some circumstances the user may not be able to drop the request or try
again later.
A separate thread dequeues the request(s) from the list and calls a user
supplied function in order to enqueue this request in a ring buffer which is
located on the SPU. This transit stop enables
- enqueuing items if the ring buffer is full (not the list)
- enqueuing items from non-blocking context
After the callback function returns, the SPU starts "immediately" the
work. Once the SPU performed the work, KSPU invokes another callback to inform
the user, that his request is complete. 
The PPU code is responsible for proper alignment & transfer size.

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- a/arch/powerpc/platforms/cell/Kconfig
+++ b/arch/powerpc/platforms/cell/Kconfig
@@ -54,6 +54,13 @@ config SPU_BASE
 	bool
 	default n
 
+config KSPU
+	bool "Support for utilisation of SPU by the kernel"
+	depends on SPU_FS && EXPERIMENTAL
+	help
+	  With this option enabled, the kernel is able to utilize the SPUs for its
+	  own tasks.
+
 config CBE_RAS
 	bool "RAS features for bare metal Cell BE"
 	depends on PPC_CELL_NATIVE
--- a/arch/powerpc/platforms/cell/spufs/Makefile
+++ b/arch/powerpc/platforms/cell/spufs/Makefile
@@ -3,6 +3,7 @@ obj-y += switch.o fault.o lscsa_alloc.o
 obj-$(CONFIG_SPU_FS) += spufs.o
 spufs-y += inode.o file.o context.o syscalls.o coredump.o
 spufs-y += sched.o backing_ops.o hw_ops.o run.o gang.o
+spufs-$(CONFIG_KSPU) += kspu.o
 
 # Rules to build switch.o with the help of SPU tool chain
 SPU_CROSS	:= spu-
--- a/arch/powerpc/platforms/cell/spufs/inode.c
+++ b/arch/powerpc/platforms/cell/spufs/inode.c
@@ -791,10 +791,17 @@ static int __init spufs_init(void)
 	if (ret)
 		goto out_syscalls;
 
+	ret = kspu_init();
+	if (ret)
+		goto out_archcoredump;
+
 	spufs_init_isolated_loader();
 
 	return 0;
 
+out_archcoredump:
+	printk("kspu_init() failed\n");
+	unregister_arch_coredump_calls(&spufs_coredump_calls);
 out_syscalls:
 	unregister_spu_syscalls(&spufs_calls);
 out_fs:
@@ -804,12 +811,14 @@ out_sched:
 out_cache:
 	kmem_cache_destroy(spufs_inode_cache);
 out:
+	printk("spufs init not performed\n");
 	return ret;
 }
 module_init(spufs_init);
 
 static void __exit spufs_exit(void)
 {
+	kspu_exit();
 	spu_sched_exit();
 	spufs_exit_isolated_loader();
 	unregister_arch_coredump_calls(&spufs_coredump_calls);
--- /dev/null
+++ b/arch/powerpc/platforms/cell/spufs/kspu.c
@@ -0,0 +1,645 @@
+/*
+ * Interface for accessing SPUs from the kernel.
+ *
+ * Author: Sebastian Siewior <sebastian@breakpoint.cc>
+ * License: GPLv2
+ *
+ * Utilization of SPUs by the kernel, main implementation.
+ * Functions that are offloaded to the SPU must be spitted into two parts:
+ * - SPU part (executing)
+ * - PPU part (prepare/glue)
+ *
+ * The SPU part expects a buffer and maybe some other parameters and performs
+ * the work on the buffer. After the work/job is done, it requests the
+ * transfer back into main memory.
+ * The PPU part needs to split the information into this kind of job. Every
+ * job consists of one buffer (16 KiB max) and a few parameters. Once
+ * everything is prepared, the request is added to a list. There is soft
+ * limit for the number of requests that fit into this list. Once the limit
+ * is reached, all request are dropped (unless a flag is passed in order not
+ * to). The limit makes sure the user is not trying to process faster than
+ * the SPU is capable of. The "queue anyway" flag is necessary because under
+ * some circumstances the user may not be able to drop the request or try
+ * again later.
+ * A separate thread dequeues the request(s) from the list and calls a user
+ * supplied function in order to enqueue this request in a ring buffer which is
+ * located on the SPU. This transit stop enables
+ * - enqueuing items if the ring buffer is full (not the list)
+ * - enqueuing items from non-blocking context
+ * After the callback function returns, the SPU starts "immediately" the
+ * work. Once the SPU performed the work, KSPU invokes another callback to
+ * inform the user, that his request is complete.
+ * The PPU code is responsible for proper alignment & transfer size.
+ */
+
+#include <asm/spu_priv1.h>
+#include <asm/kspu/kspu.h>
+#include <asm/kspu/merged_code.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/init_task.h>
+#include <linux/hardirq.h>
+#include <linux/kernel.h>
+
+#include "spufs.h"
+#include "kspu_util.h"
+#include "spu_kspu_dump.h"
+
+static struct kspu_code single_spu_code = {
+	.code = spu_kspu_code,
+	.code_len = sizeof(spu_kspu_code),
+	.kspu_data_offset = KERNEL_SPU_DATA_OFFSET,
+	.queue_mask = RB_SLOTS-1,
+	.queue_entr_size = sizeof(struct kspu_job),
+};
+
+static void free_kspu_context(struct kspu_context *kctx)
+{
+	struct spu_context *spu_ctx = kctx->spu_ctx;
+	int ret;
+
+	if (spu_ctx->owner)
+		spu_forget(spu_ctx);
+	ret = put_spu_context(spu_ctx);
+	WARN_ON(!ret);
+	kfree(kctx->notify_cb_info);
+	kfree(kctx);
+}
+
+static void setup_stack(struct kspu_context *kctx)
+{
+	struct spu_context *ctx = kctx->spu_ctx;
+	u8 *ls;
+	u32 *u32p;
+
+	spu_acquire_saved(ctx);
+	ls = ctx->ops->get_ls(ctx);
+
+#define BACKCHAIN (kctx->spu_code->kspu_data_offset - 16)
+#define STACK_GAP 176
+#define INITIAL_STACK (BACKCHAIN - STACK_GAP)
+
+	BUG_ON(INITIAL_STACK > KSPU_LS_SIZE);
+
+	u32p = (u32 *) &ls[BACKCHAIN];
+	u32p[0] = 0;
+	u32p[1] = 0;
+	u32p[2] = 0;
+	u32p[3] = 0;
+
+	u32p = (u32 *) &ls[INITIAL_STACK];
+	u32p[0] = BACKCHAIN;
+	u32p[1] = 0;
+	u32p[2] = 0;
+	u32p[3] = 0;
+
+	ctx->csa.lscsa->gprs[1].slot[0] = INITIAL_STACK;
+	spu_release(ctx);
+	pr_debug("SPU's stack ready 0x%04x\n", INITIAL_STACK);
+}
+
+static struct kspu_context *__init kcreate_spu_context(int flags,
+		struct kspu_code *spu_code)
+{
+	struct kspu_context *kctx;
+	struct spu_context *ctx;
+	unsigned int ret;
+	u8 *ls;
+
+	flags |= SPU_CREATE_EVENTS_ENABLED;
+	ret = -EINVAL;
+
+	if (flags & (~SPU_CREATE_FLAG_ALL))
+		goto err;
+	/*
+	 * it must be a multiple of 16 because this value is used to calculate
+	 * the initial stack frame which must be 16byte aligned
+	 */
+	if (spu_code->kspu_data_offset & 15)
+		goto err;
+
+	pr_debug("SPU's queue: %d elemets, %d bytes each (%d bytes total)\n",
+			spu_code->queue_mask+1, spu_code->queue_entr_size,
+			(spu_code->queue_mask+1) * spu_code->queue_entr_size);
+
+	ret = -EFBIG;
+	if (spu_code->code_len > KSPU_LS_SIZE)
+		goto err;
+
+	ret = -ENOMEM;
+	kctx = kzalloc(sizeof *kctx, GFP_KERNEL);
+	if (!kctx)
+		goto err;
+
+	kctx->qlen = 0;
+	kctx->spu_code = spu_code;
+	init_waitqueue_head(&kctx->newitem_wq);
+	spin_lock_init(&kctx->queue_lock);
+	INIT_LIST_HEAD(&kctx->work_queue);
+	kctx->notify_cb_info = kzalloc(sizeof(*kctx->notify_cb_info) *
+			(kctx->spu_code->queue_mask + 1), GFP_KERNEL);
+	if (!kctx->notify_cb_info)
+		goto err_notify;
+
+	ctx = kspu_alloc_context();
+	if (!ctx)
+		goto err_spu_ctx;
+
+	kctx->spu_ctx = ctx;
+	ctx->flags = flags;
+
+	spu_acquire(ctx);
+	ls = ctx->ops->get_ls(ctx);
+	memcpy(ls, spu_code->code, spu_code->code_len);
+	spu_release(ctx);
+	setup_stack(kctx);
+
+	return kctx;
+err_spu_ctx:
+	kfree(kctx->notify_cb_info);
+
+err_notify:
+	kfree(kctx);
+err:
+	return ERR_PTR(ret);
+}
+
+/**
+ * kspu_get_rb_slot - get a free slot to queue a work request on the SPU.
+ * @kctx:	kspu context, where the free slot is required
+ *
+ * Returns a free slot where a request may be queued on. Repeated calls will
+ * return the same slot until it is marked as taken (by
+ * kspu_mark_rb_slot_ready()).
+ */
+struct kspu_job *kspu_get_rb_slot(struct kspu_context *kctx)
+{
+	struct kspu_ring_data *ring_data;
+	unsigned char *ls;
+	unsigned int outstanding;
+	unsigned int queue_mask;
+	unsigned int notified;
+
+	ls = kctx->spu_ctx->ops->get_ls(kctx->spu_ctx);
+	ls += kctx->spu_code->kspu_data_offset;
+	ring_data = (struct kspu_ring_data *) ls;
+
+	queue_mask = kctx->spu_code->queue_mask;
+	outstanding = ring_data->outstanding;
+	notified = kctx->last_notified;
+
+	/* without the & an overflow won't be detected */
+	if (((outstanding + 1) & queue_mask) == (notified & queue_mask)) {
+		return NULL;
+	}
+
+	ls += sizeof(struct kspu_ring_data);
+	/* ls points now to the first queue slot */
+	ls += kctx->spu_code->queue_entr_size * (outstanding & queue_mask);
+
+	pr_debug("Return slot %d, at %p\n", (outstanding & queue_mask), ls);
+	return (struct kspu_job *) ls;
+}
+EXPORT_SYMBOL_GPL(kspu_get_rb_slot);
+
+/*
+ * kspu_mark_rb_slot_ready - mark a request valid.
+ * @kctx:	kspu context that the request belongs to
+ * @work:	work item that is used for notification. May be NULL.
+ *
+ * The slot will be marked as valid not returned kspu_get_rb_slot() until
+ * the request is processed. If @work is not NULL, work->notify will be
+ * called to notify the user, that his request is done.
+ */
+void kspu_mark_rb_slot_ready(struct kspu_context *kctx,
+		struct kspu_work_item *work)
+{
+	struct kspu_ring_data *ring_data;
+	unsigned char *ls;
+	unsigned int outstanding;
+	unsigned int queue_mask;
+
+	ls = kctx->spu_ctx->ops->get_ls(kctx->spu_ctx);
+	ls += kctx->spu_code->kspu_data_offset;
+	ring_data = (struct kspu_ring_data *) ls;
+
+	queue_mask = kctx->spu_code->queue_mask;
+	outstanding = ring_data->outstanding;
+	kctx->notify_cb_info[outstanding & queue_mask] = work;
+	pr_debug("item ready: outs %d, notification data %p\n",
+			outstanding & queue_mask, work);
+	outstanding++;
+	BUG_ON((outstanding & queue_mask) == (kctx->last_notified & queue_mask));
+	ring_data->outstanding = outstanding;
+}
+EXPORT_SYMBOL_GPL(kspu_mark_rb_slot_ready);
+
+static int notify_done_reqs(struct kspu_context *kctx)
+{
+	struct kspu_ring_data *ring_data;
+	struct kspu_work_item *kspu_work;
+	unsigned char *kjob;
+	unsigned char *ls;
+	unsigned int current_notify;
+	unsigned int queue_mask;
+	unsigned ret = 0;
+
+	ls = kctx->spu_ctx->ops->get_ls(kctx->spu_ctx);
+	ls += kctx->spu_code->kspu_data_offset;
+	ring_data = (struct kspu_ring_data *) ls;
+	ls += sizeof(struct kspu_ring_data);
+
+	current_notify = kctx->last_notified;
+	queue_mask = kctx->spu_code->queue_mask;
+	pr_debug("notify| %d | %d (%d | %d)\n", current_notify & queue_mask,
+			ring_data->consumed & queue_mask,
+			current_notify, ring_data->consumed);
+
+	while (ring_data->consumed != current_notify) {
+
+		pr_debug("do notify %d. (consumed = %d)\n", current_notify, ring_data->consumed);
+
+		kspu_work = kctx->notify_cb_info[current_notify & queue_mask];
+		if (likely(kspu_work)) {
+			kjob = (unsigned char *) ls +
+				kctx->spu_code->queue_entr_size *
+				(current_notify & queue_mask);
+			kspu_work->notify(kspu_work, (struct kspu_job *) kjob);
+		}
+
+		current_notify++;
+		ret = 1;
+	}
+
+	kctx->last_notified = current_notify;
+	pr_debug("notify done\n");
+	return ret;
+}
+
+static int queue_requests(struct kspu_context *kctx)
+{
+	int ret;
+	int empty;
+	int queued = 0;
+	struct kspu_work_item *work;
+
+	WARN_ON(in_irq());
+	while(1) {
+		if (!kspu_get_rb_slot(kctx))
+			break;
+
+		spin_lock_bh(&kctx->queue_lock);
+		empty = list_empty(&kctx->work_queue);
+		if (unlikely(empty)) {
+			work = NULL;
+		} else {
+			work = list_first_entry(&kctx->work_queue,
+					struct kspu_work_item, list);
+			list_del(&work->list);
+			kctx->qlen--;
+		}
+		spin_unlock_bh(&kctx->queue_lock);
+
+		if (!work)
+			break;
+
+		pr_debug("Adding item %p to queue\n", work);
+		ret = work->enqueue(work);
+		if (unlikely(ret == 0)) {
+			pr_debug("Adding item %p again to list.\n", work);
+			spin_lock_bh(&kctx->queue_lock);
+			list_add(&work->list, &kctx->work_queue);
+			kctx->qlen++;
+			spin_unlock_bh(&kctx->queue_lock);
+			break;
+		}
+
+		queued = 1;
+	}
+	pr_debug("Queue requests done. => %d\n", queued);
+	return queued;
+}
+
+/**
+ * kspu_enqueue_work_item - Enqueue a request that supposed to be queued on the
+ * SPU.
+ * @kctx:	kspu context that should be used.
+ * @work:	Work item that should be placed on the SPU
+ *
+ * The functions puts the work item in a list belonging to the kctx. If the
+ * queue is full (KSPU_MAX_QUEUE_LENGTH limit) the request will be discarded
+ * unless the KSPU_MUST_BACKLOG flag has been specified. The flag should be
+ * specified if the user can't drop the requeuest or try again later (softirq).
+ * Once a SPU slot is available, the user supplied enqueue function
+ * (work->enqueue) will be called from a kthread context. The user may then
+ * enqueue the request on the SPU. This function may be called from softirq.
+ *
+ * Returns: -EINPROGRESS if the work item is enqueued,
+ * -EBUSY if the queue is full and the user should slowdown. Packet is
+ *  discarded unless KSPU_MUST_BACKLOG has been passed.
+ */
+int kspu_enqueue_work_item(struct kspu_context *kctx,
+		struct kspu_work_item *work, unsigned int flags)
+{
+	int ret = -EINPROGRESS;
+
+	spin_lock_bh(&kctx->queue_lock);
+	if (unlikely(kctx->qlen > KSPU_MAX_QUEUE_LENGTH)) {
+
+		ret = -EBUSY;
+		if (flags != KSPU_MUST_BACKLOG) {
+			spin_unlock_bh(&kctx->queue_lock);
+			return ret;
+		}
+	}
+
+	kctx->qlen++;
+	list_add_tail(&work->list, &kctx->work_queue);
+
+	spin_unlock_bh(&kctx->queue_lock);
+	wake_up_all(&kctx->newitem_wq);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kspu_enqueue_work_item);
+
+static int pending_spu_work(struct kspu_context *kctx)
+{
+	struct kspu_ring_data *ring_data;
+	unsigned char *ls;
+
+	ls = kctx->spu_ctx->ops->get_ls(kctx->spu_ctx);
+	ls += kctx->spu_code->kspu_data_offset;
+	ring_data = (struct kspu_ring_data *) ls;
+
+	pr_debug("pending spu work status: %u == %u ?\n",
+			ring_data->consumed,
+			ring_data->outstanding);
+	if (ring_data->consumed == ring_data->outstanding )
+		return 0;
+
+	return 1;
+}
+
+/*
+ * Fill dummy requests in the ring buffer. Dummy requests are required
+ * to let MFC "transfer" data if there are not enough real requests.
+ * Transfers with a size of 0 bytes are nops for the MFC
+ */
+static void kspu_fill_dummy_reqs(struct kspu_context *kctx)
+{
+
+	struct kspu_ring_data *ring_data;
+	unsigned char *ls;
+	unsigned int requests;
+	unsigned int queue_mask;
+	unsigned int outstanding;
+	unsigned int consumed;
+	struct kspu_job *kjob;
+	int i;
+
+	ls = kctx->spu_ctx->ops->get_ls(kctx->spu_ctx);
+	ls += kctx->spu_code->kspu_data_offset;
+	ring_data = (struct kspu_ring_data *) ls;
+	queue_mask = kctx->spu_code->queue_mask;
+
+	outstanding = ring_data->outstanding;
+	consumed = ring_data->consumed;
+
+	requests = outstanding - consumed;
+
+	if (requests >= DMA_BUFFERS *  2)
+		return;
+
+	for (i = requests; i < (DMA_BUFFERS * 2); i++) {
+		kjob = kspu_get_rb_slot(kctx);
+		kjob->operation = SPU_OP_nop;
+		kjob->in_size = 0;
+		kspu_mark_rb_slot_ready(kctx, NULL);
+	}
+}
+
+static void print_kctx_debug(struct kspu_context *kctx)
+{
+	struct kspu_job *kjob;
+	struct kspu_ring_data *ring_data;
+	unsigned char *ls, *new_queue;
+	unsigned int requests, consumed, outstanding;
+	unsigned int queue_mask;
+	unsigned int i;
+
+	ls = kctx->spu_ctx->ops->get_ls(kctx->spu_ctx);
+	ls += kctx->spu_code->kspu_data_offset;
+	ring_data = (struct kspu_ring_data *) ls;
+	ls += sizeof(struct kspu_ring_data);
+
+	consumed = ring_data->consumed;
+	outstanding = ring_data->outstanding;
+
+	if (likely(outstanding > consumed))
+		requests = outstanding - consumed;
+	else
+		requests = UINT_MAX - consumed + outstanding +1;
+
+	queue_mask = kctx->spu_code->queue_mask;
+	/* show the last two processed as well */
+	requests +=2;
+	consumed -=2;
+
+	printk(KERN_ERR "Consumed: %d Outstanding: %d (%d)\n", consumed, outstanding, requests);
+	if (requests > 10)
+		requests = 10;
+
+	for (i = 0; i < requests; i++) {
+		new_queue = ls + kctx->spu_code->queue_entr_size * (consumed & queue_mask);
+		kjob = (struct kspu_job *) new_queue;
+
+		printk(KERN_ERR "Request: %d function: %d src addr: %08llx, length: %d\n",
+				consumed & queue_mask, kjob->operation, kjob->in, kjob->in_size);
+		consumed++;
+	}
+}
+
+/*
+ * based on run.c spufs_run_spu
+ */
+static int spufs_run_kernel_spu(void *priv)
+{
+	struct kspu_context *kctx = (struct kspu_context *) priv;
+	struct spu_context *ctx = kctx->spu_ctx;
+	int ret;
+	u32 status;
+	unsigned int npc = 0;
+	int fastpath;
+	DEFINE_WAIT(wait_for_stop);
+	DEFINE_WAIT(wait_for_ibox);
+	DEFINE_WAIT(wait_for_newitem);
+
+	spu_enable_spu(ctx);
+	ctx->event_return = 0;
+	spu_acquire(ctx);
+	if (ctx->state == SPU_STATE_SAVED) {
+		__spu_update_sched_info(ctx);
+
+		ret = spu_activate(ctx, 0);
+		if (ret) {
+			spu_release(ctx);
+			printk(KERN_ERR "could not obtain runnable spu: %d\n",
+					ret);
+			BUG();
+		}
+	} else {
+		/*
+		 * We have to update the scheduling priority under active_mutex
+		 * to protect against find_victim().
+		 */
+		spu_update_sched_info(ctx);
+	}
+
+	spu_run_init(ctx, &npc);
+	do {
+		fastpath = 0;
+		prepare_to_wait(&ctx->stop_wq, &wait_for_stop,
+				TASK_INTERRUPTIBLE);
+		prepare_to_wait(&ctx->ibox_wq, &wait_for_ibox,
+				TASK_INTERRUPTIBLE);
+		prepare_to_wait(&kctx->newitem_wq, &wait_for_newitem,
+				TASK_INTERRUPTIBLE);
+
+		if (unlikely(test_and_clear_bit(SPU_SCHED_NOTIFY_ACTIVE,
+						&ctx->sched_flags))) {
+
+			if (!(status & SPU_STATUS_STOPPED_BY_STOP)) {
+				spu_switch_notify(ctx->spu, ctx);
+			}
+		}
+
+		spuctx_switch_state(ctx, SPU_UTIL_SYSTEM);
+
+		pr_debug("going to handle class1\n");
+		ret = spufs_handle_class1(ctx);
+		if (unlikely(ret)) {
+			/*
+			 * SPE_EVENT_SPE_DATA_STORAGE => refernce invalid memory
+			 */
+			printk(KERN_ERR "Invalid memory dereferenced by the"
+					"spu: %d\n", ret);
+			BUG();
+		}
+
+		/* FIXME BUG: We need a physical SPU to discover
+		 * ctx->spu->class_0_pending. It is not saved on context
+		 * switch. We may lose this on context switch.
+		 */
+		status = ctx->ops->status_read(ctx);
+		if (unlikely((ctx->spu && ctx->spu->class_0_pending) ||
+					status & SPU_STATUS_INVALID_INSTR)) {
+			printk(KERN_ERR "kspu error, status_register: 0x%08x\n",
+					status);
+			printk(KERN_ERR "event return: 0x%08lx, spu's npc: "
+					"0x%08x\n", kctx->spu_ctx->event_return,
+					kctx->spu_ctx->ops->npc_read(
+						kctx->spu_ctx));
+			printk(KERN_ERR "class_0_pending: 0x%lx\n", ctx->spu->class_0_pending);
+			print_kctx_debug(kctx);
+			BUG();
+		}
+
+		if (notify_done_reqs(kctx))
+			fastpath = 1;
+
+		if (queue_requests(kctx))
+			fastpath = 1;
+
+		if (!(status & SPU_STATUS_RUNNING)) {
+			/* spu is currently not running */
+			pr_debug("SPU not running, last stop code was: %08x\n",
+					status >> SPU_STOP_STATUS_SHIFT);
+			if (pending_spu_work(kctx)) {
+				/* spu should run again */
+				pr_debug("Activate SPU\n");
+				kspu_fill_dummy_reqs(kctx);
+
+				spu_run_fini(ctx, &npc, &status);
+				spu_acquire_runnable(ctx, 0);
+				spu_run_init(ctx, &npc);
+			} else {
+				/* spu finished work */
+				pr_debug("SPU will remain in stop state\n");
+				spu_run_fini(ctx, &npc, &status);
+				spu_yield(ctx);
+				spu_acquire(ctx);
+			}
+		} else {
+			pr_debug("SPU is running, switch state to util user\n");
+			spuctx_switch_state(ctx, SPU_UTIL_USER);
+		}
+
+		if (fastpath)
+			continue;
+
+		spu_release(ctx);
+		schedule();
+		spu_acquire(ctx);
+
+	} while (!kthread_should_stop() || !list_empty(&kctx->work_queue));
+
+	finish_wait(&ctx->stop_wq, &wait_for_stop);
+	finish_wait(&ctx->ibox_wq, &wait_for_ibox);
+	finish_wait(&kctx->newitem_wq, &wait_for_newitem);
+
+	spu_release(ctx);
+	spu_disable_spu(ctx);
+	return 0;
+}
+
+static struct kspu_context *kspu_ctx;
+
+/**
+ * kspu_get_kctx - return a kspu contexte.
+ *
+ * Returns a kspu_context that identifies the SPU context used by the kernel.
+ * Right now only one static context exist which may be used by multiple users.
+ */
+struct kspu_context *kspu_get_kctx(void)
+{
+	return kspu_ctx;
+}
+EXPORT_SYMBOL_GPL(kspu_get_kctx);
+
+int __init kspu_init(void)
+{
+	int ret;
+
+	pr_debug("code @%p, len %d, offet 0x%08x, elemets: %d,"
+			"element size: %d\n", single_spu_code.code,
+			single_spu_code.code_len,
+			single_spu_code.kspu_data_offset,
+			single_spu_code.queue_mask,
+			single_spu_code.queue_entr_size);
+	kspu_ctx = kcreate_spu_context(0, &single_spu_code);
+	if (IS_ERR(kspu_ctx)) {
+		ret = PTR_ERR(kspu_ctx);
+		goto out;
+	}
+
+	/* kthread_run */
+	kspu_ctx->thread = kthread_create(spufs_run_kernel_spu, kspu_ctx,
+			"spucode");
+	if (IS_ERR(kspu_ctx->thread)) {
+		ret = PTR_ERR(kspu_ctx->thread);
+		goto err_kspu_ctx;
+	}
+	wake_up_process(kspu_ctx->thread);
+
+	return 0;
+err_kspu_ctx:
+	free_kspu_context(kspu_ctx);
+out:
+	return ret;
+}
+
+void __exit kspu_exit(void)
+{
+	kthread_stop(kspu_ctx->thread);
+	free_kspu_context(kspu_ctx);
+}
--- /dev/null
+++ b/arch/powerpc/platforms/cell/spufs/kspu_util.h
@@ -0,0 +1,30 @@
+#ifndef KSPU_UTIL_H
+#define KSPU_UTIL_H
+#include <linux/wait.h>
+
+struct kspu_code {
+	const unsigned int *code;
+	unsigned int code_len;
+	unsigned int kspu_data_offset;
+	unsigned int queue_mask;
+	unsigned int queue_entr_size;
+};
+
+struct notify_cb_info {
+	void *notify;
+};
+
+struct kspu_context {
+	struct spu_context *spu_ctx;
+	wait_queue_head_t newitem_wq;
+	void **notify_cb_info;
+	unsigned int last_notified;
+	struct kspu_code *spu_code;
+	struct task_struct *thread;
+	/* spinlock protects qlen + work_queue */
+	spinlock_t queue_lock;
+	unsigned int qlen;
+	struct list_head work_queue;
+};
+
+#endif
--- a/arch/powerpc/platforms/cell/spufs/spufs.h
+++ b/arch/powerpc/platforms/cell/spufs/spufs.h
@@ -344,4 +344,18 @@ static inline void spuctx_switch_state(s
 	}
 }
 
+#ifdef CONFIG_KSPU
+int __init kspu_init(void);
+void __exit kspu_exit(void);
+#else
+static inline int kspu_init(void)
+{
+	return 0;
+}
+
+static inline void kspu_exit(void)
+{
+}
+#endif
+
 #endif
--- /dev/null
+++ b/include/asm-powerpc/kspu/kspu.h
@@ -0,0 +1,35 @@
+#ifndef KSPU_KSPU_H
+#define KSPU_KSPU_H
+#include <linux/list.h>
+#include <asm/kspu/merged_code.h>
+
+/*
+ * If the queue is full, the request must be accepted (it can't be droped).
+ * The user that uses this flag should make sure that further requests arrive
+ * more slowly
+ */
+#define KSPU_MUST_BACKLOG	0x1
+
+/*
+ * Max number of requests that may be in the queue. All following items are
+ * discared if the KSPU_MUST_BACKLOG is not specified (it seems that the SPE
+ * is not working fast enough).
+ */
+#define KSPU_MAX_QUEUE_LENGTH	400
+
+struct kspu_work_item {
+	struct list_head list;
+	int (*enqueue)(struct kspu_work_item *);
+	void (*notify)(struct kspu_work_item *, struct kspu_job *);
+};
+
+struct kspu_context;
+
+struct kspu_job *kspu_get_rb_slot(struct kspu_context *kspu);
+void kspu_mark_rb_slot_ready(struct kspu_context *kspu,
+		struct kspu_work_item *work);
+int kspu_enqueue_work_item(struct kspu_context *kctx,
+		struct kspu_work_item *work, unsigned int flags);
+struct kspu_context *kspu_get_kctx(void);
+
+#endif

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 08/10] spufs: SPE side implementation of kspu
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
                   ` (6 preceding siblings ...)
  2007-08-16 20:01 ` [patch 07/10] spufs: add kernel support for spu task Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  2007-08-16 20:01 ` [patch 09/10] spufs: SPU-AES support (kernel side) Sebastian Siewior
  2007-08-16 20:01 ` [patch 10/10] cryptoapi: async speed test Sebastian Siewior
  9 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: spufs-add_kspu_spu_side.diff --]
[-- Type: text/plain, Size: 8319 bytes --]

The SPU part of KSPU which consists of the a multiplexor and one helper
function. The multiplexor invokes the offloaded functions and performs multi
buffering (DMA_BUFFERS=2 -> double buffering, DMA_BUFFERS= -> triple \ldots).
The offloaded function cares only about processing the buffer and arranging
the transfer of the result. Waiting for the transfers to complete as well as
signaling the completion of functions is taken care of by the multiplexor.

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>

--- a/arch/powerpc/platforms/cell/spufs/Makefile
+++ b/arch/powerpc/platforms/cell/spufs/Makefile
@@ -12,13 +12,21 @@ SPU_AS		:= $(SPU_CROSS)gcc
 SPU_LD		:= $(SPU_CROSS)ld
 SPU_OBJCOPY	:= $(SPU_CROSS)objcopy
 SPU_CFLAGS	:= -O2 -Wall -I$(srctree)/include \
-		   -I$(objtree)/include2 -D__KERNEL__
+		   -I$(objtree)/include2 -D__KERNEL__ -ffreestanding
 SPU_AFLAGS	:= -c -D__ASSEMBLY__ -I$(srctree)/include \
 		   -I$(objtree)/include2 -D__KERNEL__
 SPU_LDFLAGS	:= -N -Ttext=0x0
 
 $(obj)/switch.o: $(obj)/spu_save_dump.h $(obj)/spu_restore_dump.h
-clean-files := spu_save_dump.h spu_restore_dump.h
+clean-files := spu_save_dump.h spu_restore_dump.h spu_kspu_dump.h
+
+$(obj)/kspu.o: $(obj)/spu_kspu_dump.h
+
+spu_kspu_code_obj-y += $(obj)/spu_main.o $(obj)/spu_runtime.o
+spu_kspu_code_obj-y += $(spu_kspu_code_obj-m)
+
+$(obj)/spu_kspu: $(spu_kspu_code_obj-y)
+	$(call if_changed,spu_ld)
 
 # Compile SPU files
       cmd_spu_cc = $(SPU_CC) $(SPU_CFLAGS) -c -o $@ $<
--- /dev/null
+++ b/arch/powerpc/platforms/cell/spufs/spu_main.c
@@ -0,0 +1,116 @@
+/*
+ * This code can be considered as crt0.S
+ * Compile with -O[123S] and make sure that here is only one function
+ * that starts at 0x0
+ * Author: Sebastian Siewior <sebastian@breakpoint.cc>
+ * License: GPLv2
+ */
+#include <asm/kspu/merged_code.h>
+#include <spu_mfcio.h>
+#include "spu_runtime.h"
+
+static spu_operation_t spu_ops[TOTAL_SPU_OPS] __attribute__((aligned(16))) = {
+	[SPU_OP_nop] = spu_nop,
+};
+static unsigned char kspu_buff[DMA_BUFFERS][DMA_MAX_TRANS_SIZE];
+
+void _start(void) __attribute__((noreturn));
+void _start(void)
+{
+	struct kernel_spu_data *spu_data;
+
+	spu_data = (struct kernel_spu_data *) KERNEL_SPU_DATA_OFFSET;
+
+	while (37) {
+		struct kspu_job *kjob;
+		unsigned char *dma_buff;
+		unsigned int consumed;
+		unsigned int outstanding;
+		unsigned int cur_req;
+		unsigned int cur_item;
+		unsigned int cur_buf;
+		unsigned int i;
+
+		spu_stop(1);
+		/*
+		 * Once started, it is guaranteed that atleast DMA_BUFFERS *2
+		 * requests are in ring buffer. The work order is:
+		 * 1. request DMA_BUFFERS transfers, every in a seperate buffer
+		 *    with its own tag.
+		 * 2. process those buffers and request new ones.
+		 * 3. if more than (DMA_BUFFERS *2) are available, than the
+		 *    main loop begins:
+		 *   - wait for tag to finish transfers
+		 *   - notify done work
+		 *   - process request
+		 *   - write back
+		 * 4. if no more request are available, process the last
+		 *    DMA_BUFFERS request that are left, write them back and
+		 *    wait until that transfers completes and spu_stop()
+		 */
+
+		consumed = spu_data->kspu_ring_data.consumed;
+		cur_req = consumed;
+		cur_item = consumed;
+
+		/* 1 */
+		for (cur_buf = 0; cur_buf < DMA_BUFFERS; cur_buf++) {
+			init_get_data(kspu_buff[cur_buf & DMA_BUFF_MASK],
+					&spu_data->work_item[cur_req & RB_MASK],
+					cur_buf & DMA_BUFF_MASK);
+			cur_req++;
+		}
+
+		/* 2 */
+		for (cur_buf = 0; cur_buf < DMA_BUFFERS; cur_buf++) {
+			wait_for_buffer(1 << (cur_buf & DMA_BUFF_MASK));
+
+			kjob = &spu_data->work_item[cur_item & RB_MASK];
+			dma_buff = kspu_buff[cur_buf & DMA_BUFF_MASK];
+			spu_ops[kjob->operation]
+				(kjob, dma_buff, cur_buf & DMA_BUFF_MASK);
+
+			init_get_data(dma_buff,
+					&spu_data->work_item[cur_req & RB_MASK],
+					cur_buf & DMA_BUFF_MASK);
+			cur_item++;
+			cur_req++;
+		}
+
+		outstanding = spu_data->kspu_ring_data.outstanding;
+		/* 3 */
+		while (cur_req != outstanding) {
+			wait_for_buffer(1 << (cur_buf & DMA_BUFF_MASK));
+			spu_data->kspu_ring_data.consumed++;
+			if (spu_stat_out_intr_mbox())
+				spu_write_out_intr_mbox(0x0);
+
+			kjob = &spu_data->work_item[cur_item & RB_MASK];
+			dma_buff = kspu_buff[cur_buf & DMA_BUFF_MASK];
+			spu_ops[kjob->operation]
+				(kjob, dma_buff, cur_buf & DMA_BUFF_MASK);
+
+			init_get_data(dma_buff,
+					&spu_data->work_item[cur_req & RB_MASK],
+					cur_buf & DMA_BUFF_MASK);
+			cur_item++;
+			cur_req++;
+			cur_buf++;
+			outstanding = spu_data->kspu_ring_data.outstanding;
+		}
+
+		/* 4 */
+		for (i = 0; i < DMA_BUFFERS; i++) {
+			wait_for_buffer(1 << (cur_buf & DMA_BUFF_MASK));
+			kjob = &spu_data->work_item[cur_item & RB_MASK];
+			dma_buff = kspu_buff[cur_buf & DMA_BUFF_MASK];
+			spu_ops[kjob->operation]
+				(kjob, dma_buff, cur_buf & DMA_BUFF_MASK);
+			cur_buf++;
+			cur_item++;
+		}
+
+		wait_for_buffer(ALL_DMA_BUFFS);
+		spu_data->kspu_ring_data.consumed = cur_item;
+	}
+}
--- /dev/null
+++ b/arch/powerpc/platforms/cell/spufs/spu_runtime.c
@@ -0,0 +1,40 @@
+/*
+ * Runtime helper functions, which intend to replace libc. They can't be merged
+ * into spu_main.c because it must be guaranteed that _start() starts at 0x0.
+ *
+ * Author: Sebastian Siewior <sebastian@breakpoint.cc>
+ * License: GPLv2
+ */
+
+#include <spu_intrinsics.h>
+#include <asm/kspu/merged_code.h>
+
+void spu_nop(struct kspu_job *kjob, void *buffer, unsigned int buf_num)
+{
+}
+
+/*
+ * memcpy_aligned - copy memory
+ * @src:	source of memory
+ * @dst:	destination
+ * @num:	number of bytes
+ *
+ * Copies @num bytes from @src to @dst. @src & @dst must be aligned at
+ * 16byte boundary. If @src or @dst is not properly aligned, wrong data will be
+ * read and or written. @num must be multiple of 16. If @num is not multiple of
+ * 16 than the function simply do nothing
+ */
+void memcpy_aligned(void *dest, const void *src, unsigned int num)
+{
+	const vector unsigned char *s = src;
+	vector unsigned char *d = dest;
+
+	if (num & 15)
+		return;
+	do {
+		*d = *s;
+		s++;
+		d++;
+		num -= 16;
+	} while (num);
+}
--- /dev/null
+++ b/arch/powerpc/platforms/cell/spufs/spu_runtime.h
@@ -0,0 +1,29 @@
+#ifndef SPU_RUNTIME_H
+#define SPU_RUNTIME_H
+#include <spu_mfcio.h>
+
+static inline void init_get_data(void *buf, struct kspu_job *job,
+		unsigned int dma_tag)
+{
+	mfc_getb(buf, job->in, job->in_size, dma_tag, 0, 0);
+}
+
+static inline void init_put_data(void *buf, unsigned long long ea,
+		unsigned int size, unsigned int dma_tag)
+{
+	mfc_putf(buf, ea, size, dma_tag, 0, 0);
+}
+
+static inline void wait_for_buffer(unsigned int dma_tag)
+{
+	mfc_write_tag_mask(dma_tag);
+	spu_mfcstat(MFC_TAG_UPDATE_ALL);
+}
+
+void memcpy_aligned(void *dest, const void *src, unsigned int n);
+
+/* exported offloaded functions */
+void spu_nop(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num);
+
+#endif
--- /dev/null
+++ b/include/asm-powerpc/kspu/merged_code.h
@@ -0,0 +1,51 @@
+#ifndef KSPU_MERGED_CODE_H
+#define KSPU_MERGED_CODE_H
+
+#define KSPU_LS_SIZE 0x40000
+
+#define RB_SLOTS 256
+#define RB_MASK (RB_SLOTS-1)
+
+#define DMA_MAX_TRANS_SIZE (16 * 1024)
+#define DMA_BUFFERS   2
+#define DMA_BUFF_MASK (DMA_BUFFERS-1)
+#define ALL_DMA_BUFFS ((1 << DMA_BUFFERS)-1)
+
+/*
+ * Every offloaded SPU operation has register itself in the SPU_OPERATIONS
+ * between SPU_OP_nop & TOTAL_SPU_OPS
+ */
+enum SPU_OPERATIONS {
+	SPU_OP_nop,
+
+	TOTAL_SPU_OPS,
+};
+
+struct kspu_job {
+	enum SPU_OPERATIONS operation __attribute__((aligned(16)));
+	unsigned long long in __attribute__((aligned(16)));
+	unsigned int in_size __attribute__((aligned(16)));
+	/*
+	 * This union is reserved for the parameter block of the offloaded
+	 * function.
+	 */
+	union {
+	} __attribute__((aligned(16)));
+};
+
+typedef void (*spu_operation_t)(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num);
+
+struct kspu_ring_data {
+	volatile unsigned int consumed __attribute__((aligned(16)));
+	volatile unsigned int outstanding __attribute__((aligned(16)));
+};
+
+struct kernel_spu_data {
+	struct kspu_ring_data kspu_ring_data __attribute__((aligned(16)));
+	struct kspu_job work_item[RB_SLOTS] __attribute__((aligned(16)));
+};
+
+#define KERNEL_SPU_DATA_OFFSET (KSPU_LS_SIZE - sizeof(struct kernel_spu_data))
+
+#endif

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 09/10] spufs: SPU-AES support (kernel side)
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
                   ` (7 preceding siblings ...)
  2007-08-16 20:01 ` [patch 08/10] spufs: SPE side implementation of kspu Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
       [not found]   ` <20070828154637.GA21007@Chamillionaire.breakpoint.cc>
  2007-08-16 20:01 ` [patch 10/10] cryptoapi: async speed test Sebastian Siewior
  9 siblings, 1 reply; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: aes-spu-async2.diff --]
[-- Type: text/plain, Size: 47045 bytes --]

This patch implements the AES cipher algorithm in ECB & CBC blockmode
which is executed on the SPU using the crypto async interface & kspu.

CBC has one limitiation: The IV is written back in the notification
callback. That means that it is not available for crypto requests that
depend on the previous IV (as well as crypto requests >16 KiB). Herbert Xu
pointer out, that this is currently not the case. For instance:
- IPsec brings its own IV on with every packet. A packet is usually <=
	1500 bytes. The trouble starts with jumbo frames
- EcryptFS changes the IV on page bassis (every enc/dec request is
	PAGE_SIZE long).

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- a/arch/powerpc/platforms/cell/Makefile
+++ b/arch/powerpc/platforms/cell/Makefile
@@ -24,6 +24,7 @@ obj-$(CONFIG_SPU_BASE)			+= spu_callback
 					   $(spufs-modular-m) \
 					   $(spu-priv1-y) \
 					   $(spu-manage-y) \
-					   spufs/
+					   spufs/ \
+					   crypto/
 
 obj-$(CONFIG_PCI_MSI)			+= axon_msi.o
--- /dev/null
+++ b/arch/powerpc/platforms/cell/crypto/Makefile
@@ -0,0 +1,6 @@
+#
+# Crypto, arch specific
+#
+CFLAGS_aes_vmx_key.o += -O3  -maltivec
+aes_spu-objs := aes_spu_glue.o aes_vmx_key.o
+obj-$(CONFIG_CRYPTO_AES_SPU) += aes_spu.o
--- /dev/null
+++ b/arch/powerpc/platforms/cell/crypto/aes_spu_glue.c
@@ -0,0 +1,462 @@
+/*
+ * AES interface module for the async crypto API.
+ *
+ * Author: Sebastian Siewior <sebastian@breakpoint.cc>
+ * License: GPLv2
+ */
+#include <asm/byteorder.h>
+#include <asm/system.h>
+#include <asm/kspu/kspu.h>
+#include <asm/kspu/merged_code.h>
+#include <crypto/algapi.h>
+#include <linux/module.h>
+#include <linux/crypto.h>
+#include <linux/mutex.h>
+#include <linux/err.h>
+#include <linux/list.h>
+#include <linux/delay.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/scatterlist.h>
+#include <linux/highmem.h>
+#include <linux/vmalloc.h>
+
+#include "aes_vmx_key.h"
+
+struct map_key_spu {
+	struct list_head list;
+	unsigned int spu_slot;
+	struct aes_ctx *slot_content;
+};
+
+struct aes_ctx {
+	/* the key used for enc|dec purpose */
+	struct aes_key_struct key __attribute__((aligned(16)));
+	/* identify the slot on the SPU */
+	struct map_key_spu *key_mapping;
+	/* identify the SPU that is used */
+	struct async_aes *spe_ctx;
+};
+
+struct async_d_request {
+	enum SPU_OPERATIONS crypto_operation;
+	 /*
+	  * If src|dst is not properly aligned, we keep here a copy of
+	  * it that is properly aligned.
+	  */
+	struct kspu_work_item kspu_work;
+	unsigned char *al_data;
+	unsigned char *mapped_src;
+	unsigned char *mapped_dst;
+	unsigned char *real_src;
+	unsigned char *real_dst;
+	unsigned int progress;
+};
+
+struct async_aes {
+	struct kspu_context *ctx;
+	struct map_key_spu mapping_key_spu[SPU_KEY_SLOTS];
+	struct list_head key_ring;
+};
+
+static struct async_aes async_spu;
+
+#define AES_MIN_KEY_SIZE	16
+#define AES_MAX_KEY_SIZE	32
+#define AES_BLOCK_SIZE		16
+#define ALIGN_MASK 15
+
+static void cleanup_requests(struct ablkcipher_request *req,
+		struct async_d_request *a_d_ctx)
+{
+	char *dst_addr;
+	char *aligned_addr;
+
+	if (a_d_ctx->al_data) {
+		aligned_addr = (char *) ALIGN((unsigned long)
+				a_d_ctx->al_data, ALIGN_MASK+1);
+		dst_addr = a_d_ctx->mapped_dst + req->dst->offset;
+
+		if ((unsigned long) dst_addr & ALIGN_MASK)
+			memcpy(dst_addr, aligned_addr, req->nbytes);
+		vfree(a_d_ctx->al_data);
+		kunmap(a_d_ctx->mapped_dst);
+		kunmap(a_d_ctx->mapped_src);
+	}
+
+}
+
+static void aes_finish_callback(struct kspu_work_item *kspu_work,
+		struct kspu_job *kjob)
+{
+	struct async_d_request *a_d_ctx = container_of(kspu_work,
+			struct async_d_request, kspu_work);
+	struct ablkcipher_request *ablk_req = ablkcipher_ctx_cast(a_d_ctx);
+
+	a_d_ctx = ablkcipher_request_ctx(ablk_req);
+	cleanup_requests(ablk_req, a_d_ctx);
+
+	if (ablk_req->info) {
+		struct aes_crypt *aes_crypt = (struct aes_crypt *) kjob;
+
+		memcpy(ablk_req->info, aes_crypt->iv, 16);
+	}
+
+	pr_debug("Request %p done, memory cleaned. Now calling crypto user\n",
+			kspu_work);
+	local_bh_disable();
+	ablk_req->base.complete(&ablk_req->base, 0);
+	local_bh_enable();
+	return;
+}
+
+static void update_key_on_spu(struct aes_ctx *aes_ctx)
+{
+	struct list_head *tail;
+	struct map_key_spu *entry;
+	struct aes_update_key *aes_update_key;
+	struct kspu_job *work_item;
+
+	tail = async_spu.key_ring.prev;
+	entry = list_entry(tail, struct map_key_spu, list);
+	list_move(tail, &async_spu.key_ring);
+
+	entry->slot_content = aes_ctx;
+	aes_ctx->key_mapping = entry;
+
+	pr_debug("key for %p is not on the SPU. new slot: %d\n",
+			aes_ctx, entry->spu_slot);
+	work_item = kspu_get_rb_slot(aes_ctx->spe_ctx->ctx);
+	work_item->operation = SPU_OP_aes_update_key;
+	work_item->in = (unsigned long long) &aes_ctx->key;
+	work_item->in_size = sizeof(aes_ctx->key);
+
+	aes_update_key = &work_item->aes_update_key;
+	aes_update_key->keyid = entry->spu_slot;
+
+	kspu_mark_rb_slot_ready(aes_ctx->spe_ctx->ctx, NULL);
+}
+
+static int prepare_request_mem(struct ablkcipher_request *req,
+		struct async_d_request *a_d_ctx, struct aes_ctx *aes_ctx)
+{
+	char *src_addr, *dst_addr;
+
+	a_d_ctx->mapped_src = kmap(req->src->page);
+	if (!a_d_ctx->mapped_src)
+		goto err;
+
+	a_d_ctx->mapped_dst = kmap(req->dst->page);
+	if (!a_d_ctx->mapped_dst)
+		goto err_src;
+
+	src_addr = a_d_ctx->mapped_src + req->src->offset;
+	dst_addr = a_d_ctx->mapped_dst + req->dst->offset;
+
+	if ((unsigned long) src_addr & ALIGN_MASK ||
+			(unsigned long) dst_addr & ALIGN_MASK) {
+		/*
+		 * vmalloc() is somewhat slower than __get_free_page().
+		 * However, this is the slowpath. I expect the user to align
+		 * properly in first place :).
+		 * The reason for vmalloc() is that req->nbytes may be larger
+		 * than one page and I don't want distinguish later where that
+		 * memory come from.
+		 */
+		a_d_ctx->al_data = vmalloc(req->nbytes);
+		if (!a_d_ctx->al_data)
+			goto err_dst;
+
+		pr_debug("Unaligned data replaced with %p\n",
+				a_d_ctx->al_data);
+
+		if ((unsigned long) src_addr & ALIGN_MASK) {
+			memcpy(a_d_ctx->al_data, src_addr, req->nbytes);
+			a_d_ctx->real_src = a_d_ctx->al_data;
+		}
+
+		if ((unsigned long) dst_addr & ALIGN_MASK)
+			a_d_ctx->real_dst = a_d_ctx->al_data;
+
+	} else {
+		a_d_ctx->al_data = NULL;
+		a_d_ctx->real_src = src_addr;
+		a_d_ctx->real_dst = dst_addr;
+	}
+	return 0;
+err_dst:
+	kunmap(a_d_ctx->mapped_dst);
+err_src:
+	kunmap(a_d_ctx->mapped_src);
+err:
+	return -ENOMEM;
+
+}
+/*
+ * aes_queue_work_items() is called by kspu to queue the work item on the SPU.
+ * kspu ensures atleast one slot when calling. The function may return 0 if
+ * more slots were required but not available. In this case, kspu will call
+ * again with the same work item. The function has to notice that this work
+ * item has been allready started and continue.
+ * Other return values (!=0) will remove the work item from list.
+ */
+static int aes_queue_work_items(struct kspu_work_item *kspu_work)
+{
+	struct async_d_request *a_d_ctx = container_of(kspu_work,
+			struct async_d_request, kspu_work);
+	struct ablkcipher_request *ablk_req = ablkcipher_ctx_cast(a_d_ctx);
+	struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(ablk_req);
+	struct aes_ctx *aes_ctx = crypto_ablkcipher_ctx_aligned(tfm);
+	struct kspu_job *work_item;
+	struct aes_crypt *aes_crypt;
+	int size_left;
+	int ret;
+
+	BUG_ON(ablk_req->nbytes & (AES_BLOCK_SIZE-1));
+
+	if (!a_d_ctx->progress) {
+		if (!aes_ctx->key_mapping || aes_ctx !=
+				aes_ctx->key_mapping->slot_content)
+			update_key_on_spu(aes_ctx);
+
+		else
+			list_move(&aes_ctx->key_mapping->list,
+					&async_spu.key_ring);
+
+		ret = prepare_request_mem(ablk_req, a_d_ctx, aes_ctx);
+		if (ret)
+			return 0;
+	}
+
+	do {
+		size_left = ablk_req->nbytes - a_d_ctx->progress;
+
+		if (!size_left)
+			return 1;
+
+		work_item = kspu_get_rb_slot(aes_ctx->spe_ctx->ctx);
+		if (!work_item)
+			return 0;
+
+		aes_crypt = &work_item->aes_crypt;
+		work_item->operation = a_d_ctx->crypto_operation;
+		work_item->in = (unsigned long int) a_d_ctx->real_src +
+			a_d_ctx->progress;
+		aes_crypt->out = (unsigned long int) a_d_ctx->real_dst +
+			a_d_ctx->progress;
+
+		if (size_left > DMA_MAX_TRANS_SIZE) {
+			a_d_ctx->progress += DMA_MAX_TRANS_SIZE;
+			work_item->in_size = DMA_MAX_TRANS_SIZE;
+		} else {
+			a_d_ctx->progress += size_left;
+			work_item->in_size = size_left;
+		}
+
+		if (ablk_req->info)
+			memcpy(aes_crypt->iv, ablk_req->info, 16);
+
+		aes_crypt->keyid = aes_ctx->key_mapping->spu_slot;
+
+		pr_debug("in: %p, out %p, data_size: %u\n",
+				(void *) work_item->in,
+				(void *) aes_crypt->out,
+				work_item->in_size);
+		pr_debug("key slot: %d, IV from: %p\n", aes_crypt->keyid,
+				ablk_req->info);
+
+		kspu_mark_rb_slot_ready(aes_ctx->spe_ctx->ctx,
+				a_d_ctx->progress == ablk_req->nbytes ?
+				kspu_work : NULL);
+	} while (1);
+}
+
+static int enqueue_request(struct ablkcipher_request *req,
+		enum SPU_OPERATIONS op_type)
+{
+	struct async_d_request *asy_d_ctx = ablkcipher_request_ctx(req);
+	struct crypto_ablkcipher *tfm = crypto_ablkcipher_reqtfm(req);
+	struct aes_ctx *ctx = crypto_ablkcipher_ctx_aligned(tfm);
+	struct kspu_work_item *work = &asy_d_ctx->kspu_work;
+
+	asy_d_ctx->crypto_operation = op_type;
+	asy_d_ctx->progress = 0;
+	work->enqueue = aes_queue_work_items;
+	work->notify = aes_finish_callback;
+
+	return kspu_enqueue_work_item(ctx->spe_ctx->ctx, &asy_d_ctx->kspu_work,
+			KSPU_MUST_BACKLOG);
+}
+
+/*
+ * AltiVec and not SPU code is because the key may disappear after calling
+ * this func (for example if it is not properly aligned)
+ */
+static int aes_set_key_async(struct crypto_ablkcipher *parent,
+		const u8 *key, unsigned int keylen)
+{
+	struct aes_ctx *ctx = crypto_ablkcipher_ctx_aligned(parent);
+	int ret;
+
+	ctx->spe_ctx = &async_spu;
+	ctx->key.len = keylen / 4;
+	ctx->key_mapping = NULL;
+
+	preempt_disable();
+	enable_kernel_altivec();
+	ret = expand_key(key, keylen / 4, &ctx->key.enc[0], &ctx->key.dec[0]);
+	preempt_enable();
+
+	if (ret == -EINVAL)
+		crypto_ablkcipher_set_flags(parent, CRYPTO_TFM_RES_BAD_KEY_LEN);
+
+	return ret;
+}
+
+static int aes_encrypt_ecb_async(struct ablkcipher_request *req)
+{
+	req->info = NULL;
+	return enqueue_request(req, SPU_OP_aes_encrypt_ecb);
+}
+
+static int aes_decrypt_ecb_async(struct ablkcipher_request *req)
+{
+	req->info = NULL;
+	return enqueue_request(req, SPU_OP_aes_decrypt_ecb);
+}
+
+static int aes_encrypt_cbc_async(struct ablkcipher_request *req)
+{
+	return enqueue_request(req, SPU_OP_aes_encrypt_cbc);
+}
+
+static int aes_decrypt_cbc_async(struct ablkcipher_request *req)
+{
+	return enqueue_request(req, SPU_OP_aes_decrypt_cbc);
+}
+
+static int async_d_init(struct crypto_tfm *tfm)
+{
+	tfm->crt_ablkcipher.reqsize = sizeof(struct async_d_request);
+	return 0;
+}
+
+static struct crypto_alg aes_ecb_alg_async = {
+	.cra_name		= "ecb(aes)",
+	.cra_driver_name	= "ecb-aes-spu-async",
+	.cra_priority		= 125,
+	.cra_flags		= CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ASYNC,
+	.cra_blocksize		= AES_BLOCK_SIZE,
+	.cra_alignmask		= 15,
+	.cra_ctxsize		= sizeof(struct aes_ctx),
+	.cra_type		= &crypto_ablkcipher_type,
+	.cra_module		= THIS_MODULE,
+	.cra_list		= LIST_HEAD_INIT(aes_ecb_alg_async.cra_list),
+	.cra_init		= async_d_init,
+	.cra_u	= {
+		.ablkcipher = {
+			.min_keysize	= AES_MIN_KEY_SIZE,
+			.max_keysize	= AES_MAX_KEY_SIZE,
+			.ivsize		= 0,
+			.setkey		= aes_set_key_async,
+			.encrypt	= aes_encrypt_ecb_async,
+			.decrypt	= aes_decrypt_ecb_async,
+		}
+	}
+};
+
+static struct crypto_alg aes_cbc_alg_async = {
+	.cra_name		= "cbc(aes)",
+	.cra_driver_name	= "cbc-aes-spu-async",
+	.cra_priority		= 125,
+	.cra_flags		= CRYPTO_ALG_TYPE_BLKCIPHER | CRYPTO_ALG_ASYNC,
+	.cra_blocksize		= AES_BLOCK_SIZE,
+	.cra_alignmask		= 15,
+	.cra_ctxsize		= sizeof(struct aes_ctx),
+	.cra_type		= &crypto_ablkcipher_type,
+	.cra_module		= THIS_MODULE,
+	.cra_list		= LIST_HEAD_INIT(aes_cbc_alg_async.cra_list),
+	.cra_init		= async_d_init,
+	.cra_u	= {
+		.ablkcipher = {
+			.min_keysize	= AES_MIN_KEY_SIZE,
+			.max_keysize	= AES_MAX_KEY_SIZE,
+			.ivsize		= AES_BLOCK_SIZE,
+			.setkey		= aes_set_key_async,
+			.encrypt	= aes_encrypt_cbc_async,
+			.decrypt	= aes_decrypt_cbc_async,
+		}
+	}
+};
+
+static void init_spu_key_mapping(struct async_aes *spe_ctx)
+{
+	unsigned int i;
+
+	INIT_LIST_HEAD(&spe_ctx->key_ring);
+
+	for (i = 0; i < SPU_KEY_SLOTS; i++) {
+		list_add_tail(&spe_ctx->mapping_key_spu[i].list,
+				&spe_ctx->key_ring);
+		spe_ctx->mapping_key_spu[i].spu_slot = i;
+	}
+}
+
+static int init_async_ctx(struct async_aes *spe_ctx)
+{
+	int ret;
+
+	spe_ctx->ctx = kspu_get_kctx();
+	init_spu_key_mapping(spe_ctx);
+
+	ret = crypto_register_alg(&aes_ecb_alg_async);
+	if (ret) {
+		printk(KERN_ERR "crypto_register_alg(ecb) failed: %d\n", ret);
+		goto err_kthread;
+	}
+
+	ret = crypto_register_alg(&aes_cbc_alg_async);
+	if (ret) {
+		printk(KERN_ERR "crypto_register_alg(cbc) failed: %d\n", ret);
+		goto fail_cbc;
+	}
+
+	return 0;
+
+fail_cbc:
+	crypto_unregister_alg(&aes_ecb_alg_async);
+
+err_kthread:
+	return ret;
+}
+
+static void deinit_async_ctx(struct async_aes *async_aes)
+{
+
+	crypto_unregister_alg(&aes_ecb_alg_async);
+	crypto_unregister_alg(&aes_cbc_alg_async);
+}
+
+static int __init aes_init(void)
+{
+	unsigned int ret;
+
+	ret = init_async_ctx(&async_spu);
+	if (ret) {
+		printk(KERN_ERR "async_api_init() failed\n");
+		return ret;
+	}
+	return 0;
+}
+
+static void __exit aes_fini(void)
+{
+	deinit_async_ctx(&async_spu);
+}
+
+module_init(aes_init);
+module_exit(aes_fini);
+
+MODULE_DESCRIPTION("AES Cipher Algorithm with SPU support");
+MODULE_AUTHOR("Sebastian Siewior <sebastian@breakpoint.cc>");
+MODULE_LICENSE("GPL");
--- /dev/null
+++ b/arch/powerpc/platforms/cell/crypto/aes_vmx_key.c
@@ -0,0 +1,283 @@
+/*
+ * Key expansion in VMX.
+ * This is a rip of my first AES implementation in VMX. Only key expansion is
+ * required, other parts are left behind.
+ *
+ * Author: Sebastian Siewior (sebastian _at_ breakpoint.cc)
+ * License: GPL v2
+ */
+
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <altivec.h>
+#include "aes_vmx_key.h"
+
+static const vector unsigned char imm_7Fh = {
+	0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f,
+	0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f, 0x7f
+};
+
+/*
+ * This values are either defined in AES standard or can be
+ * computed.
+ */
+static const unsigned int Rcon[] = {
+	0x00000000, 0x01000000, 0x02000000, 0x04000000, 0x08000000,
+	0x10000000, 0x20000000, 0x40000000, 0x80000000, 0x1b000000,
+	0x36000000
+};
+
+static const vector unsigned char sbox_enc[16] = {
+	{ 0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5,
+	  0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76 },
+	{ 0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0,
+	  0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0 },
+	{ 0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc,
+	  0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15 },
+	{ 0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a,
+	  0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75 },
+	{ 0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0,
+	  0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84 },
+	{ 0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b,
+	  0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf },
+	{ 0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85,
+	  0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8 },
+	{ 0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5,
+	  0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2 },
+	{ 0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17,
+	  0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73 },
+	{ 0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88,
+	  0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb },
+	{ 0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c,
+	  0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79 },
+	{ 0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9,
+	  0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08 },
+	{ 0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6,
+	  0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a },
+	{ 0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e,
+	  0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e },
+	{ 0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94,
+	  0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf },
+	{ 0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68,
+	  0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 }
+};
+
+static const vector unsigned char inv_select_0e = {
+	0x00, 0x01, 0x02, 0x03,
+	0x04, 0x05, 0x06, 0x07,
+	0x08, 0x09, 0x0a, 0x0b,
+	0x0c, 0x0d, 0x0e, 0x0f
+};
+
+static const vector unsigned char inv_select_0b = {
+	0x01, 0x02, 0x03, 0x00,
+	0x05, 0x06, 0x07, 0x04,
+	0x09, 0x0a, 0x0b, 0x08,
+	0x0d, 0x0e, 0x0f, 0x0c
+};
+
+static const vector unsigned char inv_select_0d = {
+	0x02, 0x03, 0x00, 0x01,
+	0x06, 0x07, 0x04, 0x05,
+	0x0a, 0x0b, 0x08, 0x09,
+	0x0e, 0x0f, 0x0c, 0x0d
+};
+
+static const vector unsigned char inv_select_09 = {
+	0x03, 0x00, 0x01, 0x02,
+	0x07, 0x04, 0x05, 0x06,
+	0x0b, 0x08, 0x09, 0x0a,
+	0x0f, 0x0c, 0x0d, 0x0e
+};
+
+static vector unsigned char ByteSub(vector unsigned char state)
+{
+	/* line of the s-box */
+	vector unsigned char line_01, line_23, line_45, line_67,
+		   line_89, line_AB, line_CD, line_EF;
+	/* selector */
+	vector unsigned char sel1, sel2, sel7;
+	/* correct lines */
+	vector unsigned char cor_0123, cor_4567, cor_89AB, cor_CDEF,
+		cor_0to7, cor_8toF;
+	vector unsigned char ret_state;
+	vector unsigned char state_shift2, state_shift1;
+
+	line_01 = vec_perm(sbox_enc[0], sbox_enc[1], state);
+	line_23 = vec_perm(sbox_enc[2], sbox_enc[3], state);
+	line_45 = vec_perm(sbox_enc[4], sbox_enc[5], state);
+	line_67 = vec_perm(sbox_enc[6], sbox_enc[7], state);
+	line_89 = vec_perm(sbox_enc[8], sbox_enc[9], state);
+	line_AB = vec_perm(sbox_enc[10], sbox_enc[11], state);
+	line_CD = vec_perm(sbox_enc[12], sbox_enc[13], state);
+	line_EF = vec_perm(sbox_enc[14], sbox_enc[15], state);
+
+	state_shift2 = vec_vslb(state, vec_splat_u8(2));
+	sel2 = (typeof (sel2))vec_vcmpgtub(state_shift2, imm_7Fh);
+	cor_0123 = vec_sel(line_01, line_23, sel2);
+	cor_4567 = vec_sel(line_45, line_67, sel2);
+	cor_89AB = vec_sel(line_89, line_AB, sel2);
+	cor_CDEF = vec_sel(line_CD, line_EF, sel2);
+
+	state_shift1 = vec_vslb(state, vec_splat_u8(1));
+	sel1 = (typeof (sel1))vec_vcmpgtub(state_shift1, imm_7Fh);
+	cor_0to7 = vec_sel(cor_0123, cor_4567, sel1);
+	cor_8toF = vec_sel(cor_89AB, cor_CDEF, sel1);
+
+	sel7 = (typeof (sel7))vec_vcmpgtub(state, imm_7Fh);
+	ret_state = vec_sel(cor_0to7, cor_8toF, sel7);
+
+	return ret_state;
+}
+
+static vector unsigned char InvMixColumn(vector unsigned char state)
+{
+	vector unsigned char op0, op1, op2, op3, op4, op5;
+	vector unsigned char mul_0e, mul_09, mul_0d, mul_0b;
+	vector unsigned char ret;
+	vector unsigned char imm_00h, imm_01h;
+	vector unsigned char need_add;
+	vector unsigned char shifted_vec, modul;
+	vector unsigned char toadd;
+	vector unsigned char mul_2, mul_4, mul_8;
+	vector unsigned char mul_2_4;
+
+	/* compute 0e, 0b, 0d, 09 in GF */
+	imm_00h = vec_splat_u8(0x00);
+	imm_01h = vec_splat_u8(0x01);
+
+	/* modul = 0x1b */
+	modul = vec_splat( vec_lvsr(0, (unsigned char *) 0), 0x0b);
+
+	need_add = (vector unsigned char)vec_vcmpgtub(state, imm_7Fh);
+	shifted_vec = vec_vslb(state, imm_01h);
+	toadd = vec_sel(imm_00h, modul, need_add);
+	mul_2 = vec_xor(toadd, shifted_vec);
+
+	need_add = (vector unsigned char)vec_vcmpgtub(mul_2, imm_7Fh);
+	shifted_vec = vec_vslb(mul_2, imm_01h);
+	toadd = vec_sel(imm_00h, modul, need_add);
+	mul_4 = vec_xor(toadd, shifted_vec);
+
+	need_add = (vector unsigned char)vec_vcmpgtub(mul_4, imm_7Fh);
+	shifted_vec = vec_vslb(mul_4, imm_01h);
+	toadd = vec_sel(imm_00h, modul, need_add);
+	mul_8 = vec_xor(toadd, shifted_vec);
+
+	mul_2_4 = vec_xor(mul_2, mul_4);
+	/* 09 = 8 * 1 */
+	mul_09 = vec_xor(mul_8, state);
+
+	/* 0e = 2 * 4 * 8 */
+	mul_0e = vec_xor(mul_2_4, mul_8);
+
+	/* 0b = 2 * 8 * 1 */
+	mul_0b = vec_xor(mul_2, mul_09);
+
+	/* 0d = 4 * 8 * 1 */
+	mul_0d = vec_xor(mul_4, mul_09);
+
+	/* prepare vectors for add */
+
+	op0 = vec_perm(mul_0e, mul_0e, inv_select_0e);
+	op1 = vec_perm(mul_0b, mul_0b, inv_select_0b);
+	op2 = vec_perm(mul_0d, mul_0d, inv_select_0d);
+	op3 = vec_perm(mul_09, mul_09, inv_select_09);
+
+	op4 = vec_xor(op0, op1);
+	op5 = vec_xor(op2, op3);
+	ret = vec_xor(op4, op5);
+	return ret;
+}
+
+static unsigned int SubWord(unsigned int in)
+{
+	unsigned char buff[16] __attribute__((aligned(16)));
+	vector unsigned char vec_buf;
+
+	buff[0] =  in >> 24;
+	buff[1] = (in >> 16) & 0xff;
+	buff[2] = (in >>  8) & 0xff;
+	buff[3] = in & 0xff;
+
+	vec_buf = vec_ld(0, buff);
+	vec_buf = ByteSub(vec_buf);
+	vec_st(vec_buf, 0, buff);
+	return buff[0] << 24 | buff[1] << 16 | buff[2] << 8 | buff[3];
+}
+
+static unsigned int  RotWord(unsigned int word)
+{
+	return (word << 8 | word >> 24);
+}
+
+int expand_key(const unsigned char *key, unsigned int keylen,
+		unsigned char exp_enc_key[15 *4*4],
+		unsigned char exp_dec_key[15*4*4])
+{
+	unsigned int tmp;
+	unsigned int i;
+	unsigned int rounds;
+	unsigned int expanded_key[15 *4] __attribute__((aligned(16)));
+	vector unsigned char expanded_dec_key[15];
+	vector unsigned char mixed_key;
+	vector unsigned char *cur_key;
+
+	switch (keylen) {
+	case 4:
+		rounds = 10;
+		break;
+
+	case 6:
+		rounds = 12;
+		break;
+
+	case 8:
+		rounds = 14;
+		break;
+
+	default:
+		/* wrong key size */
+		return -EINVAL;
+	}
+
+	memcpy(expanded_key, key, keylen*4);
+
+	i = keylen;
+
+	/* setup enc key */
+
+	for (; i < 4 * (rounds+1); i++) {
+		tmp = expanded_key[i-1];
+
+		if (!(i % keylen)) {
+			tmp = RotWord(tmp);
+			tmp = SubWord(tmp);
+			tmp ^= Rcon[i / keylen ];
+		} else if (keylen > 6 &&  (i % keylen == 4))
+				tmp = SubWord(tmp);
+
+		expanded_key[i] = expanded_key[i-keylen] ^ tmp;
+	}
+
+	memcpy(exp_enc_key, expanded_key, 15*4*4);
+
+	/* setup dec key: the key is turned arround and prepared for the
+	 * "alternative decryption" mode
+	 */
+
+	cur_key = (vector unsigned char *) expanded_key;
+
+	memcpy(&expanded_dec_key[rounds],      &expanded_key[0], 4*4);
+	memcpy(&expanded_dec_key[0], &expanded_key[rounds *4], 4*4);
+
+	cur_key++;
+	for (i = (rounds-1); i > 0; i--) {
+
+		mixed_key = InvMixColumn(*cur_key++);
+		expanded_dec_key[i] = mixed_key;
+	}
+
+	memcpy(exp_dec_key, expanded_dec_key, 15*4*4);
+	return 0;
+}
--- /dev/null
+++ b/arch/powerpc/platforms/cell/crypto/aes_vmx_key.h
@@ -0,0 +1,7 @@
+#ifndef __aes_vmx_addon_h__
+#define __aes_vmx_addon_h__
+
+int expand_key(const unsigned char *key, unsigned int keylen,
+		unsigned char exp_enc_key[15*4*4],
+		unsigned char exp_dec_key[15*4*4]);
+#endif
--- a/arch/powerpc/platforms/cell/spufs/Makefile
+++ b/arch/powerpc/platforms/cell/spufs/Makefile
@@ -11,7 +11,7 @@ SPU_CC		:= $(SPU_CROSS)gcc
 SPU_AS		:= $(SPU_CROSS)gcc
 SPU_LD		:= $(SPU_CROSS)ld
 SPU_OBJCOPY	:= $(SPU_CROSS)objcopy
-SPU_CFLAGS	:= -O2 -Wall -I$(srctree)/include \
+SPU_CFLAGS	:= -O3 -Wall -I$(srctree)/include \
 		   -I$(objtree)/include2 -D__KERNEL__ -ffreestanding
 SPU_AFLAGS	:= -c -D__ASSEMBLY__ -I$(srctree)/include \
 		   -I$(objtree)/include2 -D__KERNEL__
@@ -23,6 +23,7 @@ clean-files := spu_save_dump.h spu_resto
 $(obj)/kspu.o: $(obj)/spu_kspu_dump.h
 
 spu_kspu_code_obj-y += $(obj)/spu_main.o $(obj)/spu_runtime.o
+spu_kspu_code_obj-$(CONFIG_CRYPTO_AES_SPU) += $(obj)/spu_aes.o
 spu_kspu_code_obj-y += $(spu_kspu_code_obj-m)
 
 $(obj)/spu_kspu: $(spu_kspu_code_obj-y)
--- /dev/null
+++ b/arch/powerpc/platforms/cell/spufs/spu_aes.c
@@ -0,0 +1,677 @@
+/*
+ * AES implementation with spu support.
+ * v.03
+ *
+ * Author:
+ *			Sebastian Siewior (sebastian _at_ breakpoint.cc)
+ *			Arnd Bergmann (arnd _at_ arndb.de)
+ *
+ * License: GPL v2
+ *
+ * Code based on ideas from "Effincient Galois Field Arithmetic on SIMD
+ * Architectures" by Raghav Bhaskar, Prapdeep K. Dubey, Vijay Kumar, Atri Rudra
+ * and Animesh Sharma.
+ *
+ * This implementation makes use of spu and asumes therefore big endian.
+ * Tables for MixColumn() and InvMixColumn() are adjusted in order to omit
+ * ShiftRow in all but last round.
+ */
+#include <stddef.h>
+#include <spu_intrinsics.h>
+#include <spu_mfcio.h>
+
+#include <asm/kspu/aes.h>
+#include <asm/kspu/merged_code.h>
+#include "spu_runtime.h"
+
+#define BUG() ;
+/*
+ * This values are either defined in AES standard or can be
+ * computed.
+ */
+static const vector unsigned char sbox_enc[16] = {
+	{ 0x63, 0x7c, 0x77, 0x7b, 0xf2, 0x6b, 0x6f, 0xc5,
+	  0x30, 0x01, 0x67, 0x2b, 0xfe, 0xd7, 0xab, 0x76 },
+	{ 0xca, 0x82, 0xc9, 0x7d, 0xfa, 0x59, 0x47, 0xf0,
+	  0xad, 0xd4, 0xa2, 0xaf, 0x9c, 0xa4, 0x72, 0xc0 },
+	{ 0xb7, 0xfd, 0x93, 0x26, 0x36, 0x3f, 0xf7, 0xcc,
+	  0x34, 0xa5, 0xe5, 0xf1, 0x71, 0xd8, 0x31, 0x15 },
+	{ 0x04, 0xc7, 0x23, 0xc3, 0x18, 0x96, 0x05, 0x9a,
+	  0x07, 0x12, 0x80, 0xe2, 0xeb, 0x27, 0xb2, 0x75 },
+	{ 0x09, 0x83, 0x2c, 0x1a, 0x1b, 0x6e, 0x5a, 0xa0,
+	  0x52, 0x3b, 0xd6, 0xb3, 0x29, 0xe3, 0x2f, 0x84 },
+	{ 0x53, 0xd1, 0x00, 0xed, 0x20, 0xfc, 0xb1, 0x5b,
+	  0x6a, 0xcb, 0xbe, 0x39, 0x4a, 0x4c, 0x58, 0xcf },
+	{ 0xd0, 0xef, 0xaa, 0xfb, 0x43, 0x4d, 0x33, 0x85,
+	  0x45, 0xf9, 0x02, 0x7f, 0x50, 0x3c, 0x9f, 0xa8 },
+	{ 0x51, 0xa3, 0x40, 0x8f, 0x92, 0x9d, 0x38, 0xf5,
+	  0xbc, 0xb6, 0xda, 0x21, 0x10, 0xff, 0xf3, 0xd2 },
+	{ 0xcd, 0x0c, 0x13, 0xec, 0x5f, 0x97, 0x44, 0x17,
+	  0xc4, 0xa7, 0x7e, 0x3d, 0x64, 0x5d, 0x19, 0x73 },
+	{ 0x60, 0x81, 0x4f, 0xdc, 0x22, 0x2a, 0x90, 0x88,
+	  0x46, 0xee, 0xb8, 0x14, 0xde, 0x5e, 0x0b, 0xdb },
+	{ 0xe0, 0x32, 0x3a, 0x0a, 0x49, 0x06, 0x24, 0x5c,
+	  0xc2, 0xd3, 0xac, 0x62, 0x91, 0x95, 0xe4, 0x79 },
+	{ 0xe7, 0xc8, 0x37, 0x6d, 0x8d, 0xd5, 0x4e, 0xa9,
+	  0x6c, 0x56, 0xf4, 0xea, 0x65, 0x7a, 0xae, 0x08 },
+	{ 0xba, 0x78, 0x25, 0x2e, 0x1c, 0xa6, 0xb4, 0xc6,
+	  0xe8, 0xdd, 0x74, 0x1f, 0x4b, 0xbd, 0x8b, 0x8a },
+	{ 0x70, 0x3e, 0xb5, 0x66, 0x48, 0x03, 0xf6, 0x0e,
+	  0x61, 0x35, 0x57, 0xb9, 0x86, 0xc1, 0x1d, 0x9e },
+	{ 0xe1, 0xf8, 0x98, 0x11, 0x69, 0xd9, 0x8e, 0x94,
+	  0x9b, 0x1e, 0x87, 0xe9, 0xce, 0x55, 0x28, 0xdf },
+	{ 0x8c, 0xa1, 0x89, 0x0d, 0xbf, 0xe6, 0x42, 0x68,
+	  0x41, 0x99, 0x2d, 0x0f, 0xb0, 0x54, 0xbb, 0x16 }
+};
+
+static const vector unsigned char shift_round = {
+	0x00, 0x05, 0x0a, 0x0f,
+	0x04, 0x09, 0x0e, 0x03,
+	0x08, 0x0d, 0x02, 0x07,
+	0x0c, 0x01, 0x06, 0x0b
+};
+
+static const vector unsigned char pre_xor_s0 = {
+	0x10, 0x00, 0x00, 0x10,
+	0x14, 0x04, 0x04, 0x14,
+	0x18, 0x08, 0x08, 0x18,
+	0x1c, 0x0c, 0x0c, 0x1c
+};
+
+static const vector unsigned char pre_xor_s1 = {
+	0x15, 0x15, 0x05, 0x00,
+	0x19, 0x19, 0x09, 0x04,
+	0x1d, 0x1d, 0x0d, 0x08,
+	0x11, 0x11, 0x01, 0x0c
+};
+
+static const vector unsigned char pre_xor_s2 = {
+	0x05, 0x1a, 0x1a, 0x05,
+	0x09, 0x1e, 0x1e, 0x09,
+	0x0d, 0x12, 0x12, 0x0d,
+	0x01, 0x16, 0x16, 0x01
+};
+
+static const vector unsigned char pre_xor_s3 = {
+	0x0a, 0x0a, 0x1f, 0x0a,
+	0x0e, 0x0e, 0x13, 0x0e,
+	0x02, 0x02, 0x17, 0x02,
+	0x06, 0x06, 0x1b, 0x06
+};
+
+static const vector unsigned char pre_xor_s4 = {
+	0x0f, 0x0f, 0x0f, 0x1f,
+	0x03, 0x03, 0x03, 0x13,
+	0x07, 0x07, 0x07, 0x17,
+	0x0b, 0x0b, 0x0b, 0x1b
+};
+
+static const vector unsigned char sbox_dec[16] = {
+	{ 0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38,
+	  0xbf, 0x40, 0xa3, 0x9e, 0x81, 0xf3, 0xd7, 0xfb },
+	{ 0x7c, 0xe3, 0x39, 0x82, 0x9b, 0x2f, 0xff, 0x87,
+	  0x34, 0x8e, 0x43, 0x44, 0xc4, 0xde, 0xe9, 0xcb },
+	{ 0x54, 0x7b, 0x94, 0x32, 0xa6, 0xc2, 0x23, 0x3d,
+	  0xee, 0x4c, 0x95, 0x0b, 0x42, 0xfa, 0xc3, 0x4e },
+	{ 0x08, 0x2e, 0xa1, 0x66, 0x28, 0xd9, 0x24, 0xb2,
+	  0x76, 0x5b, 0xa2, 0x49, 0x6d, 0x8b, 0xd1, 0x25 },
+	{ 0x72, 0xf8, 0xf6, 0x64, 0x86, 0x68, 0x98, 0x16,
+	  0xd4, 0xa4, 0x5c, 0xcc, 0x5d, 0x65, 0xb6, 0x92 },
+	{ 0x6c, 0x70, 0x48, 0x50, 0xfd, 0xed, 0xb9, 0xda,
+	  0x5e, 0x15, 0x46, 0x57, 0xa7, 0x8d, 0x9d, 0x84 },
+	{ 0x90, 0xd8, 0xab, 0x00, 0x8c, 0xbc, 0xd3, 0x0a,
+	  0xf7, 0xe4, 0x58, 0x05, 0xb8, 0xb3, 0x45, 0x06 },
+	{ 0xd0, 0x2c, 0x1e, 0x8f, 0xca, 0x3f, 0x0f, 0x02,
+	  0xc1, 0xaf, 0xbd, 0x03, 0x01, 0x13, 0x8a, 0x6b },
+	{ 0x3a, 0x91, 0x11, 0x41, 0x4f, 0x67, 0xdc, 0xea,
+	  0x97, 0xf2, 0xcf, 0xce, 0xf0, 0xb4, 0xe6, 0x73 },
+	{ 0x96, 0xac, 0x74, 0x22, 0xe7, 0xad, 0x35, 0x85,
+	  0xe2, 0xf9, 0x37, 0xe8, 0x1c, 0x75, 0xdf, 0x6e },
+	{ 0x47, 0xf1, 0x1a, 0x71, 0x1d, 0x29, 0xc5, 0x89,
+	  0x6f, 0xb7, 0x62, 0x0e, 0xaa, 0x18, 0xbe, 0x1b },
+	{ 0xfc, 0x56, 0x3e, 0x4b, 0xc6, 0xd2, 0x79, 0x20,
+	  0x9a, 0xdb, 0xc0, 0xfe, 0x78, 0xcd, 0x5a, 0xf4 },
+	{ 0x1f, 0xdd, 0xa8, 0x33, 0x88, 0x07, 0xc7, 0x31,
+	  0xb1, 0x12, 0x10, 0x59, 0x27, 0x80, 0xec, 0x5f },
+	{ 0x60, 0x51, 0x7f, 0xa9, 0x19, 0xb5, 0x4a, 0x0d,
+	  0x2d, 0xe5, 0x7a, 0x9f, 0x93, 0xc9, 0x9c, 0xef },
+	{ 0xa0, 0xe0, 0x3b, 0x4d, 0xae, 0x2a, 0xf5, 0xb0,
+	  0xc8, 0xeb, 0xbb, 0x3c, 0x83, 0x53, 0x99, 0x61 },
+	{ 0x17, 0x2b, 0x04, 0x7e, 0xba, 0x77, 0xd6, 0x26,
+	  0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d }
+};
+
+static const vector unsigned char inv_shift_round = {
+	0x00, 0x0d, 0x0a, 0x07,
+	0x04, 0x01, 0x0e, 0x0B,
+	0x08, 0x05, 0x02, 0x0f,
+	0x0c, 0x09, 0x06, 0x03
+};
+
+static const vector unsigned char inv_select_0e_shifted = {
+	0x00, 0x0d, 0x0a, 0x07,
+	0x04, 0x01, 0x0e, 0x0B,
+	0x08, 0x05, 0x02, 0x0f,
+	0x0c, 0x09, 0x06, 0x03
+};
+
+static const vector unsigned char inv_select_0b_shifted = {
+	0x0d, 0x0a, 0x07, 0x00,
+	0x01, 0x0e, 0x0b, 0x04,
+	0x05, 0x02, 0x0f, 0x08,
+	0x09, 0x06, 0x03, 0x0c
+};
+
+static const vector unsigned char inv_select_0d_shifted = {
+	0x0a, 0x07, 0x00, 0x0d,
+	0x0e, 0x0b, 0x04, 0x01,
+	0x02, 0x0f, 0x08, 0x05,
+	0x06, 0x03, 0x0c, 0x09
+};
+
+static const vector unsigned char inv_select_09_shifted = {
+	0x07, 0x00, 0x0d, 0x0a,
+	0x0b, 0x04, 0x01, 0x0e,
+	0x0f, 0x08, 0x05, 0x02,
+	0x03, 0x0c, 0x09, 0x06
+};
+
+static const vector unsigned char inv_select_0e_norm = {
+	0x00, 0x01, 0x02, 0x03,
+	0x04, 0x05, 0x06, 0x07,
+	0x08, 0x09, 0x0a, 0x0b,
+	0x0c, 0x0d, 0x0e, 0x0f
+};
+
+static const vector unsigned char inv_select_0b_norm = {
+	0x01, 0x02, 0x03, 0x00,
+	0x05, 0x06, 0x07, 0x04,
+	0x09, 0x0a, 0x0b, 0x08,
+	0x0d, 0x0e, 0x0f, 0x0c
+};
+
+static const vector unsigned char inv_select_0d_norm = {
+	0x02, 0x03, 0x00, 0x01,
+	0x06, 0x07, 0x04, 0x05,
+	0x0a, 0x0b, 0x08, 0x09,
+	0x0e, 0x0f, 0x0c, 0x0d
+};
+
+static const vector unsigned char inv_select_09_norm = {
+	0x03, 0x00, 0x01, 0x02,
+	0x07, 0x04, 0x05, 0x06,
+	0x0b, 0x08, 0x09, 0x0a,
+	0x0f, 0x0c, 0x0d, 0x0e
+};
+/* encryption code */
+
+static vector unsigned char ByteSub(vector unsigned char state)
+{
+	/* line of the s-box */
+	vector unsigned char line_01, line_23, line_45, line_67,
+		   line_89, line_AB, line_CD, line_EF;
+	/* selector */
+	vector unsigned char sel1, sel2, sel7;
+	/* correct lines */
+	vector unsigned char cor_0123, cor_4567, cor_89AB, cor_CDEF,
+		cor_0to7, cor_8toF;
+	vector unsigned char ret_state, lower_state;
+	vector unsigned char state_shift2, state_shift1;
+
+	lower_state = spu_and(state, (unsigned char) 0x1f);
+	line_01 = spu_shuffle(sbox_enc[0], sbox_enc[1], lower_state);
+	line_23 = spu_shuffle(sbox_enc[2], sbox_enc[3], lower_state);
+	line_45 = spu_shuffle(sbox_enc[4], sbox_enc[5], lower_state);
+	line_67 = spu_shuffle(sbox_enc[6], sbox_enc[7], lower_state);
+	line_89 = spu_shuffle(sbox_enc[8], sbox_enc[9], lower_state);
+	line_AB = spu_shuffle(sbox_enc[10], sbox_enc[11], lower_state);
+	line_CD = spu_shuffle(sbox_enc[12], sbox_enc[13], lower_state);
+	line_EF = spu_shuffle(sbox_enc[14], sbox_enc[15], lower_state);
+
+	state_shift2 = spu_and(state, 0x3f);
+	sel2 = spu_cmpgt(state_shift2, 0x1f);
+	cor_0123 = spu_sel(line_01, line_23, sel2);
+	cor_4567 = spu_sel(line_45, line_67, sel2);
+	cor_89AB = spu_sel(line_89, line_AB, sel2);
+	cor_CDEF = spu_sel(line_CD, line_EF, sel2);
+
+	state_shift1 = spu_slqw(state, 1);
+	sel1 = spu_cmpgt(state_shift1, 0x7f);
+	cor_0to7 = spu_sel(cor_0123, cor_4567, sel1);
+	cor_8toF = spu_sel(cor_89AB, cor_CDEF, sel1);
+
+	sel7 = spu_cmpgt(state, 0x7f);
+	ret_state = spu_sel(cor_0to7, cor_8toF, sel7);
+
+	return ret_state;
+}
+
+static vector unsigned char ShiftRow(vector unsigned char state)
+{
+	return spu_shuffle(state, state, shift_round);
+}
+
+static vector unsigned char MixColumn(vector unsigned char state)
+{
+	vector unsigned char imm_00h;
+	vector unsigned char need_add, lower_state;
+	vector unsigned char shifted_vec, modul;
+	vector unsigned char toadd, xtimed;
+	vector unsigned char op1, op2, op3, op4, op5;
+	vector unsigned char xor_12, xor_34, xor_1234, ret;
+
+	imm_00h = spu_splats((unsigned char) 0x00);
+	modul = spu_splats((unsigned char) 0x1b);
+
+	need_add = (vector unsigned char)spu_cmpgt(state, 0x7f);
+	lower_state = spu_and(state, 0x7f);
+	shifted_vec = spu_slqw(lower_state, 0x01);
+	toadd = spu_sel(imm_00h, modul, need_add);
+
+	xtimed = spu_xor(toadd, shifted_vec);
+
+	op1 = spu_shuffle(state, xtimed, pre_xor_s0);
+	op2 = spu_shuffle(state, xtimed, pre_xor_s1);
+	op3 = spu_shuffle(state, xtimed, pre_xor_s2);
+	op4 = spu_shuffle(state, xtimed, pre_xor_s3);
+	op5 = spu_shuffle(state, xtimed, pre_xor_s4);
+
+	xor_12 = spu_xor(op1, op2);
+	xor_34 = spu_xor(op3, op4);
+	xor_1234 = spu_xor(xor_12, xor_34);
+	ret = spu_xor(xor_1234, op5);
+
+	return ret;
+}
+
+static vector unsigned char AddRoundKey(vector unsigned char state,
+		vector unsigned char key)
+{
+	return spu_xor(state, key);
+}
+
+static vector unsigned char normalRound(vector unsigned char state,
+		vector unsigned char key)
+{
+	vector unsigned char pstate;
+
+	pstate = ByteSub(state);
+	pstate = MixColumn(pstate);
+	pstate = AddRoundKey(pstate, key);
+	return pstate;
+}
+
+static vector unsigned char finalRound(vector unsigned char state,
+		vector unsigned char key)
+{
+	vector unsigned char pstate;
+
+	pstate = ByteSub(state);
+	pstate = ShiftRow(pstate);
+	pstate = AddRoundKey(pstate, key);
+	return pstate;
+}
+
+static vector unsigned char aes_encrypt_block(vector unsigned char in,
+		const vector unsigned char *key, unsigned char key_len)
+{
+	unsigned char i;
+	vector unsigned char pstate;
+
+	pstate = spu_xor(in, *key++);
+	switch (key_len) {
+	case 8: /* 14 rounds */
+		pstate = normalRound(pstate, *key++);
+		pstate = normalRound(pstate, *key++);
+
+	case 6: /* 12 rounds */
+		pstate = normalRound(pstate, *key++);
+		pstate = normalRound(pstate, *key++);
+
+	case 4: /* 10 rounds */
+		for (i = 0; i < 9; i++)
+			pstate = normalRound(pstate, *key++);
+
+		break;
+	default:
+		/* unsupported */
+		BUG();
+	}
+
+	pstate = finalRound(pstate, *key);
+	return pstate;
+}
+
+static int aes_encrypt_spu_block_char(unsigned char *buffer,
+		const unsigned char *kp, unsigned int key_len)
+{
+	vector unsigned char pstate;
+
+	pstate = (*((vector unsigned char *)(buffer)));
+	pstate = aes_encrypt_block(pstate, (const vector unsigned char*) kp,
+			key_len);
+
+	*((vec_uchar16 *)(buffer)) = pstate;
+	return 0;
+}
+
+/* decryption code, alternative version */
+
+static vector unsigned char InvByteSub(vector unsigned char state)
+{
+	/* line of the s-box */
+	vector unsigned char line_01, line_23, line_45, line_67,
+		   line_89, line_AB, line_CD, line_EF;
+	/* selector */
+	vector unsigned char sel1, sel2, sel7;
+	/* correct lines */
+	vector unsigned char cor_0123, cor_4567, cor_89AB, cor_CDEF,
+		cor_0to7, cor_8toF;
+	vector unsigned char ret_state, lower_state;
+	vector unsigned char state_shift2, state_shift1;
+
+	lower_state = spu_and(state, 0x1f);
+	line_01 = spu_shuffle(sbox_dec[0], sbox_dec[1], lower_state);
+	line_23 = spu_shuffle(sbox_dec[2], sbox_dec[3], lower_state);
+	line_45 = spu_shuffle(sbox_dec[4], sbox_dec[5], lower_state);
+	line_67 = spu_shuffle(sbox_dec[6], sbox_dec[7], lower_state);
+	line_89 = spu_shuffle(sbox_dec[8], sbox_dec[9], lower_state);
+	line_AB = spu_shuffle(sbox_dec[10], sbox_dec[11], lower_state);
+	line_CD = spu_shuffle(sbox_dec[12], sbox_dec[13], lower_state);
+	line_EF = spu_shuffle(sbox_dec[14], sbox_dec[15], lower_state);
+
+	state_shift2 = spu_and(state, 0x3f);
+	sel2 = spu_cmpgt(state_shift2, 0x1f);
+	cor_0123 = spu_sel(line_01, line_23, sel2);
+	cor_4567 = spu_sel(line_45, line_67, sel2);
+	cor_89AB = spu_sel(line_89, line_AB, sel2);
+	cor_CDEF = spu_sel(line_CD, line_EF, sel2);
+
+	state_shift1 = spu_slqw(state, 1);
+	sel1 = spu_cmpgt(state_shift1, 0x7f);
+	cor_0to7 = spu_sel(cor_0123, cor_4567, sel1);
+	cor_8toF = spu_sel(cor_89AB, cor_CDEF, sel1);
+
+	sel7 = spu_cmpgt(state, 0x7f);
+	ret_state = spu_sel(cor_0to7, cor_8toF, sel7);
+
+	return ret_state;
+}
+
+static vector unsigned char InvShiftRow(vector unsigned char state)
+{
+
+	return spu_shuffle(state, state, inv_shift_round);
+}
+
+static vector unsigned char InvMixColumn(vector unsigned char state)
+{
+	vector unsigned char op0, op1, op2, op3, op4, op5;
+	vector unsigned char mul_0e, mul_09, mul_0d, mul_0b;
+	vector unsigned char ret;
+	vector unsigned char imm_00h;
+	vector unsigned char need_add, statef_shift;
+	vector unsigned char shifted_vec, modul;
+	vector unsigned char toadd;
+	vector unsigned char mul_2, mul_4, mul_8;
+	vector unsigned char mul_2_4;
+
+	/* compute 0e, 0b, 0d, 09 in GF */
+	imm_00h = spu_splats((unsigned char) 0x00);
+	modul = spu_splats((unsigned char) 0x1b);
+
+	need_add = (vector unsigned char)spu_cmpgt(state, 0x7f);
+	toadd = spu_sel(imm_00h, modul, need_add);
+	statef_shift = spu_and(state, 0x7f);
+	shifted_vec = spu_slqw(statef_shift, 0x01);
+	mul_2 = spu_xor(toadd, shifted_vec);
+
+	need_add = (vector unsigned char)spu_cmpgt(mul_2, 0x7f);
+	toadd = spu_sel(imm_00h, modul, need_add);
+	statef_shift = spu_and(mul_2, 0x7f);
+	shifted_vec = spu_slqw(statef_shift, 0x01);
+	mul_4 = spu_xor(toadd, shifted_vec);
+
+	need_add = (vector unsigned char)spu_cmpgt(mul_4, 0x7f);
+	statef_shift = spu_and(mul_4, 0x7f);
+	shifted_vec = spu_slqw(statef_shift, 0x01);
+	toadd = spu_sel(imm_00h, modul, need_add);
+	mul_8 = spu_xor(toadd, shifted_vec);
+
+	mul_2_4 = spu_xor(mul_2, mul_4);
+	/* 09 = 8 * 1 */
+	mul_09 = spu_xor(mul_8, state);
+
+	/* 0e = 2 * 4 * 8 */
+	mul_0e = spu_xor(mul_2_4, mul_8);
+
+	/* 0b = 2 * 8 * 1 */
+	mul_0b = spu_xor(mul_2, mul_09);
+
+	/* 0d = 4 * 8 * 1 */
+	mul_0d = spu_xor(mul_4, mul_09);
+
+	/* prepare vectors for add */
+	op0 = spu_shuffle(mul_0e, mul_0e, inv_select_0e_shifted);
+	op1 = spu_shuffle(mul_0b, mul_0b, inv_select_0b_shifted);
+	op2 = spu_shuffle(mul_0d, mul_0d, inv_select_0d_shifted);
+	op3 = spu_shuffle(mul_09, mul_09, inv_select_09_shifted);
+
+	op4 = spu_xor(op0, op1);
+	op5 = spu_xor(op2, op3);
+	ret = spu_xor(op4, op5);
+	return ret;
+}
+
+static vector unsigned char InvNormalRound(vector unsigned char state,
+		vector unsigned char key)
+{
+	vector unsigned char pstate;
+
+	pstate = InvByteSub(state);
+	pstate = InvMixColumn(pstate);
+	pstate = AddRoundKey(pstate, key);
+	return pstate;
+}
+
+static vector unsigned char InvfinalRound(vector unsigned char state,
+		vector unsigned char key)
+{
+	vector unsigned char pstate;
+
+	pstate = InvByteSub(state);
+	pstate = InvShiftRow(pstate);
+	pstate = AddRoundKey(pstate, key);
+	return pstate;
+}
+
+
+static vector unsigned char aes_decrypt_block(vector unsigned char in,
+		const vector unsigned char *key, unsigned int key_len)
+{
+	vector unsigned char pstate;
+	unsigned int i;
+
+	pstate = spu_xor(in, *key++);
+
+	switch (key_len) {
+	case 8: /* 14 rounds */
+		pstate = InvNormalRound(pstate, *key++);
+		pstate = InvNormalRound(pstate, *key++);
+
+	case 6: /* 12 rounds */
+		pstate = InvNormalRound(pstate, *key++);
+		pstate = InvNormalRound(pstate, *key++);
+
+	case 4: /* 10 rounds */
+		for (i = 0; i < 9; i++)
+			pstate = InvNormalRound(pstate, *key++);
+
+		break;
+	default:
+		BUG();
+	}
+
+	pstate = InvfinalRound(pstate, *key);
+	return pstate;
+}
+
+static int aes_decrypt_block_char(unsigned char *buffer,
+		const unsigned char *kp, unsigned int key_len)
+{
+	vector unsigned char pstate;
+
+	pstate = (*((vector unsigned char *)(buffer)));
+	pstate = aes_decrypt_block(pstate, (const vector unsigned char*) kp,
+			key_len);
+	*((vec_uchar16 *)(buffer)) = pstate;
+	return 0;
+}
+
+static int aes_encrypt_ecb(unsigned char *buffer,
+		const unsigned char *kp, unsigned int key_len, unsigned int len)
+{
+	unsigned int left = len;
+
+	while (left >= 16) {
+		aes_encrypt_spu_block_char(buffer, kp, key_len);
+		left -= 16;
+		buffer += 16;
+	}
+
+	return len;
+}
+
+static int aes_decrypt_ecb(unsigned char *buffer,
+		const unsigned char *kp, unsigned int key_len, unsigned int len)
+{
+	unsigned int left = len;
+
+	while (left >= 16) {
+		aes_decrypt_block_char(buffer, kp, key_len);
+		left -= 16;
+		buffer += 16;
+	}
+	return len;
+}
+
+static int  aes_encrypt_cbc(unsigned char *buffer,
+		const unsigned char *kp, unsigned int key_len, unsigned int len,
+		unsigned char *iv_)
+{
+	unsigned int i;
+	vector unsigned char iv, input;
+
+	iv = (*((vector unsigned char *)(iv_)));
+	for (i = 0; i < len; i += 16) {
+		input = (*((vector unsigned char *)(buffer)));
+		input = spu_xor(input, iv);
+
+		iv = aes_encrypt_block(input, (const vector unsigned char*) kp,
+				key_len);
+
+		*((vec_uchar16 *)(buffer)) = iv;
+
+		buffer += 16;
+	}
+
+	*((vec_uchar16 *)(iv_)) = iv;
+	return len;
+}
+
+static int aes_decrypt_cbc(unsigned char *buffer,
+		const unsigned char *kp, unsigned int key_len, unsigned int len,
+		unsigned char *iv_)
+{
+	unsigned int i;
+	vector unsigned char iv, input, vret, decrypted;
+
+	iv = (*((vector unsigned char *)(iv_)));
+	for (i = 0; i < len; i += 16) {
+
+		input = (*((vector unsigned char *)(buffer)));
+		vret = aes_decrypt_block(input,
+				(const vector unsigned char*) kp, key_len);
+
+		decrypted = spu_xor(vret, iv);
+		iv = input;
+
+		*((vec_uchar16 *)(buffer)) = decrypted;
+
+		buffer += 16;
+	}
+
+	*((vec_uchar16 *)(iv_)) = iv;
+	return len;
+}
+
+static struct aes_key_struct keys[SPU_KEY_SLOTS];
+
+void spu_aes_update_key(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num)
+{
+	struct aes_update_key *aes_update_key = &kjob->aes_update_key;
+
+	memcpy_aligned(&keys[aes_update_key->keyid], buffer,
+			sizeof(struct aes_key_struct));
+}
+
+void spu_aes_encrypt_ecb(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num)
+{
+	struct aes_crypt *aes_crypt = &kjob->aes_crypt;
+	unsigned int cur_key;
+	unsigned long data_len;
+
+	data_len = kjob->in_size;
+	cur_key = aes_crypt->keyid;
+	aes_encrypt_ecb(buffer, keys[cur_key].enc, keys[cur_key].len, data_len);
+
+	init_put_data(buffer, aes_crypt->out, data_len, buf_num);
+}
+
+void spu_aes_decrypt_ecb(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num)
+{
+	struct aes_crypt *aes_crypt = &kjob->aes_crypt;
+	unsigned int cur_key;
+	unsigned long data_len;
+
+	data_len = kjob->in_size;
+	cur_key = aes_crypt->keyid;
+	aes_decrypt_ecb(buffer, keys[cur_key].dec, keys[cur_key].len, data_len);
+
+	init_put_data(buffer, aes_crypt->out, data_len, buf_num);
+}
+
+void spu_aes_encrypt_cbc(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num)
+{
+	struct aes_crypt *aes_crypt = &kjob->aes_crypt;
+	unsigned int cur_key;
+	unsigned long data_len;
+
+	data_len = kjob->in_size;
+	cur_key = aes_crypt->keyid;
+
+	aes_encrypt_cbc(buffer, keys[cur_key].enc, keys[cur_key].len,
+			data_len, aes_crypt->iv);
+
+	init_put_data(buffer, aes_crypt->out, data_len, buf_num);
+}
+
+void spu_aes_decrypt_cbc(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num)
+{
+	struct aes_crypt *aes_crypt = &kjob->aes_crypt;
+	unsigned int cur_key;
+	unsigned long data_len;
+
+	data_len = kjob->in_size;
+	cur_key = aes_crypt->keyid;
+
+	aes_decrypt_cbc(buffer, keys[cur_key].dec, keys[cur_key].len,
+			data_len, aes_crypt->iv);
+
+	init_put_data(buffer, aes_crypt->out, data_len, buf_num);
+}
--- a/arch/powerpc/platforms/cell/spufs/spu_main.c
+++ b/arch/powerpc/platforms/cell/spufs/spu_main.c
@@ -11,6 +11,11 @@
 
 static spu_operation_t spu_ops[TOTAL_SPU_OPS] __attribute__((aligned(16))) = {
 	[SPU_OP_nop] = spu_nop,
+	[SPU_OP_aes_update_key] = spu_aes_update_key,
+	[SPU_OP_aes_encrypt_ecb] = spu_aes_encrypt_ecb,
+	[SPU_OP_aes_decrypt_ecb] = spu_aes_decrypt_ecb,
+	[SPU_OP_aes_encrypt_cbc] = spu_aes_encrypt_cbc,
+	[SPU_OP_aes_decrypt_cbc] = spu_aes_decrypt_cbc,
 };
 static unsigned char kspu_buff[DMA_BUFFERS][DMA_MAX_TRANS_SIZE];
 
--- a/arch/powerpc/platforms/cell/spufs/spu_runtime.h
+++ b/arch/powerpc/platforms/cell/spufs/spu_runtime.h
@@ -26,4 +26,14 @@ void memcpy_aligned(void *dest, const vo
 void spu_nop(struct kspu_job *kjob, void *buffer,
 		unsigned int buf_num);
 
+void spu_aes_update_key(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num);
+void spu_aes_encrypt_ecb(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num);
+void spu_aes_decrypt_ecb(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num);
+void spu_aes_encrypt_cbc(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num);
+void spu_aes_decrypt_cbc(struct kspu_job *kjob, void *buffer,
+		unsigned int buf_num);
 #endif
--- a/drivers/crypto/Kconfig
+++ b/drivers/crypto/Kconfig
@@ -48,6 +48,19 @@ config CRYPTO_DEV_PADLOCK_SHA
 
 source "arch/s390/crypto/Kconfig"
 
+config CRYPTO_AES_SPU
+	tristate "AES cipher algorithm (SPU support)"
+	select CRYPTO_ABLKCIPHER
+	depends on SPU_FS && KSPU
+	help
+	  AES cipher algorithms (FIPS-197). AES uses the Rijndael
+	  algorithm.
+	  The AES specifies three key sizes: 128, 192 and 256 bits.
+	  See <http://csrc.nist.gov/CryptoToolkit/aes/> for more information.
+
+	  This version of AES performs its work on a SPU core and supports
+		ECB and CBC block mode
+
 config CRYPTO_DEV_GEODE
 	tristate "Support for the Geode LX AES engine"
 	depends on X86_32 && PCI
--- /dev/null
+++ b/include/asm-powerpc/kspu/aes.h
@@ -0,0 +1,28 @@
+#ifndef  __SPU_AES_H__
+#define  __SPU_AES_H__
+
+#define MAX_AES_ROUNDS 15
+#define MAX_AES_KEYSIZE_INT (MAX_AES_ROUNDS * 4)
+#define MAX_AES_KEYSIZE_BYTE (MAX_AES_KEYSIZE_INT * 4)
+#define SPU_KEY_SLOTS 5
+
+struct aes_key_struct {
+	unsigned char enc[MAX_AES_KEYSIZE_BYTE] __attribute__((aligned(16)));
+	unsigned char dec[MAX_AES_KEYSIZE_BYTE] __attribute__((aligned(16)));
+	unsigned int len __attribute__((aligned(16)));
+};
+
+struct aes_update_key {
+	/* copy key from ea to ls into a specific slot */
+	unsigned int keyid __attribute__((aligned(16)));
+};
+
+struct aes_crypt {
+	/* in */
+	unsigned int keyid __attribute__((aligned(16)));
+
+	/* out */
+	unsigned char iv[16] __attribute__((aligned(16))); /* as well as in */
+	unsigned long long out __attribute__((aligned(16)));
+};
+#endif
--- a/include/asm-powerpc/kspu/merged_code.h
+++ b/include/asm-powerpc/kspu/merged_code.h
@@ -1,5 +1,6 @@
 #ifndef KSPU_MERGED_CODE_H
 #define KSPU_MERGED_CODE_H
+#include <asm/kspu/aes.h>
 
 #define KSPU_LS_SIZE 0x40000
 
@@ -17,6 +18,12 @@
  */
 enum SPU_OPERATIONS {
 	SPU_OP_nop,
+	SPU_OP_aes_setkey,
+	SPU_OP_aes_update_key,
+	SPU_OP_aes_encrypt_ecb,
+	SPU_OP_aes_decrypt_ecb,
+	SPU_OP_aes_encrypt_cbc,
+	SPU_OP_aes_decrypt_cbc,
 
 	TOTAL_SPU_OPS,
 };
@@ -30,6 +37,8 @@ struct kspu_job {
 	 * function.
 	 */
 	union {
+		struct aes_update_key aes_update_key;
+		struct aes_crypt aes_crypt;
 	} __attribute__((aligned(16)));
 };
 

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 10/10] cryptoapi: async speed test
  2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
                   ` (8 preceding siblings ...)
  2007-08-16 20:01 ` [patch 09/10] spufs: SPU-AES support (kernel side) Sebastian Siewior
@ 2007-08-16 20:01 ` Sebastian Siewior
  9 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-16 20:01 UTC (permalink / raw)
  To: cbe-oss-dev; +Cc: herbert, arnd, jk, linux-crypto, Sebastian Siewior

[-- Attachment #1: limi_speed_async.diff --]
[-- Type: text/plain, Size: 6087 bytes --]

This was used to test the async AES algo. trcypt is working as well but does
enqueue that much :-)

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -468,6 +468,12 @@ config CRYPTO_TEST
 	help
 	  Quick & dirty crypto test module.
 
+config CRYPTO_LIMI_SPEED_ASYNC
+	tristate "Crypto algorithm speed test with msec resolution"
+	help
+	  insmod/modprobe the module, and watch dmesg for results.
+	  Test is for aes only, see modinfo for options
+
 source "drivers/crypto/Kconfig"
 
 endif	# if CRYPTO
--- a/crypto/Makefile
+++ b/crypto/Makefile
@@ -50,6 +50,7 @@ obj-$(CONFIG_CRYPTO_MICHAEL_MIC) += mich
 obj-$(CONFIG_CRYPTO_CRC32C) += crc32c.o
 
 obj-$(CONFIG_CRYPTO_TEST) += tcrypt.o
+obj-$(CONFIG_CRYPTO_LIMI_SPEED_ASYNC) += limi-speeda.o
 
 #
 # generic algorithms and the async_tx api
--- /dev/null
+++ b/crypto/limi-speeda.c
@@ -0,0 +1,191 @@
+/*
+ * Code derived von crypt/tcrypt.h
+ *
+ * Small speed test with time resolution in msec.
+ * Author: Sebastian Siewior (sebastian _at_ breakpoint.cc)
+ * License: GPL v2
+ */
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/scatterlist.h>
+#include <linux/crypto.h>
+#include <linux/jiffies.h>
+#include <linux/types.h>
+#include <linux/err.h>
+
+static unsigned int buff_size = 16 * 1024;
+module_param(buff_size, uint, 0444);
+MODULE_PARM_DESC(buff_size, "Buffer allocated by kmalloc()");
+
+static unsigned int keylen = 16;
+module_param(keylen, uint, 0444);
+MODULE_PARM_DESC(keylen, "Length of the key (16,24 or 32 bits");
+
+static unsigned int mode = 0;
+module_param(mode, uint, 0444);
+MODULE_PARM_DESC(mode, "0 -> encryption else decryption");
+
+static unsigned int big_loops = 10;
+module_param(big_loops, uint, 0444);
+MODULE_PARM_DESC(big_loops, "Number of mensurations.");
+
+static unsigned int small_loops = 10000;
+module_param(small_loops, uint, 0444);
+MODULE_PARM_DESC(small_loops, "loops within one mesurement.");
+
+static unsigned int alg = 1;
+module_param(alg, uint, 0444);
+MODULE_PARM_DESC(alg, "0 -> ecb(aes), else -> cbc(aes)");
+
+static atomic_t req_pending;
+
+static void request_complete(struct crypto_async_request *req, int err)
+{
+	wait_queue_head_t *complete_wq = req->data;
+	struct ablkcipher_request *ablk_req = ablkcipher_request_cast(req);
+
+	atomic_dec(&req_pending);
+	ablkcipher_request_free(ablk_req);
+	wake_up_all(complete_wq);
+}
+
+static int __init init(void)
+{
+	struct scatterlist sg[1];
+	unsigned char iv[16];
+	struct crypto_ablkcipher *tfm;
+	static char *in;
+	unsigned int i;
+	unsigned int ret;
+	unsigned long start, end;
+	unsigned long total = 0;
+	unsigned long size_kb;
+	unsigned char key[32] = { 1, 2, 3, 4, 5, 6 };
+	const unsigned char *algname;
+	struct ablkcipher_request *ablk_req = NULL;
+	wait_queue_head_t complete_wq;
+	DEFINE_WAIT(wait_for_request);
+
+	algname = alg ? "cbc(aes)" : "ecb(aes)";
+	printk("Limi-speed: %s buff_size: %u, keylen: %d, mode: %s\n", algname, buff_size, keylen,
+			mode ? "decryption" : "encryption");
+	printk("loops: %d, iterations: %d, ", big_loops, small_loops);
+	size_kb = small_loops * buff_size / 1024;
+	printk("=> %lu kb or %lu mb a loop\n", size_kb, size_kb/1024);
+
+	if (keylen != 16 && keylen != 24 && keylen != 32) {
+		printk("Invalid keysize\n");
+		return -EINVAL;
+	}
+
+	in = kmalloc(buff_size, GFP_KERNEL);
+	if (in == NULL) {
+		printk("Failed to allocate memory.\n");
+		return -ENOMEM;
+	}
+	printk("'Alloc' in %d @%p\n", buff_size, in);
+	memset(in, 0x24, buff_size);
+	sg_set_buf(sg, in, buff_size);
+
+	tfm = crypto_alloc_ablkcipher(algname, 0, 0 /* CRYPTO_ALG_ASYNC */);
+
+	if (IS_ERR(tfm)) {
+		printk("failed to load transform for %s: %ld\n", algname, PTR_ERR(tfm));
+		goto leave;
+	}
+
+	ret = crypto_ablkcipher_setkey(tfm, key, keylen);
+	if (ret) {
+		printk("setkey() failed\n");
+		goto out_tfm;
+	}
+
+	init_waitqueue_head(&complete_wq);
+	atomic_set(&req_pending, 0);
+
+	for (i=0 ; i<big_loops; i++) {
+		int j;
+		ret = 0;
+		start = jiffies;
+
+		for (j=0; j < small_loops; j++) {
+			while (1) {
+
+				prepare_to_wait(&complete_wq, &wait_for_request, TASK_INTERRUPTIBLE);
+				ablk_req = ablkcipher_request_alloc(tfm, GFP_KERNEL);
+				if (likely(ablk_req))
+					break;
+				printk("ablkcipher_request_alloc() failed\n");
+				schedule();
+			}
+			atomic_inc(&req_pending);
+
+			ablkcipher_request_set_callback(ablk_req, CRYPTO_TFM_REQ_MAY_BACKLOG, request_complete, &complete_wq);
+			ablkcipher_request_set_crypt(ablk_req, sg, sg, buff_size, &iv);
+
+			if (!mode)
+				ret = crypto_ablkcipher_encrypt(ablk_req);
+			else
+				ret = crypto_ablkcipher_decrypt(ablk_req);
+
+			if (unlikely(ret == -EBUSY)) {
+				schedule();
+
+			} else if (unlikely(ret == 0)) {
+				printk("Process in SYNC mode, this is not cool\n");
+				ablkcipher_request_free(ablk_req);
+				goto out_tfm;
+			} else if (unlikely(ret != -EINPROGRESS)) {
+				printk("encryption failed: %d after (i,j) (%u,%u) iterations\n", ret, i, j);
+				ablkcipher_request_free(ablk_req);
+				goto out_tfm;
+			}
+
+			if (signal_pending(current)) {
+				printk("signal catched\n");
+				break;
+			}
+
+		}
+
+		while (1) {
+//clean_up_exit:
+			prepare_to_wait(&complete_wq, &wait_for_request, TASK_INTERRUPTIBLE);
+			if (atomic_read(&req_pending) == 0)
+				break;
+			schedule();
+		}
+
+		end = jiffies;
+		finish_wait(&complete_wq, &wait_for_request);
+
+		if ( !time_after(start, end)) {
+			printk("Run: %u msec\n", jiffies_to_msecs(end - start));
+			total += jiffies_to_msecs(end - start);
+		} else {
+			printk("Run: %u msec\n", jiffies_to_msecs(start - end));
+			total += jiffies_to_msecs(start - end);
+		}
+		if (signal_pending(current))
+			break;
+	}
+
+	total /= big_loops;
+	size_kb *= 1000;
+	size_kb /= total;
+	printk("Average: %lu msec, approx. %lu kb/sec || %lu mb/sec  \n", total,
+			size_kb, size_kb/1024);
+out_tfm:
+	crypto_free_ablkcipher(tfm);
+leave:
+	kfree(in);
+	return -ENODEV;
+}
+
+static void __exit fini(void) { }
+
+module_init(init);
+module_exit(fini);
+
+MODULE_LICENSE("GPL");

-- 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 01/10] t add cast to regain ablkcipher_request from private ctx
  2007-08-16 20:01 ` [patch 01/10] t add cast to regain ablkcipher_request from private ctx Sebastian Siewior
@ 2007-08-17  8:55   ` Herbert Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Herbert Xu @ 2007-08-17  8:55 UTC (permalink / raw)
  To: Sebastian Siewior; +Cc: cbe-oss-dev, arnd, jk, linux-crypto, Sebastian Siewior

On Thu, Aug 16, 2007 at 10:01:06PM +0200, Sebastian Siewior wrote:
> This cast allows to regain the struct ablkcipher_request for a request 
> from private data.

Hi Sebastian:

I think this function would make more sense as a private
function in your driver.  That way you can give it an
explicit type rather than having it take a void *.

We want to avoid unnecessary casting like this where
possible as it's error-prone.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 07/10] spufs: add kernel support for spu task
  2007-08-16 20:01 ` [patch 07/10] spufs: add kernel support for spu task Sebastian Siewior
@ 2007-08-18 16:48   ` Arnd Bergmann
  0 siblings, 0 replies; 16+ messages in thread
From: Arnd Bergmann @ 2007-08-18 16:48 UTC (permalink / raw)
  To: Sebastian Siewior
  Cc: cbe-oss-dev, herbert, jk, linux-crypto, Sebastian Siewior

On Thursday 16 August 2007, Sebastian Siewior wrote:

> +config KSPU
> +	bool "Support for utilisation of SPU by the kernel"
> +	depends on SPU_FS && EXPERIMENTAL
> +	help
> +	  With this option enabled, the kernel is able to utilize the SPUs for its
> +	  own tasks.


It might be better to not have this user-selectable at all, but to
autoselect the option when it's used by other code.

> +out_archcoredump:
> +	printk("kspu_init() failed\n");
> +	unregister_arch_coredump_calls(&spufs_coredump_calls);
>  out_syscalls:
>  	unregister_spu_syscalls(&spufs_calls);
>  out_fs:
> @@ -804,12 +811,14 @@ out_sched:
>  out_cache:
>  	kmem_cache_destroy(spufs_inode_cache);
>  out:
> +	printk("spufs init not performed\n");
>  	return ret;
>  }

The printk lines don't follow the convention of using KERN_*
printk levels. I suggest you either remove them or turn them
into pr_debug().
> +
> +#include <asm/spu_priv1.h>
> +#include <asm/kspu/kspu.h>
> +#include <asm/kspu/merged_code.h>
> +#include <linux/kthread.h>
> +#include <linux/module.h>
> +#include <linux/init_task.h>
> +#include <linux/hardirq.h>
> +#include <linux/kernel.h>


#include lines should be ordered alphabetically, and <asm/ lines come
after <linux/ lines.

> +/*
> + * based on run.c spufs_run_spu
> + */
> +static int spufs_run_kernel_spu(void *priv)
> +{
> +	struct kspu_context *kctx = (struct kspu_context *) priv;
> +	struct spu_context *ctx = kctx->spu_ctx;
> +	int ret;
> +	u32 status;
> +	unsigned int npc = 0;
> +	int fastpath;
> +	DEFINE_WAIT(wait_for_stop);
> +	DEFINE_WAIT(wait_for_ibox);
> +	DEFINE_WAIT(wait_for_newitem);
> +
> +	spu_enable_spu(ctx);
> +	ctx->event_return = 0;
> +	spu_acquire(ctx);
> +	if (ctx->state == SPU_STATE_SAVED) {
> +		__spu_update_sched_info(ctx);
> +
> +		ret = spu_activate(ctx, 0);
> +		if (ret) {
> +			spu_release(ctx);
> +			printk(KERN_ERR "could not obtain runnable spu: %d\n",
> +					ret);
> +			BUG();
> +		}
> +	} else {
> +		/*
> +		 * We have to update the scheduling priority under active_mutex
> +		 * to protect against find_victim().
> +		 */
> +		spu_update_sched_info(ctx);
> +	}

The code you have copied this from has recently been changed to also set an initial
time slice, you should do the same change here.

> +
> +	spu_run_init(ctx, &npc);
> +	do {
> +		fastpath = 0;
> +		prepare_to_wait(&ctx->stop_wq, &wait_for_stop,
> +				TASK_INTERRUPTIBLE);
> +		prepare_to_wait(&ctx->ibox_wq, &wait_for_ibox,
> +				TASK_INTERRUPTIBLE);
> +		prepare_to_wait(&kctx->newitem_wq, &wait_for_newitem,
> +				TASK_INTERRUPTIBLE);
> +
> +		if (unlikely(test_and_clear_bit(SPU_SCHED_NOTIFY_ACTIVE,
> +						&ctx->sched_flags))) {
> +
> +			if (!(status & SPU_STATUS_STOPPED_BY_STOP)) {
> +				spu_switch_notify(ctx->spu, ctx);
> +			}
> +		}
> +
> +		spuctx_switch_state(ctx, SPU_UTIL_SYSTEM);
> +
> +		pr_debug("going to handle class1\n");
> +		ret = spufs_handle_class1(ctx);
> +		if (unlikely(ret)) {
> +			/*
> +			 * SPE_EVENT_SPE_DATA_STORAGE => refernce invalid memory
> +			 */
> +			printk(KERN_ERR "Invalid memory dereferenced by the"
> +					"spu: %d\n", ret);
> +			BUG();
> +		}
> +
> +		/* FIXME BUG: We need a physical SPU to discover
> +		 * ctx->spu->class_0_pending. It is not saved on context
> +		 * switch. We may lose this on context switch.
> +		 */
> +		status = ctx->ops->status_read(ctx);
> +		if (unlikely((ctx->spu && ctx->spu->class_0_pending) ||
> +					status & SPU_STATUS_INVALID_INSTR)) {
> +			printk(KERN_ERR "kspu error, status_register: 0x%08x\n",
> +					status);
> +			printk(KERN_ERR "event return: 0x%08lx, spu's npc: "
> +					"0x%08x\n", kctx->spu_ctx->event_return,
> +					kctx->spu_ctx->ops->npc_read(
> +						kctx->spu_ctx));
> +			printk(KERN_ERR "class_0_pending: 0x%lx\n", ctx->spu->class_0_pending);
> +			print_kctx_debug(kctx);
> +			BUG();
> +		}
> +
> +		if (notify_done_reqs(kctx))
> +			fastpath = 1;
> +
> +		if (queue_requests(kctx))
> +			fastpath = 1;
> +
> +		if (!(status & SPU_STATUS_RUNNING)) {
> +			/* spu is currently not running */
> +			pr_debug("SPU not running, last stop code was: %08x\n",
> +					status >> SPU_STOP_STATUS_SHIFT);
> +			if (pending_spu_work(kctx)) {
> +				/* spu should run again */
> +				pr_debug("Activate SPU\n");
> +				kspu_fill_dummy_reqs(kctx);
> +
> +				spu_run_fini(ctx, &npc, &status);
> +				spu_acquire_runnable(ctx, 0);
> +				spu_run_init(ctx, &npc);
> +			} else {
> +				/* spu finished work */
> +				pr_debug("SPU will remain in stop state\n");
> +				spu_run_fini(ctx, &npc, &status);
> +				spu_yield(ctx);
> +				spu_acquire(ctx);
> +			}
> +		} else {
> +			pr_debug("SPU is running, switch state to util user\n");
> +			spuctx_switch_state(ctx, SPU_UTIL_USER);
> +		}
> +
> +		if (fastpath)
> +			continue;
> +
> +		spu_release(ctx);
> +		schedule();
> +		spu_acquire(ctx);
> +
> +	} while (!kthread_should_stop() || !list_empty(&kctx->work_queue));

The inner loop is rather long, in an already long function. Can you split out
parts into separate functions here?


> +#endif
> --- a/arch/powerpc/platforms/cell/spufs/spufs.h
> +++ b/arch/powerpc/platforms/cell/spufs/spufs.h
> @@ -344,4 +344,18 @@ static inline void spuctx_switch_state(s
>  	}
>  }
>  
> +#ifdef CONFIG_KSPU
> +int __init kspu_init(void);
> +void __exit kspu_exit(void);

The __init and __exit specifiers are not meaningful in the declaration,
you only need them in the definition.

	Arnd <><

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/1] spufs: SPU-AES support (kspu+ablkcipher user)
       [not found]   ` <20070828154637.GA21007@Chamillionaire.breakpoint.cc>
@ 2007-08-29  7:15     ` Herbert Xu
  2007-08-29  9:28       ` Sebastian Siewior
       [not found]     ` <18132.43463.753224.982580@cargo.ozlabs.ibm.com>
  1 sibling, 1 reply; 16+ messages in thread
From: Herbert Xu @ 2007-08-29  7:15 UTC (permalink / raw)
  To: linux-crypto, arnd, jk, cbe-oss-dev

On Tue, Aug 28, 2007 at 05:46:37PM +0200, Sebastian Siewior wrote:
>
> Herbert, could you please ACK / NACK your bits?
> I added ablkcipher_request() in the driver with a proper type as you
> suggested.

The crypto bits look alright.

Do you plan to use this for anything beyond AES? If so it
would be good to come up a way to share some of the crypto
code between the main kernel and what runs on the SPUs.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [Cbe-oss-dev] [patch 1/1] spufs: SPU-AES support (kspu+ablkcipher user)
       [not found]     ` <18132.43463.753224.982580@cargo.ozlabs.ibm.com>
@ 2007-08-29  9:09       ` Sebastian Siewior
  0 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-29  9:09 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Herbert Xu, cbe-oss-dev, linux-crypto, arnd, jk

* Paul Mackerras | 2007-08-29 09:03:35 [+1000]:

>Sebastian Siewior writes:
>
>> CBC has one limitiation: The IV is written back in the notification
>> callback. That means that it is not available for crypto requests that
>> depend on the previous IV (as well as crypto requests >16 KiB). Herbert Xu
>> pointer out, that this is currently not the case. For instance:
>> - IPsec brings its own IV on with every packet. A packet is usually <=
>> 	1500 bytes. Jumbo frames should not exceed 16 KiB.
>> - EcryptFS changes the IV on page bassis (every enc/dec request is
>> 	PAGE_SIZE long).
>
>The page size could be 64kB.

Yes, I am aware of this. That's why I mentioned it here. The only way
way how I could fix it is by caching the IV the same/similar way I do
it for the key. I had no time to implement this so far and it should not
break IPsec or EcryptFS if you don't force it :)

>Paul.

Sebastian

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/1] spufs: SPU-AES support (kspu+ablkcipher user)
  2007-08-29  7:15     ` [patch 1/1] spufs: SPU-AES support (kspu+ablkcipher user) Herbert Xu
@ 2007-08-29  9:28       ` Sebastian Siewior
  0 siblings, 0 replies; 16+ messages in thread
From: Sebastian Siewior @ 2007-08-29  9:28 UTC (permalink / raw)
  To: Herbert Xu; +Cc: linux-crypto, arnd, jk, cbe-oss-dev

* Herbert Xu | 2007-08-29 15:15:07 [+0800]:

>Do you plan to use this for anything beyond AES? If so it
I probably don't, others maybe.

>would be good to come up a way to share some of the crypto
>code between the main kernel and what runs on the SPUs.
Yes. You could share almost everything except the setkey() function on
the kernel side.

>Cheers,

Sebastian

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2007-08-29  9:28 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
2007-08-16 20:01 ` [patch 01/10] t add cast to regain ablkcipher_request from private ctx Sebastian Siewior
2007-08-17  8:55   ` Herbert Xu
2007-08-16 20:01 ` [patch 02/10] crypto: retrieve private ctx aligned Sebastian Siewior
2007-08-16 20:01 ` [patch 03/10] spufs: kspu documentation Sebastian Siewior
2007-08-16 20:01 ` [patch 04/10] spufs: kspu doc skeleton Sebastian Siewior
2007-08-16 20:01 ` [patch 05/10] spufs: kspu add required declarations Sebastian Siewior
2007-08-16 20:01 ` [patch 06/10] spufs: add kspu_alloc_context() Sebastian Siewior
2007-08-16 20:01 ` [patch 07/10] spufs: add kernel support for spu task Sebastian Siewior
2007-08-18 16:48   ` Arnd Bergmann
2007-08-16 20:01 ` [patch 08/10] spufs: SPE side implementation of kspu Sebastian Siewior
2007-08-16 20:01 ` [patch 09/10] spufs: SPU-AES support (kernel side) Sebastian Siewior
     [not found]   ` <20070828154637.GA21007@Chamillionaire.breakpoint.cc>
2007-08-29  7:15     ` [patch 1/1] spufs: SPU-AES support (kspu+ablkcipher user) Herbert Xu
2007-08-29  9:28       ` Sebastian Siewior
     [not found]     ` <18132.43463.753224.982580@cargo.ozlabs.ibm.com>
2007-08-29  9:09       ` [Cbe-oss-dev] " Sebastian Siewior
2007-08-16 20:01 ` [patch 10/10] cryptoapi: async speed test Sebastian Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).