All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sebastian Siewior <cbe-oss-dev@ml.breakpoint.cc>
To: cbe-oss-dev@ozlabs.org
Cc: <herbert@gondor.apana.org.au>, <arnd@arndb.de>, <jk@ozlabs.org>,
	linux-crypto@vger.kernel.org,
	Sebastian Siewior <sebastian@breakpoint.cc>
Subject: [patch 03/10] spufs: kspu documentation
Date: Thu, 16 Aug 2007 22:01:08 +0200	[thread overview]
Message-ID: <20070816200135.452834000@ml.breakpoint.cc> (raw)
In-Reply-To: 20070816200105.735608000@ml.breakpoint.cc

[-- Attachment #1: spufs-kspu_doc.diff --]
[-- Type: text/plain, Size: 9822 bytes --]

Documentation how to use kspu from the PPU & SPU side

Signed-off-by: Sebastian Siewior <sebastian@breakpoint.cc>
--- /dev/null
+++ b/Documentation/powerpc/kspu.txt
@@ -0,0 +1,243 @@
+                  KSPU: Utilization of SPUs for kernel tasks
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+0. KSPU design
+==============
+
+The idea is to offload single time consuming tasks to the SPU. Those tasks are
+fed with data that they have to process.
+Once the function on the SPU side is invoked, the input data is already
+available. After the job is done, the offloaded function must kick off a DMA
+transfer, that transfers the result back to the main memory.
+On the PPU, the KSPU user queues the job temporary in a linked list and receives
+later a callback to put queue the job directly in SPU's ring buffer. The transit
+stop is required for two reasons:
+- It must be possible to queue work items from softirq context
+- All requests must be accepted, even if the ring buffer is full. Waiting (until
+	a slot becomes available) is not an option.
+
+The callback (for the enqueue process on the SPU) happens in a kthread context,
+so a mutex may be hold. However, there is only one kthread for this job, so
+every delay will have global impact.
+The user should enqueue only one job item on every enqueue request. The user
+may enqueue more than one job item if _really_ necessary. If there are not
+enough free slots than the enqueue function will be called as the first enqueue
+function once free slots are available again.
+After the offloaded function completed the job, the kthread calls the
+completion callback (it is the same kthread that is used for enqueue).
+Double/multi buffering is performed by KSPU.
+
+0.5 SPU usage
+=============
+Currently only one SPU is used and allocated. Allocation occurs during KSPU
+module initialization via the spufs interface. Therefore the physical SPU is
+considered in the scheduling process (and "shared" with user space). Right
+now, it is not "easy" to find out
+- how many SPUs may be taken (i.e. not used by user space)
+- how many SPUs are useful to be taken (depending of the workload)
+The later is (theoretically) an easy accounting approach if there are no
+dependencies in processing (and two jobs of the same kind may be processed on
+two SPUs in parallel).
+A second SPU (context) may be required if the local store memory is used up.
+This can be prevented if "overlays" are used. The advantage over several SPU
+context:
+- less complexity (since there is only one kind of SPU code)
+- no tracking which function is in which context. Plus overlay code switches
+	the binary probably faster than the scheduler does.
+
+1. Overview of memory layout
+=============================
+
+            ------------------ 256 KiB
+            |    RB ENTRY    |
+            ------------------ Ring buffer entries
+            |   .........    |
+            ------------------
+            |    RB ENTRY    |
+            ------------------
+            |    RB state    | (consumed + outstanding)
+            ------------------
+            |      STACK     | Stack growing downwards
+            |       ||       |
+            |       \/       |
+            ------------------
+            |     .......    |  unused / reserved :)
+            ------------------
+            |      Data      |
+            | DMA Buffers,   |
+            | functions'     |
+            | private data   |
+            ------------------
+            |      Code      |
+            | offloaded SPU  |
+            | functions      |
+            ------------------
+            |   multiplexor  | spu_main.c
+            ------------------ 0
+
+Type of Ring buffer entry is struct kspu_job.
+Number of ring buffer entries is determined by RB_SLOTS.
+Number of of DMA buffers is determined by DMA_BUFFERS.
+The stack grows uncontrolled. There is no (cheap) way to notice a stack
+overflow. After adding a new SPU program, the developer is encourage to check
+the stack usage and make sure the stack will never hit the data segment. This
+task is not required if recursive functions are used (I hope the suicide part
+has been understood).
+
+1.1 Ring buffer
+===============
+The ring buffer has been chosen because the data structure allows exchange of
+data (PPU <-> SPU) without any locking. The ring buffer entry consists of two
+parts
+- Data known by the KSPU (public data).
+- Private data is only known by the user (hidden from KSPU)
+Public data contains the function parameters of the offloaded SPU program.
+Private data is meaningless to the KSPU and may consider algorithm specific
+information (like where to put the result).
+The number of ring buffer entries (RB_SLOTS) has two constrains (except
+LS_SIZE :D):
+- it must be power of 2.
+- it must be at least DMA_BUFFERS*2
+
+1.2 DMA Buffers
+===============
+Every DMA buffer is DMA_MAX_TRANS_SIZE bytes in size. The size reflects the
+maximum transfer size that may be request by the SPU. Therefore the same
+requirements apply here like to the MFC DMA size: it must be a multiple of 16
+and may not by larger than 16KiB.
+The only limit for the number of available DMA buffers (DMA_BUFFERS) (besides
+the available space) is that "DMA_BUFFERS*2 <= RB_SLOTS" must be true. The
+reason for this constraint is that the "multiplexor", once started, requests
+DMA_BUFFERS buffers and starts processing. While processing the first batch,
+the next DMA_BUFFERS are requested (to get into streaming mode). After
+processing DMA_BUFFERS*2 requests, the first point is reached, where the SPU
+starts to notify the PPU about done requests and may stop. Therefore the
+shortest run is DMA_BUFFERS*2 requests. If there are not enough requests
+available, KSPU fills the ring buffer with NOPs to fit. A NOP is a DMA
+transfer with the size zero (nop for the MFC) and just a return statement as
+the job function.
+
+2. Offloading a task to SPU
+===========================
+Three steps are required to offload a task to SPU:
+- PPU code
+- SPU code
+- Update header files & Makefile
+
+The example code shows an example how to offload an 'add operation' via spu_add
+on the SPU. The complete implementation is in skeleton files.
+
+2.1 PPU code
+============
+1. Init
+- Prepare a struct with 'struct kspu_work_item' embedded in it.
+  struct my_spu_req {
+		struct kspu_work_item kspu_work;
+		void *data;
+	};
+
+- get global kspu ctx.
+    struct kspu_context *kctx = kspu_get_kctx();
+
+2. Enqueue a specific request. (struct my_spu_req spe_req)
+- Setup enqueue callback.
+  spe_req.kspu_work.enqueue = my_enqueue_func;
+
+- Enqueue it in kspu.
+  kspu_enqueue_work_item(kctx, &spe_req.kspu_work);
+
+3. Wait for the callback, enqueue it than on the SPU
+- Get an empty slot
+  struct kspu_job *work_item = kspu_get_rb_slot(kctx);
+
+- fill it
+  work_item->operation = MY_ADD;
+	work_item->in = spe_req.data;
+	work_item->in_size = 16;
+
+- mark it as ready
+  kspu_mark_rb_slot_ready(kctx, &spe_req.kspu_work);
+
+- set the finish callback
+  spe_req.kspu_work.notify = my_notify_func;
+
+4. Wait for the "finish" callback.
+- job finished.
+
+2.2 SPU code
+============
+- prepare a function that matches the following params:
+ void spu_my_add(struct kspu_job *kjob, void *buffer, unsigned int buf_num)
+
+	Use init_put_data() to write data back to main memory. It is just a wrapper
+	around mfc_putf(). Use the supplied buf_num as the tag.
+  init_put_data(buffer, out, length,	buf_num);
+
+2.3 Update files
+================
+- define your private data structures which are visible from your PPU program
+	and from your SPU program. They become later part of struct kspu_job if you
+	need them for parameters. Keep them as small as possible.
+
+- attach the function to SPU_OPS in
+	include/asm-powerpc/kspu/merged_code.h before TOTAL_SPU_OPS
+
+- attach the function to spu_funcs[] in arch/powerpc/platforms/cell/spufs/spu_main.c
+
+2.4 Skeleton files
+=================
+PPU code in Documentation/powerpc/kspu_ppu_skeleton.c
+SPU code in Documentation/powerpc/kspu_spu_skeleton.[ch]
+
+Merge both into kspu:
+--- a/arch/powerpc/platforms/cell/spufs/Makefile
++++ b/arch/powerpc/platforms/cell/spufs/Makefile
+@@ -24,6 +24,7 @@ kspu-y += kspu_helper.o kspu_code.o
+ $(obj)/kspu_code.o: $(obj)/spu_kspu_dump.h
+
+ spu_kspu_code_obj-y += $(obj)/spu_main.o $(obj)/spu_runtime.o
++spu_kspu_code_obj-y += $(obj)/spu_kspu_ppu_skeleton.o
+ spu_kspu_code_obj-y += $(spu_kspu_code_obj-m)
+
+ $(obj)/spu_kspu: $(spu_kspu_code_obj-y)
+
+--- a/arch/powerpc/platforms/cell/spufs/spu_main.c
++++ b/arch/powerpc/platforms/cell/spufs/spu_main.c
+@@ -13,6 +13,7 @@
+
+ static spu_operation_t spu_ops[TOTAL_SPU_FUNCS] __attribute__((aligned(16))) = {
+        [SPU_OP_nop] = spu_nop,
++       [SPU_OP_my_add] = spu_my_add
+ };
+
+ static unsigned char kspu_buff[DMA_BUFFERS][DMA_MAX_TRANS_SIZE];
+
+--- a/arch/powerpc/platforms/cell/spufs/spu_runtime.h
++++ b/arch/powerpc/platforms/cell/spufs/spu_runtime.h
+@@ -25,5 +25,6 @@ void memcpy_aligned(void *dest, const vo
+ /* exported offloaded functions */
+ void spu_nop(struct kspu_job *kjob, void *buffer,
+                unsigned int buf_num);
+
++void spu_my_add(struct kspu_job *kjob, void *buffer,
++               unsigned int buf_num);
+
+ #endif
+
+--- a/include/asm-powerpc/kspu/merged_code.h
++++ b/include/asm-powerpc/kspu/merged_code.h
+@@ -14,6 +14,7 @@
+
+ enum SPU_OPERATIONS {
+        SPU_OP_nop,
++       SPU_OP_my_add,
+
+        TOTAL_OP_FUNCS,
+ };
+@@ -23,6 +24,7 @@ struct kspu_job {
+        unsigned long long in __attribute__((aligned(16)));
+        unsigned int in_size __attribute__((aligned(16)));
+        union {
++               struct my_sum my_sum;
+        } __attribute__((aligned(16)));
+ };
+

-- 

  parent reply	other threads:[~2007-08-16 20:05 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-08-16 20:01 [patch 00/10] KSPU API + AES offloaded to SPU + testing module Sebastian Siewior
2007-08-16 20:01 ` [patch 01/10] t add cast to regain ablkcipher_request from private ctx Sebastian Siewior
2007-08-17  8:55   ` Herbert Xu
2007-08-16 20:01 ` [patch 02/10] crypto: retrieve private ctx aligned Sebastian Siewior
2007-08-16 20:01 ` Sebastian Siewior [this message]
2007-08-16 20:01 ` [patch 04/10] spufs: kspu doc skeleton Sebastian Siewior
2007-08-16 20:01 ` [patch 05/10] spufs: kspu add required declarations Sebastian Siewior
2007-08-16 20:01 ` [patch 06/10] spufs: add kspu_alloc_context() Sebastian Siewior
2007-08-16 20:01 ` [patch 07/10] spufs: add kernel support for spu task Sebastian Siewior
2007-08-18 16:48   ` Arnd Bergmann
2007-08-16 20:01 ` [patch 08/10] spufs: SPE side implementation of kspu Sebastian Siewior
2007-08-16 20:01 ` [patch 09/10] spufs: SPU-AES support (kernel side) Sebastian Siewior
     [not found]   ` <20070828154637.GA21007@Chamillionaire.breakpoint.cc>
2007-08-29  7:15     ` [patch 1/1] spufs: SPU-AES support (kspu+ablkcipher user) Herbert Xu
2007-08-29  9:28       ` Sebastian Siewior
     [not found]     ` <18132.43463.753224.982580@cargo.ozlabs.ibm.com>
2007-08-29  9:09       ` [Cbe-oss-dev] " Sebastian Siewior
2007-08-16 20:01 ` [patch 10/10] cryptoapi: async speed test Sebastian Siewior

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070816200135.452834000@ml.breakpoint.cc \
    --to=cbe-oss-dev@ml.breakpoint.cc \
    --cc=arnd@arndb.de \
    --cc=cbe-oss-dev@ozlabs.org \
    --cc=herbert@gondor.apana.org.au \
    --cc=jk@ozlabs.org \
    --cc=linux-crypto@vger.kernel.org \
    --cc=sebastian@breakpoint.cc \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.