async_tx: get best channel

linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* async_tx: get best channel
@ 2007-10-19 12:23 Yuri Tikhonov
  2007-10-23 18:16 ` Dan Williams
  0 siblings, 1 reply; 4+ messages in thread
From: Yuri Tikhonov @ 2007-10-19 12:23 UTC (permalink / raw)
  To: Williams, Dan J
  Cc: Dan Williams, Neil Brown, Wolfgang Denk, Detlev Zundel,
	linux-raid

 Hello Dan,

 I have a suggestion regarding the async_tx_find_channel() procedure.

 First, a little introduction. Some processors (e.g. ppc440spe) have several DMA
engines (say DMA1 and DMA2) which are capable of performing the same type of 
operation, say XOR. The DMA2 engine may process the XOR operation faster than
the DMA1 engine, but DMA2 (which is faster) has some restrictions for the source
operand addresses, whereas there are no such restrictions for DMA1 (which is slower).
So the question is, how may ASYNC_TX select the DMA engine which will be the
most effective for the given tx operation ?

 In the example just described this means: if the faster engine, DMA2, may process
the tx operation with the given source operand addresses, then we select DMA2;
if the given source operand addresses cannot be processed with DMA2, then we
select the slower engine, DMA1.

 I see the following way for introducing such functionality.

 We may introduce an additional method in struct dma_device (let's call it device_estimate())
which would take the following as the arguments:
--- the list of sources to be processed during the given tx,
--- the type of operation (XOR, COPY, ...), 
--- perhaps something else, 
 and then estimate the effectiveness of processing this tx on the given channel.
 The async_tx_find_channel() function should call the device_estimate() method for each
registered dma channel and then select the most effective one.
 The architecture specific ADMA driver will be responsible for returning the greatest
value from the device_estimate() method for the channel which will be the most effective
for this given tx.

 What are your thoughts regarding this? Do you see any other effective ways for
enhancing ASYNC_TX with such functionality?

 Regards, Yuri

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: async_tx: get best channel
  2007-10-19 12:23 async_tx: get best channel Yuri Tikhonov
@ 2007-10-23 18:16 ` Dan Williams
  2007-10-31 17:58   ` Yuri Tikhonov
  2007-12-05 16:37   ` Yuri Tikhonov
  0 siblings, 2 replies; 4+ messages in thread
From: Dan Williams @ 2007-10-23 18:16 UTC (permalink / raw)
  To: Yuri Tikhonov
  Cc: Dan Williams, Neil Brown, Wolfgang Denk, Detlev Zundel,
	linux-raid

On Fri, 2007-10-19 at 05:23 -0700, Yuri Tikhonov wrote:
> 
>  Hello Dan,

Hi Yuri, sorry it has taken me so long to get back to you...
> 
>  I have a suggestion regarding the async_tx_find_channel() procedure.
> 
>  First, a little introduction. Some processors (e.g. ppc440spe) have several DMA
> engines (say DMA1 and DMA2) which are capable of performing the same type of
> operation, say XOR. The DMA2 engine may process the XOR operation faster than
> the DMA1 engine, but DMA2 (which is faster) has some restrictions for the source
> operand addresses, whereas there are no such restrictions for DMA1 (which is slower).
> So the question is, how may ASYNC_TX select the DMA engine which will be the
> most effective for the given tx operation ?

>  In the example just described this means: if the faster engine, DMA2, may process
> the tx operation with the given source operand addresses, then we select DMA2;
> if the given source operand addresses cannot be processed with DMA2, then we
> select the slower engine, DMA1.
> 
>  I see the following way for introducing such functionality.
> 
>  We may introduce an additional method in struct dma_device (let's call it device_estimate())
> which would take the following as the arguments:
> --- the list of sources to be processed during the given tx,
> --- the type of operation (XOR, COPY, ...),
> --- perhaps something else,
>  and then estimate the effectiveness of processing this tx on the given channel.
>  The async_tx_find_channel() function should call the device_estimate() method for each
> registered dma channel and then select the most effective one.
>  The architecture specific ADMA driver will be responsible for returning the greatest
> value from the device_estimate() method for the channel which will be the most effective
> for this given tx.
> 
>  What are your thoughts regarding this? Do you see any other effective ways for
> enhancing ASYNC_TX with such functionality?

The problem with moving this test to async_tx_find_channel() is that it
imposes extra overhead in the fast path.  It would be best if we could
keep all these decisions in the slow path, or at least hide it from
architectures that do not need to implement it.  The thing that makes
this tricky is the fact that the speed is based on the source address...

One question what are the source address restrictions, is it around
high-memory?  My thought is MD usually only operates on GFP_KERNEL
memory but sometimes sees high-memory when copying data into and out of
the cache.  You might be able to achieve your use case by disabling
(hiding) the XOR capability on the channels used for copying.  This will
cause async_tx to switch the operation from the high memory capable copy
channel to the fast low memory XOR channel.

Another way to approach this would be to implement architecture specific
definitions of dma_channel_add_remove() and async_tx_rebalance().  This
will bypass the default allocation scheme and allow you to assign the
fastest channel to an operation, but it still does not allow for dynamic
selection based on source/destination address...

> 
>  Regards, Yuri
> 
Regards,
Dan

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: async_tx: get best channel
  2007-10-23 18:16 ` Dan Williams
@ 2007-10-31 17:58   ` Yuri Tikhonov
  2007-12-05 16:37   ` Yuri Tikhonov
  1 sibling, 0 replies; 4+ messages in thread
From: Yuri Tikhonov @ 2007-10-31 17:58 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dan Williams, Neil Brown, Wolfgang Denk, Detlev Zundel,
	linux-raid


 Hi Dan,

On Tuesday 23 October 2007 22:16, Dan Williams wrote:
...
> The problem with moving this test to async_tx_find_channel() is that it
> imposes extra overhead in the fast path.  It would be best if we could
> keep all these decisions in the slow path, or at least hide it from
> architectures that do not need to implement it.  The thing that makes
> this tricky is the fact that the speed is based on the source address...

 I agree with you that extra checking will impose extra overhead, but in my 
case this overhead is expected to be less than the improvement achieved due 
to using the more effective channel.

 In the worst case, the architectures which do not need to implement the 
device_estimate() method will have the overhead because of the following:
- passing two additional parameters to function async_tx_find_channel() (these 
are the source list and the number of sources),
- and checking one condition for (depend_tx->chan->device->device_estimate != 
0).

 I guess this is not such a big overhead. Right ?

> One question what are the source address restrictions, is it around
> high-memory?  

 No, it isn't. The condition which has to be met to run the most effective DMA 
is that the source addresses might be arranged in the following way:

 src0 = addr0,
 src1 = addr0 + 1*BLOCK_SIZE,
 src2 = addr0 + 2*BLOCK_SIZE,
 src3 = addr0 + 3*BLOCK_SIZE,
 src4 = addr0 + 4*BLOCK_SIZE,
 ...
 srcN = addr0 + N*BLOCK_SIZE.

> My thought is MD usually only operates on GFP_KERNEL 
> memory but sometimes sees high-memory when copying data into and out of
> the cache.  You might be able to achieve your use case by disabling
> (hiding) the XOR capability on the channels used for copying.  This will
> cause async_tx to switch the operation from the high memory capable copy
> channel to the fast low memory XOR channel.
> 
> Another way to approach this would be to implement architecture specific
> definitions of dma_channel_add_remove() and async_tx_rebalance().  This
> will bypass the default allocation scheme and allow you to assign the
> fastest channel to an operation, but it still does not allow for dynamic
> selection based on source/destination address...

 Understood. Thanks. Unfortunately, this is not the case, because the channels 
which may do the fast XOR operations support asynchronous COPY in my ADMA 
driver, so I guess even very very fast XOR will not help too much if I'll 
have no asynchronous COPY : )

 Regards, Yuri

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: async_tx: get best channel
  2007-10-23 18:16 ` Dan Williams
  2007-10-31 17:58   ` Yuri Tikhonov
@ 2007-12-05 16:37   ` Yuri Tikhonov
  1 sibling, 0 replies; 4+ messages in thread
From: Yuri Tikhonov @ 2007-12-05 16:37 UTC (permalink / raw)
  To: Dan Williams
  Cc: Dan Williams, Neil Brown, Wolfgang Denk, Detlev Zundel,
	linux-raid


 Hello,

On Tuesday 23 October 2007 22:16, Dan Williams wrote:
> The problem with moving this test to async_tx_find_channel() is that it
> imposes extra overhead in the fast path.  It would be best if we could
> keep all these decisions in the slow path, or at least hide it from
> architectures that do not need to implement it.  The thing that makes
> this tricky is the fact that the speed is based on the source address...

 The following patch is an attempt to support the getting best channel
feature in ASYNC_TX. The architectures which do not need to implement
this keep the chan->device->device_estimate method set to NULL, and thus
avoid async_tx_find_best_channel() invocation.

 As far as the the architectures which may benefit from this feature are
concerned, it is given at the discretion of the architecture's ADMA driver
developer to assign the relative priority of each channel for a given
operation: the async_tx_find_best_channel() function looks through the list
of channels which support the 'cap' type of operation, and selects the one
which returns the biggest value from device_estimate() method. The parameters
passed to device_estimate() are as follows:
- type of operation;
- list of source operands (addresses);
- number of operands;
- size of data processed by the operation.

 Signed-off-by: Yuri Tikhonov <yur@emcraft.com>
 Signed-off-by: Mikhail Cherkashin <mike@emcraft.com>
--

diff --git a/crypto/async_tx/async_memcpy.c b/crypto/async_tx/async_memcpy.c
index a973f4e..2146857 100644
--- a/crypto/async_tx/async_memcpy.c
+++ b/crypto/async_tx/async_memcpy.c
@@ -47,7 +47,8 @@ async_memcpy(struct page *dest, struct page *src, unsigned int dest_offset,
 	struct dma_async_tx_descriptor *depend_tx,
 	dma_async_tx_callback cb_fn, void *cb_param)
 {
-	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_MEMCPY);
+	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_MEMCPY,
+		&src, 1, len);
 	struct dma_device *device = chan ? chan->device : NULL;
 	int int_en = cb_fn ? 1 : 0;
 	struct dma_async_tx_descriptor *tx = device ?
diff --git a/crypto/async_tx/async_memset.c b/crypto/async_tx/async_memset.c
index 66ef635..b427913 100644
--- a/crypto/async_tx/async_memset.c
+++ b/crypto/async_tx/async_memset.c
@@ -46,7 +46,8 @@ async_memset(struct page *dest, int val, unsigned int offset,
 	struct dma_async_tx_descriptor *depend_tx,
 	dma_async_tx_callback cb_fn, void *cb_param)
 {
-	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_MEMSET);
+	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_MEMSET,
+		NULL, 0, len);
 	struct dma_device *device = chan ? chan->device : NULL;
 	int int_en = cb_fn ? 1 : 0;
 	struct dma_async_tx_descriptor *tx = device ?
diff --git a/crypto/async_tx/async_tx.c b/crypto/async_tx/async_tx.c
index 6d77ae6..e889f76 100644
--- a/crypto/async_tx/async_tx.c
+++ b/crypto/async_tx/async_tx.c
@@ -363,26 +363,75 @@ static void __exit async_tx_exit(void)
 }
 
 /**
+ * async_tx_find_best_channel - find a channel with the maximum rank for the
+ * 	transaction type given (the rank of the operation is the value
+ *	returned by the device_estimate method).
+ * @cap: transaction type
+ * @src_lst: array of pointers to sources for the transaction
+ * @src_cnt: number of arguments (size of the srcs array)
+ * @src_sz: length of the each argument pointed by srcs
+ */
+static struct dma_chan *
+async_tx_find_best_channel (enum dma_transaction_type cap,
+	struct page **src_lst, int src_cnt, size_t src_sz)
+{
+	struct dma_chan *best_chan = NULL;
+	struct dma_chan_ref *ref;
+	int best_rank = -1;
+
+	rcu_read_lock();
+	list_for_each_entry_rcu(ref, &async_tx_master_list, node)
+		if (dma_has_cap(cap, ref->chan->device->cap_mask)) {
+			int rank;
+
+			BUG_ON (!ref->chan->device->device_estimate);
+			rank = ref->chan->device->device_estimate (ref->chan,
+				cap, src_lst, src_cnt, src_sz);
+			if (rank > best_rank) {
+				best_rank = rank;
+				best_chan = ref->chan;
+			}
+		}
+	rcu_read_unlock();
+
+	BUG_ON(best_rank == -1);
+	return best_chan;
+}
+
+/**
  * async_tx_find_channel - find a channel to carry out the operation or let
  *	the transaction execute synchronously
  * @depend_tx: transaction dependency
  * @tx_type: transaction type
+ * @src_lst: array of pointers to sources for the transaction
+ * @src_cnt: number of arguments (size of the srcs array)
+ * @src_sz: length of the each argument pointed by srcs
  */
 struct dma_chan *
 async_tx_find_channel(struct dma_async_tx_descriptor *depend_tx,
-	enum dma_transaction_type tx_type)
+	enum dma_transaction_type tx_type, struct page **src_lst,
+	int src_cnt, size_t src_sz)
 {
+	struct dma_chan_ref *ref = NULL;
+	int cpu;
+
+	cpu = get_cpu();
+	if (likely(channel_table_initialized))
+		ref = per_cpu_ptr(channel_table[tx_type], cpu)->ref;
+	put_cpu();
+
+	/* see if adma devices can estimate efficiency of ops processing */
+	if (!depend_tx && ref && ref->chan->device->device_estimate)
+		return async_tx_find_best_channel (tx_type,
+			src_lst, src_cnt, src_sz);
+
 	/* see if we can keep the chain on one channel */
 	if (depend_tx &&
 		dma_has_cap(tx_type, depend_tx->chan->device->cap_mask))
 		return depend_tx->chan;
-	else if (likely(channel_table_initialized)) {
-		struct dma_chan_ref *ref;
-		int cpu = get_cpu();
-		ref = per_cpu_ptr(channel_table[tx_type], cpu)->ref;
-		put_cpu();
-		return ref ? ref->chan : NULL;
-	} else
+	else if (likely(ref))
+		return ref->chan;
+	else
 		return NULL;
 }
 EXPORT_SYMBOL_GPL(async_tx_find_channel);
diff --git a/crypto/async_tx/async_xor.c b/crypto/async_tx/async_xor.c
index 2575f67..2cdcd21 100644
--- a/crypto/async_tx/async_xor.c
+++ b/crypto/async_tx/async_xor.c
@@ -113,7 +113,8 @@ async_xor(struct page *dest, struct page **src_list, unsigned int offset,
 	struct dma_async_tx_descriptor *depend_tx,
 	dma_async_tx_callback cb_fn, void *cb_param)
 {
-	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_XOR);
+	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_XOR,
+		src_list, src_cnt, len);
 	struct dma_device *device = chan ? chan->device : NULL;
 	struct dma_async_tx_descriptor *tx = NULL;
 	dma_async_tx_callback _cb_fn;
@@ -254,7 +255,8 @@ async_xor_zero_sum(struct page *dest, struct page **src_list,
 	struct dma_async_tx_descriptor *depend_tx,
 	dma_async_tx_callback cb_fn, void *cb_param)
 {
-	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_ZERO_SUM);
+	struct dma_chan *chan = async_tx_find_channel(depend_tx, DMA_ZERO_SUM,
+		src_list, src_cnt, len);
 	struct dma_device *device = chan ? chan->device : NULL;
 	int int_en = cb_fn ? 1 : 0;
 	struct dma_async_tx_descriptor *tx = device ?
diff --git a/include/linux/async_tx.h b/include/linux/async_tx.h
index 4be0e25..7ada55b 100644
--- a/include/linux/async_tx.h
+++ b/include/linux/async_tx.h
@@ -76,7 +76,8 @@ enum dma_status dma_wait_for_async_tx(struct dma_async_tx_descriptor *tx);
 void async_tx_run_dependencies(struct dma_async_tx_descriptor *tx);
 struct dma_chan *
 async_tx_find_channel(struct dma_async_tx_descriptor *depend_tx,
-	enum dma_transaction_type tx_type);
+	enum dma_transaction_type tx_type,struct page **src_lst,
+	int src_cnt, size_t src_sz);
 #else
 static inline void async_tx_issue_pending_all(void)
 {
@@ -97,7 +98,8 @@ async_tx_run_dependencies(struct dma_async_tx_descriptor *tx)
 
 static inline struct dma_chan *
 async_tx_find_channel(struct dma_async_tx_descriptor *depend_tx,
-	enum dma_transaction_type tx_type)
+	enum dma_transaction_type tx_type, struct page **src_lst,
+	int src_cnt, size_t src_sz)
 {
 	return NULL;
 }
diff --git a/include/linux/dmaengine.h b/include/linux/dmaengine.h
index 50d1ae3..3d0f8cf 100644
--- a/include/linux/dmaengine.h
+++ b/include/linux/dmaengine.h
@@ -264,6 +264,8 @@ struct dma_async_tx_descriptor {
  * @device_prep_dma_interrupt: prepares an end of chain interrupt operation
  * @device_dependency_added: async_tx notifies the channel about new deps
  * @device_issue_pending: push pending transactions to hardware
+ * @device_estimate: estimate the efficiency of processing the transaction
+ *	by this device
  */
 struct dma_device {
 
@@ -308,6 +310,9 @@ struct dma_device {
 			dma_cookie_t cookie, dma_cookie_t *last,
 			dma_cookie_t *used);
 	void (*device_issue_pending)(struct dma_chan *chan);
+	int (*device_estimate) (struct dma_chan *chan,
+			enum dma_transaction_type cap, struct page **src_lst,
+			int src_cnt, size_t src_sz);
 };
 
 /* --- public DMA engine API --- */

-- 
Yuri Tikhonov, Senior Software Engineer
Emcraft Systems, www.emcraft.com

^ permalink raw reply related	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-12-05 16:37 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-10-19 12:23 async_tx: get best channel Yuri Tikhonov
2007-10-23 18:16 ` Dan Williams
2007-10-31 17:58   ` Yuri Tikhonov
2007-12-05 16:37   ` Yuri Tikhonov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).