* [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug
[not found] <1335189109-4871-1-git-send-email-jiang.liu@huawei.com>
@ 2012-04-23 13:51 ` Jiang Liu
2012-04-23 18:30 ` Dan Williams
0 siblings, 1 reply; 6+ messages in thread
From: Jiang Liu @ 2012-04-23 13:51 UTC (permalink / raw)
To: Vinod Koul, Dan Williams
Cc: Jiang Liu, Keping Chen, David S. Miller, Alexey Kuznetsov,
James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev,
linux-pci, linux-kernel, Jiang Liu
Enhance network subsystem to correctly update DMA channel reference counts,
so it won't break DMA device hotplug logic.
Signed-off-by: Jiang Liu <liuj97@gmail.com>
---
include/net/netdma.h | 26 ++++++++++++++++++++++++++
net/ipv4/tcp.c | 10 +++-------
net/ipv4/tcp_input.c | 5 +----
net/ipv4/tcp_ipv4.c | 4 +---
net/ipv6/tcp_ipv6.c | 4 +---
5 files changed, 32 insertions(+), 17 deletions(-)
diff --git a/include/net/netdma.h b/include/net/netdma.h
index 8ba8ce2..6d71724 100644
--- a/include/net/netdma.h
+++ b/include/net/netdma.h
@@ -24,6 +24,32 @@
#include <linux/dmaengine.h>
#include <linux/skbuff.h>
+static inline bool
+net_dma_capable(void)
+{
+ struct dma_chan *chan = net_dma_find_channel();
+ dma_put_channel(chan);
+
+ return !!chan;
+}
+
+static inline struct dma_chan *
+net_dma_get_channel(struct tcp_sock *tp)
+{
+ if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
+ tp->ucopy.dma_chan = net_dma_find_channel();
+ return tp->ucopy.dma_chan;
+}
+
+static inline void
+net_dma_put_channel(struct tcp_sock *tp)
+{
+ if (tp->ucopy.dma_chan) {
+ dma_put_channel(tp->ucopy.dma_chan);
+ tp->ucopy.dma_chan = NULL;
+ }
+}
+
int dma_skb_copy_datagram_iovec(struct dma_chan* chan,
struct sk_buff *skb, int offset, struct iovec *to,
size_t len, struct dma_pinned_list *pinned_list);
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 8bb6ade..aea4032 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -1451,8 +1451,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
available = TCP_SKB_CB(skb)->seq + skb->len - (*seq);
if ((available < target) &&
(len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) &&
- !sysctl_tcp_low_latency &&
- net_dma_find_channel()) {
+ !sysctl_tcp_low_latency && net_dma_capable()) {
preempt_enable_no_resched();
tp->ucopy.pinned_list =
dma_pin_iovec_pages(msg->msg_iov, len);
@@ -1666,10 +1665,7 @@ do_prequeue:
if (!(flags & MSG_TRUNC)) {
#ifdef CONFIG_NET_DMA
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
-
- if (tp->ucopy.dma_chan) {
+ if (net_dma_get_channel(tp)) {
tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec(
tp->ucopy.dma_chan, skb, offset,
msg->msg_iov, used,
@@ -1758,7 +1754,7 @@ skip_copy:
#ifdef CONFIG_NET_DMA
tcp_service_net_dma(sk, true); /* Wait for queue to drain */
- tp->ucopy.dma_chan = NULL;
+ net_dma_put_channel(tp);
if (tp->ucopy.pinned_list) {
dma_unpin_iovec_pages(tp->ucopy.pinned_list);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9944c1d..3878916 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5227,10 +5227,7 @@ static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb,
if (tp->ucopy.wakeup)
return 0;
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
-
- if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) {
+ if (net_dma_get_channel(tp) && skb_csum_unnecessary(skb)) {
dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan,
skb, hlen,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0cb86ce..90ea1c0 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1729,9 +1729,7 @@ process:
if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
struct tcp_sock *tp = tcp_sk(sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
+ if (net_dma_get_channel(tp))
ret = tcp_v4_do_rcv(sk, skb);
else
#endif
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 86cfe60..fb81bbd 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1644,9 +1644,7 @@ process:
if (!sock_owned_by_user(sk)) {
#ifdef CONFIG_NET_DMA
struct tcp_sock *tp = tcp_sk(sk);
- if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list)
- tp->ucopy.dma_chan = net_dma_find_channel();
- if (tp->ucopy.dma_chan)
+ if (net_dma_get_channel(tp))
ret = tcp_v6_do_rcv(sk, skb);
else
#endif
--
1.7.5.4
^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug
2012-04-23 13:51 ` [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug Jiang Liu
@ 2012-04-23 18:30 ` Dan Williams
2012-04-24 2:30 ` Jiang Liu
0 siblings, 1 reply; 6+ messages in thread
From: Dan Williams @ 2012-04-23 18:30 UTC (permalink / raw)
To: Jiang Liu
Cc: Vinod Koul, Jiang Liu, Keping Chen, David S. Miller,
Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, netdev, linux-pci, linux-kernel
On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu <liuj97@gmail.com> wrote:
> Enhance network subsystem to correctly update DMA channel reference counts,
> so it won't break DMA device hotplug logic.
>
> Signed-off-by: Jiang Liu <liuj97@gmail.com>
This introduces an atomic action on every channel touch, which is more
expensive than what we had previously. There has always been a
concern about the overhead of offload that sometimes makes ineffective
or a loss compared to cpu copies. In the cases where net_dma shows
improvement this will eat into / maybe eliminate that advantage.
Take a look at where dmaengine started [1]. It was from the beginning
going through contortions to avoid something like this. We made it
simpler here [2], but still kept the principle of not dirtying a
shared cacheline on every channel touch, and certainly not locking it.
If you are going to hotplug the entire IOH, then you are probably ok
with network links going down, so could you just down the links and
remove the driver with the existing code?
--
Dan
[1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826
[2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug
2012-04-23 18:30 ` Dan Williams
@ 2012-04-24 2:30 ` Jiang Liu
2012-04-24 3:09 ` Dan Williams
0 siblings, 1 reply; 6+ messages in thread
From: Jiang Liu @ 2012-04-24 2:30 UTC (permalink / raw)
To: Dan Williams
Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller,
Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, netdev, linux-pci, linux-kernel
Hi Dan,
Thanks for your great comments!
gerry
On 2012-4-24 2:30, Dan Williams wrote:
> On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu<liuj97@gmail.com> wrote:
>> Enhance network subsystem to correctly update DMA channel reference counts,
>> so it won't break DMA device hotplug logic.
>>
>> Signed-off-by: Jiang Liu<liuj97@gmail.com>
>
> This introduces an atomic action on every channel touch, which is more
> expensive than what we had previously. There has always been a
> concern about the overhead of offload that sometimes makes ineffective
> or a loss compared to cpu copies. In the cases where net_dma shows
> improvement this will eat into / maybe eliminate that advantage.
Good point, we should avoid pollute a shared cacheline here, otherwise
it may eat the benefits of IOAT acceleration.
>
> Take a look at where dmaengine started [1]. It was from the beginning
> going through contortions to avoid something like this. We made it
> simpler here [2], but still kept the principle of not dirtying a
> shared cacheline on every channel touch, and certainly not locking it.
Thanks for the great background information, especially the second one.
The check-in log message as below.
>Why?, beyond reducing complication:
>1/ Tracking reference counts per-transaction in an efficient manner, as
> is currently done, requires a complicated scheme to avoid cache-line
> bouncing effects.
The really issue here is polluting shared cachelines here, right?
Will it help to use percpu counter instead of atomic operations here?
I will have a try to use percpu counter for reference count.
BTW, do you have any DMAEngine benchmarks so we could use them to
compare the performance difference?
>2/ Per-transaction ref-counting gives the false impression that a
> dma-driver can be gracefully removed ahead of its user (net, md, or
> dma-slave)
>3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but
Seems the situation has changed now:)
Intel 7500 (Boxboro) chipset supports hotplug. And we are working on
a system, which adopts Boxboro chipset and supports node hotplug.
So we try to enhance the DMAEngine to support IOAT hotplug.
On the other hand, Intel next generation processor Ivybridge has
embedded IOH, so we need to support IOH/IOAT hotplug when supporting
processor hotplug.
> if such an engine were built one day we still would not need to >notify
> clients of remove events. The driver can simply return NULL to a
> ->prep() request, something that is much easier for a client to
>handle.
Could you please help to give more explanations about "The driver can
simply return NULL to a ->prep() request", I have gotten the idea yet.
>
> If you are going to hotplug the entire IOH, then you are probably ok
> with network links going down, so could you just down the links and
> remove the driver with the existing code?
I feel it's a little risky to shut down/restart all network interfaces
for hot-removal of IOH, that may disturb the applications. And there
are also other kinds of clients, such as ASYNC_TX, seems we can't
adopt this method to reclaim DMA channels from ASYNC_TX subsystem.
>
> --
> Dan
>
> [1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826
> [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a
>
> .
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug
2012-04-24 2:30 ` Jiang Liu
@ 2012-04-24 3:09 ` Dan Williams
2012-04-24 3:56 ` Jiang Liu
2012-04-25 15:47 ` Jiang Liu
0 siblings, 2 replies; 6+ messages in thread
From: Dan Williams @ 2012-04-24 3:09 UTC (permalink / raw)
To: Jiang Liu
Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller,
Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, netdev, linux-pci, linux-kernel
On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu <jiang.liu@huawei.com> wrote:
>> If you are going to hotplug the entire IOH, then you are probably ok
>> with network links going down, so could you just down the links and
>> remove the driver with the existing code?
>
> I feel it's a little risky to shut down/restart all network interfaces
> for hot-removal of IOH, that may disturb the applications.
I guess I'm confused... wouldn't the removal of an entire domain of
pci devices disturb userspace applications?
> And there
> are also other kinds of clients, such as ASYNC_TX, seems we can't
> adopt this method to reclaim DMA channels from ASYNC_TX subsystem.
I say handle this like block device hotplug. I.e. the driver stays
loaded but the channel is put into an 'offline' state. So the driver
hides the fact that the hardware went away. Similar to how you can
remove a disk but /dev/sda sticks around until the last reference is
gone (and the driver 'sd' sticks around until all block devices are
gone).
I expect the work will be in making sure existing clients are prepared
to handle NULL returns from ->device_prep_dma_*. In some cases the
channel is treated more like a cpu, so a NULL return from
->device_prep_dma_memcpy() has been interpreted as "device is
temporarily busy, it is safe to try again". We would need to change
that to a permanent indication that the device is gone and not attempt
retry.
--
Dan
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug
2012-04-24 3:09 ` Dan Williams
@ 2012-04-24 3:56 ` Jiang Liu
2012-04-25 15:47 ` Jiang Liu
1 sibling, 0 replies; 6+ messages in thread
From: Jiang Liu @ 2012-04-24 3:56 UTC (permalink / raw)
To: Dan Williams
Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller,
Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, netdev, linux-pci, linux-kernel
On 2012-4-24 11:09, Dan Williams wrote:
> On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu<jiang.liu@huawei.com> wrote:
>>> If you are going to hotplug the entire IOH, then you are probably ok
>>> with network links going down, so could you just down the links and
>>> remove the driver with the existing code?
>>
>> I feel it's a little risky to shut down/restart all network interfaces
>> for hot-removal of IOH, that may disturb the applications.
>
> I guess I'm confused... wouldn't the removal of an entire domain of
> pci devices disturb userspace applications?
Here I mean removing an IOH shouldn't affect devices under other IOHs
if possible.
With current dmaengine implementation, a DMA device/channel may be used
by clients in other PCI domains. So to safely remove a DMA device, we
need to return dmaengine_ref_count to zero by stopping all DMA clients.
For network, that means we need to stop all network interfaces, seems
a little heavy:)
>
>> And there
>> are also other kinds of clients, such as ASYNC_TX, seems we can't
>> adopt this method to reclaim DMA channels from ASYNC_TX subsystem.
>
> I say handle this like block device hotplug. I.e. the driver stays
> loaded but the channel is put into an 'offline' state. So the driver
> hides the fact that the hardware went away. Similar to how you can
> remove a disk but /dev/sda sticks around until the last reference is
> gone (and the driver 'sd' sticks around until all block devices are
> gone).
Per my understanding, this mechanism could be used to stop driver from
accessing surprisingly removed devices, but it still needs a reference
count mechanism to finish the driver unbinding operation eventually.
For IOH hotplug, we need to wait for the completion of driver unbinding
operations before destroying the PCI device nodes of IOAT, so still need
reference count to track channel usage.
Another way is to notify all clients to release all channels when IOAT
device hotplug happens, but that may need heavy modification to the
DMA clients.
>
> I expect the work will be in making sure existing clients are prepared
> to handle NULL returns from ->device_prep_dma_*. In some cases the
> channel is treated more like a cpu, so a NULL return from
> ->device_prep_dma_memcpy() has been interpreted as "device is
> temporarily busy, it is safe to try again". We would need to change
> that to a permanent indication that the device is gone and not attempt
> retry.
Yes, some ASYNC_TX clients interpret NULL return as EBUSY and keep on
retry when doing context aware computations. Will try to investigate
on this direction.
>
> --
> Dan
>
> .
>
^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug
2012-04-24 3:09 ` Dan Williams
2012-04-24 3:56 ` Jiang Liu
@ 2012-04-25 15:47 ` Jiang Liu
1 sibling, 0 replies; 6+ messages in thread
From: Jiang Liu @ 2012-04-25 15:47 UTC (permalink / raw)
To: Dan Williams
Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller,
Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
Patrick McHardy, netdev, linux-pci, linux-kernel
Hi Dan,
Thanks for your great comments about the performance penalty issue. And I'm trying
to refine the implementation to reduce penalty caused by hotplug logic. If the algorithm works
correctly, the optimized hot path code will be:
------------------------------------------------------------------------------
struct dma_chan *dma_find_channel(enum dma_transaction_type tx_type)
{
struct dma_chan *chan = this_cpu_read(channel_table[tx_type]->chan);
this_cpu_inc(dmaengine_chan_ref_count);
if (static_key_false(&dmaengine_quiesce)) {
chan = NULL;
}
return chan;
}
EXPORT_SYMBOL(dma_find_channel);
struct dma_chan *dma_get_channel(struct dma_chan *chan)
{
if (static_key_false(&dmaengine_quiesce))
atomic_inc(&dmaengine_dirty);
this_cpu_inc(dmaengine_chan_ref_count);
return chan;
}
EXPORT_SYMBOL(dma_get_channel);
void dma_put_channel(struct dma_chan *chan)
{
this_cpu_dec(dmaengine_chan_ref_count);
}
EXPORT_SYMBOL(dma_put_channel);
-----------------------------------------------------------------------------
The disassembled code is:
(gdb) disassemble dma_find_channel
Dump of assembler code for function dma_find_channel:
0x0000000000000000 <+0>: push %rbp
0x0000000000000001 <+1>: mov %rsp,%rbp
0x0000000000000004 <+4>: callq 0x9 <dma_find_channel+9>
0x0000000000000009 <+9>: mov %edi,%edi
0x000000000000000b <+11>: mov 0x0(,%rdi,8),%rax
0x0000000000000013 <+19>: mov %gs:(%rax),%rax
0x0000000000000017 <+23>: incq %gs:0x0 //overhead: this_cpu_inc(dmaengine_chan_ref_count)
0x0000000000000020 <+32>: jmpq 0x25 <dma_find_channel+37> //overhead: if (static_key_false(&dmaengine_quiesce)), will be replaced as NOP by jump label
0x0000000000000025 <+37>: pop %rbp
0x0000000000000026 <+38>: retq
0x0000000000000027 <+39>: nopw 0x0(%rax,%rax,1)
0x0000000000000030 <+48>: xor %eax,%eax
0x0000000000000032 <+50>: pop %rbp
0x0000000000000033 <+51>: retq
End of assembler dump.
(gdb) disassemble dma_put_channel // overhead: to decrease channel reference count, 6 instructions
Dump of assembler code for function dma_put_channel:
0x0000000000000070 <+0>: push %rbp
0x0000000000000071 <+1>: mov %rsp,%rbp
0x0000000000000074 <+4>: callq 0x79 <dma_put_channel+9>
0x0000000000000079 <+9>: decq %gs:0x0
0x0000000000000082 <+18>: pop %rbp
0x0000000000000083 <+19>: retq
End of assembler dump.
(gdb) disassemble dma_get_channel
Dump of assembler code for function dma_get_channel:
0x0000000000000040 <+0>: push %rbp
0x0000000000000041 <+1>: mov %rsp,%rbp
0x0000000000000044 <+4>: callq 0x49 <dma_get_channel+9>
0x0000000000000049 <+9>: mov %rdi,%rax
0x000000000000004c <+12>: jmpq 0x51 <dma_get_channel+17>
0x0000000000000051 <+17>: incq %gs:0x0
0x000000000000005a <+26>: pop %rbp
0x000000000000005b <+27>: retq
0x000000000000005c <+28>: nopl 0x0(%rax)
0x0000000000000060 <+32>: lock incl 0x0(%rip) # 0x67 <dma_get_channel+39>
0x0000000000000067 <+39>: jmp 0x51 <dma_get_channel+17>
End of assembler dump.
So for a typical dma_find_channel()/dma_put_channel(), the total overhead
is about 10 instructions and two percpu(local) memory updates. And there's
no shared cache pollution any more. Is this acceptable ff the algorithm
works as expected? I will test the code tomorrow.
For typical systems which don't support DMA device hotplug, the overhead
could be completely removed by condition compilation.
Any comments are welcomed!
Thanks!
--gerry
On 04/24/2012 11:09 AM, Dan Williams wrote:
>>> If you are going to hotplug the entire IOH, then you are probably ok
^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-04-25 15:47 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1335189109-4871-1-git-send-email-jiang.liu@huawei.com>
2012-04-23 13:51 ` [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug Jiang Liu
2012-04-23 18:30 ` Dan Williams
2012-04-24 2:30 ` Jiang Liu
2012-04-24 3:09 ` Dan Williams
2012-04-24 3:56 ` Jiang Liu
2012-04-25 15:47 ` Jiang Liu
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).