* [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug [not found] <1335189109-4871-1-git-send-email-jiang.liu@huawei.com> @ 2012-04-23 13:51 ` Jiang Liu 2012-04-23 18:30 ` Dan Williams 0 siblings, 1 reply; 6+ messages in thread From: Jiang Liu @ 2012-04-23 13:51 UTC (permalink / raw) To: Vinod Koul, Dan Williams Cc: Jiang Liu, Keping Chen, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-pci, linux-kernel, Jiang Liu Enhance network subsystem to correctly update DMA channel reference counts, so it won't break DMA device hotplug logic. Signed-off-by: Jiang Liu <liuj97@gmail.com> --- include/net/netdma.h | 26 ++++++++++++++++++++++++++ net/ipv4/tcp.c | 10 +++------- net/ipv4/tcp_input.c | 5 +---- net/ipv4/tcp_ipv4.c | 4 +--- net/ipv6/tcp_ipv6.c | 4 +--- 5 files changed, 32 insertions(+), 17 deletions(-) diff --git a/include/net/netdma.h b/include/net/netdma.h index 8ba8ce2..6d71724 100644 --- a/include/net/netdma.h +++ b/include/net/netdma.h @@ -24,6 +24,32 @@ #include <linux/dmaengine.h> #include <linux/skbuff.h> +static inline bool +net_dma_capable(void) +{ + struct dma_chan *chan = net_dma_find_channel(); + dma_put_channel(chan); + + return !!chan; +} + +static inline struct dma_chan * +net_dma_get_channel(struct tcp_sock *tp) +{ + if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) + tp->ucopy.dma_chan = net_dma_find_channel(); + return tp->ucopy.dma_chan; +} + +static inline void +net_dma_put_channel(struct tcp_sock *tp) +{ + if (tp->ucopy.dma_chan) { + dma_put_channel(tp->ucopy.dma_chan); + tp->ucopy.dma_chan = NULL; + } +} + int dma_skb_copy_datagram_iovec(struct dma_chan* chan, struct sk_buff *skb, int offset, struct iovec *to, size_t len, struct dma_pinned_list *pinned_list); diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index 8bb6ade..aea4032 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -1451,8 +1451,7 @@ int tcp_recvmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, available = TCP_SKB_CB(skb)->seq + skb->len - (*seq); if ((available < target) && (len > sysctl_tcp_dma_copybreak) && !(flags & MSG_PEEK) && - !sysctl_tcp_low_latency && - net_dma_find_channel()) { + !sysctl_tcp_low_latency && net_dma_capable()) { preempt_enable_no_resched(); tp->ucopy.pinned_list = dma_pin_iovec_pages(msg->msg_iov, len); @@ -1666,10 +1665,7 @@ do_prequeue: if (!(flags & MSG_TRUNC)) { #ifdef CONFIG_NET_DMA - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - - if (tp->ucopy.dma_chan) { + if (net_dma_get_channel(tp)) { tp->ucopy.dma_cookie = dma_skb_copy_datagram_iovec( tp->ucopy.dma_chan, skb, offset, msg->msg_iov, used, @@ -1758,7 +1754,7 @@ skip_copy: #ifdef CONFIG_NET_DMA tcp_service_net_dma(sk, true); /* Wait for queue to drain */ - tp->ucopy.dma_chan = NULL; + net_dma_put_channel(tp); if (tp->ucopy.pinned_list) { dma_unpin_iovec_pages(tp->ucopy.pinned_list); diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c index 9944c1d..3878916 100644 --- a/net/ipv4/tcp_input.c +++ b/net/ipv4/tcp_input.c @@ -5227,10 +5227,7 @@ static int tcp_dma_try_early_copy(struct sock *sk, struct sk_buff *skb, if (tp->ucopy.wakeup) return 0; - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - - if (tp->ucopy.dma_chan && skb_csum_unnecessary(skb)) { + if (net_dma_get_channel(tp) && skb_csum_unnecessary(skb)) { dma_cookie = dma_skb_copy_datagram_iovec(tp->ucopy.dma_chan, skb, hlen, diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index 0cb86ce..90ea1c0 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -1729,9 +1729,7 @@ process: if (!sock_owned_by_user(sk)) { #ifdef CONFIG_NET_DMA struct tcp_sock *tp = tcp_sk(sk); - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - if (tp->ucopy.dma_chan) + if (net_dma_get_channel(tp)) ret = tcp_v4_do_rcv(sk, skb); else #endif diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c index 86cfe60..fb81bbd 100644 --- a/net/ipv6/tcp_ipv6.c +++ b/net/ipv6/tcp_ipv6.c @@ -1644,9 +1644,7 @@ process: if (!sock_owned_by_user(sk)) { #ifdef CONFIG_NET_DMA struct tcp_sock *tp = tcp_sk(sk); - if (!tp->ucopy.dma_chan && tp->ucopy.pinned_list) - tp->ucopy.dma_chan = net_dma_find_channel(); - if (tp->ucopy.dma_chan) + if (net_dma_get_channel(tp)) ret = tcp_v6_do_rcv(sk, skb); else #endif -- 1.7.5.4 ^ permalink raw reply related [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug 2012-04-23 13:51 ` [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug Jiang Liu @ 2012-04-23 18:30 ` Dan Williams 2012-04-24 2:30 ` Jiang Liu 0 siblings, 1 reply; 6+ messages in thread From: Dan Williams @ 2012-04-23 18:30 UTC (permalink / raw) To: Jiang Liu Cc: Vinod Koul, Jiang Liu, Keping Chen, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-pci, linux-kernel On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu <liuj97@gmail.com> wrote: > Enhance network subsystem to correctly update DMA channel reference counts, > so it won't break DMA device hotplug logic. > > Signed-off-by: Jiang Liu <liuj97@gmail.com> This introduces an atomic action on every channel touch, which is more expensive than what we had previously. There has always been a concern about the overhead of offload that sometimes makes ineffective or a loss compared to cpu copies. In the cases where net_dma shows improvement this will eat into / maybe eliminate that advantage. Take a look at where dmaengine started [1]. It was from the beginning going through contortions to avoid something like this. We made it simpler here [2], but still kept the principle of not dirtying a shared cacheline on every channel touch, and certainly not locking it. If you are going to hotplug the entire IOH, then you are probably ok with network links going down, so could you just down the links and remove the driver with the existing code? -- Dan [1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826 [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug 2012-04-23 18:30 ` Dan Williams @ 2012-04-24 2:30 ` Jiang Liu 2012-04-24 3:09 ` Dan Williams 0 siblings, 1 reply; 6+ messages in thread From: Jiang Liu @ 2012-04-24 2:30 UTC (permalink / raw) To: Dan Williams Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-pci, linux-kernel Hi Dan, Thanks for your great comments! gerry On 2012-4-24 2:30, Dan Williams wrote: > On Mon, Apr 23, 2012 at 6:51 AM, Jiang Liu<liuj97@gmail.com> wrote: >> Enhance network subsystem to correctly update DMA channel reference counts, >> so it won't break DMA device hotplug logic. >> >> Signed-off-by: Jiang Liu<liuj97@gmail.com> > > This introduces an atomic action on every channel touch, which is more > expensive than what we had previously. There has always been a > concern about the overhead of offload that sometimes makes ineffective > or a loss compared to cpu copies. In the cases where net_dma shows > improvement this will eat into / maybe eliminate that advantage. Good point, we should avoid pollute a shared cacheline here, otherwise it may eat the benefits of IOAT acceleration. > > Take a look at where dmaengine started [1]. It was from the beginning > going through contortions to avoid something like this. We made it > simpler here [2], but still kept the principle of not dirtying a > shared cacheline on every channel touch, and certainly not locking it. Thanks for the great background information, especially the second one. The check-in log message as below. >Why?, beyond reducing complication: >1/ Tracking reference counts per-transaction in an efficient manner, as > is currently done, requires a complicated scheme to avoid cache-line > bouncing effects. The really issue here is polluting shared cachelines here, right? Will it help to use percpu counter instead of atomic operations here? I will have a try to use percpu counter for reference count. BTW, do you have any DMAEngine benchmarks so we could use them to compare the performance difference? >2/ Per-transaction ref-counting gives the false impression that a > dma-driver can be gracefully removed ahead of its user (net, md, or > dma-slave) >3/ None of the in-tree dma-drivers talk to hot pluggable hardware, but Seems the situation has changed now:) Intel 7500 (Boxboro) chipset supports hotplug. And we are working on a system, which adopts Boxboro chipset and supports node hotplug. So we try to enhance the DMAEngine to support IOAT hotplug. On the other hand, Intel next generation processor Ivybridge has embedded IOH, so we need to support IOH/IOAT hotplug when supporting processor hotplug. > if such an engine were built one day we still would not need to >notify > clients of remove events. The driver can simply return NULL to a > ->prep() request, something that is much easier for a client to >handle. Could you please help to give more explanations about "The driver can simply return NULL to a ->prep() request", I have gotten the idea yet. > > If you are going to hotplug the entire IOH, then you are probably ok > with network links going down, so could you just down the links and > remove the driver with the existing code? I feel it's a little risky to shut down/restart all network interfaces for hot-removal of IOH, that may disturb the applications. And there are also other kinds of clients, such as ASYNC_TX, seems we can't adopt this method to reclaim DMA channels from ASYNC_TX subsystem. > > -- > Dan > > [1]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c13c826 > [2]: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=6f49a57a > > . > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug 2012-04-24 2:30 ` Jiang Liu @ 2012-04-24 3:09 ` Dan Williams 2012-04-24 3:56 ` Jiang Liu 2012-04-25 15:47 ` Jiang Liu 0 siblings, 2 replies; 6+ messages in thread From: Dan Williams @ 2012-04-24 3:09 UTC (permalink / raw) To: Jiang Liu Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-pci, linux-kernel On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu <jiang.liu@huawei.com> wrote: >> If you are going to hotplug the entire IOH, then you are probably ok >> with network links going down, so could you just down the links and >> remove the driver with the existing code? > > I feel it's a little risky to shut down/restart all network interfaces > for hot-removal of IOH, that may disturb the applications. I guess I'm confused... wouldn't the removal of an entire domain of pci devices disturb userspace applications? > And there > are also other kinds of clients, such as ASYNC_TX, seems we can't > adopt this method to reclaim DMA channels from ASYNC_TX subsystem. I say handle this like block device hotplug. I.e. the driver stays loaded but the channel is put into an 'offline' state. So the driver hides the fact that the hardware went away. Similar to how you can remove a disk but /dev/sda sticks around until the last reference is gone (and the driver 'sd' sticks around until all block devices are gone). I expect the work will be in making sure existing clients are prepared to handle NULL returns from ->device_prep_dma_*. In some cases the channel is treated more like a cpu, so a NULL return from ->device_prep_dma_memcpy() has been interpreted as "device is temporarily busy, it is safe to try again". We would need to change that to a permanent indication that the device is gone and not attempt retry. -- Dan ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug 2012-04-24 3:09 ` Dan Williams @ 2012-04-24 3:56 ` Jiang Liu 2012-04-25 15:47 ` Jiang Liu 1 sibling, 0 replies; 6+ messages in thread From: Jiang Liu @ 2012-04-24 3:56 UTC (permalink / raw) To: Dan Williams Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-pci, linux-kernel On 2012-4-24 11:09, Dan Williams wrote: > On Mon, Apr 23, 2012 at 7:30 PM, Jiang Liu<jiang.liu@huawei.com> wrote: >>> If you are going to hotplug the entire IOH, then you are probably ok >>> with network links going down, so could you just down the links and >>> remove the driver with the existing code? >> >> I feel it's a little risky to shut down/restart all network interfaces >> for hot-removal of IOH, that may disturb the applications. > > I guess I'm confused... wouldn't the removal of an entire domain of > pci devices disturb userspace applications? Here I mean removing an IOH shouldn't affect devices under other IOHs if possible. With current dmaengine implementation, a DMA device/channel may be used by clients in other PCI domains. So to safely remove a DMA device, we need to return dmaengine_ref_count to zero by stopping all DMA clients. For network, that means we need to stop all network interfaces, seems a little heavy:) > >> And there >> are also other kinds of clients, such as ASYNC_TX, seems we can't >> adopt this method to reclaim DMA channels from ASYNC_TX subsystem. > > I say handle this like block device hotplug. I.e. the driver stays > loaded but the channel is put into an 'offline' state. So the driver > hides the fact that the hardware went away. Similar to how you can > remove a disk but /dev/sda sticks around until the last reference is > gone (and the driver 'sd' sticks around until all block devices are > gone). Per my understanding, this mechanism could be used to stop driver from accessing surprisingly removed devices, but it still needs a reference count mechanism to finish the driver unbinding operation eventually. For IOH hotplug, we need to wait for the completion of driver unbinding operations before destroying the PCI device nodes of IOAT, so still need reference count to track channel usage. Another way is to notify all clients to release all channels when IOAT device hotplug happens, but that may need heavy modification to the DMA clients. > > I expect the work will be in making sure existing clients are prepared > to handle NULL returns from ->device_prep_dma_*. In some cases the > channel is treated more like a cpu, so a NULL return from > ->device_prep_dma_memcpy() has been interpreted as "device is > temporarily busy, it is safe to try again". We would need to change > that to a permanent indication that the device is gone and not attempt > retry. Yes, some ASYNC_TX clients interpret NULL return as EBUSY and keep on retry when doing context aware computations. Will try to investigate on this direction. > > -- > Dan > > . > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug 2012-04-24 3:09 ` Dan Williams 2012-04-24 3:56 ` Jiang Liu @ 2012-04-25 15:47 ` Jiang Liu 1 sibling, 0 replies; 6+ messages in thread From: Jiang Liu @ 2012-04-25 15:47 UTC (permalink / raw) To: Dan Williams Cc: Jiang Liu, Vinod Koul, Keping Chen, David S. Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI, Patrick McHardy, netdev, linux-pci, linux-kernel Hi Dan, Thanks for your great comments about the performance penalty issue. And I'm trying to refine the implementation to reduce penalty caused by hotplug logic. If the algorithm works correctly, the optimized hot path code will be: ------------------------------------------------------------------------------ struct dma_chan *dma_find_channel(enum dma_transaction_type tx_type) { struct dma_chan *chan = this_cpu_read(channel_table[tx_type]->chan); this_cpu_inc(dmaengine_chan_ref_count); if (static_key_false(&dmaengine_quiesce)) { chan = NULL; } return chan; } EXPORT_SYMBOL(dma_find_channel); struct dma_chan *dma_get_channel(struct dma_chan *chan) { if (static_key_false(&dmaengine_quiesce)) atomic_inc(&dmaengine_dirty); this_cpu_inc(dmaengine_chan_ref_count); return chan; } EXPORT_SYMBOL(dma_get_channel); void dma_put_channel(struct dma_chan *chan) { this_cpu_dec(dmaengine_chan_ref_count); } EXPORT_SYMBOL(dma_put_channel); ----------------------------------------------------------------------------- The disassembled code is: (gdb) disassemble dma_find_channel Dump of assembler code for function dma_find_channel: 0x0000000000000000 <+0>: push %rbp 0x0000000000000001 <+1>: mov %rsp,%rbp 0x0000000000000004 <+4>: callq 0x9 <dma_find_channel+9> 0x0000000000000009 <+9>: mov %edi,%edi 0x000000000000000b <+11>: mov 0x0(,%rdi,8),%rax 0x0000000000000013 <+19>: mov %gs:(%rax),%rax 0x0000000000000017 <+23>: incq %gs:0x0 //overhead: this_cpu_inc(dmaengine_chan_ref_count) 0x0000000000000020 <+32>: jmpq 0x25 <dma_find_channel+37> //overhead: if (static_key_false(&dmaengine_quiesce)), will be replaced as NOP by jump label 0x0000000000000025 <+37>: pop %rbp 0x0000000000000026 <+38>: retq 0x0000000000000027 <+39>: nopw 0x0(%rax,%rax,1) 0x0000000000000030 <+48>: xor %eax,%eax 0x0000000000000032 <+50>: pop %rbp 0x0000000000000033 <+51>: retq End of assembler dump. (gdb) disassemble dma_put_channel // overhead: to decrease channel reference count, 6 instructions Dump of assembler code for function dma_put_channel: 0x0000000000000070 <+0>: push %rbp 0x0000000000000071 <+1>: mov %rsp,%rbp 0x0000000000000074 <+4>: callq 0x79 <dma_put_channel+9> 0x0000000000000079 <+9>: decq %gs:0x0 0x0000000000000082 <+18>: pop %rbp 0x0000000000000083 <+19>: retq End of assembler dump. (gdb) disassemble dma_get_channel Dump of assembler code for function dma_get_channel: 0x0000000000000040 <+0>: push %rbp 0x0000000000000041 <+1>: mov %rsp,%rbp 0x0000000000000044 <+4>: callq 0x49 <dma_get_channel+9> 0x0000000000000049 <+9>: mov %rdi,%rax 0x000000000000004c <+12>: jmpq 0x51 <dma_get_channel+17> 0x0000000000000051 <+17>: incq %gs:0x0 0x000000000000005a <+26>: pop %rbp 0x000000000000005b <+27>: retq 0x000000000000005c <+28>: nopl 0x0(%rax) 0x0000000000000060 <+32>: lock incl 0x0(%rip) # 0x67 <dma_get_channel+39> 0x0000000000000067 <+39>: jmp 0x51 <dma_get_channel+17> End of assembler dump. So for a typical dma_find_channel()/dma_put_channel(), the total overhead is about 10 instructions and two percpu(local) memory updates. And there's no shared cache pollution any more. Is this acceptable ff the algorithm works as expected? I will test the code tomorrow. For typical systems which don't support DMA device hotplug, the overhead could be completely removed by condition compilation. Any comments are welcomed! Thanks! --gerry On 04/24/2012 11:09 AM, Dan Williams wrote: >>> If you are going to hotplug the entire IOH, then you are probably ok ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2012-04-25 15:47 UTC | newest]
Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1335189109-4871-1-git-send-email-jiang.liu@huawei.com>
2012-04-23 13:51 ` [PATCH v1 6/8] dmaengine: enhance network subsystem to support DMA device hotplug Jiang Liu
2012-04-23 18:30 ` Dan Williams
2012-04-24 2:30 ` Jiang Liu
2012-04-24 3:09 ` Dan Williams
2012-04-24 3:56 ` Jiang Liu
2012-04-25 15:47 ` Jiang Liu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).