* qdisc spin lock @ 2016-03-30 7:20 Michael Ma 2016-03-31 19:18 ` Jesper Dangaard Brouer 2016-03-31 22:16 ` Cong Wang 0 siblings, 2 replies; 21+ messages in thread From: Michael Ma @ 2016-03-30 7:20 UTC (permalink / raw) To: netdev Hi - I know this might be an old topic so bare with me – what we are facing is that applications are sending small packets using hundreds of threads so the contention on spin lock in __dev_xmit_skb increases the latency of dev_queue_xmit significantly. We’re building a network QoS solution to avoid interference of different applications using HTB. But in this case when some applications send massive small packets in parallel, the application to be protected will get its throughput affected (because it’s doing synchronous network communication using multiple threads and throughput is sensitive to the increased latency) Here is the profiling from perf: - 67.57% iperf [kernel.kallsyms] [k] _spin_lock - 99.94% dev_queue_xmit 96.91% _spin_lock - 2.62% __qdisc_run - 98.98% sch_direct_xmit 99.98% _spin_lock 1.01% _spin_lock As far as I understand the design of TC is to simplify locking schema and minimize the work in __qdisc_run so that throughput won’t be affected, especially with large packets. However if the scenario is that multiple classes in the queueing discipline only have the shaping limit, there isn’t really a necessary correlation between different classes. The only synchronization point should be when the packet is dequeued from the qdisc queue and enqueued to the transmit queue of the device. My question is – is it worth investing on avoiding the locking contention by partitioning the queue/lock so that this scenario is addressed with relatively smaller latency? I must have oversimplified a lot of details since I’m not familiar with the TC implementation at this point – just want to get your input in terms of whether this is a worthwhile effort or there is something fundamental that I’m not aware of. If this is just a matter of quite some additional work, would also appreciate helping to outline the required work here. Also would appreciate if there is any information about the latest status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf Thanks, Ke Ma ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-30 7:20 qdisc spin lock Michael Ma @ 2016-03-31 19:18 ` Jesper Dangaard Brouer 2016-03-31 23:41 ` Michael Ma 2016-03-31 22:16 ` Cong Wang 1 sibling, 1 reply; 21+ messages in thread From: Jesper Dangaard Brouer @ 2016-03-31 19:18 UTC (permalink / raw) To: Michael Ma; +Cc: brouer, netdev On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote: > I know this might be an old topic so bare with me – what we are facing > is that applications are sending small packets using hundreds of > threads so the contention on spin lock in __dev_xmit_skb increases the > latency of dev_queue_xmit significantly. We’re building a network QoS > solution to avoid interference of different applications using HTB. Yes, as you have noticed with HTB there is a single qdisc lock, and congestion obviously happens :-) It is possible with different tricks to make it scale. I believe Google is using a variant of HTB, and it scales for them. They have not open source their modifications to HTB (which likely also involves a great deal of setup tricks). If your purpose it to limit traffic/bandwidth per "cloud" instance, then you can just use another TC setup structure. Like using MQ and assigning a HTB per MQ queue (where the MQ queues are bound to each CPU/HW queue)... But you have to figure out this setup yourself... > But in this case when some applications send massive small packets in > parallel, the application to be protected will get its throughput > affected (because it’s doing synchronous network communication using > multiple threads and throughput is sensitive to the increased latency) > > Here is the profiling from perf: > > - 67.57% iperf [kernel.kallsyms] [k] _spin_lock > - 99.94% dev_queue_xmit > - 96.91% _spin_lock > - 2.62% __qdisc_run > - 98.98% sch_direct_xmit > - 99.98% _spin_lock > > As far as I understand the design of TC is to simplify locking schema > and minimize the work in __qdisc_run so that throughput won’t be > affected, especially with large packets. However if the scenario is > that multiple classes in the queueing discipline only have the shaping > limit, there isn’t really a necessary correlation between different > classes. The only synchronization point should be when the packet is > dequeued from the qdisc queue and enqueued to the transmit queue of > the device. My question is – is it worth investing on avoiding the > locking contention by partitioning the queue/lock so that this > scenario is addressed with relatively smaller latency? Yes, there is a lot go gain, but it is not easy ;-) > I must have oversimplified a lot of details since I’m not familiar > with the TC implementation at this point – just want to get your input > in terms of whether this is a worthwhile effort or there is something > fundamental that I’m not aware of. If this is just a matter of quite > some additional work, would also appreciate helping to outline the > required work here. > > Also would appreciate if there is any information about the latest > status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf This article seems to be very low quality... spelling errors, only 5 pages, no real code, etc. -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat Author of http://www.iptv-analyzer.org LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-31 19:18 ` Jesper Dangaard Brouer @ 2016-03-31 23:41 ` Michael Ma 2016-04-16 8:52 ` Andrew 0 siblings, 1 reply; 21+ messages in thread From: Michael Ma @ 2016-03-31 23:41 UTC (permalink / raw) To: Jesper Dangaard Brouer; +Cc: netdev Thanks for the suggestion - I'll try the MQ solution out. It seems to be able to solve the problem well with the assumption that bandwidth can be statically partitioned. 2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <brouer@redhat.com>: > > On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote: > >> I know this might be an old topic so bare with me – what we are facing >> is that applications are sending small packets using hundreds of >> threads so the contention on spin lock in __dev_xmit_skb increases the >> latency of dev_queue_xmit significantly. We’re building a network QoS >> solution to avoid interference of different applications using HTB. > > Yes, as you have noticed with HTB there is a single qdisc lock, and > congestion obviously happens :-) > > It is possible with different tricks to make it scale. I believe > Google is using a variant of HTB, and it scales for them. They have > not open source their modifications to HTB (which likely also involves > a great deal of setup tricks). > > If your purpose it to limit traffic/bandwidth per "cloud" instance, > then you can just use another TC setup structure. Like using MQ and > assigning a HTB per MQ queue (where the MQ queues are bound to each > CPU/HW queue)... But you have to figure out this setup yourself... > > >> But in this case when some applications send massive small packets in >> parallel, the application to be protected will get its throughput >> affected (because it’s doing synchronous network communication using >> multiple threads and throughput is sensitive to the increased latency) >> >> Here is the profiling from perf: >> >> - 67.57% iperf [kernel.kallsyms] [k] _spin_lock >> - 99.94% dev_queue_xmit >> - 96.91% _spin_lock >> - 2.62% __qdisc_run >> - 98.98% sch_direct_xmit >> - 99.98% _spin_lock >> >> As far as I understand the design of TC is to simplify locking schema >> and minimize the work in __qdisc_run so that throughput won’t be >> affected, especially with large packets. However if the scenario is >> that multiple classes in the queueing discipline only have the shaping >> limit, there isn’t really a necessary correlation between different >> classes. The only synchronization point should be when the packet is >> dequeued from the qdisc queue and enqueued to the transmit queue of >> the device. My question is – is it worth investing on avoiding the >> locking contention by partitioning the queue/lock so that this >> scenario is addressed with relatively smaller latency? > > Yes, there is a lot go gain, but it is not easy ;-) > >> I must have oversimplified a lot of details since I’m not familiar >> with the TC implementation at this point – just want to get your input >> in terms of whether this is a worthwhile effort or there is something >> fundamental that I’m not aware of. If this is just a matter of quite >> some additional work, would also appreciate helping to outline the >> required work here. >> >> Also would appreciate if there is any information about the latest >> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf > > This article seems to be very low quality... spelling errors, only 5 > pages, no real code, etc. > > -- > Best regards, > Jesper Dangaard Brouer > MSc.CS, Principal Kernel Engineer at Red Hat > Author of http://www.iptv-analyzer.org > LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-31 23:41 ` Michael Ma @ 2016-04-16 8:52 ` Andrew 0 siblings, 0 replies; 21+ messages in thread From: Andrew @ 2016-04-16 8:52 UTC (permalink / raw) To: Michael Ma, Jesper Dangaard Brouer; +Cc: netdev I think that it isn't a good solution - unless you can bind specified host (src/dst) to specified txq. Usually traffic is spreaded on txqs by src+dst IP (or even IP:port) hash which results in traffic spreading among all mqs on device, and wrong bandwidth limiting (N*bandwidth on multi-session load like p2p/server traffic)... People said that hfsc shaper have better performance, but I didn't tested it. 01.04.2016 02:41, Michael Ma пишет: > Thanks for the suggestion - I'll try the MQ solution out. It seems to > be able to solve the problem well with the assumption that bandwidth > can be statically partitioned. > > 2016-03-31 12:18 GMT-07:00 Jesper Dangaard Brouer <brouer@redhat.com>: >> On Wed, 30 Mar 2016 00:20:03 -0700 Michael Ma <make0818@gmail.com> wrote: >> >>> I know this might be an old topic so bare with me – what we are facing >>> is that applications are sending small packets using hundreds of >>> threads so the contention on spin lock in __dev_xmit_skb increases the >>> latency of dev_queue_xmit significantly. We’re building a network QoS >>> solution to avoid interference of different applications using HTB. >> Yes, as you have noticed with HTB there is a single qdisc lock, and >> congestion obviously happens :-) >> >> It is possible with different tricks to make it scale. I believe >> Google is using a variant of HTB, and it scales for them. They have >> not open source their modifications to HTB (which likely also involves >> a great deal of setup tricks). >> >> If your purpose it to limit traffic/bandwidth per "cloud" instance, >> then you can just use another TC setup structure. Like using MQ and >> assigning a HTB per MQ queue (where the MQ queues are bound to each >> CPU/HW queue)... But you have to figure out this setup yourself... >> >> >>> But in this case when some applications send massive small packets in >>> parallel, the application to be protected will get its throughput >>> affected (because it’s doing synchronous network communication using >>> multiple threads and throughput is sensitive to the increased latency) >>> >>> Here is the profiling from perf: >>> >>> - 67.57% iperf [kernel.kallsyms] [k] _spin_lock >>> - 99.94% dev_queue_xmit >>> - 96.91% _spin_lock >>> - 2.62% __qdisc_run >>> - 98.98% sch_direct_xmit >>> - 99.98% _spin_lock >>> >>> As far as I understand the design of TC is to simplify locking schema >>> and minimize the work in __qdisc_run so that throughput won’t be >>> affected, especially with large packets. However if the scenario is >>> that multiple classes in the queueing discipline only have the shaping >>> limit, there isn’t really a necessary correlation between different >>> classes. The only synchronization point should be when the packet is >>> dequeued from the qdisc queue and enqueued to the transmit queue of >>> the device. My question is – is it worth investing on avoiding the >>> locking contention by partitioning the queue/lock so that this >>> scenario is addressed with relatively smaller latency? >> Yes, there is a lot go gain, but it is not easy ;-) >> >>> I must have oversimplified a lot of details since I’m not familiar >>> with the TC implementation at this point – just want to get your input >>> in terms of whether this is a worthwhile effort or there is something >>> fundamental that I’m not aware of. If this is just a matter of quite >>> some additional work, would also appreciate helping to outline the >>> required work here. >>> >>> Also would appreciate if there is any information about the latest >>> status of this work http://www.ijcset.com/docs/IJCSET13-04-04-113.pdf >> This article seems to be very low quality... spelling errors, only 5 >> pages, no real code, etc. >> >> -- >> Best regards, >> Jesper Dangaard Brouer >> MSc.CS, Principal Kernel Engineer at Red Hat >> Author of http://www.iptv-analyzer.org >> LinkedIn: http://www.linkedin.com/in/brouer ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-30 7:20 qdisc spin lock Michael Ma 2016-03-31 19:18 ` Jesper Dangaard Brouer @ 2016-03-31 22:16 ` Cong Wang 2016-03-31 23:48 ` Michael Ma 1 sibling, 1 reply; 21+ messages in thread From: Cong Wang @ 2016-03-31 22:16 UTC (permalink / raw) To: Michael Ma; +Cc: Linux Kernel Network Developers On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote: > As far as I understand the design of TC is to simplify locking schema > and minimize the work in __qdisc_run so that throughput won’t be > affected, especially with large packets. However if the scenario is > that multiple classes in the queueing discipline only have the shaping > limit, there isn’t really a necessary correlation between different > classes. The only synchronization point should be when the packet is > dequeued from the qdisc queue and enqueued to the transmit queue of > the device. My question is – is it worth investing on avoiding the > locking contention by partitioning the queue/lock so that this > scenario is addressed with relatively smaller latency? If your HTB classes don't share bandwidth, why do you still make them under the same hierarchy? IOW, you can just isolate them either with some other qdisc or just separated interfaces. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-31 22:16 ` Cong Wang @ 2016-03-31 23:48 ` Michael Ma 2016-04-01 2:19 ` David Miller ` (2 more replies) 0 siblings, 3 replies; 21+ messages in thread From: Michael Ma @ 2016-03-31 23:48 UTC (permalink / raw) To: Cong Wang; +Cc: Linux Kernel Network Developers I didn't really know that multiple qdiscs can be isolated using MQ so that each txq can be associated with a particular qdisc. Also we don't really have multiple interfaces... With this MQ solution we'll still need to assign transmit queues to different classes by doing some math on the bandwidth limit if I understand correctly, which seems to be less convenient compared with a solution purely within HTB. I assume that with this solution I can still share qdisc among multiple transmit queues - please let me know if this is not the case. 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>: > On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote: >> As far as I understand the design of TC is to simplify locking schema >> and minimize the work in __qdisc_run so that throughput won’t be >> affected, especially with large packets. However if the scenario is >> that multiple classes in the queueing discipline only have the shaping >> limit, there isn’t really a necessary correlation between different >> classes. The only synchronization point should be when the packet is >> dequeued from the qdisc queue and enqueued to the transmit queue of >> the device. My question is – is it worth investing on avoiding the >> locking contention by partitioning the queue/lock so that this >> scenario is addressed with relatively smaller latency? > > If your HTB classes don't share bandwidth, why do you still make them > under the same hierarchy? IOW, you can just isolate them either with some > other qdisc or just separated interfaces. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-31 23:48 ` Michael Ma @ 2016-04-01 2:19 ` David Miller 2016-04-01 17:17 ` Michael Ma 2016-04-01 3:44 ` John Fastabend 2016-04-08 14:19 ` Eric Dumazet 2 siblings, 1 reply; 21+ messages in thread From: David Miller @ 2016-04-01 2:19 UTC (permalink / raw) To: make0818; +Cc: xiyou.wangcong, netdev From: Michael Ma <make0818@gmail.com> Date: Thu, 31 Mar 2016 16:48:43 -0700 > I didn't really know that multiple qdiscs can be isolated using MQ so ... Please stop top-posting. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-01 2:19 ` David Miller @ 2016-04-01 17:17 ` Michael Ma 0 siblings, 0 replies; 21+ messages in thread From: Michael Ma @ 2016-04-01 17:17 UTC (permalink / raw) To: David Miller; +Cc: Cong Wang, Linux Kernel Network Developers 2016-03-31 19:19 GMT-07:00 David Miller <davem@davemloft.net>: > From: Michael Ma <make0818@gmail.com> > Date: Thu, 31 Mar 2016 16:48:43 -0700 > >> I didn't really know that multiple qdiscs can be isolated using MQ so > ... > > Please stop top-posting. Sorry that I wasn't aware of this... ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-31 23:48 ` Michael Ma 2016-04-01 2:19 ` David Miller @ 2016-04-01 3:44 ` John Fastabend 2016-04-13 18:23 ` Michael Ma 2016-04-08 14:19 ` Eric Dumazet 2 siblings, 1 reply; 21+ messages in thread From: John Fastabend @ 2016-04-01 3:44 UTC (permalink / raw) To: Michael Ma, Cong Wang; +Cc: Linux Kernel Network Developers On 16-03-31 04:48 PM, Michael Ma wrote: > I didn't really know that multiple qdiscs can be isolated using MQ so > that each txq can be associated with a particular qdisc. Also we don't > really have multiple interfaces... MQ will assign a default qdisc to each txq and the default qdisc can be changed to htb or any other qdisc of your choice. > > With this MQ solution we'll still need to assign transmit queues to > different classes by doing some math on the bandwidth limit if I > understand correctly, which seems to be less convenient compared with > a solution purely within HTB. > Agreed. > I assume that with this solution I can still share qdisc among > multiple transmit queues - please let me know if this is not the case. Nope sorry doesn't work that way unless you employ some sort of stacked netdevice strategy which does start to get a bit complex. The basic hint would be to stack some type of virtual netdev on top of a device and run the htb qdisc there. Push traffic onto the netdev depending on the class it belongs to. Its ugly yes. Noting all that I posted an RFC patch some time back to allow writing qdiscs that do not require taking the lock. I'll try to respin these and submit them when net-next opens again. The next logical step is to write a "better" HTB probably using a shared counter and dropping the requirement that it be exact. Sorry I didn't get a chance to look at the paper in your post so not sure if they suggest something similar or not. Thanks, John > > 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>: >> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote: >>> As far as I understand the design of TC is to simplify locking schema >>> and minimize the work in __qdisc_run so that throughput won’t be >>> affected, especially with large packets. However if the scenario is >>> that multiple classes in the queueing discipline only have the shaping >>> limit, there isn’t really a necessary correlation between different >>> classes. The only synchronization point should be when the packet is >>> dequeued from the qdisc queue and enqueued to the transmit queue of >>> the device. My question is – is it worth investing on avoiding the >>> locking contention by partitioning the queue/lock so that this >>> scenario is addressed with relatively smaller latency? >> >> If your HTB classes don't share bandwidth, why do you still make them >> under the same hierarchy? IOW, you can just isolate them either with some >> other qdisc or just separated interfaces. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-01 3:44 ` John Fastabend @ 2016-04-13 18:23 ` Michael Ma 0 siblings, 0 replies; 21+ messages in thread From: Michael Ma @ 2016-04-13 18:23 UTC (permalink / raw) To: John Fastabend; +Cc: Cong Wang, Linux Kernel Network Developers 2016-03-31 20:44 GMT-07:00 John Fastabend <john.fastabend@gmail.com>: > On 16-03-31 04:48 PM, Michael Ma wrote: >> I didn't really know that multiple qdiscs can be isolated using MQ so >> that each txq can be associated with a particular qdisc. Also we don't >> really have multiple interfaces... > > MQ will assign a default qdisc to each txq and the default qdisc can > be changed to htb or any other qdisc of your choice. > >> >> With this MQ solution we'll still need to assign transmit queues to >> different classes by doing some math on the bandwidth limit if I >> understand correctly, which seems to be less convenient compared with >> a solution purely within HTB. >> > > Agreed. > >> I assume that with this solution I can still share qdisc among >> multiple transmit queues - please let me know if this is not the case. > > Nope sorry doesn't work that way unless you employ some sort of stacked > netdevice strategy which does start to get a bit complex. The basic hint > would be to stack some type of virtual netdev on top of a device and > run the htb qdisc there. Push traffic onto the netdev depending on the > class it belongs to. Its ugly yes. Thanks for the input - is it possible to implement a MQ to allow one qdisc to be associated with multiple TXQ belonging to one inner class? Does that require any additional synchronization in qdisc routines? I think it shouldn't because the common use of TC already associates one qdisc with all the TXQ. This would simplify the way that qdisc is crafted since it doesn't need to be created for each queue. Also is it possible to allow multiple qdiscs to be associated with one TXQ? I can see one issue with this - the TC interface for MQ today doesn't allow one TXQ to be associated with multiple classes, but this seems to be a restriction only from the tc command interface perspective. MQ anyway only establishes a connection between TXQ and qdisc so theoretically we can associate one TXQ to multiple classes/qdiscs and use priority/classid of the packet to do the pre-selection of qdisc. Again, I might have missed some potential synchronization issue when this actually happens because today TXQ is not shared by multiple qdiscs in any case. > > Noting all that I posted an RFC patch some time back to allow writing > qdiscs that do not require taking the lock. I'll try to respin these > and submit them when net-next opens again. The next logical step is to > write a "better" HTB probably using a shared counter and dropping the > requirement that it be exact. > > Sorry I didn't get a chance to look at the paper in your post so not > sure if they suggest something similar or not. > > Thanks, > John > >> >> 2016-03-31 15:16 GMT-07:00 Cong Wang <xiyou.wangcong@gmail.com>: >>> On Wed, Mar 30, 2016 at 12:20 AM, Michael Ma <make0818@gmail.com> wrote: >>>> As far as I understand the design of TC is to simplify locking schema >>>> and minimize the work in __qdisc_run so that throughput won’t be >>>> affected, especially with large packets. However if the scenario is >>>> that multiple classes in the queueing discipline only have the shaping >>>> limit, there isn’t really a necessary correlation between different >>>> classes. The only synchronization point should be when the packet is >>>> dequeued from the qdisc queue and enqueued to the transmit queue of >>>> the device. My question is – is it worth investing on avoiding the >>>> locking contention by partitioning the queue/lock so that this >>>> scenario is addressed with relatively smaller latency? >>> >>> If your HTB classes don't share bandwidth, why do you still make them >>> under the same hierarchy? IOW, you can just isolate them either with some >>> other qdisc or just separated interfaces. > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-03-31 23:48 ` Michael Ma 2016-04-01 2:19 ` David Miller 2016-04-01 3:44 ` John Fastabend @ 2016-04-08 14:19 ` Eric Dumazet 2016-04-15 22:46 ` Michael Ma 2016-04-20 21:24 ` Michael Ma 2 siblings, 2 replies; 21+ messages in thread From: Eric Dumazet @ 2016-04-08 14:19 UTC (permalink / raw) To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: > I didn't really know that multiple qdiscs can be isolated using MQ so > that each txq can be associated with a particular qdisc. Also we don't > really have multiple interfaces... > > With this MQ solution we'll still need to assign transmit queues to > different classes by doing some math on the bandwidth limit if I > understand correctly, which seems to be less convenient compared with > a solution purely within HTB. > > I assume that with this solution I can still share qdisc among > multiple transmit queues - please let me know if this is not the case. Note that this MQ + HTB thing works well, unless you use a bonding device. (Or you need the MQ+HTB on the slaves, with no way of sharing tokens between the slaves) https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 bonding can not really be used as a true MQ device yet. I might send a patch to disable this 'bonding feature' if no slave sets a queue_id. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-08 14:19 ` Eric Dumazet @ 2016-04-15 22:46 ` Michael Ma 2016-04-15 22:54 ` Eric Dumazet 2016-04-20 21:24 ` Michael Ma 1 sibling, 1 reply; 21+ messages in thread From: Michael Ma @ 2016-04-15 22:46 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> I didn't really know that multiple qdiscs can be isolated using MQ so >> that each txq can be associated with a particular qdisc. Also we don't >> really have multiple interfaces... >> >> With this MQ solution we'll still need to assign transmit queues to >> different classes by doing some math on the bandwidth limit if I >> understand correctly, which seems to be less convenient compared with >> a solution purely within HTB. >> >> I assume that with this solution I can still share qdisc among >> multiple transmit queues - please let me know if this is not the case. > > Note that this MQ + HTB thing works well, unless you use a bonding > device. (Or you need the MQ+HTB on the slaves, with no way of sharing > tokens between the slaves) > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 > > bonding can not really be used as a true MQ device yet. > > I might send a patch to disable this 'bonding feature' if no slave sets > a queue_id. > > So there is no way of using this MQ solution when bonding interface is used, right? Then modifying HTB might be the only solution? ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-15 22:46 ` Michael Ma @ 2016-04-15 22:54 ` Eric Dumazet 2016-04-15 23:05 ` Michael Ma 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2016-04-15 22:54 UTC (permalink / raw) To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers On Fri, 2016-04-15 at 15:46 -0700, Michael Ma wrote: > 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: > >> I didn't really know that multiple qdiscs can be isolated using MQ so > >> that each txq can be associated with a particular qdisc. Also we don't > >> really have multiple interfaces... > >> > >> With this MQ solution we'll still need to assign transmit queues to > >> different classes by doing some math on the bandwidth limit if I > >> understand correctly, which seems to be less convenient compared with > >> a solution purely within HTB. > >> > >> I assume that with this solution I can still share qdisc among > >> multiple transmit queues - please let me know if this is not the case. > > > > Note that this MQ + HTB thing works well, unless you use a bonding > > device. (Or you need the MQ+HTB on the slaves, with no way of sharing > > tokens between the slaves) > > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 > > > > bonding can not really be used as a true MQ device yet. > > > > I might send a patch to disable this 'bonding feature' if no slave sets > > a queue_id. > > > > > So there is no way of using this MQ solution when bonding interface is > used, right? Then modifying HTB might be the only solution? I probably can submit a bonding patch very soon if there is interest. Modifying HTB is way more complicated :( ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-15 22:54 ` Eric Dumazet @ 2016-04-15 23:05 ` Michael Ma 2016-04-15 23:56 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Michael Ma @ 2016-04-15 23:05 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers 2016-04-15 15:54 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > On Fri, 2016-04-15 at 15:46 -0700, Michael Ma wrote: >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >> >> that each txq can be associated with a particular qdisc. Also we don't >> >> really have multiple interfaces... >> >> >> >> With this MQ solution we'll still need to assign transmit queues to >> >> different classes by doing some math on the bandwidth limit if I >> >> understand correctly, which seems to be less convenient compared with >> >> a solution purely within HTB. >> >> >> >> I assume that with this solution I can still share qdisc among >> >> multiple transmit queues - please let me know if this is not the case. >> > >> > Note that this MQ + HTB thing works well, unless you use a bonding >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >> > tokens between the slaves) >> > >> > >> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 >> > >> > bonding can not really be used as a true MQ device yet. >> > >> > I might send a patch to disable this 'bonding feature' if no slave sets >> > a queue_id. >> > >> > >> So there is no way of using this MQ solution when bonding interface is >> used, right? Then modifying HTB might be the only solution? > > I probably can submit a bonding patch very soon if there is interest. Would definitely appreciate that. If you can share the patch it will be helpful as well. Let me know if I can help with this... > > Modifying HTB is way more complicated :( > > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-15 23:05 ` Michael Ma @ 2016-04-15 23:56 ` Eric Dumazet 0 siblings, 0 replies; 21+ messages in thread From: Eric Dumazet @ 2016-04-15 23:56 UTC (permalink / raw) To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers On Fri, 2016-04-15 at 16:05 -0700, Michael Ma wrote: > Would definitely appreciate that. If you can share the patch it will > be helpful as well. Let me know if I can help with this... Sure, here is what I am going to test : diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 941ec99cd3b6..f890ebe78629 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -4003,11 +4003,17 @@ static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb, * skb_tx_hash and will put the skbs in the queue we expect on their * way down to the bonding driver. */ - u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; + struct bonding *bond = netdev_priv(dev); + u16 txq; /* Save the original txq to restore before passing to the driver */ qdisc_skb_cb(skb)->slave_dev_queue_mapping = skb->queue_mapping; + if (!bond->queue_id_selection_enabled) + return fallback(dev, skb); + + txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0; + if (unlikely(txq >= dev->real_num_tx_queues)) { do { txq -= dev->real_num_tx_queues; @@ -4020,7 +4026,8 @@ static netdev_tx_t __bond_start_xmit(struct sk_buff *skb, struct net_device *dev { struct bonding *bond = netdev_priv(dev); - if (bond_should_override_tx_queue(bond) && + if (bond->queue_id_selection_enabled && + bond_should_override_tx_queue(bond) && !bond_slave_override(bond, skb)) return NETDEV_TX_OK; diff --git a/drivers/net/bonding/bond_options.c b/drivers/net/bonding/bond_options.c index 577e57cad1dc..1a8a5153f027 100644 --- a/drivers/net/bonding/bond_options.c +++ b/drivers/net/bonding/bond_options.c @@ -1270,6 +1270,7 @@ static int bond_option_ad_select_set(struct bonding *bond, static int bond_option_queue_id_set(struct bonding *bond, const struct bond_opt_value *newval) { + bool queue_id_selection_enabled = false; struct slave *slave, *update_slave; struct net_device *sdev; struct list_head *iter; @@ -1318,6 +1319,12 @@ static int bond_option_queue_id_set(struct bonding *bond, /* Actually set the qids for the slave */ update_slave->queue_id = qid; + bond_for_each_slave(bond, slave, iter) { + if (update_slave->queue_id) + queue_id_selection_enabled = true; + } + bond->queue_id_selection_enabled = queue_id_selection_enabled; + out: return ret; diff --git a/include/net/bonding.h b/include/net/bonding.h index 791800ddd6d9..29e29491c571 100644 --- a/include/net/bonding.h +++ b/include/net/bonding.h @@ -204,6 +204,7 @@ struct bonding { struct slave __rcu *primary_slave; struct bond_up_slave __rcu *slave_arr; /* Array of usable slaves */ bool force_primary; + bool queue_id_selection_enabled; s32 slave_cnt; /* never change this value outside the attach/detach wrappers */ int (*recv_probe)(const struct sk_buff *, struct bonding *, struct slave *); ^ permalink raw reply related [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-08 14:19 ` Eric Dumazet 2016-04-15 22:46 ` Michael Ma @ 2016-04-20 21:24 ` Michael Ma 2016-04-20 22:34 ` Eric Dumazet 1 sibling, 1 reply; 21+ messages in thread From: Michael Ma @ 2016-04-20 21:24 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> I didn't really know that multiple qdiscs can be isolated using MQ so >> that each txq can be associated with a particular qdisc. Also we don't >> really have multiple interfaces... >> >> With this MQ solution we'll still need to assign transmit queues to >> different classes by doing some math on the bandwidth limit if I >> understand correctly, which seems to be less convenient compared with >> a solution purely within HTB. >> >> I assume that with this solution I can still share qdisc among >> multiple transmit queues - please let me know if this is not the case. > > Note that this MQ + HTB thing works well, unless you use a bonding > device. (Or you need the MQ+HTB on the slaves, with no way of sharing > tokens between the slaves) Actually MQ+HTB works well for small packets - like flow of 512 byte packets can be throttled by HTB using one txq without being affected by other flows with small packets. However I found using this solution large packets (10k for example) will only achieve very limited bandwidth. In my test I used MQ to assign one txq to a HTB which sets rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by using 30 threads. But sending 10k packets using 10 threads has only 10 Mbit/s with the same TC configuration. If I increase burst and cburst of HTB to some extreme large value (like 50MB) the ceiling rate can be hit. The strange thing is that I don't see this problem when using HTB as the root. So txq number seems to be a factor here - however it's really hard to understand why would it only affect larger packets. Is this a known issue? Any suggestion on how to investigate the issue further? Profiling shows that the cpu utilization is pretty low. > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=bb1d912323d5dd50e1079e389f4e964be14f0ae3 > > bonding can not really be used as a true MQ device yet. > > I might send a patch to disable this 'bonding feature' if no slave sets > a queue_id. > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-20 21:24 ` Michael Ma @ 2016-04-20 22:34 ` Eric Dumazet 2016-04-21 5:51 ` Michael Ma 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2016-04-20 22:34 UTC (permalink / raw) To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: > 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: > >> I didn't really know that multiple qdiscs can be isolated using MQ so > >> that each txq can be associated with a particular qdisc. Also we don't > >> really have multiple interfaces... > >> > >> With this MQ solution we'll still need to assign transmit queues to > >> different classes by doing some math on the bandwidth limit if I > >> understand correctly, which seems to be less convenient compared with > >> a solution purely within HTB. > >> > >> I assume that with this solution I can still share qdisc among > >> multiple transmit queues - please let me know if this is not the case. > > > > Note that this MQ + HTB thing works well, unless you use a bonding > > device. (Or you need the MQ+HTB on the slaves, with no way of sharing > > tokens between the slaves) > > Actually MQ+HTB works well for small packets - like flow of 512 byte > packets can be throttled by HTB using one txq without being affected > by other flows with small packets. However I found using this solution > large packets (10k for example) will only achieve very limited > bandwidth. In my test I used MQ to assign one txq to a HTB which sets > rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by > using 30 threads. But sending 10k packets using 10 threads has only 10 > Mbit/s with the same TC configuration. If I increase burst and cburst > of HTB to some extreme large value (like 50MB) the ceiling rate can be > hit. > > The strange thing is that I don't see this problem when using HTB as > the root. So txq number seems to be a factor here - however it's > really hard to understand why would it only affect larger packets. Is > this a known issue? Any suggestion on how to investigate the issue > further? Profiling shows that the cpu utilization is pretty low. You could try perf record -a -g -e skb:kfree_skb sleep 5 perf report So that you see where the packets are dropped. Chances are that your UDP sockets SO_SNDBUF is too big, and packets are dropped at qdisc enqueue time, instead of having backpressure. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-20 22:34 ` Eric Dumazet @ 2016-04-21 5:51 ` Michael Ma 2016-04-21 12:41 ` Eric Dumazet 0 siblings, 1 reply; 21+ messages in thread From: Michael Ma @ 2016-04-21 5:51 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >> >> that each txq can be associated with a particular qdisc. Also we don't >> >> really have multiple interfaces... >> >> >> >> With this MQ solution we'll still need to assign transmit queues to >> >> different classes by doing some math on the bandwidth limit if I >> >> understand correctly, which seems to be less convenient compared with >> >> a solution purely within HTB. >> >> >> >> I assume that with this solution I can still share qdisc among >> >> multiple transmit queues - please let me know if this is not the case. >> > >> > Note that this MQ + HTB thing works well, unless you use a bonding >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >> > tokens between the slaves) >> >> Actually MQ+HTB works well for small packets - like flow of 512 byte >> packets can be throttled by HTB using one txq without being affected >> by other flows with small packets. However I found using this solution >> large packets (10k for example) will only achieve very limited >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by >> using 30 threads. But sending 10k packets using 10 threads has only 10 >> Mbit/s with the same TC configuration. If I increase burst and cburst >> of HTB to some extreme large value (like 50MB) the ceiling rate can be >> hit. >> >> The strange thing is that I don't see this problem when using HTB as >> the root. So txq number seems to be a factor here - however it's >> really hard to understand why would it only affect larger packets. Is >> this a known issue? Any suggestion on how to investigate the issue >> further? Profiling shows that the cpu utilization is pretty low. > > You could try > > perf record -a -g -e skb:kfree_skb sleep 5 > perf report > > So that you see where the packets are dropped. > > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are > dropped at qdisc enqueue time, instead of having backpressure. > Thanks for the hint - how should I read the perf report? Also we're using TCP socket in this testing - TCP window size is set to 70kB. - 35.88% init [kernel.kallsyms] [k] intel_idle ◆ intel_idle ▒ - 15.83% strings libc-2.5.so [.] __GI___connect_internal ▒ - __GI___connect_internal ▒ - 50.00% get_mapping ▒ __nscd_get_map_ref ▒ 50.00% __nscd_open_socket ▒ - 13.19% strings libc-2.5.so [.] __GI___libc_recvmsg ▒ - __GI___libc_recvmsg ▒ + 64.52% getifaddrs ▒ + 35.48% __check_pf ▒ - 10.55% strings libc-2.5.so [.] __sendto_nocancel ▒ - __sendto_nocancel ▒ 100.00% 0 > > ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-21 5:51 ` Michael Ma @ 2016-04-21 12:41 ` Eric Dumazet 2016-04-21 22:12 ` Michael Ma 0 siblings, 1 reply; 21+ messages in thread From: Eric Dumazet @ 2016-04-21 12:41 UTC (permalink / raw) To: Michael Ma; +Cc: Cong Wang, Linux Kernel Network Developers On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote: > 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: > >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: > >> >> I didn't really know that multiple qdiscs can be isolated using MQ so > >> >> that each txq can be associated with a particular qdisc. Also we don't > >> >> really have multiple interfaces... > >> >> > >> >> With this MQ solution we'll still need to assign transmit queues to > >> >> different classes by doing some math on the bandwidth limit if I > >> >> understand correctly, which seems to be less convenient compared with > >> >> a solution purely within HTB. > >> >> > >> >> I assume that with this solution I can still share qdisc among > >> >> multiple transmit queues - please let me know if this is not the case. > >> > > >> > Note that this MQ + HTB thing works well, unless you use a bonding > >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing > >> > tokens between the slaves) > >> > >> Actually MQ+HTB works well for small packets - like flow of 512 byte > >> packets can be throttled by HTB using one txq without being affected > >> by other flows with small packets. However I found using this solution > >> large packets (10k for example) will only achieve very limited > >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets > >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by > >> using 30 threads. But sending 10k packets using 10 threads has only 10 > >> Mbit/s with the same TC configuration. If I increase burst and cburst > >> of HTB to some extreme large value (like 50MB) the ceiling rate can be > >> hit. > >> > >> The strange thing is that I don't see this problem when using HTB as > >> the root. So txq number seems to be a factor here - however it's > >> really hard to understand why would it only affect larger packets. Is > >> this a known issue? Any suggestion on how to investigate the issue > >> further? Profiling shows that the cpu utilization is pretty low. > > > > You could try > > > > perf record -a -g -e skb:kfree_skb sleep 5 > > perf report > > > > So that you see where the packets are dropped. > > > > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are > > dropped at qdisc enqueue time, instead of having backpressure. > > > > Thanks for the hint - how should I read the perf report? Also we're > using TCP socket in this testing - TCP window size is set to 70kB. But how are you telling TCP to send 10k packets ? AFAIK you can not : TCP happily aggregates packets in write queue (see current MSG_EOR discussion) I suspect a bug in your tc settings. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-21 12:41 ` Eric Dumazet @ 2016-04-21 22:12 ` Michael Ma 2016-04-25 17:29 ` Michael Ma 0 siblings, 1 reply; 21+ messages in thread From: Michael Ma @ 2016-04-21 22:12 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers 2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: > On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote: >> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: >> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: >> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: >> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >> >> >> that each txq can be associated with a particular qdisc. Also we don't >> >> >> really have multiple interfaces... >> >> >> >> >> >> With this MQ solution we'll still need to assign transmit queues to >> >> >> different classes by doing some math on the bandwidth limit if I >> >> >> understand correctly, which seems to be less convenient compared with >> >> >> a solution purely within HTB. >> >> >> >> >> >> I assume that with this solution I can still share qdisc among >> >> >> multiple transmit queues - please let me know if this is not the case. >> >> > >> >> > Note that this MQ + HTB thing works well, unless you use a bonding >> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >> >> > tokens between the slaves) >> >> >> >> Actually MQ+HTB works well for small packets - like flow of 512 byte >> >> packets can be throttled by HTB using one txq without being affected >> >> by other flows with small packets. However I found using this solution >> >> large packets (10k for example) will only achieve very limited >> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets >> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by >> >> using 30 threads. But sending 10k packets using 10 threads has only 10 >> >> Mbit/s with the same TC configuration. If I increase burst and cburst >> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be >> >> hit. >> >> >> >> The strange thing is that I don't see this problem when using HTB as >> >> the root. So txq number seems to be a factor here - however it's >> >> really hard to understand why would it only affect larger packets. Is >> >> this a known issue? Any suggestion on how to investigate the issue >> >> further? Profiling shows that the cpu utilization is pretty low. >> > >> > You could try >> > >> > perf record -a -g -e skb:kfree_skb sleep 5 >> > perf report >> > >> > So that you see where the packets are dropped. >> > >> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are >> > dropped at qdisc enqueue time, instead of having backpressure. >> > >> >> Thanks for the hint - how should I read the perf report? Also we're >> using TCP socket in this testing - TCP window size is set to 70kB. > > But how are you telling TCP to send 10k packets ? > We just write to the socket with 10k buffer and wait for a response from the server (using read()) before the next write. Using tcpdump I can see the 10k write is actually sent through 3 packets (7.3k/1.5k/1.3k). > AFAIK you can not : TCP happily aggregates packets in write queue > (see current MSG_EOR discussion) > > I suspect a bug in your tc settings. > > Could you help to check my tc setting? sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0 queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0 sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10 sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used. ^ permalink raw reply [flat|nested] 21+ messages in thread
* Re: qdisc spin lock 2016-04-21 22:12 ` Michael Ma @ 2016-04-25 17:29 ` Michael Ma 0 siblings, 0 replies; 21+ messages in thread From: Michael Ma @ 2016-04-25 17:29 UTC (permalink / raw) To: Eric Dumazet; +Cc: Cong Wang, Linux Kernel Network Developers 2016-04-21 15:12 GMT-07:00 Michael Ma <make0818@gmail.com>: > 2016-04-21 5:41 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: >> On Wed, 2016-04-20 at 22:51 -0700, Michael Ma wrote: >>> 2016-04-20 15:34 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: >>> > On Wed, 2016-04-20 at 14:24 -0700, Michael Ma wrote: >>> >> 2016-04-08 7:19 GMT-07:00 Eric Dumazet <eric.dumazet@gmail.com>: >>> >> > On Thu, 2016-03-31 at 16:48 -0700, Michael Ma wrote: >>> >> >> I didn't really know that multiple qdiscs can be isolated using MQ so >>> >> >> that each txq can be associated with a particular qdisc. Also we don't >>> >> >> really have multiple interfaces... >>> >> >> >>> >> >> With this MQ solution we'll still need to assign transmit queues to >>> >> >> different classes by doing some math on the bandwidth limit if I >>> >> >> understand correctly, which seems to be less convenient compared with >>> >> >> a solution purely within HTB. >>> >> >> >>> >> >> I assume that with this solution I can still share qdisc among >>> >> >> multiple transmit queues - please let me know if this is not the case. >>> >> > >>> >> > Note that this MQ + HTB thing works well, unless you use a bonding >>> >> > device. (Or you need the MQ+HTB on the slaves, with no way of sharing >>> >> > tokens between the slaves) >>> >> >>> >> Actually MQ+HTB works well for small packets - like flow of 512 byte >>> >> packets can be throttled by HTB using one txq without being affected >>> >> by other flows with small packets. However I found using this solution >>> >> large packets (10k for example) will only achieve very limited >>> >> bandwidth. In my test I used MQ to assign one txq to a HTB which sets >>> >> rate at 1Gbit/s, 512 byte packets can achieve the ceiling rate by >>> >> using 30 threads. But sending 10k packets using 10 threads has only 10 >>> >> Mbit/s with the same TC configuration. If I increase burst and cburst >>> >> of HTB to some extreme large value (like 50MB) the ceiling rate can be >>> >> hit. >>> >> >>> >> The strange thing is that I don't see this problem when using HTB as >>> >> the root. So txq number seems to be a factor here - however it's >>> >> really hard to understand why would it only affect larger packets. Is >>> >> this a known issue? Any suggestion on how to investigate the issue >>> >> further? Profiling shows that the cpu utilization is pretty low. >>> > >>> > You could try >>> > >>> > perf record -a -g -e skb:kfree_skb sleep 5 >>> > perf report >>> > >>> > So that you see where the packets are dropped. >>> > >>> > Chances are that your UDP sockets SO_SNDBUF is too big, and packets are >>> > dropped at qdisc enqueue time, instead of having backpressure. >>> > >>> >>> Thanks for the hint - how should I read the perf report? Also we're >>> using TCP socket in this testing - TCP window size is set to 70kB. >> >> But how are you telling TCP to send 10k packets ? >> > We just write to the socket with 10k buffer and wait for a response > from the server (using read()) before the next write. Using tcpdump I > can see the 10k write is actually sent through 3 packets > (7.3k/1.5k/1.3k). > >> AFAIK you can not : TCP happily aggregates packets in write queue >> (see current MSG_EOR discussion) >> >> I suspect a bug in your tc settings. >> >> > > Could you help to check my tc setting? > > sudo tc qdisc add dev eth0 root mqprio num_tc 6 map 0 1 2 3 4 5 0 0 > queues 19@0 1@19 1@20 1@21 1@22 1@23 hw 0 > sudo tc qdisc add dev eth0 parent 805a:1a handle 8001:0 htb default 10 > sudo tc class add dev eth0 parent 8001: classid 8001:10 htb rate 1000Mbit > > I didn't set r2q/burst/cburst/mtu/mpu so the default value should be used. Just to circle back on this - it seems there is 200ms delay sometimes during data push which stalled the sending: 01:34:44.046232 IP (tos 0x0, ttl 64, id 2863, offset 0, flags [DF], proto: TCP (6), length: 8740) 10.101.197.75.59126 > 10.101.197.105.redwood-broker: . 250025:258713(8688) ack 1901 win 58 <nop,nop,timestamp 507571833 196626529> 01:34:44.046304 IP (tos 0x0, ttl 64, id 15420, offset 0, flags [DF], proto: TCP (6), length: 52) 10.101.197.105.redwood-broker > 10.101.197.75.59126: ., cksum 0x187d (correct), 1901:1901(0) ack 258713 win 232 <nop,nop,timestamp 196626529 507571833> 01:34:44.247184 IP (tos 0x0, ttl 64, id 2869, offset 0, flags [DF], proto: TCP (6), length: 1364) 10.101.197.75.59126 > 10.101.197.105.redwood-broker: P 258713:260025(1312) ack 1901 win 58 <nop,nop,timestamp 507571833 196626529> 01:34:44.247186 IP (tos 0x0, ttl 64, id 2870, offset 0, flags [DF], proto: TCP (6), length: 1364) 10.101.197.75.59126 > 10.101.197.105.redwood-broker: P 258713:260025(1312) ack 1901 win 58 <nop,nop,timestamp 507572034 196626529> at 44.046s there was an ack from the iperf server (10.101.197.105) for a previous sent packet of size 8740, then after exact 200 ms (44.247s above) two identical packets were pushed from the client (10.101.197.75). It looks like there is some TCP timer triggered - however disabling Nagel or delayed ack doesn't help. So maybe TC has delayed the first packet and for some reason only after 200 ms seconds the packet was sent together with the retransmitted one. As I mentioned before, setting burst/cburst to 50MB eliminates this problem. Also setting the TCP receive window on server side to some value from 4k to 12k solved the issue - but from TCPDump this might just caused the packet to be segmented further so it's not gated significantly by HTB. Using TCP window auto-scaling doesn't help. It all looks like HTB delayed a packet when the rate limit is hit (did manual computation and found the timing matches) and instead of sending it through a TC timer (which should be much less than 200ms - 100ms?), the packet was sent when TCP decides to retransmit the same packet. ^ permalink raw reply [flat|nested] 21+ messages in thread
end of thread, other threads:[~2016-04-25 17:29 UTC | newest] Thread overview: 21+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-03-30 7:20 qdisc spin lock Michael Ma 2016-03-31 19:18 ` Jesper Dangaard Brouer 2016-03-31 23:41 ` Michael Ma 2016-04-16 8:52 ` Andrew 2016-03-31 22:16 ` Cong Wang 2016-03-31 23:48 ` Michael Ma 2016-04-01 2:19 ` David Miller 2016-04-01 17:17 ` Michael Ma 2016-04-01 3:44 ` John Fastabend 2016-04-13 18:23 ` Michael Ma 2016-04-08 14:19 ` Eric Dumazet 2016-04-15 22:46 ` Michael Ma 2016-04-15 22:54 ` Eric Dumazet 2016-04-15 23:05 ` Michael Ma 2016-04-15 23:56 ` Eric Dumazet 2016-04-20 21:24 ` Michael Ma 2016-04-20 22:34 ` Eric Dumazet 2016-04-21 5:51 ` Michael Ma 2016-04-21 12:41 ` Eric Dumazet 2016-04-21 22:12 ` Michael Ma 2016-04-25 17:29 ` Michael Ma
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).