netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Tun congestion/BQL
@ 2019-04-10 12:01 David Woodhouse
  2019-04-10 13:01 ` David Woodhouse
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2019-04-10 12:01 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 3436 bytes --]

I've been working on OpenConnect VPN performance. After fixing some
local stupidities I am basically crypto-bound as I suck packets out of
the tun device and feed them out over the public network as fast as the
crypto library can encrypt them.

However, the tun device is dropping packets.

I'm testing with an ESP setup that the kernel happens to support. If I
do netperf UDP_STREAM testing with the kernel doing ESP, I get this:


Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.00     1198093      0    1341.86
212992           10.00     1198044           1341.80


Change to doing it in userspace through the tun device, though, and it
looks more like this:

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.00     8194693      0    9178.04
212992           10.00     1536155           1720.49

The discrepancy between sent and received packets is all seen as packet
loss on the tun0 interface, where userspace is not reading the packets
out fast enough:

$ netstat -i
Kernel Interface table
Iface             MTU    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0             9001 56790193      0 1718127 0      129849068      0      0      0 BMRU
lo              65536       42      0      0 0            42      0      0      0 LRU
tun0             1500        9      0      0 0       1546968      0 6647739      0 MOPRU


So... I threw together something to stop the queue when the tx_ring was
full (which I know is incomplete but was enough for my test)


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index e9ca1c088d0b..a15fca23ef45 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1125,7 +1128,9 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	if (tfile->flags & TUN_FASYNC)
 		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
 	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
 
+	if (!ptr_ring_empty(&tfile->tx_ring))
+		netif_stop_queue(tun->dev);
 	rcu_read_unlock();
 	return NETDEV_TX_OK;
 
@@ -2237,7 +2239,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		else
 			consume_skb(skb);
 	}
-
+	netif_wake_queue(tun->dev);
 	return ret;
 }
 


So now netperf doesn't send lots of packets that get dropped by the tun
device. But it doesn't send anywhere near as many packets successfully,
either...

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.00     1250223      0    1400.25
212992           10.00     1245458           1394.91

That's actually dropped me back to the performance I was getting with
the kernel's ESP implementation. Is that something we should expect?

I don't think it's purely the overhead I've added in the driver. If I
leave the netif_wake_queue() and add '&& 0' to the condition for the
netif_stop_queue(), which should still leave the locking in the
ptr_ring_empty() to happen, the performance goes back up.

What's going on? Am I actually better off letting it drop packets
silently?

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 12:01 Tun congestion/BQL David Woodhouse
@ 2019-04-10 13:01 ` David Woodhouse
  2019-04-10 13:25   ` Jason Wang
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2019-04-10 13:01 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 2007 bytes --]

On Wed, 2019-04-10 at 15:01 +0300, David Woodhouse wrote:
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1125,7 +1128,9 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
> *skb, struct net_device *dev)
>         if (tfile->flags & TUN_FASYNC)
>                 kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>         tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>  
> +       if (!ptr_ring_empty(&tfile->tx_ring))
> +               netif_stop_queue(tun->dev);
>         rcu_read_unlock();
>         return NETDEV_TX_OK;
>  
> 

Hm, that should be using ptr_ring_full() shouldn't it? So...

--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1121,6 +1121,9 @@ static netdev_tx_t tun_net_xmit(struct s
 	if (ptr_ring_produce(&tfile->tx_ring, skb))
 		goto drop;
 
+	if (ptr_ring_full(&tfile->tx_ring))
+		netif_stop_queue(tun->dev);
+
 	/* Notify and wake up reader process */
 	if (tfile->flags & TUN_FASYNC)
 		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
@@ -2229,6 +2232,7 @@ static ssize_t tun_do_read(struct tun_st
 			consume_skb(skb);
 	}
 
+	netif_wake_queue(tun->dev);
 	return ret;
 }
 

That doesn't seem to make much difference at all; it's still dropping a
lot of packets because ptr_ring_produce() is returning non-zero.


Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.00     7747169      0    8676.81
212992           10.00     1471769           1648.38

Making it call netif_stop_queue() when ptr_ring_produce() fails causes
it to perform even worse...

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.00     1906235      0    2134.98
212992           10.00      985428           1103.68


At this point I'm mostly just confused. 

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 13:01 ` David Woodhouse
@ 2019-04-10 13:25   ` Jason Wang
  2019-04-10 13:42     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Wang @ 2019-04-10 13:25 UTC (permalink / raw)
  To: David Woodhouse, netdev


On 2019/4/10 下午9:01, David Woodhouse wrote:
> On Wed, 2019-04-10 at 15:01 +0300, David Woodhouse wrote:
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1125,7 +1128,9 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
>> *skb, struct net_device *dev)
>>          if (tfile->flags & TUN_FASYNC)
>>                  kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>          tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>>   
>> +       if (!ptr_ring_empty(&tfile->tx_ring))
>> +               netif_stop_queue(tun->dev);
>>          rcu_read_unlock();
>>          return NETDEV_TX_OK;
>>   
>>
> Hm, that should be using ptr_ring_full() shouldn't it? So...
>
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1121,6 +1121,9 @@ static netdev_tx_t tun_net_xmit(struct s
>   	if (ptr_ring_produce(&tfile->tx_ring, skb))
>   		goto drop;
>   
> +	if (ptr_ring_full(&tfile->tx_ring))
> +		netif_stop_queue(tun->dev);
> +
>   	/* Notify and wake up reader process */
>   	if (tfile->flags & TUN_FASYNC)
>   		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
> @@ -2229,6 +2232,7 @@ static ssize_t tun_do_read(struct tun_st
>   			consume_skb(skb);
>   	}
>   
> +	netif_wake_queue(tun->dev);
>   	return ret;
>   }
>   
>
> That doesn't seem to make much difference at all; it's still dropping a
> lot of packets because ptr_ring_produce() is returning non-zero.


I think you need try to stop the queue just in this case? Ideally we may 
want to stop the queue when the queue is about to full, but we don't 
have such helper currently.

Thanks


>
>
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1400   10.00     7747169      0    8676.81
> 212992           10.00     1471769           1648.38
>
> Making it call netif_stop_queue() when ptr_ring_produce() fails causes
> it to perform even worse...
>
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1400   10.00     1906235      0    2134.98
> 212992           10.00      985428           1103.68
>
>
> At this point I'm mostly just confused.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 13:25   ` Jason Wang
@ 2019-04-10 13:42     ` Toke Høiland-Jørgensen
  2019-04-10 14:33       ` David Woodhouse
  2019-04-11  7:01       ` Jason Wang
  0 siblings, 2 replies; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-04-10 13:42 UTC (permalink / raw)
  To: Jason Wang, David Woodhouse, netdev

Jason Wang <jasowang@redhat.com> writes:

> On 2019/4/10 下午9:01, David Woodhouse wrote:
>> On Wed, 2019-04-10 at 15:01 +0300, David Woodhouse wrote:
>>> --- a/drivers/net/tun.c
>>> +++ b/drivers/net/tun.c
>>> @@ -1125,7 +1128,9 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
>>> *skb, struct net_device *dev)
>>>          if (tfile->flags & TUN_FASYNC)
>>>                  kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>>          tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>>>   
>>> +       if (!ptr_ring_empty(&tfile->tx_ring))
>>> +               netif_stop_queue(tun->dev);
>>>          rcu_read_unlock();
>>>          return NETDEV_TX_OK;
>>>   
>>>
>> Hm, that should be using ptr_ring_full() shouldn't it? So...
>>
>> --- a/drivers/net/tun.c
>> +++ b/drivers/net/tun.c
>> @@ -1121,6 +1121,9 @@ static netdev_tx_t tun_net_xmit(struct s
>>   	if (ptr_ring_produce(&tfile->tx_ring, skb))
>>   		goto drop;
>>   
>> +	if (ptr_ring_full(&tfile->tx_ring))
>> +		netif_stop_queue(tun->dev);
>> +
>>   	/* Notify and wake up reader process */
>>   	if (tfile->flags & TUN_FASYNC)
>>   		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>> @@ -2229,6 +2232,7 @@ static ssize_t tun_do_read(struct tun_st
>>   			consume_skb(skb);
>>   	}
>>   
>> +	netif_wake_queue(tun->dev);
>>   	return ret;
>>   }
>>   
>>
>> That doesn't seem to make much difference at all; it's still dropping a
>> lot of packets because ptr_ring_produce() is returning non-zero.
>
>
> I think you need try to stop the queue just in this case? Ideally we may 
> want to stop the queue when the queue is about to full, but we don't 
> have such helper currently.

Ideally we want to react when the queue starts building rather than when
it starts getting full; by pushing back on upper layers (or, if
forwarding, dropping packets to signal congestion).

In practice, this means tuning the TX ring to the *minimum* size it can
be without starving (this is basically what BQL does for Ethernet), and
keeping packets queued in the qdisc layer instead, where it can be
managed...

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 13:42     ` Toke Høiland-Jørgensen
@ 2019-04-10 14:33       ` David Woodhouse
  2019-04-10 15:01         ` Toke Høiland-Jørgensen
  2019-04-11  7:17         ` Jason Wang
  2019-04-11  7:01       ` Jason Wang
  1 sibling, 2 replies; 17+ messages in thread
From: David Woodhouse @ 2019-04-10 14:33 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jason Wang, netdev

[-- Attachment #1: Type: text/plain, Size: 1501 bytes --]

On Wed, 2019-04-10 at 15:42 +0200, Toke Høiland-Jørgensen wrote:
> > > That doesn't seem to make much difference at all; it's still dropping a
> > > lot of packets because ptr_ring_produce() is returning non-zero.
> > 
> > 
> > I think you need try to stop the queue just in this case? Ideally we may 
> > want to stop the queue when the queue is about to full, but we don't 
> > have such helper currently.

I don't quite understand. If the ring isn't full after I've put a
packet into it... how can it be full subsequently? We can't end up in
tun_net_xmit() concurrently, right? I'm not (knowingly) using XDP.

> Ideally we want to react when the queue starts building rather than when
> it starts getting full; by pushing back on upper layers (or, if
> forwarding, dropping packets to signal congestion).

This is precisely what my first accidental if (!ptr_ring_empty())
variant was doing, right? :)

> In practice, this means tuning the TX ring to the *minimum* size it can
> be without starving (this is basically what BQL does for Ethernet), and
> keeping packets queued in the qdisc layer instead, where it can be
> managed...

I was going to add BQL (as $SUBJECT may have caused you to infer) but
trivially adding the netdev_sent_queue() in tun_net_xmit() and
netdev_completed_queue() for xdp vs. skb in tun_do_read() was tripping
the BUG in dql_completed(). I just ripped that part out and focused on
the queue stop/start and haven't gone back to it yet.



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 14:33       ` David Woodhouse
@ 2019-04-10 15:01         ` Toke Høiland-Jørgensen
  2019-04-10 15:32           ` David Woodhouse
  2019-04-11  7:17         ` Jason Wang
  1 sibling, 1 reply; 17+ messages in thread
From: Toke Høiland-Jørgensen @ 2019-04-10 15:01 UTC (permalink / raw)
  To: David Woodhouse, Jason Wang, netdev

David Woodhouse <dwmw2@infradead.org> writes:

> On Wed, 2019-04-10 at 15:42 +0200, Toke Høiland-Jørgensen wrote:
>> > > That doesn't seem to make much difference at all; it's still dropping a
>> > > lot of packets because ptr_ring_produce() is returning non-zero.
>> > 
>> > 
>> > I think you need try to stop the queue just in this case? Ideally we may 
>> > want to stop the queue when the queue is about to full, but we don't 
>> > have such helper currently.
>
> I don't quite understand. If the ring isn't full after I've put a
> packet into it... how can it be full subsequently? We can't end up in
> tun_net_xmit() concurrently, right? I'm not (knowingly) using XDP.
>
>> Ideally we want to react when the queue starts building rather than when
>> it starts getting full; by pushing back on upper layers (or, if
>> forwarding, dropping packets to signal congestion).
>
> This is precisely what my first accidental if (!ptr_ring_empty())
> variant was doing, right? :)

Yeah, I guess. But maybe a too aggressively? How are you processing
packets on the dequeue side (for crypto)? One at a time, or is there
some kind of batching in play?

>> In practice, this means tuning the TX ring to the *minimum* size it can
>> be without starving (this is basically what BQL does for Ethernet), and
>> keeping packets queued in the qdisc layer instead, where it can be
>> managed...
>
> I was going to add BQL (as $SUBJECT may have caused you to infer) but
> trivially adding the netdev_sent_queue() in tun_net_xmit() and
> netdev_completed_queue() for xdp vs. skb in tun_do_read() was tripping
> the BUG in dql_completed(). I just ripped that part out and focused on
> the queue stop/start and haven't gone back to it yet.

Right, makes sense. What qdisc are you running on the tun device? Also,
I assume that netperf is running on the same host that has the tun
device on it, right?

-Toke

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 15:01         ` Toke Høiland-Jørgensen
@ 2019-04-10 15:32           ` David Woodhouse
  2019-04-11  7:22             ` Jason Wang
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2019-04-10 15:32 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jason Wang, netdev

[-- Attachment #1: Type: text/plain, Size: 3265 bytes --]

On Wed, 2019-04-10 at 17:01 +0200, Toke Høiland-Jørgensen wrote:
> David Woodhouse <dwmw2@infradead.org> writes:
> 
> > On Wed, 2019-04-10 at 15:42 +0200, Toke Høiland-Jørgensen wrote:
> > > > > That doesn't seem to make much difference at all; it's still dropping a
> > > > > lot of packets because ptr_ring_produce() is returning non-zero.
> > > > 
> > > > 
> > > > I think you need try to stop the queue just in this case? Ideally we may 
> > > > want to stop the queue when the queue is about to full, but we don't 
> > > > have such helper currently.
> > 
> > I don't quite understand. If the ring isn't full after I've put a
> > packet into it... how can it be full subsequently? We can't end up in
> > tun_net_xmit() concurrently, right? I'm not (knowingly) using XDP.
> > 
> > > Ideally we want to react when the queue starts building rather than when
> > > it starts getting full; by pushing back on upper layers (or, if
> > > forwarding, dropping packets to signal congestion).
> > 
> > This is precisely what my first accidental if (!ptr_ring_empty())
> > variant was doing, right? :)
> 
> Yeah, I guess. But maybe a too aggressively? How are you processing
> packets on the dequeue side (for crypto)? One at a time, or is there
> some kind of batching in play?

Slight batching. The main loop in OpenConnect will suck packets out of
the tun device until its queue is "full", which by default is 10
packets but tweaking that makes little difference at all to my testing
until I take it below 3.

(Until fairly recently, I was *ignoring* the result of sendto() on the
UDP side, which meant that I was wasting time encrypting packets that
got dropped. Now I react appropriately to -EAGAIN (-ENOBUFS?) on the
sending side, and I don't pull any more packets from the tun device
until my packet queue is no longer "full". The latest 8.02 release of
OpenConnect still has that behaviour.)


> > > In practice, this means tuning the TX ring to the *minimum* size it can
> > > be without starving (this is basically what BQL does for Ethernet), and
> > > keeping packets queued in the qdisc layer instead, where it can be
> > > managed...
> > 
> > I was going to add BQL (as $SUBJECT may have caused you to infer) but
> > trivially adding the netdev_sent_queue() in tun_net_xmit() and
> > netdev_completed_queue() for xdp vs. skb in tun_do_read() was tripping
> > the BUG in dql_completed(). I just ripped that part out and focused on
> > the queue stop/start and haven't gone back to it yet.
> 
> Right, makes sense. What qdisc are you running on the tun device? Also,
> I assume that netperf is running on the same host that has the tun
> device on it, right?

I haven't looked at the qdisc, so it's fq_codel which it got as
standard. And yes, netperf is running on the same host.

My test setup is here (inline in the first, refined and attached to the
second):

https://lists.infradead.org/pipermail/openconnect-devel/2019-April/005261.html
https://lists.infradead.org/pipermail/openconnect-devel/2019-April/005265.html

Basically, I started by setting up in-kernel ESP over UDP between the
two hosts, then tricked OpenConnect into thinking there was a real VPN
server on one of them.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 13:42     ` Toke Høiland-Jørgensen
  2019-04-10 14:33       ` David Woodhouse
@ 2019-04-11  7:01       ` Jason Wang
  1 sibling, 0 replies; 17+ messages in thread
From: Jason Wang @ 2019-04-11  7:01 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, David Woodhouse, netdev


On 2019/4/10 下午9:42, Toke Høiland-Jørgensen wrote:
> Jason Wang <jasowang@redhat.com> writes:
>
>> On 2019/4/10 下午9:01, David Woodhouse wrote:
>>> On Wed, 2019-04-10 at 15:01 +0300, David Woodhouse wrote:
>>>> --- a/drivers/net/tun.c
>>>> +++ b/drivers/net/tun.c
>>>> @@ -1125,7 +1128,9 @@ static netdev_tx_t tun_net_xmit(struct sk_buff
>>>> *skb, struct net_device *dev)
>>>>           if (tfile->flags & TUN_FASYNC)
>>>>                   kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>>>           tfile->socket.sk->sk_data_ready(tfile->socket.sk);
>>>>    
>>>> +       if (!ptr_ring_empty(&tfile->tx_ring))
>>>> +               netif_stop_queue(tun->dev);
>>>>           rcu_read_unlock();
>>>>           return NETDEV_TX_OK;
>>>>    
>>>>
>>> Hm, that should be using ptr_ring_full() shouldn't it? So...
>>>
>>> --- a/drivers/net/tun.c
>>> +++ b/drivers/net/tun.c
>>> @@ -1121,6 +1121,9 @@ static netdev_tx_t tun_net_xmit(struct s
>>>    	if (ptr_ring_produce(&tfile->tx_ring, skb))
>>>    		goto drop;
>>>    
>>> +	if (ptr_ring_full(&tfile->tx_ring))
>>> +		netif_stop_queue(tun->dev);
>>> +
>>>    	/* Notify and wake up reader process */
>>>    	if (tfile->flags & TUN_FASYNC)
>>>    		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
>>> @@ -2229,6 +2232,7 @@ static ssize_t tun_do_read(struct tun_st
>>>    			consume_skb(skb);
>>>    	}
>>>    
>>> +	netif_wake_queue(tun->dev);
>>>    	return ret;
>>>    }
>>>    
>>>
>>> That doesn't seem to make much difference at all; it's still dropping a
>>> lot of packets because ptr_ring_produce() is returning non-zero.
>>
>> I think you need try to stop the queue just in this case? Ideally we may
>> want to stop the queue when the queue is about to full, but we don't
>> have such helper currently.
> Ideally we want to react when the queue starts building rather than when
> it starts getting full; by pushing back on upper layers (or, if
> forwarding, dropping packets to signal congestion).
>
> In practice, this means tuning the TX ring to the *minimum* size it can
> be without starving (this is basically what BQL does for Ethernet), and
> keeping packets queued in the qdisc layer instead, where it can be
> managed...
>
> -Toke


Yes, but we do have user that don't use qdisc at all.

Thanks


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 14:33       ` David Woodhouse
  2019-04-10 15:01         ` Toke Høiland-Jørgensen
@ 2019-04-11  7:17         ` Jason Wang
  2019-04-11  8:56           ` David Woodhouse
  1 sibling, 1 reply; 17+ messages in thread
From: Jason Wang @ 2019-04-11  7:17 UTC (permalink / raw)
  To: David Woodhouse, Toke Høiland-Jørgensen, netdev
  Cc: Michael S. Tsirkin


On 2019/4/10 下午10:33, David Woodhouse wrote:
> On Wed, 2019-04-10 at 15:42 +0200, Toke Høiland-Jørgensen wrote:
>>>> That doesn't seem to make much difference at all; it's still dropping a
>>>> lot of packets because ptr_ring_produce() is returning non-zero.
>>>
>>> I think you need try to stop the queue just in this case? Ideally we may
>>> want to stop the queue when the queue is about to full, but we don't
>>> have such helper currently.
> I don't quite understand. If the ring isn't full after I've put a
> packet into it... how can it be full subsequently? We can't end up in
> tun_net_xmit() concurrently, right?


Note, NETIF_F_LLTX is used for tun. But this reminds me a commit from 
Michael:

5d097109257c03a71845729f8db6b5770c4bbedc("tun: only queue packets on 
device"). It looks it you want to recover the behavior IFF_ONE_QUEUE. 
But that commit is kind of confusing since we did skb_orphan_frags() 
anyway. Maybe Michael can say more on this.


> I'm not (knowingly) using XDP.
>
>> Ideally we want to react when the queue starts building rather than when
>> it starts getting full; by pushing back on upper layers (or, if
>> forwarding, dropping packets to signal congestion).
> This is precisely what my first accidental if (!ptr_ring_empty())
> variant was doing, right? :)


But I give a try on your ptr_ring_full() patch on VM, looks like it 
works (single flow), no packets were dropped by TAP anymore. How many 
flows did you use?


>
>> In practice, this means tuning the TX ring to the *minimum* size it can
>> be without starving (this is basically what BQL does for Ethernet), and
>> keeping packets queued in the qdisc layer instead, where it can be
>> managed...
> I was going to add BQL (as $SUBJECT may have caused you to infer) but
> trivially adding the netdev_sent_queue() in tun_net_xmit() and
> netdev_completed_queue() for xdp vs. skb in tun_do_read() was tripping
> the BUG in dql_completed().


Something like https://lists.openwall.net/netdev/2012/11/12/67 ?

Thanks


>   I just ripped that part out and focused on
> the queue stop/start and haven't gone back to it yet.
>
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-10 15:32           ` David Woodhouse
@ 2019-04-11  7:22             ` Jason Wang
  2019-04-11  9:25               ` David Woodhouse
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Wang @ 2019-04-11  7:22 UTC (permalink / raw)
  To: David Woodhouse, Toke Høiland-Jørgensen, netdev


On 2019/4/10 下午11:32, David Woodhouse wrote:
> On Wed, 2019-04-10 at 17:01 +0200, Toke Høiland-Jørgensen wrote:
>> David Woodhouse<dwmw2@infradead.org>  writes:
>>
>>> On Wed, 2019-04-10 at 15:42 +0200, Toke Høiland-Jørgensen wrote:
>>>>>> That doesn't seem to make much difference at all; it's still dropping a
>>>>>> lot of packets because ptr_ring_produce() is returning non-zero.
>>>>> I think you need try to stop the queue just in this case? Ideally we may
>>>>> want to stop the queue when the queue is about to full, but we don't
>>>>> have such helper currently.
>>> I don't quite understand. If the ring isn't full after I've put a
>>> packet into it... how can it be full subsequently? We can't end up in
>>> tun_net_xmit() concurrently, right? I'm not (knowingly) using XDP.
>>>
>>>> Ideally we want to react when the queue starts building rather than when
>>>> it starts getting full; by pushing back on upper layers (or, if
>>>> forwarding, dropping packets to signal congestion).
>>> This is precisely what my first accidental if (!ptr_ring_empty())
>>> variant was doing, right?:)
>> Yeah, I guess. But maybe a too aggressively? How are you processing
>> packets on the dequeue side (for crypto)? One at a time, or is there
>> some kind of batching in play?
> Slight batching. The main loop in OpenConnect will suck packets out of
> the tun device until its queue is "full", which by default is 10
> packets but tweaking that makes little difference at all to my testing
> until I take it below 3.
>
> (Until fairly recently, I was*ignoring*  the result of sendto() on the
> UDP side, which meant that I was wasting time encrypting packets that
> got dropped. Now I react appropriately to -EAGAIN (-ENOBUFS?) on the
> sending side, and I don't pull any more packets from the tun device
> until my packet queue is no longer "full". The latest 8.02 release of
> OpenConnect still has that behaviour.)
>
>

If you care about userspace performance, you may want to try vhost + TAP 
instead. I admit the API is not user friendly which needs to be improved 
but then there will be no syscall overhead on packet transmission and 
receiving, and eventfd will be used for notification.

Thanks



^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-11  7:17         ` Jason Wang
@ 2019-04-11  8:56           ` David Woodhouse
  2019-04-11  9:04             ` Jason Wang
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2019-04-11  8:56 UTC (permalink / raw)
  To: Jason Wang, Toke Høiland-Jørgensen, netdev; +Cc: Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 1522 bytes --]

On Thu, 2019-04-11 at 15:17 +0800, Jason Wang wrote:
> > > Ideally we want to react when the queue starts building rather than when
> > > it starts getting full; by pushing back on upper layers (or, if
> > > forwarding, dropping packets to signal congestion).
> > 
> > This is precisely what my first accidental if (!ptr_ring_empty())
> > variant was doing, right? :)
> 
> 
> But I give a try on your ptr_ring_full() patch on VM, looks like it 
> works (single flow), no packets were dropped by TAP anymore. How many 
> flows did you use?

Hm, I thought I was only using one. This is just a simple case of
userspace opening /dev/net/tun, TUNSETIFF, and reading/writing.

But if I was stopping the *wrong* queue that might explain things.

This is a persistent tun device.

> 
> > 
> > > In practice, this means tuning the TX ring to the *minimum* size it can
> > > be without starving (this is basically what BQL does for Ethernet), and
> > > keeping packets queued in the qdisc layer instead, where it can be
> > > managed...
> > 
> > I was going to add BQL (as $SUBJECT may have caused you to infer) but
> > trivially adding the netdev_sent_queue() in tun_net_xmit() and
> > netdev_completed_queue() for xdp vs. skb in tun_do_read() was tripping
> > the BUG in dql_completed().
> 
> 
> Something like https://lists.openwall.net/netdev/2012/11/12/6767 ?

Fairly much.

Except again I was being lazy for the proof-of-concept, ignoring 'txq'
and just using netdev_sent_queue() etc.


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-11  8:56           ` David Woodhouse
@ 2019-04-11  9:04             ` Jason Wang
  2019-04-11  9:16               ` David Woodhouse
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Wang @ 2019-04-11  9:04 UTC (permalink / raw)
  To: David Woodhouse, Toke Høiland-Jørgensen, netdev
  Cc: Michael S. Tsirkin


On 2019/4/11 下午4:56, David Woodhouse wrote:
> On Thu, 2019-04-11 at 15:17 +0800, Jason Wang wrote:
>>>> Ideally we want to react when the queue starts building rather than when
>>>> it starts getting full; by pushing back on upper layers (or, if
>>>> forwarding, dropping packets to signal congestion).
>>> This is precisely what my first accidental if (!ptr_ring_empty())
>>> variant was doing, right? :)
>>
>> But I give a try on your ptr_ring_full() patch on VM, looks like it
>> works (single flow), no packets were dropped by TAP anymore. How many
>> flows did you use?
> Hm, I thought I was only using one. This is just a simple case of
> userspace opening /dev/net/tun, TUNSETIFF, and reading/writing.
>
> But if I was stopping the *wrong* queue that might explain things.


Btw, forget to mention, I modify your patch to use 
netif_stop/wake_subqueue() instead.

Thanks


>
> This is a persistent tun device.
>
>>>> In practice, this means tuning the TX ring to the *minimum* size it can
>>>> be without starving (this is basically what BQL does for Ethernet), and
>>>> keeping packets queued in the qdisc layer instead, where it can be
>>>> managed...
>>> I was going to add BQL (as $SUBJECT may have caused you to infer) but
>>> trivially adding the netdev_sent_queue() in tun_net_xmit() and
>>> netdev_completed_queue() for xdp vs. skb in tun_do_read() was tripping
>>> the BUG in dql_completed().
>>
>> Something like https://lists.openwall.net/netdev/2012/11/12/6767 ?
> Fairly much.
>
> Except again I was being lazy for the proof-of-concept, ignoring 'txq'
> and just using netdev_sent_queue() etc.
>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-11  9:04             ` Jason Wang
@ 2019-04-11  9:16               ` David Woodhouse
  2019-04-12  4:23                 ` Jason Wang
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2019-04-11  9:16 UTC (permalink / raw)
  To: Jason Wang, Toke Høiland-Jørgensen, netdev; +Cc: Michael S. Tsirkin

[-- Attachment #1: Type: text/plain, Size: 2043 bytes --]

On Thu, 2019-04-11 at 17:04 +0800, Jason Wang wrote:
> Btw, forget to mention, I modify your patch to use 
> netif_stop/wake_subqueue() instead.

Hm...


--- /usr/src/debug/kernel-5.0.fc29/linux-5.0.5-
200.fc29.x86_64/drivers/net/tun.c2019-03-03 23:21:29.000000000 +0000
+++ /home/fedora/tun/tun.c	2019-04-11 09:11:20.781683956 +0000
@@ -1118,8 +1118,14 @@ static netdev_tx_t tun_net_xmit(struct s
 
 	nf_reset(skb);
 
-	if (ptr_ring_produce(&tfile->tx_ring, skb))
+	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
+		netif_stop_subqueue(tun->dev, txq);
 		goto drop;
+	}
+
+	if (ptr_ring_full(&tfile->tx_ring)) {
+		netif_stop_subqueue(tun->dev, txq);
+	}
 
 	/* Notify and wake up reader process */
 	if (tfile->flags & TUN_FASYNC)
@@ -2229,6 +2235,8 @@ static ssize_t tun_do_read(struct tun_st
 			consume_skb(skb);
 	}
 
+	netif_wake_subqueue(tun->dev, tfile->queue_index);
+
 	return ret;
 }
 

That gives me

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.90     2277474      0    2340.95
212992           10.90     1174728           1207.47

Without the first netif_stop_subqueue() call in the case where we
actually *do* drop the packet (which AFAICT should never happen?) it
looks like this:

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.00     7675563      0    8596.61
212992           10.00     1441186           1614.13

And without the patch at all, it was still faster when I just let it
drop packets:

Socket  Message  Elapsed      Messages                
Size    Size     Time         Okay Errors   Throughput
bytes   bytes    secs            #      #   10^6bits/sec

212992    1400   10.00     8143709      0    9120.93
212992           10.00     1545035           1730.44



[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-11  7:22             ` Jason Wang
@ 2019-04-11  9:25               ` David Woodhouse
  2019-04-12  4:26                 ` Jason Wang
  0 siblings, 1 reply; 17+ messages in thread
From: David Woodhouse @ 2019-04-11  9:25 UTC (permalink / raw)
  To: Jason Wang, Toke Høiland-Jørgensen, netdev

[-- Attachment #1: Type: text/plain, Size: 411 bytes --]

On Thu, 2019-04-11 at 15:22 +0800, Jason Wang wrote:
> If you care about userspace performance, you may want to try vhost + TAP 
> instead. I admit the API is not user friendly which needs to be improved 
> but then there will be no syscall overhead on packet transmission and 
> receiving, and eventfd will be used for notification.

That would be very useful. Is there any example code I can follow?


[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-11  9:16               ` David Woodhouse
@ 2019-04-12  4:23                 ` Jason Wang
  0 siblings, 0 replies; 17+ messages in thread
From: Jason Wang @ 2019-04-12  4:23 UTC (permalink / raw)
  To: David Woodhouse, Toke Høiland-Jørgensen, netdev
  Cc: Michael S. Tsirkin


On 2019/4/11 下午5:16, David Woodhouse wrote:
> On Thu, 2019-04-11 at 17:04 +0800, Jason Wang wrote:
>> Btw, forget to mention, I modify your patch to use
>> netif_stop/wake_subqueue() instead.
> Hm...
>
>
> --- /usr/src/debug/kernel-5.0.fc29/linux-5.0.5-
> 200.fc29.x86_64/drivers/net/tun.c2019-03-03 23:21:29.000000000 +0000
> +++ /home/fedora/tun/tun.c	2019-04-11 09:11:20.781683956 +0000
> @@ -1118,8 +1118,14 @@ static netdev_tx_t tun_net_xmit(struct s
>   
>   	nf_reset(skb);
>   
> -	if (ptr_ring_produce(&tfile->tx_ring, skb))
> +	if (ptr_ring_produce(&tfile->tx_ring, skb)) {
> +		netif_stop_subqueue(tun->dev, txq);
>   		goto drop;
> +	}
> +
> +	if (ptr_ring_full(&tfile->tx_ring)) {
> +		netif_stop_subqueue(tun->dev, txq);
> +	}
>   
>   	/* Notify and wake up reader process */
>   	if (tfile->flags & TUN_FASYNC)
> @@ -2229,6 +2235,8 @@ static ssize_t tun_do_read(struct tun_st
>   			consume_skb(skb);
>   	}
>   
> +	netif_wake_subqueue(tun->dev, tfile->queue_index);
> +
>   	return ret;
>   }
>   
>
> That gives me
>
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1400   10.90     2277474      0    2340.95
> 212992           10.90     1174728           1207.47
>
> Without the first netif_stop_subqueue() call in the case where we
> actually *do* drop the packet (which AFAICT should never happen?) it
> looks like this:


Looks like subqueue wakeup could be done after subqueue stop, so you 
probably need to check !ptr_ring_full() before waking up the queue.

But it may still have issues:

- still racy between tun_net_xmit() and tun_do_read() to solve this need 
to protect the ring full check and subqueue status changing with 
producer lock

- still racy when there're multiple tun_net_xmit()

- but probing producer index may lead cacheline bounce which is kinda 
conflict with the design of ptr_ring which don't want consumer to peek 
producer index


If we really want to go this way, need more careful thought e.g need 
bring back the IFF_ONE_QUEUE behaviour and clear SOCK_ZEROCOPY when 
IFF_ONE_QUEUE is set.


>
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1400   10.00     7675563      0    8596.61
> 212992           10.00     1441186           1614.13
>
> And without the patch at all, it was still faster when I just let it
> drop packets:
>
> Socket  Message  Elapsed      Messages
> Size    Size     Time         Okay Errors   Throughput
> bytes   bytes    secs            #      #   10^6bits/sec
>
> 212992    1400   10.00     8143709      0    9120.93
> 212992           10.00     1545035           1730.44
>
>

I think this is expected, the patch brings some overheads.

Thanks


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-11  9:25               ` David Woodhouse
@ 2019-04-12  4:26                 ` Jason Wang
  2019-04-12  5:45                   ` David Woodhouse
  0 siblings, 1 reply; 17+ messages in thread
From: Jason Wang @ 2019-04-12  4:26 UTC (permalink / raw)
  To: David Woodhouse, Toke Høiland-Jørgensen, netdev


On 2019/4/11 下午5:25, David Woodhouse wrote:
> On Thu, 2019-04-11 at 15:22 +0800, Jason Wang wrote:
>> If you care about userspace performance, you may want to try vhost + TAP
>> instead. I admit the API is not user friendly which needs to be improved
>> but then there will be no syscall overhead on packet transmission and
>> receiving, and eventfd will be used for notification.
> That would be very useful. Is there any example code I can follow?
>

Yes, you can refer:

1) Qemu hw/virtio/vhost.c or hw/net/vhost_net.c

2) dpdk drivers/net/virtio/virtio_user/vhost_kernel_tap.c

DPDK code seems more compact.

Basically, the setup of TUN/TAP should be the same, then userspace need 
to allocate virtio rings and pass them and tap fd to vhost through 
vhost_net ioctls.

Thanks


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: Tun congestion/BQL
  2019-04-12  4:26                 ` Jason Wang
@ 2019-04-12  5:45                   ` David Woodhouse
  0 siblings, 0 replies; 17+ messages in thread
From: David Woodhouse @ 2019-04-12  5:45 UTC (permalink / raw)
  To: Jason Wang, Toke Høiland-Jørgensen, netdev

[-- Attachment #1: Type: text/plain, Size: 802 bytes --]

On Fri, 2019-04-12 at 12:26 +0800, Jason Wang wrote:
> Yes, you can refer:
> 
> 1) Qemu hw/virtio/vhost.c or hw/net/vhost_net.c
> 
> 2) dpdk drivers/net/virtio/virtio_user/vhost_kernel_tap.c
> 
> DPDK code seems more compact.
> 
> Basically, the setup of TUN/TAP should be the same, then userspace need 
> to allocate virtio rings and pass them and tap fd to vhost through 
> vhost_net ioctls.

Thanks, I'll take a look. Although I might have to do something about
the server side first. I had been using my client with the kernel's ESP
support on an adjacent identical box, naïvely assuming that the kernel
would be able to keep up with anything that userspace could send.

But ksoftirqd is eating 100% of a single CPU and the kernel is starting
to drop incoming packets already.

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5174 bytes --]

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2019-04-12  5:45 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2019-04-10 12:01 Tun congestion/BQL David Woodhouse
2019-04-10 13:01 ` David Woodhouse
2019-04-10 13:25   ` Jason Wang
2019-04-10 13:42     ` Toke Høiland-Jørgensen
2019-04-10 14:33       ` David Woodhouse
2019-04-10 15:01         ` Toke Høiland-Jørgensen
2019-04-10 15:32           ` David Woodhouse
2019-04-11  7:22             ` Jason Wang
2019-04-11  9:25               ` David Woodhouse
2019-04-12  4:26                 ` Jason Wang
2019-04-12  5:45                   ` David Woodhouse
2019-04-11  7:17         ` Jason Wang
2019-04-11  8:56           ` David Woodhouse
2019-04-11  9:04             ` Jason Wang
2019-04-11  9:16               ` David Woodhouse
2019-04-12  4:23                 ` Jason Wang
2019-04-11  7:01       ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).