[RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave
@ 2024-07-12 19:24 Nick Child
  2024-07-12 19:24 ` [RFC PATCH net-next 1/1] " Nick Child
  2024-07-12 23:34 ` [RFC PATCH net-next 0/1] " Jay Vosburgh
  0 siblings, 2 replies; 4+ messages in thread
From: Nick Child @ 2024-07-12 19:24 UTC (permalink / raw)
  To: netdev; +Cc: Nick Child

Hello. Posting this as RFC because I understand that changing return codes can have unexpected results and I am not well versed enough to claim with certainty that this won't affect something elsewhere.

We are seeing a rare TCP connection timeout after only ~7.5 seconds of inactivity. This is mainly due to the ibmvnic driver hogging the RTNL lock for too long (~2 seconds per ibmvnic device). We are working on getting the driver off the RTNL lock but figured the core of the issue should also be considered.

Because most of this is new ground to me, I listed the background knowledge that is needed to identify the issue. Feel free to skip this part:

1. During a zero window probe, the socket attempts to get an updated window from the peer 15 times (default of tcp_retries2). Looking at tcp_send_probe0(), the timer between probes is either doubled or 0.5 seconds. The timer is doubled if the skb transmit returns NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means the socket can ETIMEOUT after 7.5 seconds. The return code is typically the return code of __dev_queue_xmit()

2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the enqueue function is defined. In this circumstance, the qdisc enqueue function return code propagates up the stack. On the other hand, if the qdisc enqueue function is NULL then the drivers xmit function is called directly via dev_hard_start_xmit(). In this circumstance, the drivers xmit return code propagates up the stack.

3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue function. Therefore, when the bonding device is carrier UP, bond_start_xmit is called directly. In this function, depending on bonding mode, a slave device is assigned to the skb and __dev_queue_xmit() is called again. This time the slaves qdisc enqueue function (which is almost always defined) is called.

4. When a device calls netif_carrier_off(), it schedules dev_deactivate which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue function of noop_qdisc is defined but simply returns NET_XMIT_CN.

5. The miimon of the bonding device periodically checks for the carrier status of its slaves. If it detects that all of its slaves are down, then it sets currently_active_slave to NULL and calls netif_carrier_off() on itself.

6. In the bonding devices xmit function, if it does not have any active slaves, it returns NET_XMIT_DROP.

Given these observations. Assume a bonding devices slaves all suddenly call netif_carrier_off(). Also assume that a tcp connection is in a zero window probe. There is a window for a behavioral issue here:
1. If the bonding device does not notice that its slaves are down (maybe its miimon interval is too large or the miimon commit could not grab rtnl), then the slaves enqueue function is invoked. This will either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is doubled.
2. If the bonding device notices the slaves are down. It sets active slave to NULL and calls netif_carrier_off() on itself. dev_deactivate() is scheduled:
    a. If dev_deactivate() executes. The bonding enqueue function (which was null) is now noop_enqueue and returns NET_XMIT_CN. The probe timer is doubled.
    b. If dev_deactivate() cannot execute for some time (say because it is waiting on rtnl). Then bond_start_xmit() is called. It sees that it does not have an active slave and returns NET_XMIT_DROP. The probe timer is set to 0.5 seconds.

I believe that when the active slave is NULL, it is safe to say that the bonding device calls netif_carrier_off() on itself. But there is a time where active_slave == NULL and the qdisc.enqueue != noop_enqueue. During this time the return code is different. Consider the following timeline:
  slaves go down
             |
             |   returns NET_XMIT_CN
             |
     bond calls carrier_off
             |
             |   returns NET_XMIT_DROP
             |
       dev_deactivate on bond
             |
             |   returns NET_XMIT_CN
             |

Because the bonding xmit function should really only be a route to a slave device, I propose that it returns NET_XMIT_CN in these scenarios instead of NET_XMIT_DROP. Note that the bond will still return NET_XMIT_DROP when it has NO slaves. Since miimon calls netif_carrier_off() when it doesn't have any active slaves, I believe this change will only effect the time between the bonding devices call to netif_carrier_off() and the execution of dev_deactivate().

I was able to see this issue with bpftrace:
  10:28:27:283929 - dev_deactivate eth252
  10:28:27:312805 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 0+1
  10:28:27:760016 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 1+1
  10:28:28:147410 - dev_deactivate eth251
  10:28:28:348387 - dev_queue_xmit rc = 2
  10:28:28:348394 - dev_queue_xmit rc = 2
  10:28:28:670013 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 2+1
  10:28:28:670066 - dev_queue_xmit rc = 2
  10:28:28:670070 - dev_queue_xmit rc = 2
  10:28:30:440025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 3+1
  10:28:30:440084 - dev_queue_xmit rc = 2
  10:28:30:440088 - dev_queue_xmit rc = 2
  10:28:33:505982 - netif_carrier_off bond1
        netif_carrier_off+112
        bond_set_carrier+296
        bond_select_active_slave+296
        bond_mii_monitor+1900
        process_one_work+760
        worker_thread+136
        kthread+456
        ret_from_kernel_thread+92
  10:28:33:790050 - dev_queue_xmit rc = 1
  10:28:33:870015 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 4+1
  10:28:33:870061 - dev_queue_xmit rc = 1
  10:28:34:300269 - dev_queue_xmit rc = 1
  10:28:34:380025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 5+1
  10:28:34:380072 - dev_queue_xmit rc = 1
  10:28:34:432446 - dev_queue_xmit rc = 1
  10:28:34:810059 - dev_queue_xmit rc = 1
  10:28:34:900014 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 6+1
  10:28:34:900059 - dev_queue_xmit rc = 1
  10:28:35:000050 - dev_queue_xmit rc = 1
  10:28:35:330054 - dev_queue_xmit rc = 1
  10:28:35:410020 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 7+1
  10:28:35:410070 - dev_queue_xmit rc = 1
  10:28:35:630865 - dev_queue_xmit rc = 1
  10:28:35:850072 - dev_queue_xmit rc = 1
  10:28:35:920025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 8+1
  10:28:35:920069 - dev_queue_xmit rc = 1
  10:28:35:940348 - dev_queue_xmit rc = 1
  10:28:36:140055 - dev_queue_xmit rc = 1
  10:28:36:370050 - dev_queue_xmit rc = 1
  10:28:36:430029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 9+1
  10:28:36:430089 - dev_queue_xmit rc = 1
  10:28:36:460052 - dev_queue_xmit rc = 1
  10:28:36:650049 - dev_queue_xmit rc = 1
  10:28:36:880059 - dev_queue_xmit rc = 1
  10:28:36:940024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 10+1
  10:28:36:940074 - dev_queue_xmit rc = 1
  10:28:36:980044 - dev_queue_xmit rc = 1
  10:28:37:160331 - dev_queue_xmit rc = 1
  10:28:37:390060 - dev_queue_xmit rc = 1
  10:28:37:450024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 11+1
  10:28:37:450082 - dev_queue_xmit rc = 1
  10:28:37:480045 - dev_queue_xmit rc = 1
  10:28:37:730281 - dev_queue_xmit rc = 1
  10:28:37:900051 - dev_queue_xmit rc = 1
  10:28:37:970019 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 12+1
  10:28:37:970062 - dev_queue_xmit rc = 1
  10:28:38:000045 - dev_queue_xmit rc = 1
  10:28:38:240089 - dev_queue_xmit rc = 1
  10:28:38:420053 - dev_queue_xmit rc = 1
  10:28:38:490012 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 13+1
  10:28:38:490076 - dev_queue_xmit rc = 1
  10:28:38:520048 - dev_queue_xmit rc = 1
  10:28:38:750069 - dev_queue_xmit rc = 1
  10:28:39:000029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 14+1
  10:28:39:000093 - dev_queue_xmit rc = 1
  10:28:39:030052 - dev_queue_xmit rc = 1
  10:28:39:260046 - dev_queue_xmit rc = 1
  10:28:39:450050 - dev_queue_xmit rc = 1
  10:28:39:520044 - SK_ERR(110) port=8682 - snd_wnd 0
        sk_error_report+12
        tcp_write_err+64
        tcp_write_timer_handler+564
        tcp_write_timer+424
        call_timer_fn+88
        run_timer_softirq+1896
        __do_softirq+348
        do_softirq_own_stack+64
        irq_exit+392
        timer_interrupt+380
        decrementer_common_virt+528
  10:28:47:813297 - dev_deactivate bond1

Again, I know the easier solution is to remove rtnl users to reduce the time before dev_deactivate (we are working on that as well).

Nick Child (1):
  bonding: Return TX congested if no active slave

 include/net/bonding.h | 7 +++++++
 1 file changed, 7 insertions(+)

-- 
2.43.0

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [RFC PATCH net-next 1/1] bonding: Return TX congested if no active slave
  2024-07-12 19:24 [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave Nick Child
@ 2024-07-12 19:24 ` Nick Child
  2024-07-12 23:34 ` [RFC PATCH net-next 0/1] " Jay Vosburgh
  1 sibling, 0 replies; 4+ messages in thread
From: Nick Child @ 2024-07-12 19:24 UTC (permalink / raw)
  To: netdev; +Cc: Nick Child

When the bonding device cannot find an active slave, it is safe to
assume that monitoring will eventually call carrier_off on the bonding
device.

This means that eventually the bonding devices qdisc will become noop
and return congested. But, there is an indefinite amount of time
between no active slaves and the bonding devices qdisc becoming noop.

Previously, during this period, NET_XMIT_DROP was returned. Upper level
networking functions react differently to a drop vs congested.

Therefore, return NET_XMIT_CN instead of NET_XMIT_DROP because it
accurately predicts the impending noop_enqueue return code and helps
to keep sockets alive until then.

Signed-off-by: Nick Child <nnac123@linux.ibm.com>
---
 include/net/bonding.h | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/include/net/bonding.h b/include/net/bonding.h
index b61fb1aa3a56..22da0916d4a6 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -805,8 +805,15 @@ extern const u8 lacpdu_mcast_addr[];

 static inline netdev_tx_t bond_tx_drop(struct net_device *dev, struct sk_buff *skb)
 {
+	struct bonding *bond = netdev_priv(dev);
+
 	dev_core_stats_tx_dropped_inc(dev);
 	dev_kfree_skb_any(skb);
+
+	/* monitoring will trigger dev_deactivate soon, imitate noop until then */
+	if (bond_has_slaves(bond))
+		return NET_XMIT_CN;
+
 	return NET_XMIT_DROP;
 }

-- 
2.43.0

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave
  2024-07-12 19:24 [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave Nick Child
  2024-07-12 19:24 ` [RFC PATCH net-next 1/1] " Nick Child
@ 2024-07-12 23:34 ` Jay Vosburgh
  2024-07-13 12:35   ` Nick Child
  1 sibling, 1 reply; 4+ messages in thread
From: Jay Vosburgh @ 2024-07-12 23:34 UTC (permalink / raw)
  To: Nick Child; +Cc: netdev

Nick Child <nnac123@linux.ibm.com> wrote:

>Hello. Posting this as RFC because I understand that changing return
>codes can have unexpected results and I am not well versed enough to
>claim with certainty that this won't affect something elsewhere.
>
>We are seeing a rare TCP connection timeout after only ~7.5 seconds of
>inactivity. This is mainly due to the ibmvnic driver hogging the RTNL
>lock for too long (~2 seconds per ibmvnic device). We are working on
>getting the driver off the RTNL lock but figured the core of the issue
>should also be considered.
>
>Because most of this is new ground to me, I listed the background
>knowledge that is needed to identify the issue. Feel free to skip this
>part:
>
>1. During a zero window probe, the socket attempts to get an updated
>window from the peer 15 times (default of tcp_retries2). Looking at
>tcp_send_probe0(), the timer between probes is either doubled or 0.5
>seconds. The timer is doubled if the skb transmit returns
>NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the
>timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means
>the socket can ETIMEOUT after 7.5 seconds. The return code is typically
>the return code of __dev_queue_xmit()

	I'm not sure that your description of the behavior of
tcp_send_probe0() matches the code.

	It looks to me like the timer doubles for "if (err <= 0)",
meaning a negative value or NET_XMIT_SUCCESS.  The 0.5 timeout value (in
TCP_RESOURCE_PROBE_INTERVAL) is set in the "else" after "if (err <= 0)",
so NET_XMIT_DROP and NET_XMIT_CN both qualify and would presumably
result in the 0.5 second timeout.

	However, since tcp_write_wakeup() -> tcp_transmit_skb() ->
__tcp_transmit_skb() will convert NET_XMIT_CN to 0 (which is
NET_XMIT_SUCCESS) via net_xmit_eval(), I'm not sure that it's possible
for err to equal NET_XMIT_CN here.

	I'll note that the 0.5 second timeout logic had a relatively
recent change in c1d5674f8313 ("tcp: less aggressive window probing on
local congestion").  From reading the log, the intent seems to be to
bound the maximum probe interval to 0.5 seconds in low-RTT environments.

>2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the
>enqueue function is defined. In this circumstance, the qdisc enqueue
>function return code propagates up the stack. On the other hand, if the
>qdisc enqueue function is NULL then the drivers xmit function is called
>directly via dev_hard_start_xmit(). In this circumstance, the drivers
>xmit return code propagates up the stack.
>
>3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to
>noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue
>function. Therefore, when the bonding device is carrier UP,
>bond_start_xmit is called directly. In this function, depending on
>bonding mode, a slave device is assigned to the skb and
>__dev_queue_xmit() is called again. This time the slaves qdisc enqueue
>function (which is almost always defined) is called.

	Does your analysis or behavior change if the bond itself does
have a qdisc?  IFF_NO_QUEUE does not install one by default, but users
are free to add one.

>4. When a device calls netif_carrier_off(), it schedules dev_deactivate
>which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue
>function of noop_qdisc is defined but simply returns NET_XMIT_CN.
>
>5. The miimon of the bonding device periodically checks for the carrier
>status of its slaves. If it detects that all of its slaves are down,
>then it sets currently_active_slave to NULL and calls
>netif_carrier_off() on itself.
>
>6. In the bonding devices xmit function, if it does not have any active
>slaves, it returns NET_XMIT_DROP.
>
>Given these observations. Assume a bonding devices slaves all suddenly
>call netif_carrier_off(). Also assume that a tcp connection is in a zero
>window probe. There is a window for a behavioral issue here:

>1. If the bonding device does not notice that its slaves are down
>(maybe its miimon interval is too large or the miimon commit could not
>grab rtnl), then the slaves enqueue function is invoked. This will
>either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is
>doubled.

	As I mentioned above, I'm not sure this accurately describes the
behavior in tcp_send_probe0().

	-J

>2. If the bonding device notices the slaves are down. It sets active
>slave to NULL and calls netif_carrier_off() on itself. dev_deactivate()
>is scheduled:
>
>    a. If dev_deactivate() executes. The bonding enqueue function
>(which was null) is now noop_enqueue and returns NET_XMIT_CN. The probe
>timer is doubled.
>
>    b. If dev_deactivate() cannot execute for some time (say because it
>is waiting on rtnl). Then bond_start_xmit() is called. It sees that it
>does not have an active slave and returns NET_XMIT_DROP. The probe
>timer is set to 0.5 seconds.
>
>I believe that when the active slave is NULL, it is safe to say that
>the bonding device calls netif_carrier_off() on itself. But there is a
>time where active_slave == NULL and the qdisc.enqueue !=
>noop_enqueue. During this time the return code is different. Consider
>the following timeline:
>
>  slaves go down
>             |
>             |   returns NET_XMIT_CN
>             |
>     bond calls carrier_off
>             |
>             |   returns NET_XMIT_DROP
>             |
>       dev_deactivate on bond
>             |
>             |   returns NET_XMIT_CN
>             |
>
>Because the bonding xmit function should really only be a route to a
>slave device, I propose that it returns NET_XMIT_CN in these scenarios
>instead of NET_XMIT_DROP. Note that the bond will still return
>NET_XMIT_DROP when it has NO slaves. Since miimon calls
>netif_carrier_off() when it doesn't have any active slaves, I believe
>this change will only effect the time between the bonding devices call
>to netif_carrier_off() and the execution of dev_deactivate().
>
>I was able to see this issue with bpftrace:
>  10:28:27:283929 - dev_deactivate eth252
>  10:28:27:312805 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 0+1
>  10:28:27:760016 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 1+1
>  10:28:28:147410 - dev_deactivate eth251
>  10:28:28:348387 - dev_queue_xmit rc = 2
>  10:28:28:348394 - dev_queue_xmit rc = 2
>  10:28:28:670013 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 2+1
>  10:28:28:670066 - dev_queue_xmit rc = 2
>  10:28:28:670070 - dev_queue_xmit rc = 2
>  10:28:30:440025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 3+1
>  10:28:30:440084 - dev_queue_xmit rc = 2
>  10:28:30:440088 - dev_queue_xmit rc = 2
>  10:28:33:505982 - netif_carrier_off bond1
>        netif_carrier_off+112
>        bond_set_carrier+296
>        bond_select_active_slave+296
>        bond_mii_monitor+1900
>        process_one_work+760
>        worker_thread+136
>        kthread+456
>        ret_from_kernel_thread+92
>  10:28:33:790050 - dev_queue_xmit rc = 1
>  10:28:33:870015 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 4+1
>  10:28:33:870061 - dev_queue_xmit rc = 1
>  10:28:34:300269 - dev_queue_xmit rc = 1
>  10:28:34:380025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 5+1
>  10:28:34:380072 - dev_queue_xmit rc = 1
>  10:28:34:432446 - dev_queue_xmit rc = 1
>  10:28:34:810059 - dev_queue_xmit rc = 1
>  10:28:34:900014 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 6+1
>  10:28:34:900059 - dev_queue_xmit rc = 1
>  10:28:35:000050 - dev_queue_xmit rc = 1
>  10:28:35:330054 - dev_queue_xmit rc = 1
>  10:28:35:410020 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 7+1
>  10:28:35:410070 - dev_queue_xmit rc = 1
>  10:28:35:630865 - dev_queue_xmit rc = 1
>  10:28:35:850072 - dev_queue_xmit rc = 1
>  10:28:35:920025 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 8+1
>  10:28:35:920069 - dev_queue_xmit rc = 1
>  10:28:35:940348 - dev_queue_xmit rc = 1
>  10:28:36:140055 - dev_queue_xmit rc = 1
>  10:28:36:370050 - dev_queue_xmit rc = 1
>  10:28:36:430029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 9+1
>  10:28:36:430089 - dev_queue_xmit rc = 1
>  10:28:36:460052 - dev_queue_xmit rc = 1
>  10:28:36:650049 - dev_queue_xmit rc = 1
>  10:28:36:880059 - dev_queue_xmit rc = 1
>  10:28:36:940024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 10+1
>  10:28:36:940074 - dev_queue_xmit rc = 1
>  10:28:36:980044 - dev_queue_xmit rc = 1
>  10:28:37:160331 - dev_queue_xmit rc = 1
>  10:28:37:390060 - dev_queue_xmit rc = 1
>  10:28:37:450024 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 11+1
>  10:28:37:450082 - dev_queue_xmit rc = 1
>  10:28:37:480045 - dev_queue_xmit rc = 1
>  10:28:37:730281 - dev_queue_xmit rc = 1
>  10:28:37:900051 - dev_queue_xmit rc = 1
>  10:28:37:970019 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 12+1
>  10:28:37:970062 - dev_queue_xmit rc = 1
>  10:28:38:000045 - dev_queue_xmit rc = 1
>  10:28:38:240089 - dev_queue_xmit rc = 1
>  10:28:38:420053 - dev_queue_xmit rc = 1
>  10:28:38:490012 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 13+1
>  10:28:38:490076 - dev_queue_xmit rc = 1
>  10:28:38:520048 - dev_queue_xmit rc = 1
>  10:28:38:750069 - dev_queue_xmit rc = 1
>  10:28:39:000029 - send_probe0 - port=8682 - snd_wnd 0, icsk_probes_out = 14+1
>  10:28:39:000093 - dev_queue_xmit rc = 1
>  10:28:39:030052 - dev_queue_xmit rc = 1
>  10:28:39:260046 - dev_queue_xmit rc = 1
>  10:28:39:450050 - dev_queue_xmit rc = 1
>  10:28:39:520044 - SK_ERR(110) port=8682 - snd_wnd 0
>        sk_error_report+12
>        tcp_write_err+64
>        tcp_write_timer_handler+564
>        tcp_write_timer+424
>        call_timer_fn+88
>        run_timer_softirq+1896
>        __do_softirq+348
>        do_softirq_own_stack+64
>        irq_exit+392
>        timer_interrupt+380
>        decrementer_common_virt+528
>  10:28:47:813297 - dev_deactivate bond1
>
>Again, I know the easier solution is to remove rtnl users to reduce the
>time before dev_deactivate (we are working on that as well).
>
>Nick Child (1):
>  bonding: Return TX congested if no active slave
>
> include/net/bonding.h | 7 +++++++
> 1 file changed, 7 insertions(+)
>
>-- 
>2.43.0

---
	-Jay Vosburgh, jv@jvosburgh.net

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave
  2024-07-12 23:34 ` [RFC PATCH net-next 0/1] " Jay Vosburgh
@ 2024-07-13 12:35   ` Nick Child
  0 siblings, 0 replies; 4+ messages in thread
From: Nick Child @ 2024-07-13 12:35 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev

Hi Jay, thanks for taking a look.

On 7/12/24 18:34, Jay Vosburgh wrote:
> Nick Child <nnac123@linux.ibm.com> wrote:
> 
>> 1. During a zero window probe, the socket attempts to get an updated
>> window from the peer 15 times (default of tcp_retries2). Looking at
>> tcp_send_probe0(), the timer between probes is either doubled or 0.5
>> seconds. The timer is doubled if the skb transmit returns
>> NET_XMIT_SUCCESS or NET_XMIT_CN (see net_xmit_eval()). Note that the
>> timer is set to a static 0.5 if NET_XMIT_DROP is returned. This means
>> the socket can ETIMEOUT after 7.5 seconds. The return code is typically
>> the return code of __dev_queue_xmit()
> 
> 	I'm not sure that your description of the behavior of
> tcp_send_probe0() matches the code.
> 
> 	It looks to me like the timer doubles for "if (err <= 0)",
> meaning a negative value or NET_XMIT_SUCCESS.  The 0.5 timeout value (in
> TCP_RESOURCE_PROBE_INTERVAL) is set in the "else" after "if (err <= 0)",
> so NET_XMIT_DROP and NET_XMIT_CN both qualify and would presumably
> result in the 0.5 second timeout.
> 
> 	However, since tcp_write_wakeup() -> tcp_transmit_skb() ->
> __tcp_transmit_skb() will convert NET_XMIT_CN to 0 (which is
> NET_XMIT_SUCCESS) via net_xmit_eval(), I'm not sure that it's possible
> for err to equal NET_XMIT_CN here.

Apologies, I was oversimplifying in my explanation. I was referencing
dev_queue_xmit returning either CN or DROP. On its way up the stack,
return code CN is mapped to 0 (in net_xmit_eval()). So, proper phrasing
is "when dev_queue_xmit returns CN, the timer is doubled. When 
dev_queue_xmit returns DROPPED, the timer is 0.5".

> 	I'll note that the 0.5 second timeout logic had a relatively
> recent change in c1d5674f8313 ("tcp: less aggressive window probing on
> local congestion").  From reading the log, the intent seems to be to
> bound the maximum probe interval to 0.5 seconds in low-RTT environments.
> 

I don't have any complaints with the quick timeout. I believe it has 
valid applications, I just don't think this scenario (when the bond
device is waiting for dev_deactivate) falls into that.

>> 2. In __dev_queue_xmit(), the skb is enqueued to the qdisc if the
>> enqueue function is defined. In this circumstance, the qdisc enqueue
>> function return code propagates up the stack. On the other hand, if the
>> qdisc enqueue function is NULL then the drivers xmit function is called
>> directly via dev_hard_start_xmit(). In this circumstance, the drivers
>> xmit return code propagates up the stack.
>>
>> 3. The bonding device uses IFF_NO_QUEUE, this sets qdisc to
>> noqueue_qdisc_ops. noqueue_qdisc_ops has NULL enqueue
>> function. Therefore, when the bonding device is carrier UP,
>> bond_start_xmit is called directly. In this function, depending on
>> bonding mode, a slave device is assigned to the skb and
>> __dev_queue_xmit() is called again. This time the slaves qdisc enqueue
>> function (which is almost always defined) is called.
> 
> 	Does your analysis or behavior change if the bond itself does
> have a qdisc?  IFF_NO_QUEUE does not install one by default, but users
> are free to add one.
> 

Good question. I did not try. Though I don't think we would see this 
issue because there would not be a way for the return code of 
dev_queue_xmit() to propagate all the way up to tcp_write_wakeup().

>> 4. When a device calls netif_carrier_off(), it schedules dev_deactivate
>> which grabs the rtnl lock and sets the qdisc to noop_qdisc. The enqueue
>> function of noop_qdisc is defined but simply returns NET_XMIT_CN.
>>
>> 5. The miimon of the bonding device periodically checks for the carrier
>> status of its slaves. If it detects that all of its slaves are down,
>> then it sets currently_active_slave to NULL and calls
>> netif_carrier_off() on itself.
>>
>> 6. In the bonding devices xmit function, if it does not have any active
>> slaves, it returns NET_XMIT_DROP.
>>
>> Given these observations. Assume a bonding devices slaves all suddenly
>> call netif_carrier_off(). Also assume that a tcp connection is in a zero
>> window probe. There is a window for a behavioral issue here:
> 
>> 1. If the bonding device does not notice that its slaves are down
>> (maybe its miimon interval is too large or the miimon commit could not
>> grab rtnl), then the slaves enqueue function is invoked. This will
>> either return NET_XMIT_SUCCESS OR NET_XMIT_CN. The probe timer is
>> doubled.
> 
> 	As I mentioned above, I'm not sure this accurately describes the
> behavior in tcp_send_probe0()
> 
> 	-J
> 
The call stack I describe here looks like this:
    pfifo_fast_enqueue+12
     dev_qdisc_enqueue+76
     __dev_queue_xmit+1700
     bond_dev_queue_xmit+68
     bond_start_xmit+1092
     dev_hard_start_xmit+300
     __dev_queue_xmit+3124
     ip_finish_output2+1020
     ip_output+228
     ip_local_out+104
     __ip_queue_xmit+384
     __tcp_transmit_skb+1644

Considering that tcp_write_wakeup calls tcp_transmit_skb and 
__tcp_transmit_skb translates NET_XMIT_CN to 0 through net_xmit_eval, I 
believe my statement is correct. If the slaves
enqueue function is invoked (in this case pfifo_fast) and 
NET_XMIT_SUCCESS or NET_XMIT_CN is returned then the probe timer is doubled.
>> -- 
>> 2.43.0
> 
> ---
> 	-Jay Vosburgh, jv@jvosburgh.net

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-07-13 12:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-12 19:24 [RFC PATCH net-next 0/1] bonding: Return TX congested if no active slave Nick Child
2024-07-12 19:24 ` [RFC PATCH net-next 1/1] " Nick Child
2024-07-12 23:34 ` [RFC PATCH net-next 0/1] " Jay Vosburgh
2024-07-13 12:35   ` Nick Child

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).