netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] net/smc: avoid data corruption caused by decline
@ 2023-11-07  3:56 D. Wythe
  2023-11-08  2:38 ` Jakub Kicinski
  0 siblings, 1 reply; 3+ messages in thread
From: D. Wythe @ 2023-11-07  3:56 UTC (permalink / raw)
  To: kgraul, wenjia, jaka, wintera; +Cc: kuba, davem, netdev, linux-s390, linux-rdma

From: "D. Wythe" <alibuda@linux.alibaba.com>

We found a data corruption issue during testing of SMC-R on Redis
applications.

The benchmark has a low probability of reporting a strange error as
shown below.

"Error: Protocol error, got "\xe2" as reply type byte"

Finally, we found that the retrieved error data was as follows:

0xE2 0xD4 0xC3 0xD9 0x04 0x00 0x2C 0x20 0xA6 0x56 0x00 0x16 0x3E 0x0C
0xCB 0x04 0x02 0x01 0x00 0x00 0x20 0x00 0x00 0x00 0x00 0x00 0x00 0x00
0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0x00 0xE2

It is quite obvious that this is a SMC DECLINE message, which means that
the applications received SMC protocol message.
We found that this was caused by the following situations:

client			server
	   proposal
	------------->
	   accept
	<-------------
	   confirm
	------------->
wait confirm

	 failed llc confirm
	    x------
(after 2s)timeout
			wait rsp

wait decline

(after 1s) timeout
			(after 2s) timeout
	    decline
	-------------->
	    decline
	<--------------

As a result, a decline message was sent in the implementation, and this
message was read from TCP by the already-fallback connection.

This patch double the client timeout as 2x of the server value,
With this simple change, the Decline messages should never cross or
collide (during Confirm link timeout).

This issue requires an immediate solution, since the protocol updates
involve a more long-term solution.

Signed-off-by: D. Wythe <alibuda@linux.alibaba.com>
---
 net/smc/af_smc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index abd2667..5b91f55 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -599,7 +599,7 @@ static int smcr_clnt_conf_first_link(struct smc_sock *smc)
 	int rc;
 
 	/* receive CONFIRM LINK request from server over RoCE fabric */
-	qentry = smc_llc_wait(link->lgr, NULL, SMC_LLC_WAIT_TIME,
+	qentry = smc_llc_wait(link->lgr, NULL, 2 * SMC_LLC_WAIT_TIME,
 			      SMC_LLC_CONFIRM_LINK);
 	if (!qentry) {
 		struct smc_clc_msg_decline dclc;
-- 
1.8.3.1


^ permalink raw reply related	[flat|nested] 3+ messages in thread

* Re: [PATCH net] net/smc: avoid data corruption caused by decline
  2023-11-07  3:56 [PATCH net] net/smc: avoid data corruption caused by decline D. Wythe
@ 2023-11-08  2:38 ` Jakub Kicinski
  2023-11-08  9:42   ` D. Wythe
  0 siblings, 1 reply; 3+ messages in thread
From: Jakub Kicinski @ 2023-11-08  2:38 UTC (permalink / raw)
  To: D. Wythe
  Cc: kgraul, wenjia, jaka, wintera, davem, netdev, linux-s390,
	linux-rdma

On Tue,  7 Nov 2023 11:56:16 +0800 D. Wythe wrote:
> This issue requires an immediate solution, since the protocol updates
> involve a more long-term solution.

Please provide an appropriate Fixes tag.
-- 
pw-bot: cr
pv-bot: fixes

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [PATCH net] net/smc: avoid data corruption caused by decline
  2023-11-08  2:38 ` Jakub Kicinski
@ 2023-11-08  9:42   ` D. Wythe
  0 siblings, 0 replies; 3+ messages in thread
From: D. Wythe @ 2023-11-08  9:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kgraul, wenjia, jaka, wintera, davem, netdev, linux-s390,
	linux-rdma



On 11/8/23 10:38 AM, Jakub Kicinski wrote:
> On Tue,  7 Nov 2023 11:56:16 +0800 D. Wythe wrote:
>> This issue requires an immediate solution, since the protocol updates
>> involve a more long-term solution.
> Please provide an appropriate Fixes tag.

Thanks for reminder.

D. Wythe

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-11-08  9:43 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-07  3:56 [PATCH net] net/smc: avoid data corruption caused by decline D. Wythe
2023-11-08  2:38 ` Jakub Kicinski
2023-11-08  9:42   ` D. Wythe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).