From mboxrd@z Thu Jan  1 00:00:00 1970
From: Jia He <hejianet@gmail.com>
Subject: [PATCH v5 1/1] ring: guarantee load/load order in
	enqueue and dequeue
Date: Fri, 10 Nov 2017 01:51:09 +0000
Message-ID: <1510278669-8489-2-git-send-email-hejianet@gmail.com>
References: <1510118764-29697-1-git-send-email-hejianet@gmail.com>
 <1510278669-8489-1-git-send-email-hejianet@gmail.com>
Cc: konstantin.ananyev@intel.com, bruce.richardson@intel.com,
 jianbo.liu@arm.com, hemant.agrawal@nxp.com, Jia He <hejianet@gmail.com>,
 Jia He <jia.he@hxt-semitech.com>, jie2.liu@hxt-semitech.com,
 bing.zhao@hxt-semitech.com
To: jerin.jacob@caviumnetworks.com,
	dev@dpdk.org,
	olivier.matz@6wind.com
Return-path: <dev-bounces@dpdk.org>
Received: from mail-pg0-f66.google.com (mail-pg0-f66.google.com [74.125.83.66])
 by dpdk.org (Postfix) with ESMTP id 9FF351B3DC
 for <dev@dpdk.org>; Fri, 10 Nov 2017 02:51:40 +0100 (CET)
Received: by mail-pg0-f66.google.com with SMTP id z184so844100pgd.13
 for <dev@dpdk.org>; Thu, 09 Nov 2017 17:51:40 -0800 (PST)
In-Reply-To: <1510278669-8489-1-git-send-email-hejianet@gmail.com>
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

We watched a rte panic of mbuf_autotest in our qualcomm arm64 server.
In __rte_ring_move_cons_head()
...
        do {
                /* Restore n as it may change every loop */
                n = max;

                *old_head = r->cons.head;                //1st load
                const uint32_t prod_tail = r->prod.tail; //2nd load

cpu1(producer)          cpu2(consumer)          cpu3(consumer)
                        load r->prod.tail
in enqueue:
load r->cons.tail
load r->prod.head

store r->prod.tail

                                                load r->cons.head
                                                load r->prod.tail
                                                ...
                                                store r->cons.{head,tail}
                        load r->cons.head

In weak memory order architectures(powerpc,arm), the 2nd load might be
reodered before the 1st load, that makes *entries is bigger than we
wanted. This nasty reording messed enque/deque up. Then, r->cons.head
will be bigger than prod_tail, then make *entries very big and the
consumer will go forward incorrectly.

After this patch, even with above context switches, the old cons.head
will be recaculated after failure of rte_atomic32_cmpset. So no race
conditions left.

There is no such issue on X86, because X86 is strong memory order model.
But rte_smp_rmb() doesn't have impact on runtime performance on X86, so
keep the same code without architectures specific concerns.

Signed-off-by: Jia He <jia.he@hxt-semitech.com>
Signed-off-by: jie2.liu@hxt-semitech.com
Signed-off-by: bing.zhao@hxt-semitech.com
---
 lib/librte_ring/rte_ring.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 5e9b3b7..3e8085a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -409,6 +409,11 @@ __rte_ring_move_prod_head(struct rte_ring *r, int is_sp,
 		n = max;
 
 		*old_head = r->prod.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86 */
+		rte_smp_rmb();
+
 		const uint32_t cons_tail = r->cons.tail;
 		/*
 		 *  The subtraction is done between two unsigned 32bits value
@@ -517,6 +522,11 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 		n = max;
 
 		*old_head = r->cons.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86 */
+		rte_smp_rmb();
+
 		const uint32_t prod_tail = r->prod.tail;
 		/* The subtraction is done between two unsigned 32bits value
 		 * (the result is always modulo 32 bits even if we have
-- 
2.7.4