From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751335AbcGQWwn (ORCPT ); Sun, 17 Jul 2016 18:52:43 -0400 Received: from mail-sn1nam02on0106.outbound.protection.outlook.com ([104.47.36.106]:53654 "EHLO NAM02-SN1-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751142AbcGQWwj (ORCPT ); Sun, 17 Jul 2016 18:52:39 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=waiman.long@hpe.com; Message-ID: <578C0C2C.8020809@hpe.com> Date: Sun, 17 Jul 2016 18:52:28 -0400 From: Waiman Long User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.12) Gecko/20130109 Thunderbird/10.0.12 MIME-Version: 1.0 To: Peter Zijlstra CC: Pan Xinhui , Ingo Molnar , , Boqun Feng , Scott J Norton , Douglas Hatch Subject: Re: [PATCH v2 2/5] locking/pvqspinlock: Fix missed PV wakeup problem References: <1464713631-1066-1-git-send-email-Waiman.Long@hpe.com> <1464713631-1066-3-git-send-email-Waiman.Long@hpe.com> <20160715084732.GF30921@twins.programming.kicks-ass.net> <3c5d5c29-7956-572f-2638-b85299c72432@linux.vnet.ibm.com> <20160715100703.GQ30154@twins.programming.kicks-ass.net> <20160715163556.GA7141@twins.programming.kicks-ass.net> In-Reply-To: <20160715163556.GA7141@twins.programming.kicks-ass.net> Content-Type: text/plain; charset="ISO-8859-1"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [72.71.243.249] X-ClientProxiedBy: BY2PR1001CA0048.namprd10.prod.outlook.com (10.164.163.16) To CS1PR84MB0312.NAMPRD84.PROD.OUTLOOK.COM (10.162.190.30) X-MS-Office365-Filtering-Correlation-Id: f2c8000a-e1d9-41a2-e569-08d3ae950bda X-Microsoft-Exchange-Diagnostics: 1;CS1PR84MB0312;2:HnshjgtTwF0XhGTFWzaQX0fLAAa60AZj2szmaxTwNKCDhogRwqJvhlxN0btnB5DD1jKnHhvTU244DfUYeRIUJT2x0CFvOfqb/1R9AT3gw+HdNoZVrq2SGhG8D4kxEgJKt9yMp3UTJA+3TizFoZWdzYsNNH3gDHLSnMOpeUQ+HXmhUynwxV7QnqioVCojP/MN;3:BUvqlxgwtoIGywbTCvliigHK0w1Uh3onpYtMYNWh2oDAm4N3zaXCiks9h9lD5e9zoC+/M4T5aQHdsVA9AKbWEX4qJkeOJAKwoyxHFcEpO6iiEgfID4kSKdyUNm+vRc1x;25:0/omFFGmNEHpVcAHIN6Zlc+8jNdbX6TtyJ8MVVnv0W0syZrcDVhNm8Hy5ggPEq26kouLAX0qF1OVU7gOgfYXApIKIXzrcaeVeOWNyfUstWVvdkW4g3ZjK9swFzddRhkYolpDH8oTgZP7OkKK0jKxCGfeTriCTXLRFZcrcTPzvZXFqg0Zpd7dEp0ENa3r+L7fRPYRWGhT4QnFE/TTg05BsPLftyNIHr+scKz+jKipxIG3ljInsusl/DBtb+x8Lfr9TJ+Hg+7I66QVMCVYENyNpAnCq3CyEnQol1H6eenqM6gXHUvSXGLKKa/gEak51O4mWbmNniS5MNF3FA7El4J/fSs2GDjj1dHGWS3M7o4BzVcWJxunqJo+P8R0huT1ij58x8wgcSz59YPsauKQ5VhJZvIQz6SD+vm0lzIOpz97JMU=;31:WfXnsBINHPBmEI52r/HMe9lqk9psL3f5I7Hw2ac3T1F5Mc4iDbUkXp57OfdibP0iyiHzgScGF65HPq2T8Lx2O1tV4EyW2JYAIp+nLT7WidTjLm1T/mzIr6mPS7ytczCmyNpqpYxZK2kmi0gGzJBTh/GNtFPouXIPBAGmyIzNHrumah0kKoWVh3QtokoNGvCoh5xEipEUSr1fqn39BhlNnQ== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:CS1PR84MB0312; X-Microsoft-Exchange-Diagnostics: 1;CS1PR84MB0312;20:pqQ9/kBPslyo5tpfYpCRjfX7Dm+ALH1j/FDCY53NFe+WIkzcNUO85yEd8iYTfBMu1qikUvkAnnfGbRAPjC5P5lxdEWCHtA5RSR34W/DPZPc3V5uWK/D1fx5yp9aewHwInC/OhbznJLhBXSsqFP+5NErm8ZNgP1oSc7OWRpZMszfXiR7voYYq0+pE1rM/9kEgAupgC7ICc26ZwVl6pWCaEZ1jPZPSaqox0AgbPFq04+qYoz9K7/EBngKl5gPJOpuVoVrmdomCI3eVfK+E3cwELvwVQtwtDssLbjJxDjdY7qD9yhcrHGjOWI7G+M1E7SGLaI+4UJw5QGhfyGWpTdzvPAdJ2W3UYBQdorlAlKkHW7hAoyFX2aQep3uWV9Mc1R9/aKcMpj0QOhXvoE/QxZVobmnpd3PEbBh2ixvQPImpJon7M8RSIzZIT5RDZ8/93b2BDF2IFrWboJrPfA0mrohnLTTmpRHt3ks2fi5cJG4rcr9qJBbcrEVbka+9T3s6yTPR;4:efnAGqBJrueVqByQ0u4YNa9euQT09rME+hI/DuxdpSxNhs8rOLWmu4NdZoYLdbFx8etgVH+oJYO8q+Ngfs8qOGKJDPbwcvUTEpgxolXtTB83fSKkfHjJlM0nYkQ6KxUTPMKKP+tRIXiATcGeLj5dZ3qNtT1+UpbFQV2VxLJWMbdF2gHkUY4M9PyVQRzh23u53XFh0w7R7OOuVPO78B7TcT4jIxsJTXspsNZuzO0Dbl+BU131tE/lERPN3a+OQCKB8qsarQRfoZ64vZPnx34o69yNZDxU7H5J+nO5pdu7xDCqtH4o8m9oh+YO8PBsZMEth341zcUXB1IGadTW21RjUtTeTZZ1PHw5wpRpDVXGQUXK0qX/dxe5d3yWtClbuD7nXScL7uGviKKrUciqpYJvWDBQAa/cQL8wqaPocjwXF64cqpRB8+rVvRGEmUlmw383 X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(190756311086443); X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(2401047)(5005006)(8121501046)(10201501046)(3002001)(6055026);SRVR:CS1PR84MB0312;BCL:0;PCL:0;RULEID:;SRVR:CS1PR84MB0312; X-Forefront-PRVS: 00064751B6 X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(4630300001)(6049001)(6009001)(7916002)(24454002)(189002)(199003)(377454003)(33964002)(586003)(4326007)(110136002)(101416001)(65816999)(117156001)(54356999)(42186005)(80316001)(59896002)(97736004)(36756003)(76176999)(83506001)(6116002)(68736007)(87266999)(47776003)(3846002)(50986999)(92566002)(2906002)(86362001)(65806001)(93886004)(105586002)(4001350100001)(50466002)(8676002)(23756003)(2950100001)(77096005)(66066001)(81166006)(64126003)(189998001)(230700001)(33656002)(305945005)(7846002)(7736002)(106356001)(65956001)(81156014);DIR:OUT;SFP:1102;SCL:1;SRVR:CS1PR84MB0312;H:[192.168.142.169];FPR:;SPF:None;PTR:InfoNoRecords;MX:1;A:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?iso-8859-1?Q?1;CS1PR84MB0312;23:E73fp4V5/GJnTjx47+az1/iYgDrqmvEWbUpaCif?= =?iso-8859-1?Q?zKuuXcs8mkyfZTc1LOwVBs+9phpdqDicM0D1K8uh/NvH1EnsZ1C/AnN+rN?= =?iso-8859-1?Q?IrZ2GQWGWur0a1sEG8UNducGmhMUa0+a6U0jhdfhu+UnU9wPVzcDxAh9jF?= =?iso-8859-1?Q?0wbmfCXV8SxuF+tj6iXcqEixDQF5hmxg+tABw1Z8vSOGpTErVmWTy7n632?= =?iso-8859-1?Q?8fGlJu0BJJGa9PPkN+YMyDcljFITlnL3RGv4d87C8elRsS3Rdk/e22P1gn?= =?iso-8859-1?Q?9S5grKAxa4T4Q+vPN36RhYK66Sfu+yhJVoLmKwUPXs6Np2cFdebzOMPWkz?= =?iso-8859-1?Q?529notUyrG2P7inJUnSJjQNE/vkL2MSq+v0ZI0D4YXY77DBlax0jOSn8+U?= =?iso-8859-1?Q?CBjkcjWzH1QEAs9rkNXHLvzPAID9pu77yb8aP9NmHe/3mmlGOeWUw5Qkp0?= =?iso-8859-1?Q?tzrrxBkwq6ZVkXtEcmbOh+mf2KuejDmRW6L5fqIKvq5ClAqmxsrQSKN//6?= =?iso-8859-1?Q?GbX4Neo+Ddlt7ZdkafIcu6+RIIydvf3z+seiuG3k3kVWz71MWV6RAZwEWt?= =?iso-8859-1?Q?ai8uV1QqD1lqgTdJjwByY/H5WmUZ86mMKAsMmg501sf+0DdLWX57fn6Q69?= =?iso-8859-1?Q?liBhy6kUX6216b4CcyMQz7k1CoBqSJiNcwd1EzobZXuz5WDgg3H9IjV8+O?= =?iso-8859-1?Q?aVS4mzxMEwzl8IJV6MTk9zv/v4AtvPpI1gJZrIeUnON4O9WJPTvsTibX0x?= =?iso-8859-1?Q?+yYxe+9BBSek0qogufFpyUODlLew+YIqtZuN/d3rgVuiKM9a4tG86DgOAJ?= =?iso-8859-1?Q?zswDyj6ns8SO4wwUgYAIepuc4JxOr/Srm5MOmbYJ77Jhz7R6BtvBZzG0zC?= =?iso-8859-1?Q?IrYgppMfuufo07BYh4G1GCO1ebJVXlsnT1kWW9vq6h/om8lEfSgIh0XETI?= =?iso-8859-1?Q?vOSBu5/tHWhiMbiaQmZgzBqLxReMNFxJ1C/EqdyGOawrvJRiGqsDJoN/XG?= =?iso-8859-1?Q?WAMhg6BubV1jLiVMpMm2LsIAlrhrjN6D+Y+O3NDZ7NouN6BxnGBzq1LPpb?= =?iso-8859-1?Q?wbW+wQJzZaZvOlkcRMh64vPnSynWxsaQ4kdgEx6g6FF/eEd9Q3GNM8iMhm?= =?iso-8859-1?Q?9GEWmoXOvMjcFx+Yh2amK4JC0gC6CTktLIcqLt1ut0de4OTqt5i+wzypBU?= =?iso-8859-1?Q?nVPG1DvYPmLMTnmulRPdCMQob7ncG4/MQ5aHibnK5+6TGHyHWPT7UbzELc?= =?iso-8859-1?Q?CK9QFqp0xwKO4hOF6E3L1YScqwSz7A+GeNVT0YO0iKVZPJVx9O/hwDAEQo?= =?iso-8859-1?Q?DkM3lwfNDxPcF+Ed7RfwsNTUtoXOLsZbySf+cmt7ePkuw=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;CS1PR84MB0312;6:ijCaLcp403+dR7fmrDnnUnGsst59lyyxsDEmMp4JHIVs0g9O0isv9y/prm1bcsQa86jSn58wx9RBtKBazI9Zm2RmEKss4cAHhi2zXswZ4TrdeCq49eGxxkYQf+TmMS3rvtFP+Q5Ypgm9L46tfOwNv6RbOoVgHv6LF8GqJNSXZ2wT6KaMIsF5qheIrTryIK6SsVjvgZYAj8DQjCqclgZbGGYGsdlrfznYNY1e8eU3tf5oQGpHTpbRISQoqI6LLSqY5UVn1wBgItnQKVZlnPBdOuGri67oYVJQZOAHPTUHBQ6lpYs6xAsB9GbfRKsZyw7zC78kp1FuAba51iIWkorifw==;5:GDLntqyaB9rhW7yEviJ/BaB8BMYVBjMsylywDbt8MsOu7a6rbKvjNI+3hyArpXuOsGjpbryINyddQMEEzhZw3KEOspqISQIXGFzJEd6BhSN2fS2px6f9iHSM8S3cDmgyhLqNTVSC8RHSr0Mcqcj9/Q==;24:VjvevZ5BXTZK0LxMFB6yvhchjn86pqa2/5pm84/mTKx3gybMS8yX9kEuz0ppZdHQu+hrRnDcyLivoyvQxSaQRiQ0VlcmNy0z4PbOhojUoJA=;7:R/T0k6VL7jbc7eXgCDuYdSrW3GAoclYuDvnkRNVGl/4UCONQAbTvuGF7gTHCJ98yOfkoDhP9NkPmJ0OoYA7QNqX/sPVol5QOFZrJekF+FFQhzBsgmIt2VESA6sT358a2+sT9n9AQDtaswlzD7WMXAS4NN6yy94t6NczoMZ/xTJEL4Z9ZD1EtAcOus9TI+OQ7Vm2/fzLegBIePquTiuw4p8W0Tnz9qaFGzF2rlQY1qCI982GS5qmBxVKlwPg11htp SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: hpe.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 17 Jul 2016 22:52:35.1058 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: CS1PR84MB0312 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 07/15/2016 12:35 PM, Peter Zijlstra wrote: > On Fri, Jul 15, 2016 at 12:07:03PM +0200, Peter Zijlstra wrote: >>> So if we are kicked by the unlock_slowpath, and the lock is stealed by >>> someone else, we need hash its node again and set l->locked to >>> _Q_SLOW_VAL, then enter pv_wait. >> Right, let me go think about this a bit. > Urgh, brain hurt. > > So I _think_ the below does for it but I could easily have missed yet > another case. > > Waiman's patch has the problem that it can have two pv_hash() calls for > the same lock in progress and I'm thinking that means we can hit the > BUG() in pv_hash() due to that. > > If we can't, it still has a problem because its not telling us either. > > > > --- a/kernel/locking/qspinlock_paravirt.h > +++ b/kernel/locking/qspinlock_paravirt.h > @@ -20,7 +20,8 @@ > * native_queued_spin_unlock(). > */ > > -#define _Q_SLOW_VAL (3U<< _Q_LOCKED_OFFSET) > +#define _Q_HASH_VAL (3U<< _Q_LOCKED_OFFSET) > +#define _Q_SLOW_VAL (7U<< _Q_LOCKED_OFFSET) The hash state is a transient state. It is not obvious until I read through the code. Maybe some comments on allowable state transition will help. > > /* > * Queue Node Adaptive Spinning > @@ -36,14 +37,11 @@ > */ > #define PV_PREV_CHECK_MASK 0xff > > -/* > - * Queue node uses: vcpu_running& vcpu_halted. > - * Queue head uses: vcpu_running& vcpu_hashed. > - */ > enum vcpu_state { > - vcpu_running = 0, > - vcpu_halted, /* Used only in pv_wait_node */ > - vcpu_hashed, /* = pv_hash'ed + vcpu_halted */ > + vcpu_node_running = 0, > + vcpu_node_halted, > + vcpu_head_running, > + vcpu_head_halted, > }; > > struct pv_node { > @@ -263,7 +261,7 @@ pv_wait_early(struct pv_node *prev, int > if ((loop& PV_PREV_CHECK_MASK) != 0) > return false; > > - return READ_ONCE(prev->state) != vcpu_running; > + return READ_ONCE(prev->state)& 1; > } > This relies on the implicit ordering of the enum vcpu_state variable. I think we need some warning above that all the halt states must have bit 0 set and running states have that bit cleared. We would like to make sure that any future changes in vcpu_state won't affect that rule. > /* > @@ -311,20 +309,19 @@ static void pv_wait_node(struct mcs_spin > * > * Matches the cmpxchg() from pv_kick_node(). > */ > - smp_store_mb(pn->state, vcpu_halted); > + smp_store_mb(pn->state, vcpu_node_halted); > > - if (!READ_ONCE(node->locked)) { > - qstat_inc(qstat_pv_wait_node, true); > - qstat_inc(qstat_pv_wait_early, wait_early); > - pv_wait(&pn->state, vcpu_halted); > - } > + if (READ_ONCE(node->locked)) > + return; > + > + qstat_inc(qstat_pv_wait_node, true); > + qstat_inc(qstat_pv_wait_early, wait_early); > + pv_wait(&pn->state, vcpu_node_halted); > > /* > - * If pv_kick_node() changed us to vcpu_hashed, retain that > - * value so that pv_wait_head_or_lock() knows to not also try > - * to hash this lock. > + * If pv_kick_node() advanced us, retain that state. > */ > - cmpxchg(&pn->state, vcpu_halted, vcpu_running); > + cmpxchg(&pn->state, vcpu_node_halted, vcpu_node_running); > > /* > * If the locked flag is still not set after wakeup, it is a > @@ -362,18 +359,17 @@ static void pv_kick_node(struct qspinloc > * > * Matches with smp_store_mb() and cmpxchg() in pv_wait_node() > */ > - if (cmpxchg(&pn->state, vcpu_halted, vcpu_hashed) != vcpu_halted) > + if (cmpxchg(&pn->state, vcpu_node_halted, vcpu_head_running) != vcpu_node_halted) > return; > > /* > - * Put the lock into the hash table and set the _Q_SLOW_VAL. > - * > - * As this is the same vCPU that will check the _Q_SLOW_VAL value and > - * the hash table later on at unlock time, no atomic instruction is > - * needed. > + * See pv_wait_head_or_lock(). We have to hash and force the unlock > + * into the slow path to deliver the actual kick for waking. > */ > - WRITE_ONCE(l->locked, _Q_SLOW_VAL); > - (void)pv_hash(lock, pn); > + if (cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL) == _Q_LOCKED_VAL) { > + (void)pv_hash(lock, pn); > + smp_store_release(&l->locked, _Q_SLOW_VAL); > + } > } > > /* > @@ -388,28 +384,22 @@ pv_wait_head_or_lock(struct qspinlock *l > { > struct pv_node *pn = (struct pv_node *)node; > struct __qspinlock *l = (void *)lock; > - struct qspinlock **lp = NULL; > int waitcnt = 0; > int loop; > > /* > - * If pv_kick_node() already advanced our state, we don't need to > - * insert ourselves into the hash table anymore. > - */ > - if (READ_ONCE(pn->state) == vcpu_hashed) > - lp = (struct qspinlock **)1; > - > - /* > * Tracking # of slowpath locking operations > */ > qstat_inc(qstat_pv_lock_slowpath, true); > > for (;; waitcnt++) { > + u8 locked; > + > /* > * Set correct vCPU state to be used by queue node wait-early > * mechanism. > */ > - WRITE_ONCE(pn->state, vcpu_running); > + WRITE_ONCE(pn->state, vcpu_head_running); > > /* > * Set the pending bit in the active lock spinning loop to > @@ -423,33 +413,38 @@ pv_wait_head_or_lock(struct qspinlock *l > } > clear_pending(lock); > > + /* > + * We want to go sleep; ensure we're hashed so that > + * __pv_queued_spin_unlock_slow() can find us for a wakeup. > + */ > + locked = cmpxchg(&l->locked, _Q_LOCKED_VAL, _Q_HASH_VAL); > + switch (locked) { > + /* > + * We're not hashed yet, either we're fresh from pv_wait_node() > + * or __pv_queued_spin_unlock_slow() unhashed us but we lost > + * the trylock to a steal and have to re-hash. > + */ > + case _Q_LOCKED_VAL: > + (void)pv_hash(lock, pn); > + smp_store_release(&l->locked, _Q_SLOW_VAL); > + break; > > - if (!lp) { /* ONCE */ > - lp = pv_hash(lock, pn); > + /* > + * pv_kick_node() is hashing us, wait for it. > + */ > + case _Q_HASH_VAL: > + while (READ_ONCE(l->locked) == _Q_HASH_VAL) > + cpu_relax(); > + break; > > - /* > - * We must hash before setting _Q_SLOW_VAL, such that > - * when we observe _Q_SLOW_VAL in __pv_queued_spin_unlock() > - * we'll be sure to be able to observe our hash entry. > - * > - * [S] [Rmw] l->locked == _Q_SLOW_VAL > - * MB RMB > - * [RmW] l->locked = _Q_SLOW_VAL [L] > - * > - * Matches the smp_rmb() in __pv_queued_spin_unlock(). > - */ > - if (xchg(&l->locked, _Q_SLOW_VAL) == 0) { > - /* > - * The lock was free and now we own the lock. > - * Change the lock value back to _Q_LOCKED_VAL > - * and unhash the table. > - */ > - WRITE_ONCE(l->locked, _Q_LOCKED_VAL); > - WRITE_ONCE(*lp, NULL); > - goto gotlock; > - } > + /* > + * Ooh, unlocked, try and grab it. > + */ > + case 0: > + continue; > } > - WRITE_ONCE(pn->state, vcpu_hashed); > + > + WRITE_ONCE(pn->state, vcpu_head_halted); > qstat_inc(qstat_pv_wait_head, true); > qstat_inc(qstat_pv_wait_again, waitcnt); > pv_wait(&l->locked, _Q_SLOW_VAL); > @@ -480,7 +475,7 @@ __pv_queued_spin_unlock_slowpath(struct > struct __qspinlock *l = (void *)lock; > struct pv_node *node; > > - if (unlikely(locked != _Q_SLOW_VAL)) { > + if (unlikely(locked != _Q_SLOW_VAL&& locked != _Q_HASH_VAL)) { > WARN(!debug_locks_silent, > "pvqspinlock: lock 0x%lx has corrupted value 0x%x!\n", > (unsigned long)lock, atomic_read(&lock->val)); > @@ -488,18 +483,17 @@ __pv_queued_spin_unlock_slowpath(struct > } > > /* > - * A failed cmpxchg doesn't provide any memory-ordering guarantees, > - * so we need a barrier to order the read of the node data in > - * pv_unhash *after* we've read the lock being _Q_SLOW_VAL. > - * > - * Matches the cmpxchg() in pv_wait_head_or_lock() setting _Q_SLOW_VAL. > + * Wait until the hash-bucket is complete. > */ > - smp_rmb(); > + while (READ_ONCE(l->locked) == _Q_HASH_VAL) > + cpu_relax(); > > /* > - * Since the above failed to release, this must be the SLOW path. > - * Therefore start by looking up the blocked node and unhashing it. > + * Must first observe _Q_SLOW_VAL in order to observe > + * consistent hash bucket. > */ > + smp_rmb(); > + > node = pv_unhash(lock); > > /* I like your patch. Other than a bit more comment to clarify thing, I think it is a good one. Cheers, Longman