From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: linux-nfs-owner@vger.kernel.org Received: from cantor2.suse.de ([195.135.220.15]:43383 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751224Ab3BTXzg (ORCPT ); Wed, 20 Feb 2013 18:55:36 -0500 Date: Thu, 21 Feb 2013 10:55:19 +1100 From: NeilBrown To: bstroesser@ts.fujitsu.com Cc: bfields@fieldses.org, linux-nfs@vger.kernel.org Subject: Re: [PATCH] sunrpc.ko: RPC cache fix races Message-ID: <20130221105519.5f46d3e7@notabene.brown> In-Reply-To: References: Mime-Version: 1.0 Content-Type: multipart/signed; micalg=PGP-SHA1; boundary="Sig_/Co=saaYWjk7QQbZ7wSrrr8M"; protocol="application/pgp-signature" Sender: linux-nfs-owner@vger.kernel.org List-ID: --Sig_/Co=saaYWjk7QQbZ7wSrrr8M Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: quoted-printable On 20 Feb 2013 14:57:07 +0100 bstroesser@ts.fujitsu.com wrote: > On 20 Feb 2013 04:09:00 +0100 neilb@suse.de wrote: >=20 > > On 19 Feb 2013 18:08:40 +0100 bstroesser@ts.fujitsu.com wrote: > >=20 > > > Second attempt using the correct FROM. Sorry for the noise. > > >=20 > > >=20 > > > Hi, > > >=20 > > > I found a problem in sunrpc.ko on a SLES11 SP1 (2.6.32.59-0,7.1) and= =20 > > > fixed it (see first patch ifor 2.6.32.60 below). > > > For us the patch works fine (tested on 2.6.32.59-0.7.1). > > >=20 > > > AFAICS from the code, the problem seems to persist in current kernel= =20 > > > versions also. Thus, I added the second patch for 3.7.9. > > > As the setup to reproduce the problem is quite complex, I couldn't=20 > > > test the second patch yet. So consider this one as a RFC. > > >=20 > > > Best regards, > > > Bodo > > >=20 > > > Please CC me, I'm not on the list. > > >=20 > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > From: Bodo Stroesser > > > Date: Fri, 08 Feb 2013 > > > Subject: [PATCH] net: sunrpc: fix races in RPC cache > > >=20 > > > We found the problem and tested the patch on a SLES11 SP1=20 > > > 2.6.32.59-0.7.1 > > >=20 > > > This patch applies to linux-2.6.32.60 (changed monotonic_seconds --> > > > get_seconds()) > > >=20 > > > Sporadically NFS3 RPC requests to the nfs server are dropped due to > > > cache_check() (net/sunrpc/cache.c) returning -ETIMEDOUT for an entry= =20 > > > of the "auth.unix.gid" cache. > > > In this case, no NFS reply is sent to the client. > > >=20 > > > The reason for the dropped requests are races in cache_check() when > > > cache_make_upcall() returns -EINVAL (because it is called for a cache= =20 > > > without readers) and cache_check() therefore refreshes the cache entr= y=20 > > > (rv =3D=3D -EAGAIN). > > >=20 > > > There are three details that need to be changed: > > > 1) cache_revisit_request() must not be called before cache_fresh_loc= ked() > > > has updated the cache entry, as cache_revisit_request() wakes up > > > threads waiting for the cache entry to be updated. > >=20 > > This certainly seems correct. It is wrong to call cache_revisit_reques= t() so early. > >=20 > > > The explicit call to cache_revisit_request() is not needed, as > > > cache_fresh_unlocked() calls it anyway. > >=20 > > But cache_fresh_unlocked is only called if "rv =3D=3D -EAGAIN", however= we also need to call it in the case where "age > refresh_age/2" - it must = always be called after clearing CACHE_PENDING. > >=20 > > Also, cache_fresh_unlocked() only calls cache_revisit_request() if CACH= E_PENDING is set, but we have just cleared it! Some definitely something w= rong here. > > (Note that I'm looking at the SLES 2.6.32 code at the moment, mainline = is a bit different). > >=20 > >=20 > > > (But in case of not updating the cache entry, cache_revisit_reque= st() > > > must be called. Thus, we use a fall through in the "case"). > >=20 > > Hmm... I don't like case fallthroughs unless they have nice big comment= s: > > /* FALLTHROUGH */ > > or similar. :-) > >=20 > > > 2) CACHE_PENDING must not be cleared before cache_fresh_locked() has > > > updated the cache entry, as cache_defer_req() can return without = really > > > sleeping if it detects CACHE_PENDING unset. > >=20 > > Agreed. So we should leave the clearing of CACHE_PENDING to cache_fres= h_unlocked(). > >=20 > >=20 > > > (In case of not updating the cache entry again we use the fall=20 > > > through) > > > 3) Imagine a thread that calls cache_check() and gets rv =3D -ENOENT= from > > > cache_is_valid(). Then it sets CACHE_PENDINGs and calls > > > cache_make_upcall(). > > > We assume that meanwhile get_seconds() advances to the next > > > sec. and a second thread also calls cache_check(). It gets -EAGAI= N from > > > cache_is_valid() for the same cache entry. As CACHE_PENDING still= is > > > set, it calls cache_defer_req() immediately and waits for a wakeu= p from > > > the first thread. > > > After cache_make_upcall() returned -EINVAL, the first thread does= not > > > update the cache entry as it had got rv =3D -ENOENT, but wakes up= the > > > second thread by calling cache_revisit_request(). > > > Thread two wakes up, calls cache_is_valid() and again gets -EAGAI= N. > > > Thus, the result of the second cache_check() is -ETIMEDOUT and the > > > NFS request is dropped. > >=20 > > Yep, that's not so good.... > >=20 > >=20 > > > To solve this, the cache entry now is updated not only if rv =3D= =3D -EAGAIN > > > but if rv =3D=3D -ENOENT also. This is a sufficient workaround, a= s the > > > first thread would have to stay in cache_check() between its call= to > > > cache_is_valid() and clearing CACHE_PENDING for more than 60 seco= nds > > > to break the workaround. > >=20 > > Still, it isn't nice to just have a work-around. It would be best to h= ave a fix. > > The key problem here is that cache_is_valid() is time-sensitive. This = have been address in mainline - cache_is_valid doesn't depend on the curren= t time there. >=20 > So, the solution would be a backport of the current mainline code ... That would always be my preference, when it is practical. >=20 > Anyway, I think for SLES11 SP1 and 2.6.32.60 the work-around would be suf= ficient. > BTW: it has the positive side effect, that - while a cache entry is in it= s second > half of life - no longer each cache_check() tries to do a cache_make_upca= ll(). I agree that is an improvement. >=20 >=20 > >=20 > >=20 > > > =20 > > > Signed-off-by: Bodo Stroesser > > > --- > > >=20 > > > --- a/net/sunrpc/cache.c 2012-08-08 21:35:09.000000000 +0200 > > > +++ b/net/sunrpc/cache.c 2013-02-08 14:29:41.000000000 +0100 > > > @@ -233,15 +233,14 @@ int cache_check(struct cache_detail *det > > > if (!test_and_set_bit(CACHE_PENDING, &h->flags)) { > > > switch (cache_make_upcall(detail, h)) { > > > case -EINVAL: > > > - clear_bit(CACHE_PENDING, &h->flags); > > > - cache_revisit_request(h); > > > - if (rv =3D=3D -EAGAIN) { > > > + if (rv =3D=3D -EAGAIN || rv =3D=3D -ENOENT) { > > > set_bit(CACHE_NEGATIVE, &h->flags); > > > cache_fresh_locked(h, get_seconds()+CACHE_NEW_EXPIRY); > > > + clear_bit(CACHE_PENDING, &h->flags); > > > cache_fresh_unlocked(h, detail); > > > rv =3D -ENOENT; > > > + break; > > > } > > > - break; > > > =20 > > > case -EAGAIN: > > > clear_bit(CACHE_PENDING, &h->flags); > >=20 > > I agree with some of this.... > > Maybe: > >=20 > > switch(cache_make_upcall(detail, h)) { > > case -EINVAL: > > if (rv) { > > set_bit(CACHE_NEGATIVE, &h->flags); > > cache_fresh_locked(h, get_seconds() + CACHE_NEW_EXPIRY); > > rv =3D -ENOENT; > > } > > /* FALLTHROUGH */ > > case -EAGAIN: > > cache_fresh_unlocked(h, detail); > > } >=20 > I agree, your patch is obviously better than the mine. > But let me suggest one little change: I would like to substitute > cache_fresh_unlocked() by clear_bit() and cache_revisit_request(), > as the call to cache_dequeue() in cache_fresh_unlocked() seems to > be obsolete here: It is exactly this sort of thinking (on my part) that got us into this mess in the first place. I reasoned that the full locking/testing/whatever wasn= 't necessary and took a short cut. It wasn't a good idea. Given that this is obviously difficult code to get right, we should make it as easy to review as possible. Have "cache_fresh_unlocked" makes it more obviously correct, and that is a good thing. Maybe cache_dequeue isn't needed here, but it won't hurt so I'd much rather have the clearer code. In fact, I'd also like to change if (test_and_clear_bit(CACHE_PENDING, &ch->flags)) cache_dequeue(current_detail, ch); cache_revisit_request(ch); near the end of cache_clean to call cache_fresh_unlocked(). NeilBrown --Sig_/Co=saaYWjk7QQbZ7wSrrr8M Content-Type: application/pgp-signature; name=signature.asc Content-Disposition: attachment; filename=signature.asc -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.19 (GNU/Linux) iQIVAwUBUSViZznsnt1WYoG5AQJ69g//ULa+R7Ll+Du68nYLZ5dYsfz93Rvhb2BF csuu9XRAtHLBRKULvT+3WctpDbyE2KQjr8BSi8bhWy6aJjQG6l80ZBSX3EF//F64 HGTAKZO0LFHxqpq8A9utzZsfvtpzR8nyG2Aqt5Vn9IvhLTka9LYnV46Wgtfb/DNd dBhE+f0zPddJTh/xoOANgaPkpp+3Id7B81KTakcHeIFysk8wYqpqW7LlyYxsj6rd vqU3UmZmBKU2fewyVFl7yYoYw/m65Qfd6N5mwjAM9ynAc3h7vMP900FYZzp73W4u FwVyseoknZb+Fu0SnBLFJ+07LynvpkWBScZoYjhLaUQpCplFvXrK1GSX0/KnjrH8 rg+YpM/jacwFmMoPcQj7dvIutuR8Ey2mkyliQKheth5fB04s2e05g0SGKUMj42kE nmNtnwlHa/H5U4mnNFIin1SNXmf2FFlIQlpswjlBGCDf7g/Gih1v+o2s0JNrUKtK 1bxpx9WUabp5TSygHW5oQ68QBGIyQz1AElDrTSejvCCQsjAlDnnxr4Yn7EiAX8t8 StkQshgPI60r4kM+YWHezKbyMoIrTy8ixdhqQOOrm3mFqD8QDVTHZ9g657wIorYB Schc8Ty/75ojj3LTrhfd4nJ1jb+IlNjUIe5ZJQ/vLlVnGZJEcL09vOD8KuzhUr7T f9tBNd7EPLU= =URwS -----END PGP SIGNATURE----- --Sig_/Co=saaYWjk7QQbZ7wSrrr8M--