From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from cn.fujitsu.com ([222.73.24.84]:19320 "EHLO song.cn.fujitsu.com" rhost-flags-OK-FAIL-OK-OK) by vger.kernel.org with ESMTP id S1757464Ab3AQEM3 (ORCPT ); Wed, 16 Jan 2013 23:12:29 -0500 Message-ID: <50F77A50.1000109@cn.fujitsu.com> Date: Thu, 17 Jan 2013 12:13:04 +0800 From: Miao Xie Reply-To: miaox@cn.fujitsu.com MIME-Version: 1.0 To: dsterba@suse.cz CC: Linux Btrfs Subject: Re: [PATCH 05/11] Btrfs: protect fs_info->alloc_start References: <50EEB880.4060308@cn.fujitsu.com> <20130110171040.GG20089@twin.jikos.cz> <50F3BC2B.2010407@cn.fujitsu.com> <20130116124300.GR20089@twin.jikos.cz> In-Reply-To: <20130116124300.GR20089@twin.jikos.cz> Content-Type: multipart/mixed; boundary="------------090806070006090203080508" Sender: linux-btrfs-owner@vger.kernel.org List-ID: --------------090806070006090203080508 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On wed, 16 Jan 2013 13:43:00 +0100, David Sterba wrote: > On Mon, Jan 14, 2013 at 04:04:59PM +0800, Miao Xie wrote: >> On Thu, 10 Jan 2013 18:10:40 +0100, David Sterba wrote: >>> On Thu, Jan 10, 2013 at 08:48:00PM +0800, Miao Xie wrote: >>>> fs=5Finfo->alloc=5Fstart was not protected strictly, it might be chang= ed while >>>> we were accessing it. This patch fixes this problem by using two locks= to >>>> protect it (fs=5Finfo->chunk=5Fmutex and sb->s=5Fumount). On the read = side, we >>>> just need get one of these two locks, and on the write side, we must l= ock >>>> all of them. >>> >>> Can you please describe how this can theoretically cause any problems? >>> >>> Out of curiosity I've been running this command on a 60G ssd disk for a= while >>> >>> for i in `seq 1 10000|shuf`;do >>> mount -o remount,alloc=5Fstart=3D$(($i*1024*1024)) /mnt/sdc;done >>> #sleep 0 / 1 / 5 >>> done >>> >>> together with one of my favourite stresstests (heavy writes, subvolume >>> activity). >>> >>> There are two direct users of fs=5Finfo->alloc=5Fstart: >>> >>> * statfs via btrfs=5Fcalc=5Favail=5Fdata=5Fspace >>> * find=5Ffree=5Fdev=5Fextent >>> >>> 960 search=5Fstart =3D max(root->fs=5Finfo->alloc=5Fstart, 1024= ull * 1024); >>> >>> as remount calls sync, there is a very tiny window where an allocation = could >>> get the old value of alloc=5Fstart just between sync and do=5Fremount. = Theoretical >>> and without any visible effect. >> >> ->alloc=5Fstart is a 64bits variant, on the 32bits machine, we have to l= oad it >> by two instructions. >> >> Assuming -> alloc=5Fstart is 0x00010000 at the beginning, then we remoun= t and ^ it should be 0x0000 0001 0000 0000 >> set ->alloc=5Fstart to 0x00000100 ^ should be 0x0000 0000 0001 0000 >> Task0 Task1 >> load low 32 bits >> set high 32 bits >> load high 32 bits >> set low 32 bits >> >> Task1 will get 0. >=20 > That's an insufficient description of how a race could actually happen, > it's only a high-level scheme how a 64bit value can get mixed up with > 32bit access. >=20 > How exactly does it cause problems in btrfs code? >=20 > IMO the bug is not there due to various reasons like instruction > scheduling and cache effect that may hide the actual split of the 32bit > value. >=20 > * the value set on one cpu is not immediatelly visible on another cpu > * the full 64bit value of alloc=5Fstart lives inside a single cacheline > and will be updated as a whole in context of btrfs -- there's only one I know it lives inside a single cacheline, but only the cacheline will be u= pdated as a whole, not the 64bit value. And cacheline is not a lock, it can not pr= event that it is acquired by the other CPU when only a half is updated. Right? > place where it's set so multiple places that would store the value and > fight over the cacheline exclusivity are out of question; > the storing cpu will flush the pending writes at some point in time > so they're visible by other cpus, until then the load side reads a > consistent value -- in almost all cases, and this gets even more wild > and counts on exact timing of a memory barriers triggered from other > cpu and cacheline ownership, so the store finishes only one half, > barrier, load finds half new and half old value, store does the other > half > * this needs to happen within 4-5 instructions for an operation that > involves a IO sync -- makes it much harder to provoke right timing >=20 > Your example with 0x00010000 and 0x00000100 uses only 32bit values, so My mistake, sorry. The number we want to use is 64bit(see above). > it cannot be applied here. Also, alloc=5Fstart is always compared with the > hardcoded 1M limit so it cannot go below to 0. 0 is just a sample. The point is we should not allocate a extent from a une= xpected address. > The point is not the one extra mutex lock, but that you're attempting to > fix a bug highly improbable if possible at all without a proper On 32bit machines, it is not highly improbable, we can trigger this bug eas= ily (The attached code which simulates the process of the btrfs can reproduce t= he bug). This problem don't happen on btrfs is because: 1. Most of the users use btrfs on 64bit machine 2. Changing alloc=5Fstart is infrequent=EF=BC=8Cand chunk allocation is not= frequent. So the race is hard to be triggered on btrfs. But it doesn't means it can not happen. > description and analysis. I don't want to spend the time on this and do > the work for instead of you next time, please try harder to convince us. Miao = --------------090806070006090203080508 Content-Transfer-Encoding: 7bit Content-Type: text/x-csrc; name="test.c" Content-Disposition: attachment; filename="test.c" #include #include #include unsigned long long global = 0x00010000; #define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x)) void *thread_set_function(void *arg); void *thread_read_function(void *arg); int main(int argc, char **argv) { int ret; pthread_t pthread_set; pthread_t pthread_read; ret = pthread_create(&pthread_set, NULL, thread_set_function, NULL); if (ret) { fprintf(stderr, "ERROR: fail to create set_thread\n"); exit(EXIT_FAILURE); } ret = pthread_create(&pthread_read, NULL, thread_read_function, NULL); if (ret) { fprintf(stderr, "ERROR: fail to create read_thread\n"); exit(EXIT_FAILURE); } pthread_join(pthread_set, NULL); pthread_join(pthread_read, NULL); return 0; } void *thread_set_function(void *arg) { sleep(1); while(1) { ACCESS_ONCE(global) = 0x0000000000000100; ACCESS_ONCE(global) = 0x0010000000000000; } } void *thread_read_function(void *arg) { if (global) printf("no 0 \n"); while(1) { if (global == 0) { printf("0 is detected\n"); exit(0); } } } --------------090806070006090203080508--