stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Please include following patches to 4.4/4.9/4.14
@ 2018-05-21  7:18 Nikolay Borisov
  2018-05-21  8:24 ` Greg KH
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Nikolay Borisov @ 2018-05-21  7:18 UTC (permalink / raw)
  To: stable

[-- Attachment #1: Type: text/plain, Size: 194 bytes --]

Hello Greg,

Here are trivial backports for upstream commit, which is tagged as stable:

02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
raid1 mounts")


Regards,
Nikolay

[-- Attachment #2: 0001-btrfs-fix-reading-stale-metadata-blocks-after-degrad.patch-44 --]
[-- Type: text/plain, Size: 3449 bytes --]

From dd19d5d5c7b9f839d3c179f5b3494c461f386e5a Mon Sep 17 00:00:00 2001
From: Liu Bo <bo.liu@linux.alibaba.com>
Date: Wed, 16 May 2018 01:37:36 +0800
Subject: [PATCH] btrfs: fix reading stale metadata blocks after degraded raid1
 mounts

If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.

This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.

Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/ctree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0f2b7c622ce3..eebb835db673 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2497,10 +2497,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 	if (p->reada)
 		reada_for_search(root, p, level, slot, key->objectid);
 
-	btrfs_release_path(p);
-
 	ret = -EAGAIN;
-	tmp = read_tree_block(root, blocknr, 0);
+	tmp = read_tree_block(fs_info, blocknr, gen);
 	if (!IS_ERR(tmp)) {
 		/*
 		 * If the read above didn't mark this buffer up to date,
@@ -2512,6 +2510,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 			ret = -EIO;
 		free_extent_buffer(tmp);
 	}
+
+	btrfs_release_path(p);
 	return ret;
 }
 
-- 
2.7.4


[-- Attachment #3: 0001-btrfs-fix-reading-stale-metadata-blocks-after-degrad.patch-49 --]
[-- Type: text/plain, Size: 3453 bytes --]

From bbbfbf4b39635872cde02ddb62a3b810f1d5cefe Mon Sep 17 00:00:00 2001
From: Liu Bo <bo.liu@linux.alibaba.com>
Date: Wed, 16 May 2018 01:37:36 +0800
Subject: [PATCH] btrfs: fix reading stale metadata blocks after degraded raid1
 mounts

If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.

This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.

Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/ctree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index f6ba165d3f81..2c6d348bde55 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2486,10 +2486,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 	if (p->reada != READA_NONE)
 		reada_for_search(root, p, level, slot, key->objectid);
 
-	btrfs_release_path(p);
-
 	ret = -EAGAIN;
-	tmp = read_tree_block(root, blocknr, 0);
+	tmp = read_tree_block(fs_info, blocknr, gen);
 	if (!IS_ERR(tmp)) {
 		/*
 		 * If the read above didn't mark this buffer up to date,
@@ -2503,6 +2501,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 	} else {
 		ret = PTR_ERR(tmp);
 	}
+
+	btrfs_release_path(p);
 	return ret;
 }
 
-- 
2.7.4


[-- Attachment #4: 0001-btrfs-fix-reading-stale-metadata-blocks-after-degrad.patch-414 --]
[-- Type: text/plain, Size: 3485 bytes --]

From caaf1c9b942884ca5557079ecea8a34624eae936 Mon Sep 17 00:00:00 2001
From: Liu Bo <bo.liu@linux.alibaba.com>
Date: Wed, 16 May 2018 01:37:36 +0800
Subject: [PATCH] btrfs: fix reading stale metadata blocks after degraded raid1
 mounts

If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.

This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.

Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/ctree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 21cc27509993..9620c5b5e53e 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2497,10 +2497,8 @@ read_block_for_search(struct btrfs_root *root, struct btrfs_path *p,
 	if (p->reada != READA_NONE)
 		reada_for_search(fs_info, p, level, slot, key->objectid);
 
-	btrfs_release_path(p);
-
 	ret = -EAGAIN;
-	tmp = read_tree_block(fs_info, blocknr, 0);
+	tmp = read_tree_block(fs_info, blocknr, gen);
 	if (!IS_ERR(tmp)) {
 		/*
 		 * If the read above didn't mark this buffer up to date,
@@ -2514,6 +2512,8 @@ read_block_for_search(struct btrfs_root *root, struct btrfs_path *p,
 	} else {
 		ret = PTR_ERR(tmp);
 	}
+
+	btrfs_release_path(p);
 	return ret;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  7:18 Please include following patches to 4.4/4.9/4.14 Nikolay Borisov
@ 2018-05-21  8:24 ` Greg KH
  2018-05-21  8:27   ` Nikolay Borisov
  2018-05-21  8:33 ` Greg KH
  2018-05-21  8:47 ` Greg KH
  2 siblings, 1 reply; 9+ messages in thread
From: Greg KH @ 2018-05-21  8:24 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable

On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
> Hello Greg,
> 
> Here are trivial backports for upstream commit, which is tagged as stable:
> 
> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
> raid1 mounts")

What about 4.16?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  8:24 ` Greg KH
@ 2018-05-21  8:27   ` Nikolay Borisov
  2018-05-21  8:31     ` Greg KH
  0 siblings, 1 reply; 9+ messages in thread
From: Nikolay Borisov @ 2018-05-21  8:27 UTC (permalink / raw)
  To: Greg KH; +Cc: stable



On 21.05.2018 11:24, Greg KH wrote:
> On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
>> Hello Greg,
>>
>> Here are trivial backports for upstream commit, which is tagged as stable:
>>
>> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
>> raid1 mounts")
> 
> What about 4.16?

I only provided patches for LTS releases. In any case, the code in
question hasn't changed recently so the 4.14 backport should apply to 4.16.


> 
> thanks,
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  8:27   ` Nikolay Borisov
@ 2018-05-21  8:31     ` Greg KH
  0 siblings, 0 replies; 9+ messages in thread
From: Greg KH @ 2018-05-21  8:31 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable

On Mon, May 21, 2018 at 11:27:37AM +0300, Nikolay Borisov wrote:
> 
> 
> On 21.05.2018 11:24, Greg KH wrote:
> > On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
> >> Hello Greg,
> >>
> >> Here are trivial backports for upstream commit, which is tagged as stable:
> >>
> >> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
> >> raid1 mounts")
> > 
> > What about 4.16?
> 
> I only provided patches for LTS releases. In any case, the code in
> question hasn't changed recently so the 4.14 backport should apply to 4.16.

I don't want someone moving from a 4.14-lts release to 4.16 to hit a
regression, that would be foolish :)

Anyway, thanks, I'll take the 4.14 backport.

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  7:18 Please include following patches to 4.4/4.9/4.14 Nikolay Borisov
  2018-05-21  8:24 ` Greg KH
@ 2018-05-21  8:33 ` Greg KH
  2018-05-21  8:47 ` Greg KH
  2 siblings, 0 replies; 9+ messages in thread
From: Greg KH @ 2018-05-21  8:33 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable

On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
> Hello Greg,
> 
> Here are trivial backports for upstream commit, which is tagged as stable:
> 
> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
> raid1 mounts")

All now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  7:18 Please include following patches to 4.4/4.9/4.14 Nikolay Borisov
  2018-05-21  8:24 ` Greg KH
  2018-05-21  8:33 ` Greg KH
@ 2018-05-21  8:47 ` Greg KH
  2018-05-21  8:48   ` Nikolay Borisov
  2018-05-21  9:00   ` Nikolay Borisov
  2 siblings, 2 replies; 9+ messages in thread
From: Greg KH @ 2018-05-21  8:47 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable

On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
> Hello Greg,
> 
> Here are trivial backports for upstream commit, which is tagged as stable:
> 
> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
> raid1 mounts")

the 4.9 and 4.4 patches break the build :(

Please test build them at the least...

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  8:47 ` Greg KH
@ 2018-05-21  8:48   ` Nikolay Borisov
  2018-05-21  9:00   ` Nikolay Borisov
  1 sibling, 0 replies; 9+ messages in thread
From: Nikolay Borisov @ 2018-05-21  8:48 UTC (permalink / raw)
  To: Greg KH; +Cc: stable



On 21.05.2018 11:47, Greg KH wrote:
> On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
>> Hello Greg,
>>
>> Here are trivial backports for upstream commit, which is tagged as stable:
>>
>> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
>> raid1 mounts")
> 
> the 4.9 and 4.4 patches break the build :(

Huhz, I tested on 4.14 and it was ok so I assumed other will be as well.

Nothing works as expected on Mondays .....


On it!
> 
> Please test build them at the least...
> 
> greg k-h
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  8:47 ` Greg KH
  2018-05-21  8:48   ` Nikolay Borisov
@ 2018-05-21  9:00   ` Nikolay Borisov
  2018-05-21 16:38     ` Greg KH
  1 sibling, 1 reply; 9+ messages in thread
From: Nikolay Borisov @ 2018-05-21  9:00 UTC (permalink / raw)
  To: Greg KH; +Cc: stable

[-- Attachment #1: Type: text/plain, Size: 453 bytes --]



On 21.05.2018 11:47, Greg KH wrote:
> On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
>> Hello Greg,
>>
>> Here are trivial backports for upstream commit, which is tagged as stable:
>>
>> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
>> raid1 mounts")
> 
> the 4.9 and 4.4 patches break the build :(
> 
> Please test build them at the least...

Fixed and test-built on both kernel versions!

> 
> greg k-h
> 

[-- Attachment #2: 0001-btrfs-fix-reading-stale-metadata-blocks-after-degrad.patch-44 --]
[-- Type: text/plain, Size: 3498 bytes --]

From c6c4b6698919c50e71c52f0a58aa1c0734d8ce61 Mon Sep 17 00:00:00 2001
From: Liu Bo <bo.liu@linux.alibaba.com>
Date: Wed, 16 May 2018 01:37:36 +0800
Subject: [PATCH] btrfs: fix reading stale metadata blocks after degraded raid1
 mounts

If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.

This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.

Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/ctree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index 0f2b7c622ce3..e2f5be261532 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2497,10 +2497,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 	if (p->reada)
 		reada_for_search(root, p, level, slot, key->objectid);
 
-	btrfs_release_path(p);
-
 	ret = -EAGAIN;
-	tmp = read_tree_block(root, blocknr, 0);
+	tmp = read_tree_block(root, blocknr, gen);
 	if (!IS_ERR(tmp)) {
 		/*
 		 * If the read above didn't mark this buffer up to date,
@@ -2512,6 +2510,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 			ret = -EIO;
 		free_extent_buffer(tmp);
 	}
+
+	btrfs_release_path(p);
 	return ret;
 }
 
-- 
2.7.4


[-- Attachment #3: 0001-btrfs-fix-reading-stale-metadata-blocks-after-degrad.patch-49 --]
[-- Type: text/plain, Size: 3502 bytes --]

From b55269e3b6217d1982b70ca7b81352621ca446d3 Mon Sep 17 00:00:00 2001
From: Liu Bo <bo.liu@linux.alibaba.com>
Date: Wed, 16 May 2018 01:37:36 +0800
Subject: [PATCH] btrfs: fix reading stale metadata blocks after degraded raid1
 mounts

If a btree block, aka. extent buffer, is not available in the extent
buffer cache, it'll be read out from the disk instead, i.e.

btrfs_search_slot()
  read_block_for_search()  # hold parent and its lock, go to read child
    btrfs_release_path()
    read_tree_block()  # read child

Unfortunately, the parent lock got released before reading child, so
commit 5bdd3536cbbe ("Btrfs: Fix block generation verification race") had
used 0 as parent transid to read the child block.  It forces
read_tree_block() not to check if parent transid is different with the
generation id of the child that it reads out from disk.

A simple PoC is included in btrfs/124,

0. A two-disk raid1 btrfs,

1. Right after mkfs.btrfs, block A is allocated to be device tree's root.

2. Mount this filesystem and put it in use, after a while, device tree's
   root got COW but block A hasn't been allocated/overwritten yet.

3. Umount it and reload the btrfs module to remove both disks from the
   global @fs_devices list.

4. mount -odegraded dev1 and write some data, so now block A is allocated
   to be a leaf in checksum tree.  Note that only dev1 has the latest
   metadata of this filesystem.

5. Umount it and mount it again normally (with both disks), since raid1
   can pick up one disk by the writer task's pid, if btrfs_search_slot()
   needs to read block A, dev2 which does NOT have the latest metadata
   might be read for block A, then we got a stale block A.

6. As parent transid is not checked, block A is marked as uptodate and
   put into the extent buffer cache, so the future search won't bother
   to read disk again, which means it'll make changes on this stale
   one and make it dirty and flush it onto disk.

To avoid the problem, parent transid needs to be passed to
read_tree_block().

In order to get a valid parent transid, we need to hold the parent's
lock until finishing reading child.

This patch needs to be slightly adapted for stable kernels, the
&first_key parameter added to read_tree_block() is from 4.16+
(581c1760415c4). The fix is to replace 0 by 'gen'.

Fixes: 5bdd3536cbbe ("Btrfs: Fix block generation verification race")
CC: stable@vger.kernel.org # 4.4+
Signed-off-by: Liu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
[ update changelog ]
Signed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Nikolay Borisov <nborisov@suse.com>

Signed-off-by: Nikolay Borisov <nborisov@suse.com>
---
 fs/btrfs/ctree.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/ctree.c b/fs/btrfs/ctree.c
index f6ba165d3f81..409b12392474 100644
--- a/fs/btrfs/ctree.c
+++ b/fs/btrfs/ctree.c
@@ -2486,10 +2486,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 	if (p->reada != READA_NONE)
 		reada_for_search(root, p, level, slot, key->objectid);
 
-	btrfs_release_path(p);
-
 	ret = -EAGAIN;
-	tmp = read_tree_block(root, blocknr, 0);
+	tmp = read_tree_block(root, blocknr, gen);
 	if (!IS_ERR(tmp)) {
 		/*
 		 * If the read above didn't mark this buffer up to date,
@@ -2503,6 +2501,8 @@ read_block_for_search(struct btrfs_trans_handle *trans,
 	} else {
 		ret = PTR_ERR(tmp);
 	}
+
+	btrfs_release_path(p);
 	return ret;
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: Please include following patches to 4.4/4.9/4.14
  2018-05-21  9:00   ` Nikolay Borisov
@ 2018-05-21 16:38     ` Greg KH
  0 siblings, 0 replies; 9+ messages in thread
From: Greg KH @ 2018-05-21 16:38 UTC (permalink / raw)
  To: Nikolay Borisov; +Cc: stable

On Mon, May 21, 2018 at 12:00:15PM +0300, Nikolay Borisov wrote:
> 
> 
> On 21.05.2018 11:47, Greg KH wrote:
> > On Mon, May 21, 2018 at 10:18:00AM +0300, Nikolay Borisov wrote:
> >> Hello Greg,
> >>
> >> Here are trivial backports for upstream commit, which is tagged as stable:
> >>
> >> 02a3307aa9c2 ("btrfs: fix reading stale metadata blocks after degraded
> >> raid1 mounts")
> > 
> > the 4.9 and 4.4 patches break the build :(
> > 
> > Please test build them at the least...
> 
> Fixed and test-built on both kernel versions!

Many thanks, now queued up.

greg k-h

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2018-05-21 16:39 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-05-21  7:18 Please include following patches to 4.4/4.9/4.14 Nikolay Borisov
2018-05-21  8:24 ` Greg KH
2018-05-21  8:27   ` Nikolay Borisov
2018-05-21  8:31     ` Greg KH
2018-05-21  8:33 ` Greg KH
2018-05-21  8:47 ` Greg KH
2018-05-21  8:48   ` Nikolay Borisov
2018-05-21  9:00   ` Nikolay Borisov
2018-05-21 16:38     ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).