All of lore.kernel.org
 help / color / mirror / Atom feed
* btrfs seems to do COW while inode has NODATACOW set
@ 2012-10-25 18:35 Alex Lyakas
  2012-10-25 18:40 ` cwillu
  2012-10-25 18:58 ` Wade Cline
  0 siblings, 2 replies; 9+ messages in thread
From: Alex Lyakas @ 2012-10-25 18:35 UTC (permalink / raw)
  To: linux-btrfs

Hi everybody,
I need some help understanding the nodatacow behavior.

I have set up a large file (5GiB), which has very few EXTENT_DATAs
(all are real, not bytenr=0). The file has NODATASUM and NODATACOW
flags set (flags=0x3):
	item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
		inode generation 5 transid 5 size 5368709120 nbytes 5368709120
owner[0:0] mode 100644
		inode blockgroup 0 nlink 1 flags 0x3 seq 0
	item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
	item 8 key (257 EXTENT_DATA 33554432) itemoff 3416 itemsize 53
	item 9 key (257 EXTENT_DATA 67108864) itemoff 3363 itemsize 53
	item 10 key (257 EXTENT_DATA 67112960) itemoff 3310 itemsize 53
	item 11 key (257 EXTENT_DATA 67117056) itemoff 3257 itemsize 53
	item 12 key (257 EXTENT_DATA 67121152) itemoff 3204 itemsize 53
	item 13 key (257 EXTENT_DATA 67125248) itemoff 3151 itemsize 53
	item 14 key (257 EXTENT_DATA 67129344) itemoff 3098 itemsize 53
	item 15 key (257 EXTENT_DATA 67133440) itemoff 3045 itemsize 53
	item 16 key (257 EXTENT_DATA 67137536) itemoff 2992 itemsize 53
	item 17 key (257 EXTENT_DATA 67141632) itemoff 2939 itemsize 53
	item 18 key (257 EXTENT_DATA 67145728) itemoff 2886 itemsize 53
	item 19 key (257 EXTENT_DATA 67149824) itemoff 2833 itemsize 53
	item 20 key (257 EXTENT_DATA 67153920) itemoff 2780 itemsize 53
	item 21 key (257 EXTENT_DATA 67158016) itemoff 2727 itemsize 53
	item 22 key (257 EXTENT_DATA 67162112) itemoff 2674 itemsize 53
	item 23 key (257 EXTENT_DATA 67166208) itemoff 2621 itemsize 53
	item 24 key (257 EXTENT_DATA 67170304) itemoff 2568 itemsize 53
	item 25 key (257 EXTENT_DATA 67174400) itemoff 2515 itemsize 53
		extent data disk byte 67174400 nr 5301534720
		extent data offset 0 nr 5301534720 ram 5301534720
		extent compression 0
As you see by last extent, the file size is exactly 5Gib.

Then I also mount btrfs with nodatacow option.

root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
Data: total=5.47GB, used=5.00GB
System: total=32.00MB, used=4.00KB
Metadata: total=512.00MB, used=28.00KB

(I have set up block groups myself by playing with mfks code and
convertion code to learn about the extent tree. The filesystem passes
btrfsck fine, with no errors. All superblock copies are consistent.)

Then I run parallel random IOs on the file, and almost immediately hit
ENOSPC. When looking at the file, I see that now it has a huge amount
of EXTENT_DATAs:
item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
	inode generation 5 transid 21 size 5368709120 nbytes 5368709120
owner[0:0] mode 100644
	inode blockgroup 0 nlink 1 flags 0x3 seq 130098
item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
item 8 key (257 EXTENT_DATA 262144) itemoff 3419 itemsize 53
item 9 key (257 EXTENT_DATA 524288) itemoff 3366 itemsize 53
item 10 key (257 EXTENT_DATA 655360) itemoff 3313 itemsize 53
item 11 key (257 EXTENT_DATA 1310720) itemoff 3260 itemsize 53
item 12 key (257 EXTENT_DATA 1441792) itemoff 3207 itemsize 53
item 13 key (257 EXTENT_DATA 2097152) itemoff 3154 itemsize 53
item 14 key (257 EXTENT_DATA 2228224) itemoff 3101 itemsize 53
item 15 key (257 EXTENT_DATA 2752512) itemoff 3048 itemsize 53
item 16 key (257 EXTENT_DATA 2883584) itemoff 2995 itemsize 53
item 17 key (257 EXTENT_DATA 11927552) itemoff 2942 itemsize 53
item 18 key (257 EXTENT_DATA 12058624) itemoff 2889 itemsize 53
item 19 key (257 EXTENT_DATA 13238272) itemoff 2836 itemsize 53
item 20 key (257 EXTENT_DATA 13369344) itemoff 2783 itemsize 53
item 21 key (257 EXTENT_DATA 16646144) itemoff 2730 itemsize 53
item 22 key (257 EXTENT_DATA 16777216) itemoff 2677 itemsize 53
item 23 key (257 EXTENT_DATA 17432576) itemoff 2624 itemsize 53
...

and:
root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
Data: total=5.47GB, used=5.46GB
System: total=32.00MB, used=4.00KB
Metadata: total=512.00MB, used=992.00KB

Kernel is for-linus branch from Chris's tree, up to
f46dbe3dee853f8a860f889cb2b7ff4c624f2a7a (this is the last commit
there now).

I was under impression that if a file is marked as NODATACOW, then new
writes will never allocate EXTENT_DATAs if appropriate EXTENT_DATAs
already exist. However, it is clearly not the case, or maybe I am
doing something wrong.

Can anybody please help me to debug further and understand why this is
happening.

Thanks,
Alex.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs seems to do COW while inode has NODATACOW set
  2012-10-25 18:35 btrfs seems to do COW while inode has NODATACOW set Alex Lyakas
@ 2012-10-25 18:40 ` cwillu
  2012-10-25 18:47   ` Alex Lyakas
  2012-10-25 18:58 ` Wade Cline
  1 sibling, 1 reply; 9+ messages in thread
From: cwillu @ 2012-10-25 18:40 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On Thu, Oct 25, 2012 at 12:35 PM, Alex Lyakas
<alex.btrfs@zadarastorage.com> wrote:
> Hi everybody,
> I need some help understanding the nodatacow behavior.
>
> I have set up a large file (5GiB), which has very few EXTENT_DATAs
> (all are real, not bytenr=0). The file has NODATASUM and NODATACOW
> flags set (flags=0x3):
>         item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
>                 inode generation 5 transid 5 size 5368709120 nbytes 5368709120
> owner[0:0] mode 100644
>                 inode blockgroup 0 nlink 1 flags 0x3 seq 0
>         item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
>         item 8 key (257 EXTENT_DATA 33554432) itemoff 3416 itemsize 53
>         item 9 key (257 EXTENT_DATA 67108864) itemoff 3363 itemsize 53
>         item 10 key (257 EXTENT_DATA 67112960) itemoff 3310 itemsize 53
>         item 11 key (257 EXTENT_DATA 67117056) itemoff 3257 itemsize 53
>         item 12 key (257 EXTENT_DATA 67121152) itemoff 3204 itemsize 53
>         item 13 key (257 EXTENT_DATA 67125248) itemoff 3151 itemsize 53
>         item 14 key (257 EXTENT_DATA 67129344) itemoff 3098 itemsize 53
>         item 15 key (257 EXTENT_DATA 67133440) itemoff 3045 itemsize 53
>         item 16 key (257 EXTENT_DATA 67137536) itemoff 2992 itemsize 53
>         item 17 key (257 EXTENT_DATA 67141632) itemoff 2939 itemsize 53
>         item 18 key (257 EXTENT_DATA 67145728) itemoff 2886 itemsize 53
>         item 19 key (257 EXTENT_DATA 67149824) itemoff 2833 itemsize 53
>         item 20 key (257 EXTENT_DATA 67153920) itemoff 2780 itemsize 53
>         item 21 key (257 EXTENT_DATA 67158016) itemoff 2727 itemsize 53
>         item 22 key (257 EXTENT_DATA 67162112) itemoff 2674 itemsize 53
>         item 23 key (257 EXTENT_DATA 67166208) itemoff 2621 itemsize 53
>         item 24 key (257 EXTENT_DATA 67170304) itemoff 2568 itemsize 53
>         item 25 key (257 EXTENT_DATA 67174400) itemoff 2515 itemsize 53
>                 extent data disk byte 67174400 nr 5301534720
>                 extent data offset 0 nr 5301534720 ram 5301534720
>                 extent compression 0
> As you see by last extent, the file size is exactly 5Gib.
>
> Then I also mount btrfs with nodatacow option.
>
> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
> Data: total=5.47GB, used=5.00GB
> System: total=32.00MB, used=4.00KB
> Metadata: total=512.00MB, used=28.00KB
>
> (I have set up block groups myself by playing with mfks code and
> convertion code to learn about the extent tree. The filesystem passes
> btrfsck fine, with no errors. All superblock copies are consistent.)
>
> Then I run parallel random IOs on the file, and almost immediately hit
> ENOSPC. When looking at the file, I see that now it has a huge amount
> of EXTENT_DATAs:
> item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
>         inode generation 5 transid 21 size 5368709120 nbytes 5368709120
> owner[0:0] mode 100644
>         inode blockgroup 0 nlink 1 flags 0x3 seq 130098
> item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
> item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
> item 8 key (257 EXTENT_DATA 262144) itemoff 3419 itemsize 53
> item 9 key (257 EXTENT_DATA 524288) itemoff 3366 itemsize 53
> item 10 key (257 EXTENT_DATA 655360) itemoff 3313 itemsize 53
> item 11 key (257 EXTENT_DATA 1310720) itemoff 3260 itemsize 53
> item 12 key (257 EXTENT_DATA 1441792) itemoff 3207 itemsize 53
> item 13 key (257 EXTENT_DATA 2097152) itemoff 3154 itemsize 53
> item 14 key (257 EXTENT_DATA 2228224) itemoff 3101 itemsize 53
> item 15 key (257 EXTENT_DATA 2752512) itemoff 3048 itemsize 53
> item 16 key (257 EXTENT_DATA 2883584) itemoff 2995 itemsize 53
> item 17 key (257 EXTENT_DATA 11927552) itemoff 2942 itemsize 53
> item 18 key (257 EXTENT_DATA 12058624) itemoff 2889 itemsize 53
> item 19 key (257 EXTENT_DATA 13238272) itemoff 2836 itemsize 53
> item 20 key (257 EXTENT_DATA 13369344) itemoff 2783 itemsize 53
> item 21 key (257 EXTENT_DATA 16646144) itemoff 2730 itemsize 53
> item 22 key (257 EXTENT_DATA 16777216) itemoff 2677 itemsize 53
> item 23 key (257 EXTENT_DATA 17432576) itemoff 2624 itemsize 53
> ...
>
> and:
> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
> Data: total=5.47GB, used=5.46GB
> System: total=32.00MB, used=4.00KB
> Metadata: total=512.00MB, used=992.00KB
>
> Kernel is for-linus branch from Chris's tree, up to
> f46dbe3dee853f8a860f889cb2b7ff4c624f2a7a (this is the last commit
> there now).
>
> I was under impression that if a file is marked as NODATACOW, then new
> writes will never allocate EXTENT_DATAs if appropriate EXTENT_DATAs
> already exist. However, it is clearly not the case, or maybe I am
> doing something wrong.
>
> Can anybody please help me to debug further and understand why this is
> happening.

Have there been any snapshots taken, and/or was the filesystem
converted from ext?  In those cases, there will be one final copy
taken for for the write.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs seems to do COW while inode has NODATACOW set
  2012-10-25 18:40 ` cwillu
@ 2012-10-25 18:47   ` Alex Lyakas
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Lyakas @ 2012-10-25 18:47 UTC (permalink / raw)
  To: cwillu; +Cc: linux-btrfs

Hi cwillu,

the filesystem has a single subvolume and a single file within it. I
know that ext2 conversion creates an image file that references same
extents, which should cause the COW. I actually used examples from
conversion & mkfs code to create this filesystem. Maybe I have some
inconsistencies there, although btrfsck passes fine.
Any other reason that COW should happen? Any hint on how to debug
deeper is appreciated:)

Alex.

On Thu, Oct 25, 2012 at 8:40 PM, cwillu <cwillu@cwillu.com> wrote:
> On Thu, Oct 25, 2012 at 12:35 PM, Alex Lyakas
> <alex.btrfs@zadarastorage.com> wrote:
>> Hi everybody,
>> I need some help understanding the nodatacow behavior.
>>
>> I have set up a large file (5GiB), which has very few EXTENT_DATAs
>> (all are real, not bytenr=0). The file has NODATASUM and NODATACOW
>> flags set (flags=0x3):
>>         item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
>>                 inode generation 5 transid 5 size 5368709120 nbytes 5368709120
>> owner[0:0] mode 100644
>>                 inode blockgroup 0 nlink 1 flags 0x3 seq 0
>>         item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
>>         item 8 key (257 EXTENT_DATA 33554432) itemoff 3416 itemsize 53
>>         item 9 key (257 EXTENT_DATA 67108864) itemoff 3363 itemsize 53
>>         item 10 key (257 EXTENT_DATA 67112960) itemoff 3310 itemsize 53
>>         item 11 key (257 EXTENT_DATA 67117056) itemoff 3257 itemsize 53
>>         item 12 key (257 EXTENT_DATA 67121152) itemoff 3204 itemsize 53
>>         item 13 key (257 EXTENT_DATA 67125248) itemoff 3151 itemsize 53
>>         item 14 key (257 EXTENT_DATA 67129344) itemoff 3098 itemsize 53
>>         item 15 key (257 EXTENT_DATA 67133440) itemoff 3045 itemsize 53
>>         item 16 key (257 EXTENT_DATA 67137536) itemoff 2992 itemsize 53
>>         item 17 key (257 EXTENT_DATA 67141632) itemoff 2939 itemsize 53
>>         item 18 key (257 EXTENT_DATA 67145728) itemoff 2886 itemsize 53
>>         item 19 key (257 EXTENT_DATA 67149824) itemoff 2833 itemsize 53
>>         item 20 key (257 EXTENT_DATA 67153920) itemoff 2780 itemsize 53
>>         item 21 key (257 EXTENT_DATA 67158016) itemoff 2727 itemsize 53
>>         item 22 key (257 EXTENT_DATA 67162112) itemoff 2674 itemsize 53
>>         item 23 key (257 EXTENT_DATA 67166208) itemoff 2621 itemsize 53
>>         item 24 key (257 EXTENT_DATA 67170304) itemoff 2568 itemsize 53
>>         item 25 key (257 EXTENT_DATA 67174400) itemoff 2515 itemsize 53
>>                 extent data disk byte 67174400 nr 5301534720
>>                 extent data offset 0 nr 5301534720 ram 5301534720
>>                 extent compression 0
>> As you see by last extent, the file size is exactly 5Gib.
>>
>> Then I also mount btrfs with nodatacow option.
>>
>> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
>> Data: total=5.47GB, used=5.00GB
>> System: total=32.00MB, used=4.00KB
>> Metadata: total=512.00MB, used=28.00KB
>>
>> (I have set up block groups myself by playing with mfks code and
>> convertion code to learn about the extent tree. The filesystem passes
>> btrfsck fine, with no errors. All superblock copies are consistent.)
>>
>> Then I run parallel random IOs on the file, and almost immediately hit
>> ENOSPC. When looking at the file, I see that now it has a huge amount
>> of EXTENT_DATAs:
>> item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
>>         inode generation 5 transid 21 size 5368709120 nbytes 5368709120
>> owner[0:0] mode 100644
>>         inode blockgroup 0 nlink 1 flags 0x3 seq 130098
>> item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
>> item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
>> item 8 key (257 EXTENT_DATA 262144) itemoff 3419 itemsize 53
>> item 9 key (257 EXTENT_DATA 524288) itemoff 3366 itemsize 53
>> item 10 key (257 EXTENT_DATA 655360) itemoff 3313 itemsize 53
>> item 11 key (257 EXTENT_DATA 1310720) itemoff 3260 itemsize 53
>> item 12 key (257 EXTENT_DATA 1441792) itemoff 3207 itemsize 53
>> item 13 key (257 EXTENT_DATA 2097152) itemoff 3154 itemsize 53
>> item 14 key (257 EXTENT_DATA 2228224) itemoff 3101 itemsize 53
>> item 15 key (257 EXTENT_DATA 2752512) itemoff 3048 itemsize 53
>> item 16 key (257 EXTENT_DATA 2883584) itemoff 2995 itemsize 53
>> item 17 key (257 EXTENT_DATA 11927552) itemoff 2942 itemsize 53
>> item 18 key (257 EXTENT_DATA 12058624) itemoff 2889 itemsize 53
>> item 19 key (257 EXTENT_DATA 13238272) itemoff 2836 itemsize 53
>> item 20 key (257 EXTENT_DATA 13369344) itemoff 2783 itemsize 53
>> item 21 key (257 EXTENT_DATA 16646144) itemoff 2730 itemsize 53
>> item 22 key (257 EXTENT_DATA 16777216) itemoff 2677 itemsize 53
>> item 23 key (257 EXTENT_DATA 17432576) itemoff 2624 itemsize 53
>> ...
>>
>> and:
>> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
>> Data: total=5.47GB, used=5.46GB
>> System: total=32.00MB, used=4.00KB
>> Metadata: total=512.00MB, used=992.00KB
>>
>> Kernel is for-linus branch from Chris's tree, up to
>> f46dbe3dee853f8a860f889cb2b7ff4c624f2a7a (this is the last commit
>> there now).
>>
>> I was under impression that if a file is marked as NODATACOW, then new
>> writes will never allocate EXTENT_DATAs if appropriate EXTENT_DATAs
>> already exist. However, it is clearly not the case, or maybe I am
>> doing something wrong.
>>
>> Can anybody please help me to debug further and understand why this is
>> happening.
>
> Have there been any snapshots taken, and/or was the filesystem
> converted from ext?  In those cases, there will be one final copy
> taken for for the write.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs seems to do COW while inode has NODATACOW set
  2012-10-25 18:35 btrfs seems to do COW while inode has NODATACOW set Alex Lyakas
  2012-10-25 18:40 ` cwillu
@ 2012-10-25 18:58 ` Wade Cline
  2012-10-25 19:09   ` Alex Lyakas
  1 sibling, 1 reply; 9+ messages in thread
From: Wade Cline @ 2012-10-25 18:58 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

Hi Alex,

Someone correct me if I am wrong, but I'm pretty sure that the purpose of
'nodatacow' is to prevent the location of extents on the disk itself from
moving, however, it may be necessary to allocate more extents in the metadata
(which I presume are represented by EXTENT_DATA) in order to do this.

For example, say you preallocated space for a 1GB file using fallocate. Then
you'd have one EXTENT_DATA to represent the entire 1GB range, say:

         item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53

Then, if you performed a single write to the middle of the 1GB file, that one,
preallocated extent would need to be broken up into three extents; one for the
preallocated area before the write, one for the written area, and the last one
for the preallocated area after the write, say:

         item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
         item 8 key (257 EXTENT_DATA 33554432) itemoff 3416 itemsize 53
	item 9 key (257 EXTENT_DATA 67108864) itemoff 3363 itemsize 53

The main point I'm trying to make is that it may be necessary to create more
EXTENT_DATAs in order to preserve the correct on-disk location.

Since you're not using a preallocated file, I'd guess that the writes are
reading in part of a larger extent, which isn't fully read-into memory, and
then the write ends up breaking that extent into two smaller extents. You may
have better luck figuring out what's happening using the 'filefrag -v<file>'
command.

Hope this helps/answers your question.

Regards,
Wade

  
On 10/25/2012 11:35 AM, Alex Lyakas wrote:

> Hi everybody,
> I need some help understanding the nodatacow behavior.
>
> I have set up a large file (5GiB), which has very few EXTENT_DATAs
> (all are real, not bytenr=0). The file has NODATASUM and NODATACOW
> flags set (flags=0x3):
> 	item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
> 		inode generation 5 transid 5 size 5368709120 nbytes 5368709120
> owner[0:0] mode 100644
> 		inode blockgroup 0 nlink 1 flags 0x3 seq 0
> 	item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
> 	item 8 key (257 EXTENT_DATA 33554432) itemoff 3416 itemsize 53
> 	item 9 key (257 EXTENT_DATA 67108864) itemoff 3363 itemsize 53
> 	item 10 key (257 EXTENT_DATA 67112960) itemoff 3310 itemsize 53
> 	item 11 key (257 EXTENT_DATA 67117056) itemoff 3257 itemsize 53
> 	item 12 key (257 EXTENT_DATA 67121152) itemoff 3204 itemsize 53
> 	item 13 key (257 EXTENT_DATA 67125248) itemoff 3151 itemsize 53
> 	item 14 key (257 EXTENT_DATA 67129344) itemoff 3098 itemsize 53
> 	item 15 key (257 EXTENT_DATA 67133440) itemoff 3045 itemsize 53
> 	item 16 key (257 EXTENT_DATA 67137536) itemoff 2992 itemsize 53
> 	item 17 key (257 EXTENT_DATA 67141632) itemoff 2939 itemsize 53
> 	item 18 key (257 EXTENT_DATA 67145728) itemoff 2886 itemsize 53
> 	item 19 key (257 EXTENT_DATA 67149824) itemoff 2833 itemsize 53
> 	item 20 key (257 EXTENT_DATA 67153920) itemoff 2780 itemsize 53
> 	item 21 key (257 EXTENT_DATA 67158016) itemoff 2727 itemsize 53
> 	item 22 key (257 EXTENT_DATA 67162112) itemoff 2674 itemsize 53
> 	item 23 key (257 EXTENT_DATA 67166208) itemoff 2621 itemsize 53
> 	item 24 key (257 EXTENT_DATA 67170304) itemoff 2568 itemsize 53
> 	item 25 key (257 EXTENT_DATA 67174400) itemoff 2515 itemsize 53
> 		extent data disk byte 67174400 nr 5301534720
> 		extent data offset 0 nr 5301534720 ram 5301534720
> 		extent compression 0
> As you see by last extent, the file size is exactly 5Gib.
>
> Then I also mount btrfs with nodatacow option.
>
> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
> Data: total=5.47GB, used=5.00GB
> System: total=32.00MB, used=4.00KB
> Metadata: total=512.00MB, used=28.00KB
>
> (I have set up block groups myself by playing with mfks code and
> convertion code to learn about the extent tree. The filesystem passes
> btrfsck fine, with no errors. All superblock copies are consistent.)
>
> Then I run parallel random IOs on the file, and almost immediately hit
> ENOSPC. When looking at the file, I see that now it has a huge amount
> of EXTENT_DATAs:
> item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
> 	inode generation 5 transid 21 size 5368709120 nbytes 5368709120
> owner[0:0] mode 100644
> 	inode blockgroup 0 nlink 1 flags 0x3 seq 130098
> item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
> item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
> item 8 key (257 EXTENT_DATA 262144) itemoff 3419 itemsize 53
> item 9 key (257 EXTENT_DATA 524288) itemoff 3366 itemsize 53
> item 10 key (257 EXTENT_DATA 655360) itemoff 3313 itemsize 53
> item 11 key (257 EXTENT_DATA 1310720) itemoff 3260 itemsize 53
> item 12 key (257 EXTENT_DATA 1441792) itemoff 3207 itemsize 53
> item 13 key (257 EXTENT_DATA 2097152) itemoff 3154 itemsize 53
> item 14 key (257 EXTENT_DATA 2228224) itemoff 3101 itemsize 53
> item 15 key (257 EXTENT_DATA 2752512) itemoff 3048 itemsize 53
> item 16 key (257 EXTENT_DATA 2883584) itemoff 2995 itemsize 53
> item 17 key (257 EXTENT_DATA 11927552) itemoff 2942 itemsize 53
> item 18 key (257 EXTENT_DATA 12058624) itemoff 2889 itemsize 53
> item 19 key (257 EXTENT_DATA 13238272) itemoff 2836 itemsize 53
> item 20 key (257 EXTENT_DATA 13369344) itemoff 2783 itemsize 53
> item 21 key (257 EXTENT_DATA 16646144) itemoff 2730 itemsize 53
> item 22 key (257 EXTENT_DATA 16777216) itemoff 2677 itemsize 53
> item 23 key (257 EXTENT_DATA 17432576) itemoff 2624 itemsize 53
> ...
>
> and:
> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
> Data: total=5.47GB, used=5.46GB
> System: total=32.00MB, used=4.00KB
> Metadata: total=512.00MB, used=992.00KB
>
> Kernel is for-linus branch from Chris's tree, up to
> f46dbe3dee853f8a860f889cb2b7ff4c624f2a7a (this is the last commit
> there now).
>
> I was under impression that if a file is marked as NODATACOW, then new
> writes will never allocate EXTENT_DATAs if appropriate EXTENT_DATAs
> already exist. However, it is clearly not the case, or maybe I am
> doing something wrong.
>
> Can anybody please help me to debug further and understand why this is
> happening.
>
> Thanks,
> Alex.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs seems to do COW while inode has NODATACOW set
  2012-10-25 18:58 ` Wade Cline
@ 2012-10-25 19:09   ` Alex Lyakas
  2012-10-25 20:52     ` Wade Cline
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Lyakas @ 2012-10-25 19:09 UTC (permalink / raw)
  To: Wade Cline; +Cc: linux-btrfs

Wade, thanks.

Yes, with the preallocated extent I saw the behavior you describe, and
it makes perfect sense to alloc a new EXTENT_DATA in this case.
In my case, I did another simple test:

Before:
	item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
		inode generation 5 transid 5 size 5368709120 nbytes 5368709120
owner[0:0] mode 100644
		inode blockgroup 0 nlink 1 flags 0x3 seq 0
	item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
		inode ref index 2 namelen 5 name: vol-1
	item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
		extent data disk byte 5368709120 nr 131072
		extent data offset 0 nr 131072 ram 131072
		extent compression 0
	item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
		extent data disk byte 5905842176 nr 33423360
		extent data offset 0 nr 33423360 ram 33423360
		extent compression 0
                ...

I am going to do a single write of a 4Kib block into (257 EXTENT_DATA
131072) extent:

dd if=/dev/urandom of=/mnt/src/subvol-1/vol-1 bs=4096 seek=32 count=1
conv=notrunc

After:
	item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
		inode generation 5 transid 21 size 5368709120 nbytes 5368709120
owner[0:0] mode 100644
		inode blockgroup 0 nlink 1 flags 0x3 seq 1
	item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
		inode ref index 2 namelen 5 name: vol-1
	item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
		extent data disk byte 5368709120 nr 131072
		extent data offset 0 nr 131072 ram 131072
		extent compression 0
	item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
		extent data disk byte 5368840192 nr 4096
		extent data offset 0 nr 4096 ram 4096
		extent compression 0
	item 8 key (257 EXTENT_DATA 135168) itemoff 3419 itemsize 53
		extent data disk byte 5905842176 nr 33423360
		extent data offset 4096 nr 33419264 ram 33423360
		extent compression 0

We clearly see that a new extent has been allocated for some reason
(bytenr=5368840192), and previous extent (bytenr=5905842176) is still
there, but used at offset of 4096. This is exactly cow, I believe.

However, your hint about not being able to read into memory may be
useful; it would be good if we can find the place in the code that
does that decision to cow.

I guess I am looking for a way to never ever allocate new EXTENT_DATAs
on a fully-mapped file. Is there one?

Thanks!
Alex.





On Thu, Oct 25, 2012 at 8:58 PM, Wade Cline <clinew@linux.vnet.ibm.com> wrote:
> Hi Alex,
>
> Someone correct me if I am wrong, but I'm pretty sure that the purpose of
> 'nodatacow' is to prevent the location of extents on the disk itself from
> moving, however, it may be necessary to allocate more extents in the
> metadata
> (which I presume are represented by EXTENT_DATA) in order to do this.
>
> For example, say you preallocated space for a 1GB file using fallocate. Then
> you'd have one EXTENT_DATA to represent the entire 1GB range, say:
>
>
>         item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
>
> Then, if you performed a single write to the middle of the 1GB file, that
> one,
> preallocated extent would need to be broken up into three extents; one for
> the
> preallocated area before the write, one for the written area, and the last
> one
> for the preallocated area after the write, say:
>
>
>         item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
>         item 8 key (257 EXTENT_DATA 33554432) itemoff 3416 itemsize 53
>         item 9 key (257 EXTENT_DATA 67108864) itemoff 3363 itemsize 53
>
> The main point I'm trying to make is that it may be necessary to create more
> EXTENT_DATAs in order to preserve the correct on-disk location.
>
> Since you're not using a preallocated file, I'd guess that the writes are
> reading in part of a larger extent, which isn't fully read-into memory, and
> then the write ends up breaking that extent into two smaller extents. You
> may
> have better luck figuring out what's happening using the 'filefrag -v<file>'
> command.
>
> Hope this helps/answers your question.
>
> Regards,
> Wade
>
>
>  On 10/25/2012 11:35 AM, Alex Lyakas wrote:
>
>> Hi everybody,
>> I need some help understanding the nodatacow behavior.
>>
>> I have set up a large file (5GiB), which has very few EXTENT_DATAs
>> (all are real, not bytenr=0). The file has NODATASUM and NODATACOW
>> flags set (flags=0x3):
>>         item 4 key (257 INODE_ITEM 0) itemoff 3591 itemsize 160
>>                 inode generation 5 transid 5 size 5368709120 nbytes
>> 5368709120
>> owner[0:0] mode 100644
>>                 inode blockgroup 0 nlink 1 flags 0x3 seq 0
>>         item 7 key (257 EXTENT_DATA 131072) itemoff 3469 itemsize 53
>>         item 8 key (257 EXTENT_DATA 33554432) itemoff 3416 itemsize 53
>>         item 9 key (257 EXTENT_DATA 67108864) itemoff 3363 itemsize 53
>>         item 10 key (257 EXTENT_DATA 67112960) itemoff 3310 itemsize 53
>>         item 11 key (257 EXTENT_DATA 67117056) itemoff 3257 itemsize 53
>>         item 12 key (257 EXTENT_DATA 67121152) itemoff 3204 itemsize 53
>>         item 13 key (257 EXTENT_DATA 67125248) itemoff 3151 itemsize 53
>>         item 14 key (257 EXTENT_DATA 67129344) itemoff 3098 itemsize 53
>>         item 15 key (257 EXTENT_DATA 67133440) itemoff 3045 itemsize 53
>>         item 16 key (257 EXTENT_DATA 67137536) itemoff 2992 itemsize 53
>>         item 17 key (257 EXTENT_DATA 67141632) itemoff 2939 itemsize 53
>>         item 18 key (257 EXTENT_DATA 67145728) itemoff 2886 itemsize 53
>>         item 19 key (257 EXTENT_DATA 67149824) itemoff 2833 itemsize 53
>>         item 20 key (257 EXTENT_DATA 67153920) itemoff 2780 itemsize 53
>>         item 21 key (257 EXTENT_DATA 67158016) itemoff 2727 itemsize 53
>>         item 22 key (257 EXTENT_DATA 67162112) itemoff 2674 itemsize 53
>>         item 23 key (257 EXTENT_DATA 67166208) itemoff 2621 itemsize 53
>>         item 24 key (257 EXTENT_DATA 67170304) itemoff 2568 itemsize 53
>>         item 25 key (257 EXTENT_DATA 67174400) itemoff 2515 itemsize 53
>>                 extent data disk byte 67174400 nr 5301534720
>>                 extent data offset 0 nr 5301534720 ram 5301534720
>>                 extent compression 0
>> As you see by last extent, the file size is exactly 5Gib.
>>
>> Then I also mount btrfs with nodatacow option.
>>
>> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
>> Data: total=5.47GB, used=5.00GB
>> System: total=32.00MB, used=4.00KB
>> Metadata: total=512.00MB, used=28.00KB
>>
>> (I have set up block groups myself by playing with mfks code and
>> convertion code to learn about the extent tree. The filesystem passes
>> btrfsck fine, with no errors. All superblock copies are consistent.)
>>
>> Then I run parallel random IOs on the file, and almost immediately hit
>> ENOSPC. When looking at the file, I see that now it has a huge amount
>> of EXTENT_DATAs:
>> item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
>>         inode generation 5 transid 21 size 5368709120 nbytes 5368709120
>> owner[0:0] mode 100644
>>         inode blockgroup 0 nlink 1 flags 0x3 seq 130098
>> item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
>> item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
>> item 8 key (257 EXTENT_DATA 262144) itemoff 3419 itemsize 53
>> item 9 key (257 EXTENT_DATA 524288) itemoff 3366 itemsize 53
>> item 10 key (257 EXTENT_DATA 655360) itemoff 3313 itemsize 53
>> item 11 key (257 EXTENT_DATA 1310720) itemoff 3260 itemsize 53
>> item 12 key (257 EXTENT_DATA 1441792) itemoff 3207 itemsize 53
>> item 13 key (257 EXTENT_DATA 2097152) itemoff 3154 itemsize 53
>> item 14 key (257 EXTENT_DATA 2228224) itemoff 3101 itemsize 53
>> item 15 key (257 EXTENT_DATA 2752512) itemoff 3048 itemsize 53
>> item 16 key (257 EXTENT_DATA 2883584) itemoff 2995 itemsize 53
>> item 17 key (257 EXTENT_DATA 11927552) itemoff 2942 itemsize 53
>> item 18 key (257 EXTENT_DATA 12058624) itemoff 2889 itemsize 53
>> item 19 key (257 EXTENT_DATA 13238272) itemoff 2836 itemsize 53
>> item 20 key (257 EXTENT_DATA 13369344) itemoff 2783 itemsize 53
>> item 21 key (257 EXTENT_DATA 16646144) itemoff 2730 itemsize 53
>> item 22 key (257 EXTENT_DATA 16777216) itemoff 2677 itemsize 53
>> item 23 key (257 EXTENT_DATA 17432576) itemoff 2624 itemsize 53
>> ...
>>
>> and:
>> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
>> Data: total=5.47GB, used=5.46GB
>> System: total=32.00MB, used=4.00KB
>> Metadata: total=512.00MB, used=992.00KB
>>
>> Kernel is for-linus branch from Chris's tree, up to
>> f46dbe3dee853f8a860f889cb2b7ff4c624f2a7a (this is the last commit
>> there now).
>>
>> I was under impression that if a file is marked as NODATACOW, then new
>> writes will never allocate EXTENT_DATAs if appropriate EXTENT_DATAs
>> already exist. However, it is clearly not the case, or maybe I am
>> doing something wrong.
>>
>> Can anybody please help me to debug further and understand why this is
>> happening.
>>
>> Thanks,
>> Alex.
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs seems to do COW while inode has NODATACOW set
  2012-10-25 19:09   ` Alex Lyakas
@ 2012-10-25 20:52     ` Wade Cline
  2012-10-26 13:33       ` Kyle Gates
  0 siblings, 1 reply; 9+ messages in thread
From: Wade Cline @ 2012-10-25 20:52 UTC (permalink / raw)
  To: Alex Lyakas; +Cc: linux-btrfs

On 10/25/2012 12:09 PM, Alex Lyakas wrote:

> Wade, thanks.
>
> Yes, with the preallocated extent I saw the behavior you describe, and
> it makes perfect sense to alloc a new EXTENT_DATA in this case.
> In my case, I did another simple test:
>
> Before:
> 	item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
> 		inode generation 5 transid 5 size 5368709120 nbytes 5368709120
> owner[0:0] mode 100644
> 		inode blockgroup 0 nlink 1 flags 0x3 seq 0
> 	item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
> 		inode ref index 2 namelen 5 name: vol-1
> 	item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
> 		extent data disk byte 5368709120 nr 131072
> 		extent data offset 0 nr 131072 ram 131072
> 		extent compression 0
> 	item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
> 		extent data disk byte 5905842176 nr 33423360
> 		extent data offset 0 nr 33423360 ram 33423360
> 		extent compression 0
>                  ...
>
> I am going to do a single write of a 4Kib block into (257 EXTENT_DATA
> 131072) extent:
>
> dd if=/dev/urandom of=/mnt/src/subvol-1/vol-1 bs=4096 seek=32 count=1
> conv=notrunc
>
> After:
> 	item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
> 		inode generation 5 transid 21 size 5368709120 nbytes 5368709120
> owner[0:0] mode 100644
> 		inode blockgroup 0 nlink 1 flags 0x3 seq 1
> 	item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
> 		inode ref index 2 namelen 5 name: vol-1
> 	item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
> 		extent data disk byte 5368709120 nr 131072
> 		extent data offset 0 nr 131072 ram 131072
> 		extent compression 0
> 	item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
> 		extent data disk byte 5368840192 nr 4096
> 		extent data offset 0 nr 4096 ram 4096
> 		extent compression 0
> 	item 8 key (257 EXTENT_DATA 135168) itemoff 3419 itemsize 53
> 		extent data disk byte 5905842176 nr 33423360
> 		extent data offset 4096 nr 33419264 ram 33423360
> 		extent compression 0
>
> We clearly see that a new extent has been allocated for some reason
> (bytenr=5368840192), and previous extent (bytenr=5905842176) is still
> there, but used at offset of 4096. This is exactly cow, I believe.
Hmm, I'm pretty sure that using 'dd' in this fashion skips the first 32 4096-sized
blocks and thus writes -past- the length of this extent (eg: writes from 131073 to
135168). This causes a new extent to be allocated after the previous extent.

But even if using 'dd' with a 'skip' value of '31' created a new EXTENT_DATA, it
would not necessarily be data CoW, since data CoW refers only to the location of
the -data- (i.e., not metadata and thus not EXTENT_DATA) on disk. The key thing
is to look at where the EXTENT_DATAs are pointing to, not how many EXTENT_DATAs
there are.

> However, your hint about not being able to read into memory may be
> useful; it would be good if we can find the place in the code that
> does that decision to cow.
Try looking at the callers of btrfs_cow_block(), but you'll be own your own from
there :)

> I guess I am looking for a way to never ever allocate new EXTENT_DATAs
> on a fully-mapped file. Is there one?
Hmm, I don't think that this exists right now. You could try a '-o autodefrag' to
minimize the number of EXTENT_DATAs, though.

Regards,
Wade

>
> Thanks!
> Alex.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* RE: btrfs seems to do COW while inode has NODATACOW set
  2012-10-25 20:52     ` Wade Cline
@ 2012-10-26 13:33       ` Kyle Gates
  2012-10-28 12:12         ` Alex Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: Kyle Gates @ 2012-10-26 13:33 UTC (permalink / raw)
  To: Wade Cline, Alex Lyakas; +Cc: linux-btrfs@vger.kernel.org

> > Wade, thanks.
> >
> > Yes, with the preallocated extent I saw the behavior you describe, and
> > it makes perfect sense to alloc a new EXTENT_DATA in this case.
> > In my case, I did another simple test:
> >
> > Before:
> > item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
> > inode generation 5 transid 5 size 5368709120 nbytes 5368709120
> > owner[0:0] mode 100644
> > inode blockgroup 0 nlink 1 flags 0x3 seq 0
> > item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
> > inode ref index 2 namelen 5 name: vol-1
> > item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
> > extent data disk byte 5368709120 nr 131072
> > extent data offset 0 nr 131072 ram 131072
> > extent compression 0
> > item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
> > extent data disk byte 5905842176 nr 33423360
> > extent data offset 0 nr 33423360 ram 33423360
> > extent compression 0
> > ...
> >
> > I am going to do a single write of a 4Kib block into (257 EXTENT_DATA
> > 131072) extent:
> >
> > dd if=/dev/urandom of=/mnt/src/subvol-1/vol-1 bs=4096 seek=32 count=1
> > conv=notrunc
> >
> > After:
> > item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
> > inode generation 5 transid 21 size 5368709120 nbytes 5368709120
> > owner[0:0] mode 100644
> > inode blockgroup 0 nlink 1 flags 0x3 seq 1
> > item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
> > inode ref index 2 namelen 5 name: vol-1
> > item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
> > extent data disk byte 5368709120 nr 131072
> > extent data offset 0 nr 131072 ram 131072
> > extent compression 0
> > item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
> > extent data disk byte 5368840192 nr 4096
> > extent data offset 0 nr 4096 ram 4096
> > extent compression 0
> > item 8 key (257 EXTENT_DATA 135168) itemoff 3419 itemsize 53
> > extent data disk byte 5905842176 nr 33423360
> > extent data offset 4096 nr 33419264 ram 33423360
> > extent compression 0
> >
> > We clearly see that a new extent has been allocated for some reason
> > (bytenr=5368840192), and previous extent (bytenr=5905842176) is still
> > there, but used at offset of 4096. This is exactly cow, I believe.
> Hmm, I'm pretty sure that using 'dd' in this fashion skips the first 32 4096-sized
> blocks and thus writes -past- the length of this extent (eg: writes from 131073 to
> 135168). This causes a new extent to be allocated after the previous extent.
>
> But even if using 'dd' with a 'skip' value of '31' created a new EXTENT_DATA, it
> would not necessarily be data CoW, since data CoW refers only to the location of
> the -data- (i.e., not metadata and thus not EXTENT_DATA) on disk. The key thing
> is to look at where the EXTENT_DATAs are pointing to, not how many EXTENT_DATAs
> there are.
>
> > However, your hint about not being able to read into memory may be
> > useful; it would be good if we can find the place in the code that
> > does that decision to cow.
> Try looking at the callers of btrfs_cow_block(), but you'll be own your own from
> there :)
>
> > I guess I am looking for a way to never ever allocate new EXTENT_DATAs
> > on a fully-mapped file. Is there one?
> Hmm, I don't think that this exists right now. You could try a '-o autodefrag' to
> minimize the number of EXTENT_DATAs, though.

This seems to be a start at what you're looking for:
Commit: 7e97b8daf63487c20f78487bd4045f39b0d97cf4
btrfs: allow setting NOCOW for a zero sized file via ioctl

In short, the nodatacow option won't be honored if any checksums have been assigned to any extents of a file.

>
> Regards,
> Wade
>
> >
> > Thanks!
> > Alex. 		 	   		  

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs seems to do COW while inode has NODATACOW set
  2012-10-26 13:33       ` Kyle Gates
@ 2012-10-28 12:12         ` Alex Lyakas
  2012-10-29 17:18           ` Alex Lyakas
  0 siblings, 1 reply; 9+ messages in thread
From: Alex Lyakas @ 2012-10-28 12:12 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org; +Cc: Wade Cline, Kyle Gates, cwillu

Hi,
it appears that I found why the COW is happening. The code in the
kernel that triggers this is:
check_committed_ref():
	if (btrfs_extent_generation(leaf, ei) <=
	    btrfs_root_last_snapshot(&root->root_item))
		goto out;
It appears that both "extent_generation" and "last_snapshot" are 0 in my case.
How it happened that "extent_generation" is 0? This is converter's
fault; in record_file_extent() it has:
btrfs_set_extent_generation(leaf, ei, 0);
instead of
btrfs_set_extent_generation(leaf, ei, trans->transid);

After fixing this, I see that no COW is happening and
EXTENT_DATAs/EXTENT_ITEMs remain exactly the same, which is awesome!
(Community, if you feel this bug should be fixed, I can send this
trivial patch for converter).

However, I still receive ENOSPC when running IO to the file. I setup a
looback device on the file, and when running IOs to /dev/loop0, I get:
Oct 28 13:49:41 vc kernel: [ 1243.775530] loop: Write error at byte
offset 3637841920, length 4096, prev_pos=3637841920, bw=-28.
Oct 28 13:49:41 vc kernel: [ 1243.780909] loop: Write error at byte
offset 163704832, length 4096, prev_pos=163704832, bw=-28.
Oct 28 13:49:41 vc kernel: [ 1243.783282] loop: Write error at byte
offset 3637899264, length 4096, prev_pos=3637899264, bw=-28.
Oct 28 13:49:41 vc kernel: [ 1243.788148] loop: Write error at byte
offset 498728960, length 4096, prev_pos=498728960, bw=-28.
Oct 28 13:49:41 vc kernel: [ 1243.790573] loop: Write error at byte
offset 498855936, length 4096, prev_pos=498855936, bw=-28.
Oct 28 13:49:41 vc kernel: [ 1243.793017] loop: Write error at byte
offset 407240704, length 4096, prev_pos=407240704, bw=-28.
...
(I added the print into drivers/block/loop.c into
__do_lo_send_write(), and file->f_op->write receives -28 back).
When writing later to the same offsets with "dd" I don't get this
problem. Free space seems also fine:
root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
Data: total=5.47GB, used=5.00GB
System: total=32.00MB, used=4.00KB
Metadata: total=512.00MB, used=36.00KB

How can it happen that I get back ENOSPC with NOCOW?
Can anybody please help me debugging this further? There are no prints
from btrfs. Kernel is latest Chris's.

Thanks,
Alex.







On Fri, Oct 26, 2012 at 3:33 PM, Kyle Gates <kylegates@hotmail.com> wrote:
>> > Wade, thanks.
>> >
>> > Yes, with the preallocated extent I saw the behavior you describe, and
>> > it makes perfect sense to alloc a new EXTENT_DATA in this case.
>> > In my case, I did another simple test:
>> >
>> > Before:
>> > item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
>> > inode generation 5 transid 5 size 5368709120 nbytes 5368709120
>> > owner[0:0] mode 100644
>> > inode blockgroup 0 nlink 1 flags 0x3 seq 0
>> > item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
>> > inode ref index 2 namelen 5 name: vol-1
>> > item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
>> > extent data disk byte 5368709120 nr 131072
>> > extent data offset 0 nr 131072 ram 131072
>> > extent compression 0
>> > item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
>> > extent data disk byte 5905842176 nr 33423360
>> > extent data offset 0 nr 33423360 ram 33423360
>> > extent compression 0
>> > ...
>> >
>> > I am going to do a single write of a 4Kib block into (257 EXTENT_DATA
>> > 131072) extent:
>> >
>> > dd if=/dev/urandom of=/mnt/src/subvol-1/vol-1 bs=4096 seek=32 count=1
>> > conv=notrunc
>> >
>> > After:
>> > item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
>> > inode generation 5 transid 21 size 5368709120 nbytes 5368709120
>> > owner[0:0] mode 100644
>> > inode blockgroup 0 nlink 1 flags 0x3 seq 1
>> > item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
>> > inode ref index 2 namelen 5 name: vol-1
>> > item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
>> > extent data disk byte 5368709120 nr 131072
>> > extent data offset 0 nr 131072 ram 131072
>> > extent compression 0
>> > item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
>> > extent data disk byte 5368840192 nr 4096
>> > extent data offset 0 nr 4096 ram 4096
>> > extent compression 0
>> > item 8 key (257 EXTENT_DATA 135168) itemoff 3419 itemsize 53
>> > extent data disk byte 5905842176 nr 33423360
>> > extent data offset 4096 nr 33419264 ram 33423360
>> > extent compression 0
>> >
>> > We clearly see that a new extent has been allocated for some reason
>> > (bytenr=5368840192), and previous extent (bytenr=5905842176) is still
>> > there, but used at offset of 4096. This is exactly cow, I believe.
>> Hmm, I'm pretty sure that using 'dd' in this fashion skips the first 32 4096-sized
>> blocks and thus writes -past- the length of this extent (eg: writes from 131073 to
>> 135168). This causes a new extent to be allocated after the previous extent.
>>
>> But even if using 'dd' with a 'skip' value of '31' created a new EXTENT_DATA, it
>> would not necessarily be data CoW, since data CoW refers only to the location of
>> the -data- (i.e., not metadata and thus not EXTENT_DATA) on disk. The key thing
>> is to look at where the EXTENT_DATAs are pointing to, not how many EXTENT_DATAs
>> there are.
>>
>> > However, your hint about not being able to read into memory may be
>> > useful; it would be good if we can find the place in the code that
>> > does that decision to cow.
>> Try looking at the callers of btrfs_cow_block(), but you'll be own your own from
>> there :)
>>
>> > I guess I am looking for a way to never ever allocate new EXTENT_DATAs
>> > on a fully-mapped file. Is there one?
>> Hmm, I don't think that this exists right now. You could try a '-o autodefrag' to
>> minimize the number of EXTENT_DATAs, though.
>
> This seems to be a start at what you're looking for:
> Commit: 7e97b8daf63487c20f78487bd4045f39b0d97cf4
> btrfs: allow setting NOCOW for a zero sized file via ioctl
>
> In short, the nodatacow option won't be honored if any checksums have been assigned to any extents of a file.
>
>>
>> Regards,
>> Wade
>>
>> >
>> > Thanks!
>> > Alex.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: btrfs seems to do COW while inode has NODATACOW set
  2012-10-28 12:12         ` Alex Lyakas
@ 2012-10-29 17:18           ` Alex Lyakas
  0 siblings, 0 replies; 9+ messages in thread
From: Alex Lyakas @ 2012-10-29 17:18 UTC (permalink / raw)
  To: linux-btrfs@vger.kernel.org, Josef Bacik; +Cc: Wade Cline, Kyle Gates, cwillu

FWIW,
I have found when I am hitting ENOSPC.

btrfs_check_data_free_space() has this code:
...
	/* make sure we have enough space to handle the data first */
	spin_lock(&data_sinfo->lock);
	used = data_sinfo->bytes_used + data_sinfo->bytes_reserved +
		data_sinfo->bytes_pinned + data_sinfo->bytes_readonly +
		data_sinfo->bytes_may_use;

	if (used + bytes > data_sinfo->total_bytes) {
		struct btrfs_trans_handle *trans;

...
	return -ENOSPC;
}
data_sinfo->bytes_may_use += bytes;

Josef, I have read your doc on
https://btrfs.wiki.kernel.org/index.php/ENOSPC and also the related
email thread. You mention there the metadata reservations only. In my
case, bytes_may_use get bumped up for data. Eventually I hit ENOSPC
because I have very few extra space for data, but plenty of space for
metadata. However, I am using NOCOW. Is this the intended thing to do
--- to bump up bytes_may_use even though we won't need any new space
for data eventually?

Thanks,
Alex.





On Sun, Oct 28, 2012 at 2:12 PM, Alex Lyakas
<alex.btrfs@zadarastorage.com> wrote:
> Hi,
> it appears that I found why the COW is happening. The code in the
> kernel that triggers this is:
> check_committed_ref():
>         if (btrfs_extent_generation(leaf, ei) <=
>             btrfs_root_last_snapshot(&root->root_item))
>                 goto out;
> It appears that both "extent_generation" and "last_snapshot" are 0 in my case.
> How it happened that "extent_generation" is 0? This is converter's
> fault; in record_file_extent() it has:
> btrfs_set_extent_generation(leaf, ei, 0);
> instead of
> btrfs_set_extent_generation(leaf, ei, trans->transid);
>
> After fixing this, I see that no COW is happening and
> EXTENT_DATAs/EXTENT_ITEMs remain exactly the same, which is awesome!
> (Community, if you feel this bug should be fixed, I can send this
> trivial patch for converter).
>
> However, I still receive ENOSPC when running IO to the file. I setup a
> looback device on the file, and when running IOs to /dev/loop0, I get:
> Oct 28 13:49:41 vc kernel: [ 1243.775530] loop: Write error at byte
> offset 3637841920, length 4096, prev_pos=3637841920, bw=-28.
> Oct 28 13:49:41 vc kernel: [ 1243.780909] loop: Write error at byte
> offset 163704832, length 4096, prev_pos=163704832, bw=-28.
> Oct 28 13:49:41 vc kernel: [ 1243.783282] loop: Write error at byte
> offset 3637899264, length 4096, prev_pos=3637899264, bw=-28.
> Oct 28 13:49:41 vc kernel: [ 1243.788148] loop: Write error at byte
> offset 498728960, length 4096, prev_pos=498728960, bw=-28.
> Oct 28 13:49:41 vc kernel: [ 1243.790573] loop: Write error at byte
> offset 498855936, length 4096, prev_pos=498855936, bw=-28.
> Oct 28 13:49:41 vc kernel: [ 1243.793017] loop: Write error at byte
> offset 407240704, length 4096, prev_pos=407240704, bw=-28.
> ...
> (I added the print into drivers/block/loop.c into
> __do_lo_send_write(), and file->f_op->write receives -28 back).
> When writing later to the same offsets with "dd" I don't get this
> problem. Free space seems also fine:
> root@vc:/btrfs-progs# ./btrfs fi df /mnt/src/
> Data: total=5.47GB, used=5.00GB
> System: total=32.00MB, used=4.00KB
> Metadata: total=512.00MB, used=36.00KB
>
> How can it happen that I get back ENOSPC with NOCOW?
> Can anybody please help me debugging this further? There are no prints
> from btrfs. Kernel is latest Chris's.
>
> Thanks,
> Alex.
>
>
>
>
>
>
>
> On Fri, Oct 26, 2012 at 3:33 PM, Kyle Gates <kylegates@hotmail.com> wrote:
>>> > Wade, thanks.
>>> >
>>> > Yes, with the preallocated extent I saw the behavior you describe, and
>>> > it makes perfect sense to alloc a new EXTENT_DATA in this case.
>>> > In my case, I did another simple test:
>>> >
>>> > Before:
>>> > item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
>>> > inode generation 5 transid 5 size 5368709120 nbytes 5368709120
>>> > owner[0:0] mode 100644
>>> > inode blockgroup 0 nlink 1 flags 0x3 seq 0
>>> > item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
>>> > inode ref index 2 namelen 5 name: vol-1
>>> > item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
>>> > extent data disk byte 5368709120 nr 131072
>>> > extent data offset 0 nr 131072 ram 131072
>>> > extent compression 0
>>> > item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
>>> > extent data disk byte 5905842176 nr 33423360
>>> > extent data offset 0 nr 33423360 ram 33423360
>>> > extent compression 0
>>> > ...
>>> >
>>> > I am going to do a single write of a 4Kib block into (257 EXTENT_DATA
>>> > 131072) extent:
>>> >
>>> > dd if=/dev/urandom of=/mnt/src/subvol-1/vol-1 bs=4096 seek=32 count=1
>>> > conv=notrunc
>>> >
>>> > After:
>>> > item 4 key (257 INODE_ITEM 0) itemoff 3593 itemsize 160
>>> > inode generation 5 transid 21 size 5368709120 nbytes 5368709120
>>> > owner[0:0] mode 100644
>>> > inode blockgroup 0 nlink 1 flags 0x3 seq 1
>>> > item 5 key (257 INODE_REF 256) itemoff 3578 itemsize 15
>>> > inode ref index 2 namelen 5 name: vol-1
>>> > item 6 key (257 EXTENT_DATA 0) itemoff 3525 itemsize 53
>>> > extent data disk byte 5368709120 nr 131072
>>> > extent data offset 0 nr 131072 ram 131072
>>> > extent compression 0
>>> > item 7 key (257 EXTENT_DATA 131072) itemoff 3472 itemsize 53
>>> > extent data disk byte 5368840192 nr 4096
>>> > extent data offset 0 nr 4096 ram 4096
>>> > extent compression 0
>>> > item 8 key (257 EXTENT_DATA 135168) itemoff 3419 itemsize 53
>>> > extent data disk byte 5905842176 nr 33423360
>>> > extent data offset 4096 nr 33419264 ram 33423360
>>> > extent compression 0
>>> >
>>> > We clearly see that a new extent has been allocated for some reason
>>> > (bytenr=5368840192), and previous extent (bytenr=5905842176) is still
>>> > there, but used at offset of 4096. This is exactly cow, I believe.
>>> Hmm, I'm pretty sure that using 'dd' in this fashion skips the first 32 4096-sized
>>> blocks and thus writes -past- the length of this extent (eg: writes from 131073 to
>>> 135168). This causes a new extent to be allocated after the previous extent.
>>>
>>> But even if using 'dd' with a 'skip' value of '31' created a new EXTENT_DATA, it
>>> would not necessarily be data CoW, since data CoW refers only to the location of
>>> the -data- (i.e., not metadata and thus not EXTENT_DATA) on disk. The key thing
>>> is to look at where the EXTENT_DATAs are pointing to, not how many EXTENT_DATAs
>>> there are.
>>>
>>> > However, your hint about not being able to read into memory may be
>>> > useful; it would be good if we can find the place in the code that
>>> > does that decision to cow.
>>> Try looking at the callers of btrfs_cow_block(), but you'll be own your own from
>>> there :)
>>>
>>> > I guess I am looking for a way to never ever allocate new EXTENT_DATAs
>>> > on a fully-mapped file. Is there one?
>>> Hmm, I don't think that this exists right now. You could try a '-o autodefrag' to
>>> minimize the number of EXTENT_DATAs, though.
>>
>> This seems to be a start at what you're looking for:
>> Commit: 7e97b8daf63487c20f78487bd4045f39b0d97cf4
>> btrfs: allow setting NOCOW for a zero sized file via ioctl
>>
>> In short, the nodatacow option won't be honored if any checksums have been assigned to any extents of a file.
>>
>>>
>>> Regards,
>>> Wade
>>>
>>> >
>>> > Thanks!
>>> > Alex.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-10-29 17:18 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-25 18:35 btrfs seems to do COW while inode has NODATACOW set Alex Lyakas
2012-10-25 18:40 ` cwillu
2012-10-25 18:47   ` Alex Lyakas
2012-10-25 18:58 ` Wade Cline
2012-10-25 19:09   ` Alex Lyakas
2012-10-25 20:52     ` Wade Cline
2012-10-26 13:33       ` Kyle Gates
2012-10-28 12:12         ` Alex Lyakas
2012-10-29 17:18           ` Alex Lyakas

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.