Strange prformance degradation when COW writes happen at fixed offsets

All of lore.kernel.org
 help / color / mirror / Atom feed

* Strange prformance degradation when COW writes happen at fixed offsets
@ 2012-02-24  1:32 Nik Markovic
  2012-02-24  2:31 ` Nik Markovic
  0 siblings, 1 reply; 7+ messages in thread
From: Nik Markovic @ 2012-02-24  1:32 UTC (permalink / raw)
  To: linux-btrfs

Hi,

My kernel version is 32-bit 3.2.0-rc5 and using btrfs-tools 0.19

I was having performance issues with BTRFS with fragmentation and
HDDs, so I decided to switch to an SSD to see if these would go away.
Performance was much better but at times, I would see a "freeze
happen" which I can't really explain. The CPU would spike up to 100%
at times.

I decided to try reproduce this, hough it may or may not be related,
while testing BTFS performance, I encountered this interesting problem
where performance would depend on whether a file is freshly copied
onto a BTRFS filesystem or obtained via COW "children". This is all
happening on a Crucial M4 SSD, so something on the SSD firmware could
be causing the issue but I feel it's related to BTRFS  metadata.

Here is the test:
1. Write a fresh large file to the file system called A
2. Make a reflink of A COW copy B
3. Modify a set of random blocks on B
4. Remove A
5. Repeat 2-5 but use newly produced B as new A

Expected results:
Each steps takes equal amount of time to complete on an SSD because
there is no fragmentation involved and the system is in the same state
at #2 because there's always only one file on the filesystem.

I used 1GB file as my source. I repeated tests using different
algorithms for the "write" in step #2 above.
Algorithm 1 (random): Write 8 bytes randomly
Algorithm 2 (fixed): Write first 8 bytes and continue at 50k offsets
Algorithm 3 (incremental): Write first 8 bytes at offset = random
(50k) then continue at 50k offsets
For each test, there were 40k writes total. Algorithm is in the Java code below.

The following is observed with each iteration ONLY when using algorithm #3
1. Over time, the time to modify the file increases
2. Over time, the time to make the reflink copy increases
3. Over time, the time to remove the file increases
4. First few writes take less then normal time to complete.

Data for 1st/5th/10th/15th/20th iteration:
Algorithm 1 and 2:
Always Write:6s
Always Copy: 0.5s
Always Remove: 0.10s

Algorithm 2:
Write: 2/6/9/10/11.5
Copy: 0.5/3/4.5/5.5/6
Remove: 0.1/1/2/2/2

As you can see, things degrade and taper off after the 10th iteration.
This probably has to do with 4k block size being near 50k/10. I don't
think this has to do with SSD garbage collection because I ran these
tests multiple times.

To use this script, cd into an empty directory on a btrfs filesystem
and and run it with "incremental" as argument. You can use other modes
to confirm expected behavior.
Script used to produce the bug:
#!/bin/bash

mode=$1
if [ -z "$mode" ]; then
	echo "Usage $0 <incremental|random|fixed>"
	exit -1
fi
mode=$1

src=`pwd`/test/src
dst=`pwd`/test/dst
srcfile=$src/test.tar
dstfile=$dst/test.tar

mkdir -p $src
mkdir -p $dst

filesize=100MB

#build a 1GB file from a smaller download. You can tweak filesize and
the loop below for lower bandwidth
if [ ! -f $srcfile ]; then
	cd $src
	if [ ! -f $srcfile.dl ]; then
		wget http://download.thinkbroadband.com/${filesize}.zip
--output-document=$srcfile.dl
	fi
	rm -rf tarbase
	mkdir tarbase
	for  i in {1..10}; do
		cp --reflink=always $srcfile.dl tarbase/$i.dl
	done
	tar -cvf $srcfile tarbase
	rm -rf tarbase
fi

cat <<END > $src/FileTest.java
import java.io.IOException;
import java.io.RandomAccessFile;
public class FileTest {
    public static final int BLOCK_SIZE = 50000;
    public static final int MAX_ITERATIONS = 40000;
    public static void main(String args[]) throws IOException {
        String mode = args[0];
        RandomAccessFile f = new RandomAccessFile(args[1], "rw");
        //int offset = 0;
        int i;
        int offset = new java.util.Random().nextInt(BLOCK_SIZE); //
initializer ONLY for incremental mode
        for (i=0; i < MAX_ITERATIONS; i++) {
            try {
                int writeOffset;
                if (mode.equals("incremental")) {
                    writeOffset = new
java.util.Random().nextInt(offset + i * BLOCK_SIZE);
                } else { // mode.equals random
                    writeOffset = new
java.util.Random().nextInt(((int)f.length() - 100));
                    offset = writeOffset; // for reporting it at the end
                }
                f.seek(writeOffset);
                f.writeBytes("DEADBEEF");
            } catch (java.io.IOException e) {
                System.out.println("EOF");
                break;
            }
        }
        System.out.print("Last offset=" + offset);
        System.out.println(". Made " + i + " random writes.");
        f.close();
    }
}

END

cd $src
javac FileTest.java

/usr/bin/time --format 'rm: %E' rm -rf $dst/*
cp --reflink=always $srcfile.dl $dst/1.tst
cd $dst
for i in {1..20}; do	
	echo -n "$i."
	i_plus=`expr $i + 1`
	/usr/bin/time --format 'write: %E' java -cp $src FileTest $mode $i.tst
	/usr/bin/time --format 'cp:    %E' cp --reflink=always $i.tst $i_plus.tst
	/usr/bin/time --format 'rm:    %E' rm $i.tst
	/usr/bin/time --format 'sync:  %E' sync
	sleep 1
done

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange prformance degradation when COW writes happen at fixed offsets
  2012-02-24  1:32 Strange prformance degradation when COW writes happen at fixed offsets Nik Markovic
@ 2012-02-24  2:31 ` Nik Markovic
  2012-02-24  6:38   ` Duncan
  0 siblings, 1 reply; 7+ messages in thread
From: Nik Markovic @ 2012-02-24  2:31 UTC (permalink / raw)
  To: linux-btrfs

I noticed a few errors in the script that I used. I corrected it and
it seems that degradation is occurring even at fully random writes:


#!/bin/bash

mode=$1
if [ -z "$mode" ]; then
	echo "Usage $0 <incremental|random|fixed>"
	exit -1
fi
mode=$1

src=`pwd`/test/src
dst=`pwd`/test/dst
srcfile=$src/test.tar
dstfile=$dst/test.tar

mkdir -p $src
mkdir -p $dst

filesize=100MB

#build a 10GB file from a smaller download. You can tweak filesize and
the loop below for lower bandwidth
if [ ! -f $srcfile ]; then
	cd $src
	if [ ! -f $srcfile.dl ]; then
		wget http://download.thinkbroadband.com/${filesize}.zip
--output-document=$srcfile.dl
	fi
	rm -rf tarbase
	mkdir tarbase
	for  i in {1..100}; do
		cp --reflink=always $srcfile.dl tarbase/$i.dl
	done
	tar -cvf $srcfile tarbase
	rm -rf tarbase
fi

cat <<END > $src/FileTest.java
import java.io.IOException;
import java.io.RandomAccessFile;
public class FileTest {
    public static final int BLOCK_SIZE = 50000;
    public static final int MAX_ITERATIONS = 40000;
    public static void main(String args[]) throws IOException {
        String mode = args[0];
        RandomAccessFile f = new RandomAccessFile(args[1], "rw");
        //int offset = 0;
        int i;
        int offset = new java.util.Random().nextInt(BLOCK_SIZE); //
initializer ONLY for incremental mode
        for (i=0; i < MAX_ITERATIONS; i++) {
            try {
                int writeOffset;
                if (mode.equals("incremental")) {
                    writeOffset = new
java.util.Random().nextInt(offset + i * BLOCK_SIZE);
                }  else if (mode.equals("fixed")) {
                    writeOffset = i * BLOCK_SIZE;
                    offset = writeOffset; // for reporting it at the end
                } else { // mode.equals random
                    writeOffset = new
java.util.Random().nextInt(((int)f.length() - 100));
                    offset = writeOffset; // for reporting it at the end
                }
		if (writeOffset > (f.length() - 100)) {
			break;
		}
                f.seek(writeOffset);
                f.writeBytes("DEADBEEF");
            } catch (java.io.IOException e) {
                System.out.println("EOF");
                break;
            }
        }
        System.out.print("Last offset=" + offset);
        System.out.println(". Made " + i + " random writes.");
        f.close();
    }
}
END

cd $src
javac FileTest.java


/usr/bin/time --format 'rm: %E' rm -rf $dst/*
cp --reflink=always $srcfile $dst/1.tst
cd $dst
for i in {1..100}; do	
	echo -n "$i."
	i_plus=`expr $i + 1`
	/usr/bin/time --format 'write: %E' java -cp $src FileTest $mode $i.tst
	/usr/bin/time --format 'cp:    %E' cp --reflink=always $i.tst $i_plus.tst
	/usr/bin/time --format 'rm:    %E' rm $i.tst
	/usr/bin/time --format 'sync:  %E' sync
	sleep 1
done

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange prformance degradation when COW writes happen at fixed offsets
  2012-02-24  2:31 ` Nik Markovic
@ 2012-02-24  6:38   ` Duncan
  2012-02-24 20:38     ` Nik Markovic
  0 siblings, 1 reply; 7+ messages in thread
From: Duncan @ 2012-02-24  6:38 UTC (permalink / raw)
  To: linux-btrfs

Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:

> I noticed a few errors in the script that I used. I corrected it and it
> seems that degradation is occurring even at fully random writes:

I don't have an ssd, but is it possible that you're simply seeing erase-
block related degradation due to multi-write-block sized erase-blocks?

It seems to me that when originally written to the btrfs-on-ssd, the file 
will likely be written block-sequentially enough that the file as a whole 
takes up relatively few erase-blocks.  As you COW-write individual 
blocks, they'll be written elsewhere, perhaps all the changed blocks to a 
new erase-block, perhaps each to a different erase block.

As you increase the successive COW generation count, the file's file-
system/write blocks will be spread thru more and more erase-blocks, 
basically fragmentation but of the SSD-critical type, into more and more 
erase blocks, thus affecting modification and removal time but not read 
time.

IIRC I saw a note about this on the wiki, in regard to the nodatacow 
mount-option.  Let's see if I can find it again.  Hmm... yes...

http://btrfs.ipv5.de/index.php?title=Getting_started#Mount_Options

In particular this (for nodatacow, read the rest as there's additional 
implications):

>>>>>
Performance gain is usually < 5% unless the workload is random writes to 
large database files, where the difference can become very large.
<<<<<

In addition to nodatacow, see the note on the autodefrag option.

IOW, with the repeated generations of random-writes to cow-copies, you're 
apparently triggering a cow-worst-case fragmentation situation.  It 
shouldn't affect read-time much on SSD, but it certainly will affect copy 
and erase time, as the data and metadata (which as you'll recall is 2X by 
default on btrfs) gets written to more and more blocks that need updated 
at copy/erase time, 

That /might/ be the problem triggering the freezes you noted that set off 
the original investigation as well, if the SSD firmware is running out of 
erase blocks and having to pause access while it rearranges data to allow 
operations to continue.  Since your original issue on "rotating rust" 
drives was fragmentation, rewriting would seem to be something you do 
quite a lot of, triggering different but similar-cause issues on SSDs as 
well.

FWIW, with that sort of database-style workload, large files constantly 
random-change rewritten, something like xfs might be more appropriate 
than btrfs.  See the recent xfs presentations (were they at ScaleX or 
LinuxConf.au? both happened about the same time and were covered in the 
same LWN weekly edition) as covered a couple weeks ago on LWN for more.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange prformance degradation when COW writes happen at fixed offsets
  2012-02-24  6:38   ` Duncan
@ 2012-02-24 20:38     ` Nik Markovic
  2012-02-24 21:33       ` Nik Markovic
  2012-02-25  3:34       ` Duncan
  0 siblings, 2 replies; 7+ messages in thread
From: Nik Markovic @ 2012-02-24 20:38 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote:
> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>
>> I noticed a few errors in the script that I used. I corrected it and=
 it
>> seems that degradation is occurring even at fully random writes:
>
> I don't have an ssd, but is it possible that you're simply seeing era=
se-
> block related degradation due to multi-write-block sized erase-blocks=
?
>
> It seems to me that when originally written to the btrfs-on-ssd, the =
file
> will likely be written block-sequentially enough that the file as a w=
hole
> takes up relatively few erase-blocks. =A0As you COW-write individual
> blocks, they'll be written elsewhere, perhaps all the changed blocks =
to a
> new erase-block, perhaps each to a different erase block.

This is a very interesting insight. I wasn't even aware of the
erase-block issue, so I did some reading up on it...

>
> As you increase the successive COW generation count, the file's file-
> system/write blocks will be spread thru more and more erase-blocks,
> basically fragmentation but of the SSD-critical type, into more and m=
ore
> erase blocks, thus affecting modification and removal time but not re=
ad
> time.

OK, so time to write would increase due to fragmentation and writing,
it now makes sense (though I don't see why small writes would affect
this, but my concerns are not writes anyway), but why would cp
--reflink time increase so much. Yes, new extents would be created,
but btrfs doesn't write into data blocks, does it? I figured its
metadata would be kept in one place. I figure the only thing BTRFS
would do on cp --reflink=3Dalways:
1. Take a collection of extents owned by source.
2. Make the new copy use the same collection of extents.
3. Write the collection of extents to the "directory".

Now this process seems to be CPU intensive. When I remove or make a
reflink copy, one core pikes up to 100%, which tells me that there's a
performance issue there, not an ssd issue. Also, only one CPU thread
is being used for this. I figured that I can improve this by some
setting. Maybe thread_pool mount option? Are there any updates in
later kernels that I should possibly pick up?

>
> IIRC I saw a note about this on the wiki, in regard to the nodatacow
> mount-option. =A0Let's see if I can find it again. =A0Hmm... yes...
>
> http://btrfs.ipv5.de/index.php?title=3DGetting_started#Mount_Options
>
> In particular this (for nodatacow, read the rest as there's additiona=
l
> implications):
>
>>>>>>
> Performance gain is usually < 5% unless the workload is random writes=
 to
> large database files, where the difference can become very large.
> <<<<<
>

Unless I am wrong, this would disable COW completely and reflink copy.
Reflinks are a crucial component and the sole
reason I picked BTRFS for the system that I am writing for my company.
The autodefrag option addresses multiple writes. Writing is not the
problem, but cp --reflink should be near-instant. That was the reason
we chose BTRFS over ZFS, which seemed to be the only feasible
alternative. ZFS snapshot complicate the design and deduplication copy
time is the same as (or not much better than) raw copy.

> In addition to nodatacow, see the note on the autodefrag option.
>
> IOW, with the repeated generations of random-writes to cow-copies, yo=
u're
> apparently triggering a cow-worst-case fragmentation situation. =A0It
> shouldn't affect read-time much on SSD, but it certainly will affect =
copy
> and erase time, as the data and metadata (which as you'll recall is 2=
X by
> default on btrfs) gets written to more and more blocks that need upda=
ted
> at copy/erase time,
>
>
> That /might/ be the problem triggering the freezes you noted that set=
 off
> the original investigation as well, if the SSD firmware is running ou=
t of
> erase blocks and having to pause access while it rearranges data to a=
llow
> operations to continue. =A0Since your original issue on "rotating rus=
t"
> drives was fragmentation, rewriting would seem to be something you do
> quite a lot of, triggering different but similar-cause issues on SSDs=
 as
> well.
>
> FWIW, with that sort of database-style workload, large files constant=
ly
> random-change rewritten, something like xfs might be more appropriate
> than btrfs. =A0See the recent xfs presentations (were they at ScaleX =
or
> LinuxConf.au? both happened about the same time and were covered in t=
he
> same LWN weekly edition) as covered a couple weeks ago on LWN for mor=
e.
>

As I mentioned above, the COW is the crucial component of our system,
XFS won't do. Our system does not do random writes. In fact it is
mainly heavy on read operation. The system does occasional "rotation
of rust" on large files in a way that version control system would
(large files are modified and then used as a new baseline)

> --
> Duncan - List replies preferred. =A0 No HTML msgs.
> "Every nonfree program has a lord, a master --
> and if you use the program, he is your master." =A0Richard Stallman
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs=
" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html

Thanks for all your help on this issue. I hope that someone can point
out some more tweaks or added features/fixes after 3.2 RC5 that I may
do.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange prformance degradation when COW writes happen at fixed offsets
  2012-02-24 20:38     ` Nik Markovic
@ 2012-02-24 21:33       ` Nik Markovic
  2012-02-27  8:29         ` Christian Brunner
  2012-02-25  3:34       ` Duncan
  1 sibling, 1 reply; 7+ messages in thread
From: Nik Markovic @ 2012-02-24 21:33 UTC (permalink / raw)
  To: Duncan; +Cc: linux-btrfs

To add... I also tried nodatasum (only) and nodatacow otions. I found
somewhere that nodatacow doesn't really mean tthat COW is disabled.
Test data is still the same - CPU spikes and times are the same.

On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic <nmarkovi.navteq@gmail.co=
m> wrote:
> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote=
:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it an=
d it
>>> seems that degradation is occurring even at fully random writes:
>>
>> I don't have an ssd, but is it possible that you're simply seeing er=
ase-
>> block related degradation due to multi-write-block sized erase-block=
s?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the=
 file
>> will likely be written block-sequentially enough that the file as a =
whole
>> takes up relatively few erase-blocks. =A0As you COW-write individual
>> blocks, they'll be written elsewhere, perhaps all the changed blocks=
 to a
>> new erase-block, perhaps each to a different erase block.
>
> This is a very interesting insight. I wasn't even aware of the
> erase-block issue, so I did some reading up on it...
>
>>
>> As you increase the successive COW generation count, the file's file=
-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and =
more
>> erase blocks, thus affecting modification and removal time but not r=
ead
>> time.
>
> OK, so time to write would increase due to fragmentation and writing,
> it now makes sense (though I don't see why small writes would affect
> this, but my concerns are not writes anyway), but why would cp
> --reflink time increase so much. Yes, new extents would be created,
> but btrfs doesn't write into data blocks, does it? I figured its
> metadata would be kept in one place. I figure the only thing BTRFS
> would do on cp --reflink=3Dalways:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
>
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that there's =
a
> performance issue there, not an ssd issue. Also, only one CPU thread
> is being used for this. I figured that I can improve this by some
> setting. Maybe thread_pool mount option? Are there any updates in
> later kernels that I should possibly pick up?
>
>>
>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option. =A0Let's see if I can find it again. =A0Hmm... yes...
>>
>> http://btrfs.ipv5.de/index.php?title=3DGetting_started#Mount_Options
>>
>> In particular this (for nodatacow, read the rest as there's addition=
al
>> implications):
>>
>>>>>>>
>> Performance gain is usually < 5% unless the workload is random write=
s to
>> large database files, where the difference can become very large.
>> <<<<<
>>
>
> Unless I am wrong, this would disable COW completely and reflink copy=
=2E
> Reflinks are a crucial component and the sole
> reason I picked BTRFS for the system that I am writing for my company=
=2E
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason
> we chose BTRFS over ZFS, which seemed to be the only feasible
> alternative. ZFS snapshot complicate the design and deduplication cop=
y
> time is the same as (or not much better than) raw copy.
>
>> In addition to nodatacow, see the note on the autodefrag option.
>>
>> IOW, with the repeated generations of random-writes to cow-copies, y=
ou're
>> apparently triggering a cow-worst-case fragmentation situation. =A0I=
t
>> shouldn't affect read-time much on SSD, but it certainly will affect=
 copy
>> and erase time, as the data and metadata (which as you'll recall is =
2X by
>> default on btrfs) gets written to more and more blocks that need upd=
ated
>> at copy/erase time,
>>
>>
>> That /might/ be the problem triggering the freezes you noted that se=
t off
>> the original investigation as well, if the SSD firmware is running o=
ut of
>> erase blocks and having to pause access while it rearranges data to =
allow
>> operations to continue. =A0Since your original issue on "rotating ru=
st"
>> drives was fragmentation, rewriting would seem to be something you d=
o
>> quite a lot of, triggering different but similar-cause issues on SSD=
s as
>> well.
>>
>> FWIW, with that sort of database-style workload, large files constan=
tly
>> random-change rewritten, something like xfs might be more appropriat=
e
>> than btrfs. =A0See the recent xfs presentations (were they at ScaleX=
 or
>> LinuxConf.au? both happened about the same time and were covered in =
the
>> same LWN weekly edition) as covered a couple weeks ago on LWN for mo=
re.
>>
>
> As I mentioned above, the COW is the crucial component of our system,
> XFS won't do. Our system does not do random writes. In fact it is
> mainly heavy on read operation. The system does occasional "rotation
> of rust" on large files in a way that version control system would
> (large files are modified and then used as a new baseline)
>
>> --
>> Duncan - List replies preferred. =A0 No HTML msgs.
>> "Every nonfree program has a lord, a master --
>> and if you use the program, he is your master." =A0Richard Stallman
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-btrf=
s" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at =A0http://vger.kernel.org/majordomo-info.html
>
> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange prformance degradation when COW writes happen at fixed offsets
  2012-02-24 21:33       ` Nik Markovic
@ 2012-02-27  8:29         ` Christian Brunner
  0 siblings, 0 replies; 7+ messages in thread
From: Christian Brunner @ 2012-02-27  8:29 UTC (permalink / raw)
  To: Nik Markovic; +Cc: linux-btrfs

2012/2/24 Nik Markovic <nmarkovi.navteq@gmail.com>:
> To add... I also tried nodatasum (only) and nodatacow otions. I found
> somewhere that nodatacow doesn't really mean tthat COW is disabled.
> Test data is still the same - CPU spikes and times are the same.
>
> On Fri, Feb 24, 2012 at 2:38 PM, Nik Markovic <nmarkovi.navteq@gmail.=
com> wrote:
>> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrot=
e:
>>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted=
:
>>>
>>>> I noticed a few errors in the script that I used. I corrected it a=
nd it
>>>> seems that degradation is occurring even at fully random writes:
>>>
>>> I don't have an ssd, but is it possible that you're simply seeing e=
rase-
>>> block related degradation due to multi-write-block sized erase-bloc=
ks?
>>>
>>> It seems to me that when originally written to the btrfs-on-ssd, th=
e file
>>> will likely be written block-sequentially enough that the file as a=
 whole
>>> takes up relatively few erase-blocks. =A0As you COW-write individua=
l
>>> blocks, they'll be written elsewhere, perhaps all the changed block=
s to a
>>> new erase-block, perhaps each to a different erase block.
>>
>> This is a very interesting insight. I wasn't even aware of the
>> erase-block issue, so I did some reading up on it...
>>
>>>
>>> As you increase the successive COW generation count, the file's fil=
e-
>>> system/write blocks will be spread thru more and more erase-blocks,
>>> basically fragmentation but of the SSD-critical type, into more and=
 more
>>> erase blocks, thus affecting modification and removal time but not =
read
>>> time.
>>
>> OK, so time to write would increase due to fragmentation and writing=
,
>> it now makes sense (though I don't see why small writes would affect
>> this, but my concerns are not writes anyway), but why would cp
>> --reflink time increase so much. Yes, new extents would be created,
>> but btrfs doesn't write into data blocks, does it? I figured its
>> metadata would be kept in one place. I figure the only thing BTRFS
>> would do on cp --reflink=3Dalways:
>> 1. Take a collection of extents owned by source.
>> 2. Make the new copy use the same collection of extents.
>> 3. Write the collection of extents to the "directory".
>>
>> Now this process seems to be CPU intensive. When I remove or make a
>> reflink copy, one core pikes up to 100%, which tells me that there's=
 a
>> performance issue there, not an ssd issue. Also, only one CPU thread
>> is being used for this. I figured that I can improve this by some
>> setting. Maybe thread_pool mount option? Are there any updates in
>> later kernels that I should possibly pick up?
>>
>> [...]
>>
>> Unless I am wrong, this would disable COW completely and reflink cop=
y.
>> Reflinks are a crucial component and the sole
>> reason I picked BTRFS for the system that I am writing for my compan=
y.
>> The autodefrag option addresses multiple writes. Writing is not the
>> problem, but cp --reflink should be near-instant. That was the reaso=
n
>> we chose BTRFS over ZFS, which seemed to be the only feasible
>> alternative. ZFS snapshot complicate the design and deduplication co=
py
>> time is the same as (or not much better than) raw copy.
>>
>> [...]
>>
>> As I mentioned above, the COW is the crucial component of our system=
,
>> XFS won't do. Our system does not do random writes. In fact it is
>> mainly heavy on read operation. The system does occasional "rotation
>> of rust" on large files in a way that version control system would
>> (large files are modified and then used as a new baseline)

The symptoms you are reporting are quite similar to what I'm seeing in
our Ceph cluster:

http://comments.gmane.org/gmane.comp.file-systems.btrfs/15413

AFAIK, Chris and Josef are working on it, but you'll have to wait for
kernel 3.4, until this will be available in mainline. If you are
feeling adventurous, you could try the patches in Josef's git tree,
but I think it's still experimental.

Regards,
Christian
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Strange prformance degradation when COW writes happen at fixed offsets
  2012-02-24 20:38     ` Nik Markovic
  2012-02-24 21:33       ` Nik Markovic
@ 2012-02-25  3:34       ` Duncan
  1 sibling, 0 replies; 7+ messages in thread
From: Duncan @ 2012-02-25  3:34 UTC (permalink / raw)
  To: linux-btrfs

Nik Markovic posted on Fri, 24 Feb 2012 14:38:57 -0600 as excerpted:

> On Fri, Feb 24, 2012 at 12:38 AM, Duncan <1i5t5.duncan@cox.net> wrote=
:
>> Nik Markovic posted on Thu, 23 Feb 2012 20:31:02 -0600 as excerpted:
>>
>>> I noticed a few errors in the script that I used. I corrected it an=
d
>>> it seems that degradation is occurring even at fully random writes:
>>
>> I don't have an ssd, but is it possible that you're simply seeing
>> erase- block related degradation due to multi-write-block sized
>> erase-blocks?
>>
>> It seems to me that when originally written to the btrfs-on-ssd, the
>> file will likely be written block-sequentially enough that the file =
as
>> a whole takes up relatively few erase-blocks. =C2=A0As you COW-write
>> individual blocks, they'll be written elsewhere, perhaps all the
>> changed blocks to a new erase-block, perhaps each to a different era=
se
>> block.
>=20
> This is a very interesting insight. I wasn't even aware of the
> erase-block issue, so I did some reading up on it...

I take it you looked at TRIM/discard, then, as well?  In theory and for=
=20
some SSD firmware, it works well at helping to alleviate the problem by=
=20
informing the SSD of data areas that it no longer needs to care about=20
(empty space), thus allowing more effective management of those erase-
blocks.

Reality is however not quite so simple, and it doesn't help a lot with=20
some SSDs, plus there's a potential performance issue due when doing th=
e=20
discard on especially earlier devices, since TRIM is an unqueueable=20
command in the earlier standards (I've read it's defined as queueable i=
n=20
the latest standards, however), thus forcing a flush of all activity in=
=20
the queue before the discard, potentially triggering I/O freeze=20
behavior.  Additionally, when run on top of dm-crypt, there's a potenti=
al=20
security issue (examination of the raw undecrypted storage reveals=20
whether there's data there or not, and possibly the filesystem type use=
d=20
based on patterns, a potential deniability issue in that they know the=20
data is there, tho it doesn't affect the strength of the encryption=20
itself).

So since on a lot of firmware it doesn't make a lot of difference anywa=
y,=20
and there's a couple of down sides, the btrfs ssd mount option does NOT=
=20
enable discard as well.  However, there *IS* a discard option that you=20
can experiment with if you like, and it probably WILL help with erase-
block handling on SOME firmware.

See the FAQ, part 3 Features, # 3.4 on TRIM/discard, for a bit more. I'=
ve=20
really covered what it says, above, but there's a link to the encryptio=
n=20
security vs TRIM research, for instance.  And the discard mount-option=20
for whatever reason isn't listed in mount options, or at least I didn't=
=20
see it, only in the FAQ.

(This is one URL, my client is wrapping it and it's a hassle to fix.)

http://btrfs.ipv5.de/index.php?
title=3DFAQ#Does_Btrfs_support_TRIM.2Fdiscard.3F

Bottom line, if it is indeed an erase-block issue, the discard mount=20
option MIGHT help, or it might not, depending on your device firmware. =
=20
It's an experiment-and-see thing.

>> As you increase the successive COW generation count, the file's file=
-
>> system/write blocks will be spread thru more and more erase-blocks,
>> basically fragmentation but of the SSD-critical type, into more and
>> more erase blocks, thus affecting modification and removal time but =
not
>> read time.
>=20
> OK, so time to write would increase due to fragmentation and writing,=
 it
> now makes sense (though I don't see why small writes would affect thi=
s,
> but my concerns are not writes anyway), but why would cp --reflink ti=
me
> increase so much. Yes, new extents would be created, but btrfs doesn'=
t
> write into data blocks, does it? I figured its metadata would be kept=
 in
> one place. I figure the only thing BTRFS would do on cp
> --reflink=3Dalways:
> 1. Take a collection of extents owned by source.
> 2. Make the new copy use the same collection of extents.
> 3. Write the collection of extents to the "directory".
>=20
> Now this process seems to be CPU intensive. When I remove or make a
> reflink copy, one core pikes up to 100%, which tells me that there's =
a
> performance issue there, not an ssd issue. Also, only one CPU thread =
is
> being used for this. I figured that I can improve this by some settin=
g.
> Maybe thread_pool mount option? Are there any updates in later kernel=
s
> that I should possibly pick up?

=46WIW... I am by no means an expert on this.  I /think/ I understand=20
enough of it to somewhat guide trial and error testing to arrive at a=20
reasonable if not best-case config for any setup I might deal with, and=
=20
well enough to hopefully point you in the right direction for your own=20
research and testing, but I'd not going to claim to be able to explain=20
the whys of individual cases, or even necessarily to understand them=20
myself, just understand enough to know of the issue and to trial and=20
error resolve to a hopefully reasonable situation on any hardware I mig=
ht=20
run.

However, I could speculate (enough to guide my own testing were I=20
troubleshooting here) that it's one of several things or more likely a=20
combination of them.  =20

One, I'm not sure if the metadata ends up being COW also, or not, but i=
f=20
it is, then your test case is fragmenting it too, thus explaining the=20
reflink copy issue.  And keep in mind that by default, btrfs uses DUP f=
or=20
metadata, so there's TWO copies of it written, thereby DOUBLING the=20
performance effects of anything affecting metadata!

Two, see the FAQ deduplication question/answer a couple questions below=
=20
the TRIM/discard one mentioned above.  I'm rather fuzzy on the filesyst=
em=20
implications of this myself, but it seems to me that our COW assumption=
s=20
might be wrong because they're assuming deduplication effects that simp=
ly=20
aren't the way btrfs works presently, as it hasn't implemented=20
deduplication.  Admittedly, this is at best a handwavy black-box factor=
,=20
but that's the best I can do with it, presently.  I guess that at least=
=20
gives you another place to do additional research, if it comes to that.=
 =20
(In this regard I do wish the COW subsection of the sysadmin guide page=
=20
on the wiki was written, it's simply punted ATM, since there's a fair=20
chance that a good explanation there would cover the filesystem viewpoi=
nt=20
differences between full deduplication and the COW that btrfs does,=20
perhaps clearing up some misconceptions people including me may have=20
about it, as well.)

Three, as evident in the discussion on the nodatacow and autodefrag=20
options I mentioned before, there's known issues with some use cases=20
involving large files and rewrites of data at random locations within=20
them.  But I'm not sure if these known issues are simply the ones we've=
=20
been discussing, or if there's other factors I'm unaware of in this=20
regard.  Knowing more about just what those known issues are and the=20
specific scenarios under which they occur, could go a long way toward=20
resolving the situation for you.

But I'm only a recent list regular, joining a few weeks ago as part of =
my=20
own research into btrfs (FWIW my use case involves N-way mirroring, wit=
h=20
N=3D3-4; since only no-mirroring and N=3D2 is available today and 3-way=
/n-way=20
is planned to layer on top of raid5/6, which is planned for kernel=20
3.4/3.5, I'm now waiting for that... while continuing to stay current o=
n=20
the list), so whatever research or test cases lead to the remarks on th=
e=20
wiki regarding large files with random data rewrites, predates my=20
involvement likely by quite some time.

=46our, there's additional block alignment issues having to do with the=
=20
alignment of the partition on the physical storage, as it relates to=20
read-, write- and erase-block sizes and alignment.  On SSDs, erase-bloc=
k=20
sizes are the biggest, so the optimum alignment would be to erase-block=
=20
size.  Getting it wrong can result in multiple block writes and/or eras=
es=20
where proper alignment would require only one.  This phenomenon is call=
ed=20
write-amplification (and less commonly, erase-amplification).  However,=
=20
depending on what you used to create the partition on which the=20
filesystem resides (and loopback files do tend toward worst-case), it's=
=20
quite possible you don't have block-alignment level control at all.

=46WIW, that's one use case for the mkbtrfs/mkfs.btrfs --alloc-start/-A=
=20
option, since that allows you to align the allocation within the=20
partition as necessary for alignment, regardless of the partition=20
alignment.

=46WIW2, gptfdisk (a gpt partitioner as opposed to the old mbr style) h=
as=20
reasonable alignment defaults of 1 MiB on disks without an existing=20
partition layout, and attempts 8-sector (4 KiB) alignment even on=20
existing layouts, for disks >=3D300 GB at least.  That's what I've been=
=20
using for the last few years, having converted to gpt-based partitionin=
g=20
for everything, even USB-thumb-drives, if partitioned.  (GPT was design=
ed=20
for EFI, but can be used on BIOS based systems as well, which is what I=
'm=20
doing.  Grub2 understands gpt well and puts to good use any reserved BI=
OS=20
partition it finds, and there's options in the kernel for it that need=20
enabled as well.)

Block alignment is DEFINITELY something you can play with, in terms of=20
testing whether it makes a difference on your drives, SSD or "spinning=20
rust".

There's probably other factors involved of which I'm unaware, as well.

>> IIRC I saw a note about this on the wiki, in regard to the nodatacow
>> mount-option.

>> In addition to nodatacow, see the note on the autodefrag option.

> Unless I am wrong, this would disable COW completely and reflink copy=
=2E
> Reflinks are a crucial component and the sole reason I picked BTRFS f=
or
> the system that I am writing for my company.
> The autodefrag option addresses multiple writes. Writing is not the
> problem, but cp --reflink should be near-instant. That was the reason=
 we
> chose BTRFS over ZFS, which seemed to be the only feasible alternativ=
e.
> ZFS snapshot complicate the design and deduplication copy time is the
> same as (or not much better than) raw copy.

> As I mentioned above, the COW is the crucial component of our system,
> XFS won't do. Our system does not do random writes. In fact it is mai=
nly
> heavy on read operation. The system does occasional "rotation of rust=
"
> on large files in a way that version control system would (large file=
s
> are modified and then used as a new baseline)

Pardon me, I think I might have been too vague with that "rotating rust=
"=20
allusion and lost you.  Either that or you're taking the allusion out=20
even further and potentially lost me! =3D;^0

I meant spinning magnetic media with that "rotating rust" reference, th=
e=20
"rotating rust" bit being a double entendre allusion both to the iron=20
oxide (rust) used as the data storage layer, and to the fact that many=20
view rotating magnetic media as a legacy technology (rusting out)=20
compared to SSDs. =3D:^)  As it happens, I saw that double-meaning word=
-
play used elsewhere recently with the same two allusions attached, and=20
liked it enough to use it myself, when I got the chance.  Only I'm not=20
sure you got the reference, because...

You used it quite differently, referring to file rotation.  So either y=
ou=20
saw my reference and upped the ante, so to speak, leaving me to pick up=
=20
the pieces, or I lost you with the original reference, one of the two.

But I guess we should be on the same page knowing each other's meaning,=
=20
now.  Meanwhile...

[I do see your followup mentioning that it doesn't actually disable /al=
l/=20
COW, and that you tested it, without significant change in the results.=
=2E.]

=46WIW, I wasn't so much SUGGESTING those options, as noting the=20
INFORMATION contained in their description, the random writes to large =
db=20
files and its effect on btrfs bit.  But testing (which you did) is a go=
od=20
idea, just to see what difference it makes, little in your case, so=20
either the nocow option isn't disabling it in your case (specific use o=
f=20
cp --reflink), or the cow isn't the problem at all.

While you're at testing, tho, the question occurred to me of whether=20
simply using btrfs' snapshotting would make a difference.  (I did say I=
=20
don't claim a full understanding, and that trial and error testing woul=
d=20
be my method here, that I really only understand enough to hopefully=20
guide me a bit in what to test...)  Snapshotting by definition uses the=
=20
COW capacities, bit it occurs to me that since it's doing it on a=20
filesystem-wide basis instead of a single-file basis, that might allow=20
more efficiency in metadata handling.

Note that I don't necessarily expect that snapshotting would be a=20
workable final solution for you, but if in testing you discover that th=
e=20
speed stays reasonable with the snapshot method (still only changing th=
e=20
single file between snapshots), while it degrades (as you've found) wit=
h=20
the single cp --reflink method, then that's important data for the test=
=20
case, and given btrfs' state of development, it could well lead to majo=
r=20
optimizations of the single-file cp --reflink case as well, which you=20
presumably COULD use in final deployment.

> Thanks for all your help on this issue. I hope that someone can point
> out some more tweaks or added features/fixes after 3.2 RC5 that I may
> do.

Talking about which... since you mentioned 3.2-rc5, you do seem aware o=
f=20
the fact that btrfs is still very much experimental status, in active=20
development, and the need for staying current on the kernel.

However, unless your testing is for a system with actual deployment=20
scheduled for say a year or more out, I'd question btrfs as a reasonabl=
e=20
solution in any case.  One of the things that a lot of people don't see=
m=20
to realize is just how much active btrfs development is still going on,=
=20
and that it's NOT just corner-case use cases such as the multi-mirror=20
raid1 that I'm waiting on ATM, but that there's still data corruption=20
issues being traced and fixed, etc.

IOW, btrfs isn't something I'd recommend on either a production system =
or=20
even a general user's system, for the time being.  If the intent is to=20
test btrfs, filling it with data that you are not only prepared for it =
to=20
be destroyed, but expect it to happen, so you not only have backups or=20
simply don't value the data enough to be worth backups, you're not=20
counting on the btrfs copy as anything but experimental "garbage" data,=
=20
expected to be lost in testing, as well, then that's FINE.  Such testin=
g,=20
and hopefully bug reporting, and patching where possible, is what btrfs=
=20
is out there for, ATM. =20

But if the intent is to actually put production data on the filesystem,=
=20
or use it as the primary copy of data that you don't want to lose, btrf=
s=20
isn't an appropriate choice at this point, and I'd say probably won't b=
e=20
until say Q4, or even next year, so if your production deployment is=20
scheduled for before that, really, you shouldn't be looking at btrfs fo=
r=20
it, as it's not fit for that purpose ATM and isn't likely to be, for=20
another year or so (and even then, it'll be suitable for only the early=
=20
adopters, the cautious folk will wait another year or more after that,=20
just as many of the cautious folk are only now warming to ext4 as oppos=
ed=20
to ext3).

I just don't want to see you back here as one of those folks asking=20
questions about recovering data on a screwed filesystem, because they h=
ad=20
no backups or the backups weren't kept current, because they were using=
=20
btrfs for real-life use beyond testing purposes, and that's simply not=20
the sort of use btrfs is designed to or can properly deliver at this=20
point!

--=20
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" =
in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-02-27  8:29 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-24  1:32 Strange prformance degradation when COW writes happen at fixed offsets Nik Markovic
2012-02-24  2:31 ` Nik Markovic
2012-02-24  6:38   ` Duncan
2012-02-24 20:38     ` Nik Markovic
2012-02-24 21:33       ` Nik Markovic
2012-02-27  8:29         ` Christian Brunner
2012-02-25  3:34       ` Duncan

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.