From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Subject: Re: [lvm-devel] dm thin: optimize away writing all zeroes to unprovisioned blocks Date: Tue, 09 Dec 2014 08:41:59 -0700 Message-ID: <54871847.6020009@kernel.dk> References: <20141204153358.GA19315@redhat.com> <5481EB1C.4000202@kernel.dk> <20141205183342.GA27397@redhat.com> <5483B04D.5030606@kernel.dk> <5485D86C.9040800@kernel.dk> <548715D2.1000509@kernel.dk> Reply-To: device-mapper development Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii"; Format="flowed" Content-Transfer-Encoding: 7bit Return-path: In-Reply-To: <548715D2.1000509@kernel.dk> List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: dm-devel-bounces@redhat.com Errors-To: dm-devel-bounces@redhat.com To: Eric Wheeler Cc: dm-devel@redhat.com, ejt@redhat.com, LVM2 development List-Id: dm-devel.ids On 12/09/2014 08:31 AM, Jens Axboe wrote: > On 12/09/2014 01:02 AM, Eric Wheeler wrote: >> On Fri, 5 Dec 2014, Mike Snitzer wrote: >>> I do wonder what the performance impact is on this for dm. Have you >>> tried a (worst case) test of writing blocks that are zero filled, >> >> Jens, thank you for your help w/ fio for generating zeroed writes! >> Clearly fio is superior to dd as a sequential benchmarking tool; I was >> actually able to push on the system's memory bandwidth. >> >> Results: >> >> I hacked block/loop.c and md/dm-thin.c to always call >> bio_is_zero_filled() >> and then complete without writing to disk, regardless of the return value >> from bio_is_zero_filled(). In loop.c this was done in >> do_bio_filebacked(), and for dm-thin.c this was done within >> provision_block(). >> >> This allows us to compare the performance difference between the simple >> loopback block device driver vs the more complex dm-thinp implementation >> just prior to block allocation. These benchmarks give us a sense of how >> performance differences relate between bio_is_zero_filled() and block >> device implementation complexity, in addition to the raw performance of >> bio_is_zero_filled in best- and worst-case scenarios. >> >> Since we always complete without writing after the call to >> bio_is_zero_filled, regardless of the bio's content (all zeros or >> not), we >> can benchmark the difference in the common use case of random data, as >> well as the edge case of skipping writes for bio's that contain all zeros >> when writing to unallocated space of thin-provisioned volumes. >> >> These benchmarks were performed under KVM, so expect them to be lower >> bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU >> E3-1230 V2 >> @ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores. >> >> Benchmarks were performed using fio-2.1.14-33-gf8b8f >> --name=writebw >> --rw=write >> --time_based >> --runtime=7 --ramp_time=3 >> --norandommap >> --ioengine=libaio >> --group_reporting >> --direct=1 >> --bs=1m >> --filename=/dev/X >> --numjobs=Y >> >> Random data was tested using: >> --zero_buffers=0 --scramble_buffers=1 >> >> Zeroed data was tested using: >> --zero_buffers=1 --scramble_buffers=0 >> >> Values below are from aggrb. >> >> dm-thinp (MB/s) loopback (MB/s) loop faster by >> factor of >> ==============+====================================================== >> random jobs=4 | 18496.0 33522.0 1.68x >> zeros jobs=4 | 8119.2 9767.2 1.20x >> ==============+====================================================== >> random jobs=1 | 7330.5 12330.0 1.81x >> zeros jobs=1 | 4965.2 6799.9 1.11x > > This looks more reasonable in terms of throughput. > > One major worry here is that checking every write is blowing your cache, > so you could have a major impact on performance in general. Even for > O_DIRECT writes, you are now accessing the memory. Have you looked into > doing non-temporal memory compares instead? I think that would be the > way to go. So I found your patch in the thread. For each vector, use memcmp() instead and hope it does the right thing. You can compare with empty_zero_page. That should drastically cut down on the amount of hand rolled code you have in bio_is_zero_filled() at the moment. -- Jens Axboe From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jens Axboe Date: Tue, 09 Dec 2014 08:41:59 -0700 Subject: dm thin: optimize away writing all zeroes to unprovisioned blocks In-Reply-To: <548715D2.1000509@kernel.dk> References: <20141204153358.GA19315@redhat.com> <5481EB1C.4000202@kernel.dk> <20141205183342.GA27397@redhat.com> <5483B04D.5030606@kernel.dk> <5485D86C.9040800@kernel.dk> <548715D2.1000509@kernel.dk> Message-ID: <54871847.6020009@kernel.dk> List-Id: To: lvm-devel@redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit On 12/09/2014 08:31 AM, Jens Axboe wrote: > On 12/09/2014 01:02 AM, Eric Wheeler wrote: >> On Fri, 5 Dec 2014, Mike Snitzer wrote: >>> I do wonder what the performance impact is on this for dm. Have you >>> tried a (worst case) test of writing blocks that are zero filled, >> >> Jens, thank you for your help w/ fio for generating zeroed writes! >> Clearly fio is superior to dd as a sequential benchmarking tool; I was >> actually able to push on the system's memory bandwidth. >> >> Results: >> >> I hacked block/loop.c and md/dm-thin.c to always call >> bio_is_zero_filled() >> and then complete without writing to disk, regardless of the return value >> from bio_is_zero_filled(). In loop.c this was done in >> do_bio_filebacked(), and for dm-thin.c this was done within >> provision_block(). >> >> This allows us to compare the performance difference between the simple >> loopback block device driver vs the more complex dm-thinp implementation >> just prior to block allocation. These benchmarks give us a sense of how >> performance differences relate between bio_is_zero_filled() and block >> device implementation complexity, in addition to the raw performance of >> bio_is_zero_filled in best- and worst-case scenarios. >> >> Since we always complete without writing after the call to >> bio_is_zero_filled, regardless of the bio's content (all zeros or >> not), we >> can benchmark the difference in the common use case of random data, as >> well as the edge case of skipping writes for bio's that contain all zeros >> when writing to unallocated space of thin-provisioned volumes. >> >> These benchmarks were performed under KVM, so expect them to be lower >> bounds due to overhead. The hardware is a Intel(R) Xeon(R) CPU >> E3-1230 V2 >> @ 3.30GHz. The VM was allocated 4GB of memory with 4 cpu cores. >> >> Benchmarks were performed using fio-2.1.14-33-gf8b8f >> --name=writebw >> --rw=write >> --time_based >> --runtime=7 --ramp_time=3 >> --norandommap >> --ioengine=libaio >> --group_reporting >> --direct=1 >> --bs=1m >> --filename=/dev/X >> --numjobs=Y >> >> Random data was tested using: >> --zero_buffers=0 --scramble_buffers=1 >> >> Zeroed data was tested using: >> --zero_buffers=1 --scramble_buffers=0 >> >> Values below are from aggrb. >> >> dm-thinp (MB/s) loopback (MB/s) loop faster by >> factor of >> ==============+====================================================== >> random jobs=4 | 18496.0 33522.0 1.68x >> zeros jobs=4 | 8119.2 9767.2 1.20x >> ==============+====================================================== >> random jobs=1 | 7330.5 12330.0 1.81x >> zeros jobs=1 | 4965.2 6799.9 1.11x > > This looks more reasonable in terms of throughput. > > One major worry here is that checking every write is blowing your cache, > so you could have a major impact on performance in general. Even for > O_DIRECT writes, you are now accessing the memory. Have you looked into > doing non-temporal memory compares instead? I think that would be the > way to go. So I found your patch in the thread. For each vector, use memcmp() instead and hope it does the right thing. You can compare with empty_zero_page. That should drastically cut down on the amount of hand rolled code you have in bio_is_zero_filled() at the moment. -- Jens Axboe