Fwd: block level cow operation

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Fwd: block level cow operation
       [not found] <CAD6i1f+GsVZJwaz1R3NDjP_m8nOCUsmqHTQS3R=M+d+hq8f5vw@mail.gmail.com>
@ 2013-04-09  9:05 ` Prashant Shah
  2013-04-09  9:56   ` Lukáš Czerner
                     ` (2 more replies)
  0 siblings, 3 replies; 6+ messages in thread
From: Prashant Shah @ 2013-04-09  9:05 UTC (permalink / raw)
  To: linux-ext4

Hi,

I am trying to implement copy on write operation by reading the
original disk block and writing it to some other location and then
allowing the write to pass though (block the write operation till the
read or original block completes) I tried using submit_bio() /
sb_bread() to read the block and using the completion API to signal
the end of reading the block but the performance of this is very bad.
It takes around 12 times more time for any disk writes. Is there any
better way to improve the performance ?

Not waiting for the completion of the read operation and letting the
disk write go through gives good performance but under 10% of the
cases the read happens after the write and ends up the the new data
and not the original data.

Regards.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: block level cow operation
  2013-04-09  9:05 ` Fwd: block level cow operation Prashant Shah
@ 2013-04-09  9:56   ` Lukáš Czerner
  2013-04-09 14:46   ` Dmitry Monakhov
  2013-04-09 21:02   ` Theodore Ts'o
  2 siblings, 0 replies; 6+ messages in thread
From: Lukáš Czerner @ 2013-04-09  9:56 UTC (permalink / raw)
  To: Prashant Shah; +Cc: linux-ext4

On Tue, 9 Apr 2013, Prashant Shah wrote:

> Date: Tue, 9 Apr 2013 14:35:56 +0530
> From: Prashant Shah <pshah.mumbai@gmail.com>
> To: linux-ext4@vger.kernel.org
> Subject: Fwd: block level cow operation
> 
> Hi,
> 
> I am trying to implement copy on write operation

Hi,

In ext4 ? Why are you trying to do that ?

> by reading the
> original disk block and writing it to some other location and then
> allowing the write to pass though (block the write operation till the
> read or original block completes) I tried using submit_bio() /
> sb_bread() to read the block and using the completion API to signal
> the end of reading the block but the performance of this is very bad.
> It takes around 12 times more time for any disk writes. Is there any
> better way to improve the performance ?

I am not sure what you're trying to achieve here, but the simplest
answer is yes, there is a way to improve the performance - use
device mapper to do this. thinp target provides you with block level
cow functionality which enables you to do snapshots efficiently for
example.

-Lukas

> 
> Not waiting for the completion of the read operation and letting the
> disk write go through gives good performance but under 10% of the
> cases the read happens after the write and ends up the the new data
> and not the original data.
> 
> Regards.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: block level cow operation
  2013-04-09  9:05 ` Fwd: block level cow operation Prashant Shah
  2013-04-09  9:56   ` Lukáš Czerner
@ 2013-04-09 14:46   ` Dmitry Monakhov
  2013-04-25 13:00     ` Prashant Shah
  2013-04-09 21:02   ` Theodore Ts'o
  2 siblings, 1 reply; 6+ messages in thread
From: Dmitry Monakhov @ 2013-04-09 14:46 UTC (permalink / raw)
  To: Prashant Shah, linux-ext4

On Tue, 9 Apr 2013 14:35:56 +0530, Prashant Shah <pshah.mumbai@gmail.com> wrote:
> Hi,
> 
> I am trying to implement copy on write operation by reading the
> original disk block and writing it to some other location and then
> allowing the write to pass though (block the write operation till the
> read or original block completes) I tried using submit_bio() /
> sb_bread() to read the block and using the completion API to signal
> the end of reading the block but the performance of this is very bad.
> It takes around 12 times more time for any disk writes. Is there any
> better way to improve the performance ?
> 
Yes obviously instead of synchronous  block handling (block by block)
which give about  ~1-3Mb/s 

you should not block bio/requests handling, but simply deffer original
bio. Some things like that:

OUR_MAIN_ENTERING_POINT {
  if (bio->bi_rw == WRITE) {
     if (cow_required(bio))
       cow_bio  = create_cow_copy(bio)
       submit_bio(cow_bio);
   }
  /* Cow is not required */ 
   submit_bio(bio);
}
create_cow_bio(struct *bio)
{
        /* Save original content, and once it will be done we will 
         * issue original bio */
         */
        cow_bio = alloc_bio();
        cow_bio.bi_sector = bio->bi_sector;
        ....
        cow_bio->bi_private = bio;
        cow_bio->bi_end_io = cow_end_io
}
cow_end_io(struct bio *cow_bio, int error) ;
{
       /* Once we done with saving original content we may send original
          bio, But end_io may be called from various contexts even from
          interrupt context , so we are not allowed to call submit_bio()
          So we will put original bio to the list and let our worker
          thread submit it for us later
        */
       add_bio_to_the_list((struct bio*)cow_bio->bi_private);
}

This approach gives us reasonable performance ~3 times slower than disk
throughput.
For a reference implementation you may look at driver/dm/dm-snap or to
Acronis snapapi module (AFAIR it is opensource)
}
> Not waiting for the completion of the read operation and letting the
> disk write go through gives good performance but under 10% of the
> cases the read happens after the write and ends up the the new data
> and not the original data.
Noooo never do that. Block layer will not guarantee you an order.
> 
> Regards.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: block level cow operation
  2013-04-09 14:46   ` Dmitry Monakhov
@ 2013-04-25 13:00     ` Prashant Shah
  2013-05-10 13:14       ` Prashant Shah
  0 siblings, 1 reply; 6+ messages in thread
From: Prashant Shah @ 2013-04-25 13:00 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4

Hi,

On Tue, Apr 9, 2013 at 8:16 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>
> you should not block bio/requests handling, but simply deffer original
> bio. Some things like that:
>
> OUR_MAIN_ENTERING_POINT {
>   if (bio->bi_rw == WRITE) {
>      if (cow_required(bio))
>        cow_bio  = create_cow_copy(bio)
>        submit_bio(cow_bio);
>    }
>   /* Cow is not required */
>    submit_bio(bio);
> }

> This approach gives us reasonable performance ~3 times slower than disk
> throughput.
> For a reference implementation you may look at driver/dm/dm-snap or to
> Acronis snapapi module (AFAIR it is opensource)
> }

Thanks. That is what I was looking for. Got the ref code from snapapi
module which is opensource.

Its not something that is specific to any filesystem.

Regards.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: block level cow operation
  2013-04-25 13:00     ` Prashant Shah
@ 2013-05-10 13:14       ` Prashant Shah
  0 siblings, 0 replies; 6+ messages in thread
From: Prashant Shah @ 2013-05-10 13:14 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: linux-ext4

Hi,

On Thu, Apr 25, 2013 at 6:30 PM, Prashant Shah <pshah.mumbai@gmail.com> wrote:
> Hi,
>
> On Tue, Apr 9, 2013 at 8:16 PM, Dmitry Monakhov <dmonakhov@openvz.org> wrote:
>>
>> you should not block bio/requests handling, but simply deffer original
>> bio. Some things like that:
>>
>> OUR_MAIN_ENTERING_POINT {
>>   if (bio->bi_rw == WRITE) {
>>      if (cow_required(bio))
>>        cow_bio  = create_cow_copy(bio)
>>        submit_bio(cow_bio);
>>    }
>>   /* Cow is not required */
>>    submit_bio(bio);
>> }
>
>> This approach gives us reasonable performance ~3 times slower than disk
>> throughput.
>> For a reference implementation you may look at driver/dm/dm-snap or to
>> Acronis snapapi module (AFAIR it is opensource)
>> }

Is this scenario possible ?

If a write bio (bio1) for a particular sector is under cow and waiting
for the read of the original block to complete. At the same time there
is another write bio (bio2) for the same sector. The original order is
bio1 then bio2. Now since bio1 is delayed due to cow and the new order
becomes bio2 followed by bio1 that goes in the queue. This will cause
the final on-disk write to be bio1.

Regards.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Fwd: block level cow operation
  2013-04-09  9:05 ` Fwd: block level cow operation Prashant Shah
  2013-04-09  9:56   ` Lukáš Czerner
  2013-04-09 14:46   ` Dmitry Monakhov
@ 2013-04-09 21:02   ` Theodore Ts'o
  2 siblings, 0 replies; 6+ messages in thread
From: Theodore Ts'o @ 2013-04-09 21:02 UTC (permalink / raw)
  To: Prashant Shah; +Cc: linux-ext4

On Tue, Apr 09, 2013 at 02:35:56PM +0530, Prashant Shah wrote:
> I am trying to implement copy on write operation by reading the
> original disk block and writing it to some other location....

Lukas asked the correct first question, which is why are you trying to
do this?  If the goal is to make COW snapshots, then there's a lot of
accounting information that you'll need to keep track of, and it is
very doubtful ext4 will be the right place to do things.

If the goal is to do efficient writes into cheap eMMC flash for random
write workloads (i.e., which is the same problem f2fs is trying to
solve), it's not totally insane to try to adapt ext4 to handle this
problem.

#1 You'd need to add support into mballoc to understand how to align
its block writes on eMMC erase block boundaries, and to have a mode
where it gives you sequentially increasing physical blocks ignoring
the logical block numbers.

#2 You'd need to intercept the write requests at the writepages() and
writepage() calls, and that's where the decision would have to be made
to allocate a new set of block numbers, based on some flag setting
that would either be on a per-filesystem or per open file basis.  As
part of the I/O completion callback, where today we have code paths to
convert an uninitialized extent to initialized extents, we could teach
that code path to update the logical block mapping.

#3 You'd have to come up with some approach to deal with direct I/O
(including potentially not supporting COW writes for DIO).  

#4 You'd probably only want to do this for indirect block mapped
files, since for a random write workload, the extent tree would
become very inefficient very quickly.

So it's not _insane_ but it's a huge amount of work, and it would be
very trickly, and it's not something that I would recommend, say, if a
student was looking for a term project.  It would also not be faster
on SSD or HDD's.  The only reason to do something like this would be
to deal with the extremely low-cost FTL of cheap eMMC flash devices
(where the BOM cost of eMMC is approximately two orders of magnitude
cheaper than SSD's).  So if you are benchmarking this on a HDD or SSD,
don't be surprised if it's much slower.  And if you are benchmarking
on eMMC, you have to make sure that you have the writes appropriately
erase block aligned, or any performance gains would be hopeless.

Regards,

					- Ted

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-05-10 13:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAD6i1f+GsVZJwaz1R3NDjP_m8nOCUsmqHTQS3R=M+d+hq8f5vw@mail.gmail.com>
2013-04-09  9:05 ` Fwd: block level cow operation Prashant Shah
2013-04-09  9:56   ` Lukáš Czerner
2013-04-09 14:46   ` Dmitry Monakhov
2013-04-25 13:00     ` Prashant Shah
2013-05-10 13:14       ` Prashant Shah
2013-04-09 21:02   ` Theodore Ts'o

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).