* API for multi-segment atomic IO
@ 2015-07-08 3:33 Doug Dumitru
2015-07-08 15:38 ` Bart Van Assche
0 siblings, 1 reply; 9+ messages in thread
From: Doug Dumitru @ 2015-07-08 3:33 UTC (permalink / raw)
To: device-mapper development
[-- Attachment #1.1: Type: text/plain, Size: 229 bytes --]
... is anyone working on such an animal. I was thinking of a "struct bio"
extension. I know that Fusion had a private API for this, but I am not a
Fusion customer so I have not seen their concepts.
--
Doug Dumitru
EasyCo LLC
[-- Attachment #1.2: Type: text/html, Size: 424 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-08 3:33 API for multi-segment atomic IO Doug Dumitru
@ 2015-07-08 15:38 ` Bart Van Assche
2015-07-08 16:21 ` Doug Dumitru
0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2015-07-08 15:38 UTC (permalink / raw)
To: doug, device-mapper development
On 07/07/2015 08:33 PM, Doug Dumitru wrote:
> ... is anyone working on such an animal. I was thinking of a "struct
> bio" extension. I know that Fusion had a private API for this, but I am
> not a Fusion customer so I have not seen their concepts.
The following document may be helpful: "SBC-4 SPC-5 Atomic writes and
reads" (http://www.t10.org/cgi-bin/ac.pl?t=d&f=13-064r9.pdf). Please
note that the information about atomic writes in that document has been
superseded by the section in the SBC-4 document about atomic writes
(http://www.t10.org/cgi-bin/ac.pl?t=f&f=sbc4r07c.pdf).
Bart.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-08 15:38 ` Bart Van Assche
@ 2015-07-08 16:21 ` Doug Dumitru
2015-07-08 19:38 ` Christoph Hellwig
0 siblings, 1 reply; 9+ messages in thread
From: Doug Dumitru @ 2015-07-08 16:21 UTC (permalink / raw)
To: Bart Van Assche; +Cc: device-mapper development
[-- Attachment #1.1: Type: text/plain, Size: 2157 bytes --]
Mr. Assche,
Thank you for the T10 links. I was actually looking for something at a
different level.
I have a "smart" block device that can implement multi-segment atomic
writes. This is more than what T10 envisions in that the writes can span
multiple LBA ranges, and multiple devices in the case of an array. Thus I
was looking at creating an OS level API for this type of operation.
My initial thought was to create two new fields in the bio struct.
u64 bi_atomic_transaction_id;
u32 bi_atomic_transaction_length;
A writer would populate multiple bio requests with the same transaction_id
and a single write length which is the length of all bio requests. The bio
requests do not need to be linear or adjacent. There would probably need
to be alignment rules. There would be no requirement that all of the
requests are submitted at the same time. There would probably need to be
size and concurrency limits to keep the engine reasonable.
My concern with just "making up an experimental API" is that, to be useful,
it needs to play nice inside of the block layer with other pieces. For
example, if I have a device that supports this API, I would like to be able
to use LVM on top of it, or have it as a member of a software mirror set.
This implies that the block stack has a convention to handle layered
devices with these transactions embedded within them.
Again, thank you for the links.
Doug Dumitru
On Wed, Jul 8, 2015 at 8:38 AM, Bart Van Assche <bart.vanassche@sandisk.com>
wrote:
> On 07/07/2015 08:33 PM, Doug Dumitru wrote:
>
>> ... is anyone working on such an animal. I was thinking of a "struct
>> bio" extension. I know that Fusion had a private API for this, but I am
>> not a Fusion customer so I have not seen their concepts.
>>
>
> The following document may be helpful: "SBC-4 SPC-5 Atomic writes and
> reads" (http://www.t10.org/cgi-bin/ac.pl?t=d&f=13-064r9.pdf). Please note
> that the information about atomic writes in that document has been
> superseded by the section in the SBC-4 document about atomic writes (
> http://www.t10.org/cgi-bin/ac.pl?t=f&f=sbc4r07c.pdf).
>
> Bart.
>
--
Doug Dumitru
EasyCo LLC
[-- Attachment #1.2: Type: text/html, Size: 4007 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-08 16:21 ` Doug Dumitru
@ 2015-07-08 19:38 ` Christoph Hellwig
2015-07-09 15:41 ` Doug Dumitru
0 siblings, 1 reply; 9+ messages in thread
From: Christoph Hellwig @ 2015-07-08 19:38 UTC (permalink / raw)
To: doug, device-mapper development; +Cc: Bart Van Assche
On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote:
> I have a "smart" block device that can implement multi-segment atomic
> writes.
How about submitting your driver upstream first and then we can work
with you on an API that fits the devices and the consumers needs.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-08 19:38 ` Christoph Hellwig
@ 2015-07-09 15:41 ` Doug Dumitru
2015-07-09 16:34 ` Bart Van Assche
0 siblings, 1 reply; 9+ messages in thread
From: Doug Dumitru @ 2015-07-09 15:41 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: Bart Van Assche, device-mapper development
[-- Attachment #1.1: Type: text/plain, Size: 2268 bytes --]
Mr. Hellwig,
I usually like to start with an interface and then implement the driver's
from there.
In this case, this is a block-level interface that supports new
functionality (atomic writes). In the past, you would approach this type
of problem by having the atomic and user layers as a monolithic solution.
Consider database updates and the complexity that they go through to insure
database integrity. If a block device could provide a database with an
atomic update interface, the database would get a lot simpler. The same
discussion holds true for file systems. Depending on the atomic update
implementation, you might end up in the same place in terms of total code,
but you might also end up somewhere completely different.
The impetus for this is some research on file system "write
amplification". In general, file system design seems to be heading in the
direction of higher and higher write amplification. For example, the tree
structure of zfs is shockingly inefficient in terms of write overhead.
This is happening at the same time as Flash is becoming popular but is also
moving to smaller and smaller geometries. So write efficiency is becoming
more and more important.
By decoupling the atomic update semantics from file system and other block
device "users", this gives devices the opportunity to implement atomic
updates internal to or in cooperation with Flash management algorithms. In
theory, you can implement atomic updates without any extra writes. In
practice, some devices will be better than others.
I was hoping to stumble across someone interested in this as a concept, or
someone who has researched this area, as I don't have any near production
existing code. I could pretty easily hack in a couple of extra fields in
struct bio that would accomplish what I see, but others might have
differing input.
Doug Dumitru
On Wed, Jul 8, 2015 at 12:38 PM, Christoph Hellwig <hch@infradead.org>
wrote:
> On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote:
> > I have a "smart" block device that can implement multi-segment atomic
> > writes.
>
> How about submitting your driver upstream first and then we can work
> with you on an API that fits the devices and the consumers needs.
>
--
Doug Dumitru
EasyCo LLC
[-- Attachment #1.2: Type: text/html, Size: 3432 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-09 15:41 ` Doug Dumitru
@ 2015-07-09 16:34 ` Bart Van Assche
2015-07-09 17:08 ` Doug Dumitru
0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2015-07-09 16:34 UTC (permalink / raw)
To: doug, device-mapper development, Christoph Hellwig
On 07/09/2015 08:41 AM, Doug Dumitru wrote:> Mr. Hellwig,
> On Wed, Jul 8, 2015 at 12:38 PM, Christoph Hellwig <hch@infradead.org
> <mailto:hch@infradead.org>> wrote:
>
> On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote:
> > I have a "smart" block device that can implement multi-segment atomic
> > writes.
>
> How about submitting your driver upstream first and then we can work
> with you on an API that fits the devices and the consumers needs.
>
> I usually like to start with an interface and then implement the
> driver's from there.
>
> In this case, this is a block-level interface that supports new
> functionality (atomic writes). In the past, you would approach this
> type of problem by having the atomic and user layers as a monolithic
> solution. Consider database updates and the complexity that they go
> through to insure database integrity. If a block device could provide a
> database with an atomic update interface, the database would get a lot
> simpler. The same discussion holds true for file systems. Depending on
> the atomic update implementation, you might end up in the same place in
> terms of total code, but you might also end up somewhere completely
> different.
>
> The impetus for this is some research on file system "write
> amplification". In general, file system design seems to be heading in
> the direction of higher and higher write amplification. For example,
> the tree structure of zfs is shockingly inefficient in terms of write
> overhead. This is happening at the same time as Flash is becoming
> popular but is also moving to smaller and smaller geometries. So write
> efficiency is becoming more and more important.
>
> By decoupling the atomic update semantics from file system and other
> block device "users", this gives devices the opportunity to implement
> atomic updates internal to or in cooperation with Flash management
> algorithms. In theory, you can implement atomic updates without any
> extra writes. In practice, some devices will be better than others.
>
> I was hoping to stumble across someone interested in this as a concept,
> or someone who has researched this area, as I don't have any near
> production existing code. I could pretty easily hack in a couple of
> extra fields in struct bio that would accomplish what I see, but others
> might have differing input.
Hello Doug,
When designing such an API, please try to stay close to the semantics of
the already standardized SCSI commands. As you probably know the Linux
SCSI core has been implemented as a block driver. Any new command that
is added to the Linux block layer has to be translated by the Linux SCSI
core into a SCSI command. An example of a patch series that adds support
for a new block layer primitive is the patch series that adds
compare-and-write support
(http://thread.gmane.org/gmane.linux.scsi/95869). Although that patch
series is not yet upstream I think it is a good example of how to add
new functionality to the block layer and SCSI core.
Bart.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-09 16:34 ` Bart Van Assche
@ 2015-07-09 17:08 ` Doug Dumitru
2015-07-09 17:24 ` Bart Van Assche
0 siblings, 1 reply; 9+ messages in thread
From: Doug Dumitru @ 2015-07-09 17:08 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Christoph Hellwig, device-mapper development
[-- Attachment #1.1: Type: text/plain, Size: 5052 bytes --]
Mr. Assche,
The problem with this is that the SCSI commands do not go far enough to
actually address the needs of applications that need atomic updates. An
application would "like" to be able to update a large arbitrary set, of
sectors on a device with atomic semantics. The SCSI commands require the
set to be contiguous. Application design starts to get interesting when
the contiguous restriction goes away.
My initial thoughts are to tag multiple IO requests with an ID and combined
length field. This would be compatible with the SCSI spec if the request
was contiguous, but nonsensical if the request were multi-segment. On the
other hand, just hitting the SCSI spec is probably as simple as adding an
"atomic" bit to the current structure so that IO pieces are not cut up.
But then, you don't address the multi-segment functionality that is
possible. Regardless, there will be issues as pieces of the current stack
don't lend themselves well to propagating atomic operations up and down the
stack. Just how do you split an atomic write across scsi devices in a raid
set anyway?
What I am most interested in is keeping the stack working with at least 1:1
mapping layers, and with 1:many layers below the layer that implements the
atomic functionality. Think of a dm-atomic.ko device that uses a log
internally to implement multi-segment atomic writes. It can talk safely to
raid below it, but lvm should be able to sit above it and still have a file
system that expects atomic functionality to work. Now getting a SAN
connection to work like this would involve a new transport as iSCSI doesn't
really have the semantics for this, but maybe there are some extra
"transaction ID" bits that could be put into play (it has been a long time
since I dug into the depths of the SCSI layers).
Doug
On Thu, Jul 9, 2015 at 9:34 AM, Bart Van Assche <bart.vanassche@sandisk.com>
wrote:
> On 07/09/2015 08:41 AM, Doug Dumitru wrote:> Mr. Hellwig,
>
>> On Wed, Jul 8, 2015 at 12:38 PM, Christoph Hellwig <hch@infradead.org
>> <mailto:hch@infradead.org>> wrote:
>>
>> On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote:
>> > I have a "smart" block device that can implement multi-segment
>> atomic
>> > writes.
>>
>> How about submitting your driver upstream first and then we can work
>> with you on an API that fits the devices and the consumers needs.
>>
>> I usually like to start with an interface and then implement the
>> driver's from there.
>>
>> In this case, this is a block-level interface that supports new
>> functionality (atomic writes). In the past, you would approach this
>> type of problem by having the atomic and user layers as a monolithic
>> solution. Consider database updates and the complexity that they go
>> through to insure database integrity. If a block device could provide a
>> database with an atomic update interface, the database would get a lot
>> simpler. The same discussion holds true for file systems. Depending on
>> the atomic update implementation, you might end up in the same place in
>> terms of total code, but you might also end up somewhere completely
>> different.
>>
>> The impetus for this is some research on file system "write
>> amplification". In general, file system design seems to be heading in
>> the direction of higher and higher write amplification. For example,
>> the tree structure of zfs is shockingly inefficient in terms of write
>> overhead. This is happening at the same time as Flash is becoming
>> popular but is also moving to smaller and smaller geometries. So write
>> efficiency is becoming more and more important.
>>
>> By decoupling the atomic update semantics from file system and other
>> block device "users", this gives devices the opportunity to implement
>> atomic updates internal to or in cooperation with Flash management
>> algorithms. In theory, you can implement atomic updates without any
>> extra writes. In practice, some devices will be better than others.
>>
>> I was hoping to stumble across someone interested in this as a concept,
>> or someone who has researched this area, as I don't have any near
>> production existing code. I could pretty easily hack in a couple of
>> extra fields in struct bio that would accomplish what I see, but others
>> might have differing input.
>>
>
> Hello Doug,
>
> When designing such an API, please try to stay close to the semantics of
> the already standardized SCSI commands. As you probably know the Linux SCSI
> core has been implemented as a block driver. Any new command that is added
> to the Linux block layer has to be translated by the Linux SCSI core into a
> SCSI command. An example of a patch series that adds support for a new
> block layer primitive is the patch series that adds compare-and-write
> support (http://thread.gmane.org/gmane.linux.scsi/95869). Although that
> patch series is not yet upstream I think it is a good example of how to add
> new functionality to the block layer and SCSI core.
>
> Bart.
>
--
Doug Dumitru
EasyCo LLC
[-- Attachment #1.2: Type: text/html, Size: 6706 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-09 17:08 ` Doug Dumitru
@ 2015-07-09 17:24 ` Bart Van Assche
2015-07-09 20:39 ` Doug Dumitru
0 siblings, 1 reply; 9+ messages in thread
From: Bart Van Assche @ 2015-07-09 17:24 UTC (permalink / raw)
To: doug@easyco.com; +Cc: Christoph Hellwig, device-mapper development
On 07/09/2015 10:09 AM, Doug Dumitru wrote:
> The problem with this is that the SCSI commands do not go far enough to
> actually address the needs of applications that need atomic updates. An
> application would "like" to be able to update a large arbitrary set, of
> sectors on a device with atomic semantics. The SCSI commands require the
> set to be contiguous. Application design starts to get interesting when
> the contiguous restriction goes away.
>
> My initial thoughts are to tag multiple IO requests with an ID and
> combined length field. This would be compatible with the SCSI spec if
> the request was contiguous, but nonsensical if the request were
> multi-segment. On the other hand, just hitting the SCSI spec is
> probably as simple as adding an "atomic" bit to the current structure so
> that IO pieces are not cut up. But then, you don't address the
> multi-segment functionality that is possible. Regardless, there will be
> issues as pieces of the current stack don't lend themselves well to
> propagating atomic operations up and down the stack. Just how do you
> split an atomic write across scsi devices in a raid set anyway?
>
> What I am most interested in is keeping the stack working with at least
> 1:1 mapping layers, and with 1:many layers below the layer that
> implements the atomic functionality. Think of a dm-atomic.ko device
> that uses a log internally to implement multi-segment atomic writes. It
> can talk safely to raid below it, but lvm should be able to sit above it
> and still have a file system that expects atomic functionality to work.
> Now getting a SAN connection to work like this would involve a new
> transport as iSCSI doesn't really have the semantics for this, but maybe
> there are some extra "transaction ID" bits that could be put into play
> (it has been a long time since I dug into the depths of the SCSI layers).
Hello Doug,
The original proposal, from which the SBC-4 ATOMIC WRITE specification
has been derived, had support for scattered writes and gathered reads.
The title of that document is "SBC-4 SPC-5 Atomic writes and reads"
(http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-043r4.pdf).
Bart.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO
2015-07-09 17:24 ` Bart Van Assche
@ 2015-07-09 20:39 ` Doug Dumitru
0 siblings, 0 replies; 9+ messages in thread
From: Doug Dumitru @ 2015-07-09 20:39 UTC (permalink / raw)
To: Bart Van Assche; +Cc: Christoph Hellwig, device-mapper development
[-- Attachment #1.1: Type: text/plain, Size: 2846 bytes --]
Mr. Assche,
Thank you for the link. The proposal describes quite well "why"
multi-segment atomic writes are useful. I am not sure that this proposal
actually made it into the SPC-5 spec. It looks like it might overload some
of the data integrity fields and use them to link transactions together,
but I am rusty at reading specs like these. I will re-read it again over
the weekend. If anyone "knows" whether the multi-segment support is IN or
OUT of the SPC-5 spec, please chime in.
Doug
On Thu, Jul 9, 2015 at 10:24 AM, Bart Van Assche <bart.vanassche@sandisk.com
> wrote:
> On 07/09/2015 10:09 AM, Doug Dumitru wrote:
>
>> The problem with this is that the SCSI commands do not go far enough to
>> actually address the needs of applications that need atomic updates. An
>> application would "like" to be able to update a large arbitrary set, of
>> sectors on a device with atomic semantics. The SCSI commands require the
>> set to be contiguous. Application design starts to get interesting when
>> the contiguous restriction goes away.
>>
>> My initial thoughts are to tag multiple IO requests with an ID and
>> combined length field. This would be compatible with the SCSI spec if
>> the request was contiguous, but nonsensical if the request were
>> multi-segment. On the other hand, just hitting the SCSI spec is
>> probably as simple as adding an "atomic" bit to the current structure so
>> that IO pieces are not cut up. But then, you don't address the
>> multi-segment functionality that is possible. Regardless, there will be
>> issues as pieces of the current stack don't lend themselves well to
>> propagating atomic operations up and down the stack. Just how do you
>> split an atomic write across scsi devices in a raid set anyway?
>>
>> What I am most interested in is keeping the stack working with at least
>> 1:1 mapping layers, and with 1:many layers below the layer that
>> implements the atomic functionality. Think of a dm-atomic.ko device
>> that uses a log internally to implement multi-segment atomic writes. It
>> can talk safely to raid below it, but lvm should be able to sit above it
>> and still have a file system that expects atomic functionality to work.
>> Now getting a SAN connection to work like this would involve a new
>> transport as iSCSI doesn't really have the semantics for this, but maybe
>> there are some extra "transaction ID" bits that could be put into play
>> (it has been a long time since I dug into the depths of the SCSI layers).
>>
>
> Hello Doug,
>
> The original proposal, from which the SBC-4 ATOMIC WRITE specification has
> been derived, had support for scattered writes and gathered reads. The
> title of that document is "SBC-4 SPC-5 Atomic writes and reads" (
> http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-043r4.pdf).
>
> Bart.
>
>
--
Doug Dumitru
EasyCo LLC
[-- Attachment #1.2: Type: text/html, Size: 4108 bytes --]
[-- Attachment #2: Type: text/plain, Size: 0 bytes --]
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-07-09 20:39 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-07-08 3:33 API for multi-segment atomic IO Doug Dumitru
2015-07-08 15:38 ` Bart Van Assche
2015-07-08 16:21 ` Doug Dumitru
2015-07-08 19:38 ` Christoph Hellwig
2015-07-09 15:41 ` Doug Dumitru
2015-07-09 16:34 ` Bart Van Assche
2015-07-09 17:08 ` Doug Dumitru
2015-07-09 17:24 ` Bart Van Assche
2015-07-09 20:39 ` Doug Dumitru
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.