* API for multi-segment atomic IO @ 2015-07-08 3:33 Doug Dumitru 2015-07-08 15:38 ` Bart Van Assche 0 siblings, 1 reply; 9+ messages in thread From: Doug Dumitru @ 2015-07-08 3:33 UTC (permalink / raw) To: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 229 bytes --] ... is anyone working on such an animal. I was thinking of a "struct bio" extension. I know that Fusion had a private API for this, but I am not a Fusion customer so I have not seen their concepts. -- Doug Dumitru EasyCo LLC [-- Attachment #1.2: Type: text/html, Size: 424 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-08 3:33 API for multi-segment atomic IO Doug Dumitru @ 2015-07-08 15:38 ` Bart Van Assche 2015-07-08 16:21 ` Doug Dumitru 0 siblings, 1 reply; 9+ messages in thread From: Bart Van Assche @ 2015-07-08 15:38 UTC (permalink / raw) To: doug, device-mapper development On 07/07/2015 08:33 PM, Doug Dumitru wrote: > ... is anyone working on such an animal. I was thinking of a "struct > bio" extension. I know that Fusion had a private API for this, but I am > not a Fusion customer so I have not seen their concepts. The following document may be helpful: "SBC-4 SPC-5 Atomic writes and reads" (http://www.t10.org/cgi-bin/ac.pl?t=d&f=13-064r9.pdf). Please note that the information about atomic writes in that document has been superseded by the section in the SBC-4 document about atomic writes (http://www.t10.org/cgi-bin/ac.pl?t=f&f=sbc4r07c.pdf). Bart. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-08 15:38 ` Bart Van Assche @ 2015-07-08 16:21 ` Doug Dumitru 2015-07-08 19:38 ` Christoph Hellwig 0 siblings, 1 reply; 9+ messages in thread From: Doug Dumitru @ 2015-07-08 16:21 UTC (permalink / raw) To: Bart Van Assche; +Cc: device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 2157 bytes --] Mr. Assche, Thank you for the T10 links. I was actually looking for something at a different level. I have a "smart" block device that can implement multi-segment atomic writes. This is more than what T10 envisions in that the writes can span multiple LBA ranges, and multiple devices in the case of an array. Thus I was looking at creating an OS level API for this type of operation. My initial thought was to create two new fields in the bio struct. u64 bi_atomic_transaction_id; u32 bi_atomic_transaction_length; A writer would populate multiple bio requests with the same transaction_id and a single write length which is the length of all bio requests. The bio requests do not need to be linear or adjacent. There would probably need to be alignment rules. There would be no requirement that all of the requests are submitted at the same time. There would probably need to be size and concurrency limits to keep the engine reasonable. My concern with just "making up an experimental API" is that, to be useful, it needs to play nice inside of the block layer with other pieces. For example, if I have a device that supports this API, I would like to be able to use LVM on top of it, or have it as a member of a software mirror set. This implies that the block stack has a convention to handle layered devices with these transactions embedded within them. Again, thank you for the links. Doug Dumitru On Wed, Jul 8, 2015 at 8:38 AM, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 07/07/2015 08:33 PM, Doug Dumitru wrote: > >> ... is anyone working on such an animal. I was thinking of a "struct >> bio" extension. I know that Fusion had a private API for this, but I am >> not a Fusion customer so I have not seen their concepts. >> > > The following document may be helpful: "SBC-4 SPC-5 Atomic writes and > reads" (http://www.t10.org/cgi-bin/ac.pl?t=d&f=13-064r9.pdf). Please note > that the information about atomic writes in that document has been > superseded by the section in the SBC-4 document about atomic writes ( > http://www.t10.org/cgi-bin/ac.pl?t=f&f=sbc4r07c.pdf). > > Bart. > -- Doug Dumitru EasyCo LLC [-- Attachment #1.2: Type: text/html, Size: 4007 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-08 16:21 ` Doug Dumitru @ 2015-07-08 19:38 ` Christoph Hellwig 2015-07-09 15:41 ` Doug Dumitru 0 siblings, 1 reply; 9+ messages in thread From: Christoph Hellwig @ 2015-07-08 19:38 UTC (permalink / raw) To: doug, device-mapper development; +Cc: Bart Van Assche On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote: > I have a "smart" block device that can implement multi-segment atomic > writes. How about submitting your driver upstream first and then we can work with you on an API that fits the devices and the consumers needs. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-08 19:38 ` Christoph Hellwig @ 2015-07-09 15:41 ` Doug Dumitru 2015-07-09 16:34 ` Bart Van Assche 0 siblings, 1 reply; 9+ messages in thread From: Doug Dumitru @ 2015-07-09 15:41 UTC (permalink / raw) To: Christoph Hellwig; +Cc: Bart Van Assche, device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 2268 bytes --] Mr. Hellwig, I usually like to start with an interface and then implement the driver's from there. In this case, this is a block-level interface that supports new functionality (atomic writes). In the past, you would approach this type of problem by having the atomic and user layers as a monolithic solution. Consider database updates and the complexity that they go through to insure database integrity. If a block device could provide a database with an atomic update interface, the database would get a lot simpler. The same discussion holds true for file systems. Depending on the atomic update implementation, you might end up in the same place in terms of total code, but you might also end up somewhere completely different. The impetus for this is some research on file system "write amplification". In general, file system design seems to be heading in the direction of higher and higher write amplification. For example, the tree structure of zfs is shockingly inefficient in terms of write overhead. This is happening at the same time as Flash is becoming popular but is also moving to smaller and smaller geometries. So write efficiency is becoming more and more important. By decoupling the atomic update semantics from file system and other block device "users", this gives devices the opportunity to implement atomic updates internal to or in cooperation with Flash management algorithms. In theory, you can implement atomic updates without any extra writes. In practice, some devices will be better than others. I was hoping to stumble across someone interested in this as a concept, or someone who has researched this area, as I don't have any near production existing code. I could pretty easily hack in a couple of extra fields in struct bio that would accomplish what I see, but others might have differing input. Doug Dumitru On Wed, Jul 8, 2015 at 12:38 PM, Christoph Hellwig <hch@infradead.org> wrote: > On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote: > > I have a "smart" block device that can implement multi-segment atomic > > writes. > > How about submitting your driver upstream first and then we can work > with you on an API that fits the devices and the consumers needs. > -- Doug Dumitru EasyCo LLC [-- Attachment #1.2: Type: text/html, Size: 3432 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-09 15:41 ` Doug Dumitru @ 2015-07-09 16:34 ` Bart Van Assche 2015-07-09 17:08 ` Doug Dumitru 0 siblings, 1 reply; 9+ messages in thread From: Bart Van Assche @ 2015-07-09 16:34 UTC (permalink / raw) To: doug, device-mapper development, Christoph Hellwig On 07/09/2015 08:41 AM, Doug Dumitru wrote:> Mr. Hellwig, > On Wed, Jul 8, 2015 at 12:38 PM, Christoph Hellwig <hch@infradead.org > <mailto:hch@infradead.org>> wrote: > > On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote: > > I have a "smart" block device that can implement multi-segment atomic > > writes. > > How about submitting your driver upstream first and then we can work > with you on an API that fits the devices and the consumers needs. > > I usually like to start with an interface and then implement the > driver's from there. > > In this case, this is a block-level interface that supports new > functionality (atomic writes). In the past, you would approach this > type of problem by having the atomic and user layers as a monolithic > solution. Consider database updates and the complexity that they go > through to insure database integrity. If a block device could provide a > database with an atomic update interface, the database would get a lot > simpler. The same discussion holds true for file systems. Depending on > the atomic update implementation, you might end up in the same place in > terms of total code, but you might also end up somewhere completely > different. > > The impetus for this is some research on file system "write > amplification". In general, file system design seems to be heading in > the direction of higher and higher write amplification. For example, > the tree structure of zfs is shockingly inefficient in terms of write > overhead. This is happening at the same time as Flash is becoming > popular but is also moving to smaller and smaller geometries. So write > efficiency is becoming more and more important. > > By decoupling the atomic update semantics from file system and other > block device "users", this gives devices the opportunity to implement > atomic updates internal to or in cooperation with Flash management > algorithms. In theory, you can implement atomic updates without any > extra writes. In practice, some devices will be better than others. > > I was hoping to stumble across someone interested in this as a concept, > or someone who has researched this area, as I don't have any near > production existing code. I could pretty easily hack in a couple of > extra fields in struct bio that would accomplish what I see, but others > might have differing input. Hello Doug, When designing such an API, please try to stay close to the semantics of the already standardized SCSI commands. As you probably know the Linux SCSI core has been implemented as a block driver. Any new command that is added to the Linux block layer has to be translated by the Linux SCSI core into a SCSI command. An example of a patch series that adds support for a new block layer primitive is the patch series that adds compare-and-write support (http://thread.gmane.org/gmane.linux.scsi/95869). Although that patch series is not yet upstream I think it is a good example of how to add new functionality to the block layer and SCSI core. Bart. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-09 16:34 ` Bart Van Assche @ 2015-07-09 17:08 ` Doug Dumitru 2015-07-09 17:24 ` Bart Van Assche 0 siblings, 1 reply; 9+ messages in thread From: Doug Dumitru @ 2015-07-09 17:08 UTC (permalink / raw) To: Bart Van Assche; +Cc: Christoph Hellwig, device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 5052 bytes --] Mr. Assche, The problem with this is that the SCSI commands do not go far enough to actually address the needs of applications that need atomic updates. An application would "like" to be able to update a large arbitrary set, of sectors on a device with atomic semantics. The SCSI commands require the set to be contiguous. Application design starts to get interesting when the contiguous restriction goes away. My initial thoughts are to tag multiple IO requests with an ID and combined length field. This would be compatible with the SCSI spec if the request was contiguous, but nonsensical if the request were multi-segment. On the other hand, just hitting the SCSI spec is probably as simple as adding an "atomic" bit to the current structure so that IO pieces are not cut up. But then, you don't address the multi-segment functionality that is possible. Regardless, there will be issues as pieces of the current stack don't lend themselves well to propagating atomic operations up and down the stack. Just how do you split an atomic write across scsi devices in a raid set anyway? What I am most interested in is keeping the stack working with at least 1:1 mapping layers, and with 1:many layers below the layer that implements the atomic functionality. Think of a dm-atomic.ko device that uses a log internally to implement multi-segment atomic writes. It can talk safely to raid below it, but lvm should be able to sit above it and still have a file system that expects atomic functionality to work. Now getting a SAN connection to work like this would involve a new transport as iSCSI doesn't really have the semantics for this, but maybe there are some extra "transaction ID" bits that could be put into play (it has been a long time since I dug into the depths of the SCSI layers). Doug On Thu, Jul 9, 2015 at 9:34 AM, Bart Van Assche <bart.vanassche@sandisk.com> wrote: > On 07/09/2015 08:41 AM, Doug Dumitru wrote:> Mr. Hellwig, > >> On Wed, Jul 8, 2015 at 12:38 PM, Christoph Hellwig <hch@infradead.org >> <mailto:hch@infradead.org>> wrote: >> >> On Wed, Jul 08, 2015 at 09:21:21AM -0700, Doug Dumitru wrote: >> > I have a "smart" block device that can implement multi-segment >> atomic >> > writes. >> >> How about submitting your driver upstream first and then we can work >> with you on an API that fits the devices and the consumers needs. >> >> I usually like to start with an interface and then implement the >> driver's from there. >> >> In this case, this is a block-level interface that supports new >> functionality (atomic writes). In the past, you would approach this >> type of problem by having the atomic and user layers as a monolithic >> solution. Consider database updates and the complexity that they go >> through to insure database integrity. If a block device could provide a >> database with an atomic update interface, the database would get a lot >> simpler. The same discussion holds true for file systems. Depending on >> the atomic update implementation, you might end up in the same place in >> terms of total code, but you might also end up somewhere completely >> different. >> >> The impetus for this is some research on file system "write >> amplification". In general, file system design seems to be heading in >> the direction of higher and higher write amplification. For example, >> the tree structure of zfs is shockingly inefficient in terms of write >> overhead. This is happening at the same time as Flash is becoming >> popular but is also moving to smaller and smaller geometries. So write >> efficiency is becoming more and more important. >> >> By decoupling the atomic update semantics from file system and other >> block device "users", this gives devices the opportunity to implement >> atomic updates internal to or in cooperation with Flash management >> algorithms. In theory, you can implement atomic updates without any >> extra writes. In practice, some devices will be better than others. >> >> I was hoping to stumble across someone interested in this as a concept, >> or someone who has researched this area, as I don't have any near >> production existing code. I could pretty easily hack in a couple of >> extra fields in struct bio that would accomplish what I see, but others >> might have differing input. >> > > Hello Doug, > > When designing such an API, please try to stay close to the semantics of > the already standardized SCSI commands. As you probably know the Linux SCSI > core has been implemented as a block driver. Any new command that is added > to the Linux block layer has to be translated by the Linux SCSI core into a > SCSI command. An example of a patch series that adds support for a new > block layer primitive is the patch series that adds compare-and-write > support (http://thread.gmane.org/gmane.linux.scsi/95869). Although that > patch series is not yet upstream I think it is a good example of how to add > new functionality to the block layer and SCSI core. > > Bart. > -- Doug Dumitru EasyCo LLC [-- Attachment #1.2: Type: text/html, Size: 6706 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-09 17:08 ` Doug Dumitru @ 2015-07-09 17:24 ` Bart Van Assche 2015-07-09 20:39 ` Doug Dumitru 0 siblings, 1 reply; 9+ messages in thread From: Bart Van Assche @ 2015-07-09 17:24 UTC (permalink / raw) To: doug@easyco.com; +Cc: Christoph Hellwig, device-mapper development On 07/09/2015 10:09 AM, Doug Dumitru wrote: > The problem with this is that the SCSI commands do not go far enough to > actually address the needs of applications that need atomic updates. An > application would "like" to be able to update a large arbitrary set, of > sectors on a device with atomic semantics. The SCSI commands require the > set to be contiguous. Application design starts to get interesting when > the contiguous restriction goes away. > > My initial thoughts are to tag multiple IO requests with an ID and > combined length field. This would be compatible with the SCSI spec if > the request was contiguous, but nonsensical if the request were > multi-segment. On the other hand, just hitting the SCSI spec is > probably as simple as adding an "atomic" bit to the current structure so > that IO pieces are not cut up. But then, you don't address the > multi-segment functionality that is possible. Regardless, there will be > issues as pieces of the current stack don't lend themselves well to > propagating atomic operations up and down the stack. Just how do you > split an atomic write across scsi devices in a raid set anyway? > > What I am most interested in is keeping the stack working with at least > 1:1 mapping layers, and with 1:many layers below the layer that > implements the atomic functionality. Think of a dm-atomic.ko device > that uses a log internally to implement multi-segment atomic writes. It > can talk safely to raid below it, but lvm should be able to sit above it > and still have a file system that expects atomic functionality to work. > Now getting a SAN connection to work like this would involve a new > transport as iSCSI doesn't really have the semantics for this, but maybe > there are some extra "transaction ID" bits that could be put into play > (it has been a long time since I dug into the depths of the SCSI layers). Hello Doug, The original proposal, from which the SBC-4 ATOMIC WRITE specification has been derived, had support for scattered writes and gathered reads. The title of that document is "SBC-4 SPC-5 Atomic writes and reads" (http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-043r4.pdf). Bart. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: API for multi-segment atomic IO 2015-07-09 17:24 ` Bart Van Assche @ 2015-07-09 20:39 ` Doug Dumitru 0 siblings, 0 replies; 9+ messages in thread From: Doug Dumitru @ 2015-07-09 20:39 UTC (permalink / raw) To: Bart Van Assche; +Cc: Christoph Hellwig, device-mapper development [-- Attachment #1.1: Type: text/plain, Size: 2846 bytes --] Mr. Assche, Thank you for the link. The proposal describes quite well "why" multi-segment atomic writes are useful. I am not sure that this proposal actually made it into the SPC-5 spec. It looks like it might overload some of the data integrity fields and use them to link transactions together, but I am rusty at reading specs like these. I will re-read it again over the weekend. If anyone "knows" whether the multi-segment support is IN or OUT of the SPC-5 spec, please chime in. Doug On Thu, Jul 9, 2015 at 10:24 AM, Bart Van Assche <bart.vanassche@sandisk.com > wrote: > On 07/09/2015 10:09 AM, Doug Dumitru wrote: > >> The problem with this is that the SCSI commands do not go far enough to >> actually address the needs of applications that need atomic updates. An >> application would "like" to be able to update a large arbitrary set, of >> sectors on a device with atomic semantics. The SCSI commands require the >> set to be contiguous. Application design starts to get interesting when >> the contiguous restriction goes away. >> >> My initial thoughts are to tag multiple IO requests with an ID and >> combined length field. This would be compatible with the SCSI spec if >> the request was contiguous, but nonsensical if the request were >> multi-segment. On the other hand, just hitting the SCSI spec is >> probably as simple as adding an "atomic" bit to the current structure so >> that IO pieces are not cut up. But then, you don't address the >> multi-segment functionality that is possible. Regardless, there will be >> issues as pieces of the current stack don't lend themselves well to >> propagating atomic operations up and down the stack. Just how do you >> split an atomic write across scsi devices in a raid set anyway? >> >> What I am most interested in is keeping the stack working with at least >> 1:1 mapping layers, and with 1:many layers below the layer that >> implements the atomic functionality. Think of a dm-atomic.ko device >> that uses a log internally to implement multi-segment atomic writes. It >> can talk safely to raid below it, but lvm should be able to sit above it >> and still have a file system that expects atomic functionality to work. >> Now getting a SAN connection to work like this would involve a new >> transport as iSCSI doesn't really have the semantics for this, but maybe >> there are some extra "transaction ID" bits that could be put into play >> (it has been a long time since I dug into the depths of the SCSI layers). >> > > Hello Doug, > > The original proposal, from which the SBC-4 ATOMIC WRITE specification has > been derived, had support for scattered writes and gathered reads. The > title of that document is "SBC-4 SPC-5 Atomic writes and reads" ( > http://www.t10.org/cgi-bin/ac.pl?t=d&f=14-043r4.pdf). > > Bart. > > -- Doug Dumitru EasyCo LLC [-- Attachment #1.2: Type: text/html, Size: 4108 bytes --] [-- Attachment #2: Type: text/plain, Size: 0 bytes --] ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-07-09 20:39 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2015-07-08 3:33 API for multi-segment atomic IO Doug Dumitru 2015-07-08 15:38 ` Bart Van Assche 2015-07-08 16:21 ` Doug Dumitru 2015-07-08 19:38 ` Christoph Hellwig 2015-07-09 15:41 ` Doug Dumitru 2015-07-09 16:34 ` Bart Van Assche 2015-07-09 17:08 ` Doug Dumitru 2015-07-09 17:24 ` Bart Van Assche 2015-07-09 20:39 ` Doug Dumitru
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.