* [LSF/MM TOPIC] end-to-end data and metadata corruption detection @ 2012-01-17 20:15 Chuck Lever [not found] ` <38C050B3-2AAD-4767-9A25-02C33627E427-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2012-01-17 20:15 UTC (permalink / raw) To: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA Cc: linux-fsdevel, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA Hi- I know there is some work on ext4 regarding metadata corruption detection; btrfs also has some corruption detection facilities. The IETF NFS working group is considering the addition of corruption detection to the next NFSv4 minor version. T10 has introduced DIF/DIX. I'm probably ignorant of the current state of implementation in Linux, but I'm interested in understanding common ground among local file systems, block storage, and network file systems. Example questions include: Do we need standardized APIs for block device corruption detection? How much of T10 DIF/DIX should NFS support? What are the drivers for this feature (broad use cases)? -- Chuck Lever chuck[dot]lever[at]oracle[dot]com -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <38C050B3-2AAD-4767-9A25-02C33627E427-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>]
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <38C050B3-2AAD-4767-9A25-02C33627E427-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> @ 2012-01-26 12:31 ` Bernd Schubert [not found] ` <4F2147BA.6030607-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 2012-01-26 15:36 ` Martin K. Petersen 1 sibling, 1 reply; 25+ messages in thread From: Bernd Schubert @ 2012-01-26 12:31 UTC (permalink / raw) To: Chuck Lever Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-fsdevel, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA, Sven Breuner On 01/17/2012 09:15 PM, Chuck Lever wrote: > Hi- > > I know there is some work on ext4 regarding metadata corruption > detection; btrfs also has some corruption detection facilities. The > IETF NFS working group is considering the addition of corruption > detection to the next NFSv4 minor version. T10 has introduced > DIF/DIX. > > I'm probably ignorant of the current state of implementation in > Linux, but I'm interested in understanding common ground among local > file systems, block storage, and network file systems. Example > questions include: Do we need standardized APIs for block device > corruption detection? How much of T10 DIF/DIX should NFS support? > What are the drivers for this feature (broad use cases)? > Other network file systems such as Lustre already use their own network data checksums. As far as I know Lustre plans (planned?) to use underlying ZFS checksums also for network transfers, so real client-to-disk (end-to-end) checksums. Using T10 DIF/DIX might be on their todo list. We from the Fraunhofer FhGFS team would like to also see the T10 DIF/DIX API exposed to user space, so that we could make use of it for our FhGFS file system. And I think this feature is not only useful for file systems, but in general, scientific applications, databases, etc also would benefit from insurance of data integrity. Cheers, Bernd -- Bernd Schubert Fraunhofer ITWM -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <4F2147BA.6030607-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org>]
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <4F2147BA.6030607-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> @ 2012-01-26 14:53 ` Martin K. Petersen [not found] ` <yq1k44e1pn6.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> [not found] ` <DE0353DF-83EA-480E-9C42-1EE760D6EE41@dilger.ca> 0 siblings, 2 replies; 25+ messages in thread From: Martin K. Petersen @ 2012-01-26 14:53 UTC (permalink / raw) To: Bernd Schubert Cc: Chuck Lever, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-fsdevel, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA, Sven Breuner >>>>> "Bernd" == Bernd Schubert <bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> writes: Bernd> We from the Fraunhofer FhGFS team would like to also see the T10 Bernd> DIF/DIX API exposed to user space, so that we could make use of Bernd> it for our FhGFS file system. And I think this feature is not Bernd> only useful for file systems, but in general, scientific Bernd> applications, databases, etc also would benefit from insurance of Bernd> data integrity. I'm attending a SNIA meeting today to discuss a (cross-OS) data integrity aware API. We'll see what comes out of that. With the Linux hat on I'm still mainly interested in pursuing the sys_dio interface Joel and I proposed last year. We have good experience with that I/O model and it suits applications that want to interact with the protection information well. libaio is also on my list. But obviously any help and input is appreciated... -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <yq1k44e1pn6.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org>]
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <yq1k44e1pn6.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> @ 2012-01-26 16:27 ` Bernd Schubert 2012-01-26 23:21 ` James Bottomley 2012-01-31 2:10 ` Martin K. Petersen 0 siblings, 2 replies; 25+ messages in thread From: Bernd Schubert @ 2012-01-26 16:27 UTC (permalink / raw) To: Martin K. Petersen Cc: Chuck Lever, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-fsdevel, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA, Sven Breuner On 01/26/2012 03:53 PM, Martin K. Petersen wrote: >>>>>> "Bernd" == Bernd Schubert<bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> writes: > > Bernd> We from the Fraunhofer FhGFS team would like to also see the T10 > Bernd> DIF/DIX API exposed to user space, so that we could make use of > Bernd> it for our FhGFS file system. And I think this feature is not > Bernd> only useful for file systems, but in general, scientific > Bernd> applications, databases, etc also would benefit from insurance of > Bernd> data integrity. > > I'm attending a SNIA meeting today to discuss a (cross-OS) data > integrity aware API. We'll see what comes out of that. > > With the Linux hat on I'm still mainly interested in pursuing the > sys_dio interface Joel and I proposed last year. We have good experience > with that I/O model and it suits applications that want to interact with > the protection information well. libaio is also on my list. > > But obviously any help and input is appreciated... > I guess you are referring to the interface described here http://www.spinics.net/lists/linux-mm/msg14512.html Hmm, direct IO would mean we could not use the page cache. As we are using it, that would not really suit us. libaio then might be another option then. What kind of help do you exactly need? Thanks, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-01-26 16:27 ` Bernd Schubert @ 2012-01-26 23:21 ` James Bottomley [not found] ` <1327620104.6151.23.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-01-31 2:10 ` Martin K. Petersen 1 sibling, 1 reply; 25+ messages in thread From: James Bottomley @ 2012-01-26 23:21 UTC (permalink / raw) To: Bernd Schubert Cc: Martin K. Petersen, Chuck Lever, lsf-pc, linux-fsdevel, Linux NFS Mailing List, linux-scsi, Sven Breuner On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote: > On 01/26/2012 03:53 PM, Martin K. Petersen wrote: > >>>>>> "Bernd" == Bernd Schubert<bernd.schubert@itwm.fraunhofer.de> writes: > > > > Bernd> We from the Fraunhofer FhGFS team would like to also see the T10 > > Bernd> DIF/DIX API exposed to user space, so that we could make use of > > Bernd> it for our FhGFS file system. And I think this feature is not > > Bernd> only useful for file systems, but in general, scientific > > Bernd> applications, databases, etc also would benefit from insurance of > > Bernd> data integrity. > > > > I'm attending a SNIA meeting today to discuss a (cross-OS) data > > integrity aware API. We'll see what comes out of that. > > > > With the Linux hat on I'm still mainly interested in pursuing the > > sys_dio interface Joel and I proposed last year. We have good experience > > with that I/O model and it suits applications that want to interact with > > the protection information well. libaio is also on my list. > > > > But obviously any help and input is appreciated... > > > > I guess you are referring to the interface described here > > http://www.spinics.net/lists/linux-mm/msg14512.html > > Hmm, direct IO would mean we could not use the page cache. As we are > using it, that would not really suit us. libaio then might be another > option then. Are you really sure you want protection information and the page cache? The reason for using DIO is that no-one could really think of a valid page cache based use case. What most applications using protection information want is to say: This is my data and this is the integrity verification, send it down and assure me you wrote it correctly. If you go via the page cache, we have all sorts of problems, like our granularity is a page (not a block) so you'd have to guarantee to write a page at a time (a mechanism for combining subpage units of protection information sounds like a nightmare). The write becomes mark page dirty and wait for the system to flush it, and we can update the page in the meantime. How do we update the page and its protection information atomically. What happens if the page gets updated but no protection information is supplied and so on ... The can of worms just gets more squirmy. Doing DIO only avoids all of this. James ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <1327620104.6151.23.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <1327620104.6151.23.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2012-01-31 19:16 ` Bernd Schubert 2012-01-31 19:21 ` Chuck Lever 0 siblings, 1 reply; 25+ messages in thread From: Bernd Schubert @ 2012-01-31 19:16 UTC (permalink / raw) To: James Bottomley Cc: Martin K. Petersen, Chuck Lever, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-fsdevel, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA, Sven Breuner On 01/27/2012 12:21 AM, James Bottomley wrote: > On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote: >> On 01/26/2012 03:53 PM, Martin K. Petersen wrote: >>>>>>>> "Bernd" == Bernd Schubert<bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> writes: >>> >>> Bernd> We from the Fraunhofer FhGFS team would like to also see the T10 >>> Bernd> DIF/DIX API exposed to user space, so that we could make use of >>> Bernd> it for our FhGFS file system. And I think this feature is not >>> Bernd> only useful for file systems, but in general, scientific >>> Bernd> applications, databases, etc also would benefit from insurance of >>> Bernd> data integrity. >>> >>> I'm attending a SNIA meeting today to discuss a (cross-OS) data >>> integrity aware API. We'll see what comes out of that. >>> >>> With the Linux hat on I'm still mainly interested in pursuing the >>> sys_dio interface Joel and I proposed last year. We have good experience >>> with that I/O model and it suits applications that want to interact with >>> the protection information well. libaio is also on my list. >>> >>> But obviously any help and input is appreciated... >>> >> >> I guess you are referring to the interface described here >> >> http://www.spinics.net/lists/linux-mm/msg14512.html >> >> Hmm, direct IO would mean we could not use the page cache. As we are >> using it, that would not really suit us. libaio then might be another >> option then. > > Are you really sure you want protection information and the page cache? > The reason for using DIO is that no-one could really think of a valid > page cache based use case. What most applications using protection > information want is to say: This is my data and this is the integrity > verification, send it down and assure me you wrote it correctly. If you > go via the page cache, we have all sorts of problems, like our > granularity is a page (not a block) so you'd have to guarantee to write > a page at a time (a mechanism for combining subpage units of protection > information sounds like a nightmare). The write becomes mark page dirty > and wait for the system to flush it, and we can update the page in the > meantime. How do we update the page and its protection information > atomically. What happens if the page gets updated but no protection > information is supplied and so on ... The can of worms just gets more > squirmy. Doing DIO only avoids all of this. Well, entirely direct-IO will not work anyway as FhGFS is a parallel network file system, so data are sent from clients to servers, so data are not entirely direct anymore. The problem with server side storage direct-IO is that it is too slow for several work cases. I guess the write performance could be mostly solved somehow, but then still the read-cache would be entirely missing. From Lustre history I know that server side read-cache improved performance of applications at several sites. So I really wouldn't like to disable it for FhGFS... I guess if we couldn't use the page cache, we probably wouldn't attempt to use DIF/DIX interface, but will calculate our own checksums once we are going to work on the data integrity feature on our side. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-01-31 19:16 ` Bernd Schubert @ 2012-01-31 19:21 ` Chuck Lever 2012-01-31 20:04 ` Martin K. Petersen 0 siblings, 1 reply; 25+ messages in thread From: Chuck Lever @ 2012-01-31 19:21 UTC (permalink / raw) To: Bernd Schubert Cc: James Bottomley, Martin K. Petersen, lsf-pc, linux-fsdevel, Linux NFS Mailing List, linux-scsi, Sven Breuner On Jan 31, 2012, at 2:16 PM, Bernd Schubert wrote: > On 01/27/2012 12:21 AM, James Bottomley wrote: >> On Thu, 2012-01-26 at 17:27 +0100, Bernd Schubert wrote: >>> On 01/26/2012 03:53 PM, Martin K. Petersen wrote: >>>>>>>>> "Bernd" == Bernd Schubert<bernd.schubert@itwm.fraunhofer.de> writes: >>>> >>>> Bernd> We from the Fraunhofer FhGFS team would like to also see the T10 >>>> Bernd> DIF/DIX API exposed to user space, so that we could make use of >>>> Bernd> it for our FhGFS file system. And I think this feature is not >>>> Bernd> only useful for file systems, but in general, scientific >>>> Bernd> applications, databases, etc also would benefit from insurance of >>>> Bernd> data integrity. >>>> >>>> I'm attending a SNIA meeting today to discuss a (cross-OS) data >>>> integrity aware API. We'll see what comes out of that. >>>> >>>> With the Linux hat on I'm still mainly interested in pursuing the >>>> sys_dio interface Joel and I proposed last year. We have good experience >>>> with that I/O model and it suits applications that want to interact with >>>> the protection information well. libaio is also on my list. >>>> >>>> But obviously any help and input is appreciated... >>>> >>> >>> I guess you are referring to the interface described here >>> >>> http://www.spinics.net/lists/linux-mm/msg14512.html >>> >>> Hmm, direct IO would mean we could not use the page cache. As we are >>> using it, that would not really suit us. libaio then might be another >>> option then. >> >> Are you really sure you want protection information and the page cache? >> The reason for using DIO is that no-one could really think of a valid >> page cache based use case. What most applications using protection >> information want is to say: This is my data and this is the integrity >> verification, send it down and assure me you wrote it correctly. If you >> go via the page cache, we have all sorts of problems, like our >> granularity is a page (not a block) so you'd have to guarantee to write >> a page at a time (a mechanism for combining subpage units of protection >> information sounds like a nightmare). The write becomes mark page dirty >> and wait for the system to flush it, and we can update the page in the >> meantime. How do we update the page and its protection information >> atomically. What happens if the page gets updated but no protection >> information is supplied and so on ... The can of worms just gets more >> squirmy. Doing DIO only avoids all of this. > > Well, entirely direct-IO will not work anyway as FhGFS is a parallel network file system, so data are sent from clients to servers, so data are not entirely direct anymore. > The problem with server side storage direct-IO is that it is too slow for several work cases. I guess the write performance could be mostly solved somehow, but then still the read-cache would be entirely missing. From Lustre history I know that server side read-cache improved performance of applications at several sites. So I really wouldn't like to disable it for FhGFS... > I guess if we couldn't use the page cache, we probably wouldn't attempt to use DIF/DIX interface, but will calculate our own checksums once we are going to work on the data integrity feature on our side. This is interesting. I imagine the Linux kernel NFS server will have the same issue: it depends on the page cache for good performance, and does not, itself, use direct I/O. Thus it wouldn't be able to use a direct I/O-only DIF/DIX implementation, and we can't use DIF/DIX for end-to-end corruption detection for a Linux client - Linux server configuration. If high-performance applications such as databases demand corruption detection, it will need to work without introducing significant performance overhead. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-01-31 19:21 ` Chuck Lever @ 2012-01-31 20:04 ` Martin K. Petersen 0 siblings, 0 replies; 25+ messages in thread From: Martin K. Petersen @ 2012-01-31 20:04 UTC (permalink / raw) To: Chuck Lever Cc: Bernd Schubert, James Bottomley, Martin K. Petersen, lsf-pc, linux-fsdevel, Linux NFS Mailing List, linux-scsi, Sven Breuner >>>>> "Chuck" == Chuck Lever <chuck.lever@oracle.com> writes: >> I guess if we couldn't use the page cache, we probably wouldn't >> attempt to use DIF/DIX interface, but will calculate our own >> checksums once we are going to work on the data integrity feature on >> our side. Chuck> This is interesting. I imagine the Linux kernel NFS server will Chuck> have the same issue: it depends on the page cache for good Chuck> performance, and does not, itself, use direct I/O. Just so we're perfectly clear here: There's nothing that prevents you from hanging protection information off of your page private pointer so it can be submitted along with the data. The concern is purely that the filesystem owning the pages need to handle access conflicts (racing DI and non-DI updates) and potentially sub page modifications for small filesystem block sizes. The other problem with buffered I/O is that the notion of "all these pages are belong to us^wone write request" goes out the window. You really need something like aio to be able to be able to get completion status and re-drive the I/O if there's an integrity error. Otherwise you might as well just let the kernel autoprotect the pages from the block layer down (which is what we do now). -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-01-26 16:27 ` Bernd Schubert 2012-01-26 23:21 ` James Bottomley @ 2012-01-31 2:10 ` Martin K. Petersen 2012-01-31 19:22 ` Bernd Schubert 1 sibling, 1 reply; 25+ messages in thread From: Martin K. Petersen @ 2012-01-31 2:10 UTC (permalink / raw) To: Bernd Schubert Cc: Martin K. Petersen, Chuck Lever, lsf-pc, linux-fsdevel, Linux NFS Mailing List, linux-scsi, Sven Breuner >>>>> "Bernd" == Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> writes: Bernd> Hmm, direct IO would mean we could not use the page cache. As we Bernd> are using it, that would not really suit us. libaio then might be Bernd> another option then. Bernd> What kind of help do you exactly need? As far as libaio is concerned I had a PoC working a few years ago. I'll be happy to revive it if people are actually interested. So a real world use case would be a great help... But James is right that buffered I/O is much more challenging than direct I/O. And all the use cases we have had have involved databases and business apps that were doing direct I/O anyway. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-01-31 2:10 ` Martin K. Petersen @ 2012-01-31 19:22 ` Bernd Schubert 2012-01-31 19:28 ` Gregory Farnum 0 siblings, 1 reply; 25+ messages in thread From: Bernd Schubert @ 2012-01-31 19:22 UTC (permalink / raw) To: Martin K. Petersen Cc: Chuck Lever, lsf-pc, linux-fsdevel, Linux NFS Mailing List, linux-scsi, Sven Breuner On 01/31/2012 03:10 AM, Martin K. Petersen wrote: >>>>>> "Bernd" == Bernd Schubert<bernd.schubert@itwm.fraunhofer.de> writes: > > Bernd> Hmm, direct IO would mean we could not use the page cache. As we > Bernd> are using it, that would not really suit us. libaio then might be > Bernd> another option then. > > Bernd> What kind of help do you exactly need? > > As far as libaio is concerned I had a PoC working a few years ago. I'll > be happy to revive it if people are actually interested. So a real world > use case would be a great help... I guess it would be useful for us, although right now data integrity is not on our todo list for the next couple of months. Unless other people would be interested in it right now, can we postpone for some time? > > But James is right that buffered I/O is much more challenging than > direct I/O. And all the use cases we have had have involved databases > and business apps that were doing direct I/O anyway. > I guess we should talk to developers of other parallel file systems and see what they think about it. I think cephfs already uses data integrity provided by btrfs, although I'm not entirely sure and need to check the code. As I said before, Lustre does network checksums already and *might* be interested. Cheers, Bernd ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-01-31 19:22 ` Bernd Schubert @ 2012-01-31 19:28 ` Gregory Farnum 2012-02-01 16:45 ` [Lsf-pc] " Chris Mason 0 siblings, 1 reply; 25+ messages in thread From: Gregory Farnum @ 2012-01-31 19:28 UTC (permalink / raw) To: Bernd Schubert Cc: Martin K. Petersen, Chuck Lever, lsf-pc, linux-fsdevel, Linux NFS Mailing List, linux-scsi, Sven Breuner On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert <bernd.schubert@itwm.fraunhofer.de> wrote: > I guess we should talk to developers of other parallel file systems and see > what they think about it. I think cephfs already uses data integrity > provided by btrfs, although I'm not entirely sure and need to check the > code. As I said before, Lustre does network checksums already and *might* be > interested. Actually, right now Ceph doesn't check btrfs' data integrity information, but since Ceph doesn't have any data-at-rest integrity verification it relies on btrfs if you want that. Integrating integrity verification throughout the system is on our long-term to-do list. We too will be said if using a kernel-level integrity system requires using DIO, although we could probably work out a way to do "translation" between our own integrity checksums and the btrfs-generated ones if we have to (thanks to replication). -Greg ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-01-31 19:28 ` Gregory Farnum @ 2012-02-01 16:45 ` Chris Mason 2012-02-01 16:52 ` James Bottomley 0 siblings, 1 reply; 25+ messages in thread From: Chris Mason @ 2012-02-01 16:45 UTC (permalink / raw) To: Gregory Farnum Cc: Bernd Schubert, Linux NFS Mailing List, linux-scsi, Martin K. Petersen, Sven Breuner, Chuck Lever, linux-fsdevel, lsf-pc On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote: > On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert > <bernd.schubert@itwm.fraunhofer.de> wrote: > > I guess we should talk to developers of other parallel file systems and see > > what they think about it. I think cephfs already uses data integrity > > provided by btrfs, although I'm not entirely sure and need to check the > > code. As I said before, Lustre does network checksums already and *might* be > > interested. > > Actually, right now Ceph doesn't check btrfs' data integrity > information, but since Ceph doesn't have any data-at-rest integrity > verification it relies on btrfs if you want that. Integrating > integrity verification throughout the system is on our long-term to-do > list. > We too will be said if using a kernel-level integrity system requires > using DIO, although we could probably work out a way to do > "translation" between our own integrity checksums and the > btrfs-generated ones if we have to (thanks to replication). DIO isn't really required, but doing this without synchronous writes will get painful in a hurry. There's nothing wrong with letting the data sit in the page cache after the IO is done though. -chris ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-02-01 16:45 ` [Lsf-pc] " Chris Mason @ 2012-02-01 16:52 ` James Bottomley [not found] ` <1328115175.2768.11.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-02-01 18:15 ` Martin K. Petersen 0 siblings, 2 replies; 25+ messages in thread From: James Bottomley @ 2012-02-01 16:52 UTC (permalink / raw) To: Chris Mason Cc: Gregory Farnum, Bernd Schubert, Linux NFS Mailing List, linux-scsi, Martin K. Petersen, Sven Breuner, Chuck Lever, linux-fsdevel, lsf-pc On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote: > On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote: > > On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert > > <bernd.schubert@itwm.fraunhofer.de> wrote: > > > I guess we should talk to developers of other parallel file systems and see > > > what they think about it. I think cephfs already uses data integrity > > > provided by btrfs, although I'm not entirely sure and need to check the > > > code. As I said before, Lustre does network checksums already and *might* be > > > interested. > > > > Actually, right now Ceph doesn't check btrfs' data integrity > > information, but since Ceph doesn't have any data-at-rest integrity > > verification it relies on btrfs if you want that. Integrating > > integrity verification throughout the system is on our long-term to-do > > list. > > We too will be said if using a kernel-level integrity system requires > > using DIO, although we could probably work out a way to do > > "translation" between our own integrity checksums and the > > btrfs-generated ones if we have to (thanks to replication). > > DIO isn't really required, but doing this without synchronous writes > will get painful in a hurry. There's nothing wrong with letting the > data sit in the page cache after the IO is done though. I broadly agree with this, but even if you do sync writes and cache read only copies, we still have the problem of how we do the read side verification of DIX. In theory, when you read, you could either get the cached copy or an actual read (which will supply protection information), so for the cached copy we need to return cached protection information implying that we need some way of actually caching it. James ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <1328115175.2768.11.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org>]
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <1328115175.2768.11.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2012-02-01 17:41 ` Chris Mason 2012-02-01 17:59 ` Bernd Schubert 0 siblings, 1 reply; 25+ messages in thread From: Chris Mason @ 2012-02-01 17:41 UTC (permalink / raw) To: James Bottomley Cc: Gregory Farnum, Bernd Schubert, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA, Martin K. Petersen, Sven Breuner, Chuck Lever, linux-fsdevel, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On Wed, Feb 01, 2012 at 10:52:55AM -0600, James Bottomley wrote: > On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote: > > On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote: > > > On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert > > > <bernd.schubert-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> wrote: > > > > I guess we should talk to developers of other parallel file systems and see > > > > what they think about it. I think cephfs already uses data integrity > > > > provided by btrfs, although I'm not entirely sure and need to check the > > > > code. As I said before, Lustre does network checksums already and *might* be > > > > interested. > > > > > > Actually, right now Ceph doesn't check btrfs' data integrity > > > information, but since Ceph doesn't have any data-at-rest integrity > > > verification it relies on btrfs if you want that. Integrating > > > integrity verification throughout the system is on our long-term to-do > > > list. > > > We too will be said if using a kernel-level integrity system requires > > > using DIO, although we could probably work out a way to do > > > "translation" between our own integrity checksums and the > > > btrfs-generated ones if we have to (thanks to replication). > > > > DIO isn't really required, but doing this without synchronous writes > > will get painful in a hurry. There's nothing wrong with letting the > > data sit in the page cache after the IO is done though. > > I broadly agree with this, but even if you do sync writes and cache read > only copies, we still have the problem of how we do the read side > verification of DIX. In theory, when you read, you could either get the > cached copy or an actual read (which will supply protection > information), so for the cached copy we need to return cached protection > information implying that we need some way of actually caching it. Good point, reading from the cached copy is a lower level of protection because in theory bugs in your scsi drivers could corrupt the pages later on. But I think even without keeping the crcs attached to the page, there is value in keeping the cached copy in lots of workloads. The database is going to O_DIRECT read (with crcs checked) and then stuff it into a database buffer cache for long term use. Stuffing it into a page cache on the kernel side is about the same. -chris -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-02-01 17:41 ` Chris Mason @ 2012-02-01 17:59 ` Bernd Schubert 2012-02-01 18:16 ` James Bottomley 0 siblings, 1 reply; 25+ messages in thread From: Bernd Schubert @ 2012-02-01 17:59 UTC (permalink / raw) To: Chris Mason, James Bottomley, Gregory Farnum, Bernd Schubert, Linux NFS Mailing List, linux-scsi, Martin K. Petersen, Sven Breuner, Chuck Lever, linux-fsdevel, lsf-pc On 02/01/2012 06:41 PM, Chris Mason wrote: > On Wed, Feb 01, 2012 at 10:52:55AM -0600, James Bottomley wrote: >> On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote: >>> On Tue, Jan 31, 2012 at 11:28:26AM -0800, Gregory Farnum wrote: >>>> On Tue, Jan 31, 2012 at 11:22 AM, Bernd Schubert >>>> <bernd.schubert@itwm.fraunhofer.de> wrote: >>>>> I guess we should talk to developers of other parallel file systems and see >>>>> what they think about it. I think cephfs already uses data integrity >>>>> provided by btrfs, although I'm not entirely sure and need to check the >>>>> code. As I said before, Lustre does network checksums already and *might* be >>>>> interested. >>>> >>>> Actually, right now Ceph doesn't check btrfs' data integrity >>>> information, but since Ceph doesn't have any data-at-rest integrity >>>> verification it relies on btrfs if you want that. Integrating >>>> integrity verification throughout the system is on our long-term to-do >>>> list. >>>> We too will be said if using a kernel-level integrity system requires >>>> using DIO, although we could probably work out a way to do >>>> "translation" between our own integrity checksums and the >>>> btrfs-generated ones if we have to (thanks to replication). >>> >>> DIO isn't really required, but doing this without synchronous writes >>> will get painful in a hurry. There's nothing wrong with letting the >>> data sit in the page cache after the IO is done though. >> >> I broadly agree with this, but even if you do sync writes and cache read >> only copies, we still have the problem of how we do the read side >> verification of DIX. In theory, when you read, you could either get the >> cached copy or an actual read (which will supply protection >> information), so for the cached copy we need to return cached protection >> information implying that we need some way of actually caching it. > > Good point, reading from the cached copy is a lower level of protection > because in theory bugs in your scsi drivers could corrupt the pages > later on. But that only matters if the application is going to verify if data are really on disk. For example (client server scenario) 1) client-A writes a page 2) client-B reads this page client-B is simply not interested here where it gets the page from, as long as it gets correct data. The network files system in between also will just be happy existing in-cache crcs for network verification. Only if the page is later on dropped from the cache and read again, on-disk crcs matter. If those are bad, one of the layers is going to complain or correct those data. If the application wants to check data on disk it can either use DIO or alternatively something like fadvsise(DONTNEED_LOCAL_AND_REMOTE) (something I wanted to propose for some time already, at least I'm not happy that posix_fadvise(POSIX_FADV_DONTNEED) is not passed to the file system at all). Cheers, Bernd ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-02-01 17:59 ` Bernd Schubert @ 2012-02-01 18:16 ` James Bottomley 2012-02-01 18:30 ` Andrea Arcangeli 0 siblings, 1 reply; 25+ messages in thread From: James Bottomley @ 2012-02-01 18:16 UTC (permalink / raw) To: Bernd Schubert Cc: Chris Mason, Gregory Farnum, Bernd Schubert, Linux NFS Mailing List, linux-scsi, Martin K. Petersen, Sven Breuner, Chuck Lever, linux-fsdevel, lsf-pc On Wed, 2012-02-01 at 18:59 +0100, Bernd Schubert wrote: > On 02/01/2012 06:41 PM, Chris Mason wrote: > > On Wed, Feb 01, 2012 at 10:52:55AM -0600, James Bottomley wrote: > >> On Wed, 2012-02-01 at 11:45 -0500, Chris Mason wrote: [...] > >>> DIO isn't really required, but doing this without synchronous writes > >>> will get painful in a hurry. There's nothing wrong with letting the > >>> data sit in the page cache after the IO is done though. > >> > >> I broadly agree with this, but even if you do sync writes and cache read > >> only copies, we still have the problem of how we do the read side > >> verification of DIX. In theory, when you read, you could either get the > >> cached copy or an actual read (which will supply protection > >> information), so for the cached copy we need to return cached protection > >> information implying that we need some way of actually caching it. > > > > Good point, reading from the cached copy is a lower level of protection > > because in theory bugs in your scsi drivers could corrupt the pages > > later on. > > But that only matters if the application is going to verify if data are > really on disk. For example (client server scenario) Um, well, then why do you want DIX? If you don't care about having the client verify the data, that means you trust the integrity of the page cache and then you just use the automated DIF within the driver layer and SCSI will verify the data all the way up until block places it in the page cache. The whole point of supplying protection information to user space is that the application can verify the data didn't get corrupted after it left the DIF protected block stack. > 1) client-A writes a page > 2) client-B reads this page > > client-B is simply not interested here where it gets the page from, as > long as it gets correct data. How does it know it got correct data if it doesn't verify? Something might have corrupted the page between the time the block layer placed the DIF verified data there and the client reads it. > The network files system in between also > will just be happy existing in-cache crcs for network verification. > Only if the page is later on dropped from the cache and read again, > on-disk crcs matter. If those are bad, one of the layers is going to > complain or correct those data. > > If the application wants to check data on disk it can either use DIO or > alternatively something like fadvsise(DONTNEED_LOCAL_AND_REMOTE) > (something I wanted to propose for some time already, at least I'm not > happy that posix_fadvise(POSIX_FADV_DONTNEED) is not passed to the file > system at all). supplying protection information to user space isn't about the application checking what's on disk .. there's automatic verification in the chain to do that (both the HBA and the disk will check the protection information on entry/exit and transfer). Supplying protection information to userspace is about checking nothing went wrong in the handoff between the end of the DIF stack and the application. James ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-02-01 18:16 ` James Bottomley @ 2012-02-01 18:30 ` Andrea Arcangeli 2012-02-02 9:04 ` Bernd Schubert 0 siblings, 1 reply; 25+ messages in thread From: Andrea Arcangeli @ 2012-02-01 18:30 UTC (permalink / raw) To: James Bottomley Cc: Bernd Schubert, Linux NFS Mailing List, linux-scsi, Martin K. Petersen, Bernd Schubert, linux-fsdevel, Chuck Lever, Sven Breuner, Gregory Farnum, lsf-pc, Chris Mason On Wed, Feb 01, 2012 at 12:16:05PM -0600, James Bottomley wrote: > supplying protection information to user space isn't about the > application checking what's on disk .. there's automatic verification in > the chain to do that (both the HBA and the disk will check the > protection information on entry/exit and transfer). Supplying > protection information to userspace is about checking nothing went wrong > in the handoff between the end of the DIF stack and the application. Not sure if I got this right, but keeping protection information for in-ram pagecache and exposing it to userland somehow, to me sounds a bit of overkill as a concept. Then you should want that for anonymous memory too. If you copy the pagecache to a malloc()ed buffer and verify pagecache was consistent, but then the buffer is corrupt by hardware bitflip or software bug, then what's the point. Besides if this is getting exposed to userland and it's not hidden in the kernel (FS/Storage layers), userland could code its own verification logic without much added complexity. With CRC in hardware on the CPU it doesn't sound like a big cost to do it fully in userland and then you could run it on anonymous memory too if you need and not be dependent on hardware or filesystem details (well other than a a cpuid check at startup). ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-02-01 18:30 ` Andrea Arcangeli @ 2012-02-02 9:04 ` Bernd Schubert 2012-02-02 19:26 ` Andrea Arcangeli 0 siblings, 1 reply; 25+ messages in thread From: Bernd Schubert @ 2012-02-02 9:04 UTC (permalink / raw) To: Andrea Arcangeli Cc: James Bottomley, Bernd Schubert, Linux NFS Mailing List, linux-scsi, Martin K. Petersen, linux-fsdevel, Chuck Lever, Sven Breuner, Gregory Farnum, lsf-pc, Chris Mason On 02/01/2012 07:30 PM, Andrea Arcangeli wrote: > On Wed, Feb 01, 2012 at 12:16:05PM -0600, James Bottomley wrote: >> supplying protection information to user space isn't about the >> application checking what's on disk .. there's automatic verification in >> the chain to do that (both the HBA and the disk will check the >> protection information on entry/exit and transfer). Supplying >> protection information to userspace is about checking nothing went wrong >> in the handoff between the end of the DIF stack and the application. > > Not sure if I got this right, but keeping protection information for > in-ram pagecache and exposing it to userland somehow, to me sounds a > bit of overkill as a concept. Then you should want that for anonymous > memory too. If you copy the pagecache to a malloc()ed buffer and > verify pagecache was consistent, but then the buffer is corrupt by > hardware bitflip or software bug, then what's the point. Besides if > this is getting exposed to userland and it's not hidden in the kernel > (FS/Storage layers), userland could code its own verification logic > without much added complexity. With CRC in hardware on the CPU it > doesn't sound like a big cost to do it fully in userland and then you > could run it on anonymous memory too if you need and not be dependent > on hardware or filesystem details (well other than a a cpuid check at > startup). I think the point for network file systems is that they can reuse the disk-checksum for network verification. So instead of calculating a checksum for network and disk, just use one for both. The checksum also is supposed to be cached in memory, as that avoids re-calculation for other clients. 1) client-1: sends data and checksum server: Receives those data and verifies the checksum -> network transfer was ok, sends data and checksum to disk 2) client-2 ... client-N: Ask for those data server: send cached data and cached checksum client-2 ... client-N: Receive data and verify checksum So the hole point of caching checksums is to avoid the server needs to recalculate those for dozens of clients. Recalculating checksums simply does not scale with an increasing number of clients, which want to read data processed by another client. Cheers, Bernd ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-02-02 9:04 ` Bernd Schubert @ 2012-02-02 19:26 ` Andrea Arcangeli [not found] ` <20120202192643.GC5873-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 0 siblings, 1 reply; 25+ messages in thread From: Andrea Arcangeli @ 2012-02-02 19:26 UTC (permalink / raw) To: Bernd Schubert Cc: Linux NFS Mailing List, linux-scsi, Martin K. Petersen, Bernd Schubert, James Bottomley, Sven Breuner, Chuck Lever, linux-fsdevel, Gregory Farnum, lsf-pc, Chris Mason On Thu, Feb 02, 2012 at 10:04:59AM +0100, Bernd Schubert wrote: > I think the point for network file systems is that they can reuse the > disk-checksum for network verification. So instead of calculating a > checksum for network and disk, just use one for both. The checksum also > is supposed to be cached in memory, as that avoids re-calculation for > other clients. > > 1) > client-1: sends data and checksum > > server: Receives those data and verifies the checksum -> network > transfer was ok, sends data and checksum to disk > > 2) > client-2 ... client-N: Ask for those data > > server: send cached data and cached checksum > > client-2 ... client-N: Receive data and verify checksum > > > So the hole point of caching checksums is to avoid the server needs to > recalculate those for dozens of clients. Recalculating checksums simply > does not scale with an increasing number of clients, which want to read > data processed by another client. This makes sense indeed. My argument was only about the exposure of the storage hw format cksum to userland (through some new ioctl for further userland verification of the pagecache data in the client pagecache, done by whatever program is reading from the cache). The network fs client lives in kernel, the network fs server lives in kernel, so no need to expose the cksum to userland to do what you described above. I meant if we can't trust the pagecache to be correct (after the network fs client code already checked the cksum cached by the server and sent to the client along the server cached data), I don't see much value added through a further verification by the userland program running on the client and accessing pagecache in the client. If we can't trust client pagecache to be safe against memory bitflips or software bugs, we can hardly trust the anonymous memory too. ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <20120202192643.GC5873-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>]
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <20120202192643.GC5873-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> @ 2012-02-02 19:46 ` Andreas Dilger 2012-02-02 22:52 ` Bernd Schubert 1 sibling, 0 replies; 25+ messages in thread From: Andreas Dilger @ 2012-02-02 19:46 UTC (permalink / raw) To: Andrea Arcangeli Cc: Bernd Schubert, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, Martin K. Petersen, Bernd Schubert, James Bottomley, Sven Breuner, Chuck Lever, linux-fsdevel, Gregory Farnum, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA@public.gmane.org, ChrisMason On 2012-02-02, at 12:26, Andrea Arcangeli <aarcange-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote: > On Thu, Feb 02, 2012 at 10:04:59AM +0100, Bernd Schubert wrote: >> I think the point for network file systems is that they can reuse the >> disk-checksum for network verification. So instead of calculating a >> checksum for network and disk, just use one for both. The checksum also >> is supposed to be cached in memory, as that avoids re-calculation for >> other clients. >> >> 1) >> client-1: sends data and checksum >> >> server: Receives those data and verifies the checksum -> network >> transfer was ok, sends data and checksum to disk >> >> 2) >> client-2 ... client-N: Ask for those data >> >> server: send cached data and cached checksum >> >> client-2 ... client-N: Receive data and verify checksum >> >> >> So the hole point of caching checksums is to avoid the server needs to >> recalculate those for dozens of clients. Recalculating checksums simply >> does not scale with an increasing number of clients, which want to read >> data processed by another client. > > This makes sense indeed. My argument was only about the exposure of > the storage hw format cksum to userland (through some new ioctl for > further userland verification of the pagecache data in the client > pagecache, done by whatever program is reading from the cache). The > network fs client lives in kernel, the network fs server lives in > kernel, so no need to expose the cksum to userland to do what you > described above. > > I meant if we can't trust the pagecache to be correct (after the > network fs client code already checked the cksum cached by the server > and sent to the client along the server cached data), I don't see much > value added through a further verification by the userland program > running on the client and accessing pagecache in the client. If we > can't trust client pagecache to be safe against memory bitflips or > software bugs, we can hardly trust the anonymous memory too. For servers, and clients to a lesser extent, the data may reside in cache for a long time. I agree that in many cases the data will be used immediately after the kernel verifies the data checksum from disk, but for long-lived data the change of accidental corruption (bit flip, bad pointer, other software bug) increases. For our own checksum implementation in Lustre, we are planning to keep the checksum attached to the pages in cache on both the client and server, along with a "last checked" time, and periodically revalidate the in-memory checksum. As Bernd states, this dramatically reduces the checksum overhead on the server, and avoids duplicate checksum calculations for the disk and network transfers if the same algorithms can be used for both. Cheers, Andreas-- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <20120202192643.GC5873-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-02-02 19:46 ` Andreas Dilger @ 2012-02-02 22:52 ` Bernd Schubert 1 sibling, 0 replies; 25+ messages in thread From: Bernd Schubert @ 2012-02-02 22:52 UTC (permalink / raw) To: Andrea Arcangeli Cc: Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA, Martin K. Petersen, Bernd Schubert, James Bottomley, Sven Breuner, Chuck Lever, linux-fsdevel, Gregory Farnum, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, Chris Mason On 02/02/2012 08:26 PM, Andrea Arcangeli wrote: > On Thu, Feb 02, 2012 at 10:04:59AM +0100, Bernd Schubert wrote: >> I think the point for network file systems is that they can reuse the >> disk-checksum for network verification. So instead of calculating a >> checksum for network and disk, just use one for both. The checksum also >> is supposed to be cached in memory, as that avoids re-calculation for >> other clients. >> >> 1) >> client-1: sends data and checksum >> >> server: Receives those data and verifies the checksum -> network >> transfer was ok, sends data and checksum to disk >> >> 2) >> client-2 ... client-N: Ask for those data >> >> server: send cached data and cached checksum >> >> client-2 ... client-N: Receive data and verify checksum >> >> >> So the hole point of caching checksums is to avoid the server needs to >> recalculate those for dozens of clients. Recalculating checksums simply >> does not scale with an increasing number of clients, which want to read >> data processed by another client. > > This makes sense indeed. My argument was only about the exposure of > the storage hw format cksum to userland (through some new ioctl for > further userland verification of the pagecache data in the client > pagecache, done by whatever program is reading from the cache). The > network fs client lives in kernel, the network fs server lives in > kernel, so no need to expose the cksum to userland to do what you > described above. > > I meant if we can't trust the pagecache to be correct (after the > network fs client code already checked the cksum cached by the server > and sent to the client along the server cached data), I don't see much > value added through a further verification by the userland program > running on the client and accessing pagecache in the client. If we > can't trust client pagecache to be safe against memory bitflips or > software bugs, we can hardly trust the anonymous memory too. Well, now it gets a bit troublesome - not all file systems are in kernel space. FhGFS uses kernel clients, but has user space daemons. I think Ceph does it similarly. And although I'm not sure about the roadmap of Gluster and if data verification is planned at all, but if it would like to do that, even the clients would need get access to the checksums in user space. Now lets assume we ignore user space clients for now, what about using the splice interface to also send checksums? So as basic concept file systems servers are not interested at all about the real data, but only do the management between disk and network. So a possible solution to not expose checksums to user space daemons is to simply not expose data to the servers at all. However, in that case the server side kernel would need to do the checksum verification, so even for user space daemons. Remaining issue with splice is that splice does not work with inifiniband-ibverbs due to the missing socket fd. Another solution that also might work is to expose checksums only read-only to user space. Cheers, Bernd -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection 2012-02-01 16:52 ` James Bottomley [not found] ` <1328115175.2768.11.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> @ 2012-02-01 18:15 ` Martin K. Petersen [not found] ` <yq1d39ys9n1.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> 1 sibling, 1 reply; 25+ messages in thread From: Martin K. Petersen @ 2012-02-01 18:15 UTC (permalink / raw) To: James Bottomley Cc: Chris Mason, Gregory Farnum, Bernd Schubert, Linux NFS Mailing List, linux-scsi, Martin K. Petersen, Sven Breuner, Chuck Lever, linux-fsdevel, lsf-pc >>>>> "James" == James Bottomley <James.Bottomley@HansenPartnership.com> writes: James> I broadly agree with this, but even if you do sync writes and James> cache read only copies, we still have the problem of how we do James> the read side verification of DIX. Whoever requested the protected information will know how to verify it. Right now, if the Oracle DB enables a protected transfer, it'll do a verification pass once a read I/O completes. Similarly, the block layer will verify data+PI if the auto-protection feature has been turned on. James> In theory, when you read, you could either get the cached copy or James> an actual read (which will supply protection information), so for James> the cached copy we need to return cached protection information James> implying that we need some way of actually caching it. Let's assume we add a PI buffer to kaio. If an application wants to send or receive PI it needs to sit on top of a filesystem that can act as a conduit for PI. That filesystem will need to store the PI for each page somewhere hanging off of its page private pointer. When submitting a write the filesystem must iterate over these PI buffers and generate a bio integrity payload that it an attach to the data bio. This works exactly the same way as iterating over the data pages to build the data portion of the bio. When an application is requesting PI, the filesystem must allocate the relevant memory and update its private data to reflect the PI buffers. These buffers are then attached the same way as on a write. And when the I/O completes, the PI buffers contain the relevant PI from storage. Then the application gets completion and can proceed to verify that data and PI match. IOW, the filesystem should only ever act as a conduit. The only real challenge as far as I can tell is how to handle concurrent protected and unprotected updates to a page. If a non-PI-aware app updates a cached page which is subsequently read by an app requesting PI that means we may have to force a write-out followed by a read to get valid PI. We could synthesize it to avoid the I/O but I think that would be violating the premise of protected transfer. Another option is to have an exclusive write access mechanism that only permits either protected or unprotected access to a page. -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <yq1d39ys9n1.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org>]
* Re: [Lsf-pc] [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <yq1d39ys9n1.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> @ 2012-02-01 23:03 ` Boaz Harrosh 0 siblings, 0 replies; 25+ messages in thread From: Boaz Harrosh @ 2012-02-01 23:03 UTC (permalink / raw) To: Martin K. Petersen Cc: James Bottomley, Chris Mason, Gregory Farnum, Bernd Schubert, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA, Sven Breuner, Chuck Lever, linux-fsdevel, lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA On 02/01/2012 08:15 PM, Martin K. Petersen wrote: >>>>>> "James" == James Bottomley <James.Bottomley-d9PhHud1JfjCXq6kfMZ53/egYHeGw8Jk@public.gmane.org> writes: > > IOW, the filesystem should only ever act as a conduit. The only real > challenge as far as I can tell is how to handle concurrent protected and > unprotected updates to a page. If a non-PI-aware app updates a cached > page which is subsequently read by an app requesting PI that means we > may have to force a write-out followed by a read to get valid PI. We > could synthesize it to avoid the I/O but I think that would be violating > the premise of protected transfer. Another option is to have an > exclusive write access mechanism that only permits either protected or > unprotected access to a page. Yes. a protected write implies a byte-range locking on the file. (And can be implemented with one) Also the open() call demands an O_PROTECT option. Protection is then a file attribute as well. > Boaz -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
[parent not found: <DE0353DF-83EA-480E-9C42-1EE760D6EE41@dilger.ca>]
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <DE0353DF-83EA-480E-9C42-1EE760D6EE41@dilger.ca> @ 2012-01-31 2:22 ` Martin K. Petersen 0 siblings, 0 replies; 25+ messages in thread From: Martin K. Petersen @ 2012-01-31 2:22 UTC (permalink / raw) To: Andreas Dilger Cc: Martin K. Petersen, Bernd Schubert, Chuck Lever, lsf-pc@lists.linux-foundation.org, linux-fsdevel, Linux NFS Mailing List, linux-scsi@vger.kernel.org, Sven Breuner >>>>> "Andreas" == Andreas Dilger <adilger@dilger.ca> writes: Andreas> Is there a description of sys_dio() somewhere? This was the original draft: http://www.spinics.net/lists/linux-mm/msg14512.html Andreas> In particular, I'm interested to know whether it allows full Andreas> scatter-gather IO submission, unlike pwrite() which only allows Andreas> multiple input buffers, and not multiple file offsets. Each request descriptor contains buffer, target file, and offset. So it's a single entry per descriptor. But many descriptors can be submitted (and reaped) in a single syscall. So you don't have the single file offset limitation of pwritev(). -- Martin K. Petersen Oracle Linux Engineering ^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [LSF/MM TOPIC] end-to-end data and metadata corruption detection [not found] ` <38C050B3-2AAD-4767-9A25-02C33627E427-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 2012-01-26 12:31 ` Bernd Schubert @ 2012-01-26 15:36 ` Martin K. Petersen 1 sibling, 0 replies; 25+ messages in thread From: Martin K. Petersen @ 2012-01-26 15:36 UTC (permalink / raw) To: Chuck Lever Cc: lsf-pc-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA, linux-fsdevel, Linux NFS Mailing List, linux-scsi-u79uwXL29TY76Z2rM5mHXA >>>>> "Chuck" == Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> writes: Chuck> I'm probably ignorant of the current state of implementation in Chuck> Linux, but I'm interested in understanding common ground among Chuck> local file systems, block storage, and network file systems. Chuck> Example questions include: Do we need standardized APIs for block Chuck> device corruption detection? The block layer integrity stuff aims to be format agnostic. It was designed to accommodate different types of protection information (back then the ATA proposal was still on the table). Chuck> How much of T10 DIF/DIX should NFS support? You can either support the T10 PI format and act as a conduit. Or you can invent your own format and potentially force a conversion. I'd prefer the former (despite the limitations of T10 PI). Chuck> What are the drivers for this feature (broad use cases)? Two things: 1. Continuity. Downtime can be very costly and many applications require to be taken offline to do recovery after a corruption error. 2. Archival. You want to make sure your write a good copy to backup. With huge amounts of data it is often unfeasible to scrub and verify the data on the backup media. -- Martin K. Petersen Oracle Linux Engineering -- To unsubscribe from this list: send the line "unsubscribe linux-nfs" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2012-02-02 22:52 UTC | newest] Thread overview: 25+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-01-17 20:15 [LSF/MM TOPIC] end-to-end data and metadata corruption detection Chuck Lever [not found] ` <38C050B3-2AAD-4767-9A25-02C33627E427-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> 2012-01-26 12:31 ` Bernd Schubert [not found] ` <4F2147BA.6030607-mPn0NPGs4xGatNDF+KUbs4QuADTiUCJX@public.gmane.org> 2012-01-26 14:53 ` Martin K. Petersen [not found] ` <yq1k44e1pn6.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> 2012-01-26 16:27 ` Bernd Schubert 2012-01-26 23:21 ` James Bottomley [not found] ` <1327620104.6151.23.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-01-31 19:16 ` Bernd Schubert 2012-01-31 19:21 ` Chuck Lever 2012-01-31 20:04 ` Martin K. Petersen 2012-01-31 2:10 ` Martin K. Petersen 2012-01-31 19:22 ` Bernd Schubert 2012-01-31 19:28 ` Gregory Farnum 2012-02-01 16:45 ` [Lsf-pc] " Chris Mason 2012-02-01 16:52 ` James Bottomley [not found] ` <1328115175.2768.11.camel-sFMDBYUN5F8GjUHQrlYNx2Wm91YjaHnnhRte9Li2A+AAvxtiuMwx3w@public.gmane.org> 2012-02-01 17:41 ` Chris Mason 2012-02-01 17:59 ` Bernd Schubert 2012-02-01 18:16 ` James Bottomley 2012-02-01 18:30 ` Andrea Arcangeli 2012-02-02 9:04 ` Bernd Schubert 2012-02-02 19:26 ` Andrea Arcangeli [not found] ` <20120202192643.GC5873-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> 2012-02-02 19:46 ` Andreas Dilger 2012-02-02 22:52 ` Bernd Schubert 2012-02-01 18:15 ` Martin K. Petersen [not found] ` <yq1d39ys9n1.fsf-+q57XtR/GgMb6DWv4sQWN6xOck334EZe@public.gmane.org> 2012-02-01 23:03 ` Boaz Harrosh [not found] ` <DE0353DF-83EA-480E-9C42-1EE760D6EE41@dilger.ca> 2012-01-31 2:22 ` Martin K. Petersen 2012-01-26 15:36 ` Martin K. Petersen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).