From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Message-ID: <572791E1.7000103@plexistor.com> Date: Mon, 02 May 2016 20:44:01 +0300 From: Boaz Harrosh MIME-Version: 1.0 To: Dan Williams CC: Vishal Verma , "linux-nvdimm@lists.01.org" , linux-block@vger.kernel.org, Jan Kara , Matthew Wilcox , Dave Chinner , "linux-kernel@vger.kernel.org" , XFS Developers , Jens Axboe , Linux MM , Al Viro , Christoph Hellwig , linux-fsdevel , Andrew Morton , linux-ext4 Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io References: <1461878218-3844-1-git-send-email-vishal.l.verma@intel.com> <1461878218-3844-6-git-send-email-vishal.l.verma@intel.com> <5727753F.6090104@plexistor.com> <57277EDA.9000803@plexistor.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Sender: linux-fsdevel-owner@vger.kernel.org List-ID: On 05/02/2016 07:49 PM, Dan Williams wrote: > On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh wrote: >> On 05/02/2016 07:01 PM, Dan Williams wrote: >>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh wrote: >>>> On 04/29/2016 12:16 AM, Vishal Verma wrote: >>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot >>>>> handle media errors, and thus cannot provide a recovery path that can >>>>> send a write through the driver to clear errors. >>>>> >>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO >>>>> path for DAX filesystems, use the same direct_IO path for both DAX and >>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT >>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional >>>>> direct_IO path instead of DAX. >>>>> >>>> >>>> Really? What are your thinking here? >>>> >>>> What about all the current users of O_DIRECT, you have just made them >>>> 4 times slower and "less concurrent*" then "buffred io" users. Since >>>> direct_IO path will queue an IO request and all. >>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical]) >>>> >>>> I hate it that you overload the semantics of a known and expected >>>> O_DIRECT flag, for special pmem quirks. This is an incompatible >>>> and unrelated overload of the semantics of O_DIRECT. >>> >>> I think it is the opposite situation, it us undoing the premature >>> overloading of O_DIRECT that went in without performance numbers. >> >> We have tons of measurements. Is not hard to imagine the results though. >> Specially the 1000 threads case >> >>> This implementation clarifies that dax_do_io() handles the lack of a >>> page cache for buffered I/O and O_DIRECT behaves as it nominally would >>> by sending an I/O to the driver. >> >>> It has the benefit of matching the >>> error semantics of a typical block device where a buffered write could >>> hit an error filling the page cache, but an O_DIRECT write potentially >>> triggers the drive to remap the block. >>> >> >> I fail to see how in writes the device error semantics regarding remapping of >> blocks is any different between buffered and direct IO. As far as the block >> device it is the same exact code path. All The big difference is higher in the >> VFS. >> >> And ... So you are willing to sacrifice the 99% hotpath for the sake of the >> 1% error path? and piggybacking on poor O_DIRECT. >> >> Again there are tons of O_DIRECT apps out there, why are you forcing them to >> change if they want true pmem performance? > > This isn't forcing them to change. This is the path of least surprise > as error semantics are identical to a typical block device. Yes, an > application can go faster by switching to the "buffered" / dax_do_io() > path it can go even faster to switch to mmap() I/O and use DAX > directly. If we can later optimize the O_DIRECT path to bring it's > performance more in line with dax_do_io(), great, but the > implementation should be correct first and optimized later. > Why does it need to be either or. Why not both? And also I disagree if you are correct and dax_do_io is bad and needs fixing than you have broken applications. Because in current model: read => -EIO, write-bufferd, sync() gives you the same error semantics as: read => -EIO, write-direct-io In fact this is what the delete, restore from backup model does today. Who said it uses / must direct IO. Actually I think it does not. Two things I can think of which are better: [1] Why not go deeper into the dax io loops, and for any WRITE failed page call bdev_rw_page() to let the pmem.c clear / relocate the error page. So reads return -EIO - is what you wanted no? writes get a memory error and retry with bdev_rw_page() to let the bdev relocate / clear the error - is what you wanted no? In the partial page WRITE case on bad sectors. we can carefully read-modify-write sector-by-sector and zero-out the bad-sectors that could not be read, what else? (Or enhance the bdev_rw_page() API) [2] Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still hate that you overload error semantics with O_DIRECT which does not exist today see above Thanks Boaz From mboxrd@z Thu Jan 1 00:00:00 1970 From: Boaz Harrosh Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io Date: Mon, 02 May 2016 20:44:01 +0300 Message-ID: <572791E1.7000103@plexistor.com> References: <1461878218-3844-1-git-send-email-vishal.l.verma@intel.com> <1461878218-3844-6-git-send-email-vishal.l.verma@intel.com> <5727753F.6090104@plexistor.com> <57277EDA.9000803@plexistor.com> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Cc: Vishal Verma , "linux-nvdimm@lists.01.org" , linux-block@vger.kernel.org, Jan Kara , Matthew Wilcox , Dave Chinner , "linux-kernel@vger.kernel.org" , XFS Developers , Jens Axboe , Linux MM , Al Viro , Christoph Hellwig , linux-fsdevel , Andrew Morton , linux-ext4 To: Dan Williams Return-path: In-Reply-To: Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On 05/02/2016 07:49 PM, Dan Williams wrote: > On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh wrote: >> On 05/02/2016 07:01 PM, Dan Williams wrote: >>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh wrote: >>>> On 04/29/2016 12:16 AM, Vishal Verma wrote: >>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot >>>>> handle media errors, and thus cannot provide a recovery path that can >>>>> send a write through the driver to clear errors. >>>>> >>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO >>>>> path for DAX filesystems, use the same direct_IO path for both DAX and >>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT >>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional >>>>> direct_IO path instead of DAX. >>>>> >>>> >>>> Really? What are your thinking here? >>>> >>>> What about all the current users of O_DIRECT, you have just made them >>>> 4 times slower and "less concurrent*" then "buffred io" users. Since >>>> direct_IO path will queue an IO request and all. >>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical]) >>>> >>>> I hate it that you overload the semantics of a known and expected >>>> O_DIRECT flag, for special pmem quirks. This is an incompatible >>>> and unrelated overload of the semantics of O_DIRECT. >>> >>> I think it is the opposite situation, it us undoing the premature >>> overloading of O_DIRECT that went in without performance numbers. >> >> We have tons of measurements. Is not hard to imagine the results though. >> Specially the 1000 threads case >> >>> This implementation clarifies that dax_do_io() handles the lack of a >>> page cache for buffered I/O and O_DIRECT behaves as it nominally would >>> by sending an I/O to the driver. >> >>> It has the benefit of matching the >>> error semantics of a typical block device where a buffered write could >>> hit an error filling the page cache, but an O_DIRECT write potentially >>> triggers the drive to remap the block. >>> >> >> I fail to see how in writes the device error semantics regarding remapping of >> blocks is any different between buffered and direct IO. As far as the block >> device it is the same exact code path. All The big difference is higher in the >> VFS. >> >> And ... So you are willing to sacrifice the 99% hotpath for the sake of the >> 1% error path? and piggybacking on poor O_DIRECT. >> >> Again there are tons of O_DIRECT apps out there, why are you forcing them to >> change if they want true pmem performance? > > This isn't forcing them to change. This is the path of least surprise > as error semantics are identical to a typical block device. Yes, an > application can go faster by switching to the "buffered" / dax_do_io() > path it can go even faster to switch to mmap() I/O and use DAX > directly. If we can later optimize the O_DIRECT path to bring it's > performance more in line with dax_do_io(), great, but the > implementation should be correct first and optimized later. > Why does it need to be either or. Why not both? And also I disagree if you are correct and dax_do_io is bad and needs fixing than you have broken applications. Because in current model: read => -EIO, write-bufferd, sync() gives you the same error semantics as: read => -EIO, write-direct-io In fact this is what the delete, restore from backup model does today. Who said it uses / must direct IO. Actually I think it does not. Two things I can think of which are better: [1] Why not go deeper into the dax io loops, and for any WRITE failed page call bdev_rw_page() to let the pmem.c clear / relocate the error page. So reads return -EIO - is what you wanted no? writes get a memory error and retry with bdev_rw_page() to let the bdev relocate / clear the error - is what you wanted no? In the partial page WRITE case on bad sectors. we can carefully read-modify-write sector-by-sector and zero-out the bad-sectors that could not be read, what else? (Or enhance the bdev_rw_page() API) [2] Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still hate that you overload error semantics with O_DIRECT which does not exist today see above Thanks Boaz -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-wm0-x229.google.com (mail-wm0-x229.google.com [IPv6:2a00:1450:400c:c09::229]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by ml01.01.org (Postfix) with ESMTPS id 4A7321A1E96 for ; Mon, 2 May 2016 10:44:06 -0700 (PDT) Received: by mail-wm0-x229.google.com with SMTP id a17so154644042wme.0 for ; Mon, 02 May 2016 10:44:06 -0700 (PDT) Message-ID: <572791E1.7000103@plexistor.com> Date: Mon, 02 May 2016 20:44:01 +0300 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io References: <1461878218-3844-1-git-send-email-vishal.l.verma@intel.com> <1461878218-3844-6-git-send-email-vishal.l.verma@intel.com> <5727753F.6090104@plexistor.com> <57277EDA.9000803@plexistor.com> In-Reply-To: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: linux-nvdimm-bounces@lists.01.org Sender: "Linux-nvdimm" To: Dan Williams Cc: Jens Axboe , Jan Kara , Matthew Wilcox , "linux-nvdimm@lists.01.org" , Dave Chinner , "linux-kernel@vger.kernel.org" , XFS Developers , linux-block@vger.kernel.org, Linux MM , Al Viro , Christoph Hellwig , linux-fsdevel , Andrew Morton , linux-ext4 List-ID: On 05/02/2016 07:49 PM, Dan Williams wrote: > On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh wrote: >> On 05/02/2016 07:01 PM, Dan Williams wrote: >>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh wrote: >>>> On 04/29/2016 12:16 AM, Vishal Verma wrote: >>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot >>>>> handle media errors, and thus cannot provide a recovery path that can >>>>> send a write through the driver to clear errors. >>>>> >>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO >>>>> path for DAX filesystems, use the same direct_IO path for both DAX and >>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT >>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional >>>>> direct_IO path instead of DAX. >>>>> >>>> >>>> Really? What are your thinking here? >>>> >>>> What about all the current users of O_DIRECT, you have just made them >>>> 4 times slower and "less concurrent*" then "buffred io" users. Since >>>> direct_IO path will queue an IO request and all. >>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical]) >>>> >>>> I hate it that you overload the semantics of a known and expected >>>> O_DIRECT flag, for special pmem quirks. This is an incompatible >>>> and unrelated overload of the semantics of O_DIRECT. >>> >>> I think it is the opposite situation, it us undoing the premature >>> overloading of O_DIRECT that went in without performance numbers. >> >> We have tons of measurements. Is not hard to imagine the results though. >> Specially the 1000 threads case >> >>> This implementation clarifies that dax_do_io() handles the lack of a >>> page cache for buffered I/O and O_DIRECT behaves as it nominally would >>> by sending an I/O to the driver. >> >>> It has the benefit of matching the >>> error semantics of a typical block device where a buffered write could >>> hit an error filling the page cache, but an O_DIRECT write potentially >>> triggers the drive to remap the block. >>> >> >> I fail to see how in writes the device error semantics regarding remapping of >> blocks is any different between buffered and direct IO. As far as the block >> device it is the same exact code path. All The big difference is higher in the >> VFS. >> >> And ... So you are willing to sacrifice the 99% hotpath for the sake of the >> 1% error path? and piggybacking on poor O_DIRECT. >> >> Again there are tons of O_DIRECT apps out there, why are you forcing them to >> change if they want true pmem performance? > > This isn't forcing them to change. This is the path of least surprise > as error semantics are identical to a typical block device. Yes, an > application can go faster by switching to the "buffered" / dax_do_io() > path it can go even faster to switch to mmap() I/O and use DAX > directly. If we can later optimize the O_DIRECT path to bring it's > performance more in line with dax_do_io(), great, but the > implementation should be correct first and optimized later. > Why does it need to be either or. Why not both? And also I disagree if you are correct and dax_do_io is bad and needs fixing than you have broken applications. Because in current model: read => -EIO, write-bufferd, sync() gives you the same error semantics as: read => -EIO, write-direct-io In fact this is what the delete, restore from backup model does today. Who said it uses / must direct IO. Actually I think it does not. Two things I can think of which are better: [1] Why not go deeper into the dax io loops, and for any WRITE failed page call bdev_rw_page() to let the pmem.c clear / relocate the error page. So reads return -EIO - is what you wanted no? writes get a memory error and retry with bdev_rw_page() to let the bdev relocate / clear the error - is what you wanted no? In the partial page WRITE case on bad sectors. we can carefully read-modify-write sector-by-sector and zero-out the bad-sectors that could not be read, what else? (Or enhance the bdev_rw_page() API) [2] Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still hate that you overload error semantics with O_DIRECT which does not exist today see above Thanks Boaz _______________________________________________ Linux-nvdimm mailing list Linux-nvdimm@lists.01.org https://lists.01.org/mailman/listinfo/linux-nvdimm From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from relay.sgi.com (relay2.corp.sgi.com [137.38.102.29]) by oss.sgi.com (Postfix) with ESMTP id 2C5707CD5 for ; Mon, 2 May 2016 12:44:12 -0500 (CDT) Received: from cuda.sgi.com (cuda1.sgi.com [192.48.157.11]) by relay2.corp.sgi.com (Postfix) with ESMTP id D786F304064 for ; Mon, 2 May 2016 10:44:08 -0700 (PDT) Received: from mail-wm0-f53.google.com (mail-wm0-f53.google.com [74.125.82.53]) by cuda.sgi.com with ESMTP id 9XoAcpIGQZtB1diH (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NO) for ; Mon, 02 May 2016 10:44:06 -0700 (PDT) Received: by mail-wm0-f53.google.com with SMTP id a17so154644038wme.0 for ; Mon, 02 May 2016 10:44:06 -0700 (PDT) Message-ID: <572791E1.7000103@plexistor.com> Date: Mon, 02 May 2016 20:44:01 +0300 From: Boaz Harrosh MIME-Version: 1.0 Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io References: <1461878218-3844-1-git-send-email-vishal.l.verma@intel.com> <1461878218-3844-6-git-send-email-vishal.l.verma@intel.com> <5727753F.6090104@plexistor.com> <57277EDA.9000803@plexistor.com> In-Reply-To: List-Id: XFS Filesystem from SGI List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: xfs-bounces@oss.sgi.com Sender: xfs-bounces@oss.sgi.com To: Dan Williams Cc: Jens Axboe , Jan Kara , Matthew Wilcox , Vishal Verma , "linux-nvdimm@lists.01.org" , "linux-kernel@vger.kernel.org" , XFS Developers , linux-block@vger.kernel.org, Linux MM , Al Viro , Christoph Hellwig , linux-fsdevel , Andrew Morton , linux-ext4 On 05/02/2016 07:49 PM, Dan Williams wrote: > On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh wrote: >> On 05/02/2016 07:01 PM, Dan Williams wrote: >>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh wrote: >>>> On 04/29/2016 12:16 AM, Vishal Verma wrote: >>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot >>>>> handle media errors, and thus cannot provide a recovery path that can >>>>> send a write through the driver to clear errors. >>>>> >>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO >>>>> path for DAX filesystems, use the same direct_IO path for both DAX and >>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT >>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional >>>>> direct_IO path instead of DAX. >>>>> >>>> >>>> Really? What are your thinking here? >>>> >>>> What about all the current users of O_DIRECT, you have just made them >>>> 4 times slower and "less concurrent*" then "buffred io" users. Since >>>> direct_IO path will queue an IO request and all. >>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical]) >>>> >>>> I hate it that you overload the semantics of a known and expected >>>> O_DIRECT flag, for special pmem quirks. This is an incompatible >>>> and unrelated overload of the semantics of O_DIRECT. >>> >>> I think it is the opposite situation, it us undoing the premature >>> overloading of O_DIRECT that went in without performance numbers. >> >> We have tons of measurements. Is not hard to imagine the results though. >> Specially the 1000 threads case >> >>> This implementation clarifies that dax_do_io() handles the lack of a >>> page cache for buffered I/O and O_DIRECT behaves as it nominally would >>> by sending an I/O to the driver. >> >>> It has the benefit of matching the >>> error semantics of a typical block device where a buffered write could >>> hit an error filling the page cache, but an O_DIRECT write potentially >>> triggers the drive to remap the block. >>> >> >> I fail to see how in writes the device error semantics regarding remapping of >> blocks is any different between buffered and direct IO. As far as the block >> device it is the same exact code path. All The big difference is higher in the >> VFS. >> >> And ... So you are willing to sacrifice the 99% hotpath for the sake of the >> 1% error path? and piggybacking on poor O_DIRECT. >> >> Again there are tons of O_DIRECT apps out there, why are you forcing them to >> change if they want true pmem performance? > > This isn't forcing them to change. This is the path of least surprise > as error semantics are identical to a typical block device. Yes, an > application can go faster by switching to the "buffered" / dax_do_io() > path it can go even faster to switch to mmap() I/O and use DAX > directly. If we can later optimize the O_DIRECT path to bring it's > performance more in line with dax_do_io(), great, but the > implementation should be correct first and optimized later. > Why does it need to be either or. Why not both? And also I disagree if you are correct and dax_do_io is bad and needs fixing than you have broken applications. Because in current model: read => -EIO, write-bufferd, sync() gives you the same error semantics as: read => -EIO, write-direct-io In fact this is what the delete, restore from backup model does today. Who said it uses / must direct IO. Actually I think it does not. Two things I can think of which are better: [1] Why not go deeper into the dax io loops, and for any WRITE failed page call bdev_rw_page() to let the pmem.c clear / relocate the error page. So reads return -EIO - is what you wanted no? writes get a memory error and retry with bdev_rw_page() to let the bdev relocate / clear the error - is what you wanted no? In the partial page WRITE case on bad sectors. we can carefully read-modify-write sector-by-sector and zero-out the bad-sectors that could not be read, what else? (Or enhance the bdev_rw_page() API) [2] Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still hate that you overload error semantics with O_DIRECT which does not exist today see above Thanks Boaz _______________________________________________ xfs mailing list xfs@oss.sgi.com http://oss.sgi.com/mailman/listinfo/xfs From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754789AbcEBRoN (ORCPT ); Mon, 2 May 2016 13:44:13 -0400 Received: from mail-wm0-f43.google.com ([74.125.82.43]:38862 "EHLO mail-wm0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754241AbcEBRoG (ORCPT ); Mon, 2 May 2016 13:44:06 -0400 Message-ID: <572791E1.7000103@plexistor.com> Date: Mon, 02 May 2016 20:44:01 +0300 From: Boaz Harrosh User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.7.0 MIME-Version: 1.0 To: Dan Williams CC: Vishal Verma , "linux-nvdimm@lists.01.org" , linux-block@vger.kernel.org, Jan Kara , Matthew Wilcox , Dave Chinner , "linux-kernel@vger.kernel.org" , XFS Developers , Jens Axboe , Linux MM , Al Viro , Christoph Hellwig , linux-fsdevel , Andrew Morton , linux-ext4 Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io References: <1461878218-3844-1-git-send-email-vishal.l.verma@intel.com> <1461878218-3844-6-git-send-email-vishal.l.verma@intel.com> <5727753F.6090104@plexistor.com> <57277EDA.9000803@plexistor.com> In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 05/02/2016 07:49 PM, Dan Williams wrote: > On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh wrote: >> On 05/02/2016 07:01 PM, Dan Williams wrote: >>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh wrote: >>>> On 04/29/2016 12:16 AM, Vishal Verma wrote: >>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot >>>>> handle media errors, and thus cannot provide a recovery path that can >>>>> send a write through the driver to clear errors. >>>>> >>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO >>>>> path for DAX filesystems, use the same direct_IO path for both DAX and >>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT >>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional >>>>> direct_IO path instead of DAX. >>>>> >>>> >>>> Really? What are your thinking here? >>>> >>>> What about all the current users of O_DIRECT, you have just made them >>>> 4 times slower and "less concurrent*" then "buffred io" users. Since >>>> direct_IO path will queue an IO request and all. >>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical]) >>>> >>>> I hate it that you overload the semantics of a known and expected >>>> O_DIRECT flag, for special pmem quirks. This is an incompatible >>>> and unrelated overload of the semantics of O_DIRECT. >>> >>> I think it is the opposite situation, it us undoing the premature >>> overloading of O_DIRECT that went in without performance numbers. >> >> We have tons of measurements. Is not hard to imagine the results though. >> Specially the 1000 threads case >> >>> This implementation clarifies that dax_do_io() handles the lack of a >>> page cache for buffered I/O and O_DIRECT behaves as it nominally would >>> by sending an I/O to the driver. >> >>> It has the benefit of matching the >>> error semantics of a typical block device where a buffered write could >>> hit an error filling the page cache, but an O_DIRECT write potentially >>> triggers the drive to remap the block. >>> >> >> I fail to see how in writes the device error semantics regarding remapping of >> blocks is any different between buffered and direct IO. As far as the block >> device it is the same exact code path. All The big difference is higher in the >> VFS. >> >> And ... So you are willing to sacrifice the 99% hotpath for the sake of the >> 1% error path? and piggybacking on poor O_DIRECT. >> >> Again there are tons of O_DIRECT apps out there, why are you forcing them to >> change if they want true pmem performance? > > This isn't forcing them to change. This is the path of least surprise > as error semantics are identical to a typical block device. Yes, an > application can go faster by switching to the "buffered" / dax_do_io() > path it can go even faster to switch to mmap() I/O and use DAX > directly. If we can later optimize the O_DIRECT path to bring it's > performance more in line with dax_do_io(), great, but the > implementation should be correct first and optimized later. > Why does it need to be either or. Why not both? And also I disagree if you are correct and dax_do_io is bad and needs fixing than you have broken applications. Because in current model: read => -EIO, write-bufferd, sync() gives you the same error semantics as: read => -EIO, write-direct-io In fact this is what the delete, restore from backup model does today. Who said it uses / must direct IO. Actually I think it does not. Two things I can think of which are better: [1] Why not go deeper into the dax io loops, and for any WRITE failed page call bdev_rw_page() to let the pmem.c clear / relocate the error page. So reads return -EIO - is what you wanted no? writes get a memory error and retry with bdev_rw_page() to let the bdev relocate / clear the error - is what you wanted no? In the partial page WRITE case on bad sectors. we can carefully read-modify-write sector-by-sector and zero-out the bad-sectors that could not be read, what else? (Or enhance the bdev_rw_page() API) [2] Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still hate that you overload error semantics with O_DIRECT which does not exist today see above Thanks Boaz