From: Boaz Harrosh <boaz@plexistor.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
linux-block@vger.kernel.org, Jan Kara <jack@suse.cz>,
Matthew Wilcox <matthew@wil.cx>,
Dave Chinner <david@fromorbit.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
XFS Developers <xfs@oss.sgi.com>, Jens Axboe <axboe@fb.com>,
Linux MM <linux-mm@kvack.org>, Al Viro <viro@zeniv.linux.org.uk>,
Christoph Hellwig <hch@infradead.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io
Date: Mon, 02 May 2016 20:44:01 +0300 [thread overview]
Message-ID: <572791E1.7000103@plexistor.com> (raw)
In-Reply-To: <CAPcyv4jnz69a3S+XZgLaLojHZmpfoVXGDkJkt_1Q=8kk0gik9w@mail.gmail.com>
On 05/02/2016 07:49 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>> send a write through the driver to clear errors.
>>>>>
>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>> direct_IO path instead of DAX.
>>>>>
>>>>
>>>> Really? What are your thinking here?
>>>>
>>>> What about all the current users of O_DIRECT, you have just made them
>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>> direct_IO path will queue an IO request and all.
>>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>>
>>>> I hate it that you overload the semantics of a known and expected
>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>> and unrelated overload of the semantics of O_DIRECT.
>>>
>>> I think it is the opposite situation, it us undoing the premature
>>> overloading of O_DIRECT that went in without performance numbers.
>>
>> We have tons of measurements. Is not hard to imagine the results though.
>> Specially the 1000 threads case
>>
>>> This implementation clarifies that dax_do_io() handles the lack of a
>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>> by sending an I/O to the driver.
>>
>>> It has the benefit of matching the
>>> error semantics of a typical block device where a buffered write could
>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>> triggers the drive to remap the block.
>>>
>>
>> I fail to see how in writes the device error semantics regarding remapping of
>> blocks is any different between buffered and direct IO. As far as the block
>> device it is the same exact code path. All The big difference is higher in the
>> VFS.
>>
>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>> 1% error path? and piggybacking on poor O_DIRECT.
>>
>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>> change if they want true pmem performance?
>
> This isn't forcing them to change. This is the path of least surprise
> as error semantics are identical to a typical block device. Yes, an
> application can go faster by switching to the "buffered" / dax_do_io()
> path it can go even faster to switch to mmap() I/O and use DAX
> directly. If we can later optimize the O_DIRECT path to bring it's
> performance more in line with dax_do_io(), great, but the
> implementation should be correct first and optimized later.
>
Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:
read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io
In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.
Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.
So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?
In the partial page WRITE case on bad sectors. we can carefully read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)
[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above
Thanks
Boaz
WARNING: multiple messages have this Message-ID (diff)
From: Boaz Harrosh <boaz@plexistor.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
linux-block@vger.kernel.org, Jan Kara <jack@suse.cz>,
Matthew Wilcox <matthew@wil.cx>,
Dave Chinner <david@fromorbit.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
XFS Developers <xfs@oss.sgi.com>, Jens Axboe <axboe@fb.com>,
Linux MM <linux-mm@kvack.org>, Al Viro <viro@zeniv.linux.org.uk>,
Christoph Hellwig <hch@infradead.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io
Date: Mon, 02 May 2016 20:44:01 +0300 [thread overview]
Message-ID: <572791E1.7000103@plexistor.com> (raw)
In-Reply-To: <CAPcyv4jnz69a3S+XZgLaLojHZmpfoVXGDkJkt_1Q=8kk0gik9w@mail.gmail.com>
On 05/02/2016 07:49 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>> send a write through the driver to clear errors.
>>>>>
>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>> direct_IO path instead of DAX.
>>>>>
>>>>
>>>> Really? What are your thinking here?
>>>>
>>>> What about all the current users of O_DIRECT, you have just made them
>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>> direct_IO path will queue an IO request and all.
>>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>>
>>>> I hate it that you overload the semantics of a known and expected
>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>> and unrelated overload of the semantics of O_DIRECT.
>>>
>>> I think it is the opposite situation, it us undoing the premature
>>> overloading of O_DIRECT that went in without performance numbers.
>>
>> We have tons of measurements. Is not hard to imagine the results though.
>> Specially the 1000 threads case
>>
>>> This implementation clarifies that dax_do_io() handles the lack of a
>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>> by sending an I/O to the driver.
>>
>>> It has the benefit of matching the
>>> error semantics of a typical block device where a buffered write could
>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>> triggers the drive to remap the block.
>>>
>>
>> I fail to see how in writes the device error semantics regarding remapping of
>> blocks is any different between buffered and direct IO. As far as the block
>> device it is the same exact code path. All The big difference is higher in the
>> VFS.
>>
>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>> 1% error path? and piggybacking on poor O_DIRECT.
>>
>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>> change if they want true pmem performance?
>
> This isn't forcing them to change. This is the path of least surprise
> as error semantics are identical to a typical block device. Yes, an
> application can go faster by switching to the "buffered" / dax_do_io()
> path it can go even faster to switch to mmap() I/O and use DAX
> directly. If we can later optimize the O_DIRECT path to bring it's
> performance more in line with dax_do_io(), great, but the
> implementation should be correct first and optimized later.
>
Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:
read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io
In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.
Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.
So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?
In the partial page WRITE case on bad sectors. we can carefully read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)
[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above
Thanks
Boaz
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
WARNING: multiple messages have this Message-ID (diff)
From: Boaz Harrosh <boaz@plexistor.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@fb.com>, Jan Kara <jack@suse.cz>,
Matthew Wilcox <matthew@wil.cx>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
Dave Chinner <david@fromorbit.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
XFS Developers <xfs@oss.sgi.com>,
linux-block@vger.kernel.org, Linux MM <linux-mm@kvack.org>,
Al Viro <viro@zeniv.linux.org.uk>,
Christoph Hellwig <hch@infradead.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io
Date: Mon, 02 May 2016 20:44:01 +0300 [thread overview]
Message-ID: <572791E1.7000103@plexistor.com> (raw)
In-Reply-To: <CAPcyv4jnz69a3S+XZgLaLojHZmpfoVXGDkJkt_1Q=8kk0gik9w@mail.gmail.com>
On 05/02/2016 07:49 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>> send a write through the driver to clear errors.
>>>>>
>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>> direct_IO path instead of DAX.
>>>>>
>>>>
>>>> Really? What are your thinking here?
>>>>
>>>> What about all the current users of O_DIRECT, you have just made them
>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>> direct_IO path will queue an IO request and all.
>>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>>
>>>> I hate it that you overload the semantics of a known and expected
>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>> and unrelated overload of the semantics of O_DIRECT.
>>>
>>> I think it is the opposite situation, it us undoing the premature
>>> overloading of O_DIRECT that went in without performance numbers.
>>
>> We have tons of measurements. Is not hard to imagine the results though.
>> Specially the 1000 threads case
>>
>>> This implementation clarifies that dax_do_io() handles the lack of a
>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>> by sending an I/O to the driver.
>>
>>> It has the benefit of matching the
>>> error semantics of a typical block device where a buffered write could
>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>> triggers the drive to remap the block.
>>>
>>
>> I fail to see how in writes the device error semantics regarding remapping of
>> blocks is any different between buffered and direct IO. As far as the block
>> device it is the same exact code path. All The big difference is higher in the
>> VFS.
>>
>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>> 1% error path? and piggybacking on poor O_DIRECT.
>>
>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>> change if they want true pmem performance?
>
> This isn't forcing them to change. This is the path of least surprise
> as error semantics are identical to a typical block device. Yes, an
> application can go faster by switching to the "buffered" / dax_do_io()
> path it can go even faster to switch to mmap() I/O and use DAX
> directly. If we can later optimize the O_DIRECT path to bring it's
> performance more in line with dax_do_io(), great, but the
> implementation should be correct first and optimized later.
>
Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:
read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io
In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.
Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.
So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?
In the partial page WRITE case on bad sectors. we can carefully read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)
[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above
Thanks
Boaz
_______________________________________________
Linux-nvdimm mailing list
Linux-nvdimm@lists.01.org
https://lists.01.org/mailman/listinfo/linux-nvdimm
WARNING: multiple messages have this Message-ID (diff)
From: Boaz Harrosh <boaz@plexistor.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Jens Axboe <axboe@fb.com>, Jan Kara <jack@suse.cz>,
Matthew Wilcox <matthew@wil.cx>,
Vishal Verma <vishal.l.verma@intel.com>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@lists.01.org>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
XFS Developers <xfs@oss.sgi.com>,
linux-block@vger.kernel.org, Linux MM <linux-mm@kvack.org>,
Al Viro <viro@zeniv.linux.org.uk>,
Christoph Hellwig <hch@infradead.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io
Date: Mon, 02 May 2016 20:44:01 +0300 [thread overview]
Message-ID: <572791E1.7000103@plexistor.com> (raw)
In-Reply-To: <CAPcyv4jnz69a3S+XZgLaLojHZmpfoVXGDkJkt_1Q=8kk0gik9w@mail.gmail.com>
On 05/02/2016 07:49 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>> send a write through the driver to clear errors.
>>>>>
>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>> direct_IO path instead of DAX.
>>>>>
>>>>
>>>> Really? What are your thinking here?
>>>>
>>>> What about all the current users of O_DIRECT, you have just made them
>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>> direct_IO path will queue an IO request and all.
>>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>>
>>>> I hate it that you overload the semantics of a known and expected
>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>> and unrelated overload of the semantics of O_DIRECT.
>>>
>>> I think it is the opposite situation, it us undoing the premature
>>> overloading of O_DIRECT that went in without performance numbers.
>>
>> We have tons of measurements. Is not hard to imagine the results though.
>> Specially the 1000 threads case
>>
>>> This implementation clarifies that dax_do_io() handles the lack of a
>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>> by sending an I/O to the driver.
>>
>>> It has the benefit of matching the
>>> error semantics of a typical block device where a buffered write could
>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>> triggers the drive to remap the block.
>>>
>>
>> I fail to see how in writes the device error semantics regarding remapping of
>> blocks is any different between buffered and direct IO. As far as the block
>> device it is the same exact code path. All The big difference is higher in the
>> VFS.
>>
>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>> 1% error path? and piggybacking on poor O_DIRECT.
>>
>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>> change if they want true pmem performance?
>
> This isn't forcing them to change. This is the path of least surprise
> as error semantics are identical to a typical block device. Yes, an
> application can go faster by switching to the "buffered" / dax_do_io()
> path it can go even faster to switch to mmap() I/O and use DAX
> directly. If we can later optimize the O_DIRECT path to bring it's
> performance more in line with dax_do_io(), great, but the
> implementation should be correct first and optimized later.
>
Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:
read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io
In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.
Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.
So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?
In the partial page WRITE case on bad sectors. we can carefully read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)
[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above
Thanks
Boaz
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
WARNING: multiple messages have this Message-ID (diff)
From: Boaz Harrosh <boaz@plexistor.com>
To: Dan Williams <dan.j.williams@intel.com>
Cc: Vishal Verma <vishal.l.verma@intel.com>,
"linux-nvdimm@lists.01.org" <linux-nvdimm@ml01.01.org>,
linux-block@vger.kernel.org, Jan Kara <jack@suse.cz>,
Matthew Wilcox <matthew@freeurl.abc188.com>,
Dave Chinner <david@fromorbit.com>,
"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
XFS Developers <xfs@oss.sgi.com>, Jens Axboe <axboe@fb.com>,
Linux MM <linux-mm@kvack.org>, Al Viro <viro@zeniv.linux.org.uk>,
Christoph Hellwig <hch@infradead.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-ext4 <linux-ext4@vger.kernel.org>
Subject: Re: [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io
Date: Mon, 02 May 2016 20:44:01 +0300 [thread overview]
Message-ID: <572791E1.7000103@plexistor.com> (raw)
In-Reply-To: <CAPcyv4jnz69a3S+XZgLaLojHZmpfoVXGDkJkt_1Q=8kk0gik9w@mail.gmail.com>
On 05/02/2016 07:49 PM, Dan Williams wrote:
> On Mon, May 2, 2016 at 9:22 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>> On 05/02/2016 07:01 PM, Dan Williams wrote:
>>> On Mon, May 2, 2016 at 8:41 AM, Boaz Harrosh <boaz@plexistor.com> wrote:
>>>> On 04/29/2016 12:16 AM, Vishal Verma wrote:
>>>>> All IO in a dax filesystem used to go through dax_do_io, which cannot
>>>>> handle media errors, and thus cannot provide a recovery path that can
>>>>> send a write through the driver to clear errors.
>>>>>
>>>>> Add a new iocb flag for DAX, and set it only for DAX mounts. In the IO
>>>>> path for DAX filesystems, use the same direct_IO path for both DAX and
>>>>> direct_io iocbs, but use the flags to identify when we are in O_DIRECT
>>>>> mode vs non O_DIRECT with DAX, and for O_DIRECT, use the conventional
>>>>> direct_IO path instead of DAX.
>>>>>
>>>>
>>>> Really? What are your thinking here?
>>>>
>>>> What about all the current users of O_DIRECT, you have just made them
>>>> 4 times slower and "less concurrent*" then "buffred io" users. Since
>>>> direct_IO path will queue an IO request and all.
>>>> (And if it is not so slow then why do we need dax_do_io at all? [Rhetorical])
>>>>
>>>> I hate it that you overload the semantics of a known and expected
>>>> O_DIRECT flag, for special pmem quirks. This is an incompatible
>>>> and unrelated overload of the semantics of O_DIRECT.
>>>
>>> I think it is the opposite situation, it us undoing the premature
>>> overloading of O_DIRECT that went in without performance numbers.
>>
>> We have tons of measurements. Is not hard to imagine the results though.
>> Specially the 1000 threads case
>>
>>> This implementation clarifies that dax_do_io() handles the lack of a
>>> page cache for buffered I/O and O_DIRECT behaves as it nominally would
>>> by sending an I/O to the driver.
>>
>>> It has the benefit of matching the
>>> error semantics of a typical block device where a buffered write could
>>> hit an error filling the page cache, but an O_DIRECT write potentially
>>> triggers the drive to remap the block.
>>>
>>
>> I fail to see how in writes the device error semantics regarding remapping of
>> blocks is any different between buffered and direct IO. As far as the block
>> device it is the same exact code path. All The big difference is higher in the
>> VFS.
>>
>> And ... So you are willing to sacrifice the 99% hotpath for the sake of the
>> 1% error path? and piggybacking on poor O_DIRECT.
>>
>> Again there are tons of O_DIRECT apps out there, why are you forcing them to
>> change if they want true pmem performance?
>
> This isn't forcing them to change. This is the path of least surprise
> as error semantics are identical to a typical block device. Yes, an
> application can go faster by switching to the "buffered" / dax_do_io()
> path it can go even faster to switch to mmap() I/O and use DAX
> directly. If we can later optimize the O_DIRECT path to bring it's
> performance more in line with dax_do_io(), great, but the
> implementation should be correct first and optimized later.
>
Why does it need to be either or. Why not both?
And also I disagree if you are correct and dax_do_io is bad and needs fixing
than you have broken applications. Because in current model:
read => -EIO, write-bufferd, sync()
gives you the same error semantics as: read => -EIO, write-direct-io
In fact this is what the delete, restore from backup model does today.
Who said it uses / must direct IO. Actually I think it does not.
Two things I can think of which are better:
[1]
Why not go deeper into the dax io loops, and for any WRITE
failed page call bdev_rw_page() to let the pmem.c clear / relocate
the error page.
So reads return -EIO - is what you wanted no?
writes get a memory error and retry with bdev_rw_page() to let the bdev
relocate / clear the error - is what you wanted no?
In the partial page WRITE case on bad sectors. we can carefully read-modify-write
sector-by-sector and zero-out the bad-sectors that could not be read, what else?
(Or enhance the bdev_rw_page() API)
[2]
Only switch to slow O_DIRECT, on presence of errors like you wanted. But I still
hate that you overload error semantics with O_DIRECT which does not exist today
see above
Thanks
Boaz
next prev parent reply other threads:[~2016-05-02 17:44 UTC|newest]
Thread overview: 144+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-04-28 21:16 [PATCH v4 0/7] dax: handling media errors Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` [PATCH v4 1/7] block, dax: pass blk_dax_ctl through to drivers Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` [PATCH v4 2/7] dax: fallback from pmd to pte on error Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` [PATCH v4 3/7] dax: enable dax in the presence of known media errors (badblocks) Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` [PATCH v4 4/7] dax: use sb_issue_zerout instead of calling dax_clear_sectors Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` [PATCH v4 5/7] fs: prioritize and separate direct_io from dax_io Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-05-02 14:56 ` Christoph Hellwig
2016-05-02 14:56 ` Christoph Hellwig
2016-05-02 14:56 ` Christoph Hellwig
2016-05-02 14:56 ` Christoph Hellwig
2016-05-02 15:45 ` Vishal Verma
2016-05-02 15:45 ` Vishal Verma
2016-05-02 15:45 ` Vishal Verma
2016-05-02 15:45 ` Vishal Verma
2016-05-02 15:45 ` Vishal Verma
2016-05-02 15:41 ` Boaz Harrosh
2016-05-02 15:41 ` Boaz Harrosh
2016-05-02 15:41 ` Boaz Harrosh
2016-05-02 15:41 ` Boaz Harrosh
2016-05-02 15:41 ` Boaz Harrosh
2016-05-02 15:51 ` Vishal Verma
2016-05-02 15:51 ` Vishal Verma
2016-05-02 15:51 ` Vishal Verma
2016-05-02 15:51 ` Vishal Verma
2016-05-02 15:51 ` Vishal Verma
2016-05-02 15:51 ` Vishal Verma
2016-05-02 16:03 ` Boaz Harrosh
2016-05-02 16:03 ` Boaz Harrosh
2016-05-02 16:03 ` Boaz Harrosh
2016-05-02 16:03 ` Boaz Harrosh
2016-05-02 16:03 ` Boaz Harrosh
2016-05-02 18:52 ` Verma, Vishal L
2016-05-02 18:52 ` Verma, Vishal L
2016-05-02 18:52 ` Verma, Vishal L
2016-05-02 18:52 ` Verma, Vishal L
2016-05-02 18:52 ` Verma, Vishal L
2016-05-02 16:01 ` Dan Williams
2016-05-02 16:01 ` Dan Williams
2016-05-02 16:01 ` Dan Williams
2016-05-02 16:01 ` Dan Williams
2016-05-02 16:01 ` Dan Williams
2016-05-02 16:22 ` Boaz Harrosh
2016-05-02 16:22 ` Boaz Harrosh
2016-05-02 16:22 ` Boaz Harrosh
2016-05-02 16:22 ` Boaz Harrosh
2016-05-02 16:22 ` Boaz Harrosh
2016-05-02 16:49 ` Dan Williams
2016-05-02 16:49 ` Dan Williams
2016-05-02 16:49 ` Dan Williams
2016-05-02 16:49 ` Dan Williams
2016-05-02 16:49 ` Dan Williams
2016-05-02 17:44 ` Boaz Harrosh [this message]
2016-05-02 17:44 ` Boaz Harrosh
2016-05-02 17:44 ` Boaz Harrosh
2016-05-02 17:44 ` Boaz Harrosh
2016-05-02 17:44 ` Boaz Harrosh
2016-05-02 18:10 ` Dan Williams
2016-05-02 18:10 ` Dan Williams
2016-05-02 18:10 ` Dan Williams
2016-05-02 18:10 ` Dan Williams
2016-05-02 18:10 ` Dan Williams
2016-05-02 18:32 ` Boaz Harrosh
2016-05-02 18:32 ` Boaz Harrosh
2016-05-02 18:32 ` Boaz Harrosh
2016-05-02 18:32 ` Boaz Harrosh
2016-05-02 18:48 ` Dan Williams
2016-05-02 18:48 ` Dan Williams
2016-05-02 18:48 ` Dan Williams
2016-05-02 18:48 ` Dan Williams
2016-05-02 18:48 ` Dan Williams
2016-05-02 19:22 ` Boaz Harrosh
2016-05-02 19:22 ` Boaz Harrosh
2016-05-02 19:22 ` Boaz Harrosh
2016-05-02 19:22 ` Boaz Harrosh
2016-05-02 19:22 ` Boaz Harrosh
2016-05-05 14:24 ` Christoph Hellwig
2016-05-05 14:24 ` Christoph Hellwig
2016-05-05 14:24 ` Christoph Hellwig
2016-05-05 14:24 ` Christoph Hellwig
2016-05-05 15:15 ` Dan Williams
2016-05-05 15:15 ` Dan Williams
2016-05-05 15:15 ` Dan Williams
2016-05-05 15:15 ` Dan Williams
2016-05-05 15:22 ` Christoph Hellwig
2016-05-05 15:22 ` Christoph Hellwig
2016-05-05 15:22 ` Christoph Hellwig
2016-05-05 15:22 ` Christoph Hellwig
2016-05-05 16:24 ` Dan Williams
2016-05-05 16:24 ` Dan Williams
2016-05-05 16:24 ` Dan Williams
2016-05-05 16:24 ` Dan Williams
2016-05-05 21:45 ` Verma, Vishal L
2016-05-05 21:45 ` Verma, Vishal L
2016-05-05 21:45 ` Verma, Vishal L
2016-05-05 21:45 ` Verma, Vishal L
2016-05-08 9:01 ` hch
2016-05-08 9:01 ` hch
2016-05-08 9:01 ` hch
2016-05-08 9:01 ` hch
2016-05-08 18:42 ` Verma, Vishal L
2016-05-08 18:42 ` Verma, Vishal L
2016-05-08 18:42 ` Verma, Vishal L
2016-05-08 18:42 ` Verma, Vishal L
2016-05-05 21:42 ` Verma, Vishal L
2016-05-05 21:42 ` Verma, Vishal L
2016-05-05 21:42 ` Verma, Vishal L
2016-05-05 21:42 ` Verma, Vishal L
2016-05-05 21:39 ` Verma, Vishal L
2016-05-05 21:39 ` Verma, Vishal L
2016-05-05 21:39 ` Verma, Vishal L
2016-05-05 21:39 ` Verma, Vishal L
2016-05-08 9:01 ` hch
2016-05-08 9:01 ` hch
2016-05-08 9:01 ` hch
2016-05-08 9:01 ` hch
2016-04-28 21:16 ` [PATCH v4 6/7] dax: for truncate/hole-punch, do zeroing through the driver if possible Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` [PATCH v4 7/7] dax: fix a comment in dax_zero_page_range and dax_truncate_page Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-28 21:16 ` Vishal Verma
2016-04-29 21:55 ` [PATCH v4 8/7] Documentation: add error handling information to dax.txt Vishal Verma
2016-04-29 21:55 ` Vishal Verma
2016-04-29 21:55 ` Vishal Verma
2016-04-29 21:55 ` Vishal Verma
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=572791E1.7000103@plexistor.com \
--to=boaz@plexistor.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@fb.com \
--cc=dan.j.williams@intel.com \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-block@vger.kernel.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-nvdimm@lists.01.org \
--cc=matthew@wil.cx \
--cc=viro@zeniv.linux.org.uk \
--cc=vishal.l.verma@intel.com \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.