ext4 DIO read performance issue on SSD

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* ext4 DIO read performance issue on SSD
@ 2009-10-09 23:34 Jiaying Zhang
  2009-10-14 18:48 ` Mingming
  0 siblings, 1 reply; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-09 23:34 UTC (permalink / raw)
  To: ext4 development; +Cc: Andrew Morton, Michael Rubin, Manuel Benitez

Hello,

Recently, we are evaluating the ext4 performance on a high speed SSD.
One problem we found is that ext4 performance doesn't scale well with
multiple threads or multiple AIOs reading a single file with O_DIRECT.
E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
can lose up to 50% throughput compared to the results we get via RAW IO.

After some initial analysis, we think the ext4 performance problem is caused
by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
DIO_LOCKING from the generic fs code. I did a quick test by calling
blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
got 99% performance as raw IO.

As we understand, the reason why we want to take i_mutex lock during DIO
read is to prevent it from accessing stale data that may be exposed by a
simultaneous write. We saw that Mingming Cao has implemented a patch set
with which when a get_block request comes from direct write, ext4 only
allocates or splits an uninitialized extent. That uninitialized extent
will be marked as initialized at the end_io callback. We are wondering
whether we can extend this idea to buffer write as well. I.e., we always
allocate an uninitialized extent first during any write and convert it
as initialized at the time of end_io callback. This will eliminate the need
to hold i_mutex lock during direct read because a DIO read should never get
a block marked initialized before the block has been written with new data.

We haven't implemented anything yet because we want to ask here first to
see whether this proposal makes sense to you.

Regards,

Jiaying

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-09 23:34 ext4 DIO read performance issue on SSD Jiaying Zhang
@ 2009-10-14 18:48 ` Mingming
  2009-10-14 19:48   ` Jiaying Zhang
  2009-10-15  5:14   ` Jiaying Zhang
  0 siblings, 2 replies; 18+ messages in thread
From: Mingming @ 2009-10-14 18:48 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> Hello,
> 
> Recently, we are evaluating the ext4 performance on a high speed SSD.
> One problem we found is that ext4 performance doesn't scale well with
> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> can lose up to 50% throughput compared to the results we get via RAW IO.
> 
> After some initial analysis, we think the ext4 performance problem is caused
> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> DIO_LOCKING from the generic fs code. I did a quick test by calling
> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> got 99% performance as raw IO.
> 

This is very interesting...and impressive number.

I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
but then realize that we can't do this all the time, as ext4 support
ext3 non-extent based files, and uninitialized extent is not support on
ext3 format file.

> As we understand, the reason why we want to take i_mutex lock during DIO
> read is to prevent it from accessing stale data that may be exposed by a
> simultaneous write. We saw that Mingming Cao has implemented a patch set
> with which when a get_block request comes from direct write, ext4 only
> allocates or splits an uninitialized extent. That uninitialized extent
> will be marked as initialized at the end_io callback.

Though I need to clarify that with all the patches in mainline, we only
treat new allocated blocks form direct io write to holes, not to writes
to the end of file. I actually have proposed to treat the write to the
end of file also as unintialized extents, but there is some concerns
that this getting tricky with updating inode size when it is async IO
direct IO. So it didn't getting done yet.

>  We are wondering
> whether we can extend this idea to buffer write as well. I.e., we always
> allocate an uninitialized extent first during any write and convert it
> as initialized at the time of end_io callback. This will eliminate the need
> to hold i_mutex lock during direct read because a DIO read should never get
> a block marked initialized before the block has been written with new data.
> 

Oh I don't think so. For buffered IO, the data is being copied to
buffer, direct IO read would first flush what's in page cache to disk,
then read from disk. So if there is concurrent buffered write and direct
read, removing the i_mutex locks from the direct IO path should still
gurantee the right order, without having to treat buffered allocation
with uninitialized extent/end_io.

The i_mutex lock, from my understanding, is there to protect direct IO
write to hole and concurrent direct IO read, we should able to remove
this lock for extent based ext4 file. 

> We haven't implemented anything yet because we want to ask here first to
> see whether this proposal makes sense to you.
> 

It does make sense to me.

Mingming
> Regards,
> 
> Jiaying
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-14 18:48 ` Mingming
@ 2009-10-14 19:48   ` Jiaying Zhang
  2009-10-14 20:57     ` Mingming
  2009-10-15  5:14   ` Jiaying Zhang
  1 sibling, 1 reply; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-14 19:48 UTC (permalink / raw)
  To: Mingming; +Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
>> Hello,
>>
>> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> One problem we found is that ext4 performance doesn't scale well with
>> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> can lose up to 50% throughput compared to the results we get via RAW IO.
>>
>> After some initial analysis, we think the ext4 performance problem is caused
>> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> got 99% performance as raw IO.
>>
>
> This is very interesting...and impressive number.
>
> I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> but then realize that we can't do this all the time, as ext4 support
> ext3 non-extent based files, and uninitialized extent is not support on
> ext3 format file.
>
>> As we understand, the reason why we want to take i_mutex lock during DIO
>> read is to prevent it from accessing stale data that may be exposed by a
>> simultaneous write. We saw that Mingming Cao has implemented a patch set
>> with which when a get_block request comes from direct write, ext4 only
>> allocates or splits an uninitialized extent. That uninitialized extent
>> will be marked as initialized at the end_io callback.
>
> Though I need to clarify that with all the patches in mainline, we only
> treat new allocated blocks form direct io write to holes, not to writes
> to the end of file. I actually have proposed to treat the write to the
> end of file also as unintialized extents, but there is some concerns
> that this getting tricky with updating inode size when it is async IO
> direct IO. So it didn't getting done yet.
>
>>  We are wondering
>> whether we can extend this idea to buffer write as well. I.e., we always
>> allocate an uninitialized extent first during any write and convert it
>> as initialized at the time of end_io callback. This will eliminate the need
>> to hold i_mutex lock during direct read because a DIO read should never get
>> a block marked initialized before the block has been written with new data.
>>
>
> Oh I don't think so. For buffered IO, the data is being copied to
> buffer, direct IO read would first flush what's in page cache to disk,

Hmm, do you mean the filemap_write_and_wait_range() in
__blockdev_direct_IO? Or do we flush page cache after calling
get_block in dio read?

Jiaying

> then read from disk. So if there is concurrent buffered write and direct
> read, removing the i_mutex locks from the direct IO path should still
> gurantee the right order, without having to treat buffered allocation
> with uninitialized extent/end_io.
>
> The i_mutex lock, from my understanding, is there to protect direct IO
> write to hole and concurrent direct IO read, we should able to remove
> this lock for extent based ext4 file.
>


>> We haven't implemented anything yet because we want to ask here first to
>> see whether this proposal makes sense to you.
>>
>
> It does make sense to me.
>
> Mingming
>> Regards,
>>
>> Jiaying
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-14 19:48   ` Jiaying Zhang
@ 2009-10-14 20:57     ` Mingming
  2009-10-14 21:42       ` Jiaying Zhang
  0 siblings, 1 reply; 18+ messages in thread
From: Mingming @ 2009-10-14 20:57 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Wed, 2009-10-14 at 12:48 -0700, Jiaying Zhang wrote:
> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> Hello,
> >>
> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> One problem we found is that ext4 performance doesn't scale well with
> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >>
> >> After some initial analysis, we think the ext4 performance problem is caused
> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> got 99% performance as raw IO.
> >>
> >
> > This is very interesting...and impressive number.
> >
> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> > but then realize that we can't do this all the time, as ext4 support
> > ext3 non-extent based files, and uninitialized extent is not support on
> > ext3 format file.
> >
> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> read is to prevent it from accessing stale data that may be exposed by a
> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> with which when a get_block request comes from direct write, ext4 only
> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> will be marked as initialized at the end_io callback.
> >
> > Though I need to clarify that with all the patches in mainline, we only
> > treat new allocated blocks form direct io write to holes, not to writes
> > to the end of file. I actually have proposed to treat the write to the
> > end of file also as unintialized extents, but there is some concerns
> > that this getting tricky with updating inode size when it is async IO
> > direct IO. So it didn't getting done yet.
> >
> >>  We are wondering
> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> allocate an uninitialized extent first during any write and convert it
> >> as initialized at the time of end_io callback. This will eliminate the need
> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> a block marked initialized before the block has been written with new data.
> >>
> >
> > Oh I don't think so. For buffered IO, the data is being copied to
> > buffer, direct IO read would first flush what's in page cache to disk,
> 
> Hmm, do you mean the filemap_write_and_wait_range() in
> __blockdev_direct_IO? 

yes, that's the one to flush the page cache before direct read.

> Or do we flush page cache after calling
> get_block in dio read?
> 
> Jiaying
> 
> > then read from disk. So if there is concurrent buffered write and direct
> > read, removing the i_mutex locks from the direct IO path should still
> > gurantee the right order, without having to treat buffered allocation
> > with uninitialized extent/end_io.
> >
> > The i_mutex lock, from my understanding, is there to protect direct IO
> > write to hole and concurrent direct IO read, we should able to remove
> > this lock for extent based ext4 file.
> >
> 
> 
> >> We haven't implemented anything yet because we want to ask here first to
> >> see whether this proposal makes sense to you.
> >>
> >
> > It does make sense to me.
> >
> > Mingming
> >> Regards,
> >>
> >> Jiaying
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-14 20:57     ` Mingming
@ 2009-10-14 21:42       ` Jiaying Zhang
  2009-10-15 17:27         ` Mingming
  0 siblings, 1 reply; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-14 21:42 UTC (permalink / raw)
  To: Mingming; +Cc: ext4 development, Michael Rubin, Manuel Benitez, Andrew Morton

On Wed, Oct 14, 2009 at 1:57 PM, Mingming <cmm@us.ibm.com> wrote:
> On Wed, 2009-10-14 at 12:48 -0700, Jiaying Zhang wrote:
>> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
>> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
>> >> Hello,
>> >>
>> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> >> One problem we found is that ext4 performance doesn't scale well with
>> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> >> can lose up to 50% throughput compared to the results we get via RAW IO.
>> >>
>> >> After some initial analysis, we think the ext4 performance problem is caused
>> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> >> got 99% performance as raw IO.
>> >>
>> >
>> > This is very interesting...and impressive number.
>> >
>> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
>> > but then realize that we can't do this all the time, as ext4 support
>> > ext3 non-extent based files, and uninitialized extent is not support on
>> > ext3 format file.
>> >
>> >> As we understand, the reason why we want to take i_mutex lock during DIO
>> >> read is to prevent it from accessing stale data that may be exposed by a
>> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
>> >> with which when a get_block request comes from direct write, ext4 only
>> >> allocates or splits an uninitialized extent. That uninitialized extent
>> >> will be marked as initialized at the end_io callback.
>> >
>> > Though I need to clarify that with all the patches in mainline, we only
>> > treat new allocated blocks form direct io write to holes, not to writes
>> > to the end of file. I actually have proposed to treat the write to the
>> > end of file also as unintialized extents, but there is some concerns
>> > that this getting tricky with updating inode size when it is async IO
>> > direct IO. So it didn't getting done yet.
>> >
>> >>  We are wondering
>> >> whether we can extend this idea to buffer write as well. I.e., we always
>> >> allocate an uninitialized extent first during any write and convert it
>> >> as initialized at the time of end_io callback. This will eliminate the need
>> >> to hold i_mutex lock during direct read because a DIO read should never get
>> >> a block marked initialized before the block has been written with new data.
>> >>
>> >
>> > Oh I don't think so. For buffered IO, the data is being copied to
>> > buffer, direct IO read would first flush what's in page cache to disk,
>>
>> Hmm, do you mean the filemap_write_and_wait_range() in
>> __blockdev_direct_IO?
>
> yes, that's the one to flush the page cache before direct read.
>
I don't think that function is called with DIO_NO_LOCKING.
Also, if we no longer hold i_mutex lock during dio read, I think
there is a time window that a buffer write can allocate an
initialize block after dio read flushes page cache but
before it calls get_block. Then that dio read can get that
initialized block with stale data.

Jiaying

>> Or do we flush page cache after calling
>> get_block in dio read?
>>
>> Jiaying
>>
>> > then read from disk. So if there is concurrent buffered write and direct
>> > read, removing the i_mutex locks from the direct IO path should still
>> > gurantee the right order, without having to treat buffered allocation
>> > with uninitialized extent/end_io.
>> >
>> > The i_mutex lock, from my understanding, is there to protect direct IO
>> > write to hole and concurrent direct IO read, we should able to remove
>> > this lock for extent based ext4 file.
>> >
>>
>>
>> >> We haven't implemented anything yet because we want to ask here first to
>> >> see whether this proposal makes sense to you.
>> >>
>> >
>> > It does make sense to me.
>> >
>> > Mingming
>> >> Regards,
>> >>
>> >> Jiaying
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-14 21:42       ` Jiaying Zhang
@ 2009-10-15 17:27         ` Mingming
  2009-10-16  1:27           ` Jiaying Zhang
  0 siblings, 1 reply; 18+ messages in thread
From: Mingming @ 2009-10-15 17:27 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Michael Rubin, Manuel Benitez, Andrew Morton

On Wed, 2009-10-14 at 14:42 -0700, Jiaying Zhang wrote:
> On Wed, Oct 14, 2009 at 1:57 PM, Mingming <cmm@us.ibm.com> wrote:
> > On Wed, 2009-10-14 at 12:48 -0700, Jiaying Zhang wrote:
> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> >> Hello,
> >> >>
> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> >> One problem we found is that ext4 performance doesn't scale well with
> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >> >>
> >> >> After some initial analysis, we think the ext4 performance problem is caused
> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> >> got 99% performance as raw IO.
> >> >>
> >> >
> >> > This is very interesting...and impressive number.
> >> >
> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> >> > but then realize that we can't do this all the time, as ext4 support
> >> > ext3 non-extent based files, and uninitialized extent is not support on
> >> > ext3 format file.
> >> >
> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> >> read is to prevent it from accessing stale data that may be exposed by a
> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> >> with which when a get_block request comes from direct write, ext4 only
> >> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> >> will be marked as initialized at the end_io callback.
> >> >
> >> > Though I need to clarify that with all the patches in mainline, we only
> >> > treat new allocated blocks form direct io write to holes, not to writes
> >> > to the end of file. I actually have proposed to treat the write to the
> >> > end of file also as unintialized extents, but there is some concerns
> >> > that this getting tricky with updating inode size when it is async IO
> >> > direct IO. So it didn't getting done yet.
> >> >
> >> >>  We are wondering
> >> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> >> allocate an uninitialized extent first during any write and convert it
> >> >> as initialized at the time of end_io callback. This will eliminate the need
> >> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> >> a block marked initialized before the block has been written with new data.
> >> >>
> >> >
> >> > Oh I don't think so. For buffered IO, the data is being copied to
> >> > buffer, direct IO read would first flush what's in page cache to disk,
> >>
> >> Hmm, do you mean the filemap_write_and_wait_range() in
> >> __blockdev_direct_IO?
> >
> > yes, that's the one to flush the page cache before direct read.
> >
> I don't think that function is called with DIO_NO_LOCKING.

Oh, I mean the filemap_write_and_wait_range() in generic_file_aio_read()

> Also, if we no longer hold i_mutex lock during dio read, I think
> there is a time window that a buffer write can allocate an
> initialize block after dio read flushes page cache but
> before it calls get_block. Then that dio read can get that
> initialized block with stale data.
> 

ah, I think it over, the key is prevent get_block() expose initialized
extent to direct read. concurrent buffered write to hole could result in
get_block() allocate blocks before direct IO read. That could be
addressed in a similar way we did for async direct IO write to hole...
> Jiaying
> 
> >> Or do we flush page cache after calling
> >> get_block in dio read?
> >>
> >> Jiaying
> >>
> >> > then read from disk. So if there is concurrent buffered write and direct
> >> > read, removing the i_mutex locks from the direct IO path should still
> >> > gurantee the right order, without having to treat buffered allocation
> >> > with uninitialized extent/end_io.
> >> >
> >> > The i_mutex lock, from my understanding, is there to protect direct IO
> >> > write to hole and concurrent direct IO read, we should able to remove
> >> > this lock for extent based ext4 file.
> >> >
> >>
> >>
> >> >> We haven't implemented anything yet because we want to ask here first to
> >> >> see whether this proposal makes sense to you.
> >> >>
> >> >
> >> > It does make sense to me.
> >> >
> >> > Mingming
> >> >> Regards,
> >> >>
> >> >> Jiaying
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-15 17:27         ` Mingming
@ 2009-10-16  1:27           ` Jiaying Zhang
  2009-10-16 19:15             ` Theodore Tso
  2009-10-19 19:04             ` Mingming
  0 siblings, 2 replies; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-16  1:27 UTC (permalink / raw)
  To: Mingming; +Cc: ext4 development, Michael Rubin, Manuel Benitez, Andrew Morton

Thinking about this again, I think there is another benefit to always
allocate uninitialized blocks first in all kinds of write, including
size-extending write.

Currently, we avoid exposing stale data during size-extending
write by adding the inode to the orphan list before changing the
file size. This helps us with journaled ext4 because the journal
guarantees that the on-disk orphan list will be updated first before
data is written to disk and i_size is updated. But with non-journal ext4,
we don't update orphan list during size-extending write because
without a journal, there is no guarantee on the ordering of writes to disk.
So if the system crashes after the on-disk i_size is updated but before
data hits to disk (this is rare but can happen during fsync), we may
end up exposing stale data with non-journal ext4.

I think allocating uninitialized extend first during such size-extending
write and then converting the extend into initialized during end_io
time can help close this security hole with non-journal ext4.

Any thoughts on this?

Jiaying
On Thu, Oct 15, 2009 at 10:27 AM, Mingming <cmm@us.ibm.com> wrote:
> On Wed, 2009-10-14 at 14:42 -0700, Jiaying Zhang wrote:
>> On Wed, Oct 14, 2009 at 1:57 PM, Mingming <cmm@us.ibm.com> wrote:
>> > On Wed, 2009-10-14 at 12:48 -0700, Jiaying Zhang wrote:
>> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
>> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
>> >> >> Hello,
>> >> >>
>> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> >> >> One problem we found is that ext4 performance doesn't scale well with
>> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
>> >> >>
>> >> >> After some initial analysis, we think the ext4 performance problem is caused
>> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> >> >> got 99% performance as raw IO.
>> >> >>
>> >> >
>> >> > This is very interesting...and impressive number.
>> >> >
>> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
>> >> > but then realize that we can't do this all the time, as ext4 support
>> >> > ext3 non-extent based files, and uninitialized extent is not support on
>> >> > ext3 format file.
>> >> >
>> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
>> >> >> read is to prevent it from accessing stale data that may be exposed by a
>> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
>> >> >> with which when a get_block request comes from direct write, ext4 only
>> >> >> allocates or splits an uninitialized extent. That uninitialized extent
>> >> >> will be marked as initialized at the end_io callback.
>> >> >
>> >> > Though I need to clarify that with all the patches in mainline, we only
>> >> > treat new allocated blocks form direct io write to holes, not to writes
>> >> > to the end of file. I actually have proposed to treat the write to the
>> >> > end of file also as unintialized extents, but there is some concerns
>> >> > that this getting tricky with updating inode size when it is async IO
>> >> > direct IO. So it didn't getting done yet.
>> >> >
>> >> >>  We are wondering
>> >> >> whether we can extend this idea to buffer write as well. I.e., we always
>> >> >> allocate an uninitialized extent first during any write and convert it
>> >> >> as initialized at the time of end_io callback. This will eliminate the need
>> >> >> to hold i_mutex lock during direct read because a DIO read should never get
>> >> >> a block marked initialized before the block has been written with new data.
>> >> >>
>> >> >
>> >> > Oh I don't think so. For buffered IO, the data is being copied to
>> >> > buffer, direct IO read would first flush what's in page cache to disk,
>> >>
>> >> Hmm, do you mean the filemap_write_and_wait_range() in
>> >> __blockdev_direct_IO?
>> >
>> > yes, that's the one to flush the page cache before direct read.
>> >
>> I don't think that function is called with DIO_NO_LOCKING.
>
> Oh, I mean the filemap_write_and_wait_range() in generic_file_aio_read()
>
>> Also, if we no longer hold i_mutex lock during dio read, I think
>> there is a time window that a buffer write can allocate an
>> initialize block after dio read flushes page cache but
>> before it calls get_block. Then that dio read can get that
>> initialized block with stale data.
>>
>
> ah, I think it over, the key is prevent get_block() expose initialized
> extent to direct read. concurrent buffered write to hole could result in
> get_block() allocate blocks before direct IO read. That could be
> addressed in a similar way we did for async direct IO write to hole...
>> Jiaying
>>
>> >> Or do we flush page cache after calling
>> >> get_block in dio read?
>> >>
>> >> Jiaying
>> >>
>> >> > then read from disk. So if there is concurrent buffered write and direct
>> >> > read, removing the i_mutex locks from the direct IO path should still
>> >> > gurantee the right order, without having to treat buffered allocation
>> >> > with uninitialized extent/end_io.
>> >> >
>> >> > The i_mutex lock, from my understanding, is there to protect direct IO
>> >> > write to hole and concurrent direct IO read, we should able to remove
>> >> > this lock for extent based ext4 file.
>> >> >
>> >>
>> >>
>> >> >> We haven't implemented anything yet because we want to ask here first to
>> >> >> see whether this proposal makes sense to you.
>> >> >>
>> >> >
>> >> > It does make sense to me.
>> >> >
>> >> > Mingming
>> >> >> Regards,
>> >> >>
>> >> >> Jiaying
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> >> >
>> >> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-16  1:27           ` Jiaying Zhang
@ 2009-10-16 19:15             ` Theodore Tso
  2009-10-20  1:26               ` Jiaying Zhang
  2009-10-19 19:04             ` Mingming
  1 sibling, 1 reply; 18+ messages in thread
From: Theodore Tso @ 2009-10-16 19:15 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez,
	Mingming, Andrew Morton

On Fri, Oct 09, 2009 at 04:34:08PM -0700, Jiaying Zhang wrote:
> Recently, we are evaluating the ext4 performance on a high speed SSD.
> One problem we found is that ext4 performance doesn't scale well with
> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> can lose up to 50% throughput compared to the results we get via RAW IO.
> 
> After some initial analysis, we think the ext4 performance problem is caused
> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> DIO_LOCKING from the generic fs code. I did a quick test by calling
> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> got 99% performance as raw IO.

Your diagnosis makes sense, and the test of using
blockdev_direct_IO_nol_locking() tends to confirm things, but before
we start surgery, there's a very easy way of confirming that i_mutex
is under contention.  Enable CONFIG_LOCK_STAT, do "echo 0> >
/proc/lock_stat" before running your benchmark, and then capture the
contents of /proc/lock_stat after running your benchmark.  For more
information see Documentation/lockstat.txt in the kernel sources.
It's always nice to grab before and after snapshots, and it's a handy
way of finding other contention points.

> That uninitialized extent will be marked as initialized at the
> end_io callback. We are wondering whether we can extend this idea to
> buffer write as well. I.e., we always allocate an uninitialized
> extent first during any write and convert it as initialized at the
> time of end_io callback. This will eliminate the need to hold
> i_mutex lock during direct read because a DIO read should never get
> a block marked initialized before the block has been written with
> new data.

Obviously this will only work for extent-mapped inodes.  So it means
complicating the callback code somewhat, but in theory, this should
work.

On Thu, Oct 15, 2009 at 06:27:12PM -0700, Jiaying Zhang wrote:
> Thinking about this again, I think there is another benefit to always
> allocate uninitialized blocks first in all kinds of write, including
> size-extending write.
> 
> Currently, we avoid exposing stale data during size-extending
> write by adding the inode to the orphan list before changing the
> file size. This helps us with journaled ext4 because the journal
> guarantees that the on-disk orphan list will be updated first before
> data is written to disk and i_size is updated. But with non-journal ext4,
> we don't update orphan list during size-extending write because
> without a journal, there is no guarantee on the ordering of writes to disk.
> So if the system crashes after the on-disk i_size is updated but before
> data hits to disk (this is rare but can happen during fsync), we may
> end up exposing stale data with non-journal ext4.

Actually, that's not quite correct.  That's the plan for how works in
ext3's guarded mode (which hasn't been merged into mainline nor ported
to ext4), but in ext3 and ext4's ordered mode the way we avoid
exposing stale data is by forcing data blocks to disk before the
commit is allowed to complete.  You're right that either way, it doesn't
help you if you're not using a journal.  (This is also going to be a
problem in supporting TRIM/DISCARD, since we are depending on the
commit mechanism to determine when it's safe to send the TRIM command
to the block device --- but that's an separate issue.)

> I think allocating uninitialized extend first during such size-extending
> write and then converting the extend into initialized during end_io
> time can help close this security hole with non-journal ext4.

Well..... strictly speaking, using end_io won't guarantee that no
stale data will be avoided, since the end_io callback is called when
the data has been transfered to the disk controller, and not when data
has actually be written to iron oxide.  In other words, since there's
no barrier operation, there are no guarantees on a power drop.  Still,
it's certainly better than nothing, and it will make things much
better than they were previously.

By the way, something which would be worth testing is making sure that
direct I/O read racing with a buffered write with delayed allocation
works correctly today.  What should happen (although it may not be
what application programmers expect) is that direct I/O read from part
of a file which has just been written via buffered write with delayed
allocation enabled and the block has been't been allocated yet, the
direct I/O read should behave a read from a hole.  (XFS behaves the
same way; we don't gurantee cache coherence when direct I/O and
non-direct I/O is mixed.)

							- Ted

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-16 19:15             ` Theodore Tso
@ 2009-10-20  1:26               ` Jiaying Zhang
  0 siblings, 0 replies; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-20  1:26 UTC (permalink / raw)
  To: Theodore Tso
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez,
	Mingming, Andrew Morton

Ted,

Thanks a lot for your reply!

On Fri, Oct 16, 2009 at 12:15 PM, Theodore Tso <tytso@mit.edu> wrote:
> On Fri, Oct 09, 2009 at 04:34:08PM -0700, Jiaying Zhang wrote:
>> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> One problem we found is that ext4 performance doesn't scale well with
>> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> can lose up to 50% throughput compared to the results we get via RAW IO.
>>
>> After some initial analysis, we think the ext4 performance problem is caused
>> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> got 99% performance as raw IO.
>
> Your diagnosis makes sense, and the test of using
> blockdev_direct_IO_nol_locking() tends to confirm things, but before
> we start surgery, there's a very easy way of confirming that i_mutex
> is under contention.  Enable CONFIG_LOCK_STAT, do "echo 0> >
> /proc/lock_stat" before running your benchmark, and then capture the
> contents of /proc/lock_stat after running your benchmark.  For more
> information see Documentation/lockstat.txt in the kernel sources.
> It's always nice to grab before and after snapshots, and it's a handy
> way of finding other contention points.
>
I tried lock_stat as you suggested. I did see a lot of contentions
with multiple threads reading a single file via DIO AIO and a large
amount of time was spent on waiting for the i_mutex. I think this
confirms our theory that the use of i_mutex lock during DIO caused
most of the overhead we saw on the ext4 SSD.

>> That uninitialized extent will be marked as initialized at the
>> end_io callback. We are wondering whether we can extend this idea to
>> buffer write as well. I.e., we always allocate an uninitialized
>> extent first during any write and convert it as initialized at the
>> time of end_io callback. This will eliminate the need to hold
>> i_mutex lock during direct read because a DIO read should never get
>> a block marked initialized before the block has been written with
>> new data.
>
> Obviously this will only work for extent-mapped inodes.  So it means
> complicating the callback code somewhat, but in theory, this should
> work.
>
> On Thu, Oct 15, 2009 at 06:27:12PM -0700, Jiaying Zhang wrote:
>> Thinking about this again, I think there is another benefit to always
>> allocate uninitialized blocks first in all kinds of write, including
>> size-extending write.
>>
>> Currently, we avoid exposing stale data during size-extending
>> write by adding the inode to the orphan list before changing the
>> file size. This helps us with journaled ext4 because the journal
>> guarantees that the on-disk orphan list will be updated first before
>> data is written to disk and i_size is updated. But with non-journal ext4,
>> we don't update orphan list during size-extending write because
>> without a journal, there is no guarantee on the ordering of writes to disk.
>> So if the system crashes after the on-disk i_size is updated but before
>> data hits to disk (this is rare but can happen during fsync), we may
>> end up exposing stale data with non-journal ext4.
>
> Actually, that's not quite correct.  That's the plan for how works in
> ext3's guarded mode (which hasn't been merged into mainline nor ported
> to ext4), but in ext3 and ext4's ordered mode the way we avoid
> exposing stale data is by forcing data blocks to disk before the
> commit is allowed to complete.  You're right that either way, it doesn't
> help you if you're not using a journal.  (This is also going to be a
> problem in supporting TRIM/DISCARD, since we are depending on the
> commit mechanism to determine when it's safe to send the TRIM command
> to the block device --- but that's an separate issue.)
>
>> I think allocating uninitialized extend first during such size-extending
>> write and then converting the extend into initialized during end_io
>> time can help close this security hole with non-journal ext4.
>
> Well..... strictly speaking, using end_io won't guarantee that no
> stale data will be avoided, since the end_io callback is called when
> the data has been transfered to the disk controller, and not when data
> has actually be written to iron oxide.  In other words, since there's
> no barrier operation, there are no guarantees on a power drop.  Still,
> it's certainly better than nothing, and it will make things much
> better than they were previously.
>
I agree. It will be hard to eliminate the exposing stale data problem
in the non-journal case with disk buffer cache, but I think it can
help us with SSD, or battery-backed disks.

>
> By the way, something which would be worth testing is making sure that
> direct I/O read racing with a buffered write with delayed allocation
> works correctly today.  What should happen (although it may not be
> what application programmers expect) is that direct I/O read from part
> of a file which has just been written via buffered write with delayed
> allocation enabled and the block has been't been allocated yet, the
> direct I/O read should behave a read from a hole.  (XFS behaves the
> same way; we don't gurantee cache coherence when direct I/O and
> non-direct I/O is mixed.)
>
Hmm, do you mean that if a DIO read happens right after a buffered
write with delayed allocation, it should return zero instead of the data
written by the buffered write?

Jiaying

>                                                        - Ted
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-16  1:27           ` Jiaying Zhang
  2009-10-16 19:15             ` Theodore Tso
@ 2009-10-19 19:04             ` Mingming
  1 sibling, 0 replies; 18+ messages in thread
From: Mingming @ 2009-10-19 19:04 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Michael Rubin, Manuel Benitez, Andrew Morton

On Thu, 2009-10-15 at 18:27 -0700, Jiaying Zhang wrote:
> Thinking about this again, I think there is another benefit to always
> allocate uninitialized blocks first in all kinds of write, including
> size-extending write.
> 
> Currently, we avoid exposing stale data during size-extending
> write by adding the inode to the orphan list before changing the
> file size. 

Jan kara corrected me on this misunderstanding before. The orphan list
in the direct IO code is there to trim the inode to the original size if
the system/fs crash in the middle of expanding a file. The file won't
get size updated until all the IO is completed.

The ordeded journal mode is there to avoid exposing stale data during
the size extending write.

> This helps us with journaled ext4 because the journal
> guarantees that the on-disk orphan list will be updated first before
> data is written to disk and i_size is updated. But with non-journal ext4,
> we don't update orphan list during size-extending write because
> without a journal, there is no guarantee on the ordering of writes to disk.
> So if the system crashes after the on-disk i_size is updated but before
> data hits to disk (this is rare but can happen during fsync), we may
> end up exposing stale data with non-journal ext4.
> 
> I think allocating uninitialized extend first during such size-extending
> write and then converting the extend into initialized during end_io
> time can help close this security hole with non-journal ext4.
> 
> Any thoughts on this?
> 

Probably true, something similar to data=guarded mode.
> Jiaying
> On Thu, Oct 15, 2009 at 10:27 AM, Mingming <cmm@us.ibm.com> wrote:
> > On Wed, 2009-10-14 at 14:42 -0700, Jiaying Zhang wrote:
> >> On Wed, Oct 14, 2009 at 1:57 PM, Mingming <cmm@us.ibm.com> wrote:
> >> > On Wed, 2009-10-14 at 12:48 -0700, Jiaying Zhang wrote:
> >> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> >> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> >> >> Hello,
> >> >> >>
> >> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> >> >> One problem we found is that ext4 performance doesn't scale well with
> >> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >> >> >>
> >> >> >> After some initial analysis, we think the ext4 performance problem is caused
> >> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> >> >> got 99% performance as raw IO.
> >> >> >>
> >> >> >
> >> >> > This is very interesting...and impressive number.
> >> >> >
> >> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> >> >> > but then realize that we can't do this all the time, as ext4 support
> >> >> > ext3 non-extent based files, and uninitialized extent is not support on
> >> >> > ext3 format file.
> >> >> >
> >> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> >> >> read is to prevent it from accessing stale data that may be exposed by a
> >> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> >> >> with which when a get_block request comes from direct write, ext4 only
> >> >> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> >> >> will be marked as initialized at the end_io callback.
> >> >> >
> >> >> > Though I need to clarify that with all the patches in mainline, we only
> >> >> > treat new allocated blocks form direct io write to holes, not to writes
> >> >> > to the end of file. I actually have proposed to treat the write to the
> >> >> > end of file also as unintialized extents, but there is some concerns
> >> >> > that this getting tricky with updating inode size when it is async IO
> >> >> > direct IO. So it didn't getting done yet.
> >> >> >
> >> >> >>  We are wondering
> >> >> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> >> >> allocate an uninitialized extent first during any write and convert it
> >> >> >> as initialized at the time of end_io callback. This will eliminate the need
> >> >> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> >> >> a block marked initialized before the block has been written with new data.
> >> >> >>
> >> >> >
> >> >> > Oh I don't think so. For buffered IO, the data is being copied to
> >> >> > buffer, direct IO read would first flush what's in page cache to disk,
> >> >>
> >> >> Hmm, do you mean the filemap_write_and_wait_range() in
> >> >> __blockdev_direct_IO?
> >> >
> >> > yes, that's the one to flush the page cache before direct read.
> >> >
> >> I don't think that function is called with DIO_NO_LOCKING.
> >
> > Oh, I mean the filemap_write_and_wait_range() in generic_file_aio_read()
> >
> >> Also, if we no longer hold i_mutex lock during dio read, I think
> >> there is a time window that a buffer write can allocate an
> >> initialize block after dio read flushes page cache but
> >> before it calls get_block. Then that dio read can get that
> >> initialized block with stale data.
> >>
> >
> > ah, I think it over, the key is prevent get_block() expose initialized
> > extent to direct read. concurrent buffered write to hole could result in
> > get_block() allocate blocks before direct IO read. That could be
> > addressed in a similar way we did for async direct IO write to hole...
> >> Jiaying
> >>
> >> >> Or do we flush page cache after calling
> >> >> get_block in dio read?
> >> >>
> >> >> Jiaying
> >> >>
> >> >> > then read from disk. So if there is concurrent buffered write and direct
> >> >> > read, removing the i_mutex locks from the direct IO path should still
> >> >> > gurantee the right order, without having to treat buffered allocation
> >> >> > with uninitialized extent/end_io.
> >> >> >
> >> >> > The i_mutex lock, from my understanding, is there to protect direct IO
> >> >> > write to hole and concurrent direct IO read, we should able to remove
> >> >> > this lock for extent based ext4 file.
> >> >> >
> >> >>
> >> >>
> >> >> >> We haven't implemented anything yet because we want to ask here first to
> >> >> >> see whether this proposal makes sense to you.
> >> >> >>
> >> >> >
> >> >> > It does make sense to me.
> >> >> >
> >> >> > Mingming
> >> >> >> Regards,
> >> >> >>
> >> >> >> Jiaying
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> >> >
> >> >> >
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-14 18:48 ` Mingming
  2009-10-14 19:48   ` Jiaying Zhang
@ 2009-10-15  5:14   ` Jiaying Zhang
  2009-10-15 17:31     ` Mingming
  1 sibling, 1 reply; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-15  5:14 UTC (permalink / raw)
  To: Mingming; +Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

Mingming,

On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
>> Hello,
>>
>> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> One problem we found is that ext4 performance doesn't scale well with
>> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> can lose up to 50% throughput compared to the results we get via RAW IO.
>>
>> After some initial analysis, we think the ext4 performance problem is caused
>> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> got 99% performance as raw IO.
>>
>
> This is very interesting...and impressive number.
>
> I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> but then realize that we can't do this all the time, as ext4 support
> ext3 non-extent based files, and uninitialized extent is not support on
> ext3 format file.
>
>> As we understand, the reason why we want to take i_mutex lock during DIO
>> read is to prevent it from accessing stale data that may be exposed by a
>> simultaneous write. We saw that Mingming Cao has implemented a patch set
>> with which when a get_block request comes from direct write, ext4 only
>> allocates or splits an uninitialized extent. That uninitialized extent
>> will be marked as initialized at the end_io callback.
>
> Though I need to clarify that with all the patches in mainline, we only
> treat new allocated blocks form direct io write to holes, not to writes
> to the end of file. I actually have proposed to treat the write to the
> end of file also as unintialized extents, but there is some concerns
> that this getting tricky with updating inode size when it is async IO
> direct IO. So it didn't getting done yet.

I read you previous email thread again. As I understand, the main
concern for allocating uninitialized blocks in i_size extending write
is that we may end up having uninitialized blocks beyond i_size
if the system crashes during write. Can we protect this case by
adding the inode into the orphan list in ext4_ext_direct_IO,
i.e., same as we have done in ext4_ind_direct_IO?

Jiaying

>
>>  We are wondering
>> whether we can extend this idea to buffer write as well. I.e., we always
>> allocate an uninitialized extent first during any write and convert it
>> as initialized at the time of end_io callback. This will eliminate the need
>> to hold i_mutex lock during direct read because a DIO read should never get
>> a block marked initialized before the block has been written with new data.
>>
>
> Oh I don't think so. For buffered IO, the data is being copied to
> buffer, direct IO read would first flush what's in page cache to disk,
> then read from disk. So if there is concurrent buffered write and direct
> read, removing the i_mutex locks from the direct IO path should still
> gurantee the right order, without having to treat buffered allocation
> with uninitialized extent/end_io.
>
> The i_mutex lock, from my understanding, is there to protect direct IO
> write to hole and concurrent direct IO read, we should able to remove
> this lock for extent based ext4 file.
>
>> We haven't implemented anything yet because we want to ask here first to
>> see whether this proposal makes sense to you.
>>
>
> It does make sense to me.
>
> Mingming
>> Regards,
>>
>> Jiaying
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-15  5:14   ` Jiaying Zhang
@ 2009-10-15 17:31     ` Mingming
  2009-10-15 20:07       ` Jiaying Zhang
  0 siblings, 1 reply; 18+ messages in thread
From: Mingming @ 2009-10-15 17:31 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
> Mingming,
> 

Hi Jiaying,

> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> Hello,
> >>
> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> One problem we found is that ext4 performance doesn't scale well with
> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >>
> >> After some initial analysis, we think the ext4 performance problem is caused
> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> got 99% performance as raw IO.
> >>
> >
> > This is very interesting...and impressive number.
> >
> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> > but then realize that we can't do this all the time, as ext4 support
> > ext3 non-extent based files, and uninitialized extent is not support on
> > ext3 format file.
> >
> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> read is to prevent it from accessing stale data that may be exposed by a
> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> with which when a get_block request comes from direct write, ext4 only
> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> will be marked as initialized at the end_io callback.
> >
> > Though I need to clarify that with all the patches in mainline, we only
> > treat new allocated blocks form direct io write to holes, not to writes
> > to the end of file. I actually have proposed to treat the write to the
> > end of file also as unintialized extents, but there is some concerns
> > that this getting tricky with updating inode size when it is async IO
> > direct IO. So it didn't getting done yet.
> 
> I read you previous email thread again. As I understand, the main
> concern for allocating uninitialized blocks in i_size extending write
> is that we may end up having uninitialized blocks beyond i_size
> if the system crashes during write. Can we protect this case by
> adding the inode into the orphan list in ext4_ext_direct_IO,
> i.e., same as we have done in ext4_ind_direct_IO?
> 

Sure we could do that, though initially I hoped we could get rid of
that:)

The tricky part is async direct write to the end of file. By the time
the IO is completed, the inode may be truncated or extended larger.
Updating the most "safe" size is the part I haven't thought through yet.



> Jiaying
> 
> >
> >>  We are wondering
> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> allocate an uninitialized extent first during any write and convert it
> >> as initialized at the time of end_io callback. This will eliminate the need
> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> a block marked initialized before the block has been written with new data.
> >>
> >
> > Oh I don't think so. For buffered IO, the data is being copied to
> > buffer, direct IO read would first flush what's in page cache to disk,
> > then read from disk. So if there is concurrent buffered write and direct
> > read, removing the i_mutex locks from the direct IO path should still
> > gurantee the right order, without having to treat buffered allocation
> > with uninitialized extent/end_io.
> >
> > The i_mutex lock, from my understanding, is there to protect direct IO
> > write to hole and concurrent direct IO read, we should able to remove
> > this lock for extent based ext4 file.
> >
> >> We haven't implemented anything yet because we want to ask here first to
> >> see whether this proposal makes sense to you.
> >>
> >
> > It does make sense to me.
> >
> > Mingming
> >> Regards,
> >>
> >> Jiaying
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-15 17:31     ` Mingming
@ 2009-10-15 20:07       ` Jiaying Zhang
  2009-10-15 23:28         ` Mingming
  0 siblings, 1 reply; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-15 20:07 UTC (permalink / raw)
  To: Mingming; +Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Thu, Oct 15, 2009 at 10:31 AM, Mingming <cmm@us.ibm.com> wrote:
> On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
>> Mingming,
>>
>
> Hi Jiaying,
>
>> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
>> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
>> >> Hello,
>> >>
>> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> >> One problem we found is that ext4 performance doesn't scale well with
>> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> >> can lose up to 50% throughput compared to the results we get via RAW IO.
>> >>
>> >> After some initial analysis, we think the ext4 performance problem is caused
>> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> >> got 99% performance as raw IO.
>> >>
>> >
>> > This is very interesting...and impressive number.
>> >
>> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
>> > but then realize that we can't do this all the time, as ext4 support
>> > ext3 non-extent based files, and uninitialized extent is not support on
>> > ext3 format file.
>> >
>> >> As we understand, the reason why we want to take i_mutex lock during DIO
>> >> read is to prevent it from accessing stale data that may be exposed by a
>> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
>> >> with which when a get_block request comes from direct write, ext4 only
>> >> allocates or splits an uninitialized extent. That uninitialized extent
>> >> will be marked as initialized at the end_io callback.
>> >
>> > Though I need to clarify that with all the patches in mainline, we only
>> > treat new allocated blocks form direct io write to holes, not to writes
>> > to the end of file. I actually have proposed to treat the write to the
>> > end of file also as unintialized extents, but there is some concerns
>> > that this getting tricky with updating inode size when it is async IO
>> > direct IO. So it didn't getting done yet.
>>
>> I read you previous email thread again. As I understand, the main
>> concern for allocating uninitialized blocks in i_size extending write
>> is that we may end up having uninitialized blocks beyond i_size
>> if the system crashes during write. Can we protect this case by
>> adding the inode into the orphan list in ext4_ext_direct_IO,
>> i.e., same as we have done in ext4_ind_direct_IO?
>>
>
> Sure we could do that, though initially I hoped we could get rid of
> that:)
>
> The tricky part is async direct write to the end of file. By the time
> the IO is completed, the inode may be truncated or extended larger.
> Updating the most "safe" size is the part I haven't thought through yet.
>

Ok. I think I understand the problem better now :).

Looking at the __blockdev_direct_IO(), I saw it actually treats
size-extending aio dio write as synchronous and expects the dio to
complete before return (fs/direct-io.c line 1204 and line 1056-1061).
Can we follow the same rule and check whether it is a size-extending
aio write in ext4_end_io_dio()? In such cases, we can call
ext4_end_aio_dio_nolock() synchronously instead of queuing
the work. I think this will protect us from truncate because we
are still holding i_mutex and i_alloc_sem.

Jiaying

>
>
>> Jiaying
>>
>> >
>> >>  We are wondering
>> >> whether we can extend this idea to buffer write as well. I.e., we always
>> >> allocate an uninitialized extent first during any write and convert it
>> >> as initialized at the time of end_io callback. This will eliminate the need
>> >> to hold i_mutex lock during direct read because a DIO read should never get
>> >> a block marked initialized before the block has been written with new data.
>> >>
>> >
>> > Oh I don't think so. For buffered IO, the data is being copied to
>> > buffer, direct IO read would first flush what's in page cache to disk,
>> > then read from disk. So if there is concurrent buffered write and direct
>> > read, removing the i_mutex locks from the direct IO path should still
>> > gurantee the right order, without having to treat buffered allocation
>> > with uninitialized extent/end_io.
>> >
>> > The i_mutex lock, from my understanding, is there to protect direct IO
>> > write to hole and concurrent direct IO read, we should able to remove
>> > this lock for extent based ext4 file.
>> >
>> >> We haven't implemented anything yet because we want to ask here first to
>> >> see whether this proposal makes sense to you.
>> >>
>> >
>> > It does make sense to me.
>> >
>> > Mingming
>> >> Regards,
>> >>
>> >> Jiaying
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-15 20:07       ` Jiaying Zhang
@ 2009-10-15 23:28         ` Mingming
  2009-10-15 23:33           ` Jiaying Zhang
  0 siblings, 1 reply; 18+ messages in thread
From: Mingming @ 2009-10-15 23:28 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Thu, 2009-10-15 at 13:07 -0700, Jiaying Zhang wrote:
> On Thu, Oct 15, 2009 at 10:31 AM, Mingming <cmm@us.ibm.com> wrote:
> > On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
> >> Mingming,
> >>
> >
> > Hi Jiaying,
> >
> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> >> Hello,
> >> >>
> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> >> One problem we found is that ext4 performance doesn't scale well with
> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >> >>
> >> >> After some initial analysis, we think the ext4 performance problem is caused
> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> >> got 99% performance as raw IO.
> >> >>
> >> >
> >> > This is very interesting...and impressive number.
> >> >
> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> >> > but then realize that we can't do this all the time, as ext4 support
> >> > ext3 non-extent based files, and uninitialized extent is not support on
> >> > ext3 format file.
> >> >
> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> >> read is to prevent it from accessing stale data that may be exposed by a
> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> >> with which when a get_block request comes from direct write, ext4 only
> >> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> >> will be marked as initialized at the end_io callback.
> >> >
> >> > Though I need to clarify that with all the patches in mainline, we only
> >> > treat new allocated blocks form direct io write to holes, not to writes
> >> > to the end of file. I actually have proposed to treat the write to the
> >> > end of file also as unintialized extents, but there is some concerns
> >> > that this getting tricky with updating inode size when it is async IO
> >> > direct IO. So it didn't getting done yet.
> >>
> >> I read you previous email thread again. As I understand, the main
> >> concern for allocating uninitialized blocks in i_size extending write
> >> is that we may end up having uninitialized blocks beyond i_size
> >> if the system crashes during write. Can we protect this case by
> >> adding the inode into the orphan list in ext4_ext_direct_IO,
> >> i.e., same as we have done in ext4_ind_direct_IO?
> >>
> >
> > Sure we could do that, though initially I hoped we could get rid of
> > that:)
> >
> > The tricky part is async direct write to the end of file. By the time
> > the IO is completed, the inode may be truncated or extended larger.
> > Updating the most "safe" size is the part I haven't thought through yet.
> >
> 
> Ok. I think I understand the problem better now :).
> 
> Looking at the __blockdev_direct_IO(), I saw it actually treats
> size-extending aio dio write as synchronous and expects the dio to
> complete before return (fs/direct-io.c line 1204 and line 1056-1061).

Oh? It seems it will keep the write async as long as it's not a partial
write 
        /*
         * The only time we want to leave bios in flight is when a successful
         * partial aio read or full aio write have been setup.  In that case
         * bio completion will call aio_complete.  The only time it's safe to
         * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
         * This had *better* be the only place that raises -EIOCBQUEUED.
         */
        BUG_ON(ret == -EIOCBQUEUED);
        if (dio->is_async && ret == 0 && dio->result &&
            ((rw & READ) || (dio->result == dio->size)))
                ret = -EIOCBQUEUED;

        if (ret != -EIOCBQUEUED)
                dio_await_completion(dio);

> Can we follow the same rule and check whether it is a size-extending
> aio write in ext4_end_io_dio()? In such cases, we can call
> ext4_end_aio_dio_nolock() synchronously instead of queuing
> the work. I think this will protect us from truncate because we
> are still holding i_mutex and i_alloc_sem.
> 
> Jiaying
> 
> >
> >
> >> Jiaying
> >>
> >> >
> >> >>  We are wondering
> >> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> >> allocate an uninitialized extent first during any write and convert it
> >> >> as initialized at the time of end_io callback. This will eliminate the need
> >> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> >> a block marked initialized before the block has been written with new data.
> >> >>
> >> >
> >> > Oh I don't think so. For buffered IO, the data is being copied to
> >> > buffer, direct IO read would first flush what's in page cache to disk,
> >> > then read from disk. So if there is concurrent buffered write and direct
> >> > read, removing the i_mutex locks from the direct IO path should still
> >> > gurantee the right order, without having to treat buffered allocation
> >> > with uninitialized extent/end_io.
> >> >
> >> > The i_mutex lock, from my understanding, is there to protect direct IO
> >> > write to hole and concurrent direct IO read, we should able to remove
> >> > this lock for extent based ext4 file.
> >> >
> >> >> We haven't implemented anything yet because we want to ask here first to
> >> >> see whether this proposal makes sense to you.
> >> >>
> >> >
> >> > It does make sense to me.
> >> >
> >> > Mingming
> >> >> Regards,
> >> >>
> >> >> Jiaying
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> >> >
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-15 23:28         ` Mingming
@ 2009-10-15 23:33           ` Jiaying Zhang
  2009-10-16 18:56             ` Mingming
  0 siblings, 1 reply; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-15 23:33 UTC (permalink / raw)
  To: Mingming; +Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Thu, Oct 15, 2009 at 4:28 PM, Mingming <cmm@us.ibm.com> wrote:
> On Thu, 2009-10-15 at 13:07 -0700, Jiaying Zhang wrote:
>> On Thu, Oct 15, 2009 at 10:31 AM, Mingming <cmm@us.ibm.com> wrote:
>> > On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
>> >> Mingming,
>> >>
>> >
>> > Hi Jiaying,
>> >
>> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
>> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
>> >> >> Hello,
>> >> >>
>> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> >> >> One problem we found is that ext4 performance doesn't scale well with
>> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
>> >> >>
>> >> >> After some initial analysis, we think the ext4 performance problem is caused
>> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> >> >> got 99% performance as raw IO.
>> >> >>
>> >> >
>> >> > This is very interesting...and impressive number.
>> >> >
>> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
>> >> > but then realize that we can't do this all the time, as ext4 support
>> >> > ext3 non-extent based files, and uninitialized extent is not support on
>> >> > ext3 format file.
>> >> >
>> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
>> >> >> read is to prevent it from accessing stale data that may be exposed by a
>> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
>> >> >> with which when a get_block request comes from direct write, ext4 only
>> >> >> allocates or splits an uninitialized extent. That uninitialized extent
>> >> >> will be marked as initialized at the end_io callback.
>> >> >
>> >> > Though I need to clarify that with all the patches in mainline, we only
>> >> > treat new allocated blocks form direct io write to holes, not to writes
>> >> > to the end of file. I actually have proposed to treat the write to the
>> >> > end of file also as unintialized extents, but there is some concerns
>> >> > that this getting tricky with updating inode size when it is async IO
>> >> > direct IO. So it didn't getting done yet.
>> >>
>> >> I read you previous email thread again. As I understand, the main
>> >> concern for allocating uninitialized blocks in i_size extending write
>> >> is that we may end up having uninitialized blocks beyond i_size
>> >> if the system crashes during write. Can we protect this case by
>> >> adding the inode into the orphan list in ext4_ext_direct_IO,
>> >> i.e., same as we have done in ext4_ind_direct_IO?
>> >>
>> >
>> > Sure we could do that, though initially I hoped we could get rid of
>> > that:)
>> >
>> > The tricky part is async direct write to the end of file. By the time
>> > the IO is completed, the inode may be truncated or extended larger.
>> > Updating the most "safe" size is the part I haven't thought through yet.
>> >
>>
>> Ok. I think I understand the problem better now :).
>>
>> Looking at the __blockdev_direct_IO(), I saw it actually treats
>> size-extending aio dio write as synchronous and expects the dio to
>> complete before return (fs/direct-io.c line 1204 and line 1056-1061).
>
> Oh? It seems it will keep the write async as long as it's not a partial
> write
>        /*
>         * The only time we want to leave bios in flight is when a successful
>         * partial aio read or full aio write have been setup.  In that case
>         * bio completion will call aio_complete.  The only time it's safe to
>         * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
>         * This had *better* be the only place that raises -EIOCBQUEUED.
>         */
>        BUG_ON(ret == -EIOCBQUEUED);
>        if (dio->is_async && ret == 0 && dio->result &&
>            ((rw & READ) || (dio->result == dio->size)))
>                ret = -EIOCBQUEUED;
>
>        if (ret != -EIOCBQUEUED)
>                dio_await_completion(dio);
>

If I read the code correctly, dio->is_async is not set for file extending write:

        /*
         * For file extending writes updating i_size before data
         * writeouts complete can expose uninitialized blocks. So
         * even for AIO, we need to wait for i/o to complete before
         * returning in this case.
         */
        dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
                (end > i_size_read(inode)));

Jiaying

>> Can we follow the same rule and check whether it is a size-extending
>> aio write in ext4_end_io_dio()? In such cases, we can call
>> ext4_end_aio_dio_nolock() synchronously instead of queuing
>> the work. I think this will protect us from truncate because we
>> are still holding i_mutex and i_alloc_sem.
>>
>> Jiaying
>>
>> >
>> >
>> >> Jiaying
>> >>
>> >> >
>> >> >>  We are wondering
>> >> >> whether we can extend this idea to buffer write as well. I.e., we always
>> >> >> allocate an uninitialized extent first during any write and convert it
>> >> >> as initialized at the time of end_io callback. This will eliminate the need
>> >> >> to hold i_mutex lock during direct read because a DIO read should never get
>> >> >> a block marked initialized before the block has been written with new data.
>> >> >>
>> >> >
>> >> > Oh I don't think so. For buffered IO, the data is being copied to
>> >> > buffer, direct IO read would first flush what's in page cache to disk,
>> >> > then read from disk. So if there is concurrent buffered write and direct
>> >> > read, removing the i_mutex locks from the direct IO path should still
>> >> > gurantee the right order, without having to treat buffered allocation
>> >> > with uninitialized extent/end_io.
>> >> >
>> >> > The i_mutex lock, from my understanding, is there to protect direct IO
>> >> > write to hole and concurrent direct IO read, we should able to remove
>> >> > this lock for extent based ext4 file.
>> >> >
>> >> >> We haven't implemented anything yet because we want to ask here first to
>> >> >> see whether this proposal makes sense to you.
>> >> >>
>> >> >
>> >> > It does make sense to me.
>> >> >
>> >> > Mingming
>> >> >> Regards,
>> >> >>
>> >> >> Jiaying
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> >> >
>> >> >
>> >
>> >
>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-15 23:33           ` Jiaying Zhang
@ 2009-10-16 18:56             ` Mingming
  2009-10-16 19:44               ` Jiaying Zhang
  0 siblings, 1 reply; 18+ messages in thread
From: Mingming @ 2009-10-16 18:56 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Thu, 2009-10-15 at 16:33 -0700, Jiaying Zhang wrote:
> On Thu, Oct 15, 2009 at 4:28 PM, Mingming <cmm@us.ibm.com> wrote:
> > On Thu, 2009-10-15 at 13:07 -0700, Jiaying Zhang wrote:
> >> On Thu, Oct 15, 2009 at 10:31 AM, Mingming <cmm@us.ibm.com> wrote:
> >> > On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
> >> >> Mingming,
> >> >>
> >> >
> >> > Hi Jiaying,
> >> >
> >> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> >> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> >> >> Hello,
> >> >> >>
> >> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> >> >> One problem we found is that ext4 performance doesn't scale well with
> >> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >> >> >>
> >> >> >> After some initial analysis, we think the ext4 performance problem is caused
> >> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> >> >> got 99% performance as raw IO.
> >> >> >>
> >> >> >
> >> >> > This is very interesting...and impressive number.
> >> >> >
> >> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> >> >> > but then realize that we can't do this all the time, as ext4 support
> >> >> > ext3 non-extent based files, and uninitialized extent is not support on
> >> >> > ext3 format file.
> >> >> >
> >> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> >> >> read is to prevent it from accessing stale data that may be exposed by a
> >> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> >> >> with which when a get_block request comes from direct write, ext4 only
> >> >> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> >> >> will be marked as initialized at the end_io callback.
> >> >> >
> >> >> > Though I need to clarify that with all the patches in mainline, we only
> >> >> > treat new allocated blocks form direct io write to holes, not to writes
> >> >> > to the end of file. I actually have proposed to treat the write to the
> >> >> > end of file also as unintialized extents, but there is some concerns
> >> >> > that this getting tricky with updating inode size when it is async IO
> >> >> > direct IO. So it didn't getting done yet.
> >> >>
> >> >> I read you previous email thread again. As I understand, the main
> >> >> concern for allocating uninitialized blocks in i_size extending write
> >> >> is that we may end up having uninitialized blocks beyond i_size
> >> >> if the system crashes during write. Can we protect this case by
> >> >> adding the inode into the orphan list in ext4_ext_direct_IO,
> >> >> i.e., same as we have done in ext4_ind_direct_IO?
> >> >>
> >> >
> >> > Sure we could do that, though initially I hoped we could get rid of
> >> > that:)
> >> >
> >> > The tricky part is async direct write to the end of file. By the time
> >> > the IO is completed, the inode may be truncated or extended larger.
> >> > Updating the most "safe" size is the part I haven't thought through yet.
> >> >
> >>
> >> Ok. I think I understand the problem better now :).
> >>
> >> Looking at the __blockdev_direct_IO(), I saw it actually treats
> >> size-extending aio dio write as synchronous and expects the dio to
> >> complete before return (fs/direct-io.c line 1204 and line 1056-1061).
> >
> > Oh? It seems it will keep the write async as long as it's not a partial
> > write
> >        /*
> >         * The only time we want to leave bios in flight is when a successful
> >         * partial aio read or full aio write have been setup.  In that case
> >         * bio completion will call aio_complete.  The only time it's safe to
> >         * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
> >         * This had *better* be the only place that raises -EIOCBQUEUED.
> >         */
> >        BUG_ON(ret == -EIOCBQUEUED);
> >        if (dio->is_async && ret == 0 && dio->result &&
> >            ((rw & READ) || (dio->result == dio->size)))
> >                ret = -EIOCBQUEUED;
> >
> >        if (ret != -EIOCBQUEUED)
> >                dio_await_completion(dio);
> >
> 
> If I read the code correctly, dio->is_async is not set for file extending write:
> 
>         /*
>          * For file extending writes updating i_size before data
>          * writeouts complete can expose uninitialized blocks. So
>          * even for AIO, we need to wait for i/o to complete before
>          * returning in this case.
>          */
>         dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
>                 (end > i_size_read(inode)));
> 

Ah... that's a little surprise to know. Async write to the end of file
is actually synchonous. That explains why ext3 async direct IO could
always expand inode i_size safely without having to worry about expose
stale data out. 

It would be better to fix this with end_io callback, user expects get
higher performance via async mode.

But, these are seperate from removing i_mutex lock from direct IO read,
IMHO. The dio write to end of file won't extend inode size until the IO
is completed (AIO or non AIO mode). So if we remove the i_mutex lock
from the dio read,  in the case of parallel dio/buffered write to end of
file and direct io read, it won't expose updated inode's block mapping
tree to parallel direct IO read, as the i_size hasn't been changed yet.

The only concern to remove the the lock from dio read path, is the
buffered write to the hole. Isn't it?

> Jiaying
> 
> >> Can we follow the same rule and check whether it is a size-extending
> >> aio write in ext4_end_io_dio()? In such cases, we can call
> >> ext4_end_aio_dio_nolock() synchronously instead of queuing
> >> the work. I think this will protect us from truncate because we
> >> are still holding i_mutex and i_alloc_sem.
> >>
> >> Jiaying
> >>
> >> >
> >> >
> >> >> Jiaying
> >> >>
> >> >> >
> >> >> >>  We are wondering
> >> >> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> >> >> allocate an uninitialized extent first during any write and convert it
> >> >> >> as initialized at the time of end_io callback. This will eliminate the need
> >> >> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> >> >> a block marked initialized before the block has been written with new data.
> >> >> >>
> >> >> >
> >> >> > Oh I don't think so. For buffered IO, the data is being copied to
> >> >> > buffer, direct IO read would first flush what's in page cache to disk,
> >> >> > then read from disk. So if there is concurrent buffered write and direct
> >> >> > read, removing the i_mutex locks from the direct IO path should still
> >> >> > gurantee the right order, without having to treat buffered allocation
> >> >> > with uninitialized extent/end_io.
> >> >> >
> >> >> > The i_mutex lock, from my understanding, is there to protect direct IO
> >> >> > write to hole and concurrent direct IO read, we should able to remove
> >> >> > this lock for extent based ext4 file.
> >> >> >
> >> >> >> We haven't implemented anything yet because we want to ask here first to
> >> >> >> see whether this proposal makes sense to you.
> >> >> >>
> >> >> >
> >> >> > It does make sense to me.
> >> >> >
> >> >> > Mingming
> >> >> >> Regards,
> >> >> >>
> >> >> >> Jiaying
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> >> >
> >> >> >
> >> >
> >> >
> >> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> >
> >



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-16 18:56             ` Mingming
@ 2009-10-16 19:44               ` Jiaying Zhang
  2009-10-19 20:23                 ` Mingming
  0 siblings, 1 reply; 18+ messages in thread
From: Jiaying Zhang @ 2009-10-16 19:44 UTC (permalink / raw)
  To: Mingming; +Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Fri, Oct 16, 2009 at 11:56 AM, Mingming <cmm@us.ibm.com> wrote:
> On Thu, 2009-10-15 at 16:33 -0700, Jiaying Zhang wrote:
>> On Thu, Oct 15, 2009 at 4:28 PM, Mingming <cmm@us.ibm.com> wrote:
>> > On Thu, 2009-10-15 at 13:07 -0700, Jiaying Zhang wrote:
>> >> On Thu, Oct 15, 2009 at 10:31 AM, Mingming <cmm@us.ibm.com> wrote:
>> >> > On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
>> >> >> Mingming,
>> >> >>
>> >> >
>> >> > Hi Jiaying,
>> >> >
>> >> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
>> >> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
>> >> >> >> Hello,
>> >> >> >>
>> >> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
>> >> >> >> One problem we found is that ext4 performance doesn't scale well with
>> >> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
>> >> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
>> >> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
>> >> >> >>
>> >> >> >> After some initial analysis, we think the ext4 performance problem is caused
>> >> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
>> >> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
>> >> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
>> >> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
>> >> >> >> got 99% performance as raw IO.
>> >> >> >>
>> >> >> >
>> >> >> > This is very interesting...and impressive number.
>> >> >> >
>> >> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
>> >> >> > but then realize that we can't do this all the time, as ext4 support
>> >> >> > ext3 non-extent based files, and uninitialized extent is not support on
>> >> >> > ext3 format file.
>> >> >> >
>> >> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
>> >> >> >> read is to prevent it from accessing stale data that may be exposed by a
>> >> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
>> >> >> >> with which when a get_block request comes from direct write, ext4 only
>> >> >> >> allocates or splits an uninitialized extent. That uninitialized extent
>> >> >> >> will be marked as initialized at the end_io callback.
>> >> >> >
>> >> >> > Though I need to clarify that with all the patches in mainline, we only
>> >> >> > treat new allocated blocks form direct io write to holes, not to writes
>> >> >> > to the end of file. I actually have proposed to treat the write to the
>> >> >> > end of file also as unintialized extents, but there is some concerns
>> >> >> > that this getting tricky with updating inode size when it is async IO
>> >> >> > direct IO. So it didn't getting done yet.
>> >> >>
>> >> >> I read you previous email thread again. As I understand, the main
>> >> >> concern for allocating uninitialized blocks in i_size extending write
>> >> >> is that we may end up having uninitialized blocks beyond i_size
>> >> >> if the system crashes during write. Can we protect this case by
>> >> >> adding the inode into the orphan list in ext4_ext_direct_IO,
>> >> >> i.e., same as we have done in ext4_ind_direct_IO?
>> >> >>
>> >> >
>> >> > Sure we could do that, though initially I hoped we could get rid of
>> >> > that:)
>> >> >
>> >> > The tricky part is async direct write to the end of file. By the time
>> >> > the IO is completed, the inode may be truncated or extended larger.
>> >> > Updating the most "safe" size is the part I haven't thought through yet.
>> >> >
>> >>
>> >> Ok. I think I understand the problem better now :).
>> >>
>> >> Looking at the __blockdev_direct_IO(), I saw it actually treats
>> >> size-extending aio dio write as synchronous and expects the dio to
>> >> complete before return (fs/direct-io.c line 1204 and line 1056-1061).
>> >
>> > Oh? It seems it will keep the write async as long as it's not a partial
>> > write
>> >        /*
>> >         * The only time we want to leave bios in flight is when a successful
>> >         * partial aio read or full aio write have been setup.  In that case
>> >         * bio completion will call aio_complete.  The only time it's safe to
>> >         * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
>> >         * This had *better* be the only place that raises -EIOCBQUEUED.
>> >         */
>> >        BUG_ON(ret == -EIOCBQUEUED);
>> >        if (dio->is_async && ret == 0 && dio->result &&
>> >            ((rw & READ) || (dio->result == dio->size)))
>> >                ret = -EIOCBQUEUED;
>> >
>> >        if (ret != -EIOCBQUEUED)
>> >                dio_await_completion(dio);
>> >
>>
>> If I read the code correctly, dio->is_async is not set for file extending write:
>>
>>         /*
>>          * For file extending writes updating i_size before data
>>          * writeouts complete can expose uninitialized blocks. So
>>          * even for AIO, we need to wait for i/o to complete before
>>          * returning in this case.
>>          */
>>         dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
>>                 (end > i_size_read(inode)));
>>
>
> Ah... that's a little surprise to know. Async write to the end of file
> is actually synchonous. That explains why ext3 async direct IO could
> always expand inode i_size safely without having to worry about expose
> stale data out.
>
> It would be better to fix this with end_io callback, user expects get
> higher performance via async mode.
>
> But, these are seperate from removing i_mutex lock from direct IO read,
> IMHO. The dio write to end of file won't extend inode size until the IO
> is completed (AIO or non AIO mode). So if we remove the i_mutex lock
> from the dio read,  in the case of parallel dio/buffered write to end of
> file and direct io read, it won't expose updated inode's block mapping
> tree to parallel direct IO read, as the i_size hasn't been changed yet.
>
> The only concern to remove the the lock from dio read path, is the
> buffered write to the hole. Isn't it?
>

Hmm. Will we be safe between dio read and size-extending buffer
write if we don't hold i_mutex lock during dio read but still allocating
initialized blocks in size-extending buffer write? As I understand, we
update file size in ext4_write_end. The buffer write have finished at
this time but the changes are still in memory. Currently, we prevent
dio read from reading stale data by calling filemap_write_and_wait_range()
after holding the i_mutex lock in __blockdev_direct_IO(). If we no
longer hold i_mutex during dio read, a buffer write can finish after
a dio read calls filemap_write_and_wait_range() but before it calls
get_block(). At the time that dio read calls get_block(), the i_size
of the file has been updated, but the written data is still in memory,
so it may read stale data.

I was thinking that we may change to call filemap_write_and_wait_range()
after get_block() during dio read. This seems to eliminate the above
problem. However, I am also worried about non-journal ext4. With the
current code, I think there is a chance that we can expose stale data
after reboot if the system crashed at a bad time. I think always
converting uninitialized blocks to initialized during end_io time may
help to close this security hole in non-journal ext4.

Jiaying

>> Jiaying
>>
>> >> Can we follow the same rule and check whether it is a size-extending
>> >> aio write in ext4_end_io_dio()? In such cases, we can call
>> >> ext4_end_aio_dio_nolock() synchronously instead of queuing
>> >> the work. I think this will protect us from truncate because we
>> >> are still holding i_mutex and i_alloc_sem.
>> >>
>> >> Jiaying
>> >>
>> >> >
>> >> >
>> >> >> Jiaying
>> >> >>
>> >> >> >
>> >> >> >>  We are wondering
>> >> >> >> whether we can extend this idea to buffer write as well. I.e., we always
>> >> >> >> allocate an uninitialized extent first during any write and convert it
>> >> >> >> as initialized at the time of end_io callback. This will eliminate the need
>> >> >> >> to hold i_mutex lock during direct read because a DIO read should never get
>> >> >> >> a block marked initialized before the block has been written with new data.
>> >> >> >>
>> >> >> >
>> >> >> > Oh I don't think so. For buffered IO, the data is being copied to
>> >> >> > buffer, direct IO read would first flush what's in page cache to disk,
>> >> >> > then read from disk. So if there is concurrent buffered write and direct
>> >> >> > read, removing the i_mutex locks from the direct IO path should still
>> >> >> > gurantee the right order, without having to treat buffered allocation
>> >> >> > with uninitialized extent/end_io.
>> >> >> >
>> >> >> > The i_mutex lock, from my understanding, is there to protect direct IO
>> >> >> > write to hole and concurrent direct IO read, we should able to remove
>> >> >> > this lock for extent based ext4 file.
>> >> >> >
>> >> >> >> We haven't implemented anything yet because we want to ask here first to
>> >> >> >> see whether this proposal makes sense to you.
>> >> >> >>
>> >> >> >
>> >> >> > It does make sense to me.
>> >> >> >
>> >> >> > Mingming
>> >> >> >> Regards,
>> >> >> >>
>> >> >> >> Jiaying
>> >> >> >> --
>> >> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
>> >
>> >
>
>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: ext4 DIO read performance issue on SSD
  2009-10-16 19:44               ` Jiaying Zhang
@ 2009-10-19 20:23                 ` Mingming
  0 siblings, 0 replies; 18+ messages in thread
From: Mingming @ 2009-10-19 20:23 UTC (permalink / raw)
  To: Jiaying Zhang
  Cc: ext4 development, Andrew Morton, Michael Rubin, Manuel Benitez

On Fri, 2009-10-16 at 12:44 -0700, Jiaying Zhang wrote:
> On Fri, Oct 16, 2009 at 11:56 AM, Mingming <cmm@us.ibm.com> wrote:
> > On Thu, 2009-10-15 at 16:33 -0700, Jiaying Zhang wrote:
> >> On Thu, Oct 15, 2009 at 4:28 PM, Mingming <cmm@us.ibm.com> wrote:
> >> > On Thu, 2009-10-15 at 13:07 -0700, Jiaying Zhang wrote:
> >> >> On Thu, Oct 15, 2009 at 10:31 AM, Mingming <cmm@us.ibm.com> wrote:
> >> >> > On Wed, 2009-10-14 at 22:14 -0700, Jiaying Zhang wrote:
> >> >> >> Mingming,
> >> >> >>
> >> >> >
> >> >> > Hi Jiaying,
> >> >> >
> >> >> >> On Wed, Oct 14, 2009 at 11:48 AM, Mingming <cmm@us.ibm.com> wrote:
> >> >> >> > On Fri, 2009-10-09 at 16:34 -0700, Jiaying Zhang wrote:
> >> >> >> >> Hello,
> >> >> >> >>
> >> >> >> >> Recently, we are evaluating the ext4 performance on a high speed SSD.
> >> >> >> >> One problem we found is that ext4 performance doesn't scale well with
> >> >> >> >> multiple threads or multiple AIOs reading a single file with O_DIRECT.
> >> >> >> >> E.g., with 4k block size, multiple-thread DIO AIO random read on ext4
> >> >> >> >> can lose up to 50% throughput compared to the results we get via RAW IO.
> >> >> >> >>
> >> >> >> >> After some initial analysis, we think the ext4 performance problem is caused
> >> >> >> >> by the use of i_mutex lock during DIO read. I.e., during DIO read, we grab
> >> >> >> >> the i_mutex lock in __blockdev_direct_IO because ext4 uses the default
> >> >> >> >> DIO_LOCKING from the generic fs code. I did a quick test by calling
> >> >> >> >> blockdev_direct_IO_no_locking() in ext4_direct_IO() and I saw ext4 DIO read
> >> >> >> >> got 99% performance as raw IO.
> >> >> >> >>
> >> >> >> >
> >> >> >> > This is very interesting...and impressive number.
> >> >> >> >
> >> >> >> > I tried to change ext4 to call blockdev_direct_IO_no_locking() directly,
> >> >> >> > but then realize that we can't do this all the time, as ext4 support
> >> >> >> > ext3 non-extent based files, and uninitialized extent is not support on
> >> >> >> > ext3 format file.
> >> >> >> >
> >> >> >> >> As we understand, the reason why we want to take i_mutex lock during DIO
> >> >> >> >> read is to prevent it from accessing stale data that may be exposed by a
> >> >> >> >> simultaneous write. We saw that Mingming Cao has implemented a patch set
> >> >> >> >> with which when a get_block request comes from direct write, ext4 only
> >> >> >> >> allocates or splits an uninitialized extent. That uninitialized extent
> >> >> >> >> will be marked as initialized at the end_io callback.
> >> >> >> >
> >> >> >> > Though I need to clarify that with all the patches in mainline, we only
> >> >> >> > treat new allocated blocks form direct io write to holes, not to writes
> >> >> >> > to the end of file. I actually have proposed to treat the write to the
> >> >> >> > end of file also as unintialized extents, but there is some concerns
> >> >> >> > that this getting tricky with updating inode size when it is async IO
> >> >> >> > direct IO. So it didn't getting done yet.
> >> >> >>
> >> >> >> I read you previous email thread again. As I understand, the main
> >> >> >> concern for allocating uninitialized blocks in i_size extending write
> >> >> >> is that we may end up having uninitialized blocks beyond i_size
> >> >> >> if the system crashes during write. Can we protect this case by
> >> >> >> adding the inode into the orphan list in ext4_ext_direct_IO,
> >> >> >> i.e., same as we have done in ext4_ind_direct_IO?
> >> >> >>
> >> >> >
> >> >> > Sure we could do that, though initially I hoped we could get rid of
> >> >> > that:)
> >> >> >
> >> >> > The tricky part is async direct write to the end of file. By the time
> >> >> > the IO is completed, the inode may be truncated or extended larger.
> >> >> > Updating the most "safe" size is the part I haven't thought through yet.
> >> >> >
> >> >>
> >> >> Ok. I think I understand the problem better now :).
> >> >>
> >> >> Looking at the __blockdev_direct_IO(), I saw it actually treats
> >> >> size-extending aio dio write as synchronous and expects the dio to
> >> >> complete before return (fs/direct-io.c line 1204 and line 1056-1061).
> >> >
> >> > Oh? It seems it will keep the write async as long as it's not a partial
> >> > write
> >> >        /*
> >> >         * The only time we want to leave bios in flight is when a successful
> >> >         * partial aio read or full aio write have been setup.  In that case
> >> >         * bio completion will call aio_complete.  The only time it's safe to
> >> >         * call aio_complete is when we return -EIOCBQUEUED, so we key on that.
> >> >         * This had *better* be the only place that raises -EIOCBQUEUED.
> >> >         */
> >> >        BUG_ON(ret == -EIOCBQUEUED);
> >> >        if (dio->is_async && ret == 0 && dio->result &&
> >> >            ((rw & READ) || (dio->result == dio->size)))
> >> >                ret = -EIOCBQUEUED;
> >> >
> >> >        if (ret != -EIOCBQUEUED)
> >> >                dio_await_completion(dio);
> >> >
> >>
> >> If I read the code correctly, dio->is_async is not set for file extending write:
> >>
> >>         /*
> >>          * For file extending writes updating i_size before data
> >>          * writeouts complete can expose uninitialized blocks. So
> >>          * even for AIO, we need to wait for i/o to complete before
> >>          * returning in this case.
> >>          */
> >>         dio->is_async = !is_sync_kiocb(iocb) && !((rw & WRITE) &&
> >>                 (end > i_size_read(inode)));
> >>
> >
> > Ah... that's a little surprise to know. Async write to the end of file
> > is actually synchonous. That explains why ext3 async direct IO could
> > always expand inode i_size safely without having to worry about expose
> > stale data out.
> >
> > It would be better to fix this with end_io callback, user expects get
> > higher performance via async mode.
> >
> > But, these are seperate from removing i_mutex lock from direct IO read,
> > IMHO. The dio write to end of file won't extend inode size until the IO
> > is completed (AIO or non AIO mode). So if we remove the i_mutex lock
> > from the dio read,  in the case of parallel dio/buffered write to end of
> > file and direct io read, it won't expose updated inode's block mapping
> > tree to parallel direct IO read, as the i_size hasn't been changed yet.
> >
> > The only concern to remove the the lock from dio read path, is the
> > buffered write to the hole. Isn't it?
> >
> 
> Hmm. Will we be safe between dio read and size-extending buffer
> write if we don't hold i_mutex lock during dio read but still allocating
> initialized blocks in size-extending buffer write? As I understand, we
> update file size in ext4_write_end. The buffer write have finished at
> this time but the changes are still in memory. Currently, we prevent
> dio read from reading stale data by calling filemap_write_and_wait_range()
> after holding the i_mutex lock in __blockdev_direct_IO(). If we no
> longer hold i_mutex during dio read, a buffer write can finish after
> a dio read calls filemap_write_and_wait_range() but before it calls
> get_block(). At the time that dio read calls get_block(), the i_size
> of the file has been updated, but the written data is still in memory,
> so it may read stale data.
> 

If the page was dirtied after the direct read calling
filemap_write_and_wait_range(), then yes, there is window that file gets
extended and direct read sees the blocks being allocated. This window is
very smaller with delayed allocation, as data write out and and
allocation happens at the same time. 

If the page was dirtied (ie. buffered write to that file) before the
direct IO read, filemap_write_and_wait_range() will write out that dirty
page and ensure block allocation/i_size happen all complete before get
back to direct IO read.

> I was thinking that we may change to call filemap_write_and_wait_range()
> after get_block() during dio read. This seems to eliminate the above
> problem. 
 That would impact the performance, I think.

> However, I am also worried about non-journal ext4. With the
> current code, I think there is a chance that we can expose stale data
> after reboot if the system crashed at a bad time. I think always
> converting uninitialized blocks to initialized during end_io time may
> help to close this security hole in non-journal ext4.
> 
> Jiaying
> 
> >> Jiaying
> >>
> >> >> Can we follow the same rule and check whether it is a size-extending
> >> >> aio write in ext4_end_io_dio()? In such cases, we can call
> >> >> ext4_end_aio_dio_nolock() synchronously instead of queuing
> >> >> the work. I think this will protect us from truncate because we
> >> >> are still holding i_mutex and i_alloc_sem.
> >> >>
> >> >> Jiaying
> >> >>
> >> >> >
> >> >> >
> >> >> >> Jiaying
> >> >> >>
> >> >> >> >
> >> >> >> >>  We are wondering
> >> >> >> >> whether we can extend this idea to buffer write as well. I.e., we always
> >> >> >> >> allocate an uninitialized extent first during any write and convert it
> >> >> >> >> as initialized at the time of end_io callback. This will eliminate the need
> >> >> >> >> to hold i_mutex lock during direct read because a DIO read should never get
> >> >> >> >> a block marked initialized before the block has been written with new data.
> >> >> >> >>
> >> >> >> >
> >> >> >> > Oh I don't think so. For buffered IO, the data is being copied to
> >> >> >> > buffer, direct IO read would first flush what's in page cache to disk,
> >> >> >> > then read from disk. So if there is concurrent buffered write and direct
> >> >> >> > read, removing the i_mutex locks from the direct IO path should still
> >> >> >> > gurantee the right order, without having to treat buffered allocation
> >> >> >> > with uninitialized extent/end_io.
> >> >> >> >
> >> >> >> > The i_mutex lock, from my understanding, is there to protect direct IO
> >> >> >> > write to hole and concurrent direct IO read, we should able to remove
> >> >> >> > this lock for extent based ext4 file.
> >> >> >> >
> >> >> >> >> We haven't implemented anything yet because we want to ask here first to
> >> >> >> >> see whether this proposal makes sense to you.
> >> >> >> >>
> >> >> >> >
> >> >> >> > It does make sense to me.
> >> >> >> >
> >> >> >> > Mingming
> >> >> >> >> Regards,
> >> >> >> >>
> >> >> >> >> Jiaying
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >
> >> >> >> >
> >> >> >> >
> >> >> >
> >> >> >
> >> >> >
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> >> >
> >> >
> >
> >
> >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2009-10-21 22:48 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-09 23:34 ext4 DIO read performance issue on SSD Jiaying Zhang
2009-10-14 18:48 ` Mingming
2009-10-14 19:48   ` Jiaying Zhang
2009-10-14 20:57     ` Mingming
2009-10-14 21:42       ` Jiaying Zhang
2009-10-15 17:27         ` Mingming
2009-10-16  1:27           ` Jiaying Zhang
2009-10-16 19:15             ` Theodore Tso
2009-10-20  1:26               ` Jiaying Zhang
2009-10-19 19:04             ` Mingming
2009-10-15  5:14   ` Jiaying Zhang
2009-10-15 17:31     ` Mingming
2009-10-15 20:07       ` Jiaying Zhang
2009-10-15 23:28         ` Mingming
2009-10-15 23:33           ` Jiaying Zhang
2009-10-16 18:56             ` Mingming
2009-10-16 19:44               ` Jiaying Zhang
2009-10-19 20:23                 ` Mingming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).