EXT4 nodelalloc => back to stone age.

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* EXT4 nodelalloc => back to stone age.
@ 2013-04-01 11:06 Dmitry Monakhov
  2013-04-01 15:18 ` Eric Sandeen
  2013-04-02 13:46 ` Jan Kara
  0 siblings, 2 replies; 8+ messages in thread
From: Dmitry Monakhov @ 2013-04-01 11:06 UTC (permalink / raw)
  To: ext4 development; +Cc: linux-fsdevel, axboe, Jan Kara

[-- Attachment #1: Type: text/plain, Size: 496 bytes --]


I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
It shows numbers which are slower than HDD which was produced 15 years ago
#mount  $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
# dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
  1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
blktrace shows horrible traces:

[-- Attachment #2: trace.log --]
[-- Type: text/plain, Size: 1644 bytes --]

253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]

[-- Attachment #3: Type: text/plain, Size: 1220 bytes --]


As one can see data written from two threads dd and jbd2 on per-page basis and
jbd2 submit pages with WRITE_SYNC  i.e. we write page-by-page
synchronously :)

Exact calltrace:
journal_submit_inode_data_buffers
 wbc.sync_mode =  WB_SYNC_ALL
 ->generic_writepages
   ->write_cache_pages
     ->ext4_writepage
       ->ext4_bio_write_page
         ->io_submit_add_bh
           ->io_submit_init
             io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC :
             WRITE);
       ->ext4_io_submit(io);

1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
  Why blk_finish_plug(&plug) which is called from generic_writepages() is
  not enough? As far as I can see this code was copy-pasted from XFS,
  also DIO also tag bio-s with WRITE_SYNC, but what happen if file
  is highly fragmented (or block device is RAID0) we will endup doing
  synchronous io.

2) Why don't we have writepages for non delalloc case ?

I want to fix (2) by implementing writepages() for non delalloc case
Once this will be done we may add new flag WB_SYNC_NOALLOC so
journal_submit_inode_data_buffers will use
__filemap_fdatawrite_range(, , , WB_SYNC_ALL| WB_SYNC_NOALLC)
which will call optimized ->ext4_writepages() 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EXT4 nodelalloc => back to stone age.
  2013-04-01 11:06 EXT4 nodelalloc => back to stone age Dmitry Monakhov
@ 2013-04-01 15:18 ` Eric Sandeen
  2013-04-01 15:39   ` Theodore Ts'o
  2013-04-01 15:45   ` Chris Mason
  2013-04-02 13:46 ` Jan Kara
  1 sibling, 2 replies; 8+ messages in thread
From: Eric Sandeen @ 2013-04-01 15:18 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: ext4 development, linux-fsdevel, axboe, Jan Kara

On 4/1/13 6:06 AM, Dmitry Monakhov wrote:

> 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?

...

> 2) Why don't we have writepages for non delalloc case ?

...

I'd add:

3) Why do we have a "nodelalloc" mount option at all?

but then I thought:

Is it also this bad when using the ext4 driver to run an ext3 fs?

-Eric


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EXT4 nodelalloc => back to stone age.
  2013-04-01 15:18 ` Eric Sandeen
@ 2013-04-01 15:39   ` Theodore Ts'o
  2013-04-01 16:00     ` Eric Sandeen
  2013-04-01 15:45   ` Chris Mason
  1 sibling, 1 reply; 8+ messages in thread
From: Theodore Ts'o @ 2013-04-01 15:39 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Dmitry Monakhov, ext4 development, linux-fsdevel, axboe, Jan Kara

On Mon, Apr 01, 2013 at 10:18:51AM -0500, Eric Sandeen wrote:
> I'd add:
> 
> 3) Why do we have a "nodelalloc" mount option at all?
> 
> but then I thought:
> 
> Is it also this bad when using the ext4 driver to run an ext3 fs?

Yes, and I there would be a similar performance problem if you are
using the ext3 file system driver, since ext3_*_writepage() also ends
up calling block_write_full_page() which will also result in the
writes happening with WRITE_SYNC.

The main reason why we keep nodelalloc at this point is bug-for-bug
compatibility with ext3 file systems --- basically, for users who are
using this as a workaround for the O_PONIES issue instead of fixing
their applications to use fsync() appropriately.

So another question is how much do we care about exact emulation of
ext3's behaviour for those distributions who wish to use ext4 file
system driver for ext2 and ext3 file systems?

One of the reasons for keeping nodealloc mode was the argument was
that it removing it wouldn't really allow us to remove that much
complexity from ext4.  But adding a nodealloc specific ext4_writepages
pages would result in adding a huge amount of complexity, and my first
reaction is that it's really not worth the code maintenance headache.
Dmitry, is there a reason why you are especially worried about the
performace of nodelalloc mode?

	       	      	       	   	- Ted

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EXT4 nodelalloc => back to stone age.
  2013-04-01 15:18 ` Eric Sandeen
  2013-04-01 15:39   ` Theodore Ts'o
@ 2013-04-01 15:45   ` Chris Mason
  2013-04-01 15:57     ` Chris Mason
  1 sibling, 1 reply; 8+ messages in thread
From: Chris Mason @ 2013-04-01 15:45 UTC (permalink / raw)
  To: Eric Sandeen, Dmitry Monakhov
  Cc: ext4 development, linux-fsdevel@vger.kernel.org, axboe@kernel.dk,
	Jan Kara

Quoting Eric Sandeen (2013-04-01 11:18:51)
> On 4/1/13 6:06 AM, Dmitry Monakhov wrote:
> 
> > 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?

Yes?  The stuff we wait on should be WRITE_SYNC.

> 
> ...
> 
> > 2) Why don't we have writepages for non delalloc case ?
> 
> ...
> 
> I'd add:
> 
> 3) Why do we have a "nodelalloc" mount option at all?
> 
> but then I thought:
> 
> Is it also this bad when using the ext4 driver to run an ext3 fs?

Quick comparison on a single iodrive:

Ext4 (defaults):
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.95442 s, 549 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.45012 s, 740 MB/s

Ext4 (nodelalloc):
dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 2.97308 s, 361 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.76617 s, 608 MB/s
# dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc

XFS   gives 628, 733MB/s

Btrfs gives 659, 635MB/s -- since we're doing fsync, this includes all
the crcs for the data.

Ext3 mounted by ext4.ko: 291, 467MB/s

-chris


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EXT4 nodelalloc => back to stone age.
  2013-04-01 15:45   ` Chris Mason
@ 2013-04-01 15:57     ` Chris Mason
  0 siblings, 0 replies; 8+ messages in thread
From: Chris Mason @ 2013-04-01 15:57 UTC (permalink / raw)
  To: Chris Mason, Eric Sandeen, Dmitry Monakhov
  Cc: ext4 development, linux-fsdevel@vger.kernel.org, axboe@kernel.dk,
	Jan Kara

Quoting Chris Mason (2013-04-01 11:45:41)
> Quoting Eric Sandeen (2013-04-01 11:18:51)
> > On 4/1/13 6:06 AM, Dmitry Monakhov wrote:
> > 
> > > 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
> 
> Yes?  The stuff we wait on should be WRITE_SYNC.
> 
> > 
> > ...
> > 
> > > 2) Why don't we have writepages for non delalloc case ?
> > 
> > ...
> > 
> > I'd add:
> > 
> > 3) Why do we have a "nodelalloc" mount option at all?
> > 
> > but then I thought:
> > 
> > Is it also this bad when using the ext4 driver to run an ext3 fs?
> 
> Quick comparison on a single iodrive:

On the theory that writepages is the problem try echo 1 >
/sys/block/xxx/queue/rotational.  With request merging on here in
nodelalloc mode:

dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 2.53741 s, 423 MB/s

dd if=/dev/zero of=foo bs=1M count=1024 conv=fsync,notrunc
1073741824 bytes (1.1 GB) copied, 1.37795 s, 779 MB/s

-chris

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EXT4 nodelalloc => back to stone age.
  2013-04-01 15:39   ` Theodore Ts'o
@ 2013-04-01 16:00     ` Eric Sandeen
  2013-04-01 16:34       ` Zheng Liu
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Sandeen @ 2013-04-01 16:00 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Dmitry Monakhov, ext4 development, linux-fsdevel, axboe, Jan Kara

On 4/1/13 10:39 AM, Theodore Ts'o wrote:
> On Mon, Apr 01, 2013 at 10:18:51AM -0500, Eric Sandeen wrote:
>> I'd add:
>>
>> 3) Why do we have a "nodelalloc" mount option at all?
>>
>> but then I thought:
>>
>> Is it also this bad when using the ext4 driver to run an ext3 fs?
> 
> Yes, and I there would be a similar performance problem if you are
> using the ext3 file system driver, since ext3_*_writepage() also ends
> up calling block_write_full_page() which will also result in the
> writes happening with WRITE_SYNC.

> The main reason why we keep nodelalloc at this point is bug-for-bug
> compatibility with ext3 file systems --- basically, for users who are
> using this as a workaround for the O_PONIES issue instead of fixing
> their applications to use fsync() appropriately.

Sorry for getting off the original thread here, but IMHO these are
2 different things:

nondelalloc behavior makes sense for ext3, but:
-o nodelalloc mount options don't make sense for ext4.

> So another question is how much do we care about exact emulation of
> ext3's behaviour for those distributions who wish to use ext4 file
> system driver for ext2 and ext3 file systems?
>
> One of the reasons for keeping nodealloc mode was the argument was
> that it removing it wouldn't really allow us to remove that much
> complexity from ext4.

IMHO we should keep the mode for ext2/3, but lose the ext4 option.
It'd just be one less row in the ext4 test matrix.

-Eric

>  But adding a nodealloc specific ext4_writepages
> pages would result in adding a huge amount of complexity, and my first
> reaction is that it's really not worth the code maintenance headache.
> Dmitry, is there a reason why you are especially worried about the
> performace of nodelalloc mode?
> 
> 	       	      	       	   	- Ted
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EXT4 nodelalloc => back to stone age.
  2013-04-01 16:00     ` Eric Sandeen
@ 2013-04-01 16:34       ` Zheng Liu
  0 siblings, 0 replies; 8+ messages in thread
From: Zheng Liu @ 2013-04-01 16:34 UTC (permalink / raw)
  To: Eric Sandeen
  Cc: Theodore Ts'o, Dmitry Monakhov, ext4 development,
	linux-fsdevel, axboe, Jan Kara

Hi Eric,

On 04/02/2013 12:00 AM, Eric Sandeen wrote:
> On 4/1/13 10:39 AM, Theodore Ts'o wrote:
>> On Mon, Apr 01, 2013 at 10:18:51AM -0500, Eric Sandeen wrote:
>>> I'd add:
>>>
>>> 3) Why do we have a "nodelalloc" mount option at all?
>>>
>>> but then I thought:
>>>
>>> Is it also this bad when using the ext4 driver to run an ext3 fs?
>>
>> Yes, and I there would be a similar performance problem if you are
>> using the ext3 file system driver, since ext3_*_writepage() also ends
>> up calling block_write_full_page() which will also result in the
>> writes happening with WRITE_SYNC.
> 
>> The main reason why we keep nodelalloc at this point is bug-for-bug
>> compatibility with ext3 file systems --- basically, for users who are
>> using this as a workaround for the O_PONIES issue instead of fixing
>> their applications to use fsync() appropriately.
> 
> Sorry for getting off the original thread here, but IMHO these are
> 2 different things:
> 
> nondelalloc behavior makes sense for ext3, but:
> -o nodelalloc mount options don't make sense for ext4.

nodelalloc makes sense to me.  In our product system, we met a latency
problem that is caused by delalloc feature.  The workload is a web app
that does some append writes (approximately 5M/s), and wait flusher to
do write out.  We obverse that on every 30 seconds the latency will
reach a high level (approximately 100-200ms or higher, but normally
10-20ms).  The reason is that when flush tries to write dirty pages out,
it will take i_data_sem lock (write lock) and allocate some blocks for
these dirty pages.  But in the mean time the app does some append
write(2)s that will try to take i_data_sem lock (read lock) too.  So the
app will be delayed.  So I think nodelalloc is still useful for us.

Regards,
						- Zheng

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: EXT4 nodelalloc => back to stone age.
  2013-04-01 11:06 EXT4 nodelalloc => back to stone age Dmitry Monakhov
  2013-04-01 15:18 ` Eric Sandeen
@ 2013-04-02 13:46 ` Jan Kara
  1 sibling, 0 replies; 8+ messages in thread
From: Jan Kara @ 2013-04-02 13:46 UTC (permalink / raw)
  To: Dmitry Monakhov; +Cc: ext4 development, linux-fsdevel, axboe, Jan Kara

On Mon 01-04-13 15:06:18, Dmitry Monakhov wrote:
> 
> I've mounted ext4 with -onodelalloc on my SSD (INTEL SSDSA2CW120G3,4PC10362)
> It shows numbers which are slower than HDD which was produced 15 years ago
> #mount  $SCRATCH_DEV $SCRATCH_MNT -onodelalloc
> # dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
>   1073741824 bytes (1.1 GB) copied, 46.7948 s, 22.9 MB/s
> # dd if=/dev/zero of=/mnt_scratch/file bs=1M count=1024 conv=fsync,notrunc
>   1073741824 bytes (1.1 GB) copied, 41.2717 s, 26.0 MB/s
> blktrace shows horrible traces:

> 253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1    0       11     0.004965203 13618  Q  WS 1219360 + 8 [jbd2/dm-1-8]
> 253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
> 253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
> 253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
> 253,1    1       39     0.004983642     0  C  WS 1219344 + 8 [0]
> 253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
> 253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
> 253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
> 253,1    1       40     0.005082898     0  C  WS 1219352 + 8 [0]
> 253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
> 253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
> 253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
> 253,1    3       12     0.005106049  2580  Q   W 1219368 + 8 [flush-253:1]
> 253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
> 253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
> 253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
> 253,1    2       17     0.005197143 13750  Q  WS 1219376 + 8 [dd]
> 253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
> 253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
> 253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
> 253,1    1       41     0.005199871     0  C  WS 1219360 + 8 [0]
  Hum, not sure why you see all the events 4x. But that's not important I
guess.

> As one can see data written from two threads dd and jbd2 on per-page basis and
> jbd2 submit pages with WRITE_SYNC  i.e. we write page-by-page
> synchronously :)
>
> Exact calltrace:
> journal_submit_inode_data_buffers
>  wbc.sync_mode =  WB_SYNC_ALL
>  ->generic_writepages
>    ->write_cache_pages
>      ->ext4_writepage
>        ->ext4_bio_write_page
>          ->io_submit_add_bh
>            ->io_submit_init
>              io->io_op = (wbc->sync_mode == WB_SYNC_ALL ?  WRITE_SYNC :
>              WRITE);
>        ->ext4_io_submit(io);
> 
> 1)Do we really have to use WRITE_SYNC in case of WB_SYNC_ALL ?
  Actually WRITE_SYNC doesn't mean we write sychronously. We just tell the
IO scheduler that we are going to wait for the IO to complete soon. So it
prioritizes these writes against other async writes. We don't have to use
WRITE_SYNC but really in this case we do pretty much what IO scheduler
people want - flag IO that's going to be waited upon.

>   Why blk_finish_plug(&plug) which is called from generic_writepages() is
>   not enough? As far as I can see this code was copy-pasted from XFS,
>   also DIO also tag bio-s with WRITE_SYNC, but what happen if file
>   is highly fragmented (or block device is RAID0) we will endup doing
>   synchronous io.
  I see you are tracing the DM device. That may be actually somewhat
confusing since you are missing some actions like merges of requests and
dispatches to underlying device. 

> 2) Why don't we have writepages for non delalloc case ?
> 
> I want to fix (2) by implementing writepages() for non delalloc case
> Once this will be done we may add new flag WB_SYNC_NOALLOC so
> journal_submit_inode_data_buffers will use
> __filemap_fdatawrite_range(, , , WB_SYNC_ALL| WB_SYNC_NOALLC)
> which will call optimized ->ext4_writepages() 
  So what would you expect from ->writepages() implementation?

Anyway the throughput you see looks bad. What kernel version are you using?
There's possibility my recent changes to ext4_writepage() could have slowed
down something...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-04-03  9:32 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-01 11:06 EXT4 nodelalloc => back to stone age Dmitry Monakhov
2013-04-01 15:18 ` Eric Sandeen
2013-04-01 15:39   ` Theodore Ts'o
2013-04-01 16:00     ` Eric Sandeen
2013-04-01 16:34       ` Zheng Liu
2013-04-01 15:45   ` Chris Mason
2013-04-01 15:57     ` Chris Mason
2013-04-02 13:46 ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).