Re: Aw: Re: Ext4: Slow performance on first write after mount

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-18 22:34 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-18 22:34 UTC (permalink / raw)
  To: linux-ext4

Hi Andrei,

thanks for the informations! Didn't know that it is around 32 MB data for a 1TB disk. 

Regarding bigalloc: I read on the ext4 website (https://ext4.wiki.kernel.org/index.php/Bigalloc) this:
"The bigalloc feature first appeared in the v3.2 kernel. As of this writing (in the v3.7 kernel) 
bigalloc still has some problems if the delayed allocation is enabled, especially if the file 
system is close to full."
Is bigalloc really stable? Since when is it stable? Were there bigger bugs in some versions?
I ask because the software (OpenPli) we use uses different kernel versions for different boxes. 
Some boxes use 3.8.7 kernel, some 3.3.8 and so on (it's not changeable because of closed source
drivers).

Is an ext4 bigalloc partition resizeable? I saw a bug report and a patch in January 2013 regarding this.
If it works well, I could resize my partition and create a new bigalloc one. Then move files and resize
again. Or is the only possibility a reformat?

Regards,
Frank

----- Original Nachricht ----
Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>
Datum:   18.05.2013 22:34
Betreff: Re: Aw: Re: Ext4: Slow performance on first write after mount

> Frank,
> 
> Well, the main point was to use bigalloc. Unfortunately it requires
> reformat.
> W/o bigalloc there will be ~7800 block groups for 1T drive. Those groups
> take 32M of ondisk data and up to 64M when it comes to RAM because of
> runtime buddy bitmaps. I don't think it worth storing buddy bitmaps on
> drive. It's not a surprise it can take long time to read lots of block
> bitmaps scattered over drive and construct buddies out of them. And it's
> not a surprise some these pages are evicted under high memory pressure.
> With bigalloc 1M cluster size you get 256 times less metadata (128K
> instead of 32M) and you get all the benefits of faster allocate,
> truncate and lesser fragmentation.
> 
> Yes, you don't know file size in advance, but speculating say each 128M
> is clearly a benefit. truncate to real file size once recording finished
> to release unused preallocated space.
> There are some caveats with O_DIRECT, but it is faster if done correctly.
> 
> Regards,
> Andrei.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

[parent not found: <D1047C91-765D-4EBD-A6CC-869DF0D5AD90@dilger.ca>]

* Ext4: Slow performance on first write after mount
@ 2013-05-17 16:51 ` frankcmoeller
  2013-05-19 14:00   ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: frankcmoeller @ 2013-05-17 16:51 UTC (permalink / raw)
  To: linux-ext4

Hi,

we're using ext4 on satellite boxes (e.g. XTrend et9200) with mounted harddisks. The receiver uses linux with kernel version 3.8.7. Some users (like me) have problems with the first recording right after boot. Most of them have big partitions(1-2TB) and high disk usage (over 50%). The application signals in this case a buffer overflow (the buffer is 4 MB big). We found out, that one of the first writes after boot or remount is very slow. I have debugged it. The testcase was a umount then mount and then a write of 64MB data to the disk:
The problem is the initialization of the buffer cache which takes very long(ext4_mb_init_group in ext4_mb_good_group procedure). In my case it loads around 1300 groups per second (with the patch which avoids loading full groups). My disk is at the beginning quite full, so it needs to read around 8200 groups to find a "good" one. This takes over 6 seconds. Here is the output:
May 10 02:06:15 et9x00 user.info kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator start                         time: 4284161251798
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before for group loop cr: 0
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 0  time: 4284161318465
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4284161355983
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4284161440835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4284167134687
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4284167180243
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4284167198909
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4284167212835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 1  time: 4284167260724
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 2  time: 4284167276576
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 3  time: 4284167291205
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 4  time: 4284167305798
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 5  time: 4284167320280
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 6  time: 4284167334835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 7  time: 4284167349317
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8  time: 4284167363909
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 9  time: 4284167378391
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 10  time: 4284167392872
...
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8240  time: 4290297430389
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290297464612
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290297521464
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310304019
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290310346352
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290310363908
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310377834
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8241  time: 4290310425945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8242  time: 4290310443241
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8243  time: 4290310458352
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8244  time: 4290310473204
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8245  time: 4290310488056
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8246  time: 4290310503167
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8247  time: 4290310517982
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8248  time: 4290310533093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8249  time: 4290310547945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8250  time: 4290310562797
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8251  time: 4290310577871
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8252  time: 4290310592945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8253  time: 4290310608019
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8254  time: 4290310622871
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8255  time: 4290310637723
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290310668278
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290310739464
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310979093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290311058093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290311077538
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290311091167
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator after good group;  group: 8255  time: 4290311137649
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator end                           time: 4290311168501

Don't be surprised I couldn't activate kernel tracing on the satellite box. Perhaps because of 2 closed source kernel modules which we need to use. I have added some ext4_msg statements in mballoc.c.

If I understood right, this can also happen hours after first write, because there might be some space at the beginning of the harddisk and if it is consumed the initialization of the buffer cache proceeds...
For debugging I tested with a 64MB write, which took more than 7 seconds (the subsequently writes were much faster):
root@et9x00:~# dd if=/dev/zero of=/hdd/test.6434 bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (64.0MB) copied, 7.379178 seconds, 8.7MB/s

The application writes 188KB blocks to disk. And there we also see after some quick writes several seconds taking writes after boot or mount.

So we have real time data (up to 2 MB per second per recording) which needs to be written within max 2 or 3 seconds (depending on the bitrate of the channel). We cannot have very big buffer in the application, because of limited resources, which we cannot change.

So my questions:
- What can we do to avoid this (at best with no reformating)?
- Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
- A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
- I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
- I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?

Regards,
Frank

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-17 16:51 ` frankcmoeller
@ 2013-05-19 14:00   ` Theodore Ts'o
  2013-05-20  6:39     ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-05-19 14:00 UTC (permalink / raw)
  To: frankcmoeller; +Cc: linux-ext4

On Fri, May 17, 2013 at 06:51:23PM +0200, frankcmoeller@arcor.de wrote:
> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?

Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
/etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
yes.

> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?

Given the simple nature of the above workaround, it's not obvious to
me that trying to make file system format changes, or even adding a
new mount option, is really worth it.  This is especially true given
that mount -a is sequential so if there are a large number of big file
systems, using this as a mount option would be slow down the boot
significantly.  It would be better to do this parallel, which you
could do in userspace much more easily using the "cat
/proc/fs/<dev>/mb_groups" workaround.

> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?

The delay is caused purely by I/O delay, so short of replacing the HDD
with a SSD, not really....

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-19 14:00   ` Theodore Ts'o
@ 2013-05-20  6:39     ` Andreas Dilger
  2013-05-20 11:46       ` Theodore Ts'o
  0 siblings, 1 reply; 9+ messages in thread
From: Andreas Dilger @ 2013-05-20  6:39 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: frankcmoeller@arcor.de, linux-ext4@vger.kernel.org

On 2013-05-19, at 8:00, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, May 17, 2013 at 06:51:23PM +0200, frankcmoeller@arcor.de wrote:
>> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
>> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
> 
> Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
> /etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
> yes.
> 
>> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
> 
> Given the simple nature of the above workaround, it's not obvious to
> me that trying to make file system format changes, or even adding a
> new mount option, is really worth it.  This is especially true given
> that mount -a is sequential so if there are a large number of big file
> systems, using this as a mount option would be slow down the boot
> significantly.  It would be better to do this parallel, which you
> could do in userspace much more easily using the "cat
> /proc/fs/<dev>/mb_groups" workaround.

Since we already have a thread starting at mount time to check the
inode table zeroing, it would also be possible to co-opt this thread
for preloading the group metadata from the bitmaps. 

>> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?
> 
> The delay is caused purely by I/O delay, so short of replacing the HDD
> with a SSD, not really....

Well, with a larger flex_bg factor at format time there will be more
bitmaps allocated together on disk, so fewer seeks needed to load
them after a new mount. We use a flex_bg factor of 256 for this
reason on our very large storage targets.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-20  6:39     ` Andreas Dilger
@ 2013-05-20 11:46       ` Theodore Ts'o
  2013-05-21 18:02         ` Aw: " frankcmoeller
  0 siblings, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-05-20 11:46 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: frankcmoeller@arcor.de, linux-ext4@vger.kernel.org

On Mon, May 20, 2013 at 12:39:50AM -0600, Andreas Dilger wrote:
> 
> Since we already have a thread starting at mount time to check the
> inode table zeroing, it would also be possible to co-opt this thread
> for preloading the group metadata from the bitmaps. 

True.  Since I wrote my earlier post, I've also been considering the
possibility that e2fsck or the kernel should just simply issue
readahead requests for all of the bitmap blocks.  The advantage of
doing it in e2fsck is that it happens earlier.

In fact, since in e2fsck the prereads can be done in parallel, I was
even thinking about a scheme where e2fsck would synchronously force
all of the allocation blocks into the buffer cache, and then in the
kernel, we could have a loop which checks to see if the bitmap blocks
were already in cache, and if they were, to initialize the buddy
bitmaps pages.  That way, even if subsequent memory pressure were to
push the buddy bitmap pages and allocation bitmaps out of the cache,
it would mean that all of the ext4_group_info structures would be
initialized, and just having the bb_largest_free_order information
will very much help things.

On Sun, 19 May 2013 21:36:02 +0200 (CEST) Frank C Moeller wrote:
>From my point (end user) I would prefer a builtin solution. I'm also a
>programmer and I can therefore understand why you don't want to change
>anything.

It's not that I don't want to change anything, it's that I'm very
hesitant to add new mount options or new code paths that now need more
testing unless there's no other way of addressing a particular use
case.  Another consideration is how to do it in such a way that it
doesn't degrade other users' performance.

Issuing readahead requests for the bitmap blocks might be good
compromise; since they are readahead requests, as low priority
requests they won't interfere with anything else going on, and in
practice, unless you are starting your video recording **immediately**
after the reboot, it should address your concern.  (Also note that for
most people willing to hack a DVR, adding a line to /etc/rc.local is
usually considered easier than building a new kernel from sources and
then after making file system format changes, requiring a reformat of
their data disk!)

So it's not that I'm against solutions that involve kernel changes or
file system format changes.  It's just that I want to make sure we
explore the entire solution space, since there are costs in terms of
testing costs, the need to do a backup-reformat-restore pass, etc,
etc., to some of the solutions that have been suggested so far.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Aw: Re: Ext4: Slow performance on first write after mount
  2013-05-20 11:46       ` Theodore Ts'o
@ 2013-05-21 18:02         ` frankcmoeller
  2013-05-22  0:27           ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: frankcmoeller @ 2013-05-21 18:02 UTC (permalink / raw)
  To: adilger; +Cc: linux-ext4

Hi Andreas,

only a short question: 

> I like the idea of keeping the high bits of the buddy bitmap in the group
> descriptor, instead of just the largest free order. It takes the same
> amount of space, but provides more information. 
More informations for what? The allocator or better the good_group function
needs bb_largest_free_order and in some cases fragment count. Do you 
want to use the bitmap for a not 100% correct fragment count calculation? 
Or is there another use for it?

Best regards,
Frank

> 
> > On Sun, 19 May 2013 21:36:02 +0200 (CEST) Frank C Moeller wrote:
> >> From my point (end user) I would prefer a builtin solution. I'm also a
> >> programmer and I can therefore understand why you don't want to change
> >> anything.
> > 
> > It's not that I don't want to change anything, it's that I'm very
> > hesitant to add new mount options or new code paths that now need more
> > testing unless there's no other way of addressing a particular use
> > case.  Another consideration is how to do it in such a way that it
> > doesn't degrade other users' performance.
> > 
> > Issuing readahead requests for the bitmap blocks might be good
> > compromise; since they are readahead requests, as low priority
> > requests they won't interfere with anything else going on, and in
> > practice, unless you are starting your video recording **immediately**
> > after the reboot, it should address your concern.
> 
> Right. Some of our users do something similar in userspace to avoid
> slowdown on first write, which doesn't _usually_ happen immediately
> after mount, but this isn't always helpful. 
> 
> >  (Also note that for
> > most people willing to hack a DVR, adding a line to /etc/rc.local is
> > usually considered easier than building a new kernel from sources and
> > then after making file system format changes, requiring a reformat of
> > their data disk!)
> 
> I think storing the buddy bitmap top bits in the GDT could be a COMPAT
> feature.  It is just a hint that could be ignored or incorrect, since
> the actual bitmap would be authoritative. 
> 
> Cheers, Andreas
> 
> > So it's not that I'm against solutions that involve kernel changes or
> > file system format changes.  It's just that I want to make sure we
> > explore the entire solution space, since there are costs in terms of
> > testing costs, the need to do a backup-reformat-restore pass, etc,
> > etc., to some of the solutions that have been suggested so far.
> > 
> > Regards,
> > 
> >                        - Ted
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Aw: Re: Ext4: Slow performance on first write after mount
  2013-05-21 18:02         ` Aw: " frankcmoeller
@ 2013-05-22  0:27           ` Andreas Dilger
  0 siblings, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2013-05-22  0:27 UTC (permalink / raw)
  To: frankcmoeller; +Cc: linux-ext4

On 2013-05-21, at 12:02 PM, frankcmoeller@arcor.de wrote:
>> I like the idea of keeping the high bits of the buddy bitmap
>> in the group descriptor, instead of just the largest free order.
>> It takes the same amount of space, but provides more information. 
> More informations for what?

Sorry, what I meant to write was that it provides more information
than just recording e.g. the number of blocks in the largest free
extent.

> The allocator or better the good_group function
> needs bb_largest_free_order and in some cases fragment count. Do you 
> want to use the bitmap for a not 100% correct fragment count calculation? Or is there another use for it?

The bitmap would provide the largest_free_order value directly
(assuming it is at least 4MB in size).

Cheers, Andreas

>>> On Sun, 19 May 2013 21:36:02 +0200 (CEST) Frank C Moeller wrote:
>>>> From my point (end user) I would prefer a builtin solution. I'm also a
>>>> programmer and I can therefore understand why you don't want to change
>>>> anything.
>>> 
>>> It's not that I don't want to change anything, it's that I'm very
>>> hesitant to add new mount options or new code paths that now need more
>>> testing unless there's no other way of addressing a particular use
>>> case.  Another consideration is how to do it in such a way that it
>>> doesn't degrade other users' performance.
>>> 
>>> Issuing readahead requests for the bitmap blocks might be good
>>> compromise; since they are readahead requests, as low priority
>>> requests they won't interfere with anything else going on, and in
>>> practice, unless you are starting your video recording **immediately**
>>> after the reboot, it should address your concern.
>> 
>> Right. Some of our users do something similar in userspace to avoid
>> slowdown on first write, which doesn't _usually_ happen immediately
>> after mount, but this isn't always helpful. 
>> 
>>> (Also note that for
>>> most people willing to hack a DVR, adding a line to /etc/rc.local is
>>> usually considered easier than building a new kernel from sources and
>>> then after making file system format changes, requiring a reformat of
>>> their data disk!)
>> 
>> I think storing the buddy bitmap top bits in the GDT could be a COMPAT
>> feature.  It is just a hint that could be ignored or incorrect, since
>> the actual bitmap would be authoritative. 
>> 
>> Cheers, Andreas
>> 
>>> So it's not that I'm against solutions that involve kernel changes or
>>> file system format changes.  It's just that I want to make sure we
>>> explore the entire solution space, since there are costs in terms of
>>> testing costs, the need to do a backup-reformat-restore pass, etc,
>>> etc., to some of the solutions that have been suggested so far.
>>> 
>>> Regards,
>>> 
>>>                       - Ted
>> 


Cheers, Andreas






^ permalink raw reply	[flat|nested] 9+ messages in thread

* Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-20 20:54 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-20 20:54 UTC (permalink / raw)
  To: adilger, tytso; +Cc: linux-ext4

Hi together,

> > and then in the
> > kernel, we could have a loop which checks to see if the bitmap blocks
> > were already in cache, and if they were, to initialize the buddy
> > bitmaps pages.  That way, even if subsequent memory pressure were to
> > push the buddy bitmap pages and allocation bitmaps out of the cache,
> > it would mean that all of the ext4_group_info structures would be
> > initialized, and just having the bb_largest_free_order information
> > will very much help things.
> 
> I like the idea of keeping the high bits of the buddy bitmap in the group
> descriptor, instead of just the largest free order. It takes the same
> amount of space, but provides more information. 

If you use the reserved field for this, users don't need to reformat their disks
and "only" need to use a new kernel, right? That sounds really good to me.

> > Issuing readahead requests for the bitmap blocks might be good
> > compromise; since they are readahead requests, as low priority
> > requests they won't interfere with anything else going on, and in
> > practice, unless you are starting your video recording **immediately**
> > after the reboot, it should address your concern.

If there is a normal recording the PVR starts some minutes (I think 2-3) 
before the recording starts. If an user starts the PVR, timeshift (it's a 
recording) might start around 30-40 seconds after mounting the disk. 
But if it's problematic I can let it start later. 
 
> >  (Also note that for
> > most people willing to hack a DVR, adding a line to /etc/rc.local is
> > usually considered easier than building a new kernel from sources and
> > then after making file system format changes, requiring a reformat of
> > their data disk!)
> 
> I think storing the buddy bitmap top bits in the GDT could be a COMPAT
> feature.  It is just a hint that could be ignored or incorrect, since
> the actual bitmap would be authoritative. 

Yes, adding a line to rc.local is easy. Building a new kernel is also no problem,
if the patch is compatible with the current used kernel versions (3.8.7 and also 
good would be 3.3.8 ). The PVR uses a good software management system 
(something like apt) and the users can update their software including kernel
every day.
Reformating is a problem, but if it's not preventable, users can choose between
workaround and reformat.

Best regards,
Frank

> 
> Cheers, Andreas
> 
> > So it's not that I'm against solutions that involve kernel changes or
> > file system format changes.  It's just that I want to make sure we
> > explore the entire solution space, since there are costs in terms of
> > testing costs, the need to do a backup-reformat-restore pass, etc,
> > etc., to some of the solutions that have been suggested so far.
> > 
> > Regards,
> > 
> >                        - Ted
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-19 19:36 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-19 19:36 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

Hi Ted,

> Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
> /etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
> yes.
Thanks for confirming that the workaround fixes the problem!

> Given the simple nature of the above workaround, it's not obvious to
> me that trying to make file system format changes, or even adding a
> new mount option, is really worth it.  This is especially true given
> that mount -a is sequential so if there are a large number of big file
> systems, using this as a mount option would be slow down the boot
> significantly.  It would be better to do this parallel, which you
> could do in userspace much more easily using the "cat
> /proc/fs/<dev>/mb_groups" workaround.
>From my point (end user) I would prefer a builtin solution. I'm also a
programmer and I can therefore understand why you don't want to change 
anything. It's a little bit surprising for me, that only few people seems to have 
this problem. But I believe that many live with it and don't know that the slow
boot or write is caused by ext4 (and many end user have small ext4 partitions
and servers are running 24/7 without remounting fs...). Only few applications 
rely on a constant write throughput.

> > - I can see (see debug output) that the call of ext4_wait_block_bitmap in
> mballoc.c line 848 takes during buffer cache initialization the longest time
> (some 1/100 of a second). Can this be improved?
> 
> The delay is caused purely by I/O delay, so short of replacing the HDD
> with a SSD, not really....
Well, SSDs are really cool, but for a PVR a hdd is still a good choice: Cheap,
big, more reliable (hopefully), quick enough and has no problems writing several 
GB data per day.

Regards,
Frank

> 
> Regards,
> 
> 						- Ted
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-19 10:01 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-19 10:01 UTC (permalink / raw)
  To: linux-ext4

Hi Andreas,

> Part of the problem is that filesystems are rarely unmounted cleanly, so it
> means that this information would need to be updated periodically to disk so
> that it is available after a crash.
> I wouldn't object to some kind of "lazy" updating of group information on
> disk that at least gives the newly-mounted filesystem a rough idea of what
> each group's usage is. It wouldn't have to be totally accurate (it wouldn't
> replace the bitmaps), but maybe 2 bits per group would be enough as a
> starting point?
> For a 32 TB filesystem that would be about 16 4kB blocks of bits that would
> be updated periodically (e.g. every five minutes or so). Since the allocator
> will typically work in successive groups that might not cause too much
> churn. 

Yes, you're right. The stored data wouldn't be 100% reliable. And yes, it would be really good if 
right after mount the filesystem would knew something more to find a good group quicker.
What do you think of this:
1. I read this already in some discussions: You already store the free space amount for every
  group. Why not also storing how big the biggest contiguous free space block in a group is? Then you 
  don't have to read the whole group.
2. What about a list (in memory and also stored on disk) with all unused groups (1 bit for every group).
  If the allocator cannot find a good group within lets say half second, a group from this list is used.
  The list is also not be 100% reliable (because of the mentioned unclean unmounts), so you need to search
  a good group in the list. If no good group was found in the list, the allocator can continue searching.
  This don't helps in all situations (e.g. almost full disk or every group contains a small amount of data),
  but it should be in many cases much faster, if the list is not totally outdated.

> It would be possible to fallocate() at some expected size (e.g. average file
> size) and then either truncate off the unused space, or fallocate() some
> more in another thread when you are close to tunning out. 
> If the fallocate() is done in a separate thread the latency can be hidden
> from the main application?
Adding a new thread for fallocate shouldn't be a big problem. But fallocate might 
generate high disk usage (while searching for a good group). I don't know whether
parallel writing from the other thread is quick enough.

One question regarding fallocate: I create a new file and do a 100MB fallocate 
with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
Is the 30 MB unused preallocated space still preallocated for that file after closing
it? Or does a close release the preallocated space?

Regards,
Frank

> 
> Cheers, Andreas 
> 
> > And you have to take care about alignment and there are several threads in
> the internet which explain why you shouldn't use it (or only in very special
> situations and I don't think that my situation is one of them). And ext4
> group initialization takes also place when using O_DIRECT (as said before
> perhaps I did something wrong).
> > 
> > Regards,
> > Frank
> > 
> > ----- Original Nachricht ----
> > Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
> > An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>, ext4
> development <linux-ext4@vger.kernel.org>
> > Datum:   17.05.2013 23:18
> > Betreff: Re: Ext4: Slow performance on first write after mount
> > 
> >> Hi Frank,
> >> 
> >> Consider using bigalloc feature (requires reformat), preallocate space
> >> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
> >> are too small for good throughput with O_DIRECT. You might also want to
> >> adjust max_sectors_kb to something larger than 512k.
> >> 
> >> We're doing 6in+6out 20Mbps streams just fine.
> >> 
> >> Regards,
> >> Andrei.
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-18 10:50 frankcmoeller
  2013-05-18 20:34 ` Sidorov, Andrei
  2013-05-19  1:49 ` Andreas Dilger
  0 siblings, 2 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-18 10:50 UTC (permalink / raw)
  To: linux-ext4

Hi Andrei,

thanks for your quick answer!
Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD channels at the same time without problems. Except the problems with the first write after mount. And there are also some users which have problems 1-2 times during a recording.
I think the ext4 group initialization is the main problem, because it takes so long (as written before: around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?

With fallocate the group initialization is partly made before first write. This helps, but it's no solution, because the finally file size is unknown. So I cannot preallocate space for the complete file. And after the preallocated space is consumed the same problem with the initialization arises until all groups are initialized.

I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very fast. And you have to take care about alignment and there are several threads in the internet which explain why you shouldn't use it (or only in very special situations and I don't think that my situation is one of them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did something wrong).

Regards,
Frank

----- Original Nachricht ----
Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>, ext4 development <linux-ext4@vger.kernel.org>
Datum:   17.05.2013 23:18
Betreff: Re: Ext4: Slow performance on first write after mount

> Hi Frank,
> 
> Consider using bigalloc feature (requires reformat), preallocate space
> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
> are too small for good throughput with O_DIRECT. You might also want to
> adjust max_sectors_kb to something larger than 512k.
> 
> We're doing 6in+6out 20Mbps streams just fine.
> 
> Regards,
> Andrei.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Aw: Re: Ext4: Slow performance on first write after mount
  2013-05-18 10:50 frankcmoeller
@ 2013-05-18 20:34 ` Sidorov, Andrei
  2013-05-19  1:49 ` Andreas Dilger
  1 sibling, 0 replies; 9+ messages in thread
From: Sidorov, Andrei @ 2013-05-18 20:34 UTC (permalink / raw)
  To: frankcmoeller@arcor.de; +Cc: linux-ext4@vger.kernel.org

Frank,

Well, the main point was to use bigalloc. Unfortunately it requires
reformat.
W/o bigalloc there will be ~7800 block groups for 1T drive. Those groups
take 32M of ondisk data and up to 64M when it comes to RAM because of
runtime buddy bitmaps. I don't think it worth storing buddy bitmaps on
drive. It's not a surprise it can take long time to read lots of block
bitmaps scattered over drive and construct buddies out of them. And it's
not a surprise some these pages are evicted under high memory pressure.
With bigalloc 1M cluster size you get 256 times less metadata (128K
instead of 32M) and you get all the benefits of faster allocate,
truncate and lesser fragmentation.

Yes, you don't know file size in advance, but speculating say each 128M
is clearly a benefit. truncate to real file size once recording finished
to release unused preallocated space.
There are some caveats with O_DIRECT, but it is faster if done correctly.

Regards,
Andrei.

On 18.05.2013 03:50, frankcmoeller@arcor.de wrote:
> Hi Andrei,
>
> thanks for your quick answer!
> Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD channels at the same time without problems. Except the problems with the first write after mount. And there are also some users which have problems 1-2 times during a recording.
> I think the ext4 group initialization is the main problem, because it takes so long (as written before: around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?
>
> With fallocate the group initialization is partly made before first write. This helps, but it's no solution, because the finally file size is unknown. So I cannot preallocate space for the complete file. And after the preallocated space is consumed the same problem with the initialization arises until all groups are initialized.
>
> I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very fast. And you have to take care about alignment and there are several threads in the internet which explain why you shouldn't use it (or only in very special situations and I don't think that my situation is one of them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did something wrong).
>
> Regards,
> Frank
>
> ----- Original Nachricht ----
> Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
> An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>, ext4 development <linux-ext4@vger.kernel.org>
> Datum:   17.05.2013 23:18
> Betreff: Re: Ext4: Slow performance on first write after mount
>
>> Hi Frank,
>>
>> Consider using bigalloc feature (requires reformat), preallocate space
>> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
>> are too small for good throughput with O_DIRECT. You might also want to
>> adjust max_sectors_kb to something larger than 512k.
>>
>> We're doing 6in+6out 20Mbps streams just fine.
>>
>> Regards,
>> Andrei.
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Aw: Re: Ext4: Slow performance on first write after mount
  2013-05-18 10:50 frankcmoeller
  2013-05-18 20:34 ` Sidorov, Andrei
@ 2013-05-19  1:49 ` Andreas Dilger
  1 sibling, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2013-05-19  1:49 UTC (permalink / raw)
  To: frankcmoeller@arcor.de; +Cc: linux-ext4@vger.kernel.org

On 2013-05-18, at 4:50, frankcmoeller@arcor.de wrote:
> thanks for your quick answer!
> Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD channels at the same time without problems. Except the problems with the first write after mount. And there are also some users which have problems 1-2 times during a recording.
> I think the ext4 group initialization is the main problem, because it takes so long (as written before: around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?

Part of the problem is that filesystems are rarely unmounted cleanly, so it means that this information would need to be updated periodically to disk so that it is available after a crash.

I wouldn't object to some kind of "lazy" updating of group information on disk that at least gives the newly-mounted filesystem a rough idea of what each group's usage is. It wouldn't have to be totally accurate (it wouldn't replace the bitmaps), but maybe 2 bits per group would be enough as a starting point?

For a 32 TB filesystem that would be about 16 4kB blocks of bits that would be updated periodically (e.g. every five minutes or so). Since the allocator will typically work in successive groups that might not cause too much churn. 

> With fallocate the group initialization is partly made before first write. This helps, but it's no solution, because the finally file size is unknown.

It would be possible to fallocate() at some expected size (e.g. average file size) and then either truncate off the unused space, or fallocate() some more in another thread when you are close to tunning out. 

> So I cannot preallocate space for the complete file. And after the preallocated space is consumed the same problem with the initialization arises until all groups are initialized.

If the fallocate() is done in a separate thread the latency can be hidden from the main application?
> 
> I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very fast.

That is true, and depends heavily on your workload. 

Cheers, Andreas 

> And you have to take care about alignment and there are several threads in the internet which explain why you shouldn't use it (or only in very special situations and I don't think that my situation is one of them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did something wrong).
> 
> Regards,
> Frank
> 
> ----- Original Nachricht ----
> Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
> An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>, ext4 development <linux-ext4@vger.kernel.org>
> Datum:   17.05.2013 23:18
> Betreff: Re: Ext4: Slow performance on first write after mount
> 
>> Hi Frank,
>> 
>> Consider using bigalloc feature (requires reformat), preallocate space
>> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
>> are too small for good throughput with O_DIRECT. You might also want to
>> adjust max_sectors_kb to something larger than 512k.
>> 
>> We're doing 6in+6out 20Mbps streams just fine.
>> 
>> Regards,
>> Andrei.
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-05-22  0:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-18 22:34 Aw: Re: Ext4: Slow performance on first write after mount frankcmoeller
     [not found] <D1047C91-765D-4EBD-A6CC-869DF0D5AD90@dilger.ca>
2013-05-17 16:51 ` frankcmoeller
2013-05-19 14:00   ` Theodore Ts'o
2013-05-20  6:39     ` Andreas Dilger
2013-05-20 11:46       ` Theodore Ts'o
2013-05-21 18:02         ` Aw: " frankcmoeller
2013-05-22  0:27           ` Andreas Dilger
  -- strict thread matches above, loose matches on Subject: below --
2013-05-20 20:54 frankcmoeller
2013-05-19 19:36 frankcmoeller
2013-05-19 10:01 frankcmoeller
2013-05-18 10:50 frankcmoeller
2013-05-18 20:34 ` Sidorov, Andrei
2013-05-19  1:49 ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).