Ext4: Slow performance on first write after mount

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Ext4: Slow performance on first write after mount
@ 2013-05-17 16:51 ` frankcmoeller
  2013-05-17 21:18   ` Sidorov, Andrei
  2013-05-19 14:00   ` Theodore Ts'o
  0 siblings, 2 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-17 16:51 UTC (permalink / raw)
  To: linux-ext4

Hi,

we're using ext4 on satellite boxes (e.g. XTrend et9200) with mounted harddisks. The receiver uses linux with kernel version 3.8.7. Some users (like me) have problems with the first recording right after boot. Most of them have big partitions(1-2TB) and high disk usage (over 50%). The application signals in this case a buffer overflow (the buffer is 4 MB big). We found out, that one of the first writes after boot or remount is very slow. I have debugged it. The testcase was a umount then mount and then a write of 64MB data to the disk:
The problem is the initialization of the buffer cache which takes very long(ext4_mb_init_group in ext4_mb_good_group procedure). In my case it loads around 1300 groups per second (with the patch which avoids loading full groups). My disk is at the beginning quite full, so it needs to read around 8200 groups to find a "good" one. This takes over 6 seconds. Here is the output:
May 10 02:06:15 et9x00 user.info kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator start                         time: 4284161251798
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before for group loop cr: 0
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 0  time: 4284161318465
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4284161355983
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4284161440835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4284167134687
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4284167180243
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4284167198909
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4284167212835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 1  time: 4284167260724
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 2  time: 4284167276576
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 3  time: 4284167291205
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 4  time: 4284167305798
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 5  time: 4284167320280
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 6  time: 4284167334835
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 7  time: 4284167349317
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8  time: 4284167363909
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 9  time: 4284167378391
May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 10  time: 4284167392872
...
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8240  time: 4290297430389
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290297464612
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290297521464
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310304019
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290310346352
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290310363908
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310377834
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8241  time: 4290310425945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8242  time: 4290310443241
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8243  time: 4290310458352
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8244  time: 4290310473204
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8245  time: 4290310488056
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8246  time: 4290310503167
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8247  time: 4290310517982
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8248  time: 4290310533093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8249  time: 4290310547945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8250  time: 4290310562797
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8251  time: 4290310577871
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8252  time: 4290310592945
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8253  time: 4290310608019
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8254  time: 4290310622871
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8255  time: 4290310637723
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290310668278
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290310739464
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310979093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290311058093
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290311077538
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290311091167
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator after good group;  group: 8255  time: 4290311137649
May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator end                           time: 4290311168501

Don't be surprised I couldn't activate kernel tracing on the satellite box. Perhaps because of 2 closed source kernel modules which we need to use. I have added some ext4_msg statements in mballoc.c.

If I understood right, this can also happen hours after first write, because there might be some space at the beginning of the harddisk and if it is consumed the initialization of the buffer cache proceeds...
For debugging I tested with a 64MB write, which took more than 7 seconds (the subsequently writes were much faster):
root@et9x00:~# dd if=/dev/zero of=/hdd/test.6434 bs=64M count=1
1+0 records in
1+0 records out
67108864 bytes (64.0MB) copied, 7.379178 seconds, 8.7MB/s

The application writes 188KB blocks to disk. And there we also see after some quick writes several seconds taking writes after boot or mount.

So we have real time data (up to 2 MB per second per recording) which needs to be written within max 2 or 3 seconds (depending on the bitrate of the channel). We cannot have very big buffer in the application, because of limited resources, which we cannot change.

So my questions:
- What can we do to avoid this (at best with no reformating)?
- Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
- A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
- I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
- I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?

Regards,
Frank

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-17 16:51 ` Ext4: Slow performance on first write after mount frankcmoeller
@ 2013-05-17 21:18   ` Sidorov, Andrei
  2013-05-19 14:00   ` Theodore Ts'o
  1 sibling, 0 replies; 9+ messages in thread
From: Sidorov, Andrei @ 2013-05-17 21:18 UTC (permalink / raw)
  To: frankcmoeller@arcor.de, ext4 development

Hi Frank,

Consider using bigalloc feature (requires reformat), preallocate space
with fallocate and use O_DIRECT for reads/writes. However, 188k writes
are too small for good throughput with O_DIRECT. You might also want to
adjust max_sectors_kb to something larger than 512k.

We're doing 6in+6out 20Mbps streams just fine.

Regards,
Andrei.

On 17.05.2013 09:51, frankcmoeller@arcor.de wrote:
> Hi,
>
> we're using ext4 on satellite boxes (e.g. XTrend et9200) with mounted harddisks. The receiver uses linux with kernel version 3.8.7. Some users (like me) have problems with the first recording right after boot. Most of them have big partitions(1-2TB) and high disk usage (over 50%). The application signals in this case a buffer overflow (the buffer is 4 MB big). We found out, that one of the first writes after boot or remount is very slow. I have debugged it. The testcase was a umount then mount and then a write of 64MB data to the disk:
> The problem is the initialization of the buffer cache which takes very long(ext4_mb_init_group in ext4_mb_good_group procedure). In my case it loads around 1300 groups per second (with the patch which avoids loading full groups). My disk is at the beginning quite full, so it needs to read around 8200 groups to find a "good" one. This takes over 6 seconds. Here is the output:
> May 10 02:06:15 et9x00 user.info kernel: EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator start                         time: 4284161251798
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before for group loop cr: 0
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 0  time: 4284161318465
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4284161355983
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4284161440835
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4284167134687
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4284167180243
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4284167198909
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4284167212835
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 1  time: 4284167260724
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 2  time: 4284167276576
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 3  time: 4284167291205
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 4  time: 4284167305798
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 5  time: 4284167320280
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 6  time: 4284167334835
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 7  time: 4284167349317
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8  time: 4284167363909
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 9  time: 4284167378391
> May 10 02:06:31 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 10  time: 4284167392872
> ...
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8240  time: 4290297430389
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290297464612
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290297521464
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310304019
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290310346352
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290310363908
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310377834
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8241  time: 4290310425945
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8242  time: 4290310443241
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8243  time: 4290310458352
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8244  time: 4290310473204
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8245  time: 4290310488056
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8246  time: 4290310503167
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8247  time: 4290310517982
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8248  time: 4290310533093
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8249  time: 4290310547945
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8250  time: 4290310562797
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8251  time: 4290310577871
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8252  time: 4290310592945
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8253  time: 4290310608019
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8254  time: 4290310622871
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator before good group; group: 8255  time: 4290310637723
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290310668278
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290310739464
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290310979093
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after allocate buffer_heads;         time: 4290311058093
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_read_block_bitmap_nowait  time: 4290311077538
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_init_cache after ext4_wait_block_bitmap         time: 4290311091167
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator after good group;  group: 8255  time: 4290311137649
> May 10 02:06:37 et9x00 user.info kernel: EXT4-fs (sda1): ext4_mb_regular_allocator end                           time: 4290311168501
>
> Don't be surprised I couldn't activate kernel tracing on the satellite box. Perhaps because of 2 closed source kernel modules which we need to use. I have added some ext4_msg statements in mballoc.c.
>
> If I understood right, this can also happen hours after first write, because there might be some space at the beginning of the harddisk and if it is consumed the initialization of the buffer cache proceeds...
> For debugging I tested with a 64MB write, which took more than 7 seconds (the subsequently writes were much faster):
> root@et9x00:~# dd if=/dev/zero of=/hdd/test.6434 bs=64M count=1
> 1+0 records in
> 1+0 records out
> 67108864 bytes (64.0MB) copied, 7.379178 seconds, 8.7MB/s
>
> The application writes 188KB blocks to disk. And there we also see after some quick writes several seconds taking writes after boot or mount.
>
> So we have real time data (up to 2 MB per second per recording) which needs to be written within max 2 or 3 seconds (depending on the bitrate of the channel). We cannot have very big buffer in the application, because of limited resources, which we cannot change.
>
> So my questions:
> - What can we do to avoid this (at best with no reformating)?
> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?
>
> Regards,
> Frank
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-17 16:51 ` Ext4: Slow performance on first write after mount frankcmoeller
  2013-05-17 21:18   ` Sidorov, Andrei
@ 2013-05-19 14:00   ` Theodore Ts'o
  2013-05-20  6:39     ` Andreas Dilger
  1 sibling, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-05-19 14:00 UTC (permalink / raw)
  To: frankcmoeller; +Cc: linux-ext4

On Fri, May 17, 2013 at 06:51:23PM +0200, frankcmoeller@arcor.de wrote:
> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?

Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
/etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
yes.

> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?

Given the simple nature of the above workaround, it's not obvious to
me that trying to make file system format changes, or even adding a
new mount option, is really worth it.  This is especially true given
that mount -a is sequential so if there are a large number of big file
systems, using this as a mount option would be slow down the boot
significantly.  It would be better to do this parallel, which you
could do in userspace much more easily using the "cat
/proc/fs/<dev>/mb_groups" workaround.

> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?

The delay is caused purely by I/O delay, so short of replacing the HDD
with a SSD, not really....

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-19 14:00   ` Theodore Ts'o
@ 2013-05-20  6:39     ` Andreas Dilger
  2013-05-20 11:46       ` Theodore Ts'o
  2013-05-20 12:37       ` Eric Sandeen
  0 siblings, 2 replies; 9+ messages in thread
From: Andreas Dilger @ 2013-05-20  6:39 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: frankcmoeller@arcor.de, linux-ext4@vger.kernel.org

On 2013-05-19, at 8:00, Theodore Ts'o <tytso@mit.edu> wrote:
> On Fri, May 17, 2013 at 06:51:23PM +0200, frankcmoeller@arcor.de wrote:
>> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
>> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
> 
> Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
> /etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
> yes.
> 
>> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
> 
> Given the simple nature of the above workaround, it's not obvious to
> me that trying to make file system format changes, or even adding a
> new mount option, is really worth it.  This is especially true given
> that mount -a is sequential so if there are a large number of big file
> systems, using this as a mount option would be slow down the boot
> significantly.  It would be better to do this parallel, which you
> could do in userspace much more easily using the "cat
> /proc/fs/<dev>/mb_groups" workaround.

Since we already have a thread starting at mount time to check the
inode table zeroing, it would also be possible to co-opt this thread
for preloading the group metadata from the bitmaps. 

>> - I can see (see debug output) that the call of ext4_wait_block_bitmap in mballoc.c line 848 takes during buffer cache initialization the longest time (some 1/100 of a second). Can this be improved?
> 
> The delay is caused purely by I/O delay, so short of replacing the HDD
> with a SSD, not really....

Well, with a larger flex_bg factor at format time there will be more
bitmaps allocated together on disk, so fewer seeks needed to load
them after a new mount. We use a flex_bg factor of 256 for this
reason on our very large storage targets.

Cheers, Andreas

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-19 13:00 ` Aw: " frankcmoeller
@ 2013-05-20  7:04   ` Andreas Dilger
  0 siblings, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2013-05-20  7:04 UTC (permalink / raw)
  To: frankcmoeller@arcor.de; +Cc: linux-ext4@vger.kernel.org

On 2013-05-19, at 7:00, frankcmoeller@arcor.de wrote:
>> One question regarding fallocate: I create a new file and do a 100MB
>> fallocate 
>> with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
>> Is the 30 MB unused preallocated space still preallocated for that file
>> after closing
>> it? Or does a close release the preallocated space?
> 
> I did some tests and now I can answer it by myself ;-)
> The space stays preallocated after closing the file. Also umount don't releases 
> the space. Interesting!

Yes, this is how it is expected to work. Your application would need
to truncate the file to the final size when it is finished writing to it. 

> I was testing concurrent fallocates and writes to the same file descriptor. It 
> seems to work. If it is quick enough I cannot say at the moment.
> 
>> it would be really good if 
>> right after mount the filesystem
>> would knew something more to find a good group quicker.
>> What do you think of this:
>> 1. I read this already in some discussions: You already store the
>> free space amount for every group. Why not also storing how
>> big the biggest contiguous free space block in a group is?
>> Then you don't have to read the whole group.

Yes, this is done in memory already, and updating it on disk is no
more effort than updating the free block count when blocks are allocated
or freed in that group.

One option would be to store the first 32 bits of the buddy bitmap
in the bg_reserved field for each group. That would give us the
distribution down to 4 MB chunks in each group (if I calculate correctly).

That would consume the last free field in the group descriptor, but it
might be worthwhile?  Alternately, it could be put into a separate
file, but that would cause more IO. 

>> 2. What about a list (in memory and also stored on disk) with all unused
>> groups (1 bit for every group).

Having only 1 bit per group is useless. The full/not full information
can already be had from the free blocks counter in the group descriptor,
which is always in memory.

The problem is with groups that appear to have _some_ free space,
but need the bitmap to be read to see if it is contiguous or not. Some
heuristics might be used to improve this scanning, but having part of
the buddy bitmap loaded would be more useful. 

>>  If the allocator cannot find a good group within lets say half second, a
>> group from this list is used.
>>  The list is also not be 100% reliable (because of the mentioned unclean
>> unmounts), so you need to search
>>  a good group in the list. If no good group was found in the list, the
>> allocator can continue searching.
>>  This don't helps in all situations (e.g. almost full disk or every group
>> contains a small amount of data),
>>  but it should be in many cases much faster, if the list is not totally
>> outdated.

I think this could be an administrator tunable, if latency is more
important than space efficiency. It can already do this from the
data in the group descriptors that are loaded at mount time.  

Cheers, Andreas 
>>> It would be possible to fallocate() at some expected size (e.g. average
>> file
>>> size) and then either truncate off the unused space, or fallocate() some
>>> more in another thread when you are close to tunning out. 
>>> If the fallocate() is done in a separate thread the latency can be hidden
>>> from the main application?
>> Adding a new thread for fallocate shouldn't be a big problem. But fallocate
>> might 
>> generate high disk usage (while searching for a good group). I don't know
>> whether
>> parallel writing from the other thread is quick enough.
>> 
>> One question regarding fallocate: I create a new file and do a 100MB
>> fallocate 
>> with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
>> Is the 30 MB unused preallocated space still preallocated for that file
>> after closing
>> it? Or does a close release the preallocated space?
>> 
>> Regards,
>> Frank
>> 
>>> 
>>> Cheers, Andreas 
>>> 
>>>> And you have to take care about alignment and there are several threads
>> in
>>> the internet which explain why you shouldn't use it (or only in very
>> special
>>> situations and I don't think that my situation is one of them). And ext4
>>> group initialization takes also place when using O_DIRECT (as said before
>>> perhaps I did something wrong).
>>>> 
>>>> Regards,
>>>> Frank
>>>> 
>>>> ----- Original Nachricht ----
>>>> Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
>>>> An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>, ext4
>>> development <linux-ext4@vger.kernel.org>
>>>> Datum:   17.05.2013 23:18
>>>> Betreff: Re: Ext4: Slow performance on first write after mount
>>>> 
>>>>> Hi Frank,
>>>>> 
>>>>> Consider using bigalloc feature (requires reformat), preallocate space
>>>>> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
>>>>> are too small for good throughput with O_DIRECT. You might also want
>> to
>>>>> adjust max_sectors_kb to something larger than 512k.
>>>>> 
>>>>> We're doing 6in+6out 20Mbps streams just fine.
>>>>> 
>>>>> Regards,
>>>>> Andrei.
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4"
>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-20  6:39     ` Andreas Dilger
@ 2013-05-20 11:46       ` Theodore Ts'o
  2013-05-21 18:02         ` Aw: " frankcmoeller
  2013-05-20 12:37       ` Eric Sandeen
  1 sibling, 1 reply; 9+ messages in thread
From: Theodore Ts'o @ 2013-05-20 11:46 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: frankcmoeller@arcor.de, linux-ext4@vger.kernel.org

On Mon, May 20, 2013 at 12:39:50AM -0600, Andreas Dilger wrote:
> 
> Since we already have a thread starting at mount time to check the
> inode table zeroing, it would also be possible to co-opt this thread
> for preloading the group metadata from the bitmaps. 

True.  Since I wrote my earlier post, I've also been considering the
possibility that e2fsck or the kernel should just simply issue
readahead requests for all of the bitmap blocks.  The advantage of
doing it in e2fsck is that it happens earlier.

In fact, since in e2fsck the prereads can be done in parallel, I was
even thinking about a scheme where e2fsck would synchronously force
all of the allocation blocks into the buffer cache, and then in the
kernel, we could have a loop which checks to see if the bitmap blocks
were already in cache, and if they were, to initialize the buddy
bitmaps pages.  That way, even if subsequent memory pressure were to
push the buddy bitmap pages and allocation bitmaps out of the cache,
it would mean that all of the ext4_group_info structures would be
initialized, and just having the bb_largest_free_order information
will very much help things.

On Sun, 19 May 2013 21:36:02 +0200 (CEST) Frank C Moeller wrote:
>From my point (end user) I would prefer a builtin solution. I'm also a
>programmer and I can therefore understand why you don't want to change
>anything.

It's not that I don't want to change anything, it's that I'm very
hesitant to add new mount options or new code paths that now need more
testing unless there's no other way of addressing a particular use
case.  Another consideration is how to do it in such a way that it
doesn't degrade other users' performance.

Issuing readahead requests for the bitmap blocks might be good
compromise; since they are readahead requests, as low priority
requests they won't interfere with anything else going on, and in
practice, unless you are starting your video recording **immediately**
after the reboot, it should address your concern.  (Also note that for
most people willing to hack a DVR, adding a line to /etc/rc.local is
usually considered easier than building a new kernel from sources and
then after making file system format changes, requiring a reformat of
their data disk!)

So it's not that I'm against solutions that involve kernel changes or
file system format changes.  It's just that I want to make sure we
explore the entire solution space, since there are costs in terms of
testing costs, the need to do a backup-reformat-restore pass, etc,
etc., to some of the solutions that have been suggested so far.

Regards,

						- Ted

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Ext4: Slow performance on first write after mount
  2013-05-20  6:39     ` Andreas Dilger
  2013-05-20 11:46       ` Theodore Ts'o
@ 2013-05-20 12:37       ` Eric Sandeen
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Sandeen @ 2013-05-20 12:37 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Theodore Ts'o, frankcmoeller@arcor.de,
	linux-ext4@vger.kernel.org

On 5/20/13 1:39 AM, Andreas Dilger wrote:
> On 2013-05-19, at 8:00, Theodore Ts'o <tytso@mit.edu> wrote:
>> On Fri, May 17, 2013 at 06:51:23PM +0200, frankcmoeller@arcor.de wrote:
>>> - Why do you throw away buffer cache and don't store it on disk during umount? The initialization of the buffer cache is quite awful for application which need a specific write throughput.
>>> - A workaround would be to read whole /proc/.../mb_groups file right after every mount. Correct?
>>
>> Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
>> /etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
>> yes.
>>
>>> - I can try to add a mount option to initialize the cache at mount time. Would you be interested in such a patch?
>>
>> Given the simple nature of the above workaround, it's not obvious to
>> me that trying to make file system format changes, or even adding a
>> new mount option, is really worth it.  This is especially true given
>> that mount -a is sequential so if there are a large number of big file
>> systems, using this as a mount option would be slow down the boot
>> significantly.  It would be better to do this parallel, which you
>> could do in userspace much more easily using the "cat
>> /proc/fs/<dev>/mb_groups" workaround.
> 
> Since we already have a thread starting at mount time to check the
> inode table zeroing, it would also be possible to co-opt this thread
> for preloading the group metadata from the bitmaps. 

Only up to a point, I hope; if the fs is so big that you start dropping the
first ones that were read, it'd be pointless.  So it'd need some nuance,
at the very least least.

How much memory are you willing to dedicate to this, and how much does
it really help long-term, given that it's not pinned in any way?

As long as we don't have efficiently-searchable on-disk freespace info
it seems like anything else is just a workaround, I'm afraid.

-Eric

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Aw: Re: Ext4: Slow performance on first write after mount
  2013-05-20 11:46       ` Theodore Ts'o
@ 2013-05-21 18:02         ` frankcmoeller
  2013-05-22  0:27           ` Andreas Dilger
  0 siblings, 1 reply; 9+ messages in thread
From: frankcmoeller @ 2013-05-21 18:02 UTC (permalink / raw)
  To: adilger; +Cc: linux-ext4

Hi Andreas,

only a short question: 

> I like the idea of keeping the high bits of the buddy bitmap in the group
> descriptor, instead of just the largest free order. It takes the same
> amount of space, but provides more information. 
More informations for what? The allocator or better the good_group function
needs bb_largest_free_order and in some cases fragment count. Do you 
want to use the bitmap for a not 100% correct fragment count calculation? 
Or is there another use for it?

Best regards,
Frank

> 
> > On Sun, 19 May 2013 21:36:02 +0200 (CEST) Frank C Moeller wrote:
> >> From my point (end user) I would prefer a builtin solution. I'm also a
> >> programmer and I can therefore understand why you don't want to change
> >> anything.
> > 
> > It's not that I don't want to change anything, it's that I'm very
> > hesitant to add new mount options or new code paths that now need more
> > testing unless there's no other way of addressing a particular use
> > case.  Another consideration is how to do it in such a way that it
> > doesn't degrade other users' performance.
> > 
> > Issuing readahead requests for the bitmap blocks might be good
> > compromise; since they are readahead requests, as low priority
> > requests they won't interfere with anything else going on, and in
> > practice, unless you are starting your video recording **immediately**
> > after the reboot, it should address your concern.
> 
> Right. Some of our users do something similar in userspace to avoid
> slowdown on first write, which doesn't _usually_ happen immediately
> after mount, but this isn't always helpful. 
> 
> >  (Also note that for
> > most people willing to hack a DVR, adding a line to /etc/rc.local is
> > usually considered easier than building a new kernel from sources and
> > then after making file system format changes, requiring a reformat of
> > their data disk!)
> 
> I think storing the buddy bitmap top bits in the GDT could be a COMPAT
> feature.  It is just a hint that could be ignored or incorrect, since
> the actual bitmap would be authoritative. 
> 
> Cheers, Andreas
> 
> > So it's not that I'm against solutions that involve kernel changes or
> > file system format changes.  It's just that I want to make sure we
> > explore the entire solution space, since there are costs in terms of
> > testing costs, the need to do a backup-reformat-restore pass, etc,
> > etc., to some of the solutions that have been suggested so far.
> > 
> > Regards,
> > 
> >                        - Ted
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: Aw: Re: Ext4: Slow performance on first write after mount
  2013-05-21 18:02         ` Aw: " frankcmoeller
@ 2013-05-22  0:27           ` Andreas Dilger
  0 siblings, 0 replies; 9+ messages in thread
From: Andreas Dilger @ 2013-05-22  0:27 UTC (permalink / raw)
  To: frankcmoeller; +Cc: linux-ext4

On 2013-05-21, at 12:02 PM, frankcmoeller@arcor.de wrote:
>> I like the idea of keeping the high bits of the buddy bitmap
>> in the group descriptor, instead of just the largest free order.
>> It takes the same amount of space, but provides more information. 
> More informations for what?

Sorry, what I meant to write was that it provides more information
than just recording e.g. the number of blocks in the largest free
extent.

> The allocator or better the good_group function
> needs bb_largest_free_order and in some cases fragment count. Do you 
> want to use the bitmap for a not 100% correct fragment count calculation? Or is there another use for it?

The bitmap would provide the largest_free_order value directly
(assuming it is at least 4MB in size).

Cheers, Andreas

>>> On Sun, 19 May 2013 21:36:02 +0200 (CEST) Frank C Moeller wrote:
>>>> From my point (end user) I would prefer a builtin solution. I'm also a
>>>> programmer and I can therefore understand why you don't want to change
>>>> anything.
>>> 
>>> It's not that I don't want to change anything, it's that I'm very
>>> hesitant to add new mount options or new code paths that now need more
>>> testing unless there's no other way of addressing a particular use
>>> case.  Another consideration is how to do it in such a way that it
>>> doesn't degrade other users' performance.
>>> 
>>> Issuing readahead requests for the bitmap blocks might be good
>>> compromise; since they are readahead requests, as low priority
>>> requests they won't interfere with anything else going on, and in
>>> practice, unless you are starting your video recording **immediately**
>>> after the reboot, it should address your concern.
>> 
>> Right. Some of our users do something similar in userspace to avoid
>> slowdown on first write, which doesn't _usually_ happen immediately
>> after mount, but this isn't always helpful. 
>> 
>>> (Also note that for
>>> most people willing to hack a DVR, adding a line to /etc/rc.local is
>>> usually considered easier than building a new kernel from sources and
>>> then after making file system format changes, requiring a reformat of
>>> their data disk!)
>> 
>> I think storing the buddy bitmap top bits in the GDT could be a COMPAT
>> feature.  It is just a hint that could be ignored or incorrect, since
>> the actual bitmap would be authoritative. 
>> 
>> Cheers, Andreas
>> 
>>> So it's not that I'm against solutions that involve kernel changes or
>>> file system format changes.  It's just that I want to make sure we
>>> explore the entire solution space, since there are costs in terms of
>>> testing costs, the need to do a backup-reformat-restore pass, etc,
>>> etc., to some of the solutions that have been suggested so far.
>>> 
>>> Regards,
>>> 
>>>                       - Ted
>> 


Cheers, Andreas






^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-05-22  0:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <D1047C91-765D-4EBD-A6CC-869DF0D5AD90@dilger.ca>
2013-05-17 16:51 ` Ext4: Slow performance on first write after mount frankcmoeller
2013-05-17 21:18   ` Sidorov, Andrei
2013-05-19 14:00   ` Theodore Ts'o
2013-05-20  6:39     ` Andreas Dilger
2013-05-20 11:46       ` Theodore Ts'o
2013-05-21 18:02         ` Aw: " frankcmoeller
2013-05-22  0:27           ` Andreas Dilger
2013-05-20 12:37       ` Eric Sandeen
2013-05-19 10:01 Aw: " frankcmoeller
2013-05-19 13:00 ` Aw: " frankcmoeller
2013-05-20  7:04   ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).