linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-18 22:34 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-18 22:34 UTC (permalink / raw)
  To: linux-ext4

Hi Andrei,

thanks for the informations! Didn't know that it is around 32 MB data for a 1TB disk. 

Regarding bigalloc: I read on the ext4 website (https://ext4.wiki.kernel.org/index.php/Bigalloc) this:
"The bigalloc feature first appeared in the v3.2 kernel. As of this writing (in the v3.7 kernel) 
bigalloc still has some problems if the delayed allocation is enabled, especially if the file 
system is close to full."
Is bigalloc really stable? Since when is it stable? Were there bigger bugs in some versions?
I ask because the software (OpenPli) we use uses different kernel versions for different boxes. 
Some boxes use 3.8.7 kernel, some 3.3.8 and so on (it's not changeable because of closed source
drivers).

Is an ext4 bigalloc partition resizeable? I saw a bug report and a patch in January 2013 regarding this.
If it works well, I could resize my partition and create a new bigalloc one. Then move files and resize
again. Or is the only possibility a reformat?

Regards,
Frank

----- Original Nachricht ----
Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>
Datum:   18.05.2013 22:34
Betreff: Re: Aw: Re: Ext4: Slow performance on first write after mount

> Frank,
> 
> Well, the main point was to use bigalloc. Unfortunately it requires
> reformat.
> W/o bigalloc there will be ~7800 block groups for 1T drive. Those groups
> take 32M of ondisk data and up to 64M when it comes to RAM because of
> runtime buddy bitmaps. I don't think it worth storing buddy bitmaps on
> drive. It's not a surprise it can take long time to read lots of block
> bitmaps scattered over drive and construct buddies out of them. And it's
> not a surprise some these pages are evicted under high memory pressure.
> With bigalloc 1M cluster size you get 256 times less metadata (128K
> instead of 32M) and you get all the benefits of faster allocate,
> truncate and lesser fragmentation.
> 
> Yes, you don't know file size in advance, but speculating say each 128M
> is clearly a benefit. truncate to real file size once recording finished
> to release unused preallocated space.
> There are some caveats with O_DIRECT, but it is faster if done correctly.
> 
> Regards,
> Andrei.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread
[parent not found: <D1047C91-765D-4EBD-A6CC-869DF0D5AD90@dilger.ca>]
* Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-20 20:54 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-20 20:54 UTC (permalink / raw)
  To: adilger, tytso; +Cc: linux-ext4

Hi together,

> > and then in the
> > kernel, we could have a loop which checks to see if the bitmap blocks
> > were already in cache, and if they were, to initialize the buddy
> > bitmaps pages.  That way, even if subsequent memory pressure were to
> > push the buddy bitmap pages and allocation bitmaps out of the cache,
> > it would mean that all of the ext4_group_info structures would be
> > initialized, and just having the bb_largest_free_order information
> > will very much help things.
> 
> I like the idea of keeping the high bits of the buddy bitmap in the group
> descriptor, instead of just the largest free order. It takes the same
> amount of space, but provides more information. 

If you use the reserved field for this, users don't need to reformat their disks
and "only" need to use a new kernel, right? That sounds really good to me.

> > Issuing readahead requests for the bitmap blocks might be good
> > compromise; since they are readahead requests, as low priority
> > requests they won't interfere with anything else going on, and in
> > practice, unless you are starting your video recording **immediately**
> > after the reboot, it should address your concern.

If there is a normal recording the PVR starts some minutes (I think 2-3) 
before the recording starts. If an user starts the PVR, timeshift (it's a 
recording) might start around 30-40 seconds after mounting the disk. 
But if it's problematic I can let it start later. 
 
> >  (Also note that for
> > most people willing to hack a DVR, adding a line to /etc/rc.local is
> > usually considered easier than building a new kernel from sources and
> > then after making file system format changes, requiring a reformat of
> > their data disk!)
> 
> I think storing the buddy bitmap top bits in the GDT could be a COMPAT
> feature.  It is just a hint that could be ignored or incorrect, since
> the actual bitmap would be authoritative. 

Yes, adding a line to rc.local is easy. Building a new kernel is also no problem,
if the patch is compatible with the current used kernel versions (3.8.7 and also 
good would be 3.3.8 ). The PVR uses a good software management system 
(something like apt) and the users can update their software including kernel
every day.
Reformating is a problem, but if it's not preventable, users can choose between
workaround and reformat.

Best regards,
Frank

> 
> Cheers, Andreas
> 
> > So it's not that I'm against solutions that involve kernel changes or
> > file system format changes.  It's just that I want to make sure we
> > explore the entire solution space, since there are costs in terms of
> > testing costs, the need to do a backup-reformat-restore pass, etc,
> > etc., to some of the solutions that have been suggested so far.
> > 
> > Regards,
> > 
> >                        - Ted
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-19 19:36 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-19 19:36 UTC (permalink / raw)
  To: tytso; +Cc: linux-ext4

Hi Ted,

> Simply adding "cat /proc/fs/<dev>/mb_groups > /dev/null" to one of the
> /etc/init.d scripts, or to /etc/rc.local is probably the simplest fix,
> yes.
Thanks for confirming that the workaround fixes the problem!

> Given the simple nature of the above workaround, it's not obvious to
> me that trying to make file system format changes, or even adding a
> new mount option, is really worth it.  This is especially true given
> that mount -a is sequential so if there are a large number of big file
> systems, using this as a mount option would be slow down the boot
> significantly.  It would be better to do this parallel, which you
> could do in userspace much more easily using the "cat
> /proc/fs/<dev>/mb_groups" workaround.
>From my point (end user) I would prefer a builtin solution. I'm also a
programmer and I can therefore understand why you don't want to change 
anything. It's a little bit surprising for me, that only few people seems to have 
this problem. But I believe that many live with it and don't know that the slow
boot or write is caused by ext4 (and many end user have small ext4 partitions
and servers are running 24/7 without remounting fs...). Only few applications 
rely on a constant write throughput.

> > - I can see (see debug output) that the call of ext4_wait_block_bitmap in
> mballoc.c line 848 takes during buffer cache initialization the longest time
> (some 1/100 of a second). Can this be improved?
> 
> The delay is caused purely by I/O delay, so short of replacing the HDD
> with a SSD, not really....
Well, SSDs are really cool, but for a PVR a hdd is still a good choice: Cheap,
big, more reliable (hopefully), quick enough and has no problems writing several 
GB data per day.

Regards,
Frank

> 
> Regards,
> 
> 						- Ted
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Re: Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-19 10:01 frankcmoeller
  0 siblings, 0 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-19 10:01 UTC (permalink / raw)
  To: linux-ext4

Hi Andreas,

> Part of the problem is that filesystems are rarely unmounted cleanly, so it
> means that this information would need to be updated periodically to disk so
> that it is available after a crash.
> I wouldn't object to some kind of "lazy" updating of group information on
> disk that at least gives the newly-mounted filesystem a rough idea of what
> each group's usage is. It wouldn't have to be totally accurate (it wouldn't
> replace the bitmaps), but maybe 2 bits per group would be enough as a
> starting point?
> For a 32 TB filesystem that would be about 16 4kB blocks of bits that would
> be updated periodically (e.g. every five minutes or so). Since the allocator
> will typically work in successive groups that might not cause too much
> churn. 

Yes, you're right. The stored data wouldn't be 100% reliable. And yes, it would be really good if 
right after mount the filesystem would knew something more to find a good group quicker.
What do you think of this:
1. I read this already in some discussions: You already store the free space amount for every
  group. Why not also storing how big the biggest contiguous free space block in a group is? Then you 
  don't have to read the whole group.
2. What about a list (in memory and also stored on disk) with all unused groups (1 bit for every group).
  If the allocator cannot find a good group within lets say half second, a group from this list is used.
  The list is also not be 100% reliable (because of the mentioned unclean unmounts), so you need to search
  a good group in the list. If no good group was found in the list, the allocator can continue searching.
  This don't helps in all situations (e.g. almost full disk or every group contains a small amount of data),
  but it should be in many cases much faster, if the list is not totally outdated.

> It would be possible to fallocate() at some expected size (e.g. average file
> size) and then either truncate off the unused space, or fallocate() some
> more in another thread when you are close to tunning out. 
> If the fallocate() is done in a separate thread the latency can be hidden
> from the main application?
Adding a new thread for fallocate shouldn't be a big problem. But fallocate might 
generate high disk usage (while searching for a good group). I don't know whether
parallel writing from the other thread is quick enough.

One question regarding fallocate: I create a new file and do a 100MB fallocate 
with FALLOC_FL_KEEP_SIZE. Then I write only 70MB to that file and close it.
Is the 30 MB unused preallocated space still preallocated for that file after closing
it? Or does a close release the preallocated space?

Regards,
Frank

> 
> Cheers, Andreas 
> 
> > And you have to take care about alignment and there are several threads in
> the internet which explain why you shouldn't use it (or only in very special
> situations and I don't think that my situation is one of them). And ext4
> group initialization takes also place when using O_DIRECT (as said before
> perhaps I did something wrong).
> > 
> > Regards,
> > Frank
> > 
> > ----- Original Nachricht ----
> > Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
> > An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>, ext4
> development <linux-ext4@vger.kernel.org>
> > Datum:   17.05.2013 23:18
> > Betreff: Re: Ext4: Slow performance on first write after mount
> > 
> >> Hi Frank,
> >> 
> >> Consider using bigalloc feature (requires reformat), preallocate space
> >> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
> >> are too small for good throughput with O_DIRECT. You might also want to
> >> adjust max_sectors_kb to something larger than 512k.
> >> 
> >> We're doing 6in+6out 20Mbps streams just fine.
> >> 
> >> Regards,
> >> Andrei.
> >> 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread
* Aw: Re: Ext4: Slow performance on first write after mount
@ 2013-05-18 10:50 frankcmoeller
  2013-05-18 20:34 ` Sidorov, Andrei
  2013-05-19  1:49 ` Andreas Dilger
  0 siblings, 2 replies; 9+ messages in thread
From: frankcmoeller @ 2013-05-18 10:50 UTC (permalink / raw)
  To: linux-ext4

Hi Andrei,

thanks for your quick answer!
Perhaps you understood me wrong. The general write performance is quite good. We can record more than 4 HD channels at the same time without problems. Except the problems with the first write after mount. And there are also some users which have problems 1-2 times during a recording.
I think the ext4 group initialization is the main problem, because it takes so long (as written before: around 1300 groups per second). Why don't you store the gathered informations on disk when a umount takes place?

With fallocate the group initialization is partly made before first write. This helps, but it's no solution, because the finally file size is unknown. So I cannot preallocate space for the complete file. And after the preallocated space is consumed the same problem with the initialization arises until all groups are initialized.

I also made some tests with O_DIRECT (my first tests ever). Perhaps I did something wrong, but it isn't very fast. And you have to take care about alignment and there are several threads in the internet which explain why you shouldn't use it (or only in very special situations and I don't think that my situation is one of them). And ext4 group initialization takes also place when using O_DIRECT (as said before perhaps I did something wrong).

Regards,
Frank

----- Original Nachricht ----
Von:     "Sidorov, Andrei" <Andrei.Sidorov@arrisi.com>
An:      "frankcmoeller@arcor.de" <frankcmoeller@arcor.de>, ext4 development <linux-ext4@vger.kernel.org>
Datum:   17.05.2013 23:18
Betreff: Re: Ext4: Slow performance on first write after mount

> Hi Frank,
> 
> Consider using bigalloc feature (requires reformat), preallocate space
> with fallocate and use O_DIRECT for reads/writes. However, 188k writes
> are too small for good throughput with O_DIRECT. You might also want to
> adjust max_sectors_kb to something larger than 512k.
> 
> We're doing 6in+6out 20Mbps streams just fine.
> 
> Regards,
> Andrei.
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2013-05-22  0:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-05-18 22:34 Aw: Re: Ext4: Slow performance on first write after mount frankcmoeller
     [not found] <D1047C91-765D-4EBD-A6CC-869DF0D5AD90@dilger.ca>
2013-05-17 16:51 ` frankcmoeller
2013-05-19 14:00   ` Theodore Ts'o
2013-05-20  6:39     ` Andreas Dilger
2013-05-20 11:46       ` Theodore Ts'o
2013-05-21 18:02         ` Aw: " frankcmoeller
2013-05-22  0:27           ` Andreas Dilger
  -- strict thread matches above, loose matches on Subject: below --
2013-05-20 20:54 frankcmoeller
2013-05-19 19:36 frankcmoeller
2013-05-19 10:01 frankcmoeller
2013-05-18 10:50 frankcmoeller
2013-05-18 20:34 ` Sidorov, Andrei
2013-05-19  1:49 ` Andreas Dilger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).