After unlinking a large file on ext4, the process stalls for a long time

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* After unlinking a large file on ext4, the process stalls for a long time
@ 2014-07-16 14:09 Mason
  2014-07-16 15:16 ` John Stoffel
  0 siblings, 1 reply; 13+ messages in thread
From: Mason @ 2014-07-16 14:09 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel

Hello everyone,

I'm using Linux (3.1.10 at the moment) on a embedded system similar in
spec to a desktop PC from 15 years ago (256 MB RAM, 800-MHz CPU, USB).

I need to be able to create large files (50-1000 GB) "as fast as possible".
These files are created on an external hard disk drive, connected over
Hi-Speed USB (typical throughput 30 MB/s).

Sparse files were not an acceptable solution (because the space must be
reserved, and the operation must fail if the space is unavailable).
And filling the file with zeros was too slow (typically 35 s/GB).

Someone mentioned fallocate on an ext4 partition.

So I create an ext4 partition with
$ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
(Using e2fsprogs-1.42.10 if it matters)

And mount with "typical" mount options
$ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
/dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)

I wrote a small test program to create a large file, then immediately
unlink it.

My problem is that, while file creation is "fast enough" (4 seconds
for a 300 GB file) and unlink is "immediate", the process hangs
while it waits (I suppose) for the OS to actually complete the
operation (almost two minutes for a 300 GB file).

I also note that the (weak) CPU is pegged, so perhaps this problem
does not occur on a desktop workstation?

/tmp # time ./foo /mnt/hdd/xxx 5
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
unlink(filename): 0 [0 ms]
0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps

/tmp # time ./foo /mnt/hdd/xxx 10
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
unlink(filename): 0 [0 ms]
0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps

/tmp # time ./foo /mnt/hdd/xxx 100
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
unlink(filename): 0 [0 ms]
0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps

/tmp # time ./foo /mnt/hdd/xxx 300
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
unlink(filename): 0 [0 ms]
0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
0inputs+0outputs (0major+168minor)pagefaults 0swaps

QUESTIONS:

1) Did I provide enough information for someone to reproduce?

2) Is this expected behavior?

3) Are there knobs I can tweak (at FS creation, or at mount time)
to improve the performance of file unlinking?
(Maybe there is a safety/performance trade-off?

My test program:

#define _FILE_OFFSET_BITS 64
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <time.h>

#define BENCH(op) do { \
  struct timespec t0; clock_gettime(CLOCK_MONOTONIC, &t0); \
  int err = op; \
  struct timespec t1; clock_gettime(CLOCK_MONOTONIC, &t1); \
  int ms = (t1.tv_sec-t0.tv_sec)*1000 + (t1.tv_nsec-t0.tv_nsec)/1000000; \
  printf("%s: %d [%d ms]\n", #op, err, ms); } while(0)

int main(int argc, char **argv)
{
  if (argc != 3) { puts("Usage: prog filename size"); return 42; }

  char *filename = argv[1];
  int fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);
  if (fd < 0) { perror("open"); return 1; }

  long long size_in_GiB = atoi(argv[2]);
  BENCH(posix_fallocate(fd, 0, size_in_GiB << 30));
  BENCH(unlink(filename));
  return 0;
}

-- 
Regards.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-16 14:09 After unlinking a large file on ext4, the process stalls for a long time Mason
@ 2014-07-16 15:16 ` John Stoffel
  2014-07-16 17:16   ` Mason
  0 siblings, 1 reply; 13+ messages in thread
From: John Stoffel @ 2014-07-16 15:16 UTC (permalink / raw)
  To: Mason; +Cc: linux-kernel, linux-fsdevel

Mason> I'm using Linux (3.1.10 at the moment) on a embedded system
Mason> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
Mason> 800-MHz CPU, USB).

Sounds like a Raspberry Pi...  And have you investigated using
something like XFS as your filesystem instead?  

Mason> I need to be able to create large files (50-1000 GB) "as fast
Mason> as possible".  These files are created on an external hard disk
Mason> drive, connected over Hi-Speed USB (typical throughput 30
Mason> MB/s).

Really... so you just need to create allocations of space as quickly
as possible, which will then be filled in later with actuall data?  So
basically someone will say "give me 600G of space reservation" and
then will eventually fill it up, otherwise you say "Nope, can't do
it!"

Mason> Sparse files were not an acceptable solution (because the space
Mason> must be reserved, and the operation must fail if the space is
Mason> unavailable).  And filling the file with zeros was too slow
Mason> (typically 35 s/GB).

Mason> Someone mentioned fallocate on an ext4 partition.

Mason> So I create an ext4 partition with
Mason> $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
Mason> (Using e2fsprogs-1.42.10 if it matters)

Mason> And mount with "typical" mount options
Mason> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
Mason> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)

Mason> I wrote a small test program to create a large file, then immediately
Mason> unlink it.

Mason> My problem is that, while file creation is "fast enough" (4 seconds
Mason> for a 300 GB file) and unlink is "immediate", the process hangs
Mason> while it waits (I suppose) for the OS to actually complete the
Mason> operation (almost two minutes for a 300 GB file).

Mason> I also note that the (weak) CPU is pegged, so perhaps this problem
Mason> does not occur on a desktop workstation?

Mason> /tmp # time ./foo /mnt/hdd/xxx 5
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason> /tmp # time ./foo /mnt/hdd/xxx 10
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason> /tmp # time ./foo /mnt/hdd/xxx 100
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason> /tmp # time ./foo /mnt/hdd/xxx 300
Mason> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
Mason> unlink(filename): 0 [0 ms]
Mason> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
Mason> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Mason> QUESTIONS:

Mason> 1) Did I provide enough information for someone to reproduce?

Sure, but you didn't give enough information to explain what you're
trying to accomplish here.  And what the use case is.  Also, since you
know you cannot fill 500Gb in any sorta of reasonable time over USB2,
why are you concerned that the delete takes so long?  

I think that maybe using the filesystem for the reservations is the
wrong approach.  You should use a simple daemon which listens for
requests, and then checks the filesystem space and decides if it can
honor them or not.

Then you just store the files as they get writen...

Mason> 2) Is this expected behavior?

Sure, unlinking a 1Gb file that's been written too means (on EXT4)
that you need to update all the filesystem structures.  Now it should
be quicker honestly, but maybe you're not mounting it with a journal?
And have you tried tuning the filesystem to use larger allocations and
blocks?  You're not going to make alot of files on there obviously,
but just a few large ones.  

Mason> 3) Are there knobs I can tweak (at FS creation, or at mount
Mason> time) to improve the performance of file unlinking?  (Maybe
Mason> there is a safety/performance trade-off?

Sure, there are all kinds of things you can do.  For example, how
many of these files are you expecting to store?  Will you have to be
able to handle writing of more than one file at a time?  Or are they
purely sequential?  

If you are creating a small embedded system to manage a bunch of USB2
hard drives and write data to them with a space reservation process,
then you need to make sure you can actually handle the data throughput
requirements.  And I'm not sure you can.

John

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-16 15:16 ` John Stoffel
@ 2014-07-16 17:16   ` Mason
  2014-07-16 20:18     ` John Stoffel
  2014-07-17  3:37     ` Andreas Dilger
  0 siblings, 2 replies; 13+ messages in thread
From: Mason @ 2014-07-16 17:16 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-kernel, linux-fsdevel

(I hope you'll forgive me for reformatting the quote characters
to my taste.)

On 16/07/2014 17:16, John Stoffel wrote:

> Mason wrote:
>
>> I'm using Linux (3.1.10 at the moment) on a embedded system
>> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
>> 800-MHz CPU, USB).
> 
> Sounds like a Raspberry Pi...  And have you investigated using
> something like XFS as your filesystem instead?

The system is a set-top box (DVB-S2 receiver). The system CPU is
MIPS 74K, not ARM (not that it matters, in this case).

No, I have not investigated other file systems (yet).

>> I need to be able to create large files (50-1000 GB) "as fast
>> as possible".  These files are created on an external hard disk
>> drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
> 
> Really... so you just need to create allocations of space as quickly
> as possible,

I may not have been clear. The creation needs to be fast (in UX terms,
so less than 5-10 seconds), but it only occurs a few times during the
lifetime of the system.

> which will then be filled in later with actual data?

Yes. In fact, I use the loopback device to format the file as an
ext4 partition. 

> basically someone will say "give me 600G of space reservation" and
> then will eventually fill it up, otherwise you say "Nope, can't do
> it!"

Right, take a 1000 GB disk,
Reserve(R1 = 300 GB) <- SUCCESS
Reserve(R2 = 300 GB) <- SUCCESS
Reserve(R3 = 300 GB) <- SUCCESS
Reserve(R4 = 300 GB) <- FAIL
Delete (R1)          <- SUCCESS
Reserve(R4 = 300 GB) <- SUCCESS

>> So I create an ext4 partition with
>> $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
>> (Using e2fsprogs-1.42.10 if it matters)
>> 
>> And mount with "typical" mount options
>> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
>> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
>> 
>> I wrote a small test program to create a large file, then immediately
>> unlink it.
>> 
>> My problem is that, while file creation is "fast enough" (4 seconds
>> for a 300 GB file) and unlink is "immediate", the process hangs
>> while it waits (I suppose) for the OS to actually complete the
>> operation (almost two minutes for a 300 GB file).

[snip performance numbers]

>> QUESTIONS:
>> 
>> 1) Did I provide enough information for someone to reproduce?
> 
> Sure, but you didn't give enough information to explain what you're
> trying to accomplish here.  And what the use case is.  Also, since you
> know you cannot fill 500Gb in any sort of reasonable time over USB2,
> why are you concerned that the delete takes so long?

I don't understand your question. If the user asks to create a 300 GB
file, then immediately realizes than he won't need it, and asks for it
to be deleted, I don't see why the process should hang for 2 minutes.

The use case is
- allocate a large file
- stick a file system on it
- store stuff (typically video files) inside this "private" FS
- when the user decides he doesn't need it anymore, unmount and unlink
(I also have a resize operation in there, but I wanted to get the
basics before taking the hard stuff head on.)

So, in the limit, we don't store anything at all: just create and
immediately delete. This was my test.

> I think that maybe using the filesystem for the reservations is the
> wrong approach.  You should use a simple daemon which listens for
> requests, and then checks the filesystem space and decides if it can
> honor them or not.

I considered using ACTUAL partitions, but there were too many downsides.
NB: there may be several "containers" active at the same time.

>> 2) Is this expected behavior?
> 
> Sure, unlinking a 1Gb file that's been written too means (on EXT4)
> that you need to update all the filesystem structures.

Well creating such a file means updating all the filesystem structures,
yet that operation is 30x faster. Also note that I have not written
ANYTHING to the file; my test did:

  open();
  posix_fallocate();
  unlink();

> Now it should
> be quicker honestly, but maybe you're not mounting it with a journal?

Indeed no, I expected the journal to slow things down.
$ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
https://lwn.net/Articles/313514/

Also, the user might format a Flash-based device, and I've read that
journals and Flash-based storage are not a good mix.

> And have you tried tuning the filesystem to use larger allocations and
> blocks?  You're not going to make a lot of files on there obviously,
> but just a few large ones.

Are you suggesting bigalloc?
https://ext4.wiki.kernel.org/index.php/Bigalloc
1. It is not supported by my kernel AFAIU.
2. It is still experimental AFAICT.
3. Resizing bigalloc file systems is not well tested.

>> 3) Are there knobs I can tweak (at FS creation, or at mount
>> time) to improve the performance of file unlinking?  (Maybe
>> there is a safety/performance trade-off?
> 
> Sure, there are all kinds of things you can do.  For example, how
> many of these files are you expecting to store?

I do not support more than 8 containers. (But the drive is used to
store other (mostly large) files.)

This is why I specified "-i 1024000" to mkfs.ext4, to limit the number
of inodes created. Is this incorrect?

What other improvements would you suggest?
(I'd like to get the unlink operation to complete in < 10 seconds.)

> Will you have to be able to handle writing of more than one file
> at a time?  Or are they purely sequential?

All containers may be active concurrently, and since they are proper
file systems, they are written to as the FS drivers sees fit (i.e. not
sequentially). However, the max write throughput is limited to 3 MB/s
(which USB2 should easily manage to handle).

> If you are creating a small embedded system to manage a bunch of USB2
> hard drives and write data to them with a space reservation process,
> then you need to make sure you can actually handle the data throughput
> requirements.  And I'm not sure you can.

AFAIK, the plan is to support only one drive, and not to write faster
than 3 MB/s. I think it should handle it.

Thanks for your insightful questions :-)

Regards.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-16 17:16   ` Mason
@ 2014-07-16 20:18     ` John Stoffel
  2014-07-16 21:46       ` Mason
  2014-07-17  3:37     ` Andreas Dilger
  1 sibling, 1 reply; 13+ messages in thread
From: John Stoffel @ 2014-07-16 20:18 UTC (permalink / raw)
  To: Mason; +Cc: John Stoffel, linux-kernel, linux-fsdevel

Mason> (I hope you'll forgive me for reformatting the quote characters
Mason> to my taste.)

No problem.

Mason> On 16/07/2014 17:16, John Stoffel wrote:

>> Mason wrote:
>> 
>>> I'm using Linux (3.1.10 at the moment) on a embedded system
>>> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
>>> 800-MHz CPU, USB).
>> 
>> Sounds like a Raspberry Pi...  And have you investigated using
>> something like XFS as your filesystem instead?

Mason> The system is a set-top box (DVB-S2 receiver). The system CPU is
Mason> MIPS 74K, not ARM (not that it matters, in this case).

So it's a slow slow box... and it's only going to handle writing data
at 3Mbs... so why do you insist that the filesystem work at magic
speeds?  

Mason> No, I have not investigated other file systems (yet).

>>> I need to be able to create large files (50-1000 GB) "as fast
>>> as possible".  These files are created on an external hard disk
>>> drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
>> 
>> Really... so you just need to create allocations of space as quickly
>> as possible,

Mason> I may not have been clear. The creation needs to be fast (in UX terms,
Mason> so less than 5-10 seconds), but it only occurs a few times during the
Mason> lifetime of the system.

If this only happens a few times, why do you care how quick the delete
is?  And if it's only happening a few times, why don't you just do the
space reservation OUTSIDE of the filesystem? 

Or do you need to do encryption of these containers and strictly
segrate them?  Basically, implement a daemon which knows how much free
space is on the device, how much is already pre-committed to other
users, and then how much free space there is.  

If the space isn't actually used, then you don't care, because you've
reserved it.  

>> which will then be filled in later with actual data?

Mason> Yes. In fact, I use the loopback device to format the file as an
Mason> ext4 partition. 

Why are you doing it like this?  What advantage does this buy you?  In
any case, you're now slowing things down because you have the overhead
of the base filesystem, which you then create a large file on top of,
which you then mount and format with a SECOND filesystem.  

Instead, you should probably just have a small boot/OS filesystem, and
then put the rest of the storage under LVM control.  At that point,
you can reserve space using 'lvcreate ...' which will succeed or
fail.  If good, create a filesystem in there and use it.  When you
need to delete it, just unmount the LV and just do 'lvdestroy' which
should be much faster, since you won't bother to zero out the blocks.

Now I don't know offhand if lvcreate ontop of a recently deleted LV
volume whill make sure to zero all the blocks, but I suspect so, and
probably only when they're used.

Does this make more sense?  It seems to fit your strange requirements
better...

John

>> basically someone will say "give me 600G of space reservation" and
>> then will eventually fill it up, otherwise you say "Nope, can't do
>> it!"

Mason> Right, take a 1000 GB disk,
Mason> Reserve(R1 = 300 GB) <- SUCCESS
Mason> Reserve(R2 = 300 GB) <- SUCCESS
Mason> Reserve(R3 = 300 GB) <- SUCCESS
Mason> Reserve(R4 = 300 GB) <- FAIL
Mason> Delete (R1)          <- SUCCESS
Mason> Reserve(R4 = 300 GB) <- SUCCESS

>>> So I create an ext4 partition with
>>> $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
>>> (Using e2fsprogs-1.42.10 if it matters)
>>> 
>>> And mount with "typical" mount options
>>> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
>>> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
>>> 
>>> I wrote a small test program to create a large file, then immediately
>>> unlink it.
>>> 
>>> My problem is that, while file creation is "fast enough" (4 seconds
>>> for a 300 GB file) and unlink is "immediate", the process hangs
>>> while it waits (I suppose) for the OS to actually complete the
>>> operation (almost two minutes for a 300 GB file).

Mason> [snip performance numbers]

>>> QUESTIONS:
>>> 
>>> 1) Did I provide enough information for someone to reproduce?
>> 
>> Sure, but you didn't give enough information to explain what you're
>> trying to accomplish here.  And what the use case is.  Also, since you
>> know you cannot fill 500Gb in any sort of reasonable time over USB2,
>> why are you concerned that the delete takes so long?

Mason> I don't understand your question. If the user asks to create a 300 GB
Mason> file, then immediately realizes than he won't need it, and asks for it
Mason> to be deleted, I don't see why the process should hang for 2 minutes.

Mason> The use case is
Mason> - allocate a large file
Mason> - stick a file system on it
Mason> - store stuff (typically video files) inside this "private" FS
Mason> - when the user decides he doesn't need it anymore, unmount and unlink
Mason> (I also have a resize operation in there, but I wanted to get the
Mason> basics before taking the hard stuff head on.)

Mason> So, in the limit, we don't store anything at all: just create and
Mason> immediately delete. This was my test.

>> I think that maybe using the filesystem for the reservations is the
>> wrong approach.  You should use a simple daemon which listens for
>> requests, and then checks the filesystem space and decides if it can
>> honor them or not.

Mason> I considered using ACTUAL partitions, but there were too many downsides.
Mason> NB: there may be several "containers" active at the same time.

>>> 2) Is this expected behavior?
>> 
>> Sure, unlinking a 1Gb file that's been written too means (on EXT4)
>> that you need to update all the filesystem structures.

Mason> Well creating such a file means updating all the filesystem structures,
Mason> yet that operation is 30x faster. Also note that I have not written
Mason> ANYTHING to the file; my test did:

Mason>   open();
Mason>   posix_fallocate();
Mason>   unlink();

>> Now it should
>> be quicker honestly, but maybe you're not mounting it with a journal?

Mason> Indeed no, I expected the journal to slow things down.
Mason> $ mkfs.ext4 -m 0 -i 1024000 -O ^has_journal,^huge_file /dev/sda1
Mason> https://lwn.net/Articles/313514/

Mason> Also, the user might format a Flash-based device, and I've read that
Mason> journals and Flash-based storage are not a good mix.

>> And have you tried tuning the filesystem to use larger allocations and
>> blocks?  You're not going to make a lot of files on there obviously,
>> but just a few large ones.

Mason> Are you suggesting bigalloc?
Mason> https://ext4.wiki.kernel.org/index.php/Bigalloc
Mason> 1. It is not supported by my kernel AFAIU.
Mason> 2. It is still experimental AFAICT.
Mason> 3. Resizing bigalloc file systems is not well tested.

>>> 3) Are there knobs I can tweak (at FS creation, or at mount
>>> time) to improve the performance of file unlinking?  (Maybe
>>> there is a safety/performance trade-off?
>> 
>> Sure, there are all kinds of things you can do.  For example, how
>> many of these files are you expecting to store?

Mason> I do not support more than 8 containers. (But the drive is used to
Mason> store other (mostly large) files.)

Mason> This is why I specified "-i 1024000" to mkfs.ext4, to limit the number
Mason> of inodes created. Is this incorrect?

Mason> What other improvements would you suggest?
Mason> (I'd like to get the unlink operation to complete in < 10 seconds.)

>> Will you have to be able to handle writing of more than one file
>> at a time?  Or are they purely sequential?

Mason> All containers may be active concurrently, and since they are proper
Mason> file systems, they are written to as the FS drivers sees fit (i.e. not
Mason> sequentially). However, the max write throughput is limited to 3 MB/s
Mason> (which USB2 should easily manage to handle).

>> If you are creating a small embedded system to manage a bunch of USB2
>> hard drives and write data to them with a space reservation process,
>> then you need to make sure you can actually handle the data throughput
>> requirements.  And I'm not sure you can.

Mason> AFAIK, the plan is to support only one drive, and not to write faster
Mason> than 3 MB/s. I think it should handle it.

Mason> Thanks for your insightful questions :-)

Mason> Regards.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-16 20:18     ` John Stoffel
@ 2014-07-16 21:46       ` Mason
  0 siblings, 0 replies; 13+ messages in thread
From: Mason @ 2014-07-16 21:46 UTC (permalink / raw)
  To: John Stoffel; +Cc: linux-kernel, linux-fsdevel

On 16/07/2014 22:18, John Stoffel wrote:

> Mason wrote:
>
>> The system is a set-top box (DVB-S2 receiver). The system CPU is
>> MIPS 74K, not ARM (not that it matters, in this case).
> 
> So it's a slow slow box... and it's only going to handle writing data
> at 3 MBps...

What do you mean "it's only going to handle writing data at 3 MBps"?
It's a good thing that I am nowhere near the peak throughput! I have
tested the system for hours, and it can handle 30 MB/s with only a
few minor hiccups (unexplained, so far).

> so why do you insist that the filesystem work at magic speeds?

Why do you say that deleting an empty 300-GB fallocated file in
less than 10 seconds is "working at magic speeds"? What numbers
do you get for this benchmark? I would expect to get roughly
similar numbers for creation and deletion. Why aren't you
surprised by the deletion numbers? You know something I don't.
So please share with me.

>> I may not have been clear. The creation needs to be fast (in UX terms,
>> so less than 5-10 seconds), but it only occurs a few times during the
>> lifetime of the system.
> 
> If this only happens a few times, why do you care how quick the delete
> is?  And if it's only happening a few times, why don't you just do the
> space reservation OUTSIDE of the filesystem?

I care because the delete operation is done when the user asks
for it, and it hangs the UI for 2 minutes. Isn't that reason
enough to care?

I don't know what you mean by reservation OUTSIDE of the FS.
The user supplies the external HDD, and the sizes to reserve
are known at run-time (sent in the broadcast signal).

> Or do you need to do encryption of these containers and strictly
> segrate them?  Basically, implement a daemon which knows how much free
> space is on the device, how much is already pre-committed to other
> users, and then how much free space there is.

I don't think you've thought this through...
You propose to have a daemon that will mediate every file system
write to the external HDD. That means that the application has
to explicitly be coded to talk to the daemon. (My solution is
transparent to the app.) Or did you have some kind of interposition
of write system calls? Anyway, this code would be duplicating the
bean counting done inside the file system driver. (I considered
using quotas, but I didn't see how to make it work as required.)

Note that the OS and main app reside in Flash (ubifs). And there
are also a couple tmpfs. There are about 15 mount points (most
of them pseudo file systems, though).

> Why are you doing it like this?  What advantage does this buy you?  In
> any case, you're now slowing things down because you have the overhead
> of the base filesystem, which you then create a large file on top of,
> which you then mount and format with a SECOND filesystem.

Write performance is secondary. I was just providing insight
into what I planned to do with the large files, but the
performance problem of unlinking occurs *even when nothing
was done to the file*. (Please don't get distracted by the
FS-within-FS gizmo.)

> Instead, you should probably just have a small boot/OS filesystem, and
> then put the rest of the storage under LVM control.  At that point,
> you can reserve space using 'lvcreate ...' which will succeed or
> fail.  If good, create a filesystem in there and use it.  When you
> need to delete it, just unmount the LV and just do 'lvdestroy' which
> should be much faster, since you won't bother to zero out the blocks.

I didn't look closely at the LVM solution, because space in
Flash is tight, and I had the prejudice that LVM was large,
as a server/workstation solution.

[huge snip of my other questions]

I will try to test XFS on the STB, and also run my ext4 test
on my workstation, see if get the same disappointing results.

Regards.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-16 17:16   ` Mason
  2014-07-16 20:18     ` John Stoffel
@ 2014-07-17  3:37     ` Andreas Dilger
  2014-07-17 10:30       ` Mason
  1 sibling, 1 reply; 13+ messages in thread
From: Andreas Dilger @ 2014-07-17  3:37 UTC (permalink / raw)
  To: Mason; +Cc: John Stoffel, Ext4 Developers List, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4103 bytes --]

On Jul 16, 2014, at 11:16 AM, Mason <mpeg.blue@free.fr> wrote:
> (I hope you'll forgive me for reformatting the quote characters
> to my taste.)

Thank you.

> On 16/07/2014 17:16, John Stoffel wrote:
>> Mason wrote:
>>> I'm using Linux (3.1.10 at the moment) on a embedded system
>>> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
>>> 800-MHz CPU, USB).
>> 
>> Sounds like a Raspberry Pi...  And have you investigated using
>> something like XFS as your filesystem instead?
> 
> The system is a set-top box (DVB-S2 receiver). The system CPU is
> MIPS 74K, not ARM (not that it matters, in this case).
> 
> No, I have not investigated other file systems (yet).
> 
>>> I need to be able to create large files (50-1000 GB) "as fast
>>> as possible".  These files are created on an external hard disk
>>> drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
>> 
>> Really... so you just need to create allocations of space as quickly
>> as possible,
> 
> I may not have been clear. The creation needs to be fast (in UX terms,
> so less than 5-10 seconds), but it only occurs a few times during the
> lifetime of the system.
> 
>> which will then be filled in later with actual data?
> 
> Yes. In fact, I use the loopback device to format the file as an
> ext4 partition. 
> 
> The use case is
> - allocate a large file
> - stick a file system on it
> - store stuff (typically video files) inside this "private" FS
> - when the user decides he doesn't need it anymore, unmount and unlink
> (I also have a resize operation in there, but I wanted to get the
> basics before taking the hard stuff head on.)
> 
> So, in the limit, we don't store anything at all: just create and
> immediately delete. This was my test.

I would agree that LVM is the real solution that you want to use.
It is specifically designed for this, and has much less overhead than
a filesystem on a loopback device on a file on another filesystem.
The amount of space overhead is tuneable, but typically the volumes
are allocated in multiples of 4MB chunks.

That said, I think you've found some kind of strange performance problem,
and it is worthwhile to figure this out.

>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Firstly, have you tried using "fallocate()" directly, instead of
posix_fallocate()?  It may be (depending on your userspace) that
posix_fallocate() is writing zeroes to the file instead of using
the fallocate() syscall, and the kernel is busy cleaning up all
of the dirty pages when the file is unlinked.  You could try using
strace to see what system calls are actually being used.

Secondly, where is the process actually stuck?  From your output
above, the unlink() call takes no measurable time before returning,
so I don't see where it is actually stuck.  Again, running your
test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
syscall is actually taking so much time to complete.  I don't
think it is unlink().

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17  3:37     ` Andreas Dilger
@ 2014-07-17 10:30       ` Mason
  2014-07-17 10:40         ` Lukáš Czerner
  0 siblings, 1 reply; 13+ messages in thread
From: Mason @ 2014-07-17 10:30 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ext4 Developers List, linux-fsdevel

Hello,

Andreas Dilger wrote:

> Mason wrote:
> 
>> The use case is
>> - allocate a large file
>> - stick a file system on it
>> - store stuff (typically video files) inside this "private" FS
>> - when the user decides he doesn't need it anymore, unmount and unlink
>> (I also have a resize operation in there, but I wanted to get the
>> basics before taking the hard stuff head on.)
>> 
>> So, in the limit, we don't store anything at all: just create and
>> immediately delete. This was my test.
> 
> I would agree that LVM is the real solution that you want to use.
> It is specifically designed for this, and has much less overhead than
> a filesystem on a loopback device on a file on another filesystem.
> The amount of space overhead is tuneable, but typically the volumes
> are allocated in multiples of 4MB chunks.

I'll take a look at LVM. (But, at this point, it's too late to change
the architecture of the system.)

> That said, I think you've found some kind of strange performance problem,
> and it is worthwhile to figure this out.
> 
>>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Preliminary info:

The partition was created/mounted with
$ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
$ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
(mount is busybox, in case it matters)

mke2fs 1.42.10 (18-May-2014)
/dev/sda1 contains a ext4 file system labelled 'ZOZO'
        last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
Proceed anyway? (y,n) y
Creating filesystem with 104857600 4k blocks and 460800 inodes
Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

/dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
/* No support for xattr in this kernel */

# dumpe2fs -h /dev/sda1
dumpe2fs 1.42.10 (18-May-2014)
Filesystem volume name:   ZOZO
Last mounted on:          <not available>
Filesystem UUID:          8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         not clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              460800
Block count:              104857600
Reserved block count:     0
Free blocks:              104803944
Free inodes:              460789
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      999
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         144
Inode blocks per group:   9
Flex block group size:    16
Filesystem created:       Thu Jul 17 11:14:27 2014
Last mount time:          Thu Jul 17 11:14:29 2014
Last write time:          Thu Jul 17 11:14:29 2014
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Jul 17 11:14:27 2014
Check interval:           0 (<none>)
Lifetime writes:          4883 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group unknown)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      157f2107-76fc-417b-9a07-491951c873b7

> Firstly, have you tried using "fallocate()" directly, instead of
> posix_fallocate()?  It may be (depending on your userspace) that
> posix_fallocate() is writing zeroes to the file instead of using
> the fallocate() syscall, and the kernel is busy cleaning up all
> of the dirty pages when the file is unlinked.  You could try using
> strace to see what system calls are actually being used.

Unfortunately, I'm using a prehistoric version of glibc (2.8)
that doesn't support the fallocate wrapper (imported in 2.10).

I'm 70% sure that posix_fallocate() is not actually writing zeros
to the file, because when I tested it on ext2, creating a 300-GB
file took hours, literally (approx. 3 hours). The same operation
on ext4 takes a few seconds. (Although, now that I think of it,
it could be working asynchronously, or defer some operation, that
I eventually have to pay for on deletion.)

# time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
unlink(filename): 0 [1 ms]


12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
12:23:27.221029 SYS_4320()              = 0 <0.412013>
12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
12:23:27.640168 unlink("/mnt/hdd/xxx")  = 0 <0.000231>
12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
12:23:27.643056 exit_group(0)           = ?
0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+459minor)pagefaults 0swaps


AFAICT, SYS_4320() is fallocate.

/*
 * Linux o32 style syscalls are in the range from 4000 to 4999.
 */
#define __NR_Linux  4000
#define __NR_fallocate  (__NR_Linux + 320)


Where is the process stalling? That is a mystery. Seems it's stuck
in exit_group(), waiting for the kernel to clean up on its behalf?
Maybe I need ftrace, or something to profile the kernel?

> Secondly, where is the process actually stuck?  From your output
> above, the unlink() call takes no measurable time before returning,
> so I don't see where it is actually stuck.  Again, running your
> test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> syscall is actually taking so much time to complete.  I don't
> think it is unlink().

See above, the process is stalled, but I don't know where!

-- 
Regards.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 10:30       ` Mason
@ 2014-07-17 10:40         ` Lukáš Czerner
  2014-07-17 11:17           ` Mason
  0 siblings, 1 reply; 13+ messages in thread
From: Lukáš Czerner @ 2014-07-17 10:40 UTC (permalink / raw)
  To: Mason; +Cc: Andreas Dilger, Ext4 Developers List, linux-fsdevel

On Thu, 17 Jul 2014, Mason wrote:

> Date: Thu, 17 Jul 2014 12:30:34 +0200
> From: Mason <mpeg.blue@free.fr>
> To: Andreas Dilger <adilger@dilger.ca>
> Cc: Ext4 Developers List <linux-ext4@vger.kernel.org>,
>     linux-fsdevel <linux-fsdevel@vger.kernel.org>
> Subject: Re: After unlinking a large file on ext4,
>     the process stalls for a long time
> 
> Hello,
> 
> Andreas Dilger wrote:
> 
> > Mason wrote:
> > 
> >> The use case is
> >> - allocate a large file
> >> - stick a file system on it
> >> - store stuff (typically video files) inside this "private" FS
> >> - when the user decides he doesn't need it anymore, unmount and unlink
> >> (I also have a resize operation in there, but I wanted to get the
> >> basics before taking the hard stuff head on.)
> >> 
> >> So, in the limit, we don't store anything at all: just create and
> >> immediately delete. This was my test.
> > 
> > I would agree that LVM is the real solution that you want to use.
> > It is specifically designed for this, and has much less overhead than
> > a filesystem on a loopback device on a file on another filesystem.
> > The amount of space overhead is tuneable, but typically the volumes
> > are allocated in multiples of 4MB chunks.
> 
> I'll take a look at LVM. (But, at this point, it's too late to change
> the architecture of the system.)
> 
> > That said, I think you've found some kind of strange performance problem,
> > and it is worthwhile to figure this out.
> > 
> >>>> /tmp # time ./foo /mnt/hdd/xxx 5
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 10
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 100
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 300
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> 
> Preliminary info:
> 
> The partition was created/mounted with
> $ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
> (mount is busybox, in case it matters)
> 
> mke2fs 1.42.10 (18-May-2014)
> /dev/sda1 contains a ext4 file system labelled 'ZOZO'
>         last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
> Proceed anyway? (y,n) y
> Creating filesystem with 104857600 4k blocks and 460800 inodes
> Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Superblock backups stored on blocks:
>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>         102400000
> 
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: done
> 
> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
> /* No support for xattr in this kernel */
> 
> # dumpe2fs -h /dev/sda1
> dumpe2fs 1.42.10 (18-May-2014)
> Filesystem volume name:   ZOZO
> Last mounted on:          <not available>
> Filesystem UUID:          8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         not clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              460800
> Block count:              104857600
> Reserved block count:     0
> Free blocks:              104803944
> Free inodes:              460789
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Reserved GDT blocks:      999
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         144
> Inode blocks per group:   9
> Flex block group size:    16
> Filesystem created:       Thu Jul 17 11:14:27 2014
> Last mount time:          Thu Jul 17 11:14:29 2014
> Last write time:          Thu Jul 17 11:14:29 2014
> Mount count:              1
> Maximum mount count:      -1
> Last checked:             Thu Jul 17 11:14:27 2014
> Check interval:           0 (<none>)
> Lifetime writes:          4883 kB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group unknown)
> First inode:              11
> Inode size:               256
> Required extra isize:     28
> Desired extra isize:      28
> Default directory hash:   half_md4
> Directory Hash Seed:      157f2107-76fc-417b-9a07-491951c873b7
> 
> > Firstly, have you tried using "fallocate()" directly, instead of
> > posix_fallocate()?  It may be (depending on your userspace) that
> > posix_fallocate() is writing zeroes to the file instead of using
> > the fallocate() syscall, and the kernel is busy cleaning up all
> > of the dirty pages when the file is unlinked.  You could try using
> > strace to see what system calls are actually being used.
> 
> Unfortunately, I'm using a prehistoric version of glibc (2.8)
> that doesn't support the fallocate wrapper (imported in 2.10).
> 
> I'm 70% sure that posix_fallocate() is not actually writing zeros
> to the file, because when I tested it on ext2, creating a 300-GB
> file took hours, literally (approx. 3 hours). The same operation
> on ext4 takes a few seconds. (Although, now that I think of it,
> it could be working asynchronously, or defer some operation, that
> I eventually have to pay for on deletion.)
> 
> # time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
> unlink(filename): 0 [1 ms]
> 
> 
> 12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
> 12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
> 12:23:27.221029 SYS_4320()              = 0 <0.412013>
> 12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
> 12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
> 12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
> 12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
> 12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
> 12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
> 12:23:27.640168 unlink("/mnt/hdd/xxx")  = 0 <0.000231>
> 12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
> 12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
> 12:23:27.643056 exit_group(0)           = ?
> 0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
> 0inputs+0outputs (0major+459minor)pagefaults 0swaps

So it really does not seem to be stalling in fallocate, nor unlink.
Can you add close() before unlink, just to be sure what's happening
there ?

Thanks!
-Lukas


> 
> 
> AFAICT, SYS_4320() is fallocate.
> 
> /*
>  * Linux o32 style syscalls are in the range from 4000 to 4999.
>  */
> #define __NR_Linux  4000
> #define __NR_fallocate  (__NR_Linux + 320)
> 
> 
> Where is the process stalling? That is a mystery. Seems it's stuck
> in exit_group(), waiting for the kernel to clean up on its behalf?
> Maybe I need ftrace, or something to profile the kernel?
> 
> > Secondly, where is the process actually stuck?  From your output
> > above, the unlink() call takes no measurable time before returning,
> > so I don't see where it is actually stuck.  Again, running your
> > test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> > syscall is actually taking so much time to complete.  I don't
> > think it is unlink().
> 
> See above, the process is stalled, but I don't know where!
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 10:40         ` Lukáš Czerner
@ 2014-07-17 11:17           ` Mason
  2014-07-17 13:37             ` Theodore Ts'o
  0 siblings, 1 reply; 13+ messages in thread
From: Mason @ 2014-07-17 11:17 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Andreas Dilger, Ext4 Developers List, linux-fsdevel

Lukáš Czerner wrote:

> So it really does not seem to be stalling in fallocate, nor unlink.
> Can you add close() before unlink, just to be sure what's happening
> there ?

Doh! Good catch! Unlinking was fast because the ref count didn't drop
to 0 on unlink, it did so on the implicit close done on exit, which
would explain why the process stalled "at the end".

If I unlink a closed file, it is indeed unlink that stalls.

[BTW, some of the e2fsprogs devs may be reading this. I suppose you
already know, but the cross-compile build was broken in 1.4.10.
I wrote a trivial patch to fix it (cf. the end of this message)
although I'm not sure I did it the canonical way.]


# time strace -T ./foo /mnt/hdd/xxx 300 2> strace.out
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [412 ms]
close(fd): 0 [0 ms]
unlink(filename): 0 [111481 ms]

open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000456>
clock_gettime(CLOCK_MONOTONIC, {82152, 251657385}) = 0 <0.000085>
SYS_4320()                              = 0 <0.411628>
clock_gettime(CLOCK_MONOTONIC, {82152, 664179762}) = 0 <0.000089>
fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000094>
ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000128>
old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x773e4000 <0.000195>
write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000281>
clock_gettime(CLOCK_MONOTONIC, {82152, 668413115}) = 0 <0.000077>
close(3)                                = 0 <0.000119>
clock_gettime(CLOCK_MONOTONIC, {82152, 669249479}) = 0 <0.000129>
write(1, "close(fd): 0 [0 ms]\n", 20)   = 20 <0.000145>
clock_gettime(CLOCK_MONOTONIC, {82152, 670361133}) = 0 <0.000078>
unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
clock_gettime(CLOCK_MONOTONIC, {82264, 150551496}) = 0 <0.000080>
write(1, "unlink(filename): 0 [111481 ms]\n", 32) = 32 <0.000225>
exit_group(0)                           = ?

0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
0inputs+0outputs (0major+434minor)pagefaults 0swaps


For reference, here's my minimal test case:

#define _FILE_OFFSET_BITS 64
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <time.h>

#define BENCH(op) do { \
  struct timespec t0; clock_gettime(CLOCK_MONOTONIC, &t0); \
  int err = op; \
  struct timespec t1; clock_gettime(CLOCK_MONOTONIC, &t1); \
  int ms = (t1.tv_sec-t0.tv_sec)*1000 + (t1.tv_nsec-t0.tv_nsec)/1000000; \
  printf("%s: %d [%d ms]\n", #op, err, ms); } while(0)

int main(int argc, char **argv)
{
  if (argc != 3) { puts("Usage: prog filename size"); return 42; }

  char *filename = argv[1];
  int fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);
  if (fd < 0) { perror("open"); return 1; }

  long long size_in_GiB = atoi(argv[2]);
  BENCH(posix_fallocate(fd, 0, size_in_GiB << 30));
  BENCH(close(fd));
  BENCH(unlink(filename));
  return 0;
}


$ cat e2fsprogs-1.42.10.patch 
diff -ur a/util/Makefile.in b/util/Makefile.in
--- a/util/Makefile.in	2014-05-15 19:04:08.000000000 +0200
+++ b/util/Makefile.in	2014-07-10 15:31:04.819352596 +0200
@@ -15,7 +15,7 @@
 
 .c.o:
 	$(E) "	CC $<"
-	$(Q) $(BUILD_CC) -c $(BUILD_CFLAGS) $< -o $@
+	$(Q) $(BUILD_CC) $(CPPFLAGS) -c $(BUILD_CFLAGS) $< -o $@
 	$(Q) $(CHECK_CMD) $(ALL_CFLAGS) $<
 
 PROGS=		subst symlinks



-- 
Regards.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 11:17           ` Mason
@ 2014-07-17 13:37             ` Theodore Ts'o
  2014-07-17 16:07               ` Mason
  0 siblings, 1 reply; 13+ messages in thread
From: Theodore Ts'o @ 2014-07-17 13:37 UTC (permalink / raw)
  To: Mason
  Cc: Lukáš Czerner, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

On Thu, Jul 17, 2014 at 01:17:11PM +0200, Mason wrote:
> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
> 
> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
> 0inputs+0outputs (0major+434minor)pagefaults 0swaps

... and we're CPU bound inside the kernel.

Can you run perf so we can see exactly where we're spending the CPU?
You're not using a journal, so I'm pretty sure what you will find is
that we're spending all of our time in mb_free_blocks(), when it is
updating the internal mballoc buddy bitmaps.

With a journal, this work done by mb_free_blocks() is hidden in the
kjournal thread, and happens after the commit is completed, so it
won't block other file system operations (other than burning some
extra CPU on one of the multiple cores available on a typical x86
CPU).

Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
has native bit test/set/clear instructions, whereas the MIPS
architecture was designed by Prof. Hennessy at Stanford, who was a
doctrinaire RISC fanatic, so there would be no bitop instructions.

Even though I'm pretty sure what we'll find, knowing exactly *where*
in mb_free_blocks() or the function it calls would be helpful in
knowing what we need to optimize.  So if you could try using perf
(assuming that the perf is supported MIPS; not sure if it does) that
would be really helpful.

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 13:37             ` Theodore Ts'o
@ 2014-07-17 16:07               ` Mason
  2014-07-17 16:32                 ` Mason
  2014-07-18  9:29                 ` Lukáš Czerner
  0 siblings, 2 replies; 13+ messages in thread
From: Mason @ 2014-07-17 16:07 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lukáš Czerner, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

Theodore Ts'o wrote:

> Mason wrote:
> 
>> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
>>
>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
> 
> ... and we're CPU bound inside the kernel.
> 
> Can you run perf so we can see exactly where we're spending the CPU?
> You're not using a journal, so I'm pretty sure what you will find is
> that we're spending all of our time in mb_free_blocks(), when it is
> updating the internal mballoc buddy bitmaps.
> 
> With a journal, this work done by mb_free_blocks() is hidden in the
> kjournal thread, and happens after the commit is completed, so it
> won't block other file system operations (other than burning some
> extra CPU on one of the multiple cores available on a typical x86
> CPU).
> 
> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
> has native bit test/set/clear instructions, whereas the MIPS
> architecture was designed by Prof. Hennessy at Stanford, who was a
> doctrinaire RISC fanatic, so there would be no bitop instructions.
> 
> Even though I'm pretty sure what we'll find, knowing exactly *where*
> in mb_free_blocks() or the function it calls would be helpful in
> knowing what we need to optimize.  So if you could try using perf
> (assuming that the perf is supported MIPS; not sure if it does) that
> would be really helpful.

Is perf "better" than oprofile? (For some metric)

I have enabled:

CONFIG_PERF_EVENTS=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=y
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_KRETPROBES=y

What command-line do you suggest I run to get the output you expect?
(I'll try to get it done, but I might have to wait two weeks before
I can run these tests.)

-- 
Regards.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 16:07               ` Mason
@ 2014-07-17 16:32                 ` Mason
  2014-07-18  9:29                 ` Lukáš Czerner
  1 sibling, 0 replies; 13+ messages in thread
From: Mason @ 2014-07-17 16:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lukáš Czerner, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

On 17/07/2014 18:07, Mason wrote:

> Theodore Ts'o wrote:
> 
>> Mason wrote:
>>
>>> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
>>>
>>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
>>
>> ... and we're CPU bound inside the kernel.
>>
>> Can you run perf so we can see exactly where we're spending the CPU?
>> You're not using a journal, so I'm pretty sure what you will find is
>> that we're spending all of our time in mb_free_blocks(), when it is
>> updating the internal mballoc buddy bitmaps.
>>
>> With a journal, this work done by mb_free_blocks() is hidden in the
>> kjournal thread, and happens after the commit is completed, so it
>> won't block other file system operations (other than burning some
>> extra CPU on one of the multiple cores available on a typical x86
>> CPU).
>>
>> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
>> has native bit test/set/clear instructions, whereas the MIPS
>> architecture was designed by Prof. Hennessy at Stanford, who was a
>> doctrinaire RISC fanatic, so there would be no bitop instructions.
>>
>> Even though I'm pretty sure what we'll find, knowing exactly *where*
>> in mb_free_blocks() or the function it calls would be helpful in
>> knowing what we need to optimize.  So if you could try using perf
>> (assuming that the perf is supported MIPS; not sure if it does) that
>> would be really helpful.
> 
> Is perf "better" than oprofile? (For some metric)
> 
> I have enabled:
> 
> CONFIG_PERF_EVENTS=y
> CONFIG_PROFILING=y
> CONFIG_TRACEPOINTS=y
> CONFIG_OPROFILE=y
> CONFIG_HAVE_OPROFILE=y
> CONFIG_KPROBES=y
> CONFIG_KRETPROBES=y
> 
> What command-line do you suggest I run to get the output you expect?
> (I'll try to get it done, but I might have to wait two weeks before
> I can run these tests.)

So much for oprofile...

  CC      arch/mips/oprofile/../../../drivers/oprofile/oprof.o
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: In function 'oprofile_init':
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: 'timer' undeclared (first use in this function)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: (Each undeclared identifier is reported only once
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: for each function it appears in.)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: In function '__check_timer':
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: 'timer' undeclared (first use in this function)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: At top level:
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: 'timer' undeclared here (not in a function)
cc1: warnings being treated as errors
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: type defaults to 'int' in declaration of 'type name'
make[1]: *** [arch/mips/oprofile/../../../drivers/oprofile/oprof.o] Error 1
make: *** [arch/mips/oprofile] Error 2

Dunno if this happens on vanilla kernels, or if the ODM messed
something up (again).

$ ll tools/perf/arch/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 arm/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 powerpc/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 s390/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 sh/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 sparc/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 x86/

I'm not sure perf supports MIPS...

Or maybe it does

$ g -rni mips .
./Makefile:45:				  -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
Binary file ./.Makefile.swp matches
./perf.h:76:#ifdef __mips__
./perf.h:77:#include "../../arch/mips/include/asm/unistd.h"
./perf.h:79:				".set	mips2\n\t"			\
./perf.h:81:				".set	mips0"				\


-- 
Regards.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 16:07               ` Mason
  2014-07-17 16:32                 ` Mason
@ 2014-07-18  9:29                 ` Lukáš Czerner
  1 sibling, 0 replies; 13+ messages in thread
From: Lukáš Czerner @ 2014-07-18  9:29 UTC (permalink / raw)
  To: Mason
  Cc: Theodore Ts'o, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2556 bytes --]

On Thu, 17 Jul 2014, Mason wrote:

> Date: Thu, 17 Jul 2014 18:07:30 +0200
> From: Mason <mpeg.blue@free.fr>
> To: Theodore Ts'o <tytso@mit.edu>
> Cc: Lukáš Czerner <lczerner@redhat.com>, Andreas Dilger <adilger@dilger.ca>,
>     Ext4 Developers List <linux-ext4@vger.kernel.org>,
>     linux-fsdevel <linux-fsdevel@vger.kernel.org>
> Subject: Re: After unlinking a large file on ext4,
>     the process stalls for a long time
> 
> Theodore Ts'o wrote:
> 
> > Mason wrote:
> > 
> >> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
> >>
> >> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
> >> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
> > 
> > ... and we're CPU bound inside the kernel.
> > 
> > Can you run perf so we can see exactly where we're spending the CPU?
> > You're not using a journal, so I'm pretty sure what you will find is
> > that we're spending all of our time in mb_free_blocks(), when it is
> > updating the internal mballoc buddy bitmaps.
> > 
> > With a journal, this work done by mb_free_blocks() is hidden in the
> > kjournal thread, and happens after the commit is completed, so it
> > won't block other file system operations (other than burning some
> > extra CPU on one of the multiple cores available on a typical x86
> > CPU).
> > 
> > Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
> > has native bit test/set/clear instructions, whereas the MIPS
> > architecture was designed by Prof. Hennessy at Stanford, who was a
> > doctrinaire RISC fanatic, so there would be no bitop instructions.
> > 
> > Even though I'm pretty sure what we'll find, knowing exactly *where*
> > in mb_free_blocks() or the function it calls would be helpful in
> > knowing what we need to optimize.  So if you could try using perf
> > (assuming that the perf is supported MIPS; not sure if it does) that
> > would be really helpful.
> 
> Is perf "better" than oprofile? (For some metric)
> 
> I have enabled:
> 
> CONFIG_PERF_EVENTS=y
> CONFIG_PROFILING=y
> CONFIG_TRACEPOINTS=y
> CONFIG_OPROFILE=y
> CONFIG_HAVE_OPROFILE=y
> CONFIG_KPROBES=y
> CONFIG_KRETPROBES=y
> 
> What command-line do you suggest I run to get the output you expect?
> (I'll try to get it done, but I might have to wait two weeks before
> I can run these tests.)

If perf works on your system you can record data with

perf record -g ./test file <size>

and then report with

perf report --stdio

That should yield some interesting information about where we spend
the most time in kernel.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2014-07-18  9:29 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-16 14:09 After unlinking a large file on ext4, the process stalls for a long time Mason
2014-07-16 15:16 ` John Stoffel
2014-07-16 17:16   ` Mason
2014-07-16 20:18     ` John Stoffel
2014-07-16 21:46       ` Mason
2014-07-17  3:37     ` Andreas Dilger
2014-07-17 10:30       ` Mason
2014-07-17 10:40         ` Lukáš Czerner
2014-07-17 11:17           ` Mason
2014-07-17 13:37             ` Theodore Ts'o
2014-07-17 16:07               ` Mason
2014-07-17 16:32                 ` Mason
2014-07-18  9:29                 ` Lukáš Czerner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).