Re: After unlinking a large file on ext4, the process stalls for a long time

linux-ext4.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: After unlinking a large file on ext4, the process stalls for a long time
       [not found]   ` <53C6B38A.3000100@free.fr>
@ 2014-07-17  3:37     ` Andreas Dilger
  2014-07-17 10:30       ` Mason
  0 siblings, 1 reply; 12+ messages in thread
From: Andreas Dilger @ 2014-07-17  3:37 UTC (permalink / raw)
  To: Mason; +Cc: John Stoffel, Ext4 Developers List, linux-fsdevel

[-- Attachment #1: Type: text/plain, Size: 4103 bytes --]

On Jul 16, 2014, at 11:16 AM, Mason <mpeg.blue@free.fr> wrote:
> (I hope you'll forgive me for reformatting the quote characters
> to my taste.)

Thank you.

> On 16/07/2014 17:16, John Stoffel wrote:
>> Mason wrote:
>>> I'm using Linux (3.1.10 at the moment) on a embedded system
>>> similar in spec to a desktop PC from 15 years ago (256 MB RAM,
>>> 800-MHz CPU, USB).
>> 
>> Sounds like a Raspberry Pi...  And have you investigated using
>> something like XFS as your filesystem instead?
> 
> The system is a set-top box (DVB-S2 receiver). The system CPU is
> MIPS 74K, not ARM (not that it matters, in this case).
> 
> No, I have not investigated other file systems (yet).
> 
>>> I need to be able to create large files (50-1000 GB) "as fast
>>> as possible".  These files are created on an external hard disk
>>> drive, connected over Hi-Speed USB (typical throughput 30 MB/s).
>> 
>> Really... so you just need to create allocations of space as quickly
>> as possible,
> 
> I may not have been clear. The creation needs to be fast (in UX terms,
> so less than 5-10 seconds), but it only occurs a few times during the
> lifetime of the system.
> 
>> which will then be filled in later with actual data?
> 
> Yes. In fact, I use the loopback device to format the file as an
> ext4 partition. 
> 
> The use case is
> - allocate a large file
> - stick a file system on it
> - store stuff (typically video files) inside this "private" FS
> - when the user decides he doesn't need it anymore, unmount and unlink
> (I also have a resize operation in there, but I wanted to get the
> basics before taking the hard stuff head on.)
> 
> So, in the limit, we don't store anything at all: just create and
> immediately delete. This was my test.

I would agree that LVM is the real solution that you want to use.
It is specifically designed for this, and has much less overhead than
a filesystem on a loopback device on a file on another filesystem.
The amount of space overhead is tuneable, but typically the volumes
are allocated in multiples of 4MB chunks.

That said, I think you've found some kind of strange performance problem,
and it is worthwhile to figure this out.

>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>> 
>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>> unlink(filename): 0 [0 ms]
>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Firstly, have you tried using "fallocate()" directly, instead of
posix_fallocate()?  It may be (depending on your userspace) that
posix_fallocate() is writing zeroes to the file instead of using
the fallocate() syscall, and the kernel is busy cleaning up all
of the dirty pages when the file is unlinked.  You could try using
strace to see what system calls are actually being used.

Secondly, where is the process actually stuck?  From your output
above, the unlink() call takes no measurable time before returning,
so I don't see where it is actually stuck.  Again, running your
test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
syscall is actually taking so much time to complete.  I don't
think it is unlink().

Cheers, Andreas






[-- Attachment #2: Message signed with OpenPGP using GPGMail --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17  3:37     ` After unlinking a large file on ext4, the process stalls for a long time Andreas Dilger
@ 2014-07-17 10:30       ` Mason
  2014-07-17 10:40         ` Lukáš Czerner
  0 siblings, 1 reply; 12+ messages in thread
From: Mason @ 2014-07-17 10:30 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Ext4 Developers List, linux-fsdevel

Hello,

Andreas Dilger wrote:

> Mason wrote:
> 
>> The use case is
>> - allocate a large file
>> - stick a file system on it
>> - store stuff (typically video files) inside this "private" FS
>> - when the user decides he doesn't need it anymore, unmount and unlink
>> (I also have a resize operation in there, but I wanted to get the
>> basics before taking the hard stuff head on.)
>> 
>> So, in the limit, we don't store anything at all: just create and
>> immediately delete. This was my test.
> 
> I would agree that LVM is the real solution that you want to use.
> It is specifically designed for this, and has much less overhead than
> a filesystem on a loopback device on a file on another filesystem.
> The amount of space overhead is tuneable, but typically the volumes
> are allocated in multiples of 4MB chunks.

I'll take a look at LVM. (But, at this point, it's too late to change
the architecture of the system.)

> That said, I think you've found some kind of strange performance problem,
> and it is worthwhile to figure this out.
> 
>>>> /tmp # time ./foo /mnt/hdd/xxx 5
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 10
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 100
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
>>>>
>>>> /tmp # time ./foo /mnt/hdd/xxx 300
>>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
>>>> unlink(filename): 0 [0 ms]
>>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
>>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps

Preliminary info:

The partition was created/mounted with
$ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
$ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
(mount is busybox, in case it matters)

mke2fs 1.42.10 (18-May-2014)
/dev/sda1 contains a ext4 file system labelled 'ZOZO'
        last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
Proceed anyway? (y,n) y
Creating filesystem with 104857600 4k blocks and 460800 inodes
Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Superblock backups stored on blocks:
        32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
        4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
        102400000

Allocating group tables: done
Writing inode tables: done
Writing superblocks and filesystem accounting information: done

/dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
/* No support for xattr in this kernel */

# dumpe2fs -h /dev/sda1
dumpe2fs 1.42.10 (18-May-2014)
Filesystem volume name:   ZOZO
Last mounted on:          <not available>
Filesystem UUID:          8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         not clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              460800
Block count:              104857600
Reserved block count:     0
Free blocks:              104803944
Free inodes:              460789
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      999
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         144
Inode blocks per group:   9
Flex block group size:    16
Filesystem created:       Thu Jul 17 11:14:27 2014
Last mount time:          Thu Jul 17 11:14:29 2014
Last write time:          Thu Jul 17 11:14:29 2014
Mount count:              1
Maximum mount count:      -1
Last checked:             Thu Jul 17 11:14:27 2014
Check interval:           0 (<none>)
Lifetime writes:          4883 kB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group unknown)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Default directory hash:   half_md4
Directory Hash Seed:      157f2107-76fc-417b-9a07-491951c873b7

> Firstly, have you tried using "fallocate()" directly, instead of
> posix_fallocate()?  It may be (depending on your userspace) that
> posix_fallocate() is writing zeroes to the file instead of using
> the fallocate() syscall, and the kernel is busy cleaning up all
> of the dirty pages when the file is unlinked.  You could try using
> strace to see what system calls are actually being used.

Unfortunately, I'm using a prehistoric version of glibc (2.8)
that doesn't support the fallocate wrapper (imported in 2.10).

I'm 70% sure that posix_fallocate() is not actually writing zeros
to the file, because when I tested it on ext2, creating a 300-GB
file took hours, literally (approx. 3 hours). The same operation
on ext4 takes a few seconds. (Although, now that I think of it,
it could be working asynchronously, or defer some operation, that
I eventually have to pay for on deletion.)

# time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
unlink(filename): 0 [1 ms]


12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
12:23:27.221029 SYS_4320()              = 0 <0.412013>
12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
12:23:27.640168 unlink("/mnt/hdd/xxx")  = 0 <0.000231>
12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
12:23:27.643056 exit_group(0)           = ?
0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
0inputs+0outputs (0major+459minor)pagefaults 0swaps


AFAICT, SYS_4320() is fallocate.

/*
 * Linux o32 style syscalls are in the range from 4000 to 4999.
 */
#define __NR_Linux  4000
#define __NR_fallocate  (__NR_Linux + 320)


Where is the process stalling? That is a mystery. Seems it's stuck
in exit_group(), waiting for the kernel to clean up on its behalf?
Maybe I need ftrace, or something to profile the kernel?

> Secondly, where is the process actually stuck?  From your output
> above, the unlink() call takes no measurable time before returning,
> so I don't see where it is actually stuck.  Again, running your
> test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> syscall is actually taking so much time to complete.  I don't
> think it is unlink().

See above, the process is stalled, but I don't know where!

-- 
Regards.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 10:30       ` Mason
@ 2014-07-17 10:40         ` Lukáš Czerner
  2014-07-17 11:17           ` Mason
  0 siblings, 1 reply; 12+ messages in thread
From: Lukáš Czerner @ 2014-07-17 10:40 UTC (permalink / raw)
  To: Mason; +Cc: Andreas Dilger, Ext4 Developers List, linux-fsdevel

On Thu, 17 Jul 2014, Mason wrote:

> Date: Thu, 17 Jul 2014 12:30:34 +0200
> From: Mason <mpeg.blue@free.fr>
> To: Andreas Dilger <adilger@dilger.ca>
> Cc: Ext4 Developers List <linux-ext4@vger.kernel.org>,
>     linux-fsdevel <linux-fsdevel@vger.kernel.org>
> Subject: Re: After unlinking a large file on ext4,
>     the process stalls for a long time
> 
> Hello,
> 
> Andreas Dilger wrote:
> 
> > Mason wrote:
> > 
> >> The use case is
> >> - allocate a large file
> >> - stick a file system on it
> >> - store stuff (typically video files) inside this "private" FS
> >> - when the user decides he doesn't need it anymore, unmount and unlink
> >> (I also have a resize operation in there, but I wanted to get the
> >> basics before taking the hard stuff head on.)
> >> 
> >> So, in the limit, we don't store anything at all: just create and
> >> immediately delete. This was my test.
> > 
> > I would agree that LVM is the real solution that you want to use.
> > It is specifically designed for this, and has much less overhead than
> > a filesystem on a loopback device on a file on another filesystem.
> > The amount of space overhead is tuneable, but typically the volumes
> > are allocated in multiples of 4MB chunks.
> 
> I'll take a look at LVM. (But, at this point, it's too late to change
> the architecture of the system.)
> 
> > That said, I think you've found some kind of strange performance problem,
> > and it is worthwhile to figure this out.
> > 
> >>>> /tmp # time ./foo /mnt/hdd/xxx 5
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [68 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 1.86system 0:01.92elapsed 97%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 10
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [141 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 3.71system 0:03.83elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 100
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [1882 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 37.12system 0:38.93elapsed 95%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> >>>>
> >>>> /tmp # time ./foo /mnt/hdd/xxx 300
> >>>> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [3883 ms]
> >>>> unlink(filename): 0 [0 ms]
> >>>> 0.00user 111.38system 1:55.04elapsed 96%CPU (0avgtext+0avgdata 528maxresident)k
> >>>> 0inputs+0outputs (0major+168minor)pagefaults 0swaps
> 
> Preliminary info:
> 
> The partition was created/mounted with
> $ mkfs.ext4 -m 0 -i 1024000 -L ZOZO -O ^has_journal,^huge_file /dev/sda1
> $ mount -t ext4 /dev/sda1 /mnt/hdd -o noexec,noatime
> (mount is busybox, in case it matters)
> 
> mke2fs 1.42.10 (18-May-2014)
> /dev/sda1 contains a ext4 file system labelled 'ZOZO'
>         last mounted on /mnt/hdd on Wed Jul 16 15:40:40 2014
> Proceed anyway? (y,n) y
> Creating filesystem with 104857600 4k blocks and 460800 inodes
> Filesystem UUID: 8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Superblock backups stored on blocks:
>         32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
>         4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
>         102400000
> 
> Allocating group tables: done
> Writing inode tables: done
> Writing superblocks and filesystem accounting information: done
> 
> /dev/sda1 on /mnt/hdd type ext4 (rw,noexec,noatime,barrier=1)
> /* No support for xattr in this kernel */
> 
> # dumpe2fs -h /dev/sda1
> dumpe2fs 1.42.10 (18-May-2014)
> Filesystem volume name:   ZOZO
> Last mounted on:          <not available>
> Filesystem UUID:          8c12c8fe-6ab8-4888-b9a3-6f28c86020eb
> Filesystem magic number:  0xEF53
> Filesystem revision #:    1 (dynamic)
> Filesystem features:      ext_attr resize_inode dir_index filetype extent flex_bg sparse_super large_file uninit_bg dir_nlink extra_isize
> Filesystem flags:         signed_directory_hash
> Default mount options:    user_xattr acl
> Filesystem state:         not clean
> Errors behavior:          Continue
> Filesystem OS type:       Linux
> Inode count:              460800
> Block count:              104857600
> Reserved block count:     0
> Free blocks:              104803944
> Free inodes:              460789
> First block:              0
> Block size:               4096
> Fragment size:            4096
> Reserved GDT blocks:      999
> Blocks per group:         32768
> Fragments per group:      32768
> Inodes per group:         144
> Inode blocks per group:   9
> Flex block group size:    16
> Filesystem created:       Thu Jul 17 11:14:27 2014
> Last mount time:          Thu Jul 17 11:14:29 2014
> Last write time:          Thu Jul 17 11:14:29 2014
> Mount count:              1
> Maximum mount count:      -1
> Last checked:             Thu Jul 17 11:14:27 2014
> Check interval:           0 (<none>)
> Lifetime writes:          4883 kB
> Reserved blocks uid:      0 (user root)
> Reserved blocks gid:      0 (group unknown)
> First inode:              11
> Inode size:               256
> Required extra isize:     28
> Desired extra isize:      28
> Default directory hash:   half_md4
> Directory Hash Seed:      157f2107-76fc-417b-9a07-491951c873b7
> 
> > Firstly, have you tried using "fallocate()" directly, instead of
> > posix_fallocate()?  It may be (depending on your userspace) that
> > posix_fallocate() is writing zeroes to the file instead of using
> > the fallocate() syscall, and the kernel is busy cleaning up all
> > of the dirty pages when the file is unlinked.  You could try using
> > strace to see what system calls are actually being used.
> 
> Unfortunately, I'm using a prehistoric version of glibc (2.8)
> that doesn't support the fallocate wrapper (imported in 2.10).
> 
> I'm 70% sure that posix_fallocate() is not actually writing zeros
> to the file, because when I tested it on ext2, creating a 300-GB
> file took hours, literally (approx. 3 hours). The same operation
> on ext4 takes a few seconds. (Although, now that I think of it,
> it could be working asynchronously, or defer some operation, that
> I eventually have to pay for on deletion.)
> 
> # time strace -tt -T ./foo /mnt/hdd/xxx 300 2> strace.out
> posix_fallocate(fd, 0, size_in_GiB << 30): 0 [414 ms]
> unlink(filename): 0 [1 ms]
> 
> 
> 12:23:27.218838 open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000486>
> 12:23:27.220121 clock_gettime(CLOCK_MONOTONIC, {79879, 926227018}) = 0 <0.000105>
> 12:23:27.221029 SYS_4320()              = 0 <0.412013>
> 12:23:27.633673 clock_gettime(CLOCK_MONOTONIC, {79880, 339646593}) = 0 <0.000104>
> 12:23:27.634657 fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000116>
> 12:23:27.636187 ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000146>
> 12:23:27.637509 old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x77248000 <0.000143>
> 12:23:27.638306 write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000237>
> 12:23:27.639496 clock_gettime(CLOCK_MONOTONIC, {79880, 345448452}) = 0 <0.000102>
> 12:23:27.640168 unlink("/mnt/hdd/xxx")  = 0 <0.000231>
> 12:23:27.641174 clock_gettime(CLOCK_MONOTONIC, {79880, 347202581}) = 0 <0.000100>
> 12:23:27.641984 write(1, "unlink(filename): 0 [1 ms]\n", 27) = 27 <0.000157>
> 12:23:27.643056 exit_group(0)           = ?
> 0.02user 111.51system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 864maxresident)k
> 0inputs+0outputs (0major+459minor)pagefaults 0swaps

So it really does not seem to be stalling in fallocate, nor unlink.
Can you add close() before unlink, just to be sure what's happening
there ?

Thanks!
-Lukas


> 
> 
> AFAICT, SYS_4320() is fallocate.
> 
> /*
>  * Linux o32 style syscalls are in the range from 4000 to 4999.
>  */
> #define __NR_Linux  4000
> #define __NR_fallocate  (__NR_Linux + 320)
> 
> 
> Where is the process stalling? That is a mystery. Seems it's stuck
> in exit_group(), waiting for the kernel to clean up on its behalf?
> Maybe I need ftrace, or something to profile the kernel?
> 
> > Secondly, where is the process actually stuck?  From your output
> > above, the unlink() call takes no measurable time before returning,
> > so I don't see where it is actually stuck.  Again, running your
> > test with "strace -tt -T ./foo /mnt/hdd/xxx 300" will show which
> > syscall is actually taking so much time to complete.  I don't
> > think it is unlink().
> 
> See above, the process is stalled, but I don't know where!
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 10:40         ` Lukáš Czerner
@ 2014-07-17 11:17           ` Mason
  2014-07-17 13:37             ` Theodore Ts'o
  0 siblings, 1 reply; 12+ messages in thread
From: Mason @ 2014-07-17 11:17 UTC (permalink / raw)
  To: Lukáš Czerner
  Cc: Andreas Dilger, Ext4 Developers List, linux-fsdevel

Lukáš Czerner wrote:

> So it really does not seem to be stalling in fallocate, nor unlink.
> Can you add close() before unlink, just to be sure what's happening
> there ?

Doh! Good catch! Unlinking was fast because the ref count didn't drop
to 0 on unlink, it did so on the implicit close done on exit, which
would explain why the process stalled "at the end".

If I unlink a closed file, it is indeed unlink that stalls.

[BTW, some of the e2fsprogs devs may be reading this. I suppose you
already know, but the cross-compile build was broken in 1.4.10.
I wrote a trivial patch to fix it (cf. the end of this message)
although I'm not sure I did it the canonical way.]


# time strace -T ./foo /mnt/hdd/xxx 300 2> strace.out
posix_fallocate(fd, 0, size_in_GiB << 30): 0 [412 ms]
close(fd): 0 [0 ms]
unlink(filename): 0 [111481 ms]

open("/mnt/hdd/xxx", O_WRONLY|O_CREAT|O_EXCL|O_LARGEFILE, 0600) = 3 <0.000456>
clock_gettime(CLOCK_MONOTONIC, {82152, 251657385}) = 0 <0.000085>
SYS_4320()                              = 0 <0.411628>
clock_gettime(CLOCK_MONOTONIC, {82152, 664179762}) = 0 <0.000089>
fstat64(1, {st_mode=S_IFCHR|0755, st_rdev=makedev(4, 64), ...}) = 0 <0.000094>
ioctl(1, TIOCNXCL, {B115200 opost isig icanon echo ...}) = 0 <0.000128>
old_mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x773e4000 <0.000195>
write(1, "posix_fallocate(fd, 0, size_in_G"..., 54) = 54 <0.000281>
clock_gettime(CLOCK_MONOTONIC, {82152, 668413115}) = 0 <0.000077>
close(3)                                = 0 <0.000119>
clock_gettime(CLOCK_MONOTONIC, {82152, 669249479}) = 0 <0.000129>
write(1, "close(fd): 0 [0 ms]\n", 20)   = 20 <0.000145>
clock_gettime(CLOCK_MONOTONIC, {82152, 670361133}) = 0 <0.000078>
unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
clock_gettime(CLOCK_MONOTONIC, {82264, 150551496}) = 0 <0.000080>
write(1, "unlink(filename): 0 [111481 ms]\n", 32) = 32 <0.000225>
exit_group(0)                           = ?

0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
0inputs+0outputs (0major+434minor)pagefaults 0swaps


For reference, here's my minimal test case:

#define _FILE_OFFSET_BITS 64
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
#include <time.h>

#define BENCH(op) do { \
  struct timespec t0; clock_gettime(CLOCK_MONOTONIC, &t0); \
  int err = op; \
  struct timespec t1; clock_gettime(CLOCK_MONOTONIC, &t1); \
  int ms = (t1.tv_sec-t0.tv_sec)*1000 + (t1.tv_nsec-t0.tv_nsec)/1000000; \
  printf("%s: %d [%d ms]\n", #op, err, ms); } while(0)

int main(int argc, char **argv)
{
  if (argc != 3) { puts("Usage: prog filename size"); return 42; }

  char *filename = argv[1];
  int fd = open(filename, O_CREAT | O_EXCL | O_WRONLY, 0600);
  if (fd < 0) { perror("open"); return 1; }

  long long size_in_GiB = atoi(argv[2]);
  BENCH(posix_fallocate(fd, 0, size_in_GiB << 30));
  BENCH(close(fd));
  BENCH(unlink(filename));
  return 0;
}


$ cat e2fsprogs-1.42.10.patch 
diff -ur a/util/Makefile.in b/util/Makefile.in
--- a/util/Makefile.in	2014-05-15 19:04:08.000000000 +0200
+++ b/util/Makefile.in	2014-07-10 15:31:04.819352596 +0200
@@ -15,7 +15,7 @@
 
 .c.o:
 	$(E) "	CC $<"
-	$(Q) $(BUILD_CC) -c $(BUILD_CFLAGS) $< -o $@
+	$(Q) $(BUILD_CC) $(CPPFLAGS) -c $(BUILD_CFLAGS) $< -o $@
 	$(Q) $(CHECK_CMD) $(ALL_CFLAGS) $<
 
 PROGS=		subst symlinks



-- 
Regards.
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 11:17           ` Mason
@ 2014-07-17 13:37             ` Theodore Ts'o
  2014-07-17 16:07               ` Mason
  0 siblings, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2014-07-17 13:37 UTC (permalink / raw)
  To: Mason
  Cc: Lukáš Czerner, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

On Thu, Jul 17, 2014 at 01:17:11PM +0200, Mason wrote:
> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
> 
> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
> 0inputs+0outputs (0major+434minor)pagefaults 0swaps

... and we're CPU bound inside the kernel.

Can you run perf so we can see exactly where we're spending the CPU?
You're not using a journal, so I'm pretty sure what you will find is
that we're spending all of our time in mb_free_blocks(), when it is
updating the internal mballoc buddy bitmaps.

With a journal, this work done by mb_free_blocks() is hidden in the
kjournal thread, and happens after the commit is completed, so it
won't block other file system operations (other than burning some
extra CPU on one of the multiple cores available on a typical x86
CPU).

Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
has native bit test/set/clear instructions, whereas the MIPS
architecture was designed by Prof. Hennessy at Stanford, who was a
doctrinaire RISC fanatic, so there would be no bitop instructions.

Even though I'm pretty sure what we'll find, knowing exactly *where*
in mb_free_blocks() or the function it calls would be helpful in
knowing what we need to optimize.  So if you could try using perf
(assuming that the perf is supported MIPS; not sure if it does) that
would be really helpful.

Thanks,

					- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 13:37             ` Theodore Ts'o
@ 2014-07-17 16:07               ` Mason
  2014-07-17 16:32                 ` Mason
  2014-07-18  9:29                 ` Lukáš Czerner
  0 siblings, 2 replies; 12+ messages in thread
From: Mason @ 2014-07-17 16:07 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lukáš Czerner, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

Theodore Ts'o wrote:

> Mason wrote:
> 
>> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
>>
>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
> 
> ... and we're CPU bound inside the kernel.
> 
> Can you run perf so we can see exactly where we're spending the CPU?
> You're not using a journal, so I'm pretty sure what you will find is
> that we're spending all of our time in mb_free_blocks(), when it is
> updating the internal mballoc buddy bitmaps.
> 
> With a journal, this work done by mb_free_blocks() is hidden in the
> kjournal thread, and happens after the commit is completed, so it
> won't block other file system operations (other than burning some
> extra CPU on one of the multiple cores available on a typical x86
> CPU).
> 
> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
> has native bit test/set/clear instructions, whereas the MIPS
> architecture was designed by Prof. Hennessy at Stanford, who was a
> doctrinaire RISC fanatic, so there would be no bitop instructions.
> 
> Even though I'm pretty sure what we'll find, knowing exactly *where*
> in mb_free_blocks() or the function it calls would be helpful in
> knowing what we need to optimize.  So if you could try using perf
> (assuming that the perf is supported MIPS; not sure if it does) that
> would be really helpful.

Is perf "better" than oprofile? (For some metric)

I have enabled:

CONFIG_PERF_EVENTS=y
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_OPROFILE=y
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_KRETPROBES=y

What command-line do you suggest I run to get the output you expect?
(I'll try to get it done, but I might have to wait two weeks before
I can run these tests.)

-- 
Regards.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 16:07               ` Mason
@ 2014-07-17 16:32                 ` Mason
  2014-07-18  9:29                 ` Lukáš Czerner
  1 sibling, 0 replies; 12+ messages in thread
From: Mason @ 2014-07-17 16:32 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Lukáš Czerner, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

On 17/07/2014 18:07, Mason wrote:

> Theodore Ts'o wrote:
> 
>> Mason wrote:
>>
>>> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
>>>
>>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
>>
>> ... and we're CPU bound inside the kernel.
>>
>> Can you run perf so we can see exactly where we're spending the CPU?
>> You're not using a journal, so I'm pretty sure what you will find is
>> that we're spending all of our time in mb_free_blocks(), when it is
>> updating the internal mballoc buddy bitmaps.
>>
>> With a journal, this work done by mb_free_blocks() is hidden in the
>> kjournal thread, and happens after the commit is completed, so it
>> won't block other file system operations (other than burning some
>> extra CPU on one of the multiple cores available on a typical x86
>> CPU).
>>
>> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
>> has native bit test/set/clear instructions, whereas the MIPS
>> architecture was designed by Prof. Hennessy at Stanford, who was a
>> doctrinaire RISC fanatic, so there would be no bitop instructions.
>>
>> Even though I'm pretty sure what we'll find, knowing exactly *where*
>> in mb_free_blocks() or the function it calls would be helpful in
>> knowing what we need to optimize.  So if you could try using perf
>> (assuming that the perf is supported MIPS; not sure if it does) that
>> would be really helpful.
> 
> Is perf "better" than oprofile? (For some metric)
> 
> I have enabled:
> 
> CONFIG_PERF_EVENTS=y
> CONFIG_PROFILING=y
> CONFIG_TRACEPOINTS=y
> CONFIG_OPROFILE=y
> CONFIG_HAVE_OPROFILE=y
> CONFIG_KPROBES=y
> CONFIG_KRETPROBES=y
> 
> What command-line do you suggest I run to get the output you expect?
> (I'll try to get it done, but I might have to wait two weeks before
> I can run these tests.)

So much for oprofile...

  CC      arch/mips/oprofile/../../../drivers/oprofile/oprof.o
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: In function 'oprofile_init':
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: 'timer' undeclared (first use in this function)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: (Each undeclared identifier is reported only once
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:316: error: for each function it appears in.)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: In function '__check_timer':
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: 'timer' undeclared (first use in this function)
arch/mips/oprofile/../../../drivers/oprofile/oprof.c: At top level:
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: 'timer' undeclared here (not in a function)
cc1: warnings being treated as errors
arch/mips/oprofile/../../../drivers/oprofile/oprof.c:373: error: type defaults to 'int' in declaration of 'type name'
make[1]: *** [arch/mips/oprofile/../../../drivers/oprofile/oprof.o] Error 1
make: *** [arch/mips/oprofile] Error 2

Dunno if this happens on vanilla kernels, or if the ODM messed
something up (again).

$ ll tools/perf/arch/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 arm/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 powerpc/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 s390/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 sh/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 sparc/
drwxrwxr-x 4 bob bob 4096 Mar 27 17:12 x86/

I'm not sure perf supports MIPS...

Or maybe it does

$ g -rni mips .
./Makefile:45:				  -e s/ppc.*/powerpc/ -e s/mips.*/mips/ \
Binary file ./.Makefile.swp matches
./perf.h:76:#ifdef __mips__
./perf.h:77:#include "../../arch/mips/include/asm/unistd.h"
./perf.h:79:				".set	mips2\n\t"			\
./perf.h:81:				".set	mips0"				\


-- 
Regards.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-07-17 16:07               ` Mason
  2014-07-17 16:32                 ` Mason
@ 2014-07-18  9:29                 ` Lukáš Czerner
       [not found]                   ` <53DF9918.3010206@free.fr>
  1 sibling, 1 reply; 12+ messages in thread
From: Lukáš Czerner @ 2014-07-18  9:29 UTC (permalink / raw)
  To: Mason
  Cc: Theodore Ts'o, Andreas Dilger, Ext4 Developers List,
	linux-fsdevel

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2556 bytes --]

On Thu, 17 Jul 2014, Mason wrote:

> Date: Thu, 17 Jul 2014 18:07:30 +0200
> From: Mason <mpeg.blue@free.fr>
> To: Theodore Ts'o <tytso@mit.edu>
> Cc: Lukáš Czerner <lczerner@redhat.com>, Andreas Dilger <adilger@dilger.ca>,
>     Ext4 Developers List <linux-ext4@vger.kernel.org>,
>     linux-fsdevel <linux-fsdevel@vger.kernel.org>
> Subject: Re: After unlinking a large file on ext4,
>     the process stalls for a long time
> 
> Theodore Ts'o wrote:
> 
> > Mason wrote:
> > 
> >> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
> >>
> >> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
> >> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
> > 
> > ... and we're CPU bound inside the kernel.
> > 
> > Can you run perf so we can see exactly where we're spending the CPU?
> > You're not using a journal, so I'm pretty sure what you will find is
> > that we're spending all of our time in mb_free_blocks(), when it is
> > updating the internal mballoc buddy bitmaps.
> > 
> > With a journal, this work done by mb_free_blocks() is hidden in the
> > kjournal thread, and happens after the commit is completed, so it
> > won't block other file system operations (other than burning some
> > extra CPU on one of the multiple cores available on a typical x86
> > CPU).
> > 
> > Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
> > has native bit test/set/clear instructions, whereas the MIPS
> > architecture was designed by Prof. Hennessy at Stanford, who was a
> > doctrinaire RISC fanatic, so there would be no bitop instructions.
> > 
> > Even though I'm pretty sure what we'll find, knowing exactly *where*
> > in mb_free_blocks() or the function it calls would be helpful in
> > knowing what we need to optimize.  So if you could try using perf
> > (assuming that the perf is supported MIPS; not sure if it does) that
> > would be really helpful.
> 
> Is perf "better" than oprofile? (For some metric)
> 
> I have enabled:
> 
> CONFIG_PERF_EVENTS=y
> CONFIG_PROFILING=y
> CONFIG_TRACEPOINTS=y
> CONFIG_OPROFILE=y
> CONFIG_HAVE_OPROFILE=y
> CONFIG_KPROBES=y
> CONFIG_KRETPROBES=y
> 
> What command-line do you suggest I run to get the output you expect?
> (I'll try to get it done, but I might have to wait two weeks before
> I can run these tests.)

If perf works on your system you can record data with

perf record -g ./test file <size>

and then report with

perf report --stdio

That should yield some interesting information about where we spend
the most time in kernel.

Thanks!
-Lukas

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <53DF9918.3010206@free.fr>]

* Re: After unlinking a large file on ext4, the process stalls for a long time
       [not found]                   ` <53DF9918.3010206@free.fr>
@ 2014-08-04 22:55                     ` Andreas Dilger
  2014-08-05  2:33                       ` Theodore Ts'o
  2014-08-05 12:06                       ` Mason
  0 siblings, 2 replies; 12+ messages in thread
From: Andreas Dilger @ 2014-08-04 22:55 UTC (permalink / raw)
  To: Mason; +Cc: Lukáš Czerner, Theodore Ts'o, Ext4 Developers List

It would be possible to optimize mb_free_blocks() by having it
clear a whole word at a time instead of a series if bits. 

I thought that was done already, but it doesn't appear to be the case.
Also, it isn't clear that the bit "normalization" is needed anymore.
This was done back in the aniceint times when the buddy bitmaps were stored on disk instead of being regenerated only at mount time. 

Cheers, Andreas

> On Aug 4, 2014, at 16:30, Mason <mpeg.blue@free.fr> wrote:
> 
>> On 18/07/2014 11:29, Lukáš Czerner wrote:
>> 
>> Mason wrote:
>> 
>>> Theodore Ts'o wrote:
>>> 
>>>> Mason wrote:
>>>> 
>>>>> unlink("/mnt/hdd/xxx")                  = 0 <111.479283>
>>>>> 
>>>>> 0.01user 111.48system 1:51.99elapsed 99%CPU (0avgtext+0avgdata 772maxresident)k
>>>>> 0inputs+0outputs (0major+434minor)pagefaults 0swaps
>>>> 
>>>> ... and we're CPU bound inside the kernel.
>>>> 
>>>> Can you run perf so we can see exactly where we're spending the CPU?
>>>> You're not using a journal, so I'm pretty sure what you will find is
>>>> that we're spending all of our time in mb_free_blocks(), when it is
>>>> updating the internal mballoc buddy bitmaps.
>>>> 
>>>> With a journal, this work done by mb_free_blocks() is hidden in the
>>>> kjournal thread, and happens after the commit is completed, so it
>>>> won't block other file system operations (other than burning some
>>>> extra CPU on one of the multiple cores available on a typical x86
>>>> CPU).
>>>> 
>>>> Also, I suspect the CPU overhead is *much* less on an x86 CPU, which
>>>> has native bit test/set/clear instructions, whereas the MIPS
>>>> architecture was designed by Prof. Hennessy at Stanford, who was a
>>>> doctrinaire RISC fanatic, so there would be no bitop instructions.
> 
> I've attached the output of "mips-linux-gnu-objdump -xd mballoc.o"
> in case someone wants to peek at the generated code.
> 
>>>> Even though I'm pretty sure what we'll find, knowing exactly *where*
>>>> in mb_free_blocks() or the function it calls would be helpful in
>>>> knowing what we need to optimize.  So if you could try using perf
>>>> (assuming that the perf is supported MIPS; not sure if it does) that
>>>> would be really helpful.
> 
> How do you get perf to tell you where in mb_free_blocks we are spending
> the most time?
> 
>>> What command-line do you suggest I run to get the output you expect?
>> 
>> If perf works on your system you can record data with
>> 
>> perf record -g ./test file <size>
>> 
>> and then report with
>> 
>> perf report --stdio
>> 
>> That should yield some interesting information about where we spend
>> the most time in kernel.
> 
> I've no idea why, but the unlink operation, which used to take
> 111 seconds to run, now only takes 53...
> 
> Anyway, here is the requested output.
> 
> # time perf record -g foo /mnt/hdd/xxx 300
> [ perf record: Woken up 8 times to write data ]
> [ perf record: Captured and wrote 1.909 MB perf.data (~83406 samples) ]
> 0.04user 0.08system 0:53.54elapsed 0%CPU (0avgtext+0avgdata 3616maxresident)k
> 0inputs+0outputs (0major+984minor)pagefaults 0swaps
> 
> # perf report --stdio > report.txt
> (Complete report attached as report.txt.xz)
> 
> What can I do to improve the latency of unlinking large files?
> Would sparse_super2 help at all?
> 
> 
> # Events: 14K cycles
> #
> # Overhead  Command      Shared Object                        Symbol
> # ........  .......  .................  ............................
> #
>    33.94%      foo  [kernel.kallsyms]  [k] mb_free_blocks
>               |
>               --- mb_free_blocks
>                   ext4_free_blocks
>                   ext4_ext_rm_leaf
>                   ext4_ext_truncate
>                   ext4_truncate
>                   ext4_evict_inode
>                   evict
>                   do_unlinkat
>                   stack_done
> 
>    21.11%      foo  [kernel.kallsyms]  [k] __find_get_block
>               |
>               --- __find_get_block
>                  |          
>                  |--99.94%-- ext4_free_blocks
>                  |          ext4_ext_rm_leaf
>                  |          ext4_ext_truncate
>                  |          ext4_truncate
>                  |          ext4_evict_inode
>                  |          evict
>                  |          do_unlinkat
>                  |          stack_done
>                   --0.06%-- [...]
> 
>     8.33%      foo  [kernel.kallsyms]  [k] radix_tree_lookup_slot
>               |
>               --- radix_tree_lookup_slot
>                   find_get_page
>                   __find_get_block_slow
>                   __find_get_block
>                   ext4_free_blocks
>                   ext4_ext_rm_leaf
>                   ext4_ext_truncate
>                   ext4_truncate
>                   ext4_evict_inode
>                   evict
>                   do_unlinkat
>                   stack_done
> 
>     6.99%      foo  [kernel.kallsyms]  [k] mb_find_buddy
>               |
>               --- mb_find_buddy
>                   mb_free_blocks
>                   ext4_free_blocks
>                   ext4_ext_rm_leaf
>                   ext4_ext_truncate
>                   ext4_truncate
>                   ext4_evict_inode
>                   evict
>                   do_unlinkat
>                   stack_done
> 
>     4.21%      foo  [kernel.kallsyms]  [k] trace_preempt_off
>               |
>               --- trace_preempt_off
>                  |          
>                  |--99.99%-- __find_get_block
>                  |          ext4_free_blocks
>                  |          ext4_ext_rm_leaf
>                  |          ext4_ext_truncate
>                  |          ext4_truncate
>                  |          ext4_evict_inode
>                  |          evict
>                  |          do_unlinkat
>                  |          stack_done
>                   --0.01%-- [...]
> 
>     4.19%      foo  [kernel.kallsyms]  [k] ext4_free_blocks
>               |
>               --- ext4_free_blocks
>                   ext4_ext_rm_leaf
>                   ext4_ext_truncate
>                   ext4_truncate
>                   ext4_evict_inode
>                   evict
>                   do_unlinkat
>                   stack_done
> 
>     4.14%      foo  [kernel.kallsyms]  [k] sub_preempt_count
>               |
>               --- sub_preempt_count
>                  |          
>                  |--99.69%-- __find_get_block
>                  |          ext4_free_blocks
>                  |          ext4_ext_rm_leaf
>                  |          ext4_ext_truncate
>                  |          ext4_truncate
>                  |          ext4_evict_inode
>                  |          evict
>                  |          do_unlinkat
>                  |          stack_done
>                   --0.31%-- [...]
> 
>     3.97%      foo  [kernel.kallsyms]  [k] __find_get_block_slow
>               |
>               --- __find_get_block_slow
>                   __find_get_block
>                   ext4_free_blocks
>                   ext4_ext_rm_leaf
>                   ext4_ext_truncate
>                   ext4_truncate
>                   ext4_evict_inode
>                   evict
>                   do_unlinkat
>                   stack_done
> 
>     3.53%      foo  [kernel.kallsyms]  [k] __rcu_read_unlock
>               |
>               --- __rcu_read_unlock
>                  |          
>                  |--100.00%-- find_get_page
>                  |          __find_get_block_slow
>                  |          __find_get_block
>                  |          ext4_free_blocks
>                  |          ext4_ext_rm_leaf
>                  |          ext4_ext_truncate
>                  |          ext4_truncate
>                  |          ext4_evict_inode
>                  |          evict
>                  |          do_unlinkat
>                  |          stack_done
>                   --0.00%-- [...]
> 
>     3.26%      foo  [kernel.kallsyms]  [k] trace_preempt_on
>               |
>               --- trace_preempt_on
>                   sub_preempt_count
>                  |          
>                  |--100.00%-- __find_get_block
>                  |          ext4_free_blocks
>                  |          ext4_ext_rm_leaf
>                  |          ext4_ext_truncate
>                  |          ext4_truncate
>                  |          ext4_evict_inode
>                  |          evict
>                  |          do_unlinkat
>                  |          stack_done
>                   --0.00%-- [...]
> 
>     2.06%      foo  [kernel.kallsyms]  [k] find_get_page
>               |
>               --- find_get_page
>                  |          
>                  |--100.00%-- __find_get_block_slow
>                  |          __find_get_block
>                  |          ext4_free_blocks
>                  |          ext4_ext_rm_leaf
>                  |          ext4_ext_truncate
>                  |          ext4_truncate
>                  |          ext4_evict_inode
>                  |          evict
>                  |          do_unlinkat
>                  |          stack_done
>                   --0.00%-- [...]
> 
>     1.39%      foo  [kernel.kallsyms]  [k] add_preempt_count
>               |
>               --- add_preempt_count
>                  |          
>                  |--99.99%-- __find_get_block
>                  |          ext4_free_blocks
>                  |          ext4_ext_rm_leaf
>                  |          ext4_ext_truncate
>                  |          ext4_truncate
>                  |          ext4_evict_inode
>                  |          evict
>                  |          do_unlinkat
>                  |          stack_done
>                   --0.01%-- [...]
> 
>     1.26%      foo  [kernel.kallsyms]  [k] __rcu_read_lock
>               |
>               --- __rcu_read_lock
>                   find_get_page
>                   __find_get_block_slow
>                   __find_get_block
>                   ext4_free_blocks
>                   ext4_ext_rm_leaf
>                   ext4_ext_truncate
>                   ext4_truncate
>                   ext4_evict_inode
>                   evict
>                   do_unlinkat
>                   stack_done
> 
> -- 
> Regards.
> 
> <report.txt.xz>
> <mballoc.dump.xz>
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-08-04 22:55                     ` Andreas Dilger
@ 2014-08-05  2:33                       ` Theodore Ts'o
  2014-08-05 21:54                         ` Andreas Dilger
  2014-08-05 12:06                       ` Mason
  1 sibling, 1 reply; 12+ messages in thread
From: Theodore Ts'o @ 2014-08-05  2:33 UTC (permalink / raw)
  To: Andreas Dilger; +Cc: Mason, Lukáš Czerner, Ext4 Developers List

On Tue, Aug 05, 2014 at 12:55:14AM +0200, Andreas Dilger wrote:
> It would be possible to optimize mb_free_blocks() by having it
> clear a whole word at a time instead of a series if bits. 

It looks like we're doing this already in mb_test_and_clear_bits(),
aren't we?

> I thought that was done already, but it doesn't appear to be the case.
> Also, it isn't clear that the bit "normalization" is needed anymore.
> This was done back in the aniceint times when the buddy bitmaps were stored on disk instead of being regenerated only at mount time. 

I'm not sure what you mean by this; the only reference I can find
normalization is with normalizing requests?

						- Ted

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-08-05  2:33                       ` Theodore Ts'o
@ 2014-08-05 21:54                         ` Andreas Dilger
  0 siblings, 0 replies; 12+ messages in thread
From: Andreas Dilger @ 2014-08-05 21:54 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Mason, Lukáš Czerner, Ext4 Developers List

On Aug 5, 2014, at 4:33, Theodore Ts'o <tytso@mit.edu> wrote:
> 
>> On Tue, Aug 05, 2014 at 12:55:14AM +0200, Andreas Dilger wrote:
>> It would be possible to optimize mb_free_blocks() by having it
>> clear a whole word at a time instead of a series if bits. 
> 
> It looks like we're doing this already in mb_test_and_clear_bits(),
> aren't we?

Sorry, I didn't see mb_test_and_clear_bits(), I was only looking at
mb_clear_bit() to see if it be the multi-bit optimization. 

>> I thought that was done already, but it doesn't appear to be the case.
>> Also, it isn't clear that the bit "normalization" is needed anymore.
>> This was done back in the aniceint times when the buddy bitmaps were stored on disk instead of being regenerated only at mount time. 
> 
> I'm not sure what you mean by this; the only reference I can find
> normalization is with normalizing requests?

I meant mb_correct_addr_and_bit(). 

Cheers, Andreas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: After unlinking a large file on ext4, the process stalls for a long time
  2014-08-04 22:55                     ` Andreas Dilger
  2014-08-05  2:33                       ` Theodore Ts'o
@ 2014-08-05 12:06                       ` Mason
  1 sibling, 0 replies; 12+ messages in thread
From: Mason @ 2014-08-05 12:06 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Lukáš Czerner, Theodore Ts'o, Ext4 Developers List

On 05/08/2014 00:55, Andreas Dilger wrote:

> It would be possible to optimize mb_free_blocks() by having it
> clear a whole word at a time instead of a series of bits.
> 
> I thought that was done already, but it doesn't appear to be the case.
> Also, it isn't clear that the bit "normalization" is needed anymore.
> This was done back in the ancient times when the buddy bitmaps were
> stored on disk instead of being regenerated only at mount time.

Are there any other tests you'd like me to run?
(I will be permanently losing access to this platform in a few days.)

-- 
Regards.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2014-08-05 21:54 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <53C687B1.30809@free.fr>
     [not found] ` <21446.38705.190786.631403@quad.stoffel.home>
     [not found]   ` <53C6B38A.3000100@free.fr>
2014-07-17  3:37     ` After unlinking a large file on ext4, the process stalls for a long time Andreas Dilger
2014-07-17 10:30       ` Mason
2014-07-17 10:40         ` Lukáš Czerner
2014-07-17 11:17           ` Mason
2014-07-17 13:37             ` Theodore Ts'o
2014-07-17 16:07               ` Mason
2014-07-17 16:32                 ` Mason
2014-07-18  9:29                 ` Lukáš Czerner
     [not found]                   ` <53DF9918.3010206@free.fr>
2014-08-04 22:55                     ` Andreas Dilger
2014-08-05  2:33                       ` Theodore Ts'o
2014-08-05 21:54                         ` Andreas Dilger
2014-08-05 12:06                       ` Mason

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).