perf loss on parallel compile due to conention on the buf semaphore

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* perf loss on parallel compile due to conention on the buf semaphore
@ 2024-08-15 12:25 Mateusz Guzik
  2024-08-15 12:26 ` Mateusz Guzik
  2024-08-15 22:56 ` Dave Chinner
  0 siblings, 2 replies; 3+ messages in thread
From: Mateusz Guzik @ 2024-08-15 12:25 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

I have an ext4-based system where xfs got mounted on tmpfs for testing
purposes. The directory is being used a lot by gcc when compiling.

I'm testing with 24 compilers running in parallel, each operating on
their own hello world source file, listed at the end for reference.

Both ext4 and btrfs backing the directory result in 100% cpu
utilization and about 1500 compiles/second. With xfs I see about 20%
idle(!) and about 1100 compiles/second.

According to offcputime-bpfcc -K the time is spent waiting on the buf
thing, sample traces:

   finish_task_switch.isra.0
    __schedule
    schedule
    schedule_timeout
    __down_common
    down
    xfs_buf_lock
    xfs_buf_find_lock
    xfs_buf_get_map
    xfs_buf_read_map
    xfs_trans_read_buf_map
    xfs_read_agi
    xfs_ialloc_read_agi
    xfs_dialloc
    xfs_create
    xfs_generic_create
    path_openat
    do_filp_open
    do_sys_openat2
    __x64_sys_openat
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    -                cc (602142)
        10639

    finish_task_switch.isra.0
    __schedule
    schedule
    schedule_timeout
    __down_common
    down
    xfs_buf_lock
    xfs_buf_find_lock
    xfs_buf_get_map
    xfs_buf_read_map
    xfs_trans_read_buf_map
    xfs_read_agi
    xfs_iunlink
    xfs_dir_remove_child
    xfs_remove
    xfs_vn_unlink
    vfs_unlink
    do_unlinkat
    __x64_sys_unlink
    do_syscall_64
    entry_SYSCALL_64_after_hwframe
    -                as (598688)
        12050

The fact that this is contended aside, I'll note the stock semaphore
code does not do adaptive spinning, which avoidably significantly
worsens the impact. You can probably convert this to a rw semaphore
and only ever writelock, which should sort out this aspect. I did not
check what can be done to contend less to begin with.

reproducing:
create a hello world .c file (say /tmp/src.c) and plop into /src:
for i in $(seq 0 23); do cp /tmp/src.c /src/src${i}.c; done

plop the following into will-it-scale/tests/cc.c && ./cc_processes -t 24

#include <sys/types.h>
#include <unistd.h>

char *testcase_description = "compile";

void testcase(unsigned long long *iterations, unsigned long nr)
{
        char cmd[1024];

        sprintf(&cmd, "cc -c -o /tmp/out.%d /src/src%d.c", nr, nr);

        while (1) {
                system(cmd);

                (*iterations)++;
        }
}

-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: perf loss on parallel compile due to conention on the buf semaphore
  2024-08-15 12:25 perf loss on parallel compile due to conention on the buf semaphore Mateusz Guzik
@ 2024-08-15 12:26 ` Mateusz Guzik
  2024-08-15 22:56 ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Mateusz Guzik @ 2024-08-15 12:26 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Thu, Aug 15, 2024 at 2:25 PM Mateusz Guzik <mjguzik@gmail.com> wrote:
>
> I have an ext4-based system where xfs got mounted on tmpfs for testing

erm, i mean on /tmp :)

I also used noatime.

> purposes. The directory is being used a lot by gcc when compiling.
>
> I'm testing with 24 compilers running in parallel, each operating on
> their own hello world source file, listed at the end for reference.
>
> Both ext4 and btrfs backing the directory result in 100% cpu
> utilization and about 1500 compiles/second. With xfs I see about 20%
> idle(!) and about 1100 compiles/second.
>
> According to offcputime-bpfcc -K the time is spent waiting on the buf
> thing, sample traces:
>
>    finish_task_switch.isra.0
>     __schedule
>     schedule
>     schedule_timeout
>     __down_common
>     down
>     xfs_buf_lock
>     xfs_buf_find_lock
>     xfs_buf_get_map
>     xfs_buf_read_map
>     xfs_trans_read_buf_map
>     xfs_read_agi
>     xfs_ialloc_read_agi
>     xfs_dialloc
>     xfs_create
>     xfs_generic_create
>     path_openat
>     do_filp_open
>     do_sys_openat2
>     __x64_sys_openat
>     do_syscall_64
>     entry_SYSCALL_64_after_hwframe
>     -                cc (602142)
>         10639
>
>     finish_task_switch.isra.0
>     __schedule
>     schedule
>     schedule_timeout
>     __down_common
>     down
>     xfs_buf_lock
>     xfs_buf_find_lock
>     xfs_buf_get_map
>     xfs_buf_read_map
>     xfs_trans_read_buf_map
>     xfs_read_agi
>     xfs_iunlink
>     xfs_dir_remove_child
>     xfs_remove
>     xfs_vn_unlink
>     vfs_unlink
>     do_unlinkat
>     __x64_sys_unlink
>     do_syscall_64
>     entry_SYSCALL_64_after_hwframe
>     -                as (598688)
>         12050
>
> The fact that this is contended aside, I'll note the stock semaphore
> code does not do adaptive spinning, which avoidably significantly
> worsens the impact. You can probably convert this to a rw semaphore
> and only ever writelock, which should sort out this aspect. I did not
> check what can be done to contend less to begin with.
>
> reproducing:
> create a hello world .c file (say /tmp/src.c) and plop into /src:
> for i in $(seq 0 23); do cp /tmp/src.c /src/src${i}.c; done
>
> plop the following into will-it-scale/tests/cc.c && ./cc_processes -t 24
>
> #include <sys/types.h>
> #include <unistd.h>
>
> char *testcase_description = "compile";
>
> void testcase(unsigned long long *iterations, unsigned long nr)
> {
>         char cmd[1024];
>
>         sprintf(&cmd, "cc -c -o /tmp/out.%d /src/src%d.c", nr, nr);
>
>         while (1) {
>                 system(cmd);
>
>                 (*iterations)++;
>         }
> }
>
> --
> Mateusz Guzik <mjguzik gmail.com>



-- 
Mateusz Guzik <mjguzik gmail.com>

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: perf loss on parallel compile due to conention on the buf semaphore
  2024-08-15 12:25 perf loss on parallel compile due to conention on the buf semaphore Mateusz Guzik
  2024-08-15 12:26 ` Mateusz Guzik
@ 2024-08-15 22:56 ` Dave Chinner
  1 sibling, 0 replies; 3+ messages in thread
From: Dave Chinner @ 2024-08-15 22:56 UTC (permalink / raw)
  To: Mateusz Guzik; +Cc: linux-xfs

On Thu, Aug 15, 2024 at 02:25:48PM +0200, Mateusz Guzik wrote:
> I have an ext4-based system where xfs got mounted on tmpfs for testing
> purposes. The directory is being used a lot by gcc when compiling.
> 
> I'm testing with 24 compilers running in parallel, each operating on
> their own hello world source file, listed at the end for reference.
>
> Both ext4 and btrfs backing the directory result in 100% cpu
> utilization and about 1500 compiles/second. With xfs I see about 20%
> idle(!) and about 1100 compiles/second.

Yup, you're not using any of the allocation parallelism in XFS by
running all the microbenchmark threads in the same directory. That
serialises the tasks on inode and extent allocation and freeing
because they all hit the same allocation group.

Start by separating threads per directory because XFS puts
directories in different allocation groups when they are allocated,
and then keeps the contents of the directories local to the AG the
directory is located in. This largely gives perfect scalability
across directories as long as the filesystem has enough AGs in it.

For scalability microbenchmarks, I tend to use ian AG count of 2x
max thread count.  i.e. for 24 threads, I'd probably use:

# mkfs.xfs -d agcount=49 ....

and put every thread instance in a newly created directory.

For normal workloads (e.g. compiling a large source tree) this
special setup step is not necessary. e.g. the creation of a large
source tree naturally distributes all the directories and files over
all the AGs in the filesystem and so there isn't a single AGI or AGF
buffer lock that serialises the entire concurrent compilation.

You're going to see the same thing with any other will-it-scale
concurrency microbenchmark that has each thread allocate/free inodes
or extents on files in the same directory.

IOWs, this is purely a microbenchmarking setup issue, not a real
world filesystem scalability issue.

> The fact that this is contended aside, I'll note the stock semaphore
> code does not do adaptive spinning, which avoidably significantly
> worsens the impact.

No, we most definitely do not want adaptive spinning. This is a long
hold, non-owner sleeping lock - it is owned by the buffer, not the
task that locks the buffer. The semaphore protects the contents of
the buffer as IO is performed on it (i.e. while it has no task
associated with it, but hardware is modifying the contents via
asynchronous DMA).

It is also held for long periods of time even when the task that
locked it is on-cpu. Inode and extent allocation/freeing can involve
updating multiple btrees that each contain millions of records, and
all the buffers may be cached and so the task running the allocaiton
and holding the AGI/AGF locked might actually run for many
milliseconds before it yeilds the lock.

We absolutely do not want tens of threads optimistically spinning on
these locks when contention occurs - spinning locks areit is
extremely power-inefficient and these locks are held long enough
that you can measure spinning lock contention events via the power
socket monitoring...

> You can probably convert this to a rw semaphore
> and only ever writelock, which should sort out this aspect. I did not
> check what can be done to contend less to begin with.

No.  We cannot use any other Linux kernel lock for this, because
they are all mutexes (including rwsems). The optimistic spinning is
based on a task owning the lock and doing the unlock (that's why
rwsems track the write owner task).

We need *pure* sleeping semaphore locks for these buffers and we'd
really, really like for rwsems to be pure semaphores and not a
bastardised rwmutex for the same reasons....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2024-08-15 22:56 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-15 12:25 perf loss on parallel compile due to conention on the buf semaphore Mateusz Guzik
2024-08-15 12:26 ` Mateusz Guzik
2024-08-15 22:56 ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox