recent issues with heavy delete's causing soft lockups

public inbox for linux-block@vger.kernel.org
 help / color / mirror / Atom feed

* recent issues with heavy delete's causing soft lockups
@ 2018-10-27 18:40 Thomas Fjellstrom
  2018-10-27 19:20 ` Jens Axboe
  0 siblings, 1 reply; 6+ messages in thread
From: Thomas Fjellstrom @ 2018-10-27 18:40 UTC (permalink / raw)
  To: linux-block

Hi

As of the past few months or so I've been dealing with my workstation locking 
up for upwards of minutes at a time when deleting a large directory tree. I 
don't recall this being a problem before.

Current setup is 3 SATA SSDs in an lvm vg. most space is allocated to an ext4 
/home where my work projects live.

The main use case causing problems is deleting the "out" directory of an 
android AOSP build tree. It can be upwards of 95GB in size with 240k or more 
files. If I run a `rm -fr out` or `make clean` it will lock up anything 
attempting to use the disk (eg: plasma, intellij, android studio, chrome, etc) 
for sometimes minutes.

I have tried different block scheduler settings including none, mq-deadline, 
kyber and bfq none of which seem to improve things much at all.

It may be worth noting that disk space is starting to run low, perhaps there's 
some interaction going on with free space handling or ssd wear leveling...

That said, it seems to have started happening (or at least made worse) some 
time around when mq was made the default and only implementation for sata.

if it helps, my system specs are:

Kernel: Debian Sid's 4.18.0-2-amd64 (4.18.10-2)
CPU: AMD FX-8320 OCed to 4.4Ghz
RAM: 32GB DDR3 1866
MB: Asus 970 Aura Pro Gaming
Storage: Kingston HyperX 3K 240G + Samsung 850 Evo 250G + SanDisk X300 500G

I'm thinking of testing with a different or older kernel, what would be the 
best to test with?

Thanks for any assistance.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: recent issues with heavy delete's causing soft lockups
  2018-10-27 18:40 recent issues with heavy delete's causing soft lockups Thomas Fjellstrom
@ 2018-10-27 19:20 ` Jens Axboe
  2018-11-02 18:25   ` Thomas Fjellstrom
  2018-11-02 20:32   ` Thomas Fjellstrom
  0 siblings, 2 replies; 6+ messages in thread
From: Jens Axboe @ 2018-10-27 19:20 UTC (permalink / raw)
  To: Thomas Fjellstrom; +Cc: linux-block

On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> wrote=
:
>=20
> Hi
>=20
> As of the past few months or so I've been dealing with my workstation lock=
ing=20
> up for upwards of minutes at a time when deleting a large directory tree. I=
=20
> don't recall this being a problem before.
>=20
> Current setup is 3 SATA SSDs in an lvm vg. most space is allocated to an e=
xt4=20
> /home where my work projects live.
>=20
> The main use case causing problems is deleting the "out" directory of an=20=

> android AOSP build tree. It can be upwards of 95GB in size with 240k or mo=
re=20
> files. If I run a `rm -fr out` or `make clean` it will lock up anything=20=

> attempting to use the disk (eg: plasma, intellij, android studio, chrome, e=
tc)=20
> for sometimes minutes.
>=20
> I have tried different block scheduler settings including none, mq-deadlin=
e,=20
> kyber and bfq none of which seem to improve things much at all.
>=20
> It may be worth noting that disk space is starting to run low, perhaps the=
re's=20
> some interaction going on with free space handling or ssd wear leveling...=

>=20
> That said, it seems to have started happening (or at least made worse) som=
e=20
> time around when mq was made the default and only implementation for sata.=

>=20
> if it helps, my system specs are:
>=20
> Kernel: Debian Sid's 4.18.0-2-amd64 (4.18.10-2)
> CPU: AMD FX-8320 OCed to 4.4Ghz
> RAM: 32GB DDR3 1866
> MB: Asus 970 Aura Pro Gaming
> Storage: Kingston HyperX 3K 240G + Samsung 850 Evo 250G + SanDisk X300 500=
G
>=20
> I'm thinking of testing with a different or older kernel, what would be th=
e=20
> best to test with?

Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue a=
round requeue conditions, which SATA is the one to most often hit.=20

Jens=

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: recent issues with heavy delete's causing soft lockups
  2018-10-27 19:20 ` Jens Axboe
@ 2018-11-02 18:25   ` Thomas Fjellstrom
  2018-11-02 20:32   ` Thomas Fjellstrom
  1 sibling, 0 replies; 6+ messages in thread
From: Thomas Fjellstrom @ 2018-11-02 18:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
wrote:
> > Hi
[snip explanation of problem]
> 
> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
> around requeue conditions, which SATA is the one to most often hit.

Gave it a shot. with the vanila kernel from git linux-stable/v4.9. It was a 
bit of a pain as the amdgpu driver seems to be broken for my r9 390 on many 
kernels, including 4.19. Had to reconfigure to the radeon driver, which I must 
say seems to work a lot better than it used to.

At any rate, it doesn't seem to have helped a lot so far. I did end up adding 
"scsi_mod.use_blk_mq=0 dm_mod.use_blk_mq=0" to the default kernel boot command 
line in grub. It seems to have helped a little, but I haven't tested fully 
with a full delete of the build directory. haven't had time to sit and wait 
the 40+ minutes it takes to re build the entire thing. And I'm low enough on 
disk space that I can't easily make a copy of the 109GB build folder. I've got 
about 25GB free out of 780GB. I'll try and test some more soon.

> Jens

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: recent issues with heavy delete's causing soft lockups
  2018-10-27 19:20 ` Jens Axboe
  2018-11-02 18:25   ` Thomas Fjellstrom
@ 2018-11-02 20:32   ` Thomas Fjellstrom
  2018-11-02 20:37     ` Jens Axboe
  1 sibling, 1 reply; 6+ messages in thread
From: Thomas Fjellstrom @ 2018-11-02 20:32 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
[snip]
> 
> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
> around requeue conditions, which SATA is the one to most often hit.
> 
> Jens

I just had to do a clean, and I have the mq kernel options I mentioned in my 
previous mail enabled. (mq should be disabled) and it appears to still be 
causing issues. current io scheduler appears to be cfq, and it took that "make 
clean" about 4 minutes, a lot of that time was spent with plasma, intelij, and 
chrome all starved of IO. 

I did switch to a terminal and checked iostat -d 1, and it showed very little 
actual io for the time I was looking at it.

I have no idea what's going on.

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: recent issues with heavy delete's causing soft lockups
  2018-11-02 20:32   ` Thomas Fjellstrom
@ 2018-11-02 20:37     ` Jens Axboe
  2018-11-21 21:25       ` Thomas Fjellstrom
  0 siblings, 1 reply; 6+ messages in thread
From: Jens Axboe @ 2018-11-02 20:37 UTC (permalink / raw)
  To: Thomas Fjellstrom; +Cc: linux-block

On 11/2/18 2:32 PM, Thomas Fjellstrom wrote:
> On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
>> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom <thomas@fjellstrom.ca> 
> [snip]
>>
>> Can you try 4.19? A patch went in since 4.18 that fixes a starvation issue
>> around requeue conditions, which SATA is the one to most often hit.
>>
>> Jens
> 
> I just had to do a clean, and I have the mq kernel options I mentioned in my 
> previous mail enabled. (mq should be disabled) and it appears to still be 
> causing issues. current io scheduler appears to be cfq, and it took that "make 
> clean" about 4 minutes, a lot of that time was spent with plasma, intelij, and 
> chrome all starved of IO. 
> 
> I did switch to a terminal and checked iostat -d 1, and it showed very little 
> actual io for the time I was looking at it.
> 
> I have no idea what's going on.

If you're using cfq, then it's not using mq at all. Maybe do something ala:

# perf record -ag -- sleep 10

while the slowdown is happening and then do perf report -g --no-children and
see if that yields anything interesting. Sounds like time is being spent
elsewhere and you aren't actually waiting on IO.

-- 
Jens Axboe

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: recent issues with heavy delete's causing soft lockups
  2018-11-02 20:37     ` Jens Axboe
@ 2018-11-21 21:25       ` Thomas Fjellstrom
  0 siblings, 0 replies; 6+ messages in thread
From: Thomas Fjellstrom @ 2018-11-21 21:25 UTC (permalink / raw)
  To: Jens Axboe; +Cc: linux-block

On Friday, November 2, 2018 2:37:08 PM MST Jens Axboe wrote:
> On 11/2/18 2:32 PM, Thomas Fjellstrom wrote:
> > On Saturday, October 27, 2018 1:20:10 PM MDT Jens Axboe wrote:
> >> On Oct 27, 2018, at 12:40 PM, Thomas Fjellstrom <thomas@fjellstrom.ca>
> > 
> > [snip]
> > 
> >> Can you try 4.19? A patch went in since 4.18 that fixes a starvation
> >> issue
> >> around requeue conditions, which SATA is the one to most often hit.
> >> 
> >> Jens
> > 
> > I just had to do a clean, and I have the mq kernel options I mentioned in
> > my previous mail enabled. (mq should be disabled) and it appears to still
> > be causing issues. current io scheduler appears to be cfq, and it took
> > that "make clean" about 4 minutes, a lot of that time was spent with
> > plasma, intelij, and chrome all starved of IO.
> > 
> > I did switch to a terminal and checked iostat -d 1, and it showed very
> > little actual io for the time I was looking at it.
> > 
> > I have no idea what's going on.
> 
> If you're using cfq, then it's not using mq at all. Maybe do something ala:

Yeah, I switched off mq to test. I mentioned it in a previous mail.

> # perf record -ag -- sleep 10
> 
> while the slowdown is happening and then do perf report -g --no-children and
> see if that yields anything interesting. Sounds like time is being spent
> elsewhere and you aren't actually waiting on IO.

Ok, with the 4.19.1 kernel from linux-stable I've managed to catch the issue 
during real use, rather than just a dd command.

I should note that I have swap turned off, so I'm not sure what 
the "swapper" process in the below log is doing.

I also see the problem with swap enabled. But right now I'd rather 
certain apps die rather than slow the entire system down.

I also have a perf report -t log if that'd be helpful. It shows a lot of "use" 
in do_idle/acpi_idle_do_entry though I presume that's actual real idle time, 
not actual use.  The next most eye catching item in the -t log is chrome spending
17% of its time in glibc's free function.

(the top 100~ lines from perf report -g)

# Total Lost Samples: 0
#
# Samples: 456K of event 'cycles'
# Event count (approx.): 136347735217
#
# Overhead  Command          Shared Object                           Symbol                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
# ........  ...............  ......................................  .........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
#
    25.64%  swapper          [kernel.kallsyms]                       [k] acpi_idle_do_entry
            |
            ---0xffffffffa16000d4
               |          
               |--22.23%--start_secondary
               |          cpu_startup_entry
               |          do_idle
               |          cpuidle_enter_state
               |          acpi_idle_enter
               |          acpi_idle_do_entry
               |          
                --3.41%--start_kernel
                          cpu_startup_entry
                          do_idle
                          cpuidle_enter_state
                          acpi_idle_enter
                          acpi_idle_do_entry

     0.61%  swapper          [kernel.kallsyms]                       [k] apic_timer_interrupt
            |
            ---0xffffffffa16000d4
               |          
                --0.52%--start_secondary
                          cpu_startup_entry
                          do_idle
                          |          
                           --0.52%--cpuidle_enter_state

     0.54%  chrome           chrome                                  [.] _fini
     0.42%  swapper          [kernel.kallsyms]                       [k] native_sched_clock
     0.41%  swapper          [kernel.kallsyms]                       [k] menu_select
     0.40%  swapper          [kernel.kallsyms]                       [k] check_preemption_disabled
     0.35%  http.so          libQt5Core.so.5.11.2                    [.] QTranslatorPrivate::do_translate
     0.35%  swapper          [kernel.kallsyms]                       [k] x86_pmu_disable_all
     0.32%  TaskSchedulerFo  [kernel.kallsyms]                       [k] osq_lock
     0.31%  Chrome_IOThread  chrome                                  [.] _fini
     0.30%  chrome           libpthread-2.27.so                      [.] __pthread_mutex_lock
     0.29%  swapper          [kernel.kallsyms]                       [k] _raw_spin_lock_irqsave
     0.28%  swapper          [kernel.kallsyms]                       [k] read_tsc
     0.26%  chrome           libpthread-2.27.so                      [.] __pthread_mutex_unlock_usercnt
     0.26%  swapper          [kernel.kallsyms]                       [k] reschedule_interrupt
     0.24%  swapper          [kernel.kallsyms]                       [k] _raw_spin_lock
     0.24%  swapper          [kernel.kallsyms]                       [k] __sched_text_start
     0.24%  swapper          [kernel.kallsyms]                       [k] native_load_gs_index
     0.23%  swapper          [kernel.kallsyms]                       [k] __switch_to
     0.22%  swapper          [kernel.kallsyms]                       [k] do_idle
     0.21%  TaskSchedulerFo  [kernel.kallsyms]                       [k] mutex_lock
     0.21%  swapper          [kernel.kallsyms]                       [k] cpuidle_enter_state
     0.21%  TaskSchedulerFo  chrome                                  [.] 0x000000000306c000
     0.20%  chrome           [kernel.kallsyms]                       [k] native_sched_clock
     0.20%  TaskSchedulerFo  [kernel.kallsyms]                       [k] mutex_unlock
     0.18%  chrome           [kernel.kallsyms]                       [k] entry_SYSCALL_64
     0.18%  thumbnail.so     ld-2.27.so                              [.] do_lookup_x
     0.17%  Xorg             [kernel.kallsyms]                       [k] delay_tsc
     0.17%  rm               [ext4]                                  [k] ext4_mark_iloc_dirty
     0.16%  swapper          [kernel.kallsyms]                       [k] update_blocked_averages
     0.16%  chrome           [kernel.kallsyms]                       [k] check_preemption_disabled
     0.15%  swapper          [kernel.kallsyms]                       [k] update_load_avg
     0.15%  swapper          [kernel.kallsyms]                       [k] interrupt_entry
     0.15%  swapper          [kernel.kallsyms]                       [k] ktime_get
     0.15%  swapper          [kernel.kallsyms]                       [k] switch_mm_irqs_off
     0.15%  TaskSchedulerFo  [kernel.kallsyms]                       [k] __mutex_lock.isra.5
     0.14%  rm               [kernel.kallsyms]                       [k] check_preemption_disabled
     0.14%  TaskSchedulerFo  chrome                                  [.] 0x000000000306c009
     0.13%  swapper          [kernel.kallsyms]                       [k] __update_load_avg_se
     0.13%  chrome           libc-2.27.so                            [.] __memcpy_ssse3
     0.13%  swapper          [kernel.kallsyms]                       [k] __update_load_avg_cfs_rq
     0.12%  http.so          libQt5Core.so.5.11.2                    [.] QCoreApplicationPrivate::sendPostedEvents
     0.12%  rm               [kernel.kallsyms]                       [k] __find_get_block
     0.12%  swapper          [kernel.kallsyms]                       [k] timerqueue_add
     0.12%  swapper          [kernel.kallsyms]                       [k] acpi_idle_enter
     0.12%  apt-cache        libz.so.1.2.11                          [.] adler32_z
     0.12%  swapper          [kernel.kallsyms]                       [k] rcu_dynticks_eqs_exit
     0.12%  Xorg             [radeon]                                [k] cail_reg_read
     0.12%  swapper          [kernel.kallsyms]                       [k] trace_hardirqs_off
     0.11%  swapper          [kernel.kallsyms]                       [k] set_next_entity
     0.11%  swapper          [kernel.kallsyms]                       [k] _raw_spin_unlock_irqrestore
     0.11%  http.so          libQt5Core.so.5.11.2                    [.] QCoreApplication::translate
     0.11%  http.so          [kernel.kallsyms]                       [k] __switch_to
     0.11%  Chrome_ChildIOT  chrome                                  [.] _fini
     0.11%  chrome           [kernel.kallsyms]                       [k] __fget
     0.10%  swapper          [kernel.kallsyms]                       [k] __hrtimer_next_event_base
     0.10%  http.so          [kernel.kallsyms]                       [k] native_load_gs_index
     0.10%  swapper          [kernel.kallsyms]                       [k] rcu_check_callbacks
     0.10%  drkonqi          ld-2.27.so                              [.] do_lookup_x
     0.10%  TaskSchedulerFo  chrome                                  [.] 0x000000000306e42b
     0.10%  http.so          [kernel.kallsyms]                       [k] native_sched_clock
     0.10%  swapper          [kernel.kallsyms]                       [k] x86_pmu_enable_all
     0.10%  swapper          [kernel.kallsyms]                       [k] find_busiest_group
     0.10%  radeon_cs:0      [kernel.kallsyms]                       [k] refcount_sub_and_test_checked
     0.10%  http.so          [vdso]                                  [.] 0x00000000000008d9


Thanks,

-- 
Thomas Fjellstrom
thomas@fjellstrom.ca




^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2018-11-21 21:26 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-10-27 18:40 recent issues with heavy delete's causing soft lockups Thomas Fjellstrom
2018-10-27 19:20 ` Jens Axboe
2018-11-02 18:25   ` Thomas Fjellstrom
2018-11-02 20:32   ` Thomas Fjellstrom
2018-11-02 20:37     ` Jens Axboe
2018-11-21 21:25       ` Thomas Fjellstrom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox