linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: ein <ein.net@gmail.com>
To: Duncan <1i5t5.duncan@cox.net>, linux-btrfs@vger.kernel.org
Subject: Re: Periodic frame losses when recording to btrfs volume with OBS
Date: Tue, 23 Jan 2018 09:38:13 +0100	[thread overview]
Message-ID: <5A66F475.3010902@gmail.com> (raw)
In-Reply-To: <pan$ef8df$2d463ba0$6a8d2cfc$bcf35f@cox.net>



On 01/22/2018 09:59 AM, Duncan wrote:
> Sebastian Ochmann posted on Sun, 21 Jan 2018 16:27:55 +0100 as excerpted:

> [...]

> On 2018年01月20日 18:47, Sebastian Ochmann wrote:
>>>> Hello,
>>>>
>>>> I would like to describe a real-world use case where btrfs does not
>>>> perform well for me. I'm recording 60 fps, larger-than-1080p video
>>>> using OBS Studio [1] where it is important that the video stream is
>>>> encoded and written out to disk in real-time for a prolonged period of
>>>> time (2-5 hours). The result is a H264 video encoded on the GPU with a
>>>> data rate ranging from approximately 10-50 MB/s.
>>>
>>>> The hardware used is powerful enough to handle this task. When I use a
>>>> XFS volume for recording, no matter whether it's a SSD or HDD, the
>>>> recording is smooth and no frame drops are reported (OBS has a nice
>>>> Stats window where it shows the number of frames dropped due to
>>>> encoding lag which seemingly also includes writing the data out to
>>>> disk).
>>>>
>>>> However, when using a btrfs volume I quickly observe severe, periodic
>>>> frame drops. It's not single frames but larger chunks of frames that a
>>>> dropped at a time. I tried mounting the volume with nobarrier but to
>>>> no avail.
>>> What's the drop internal? Something near 30s?
>>> If so, try mount option commit=300 to see if it helps.
>> [...]
> 64 GB RAM...
>
> Do you know about the /proc/sys/vm/dirty_* files and how to use/tweak 
> them?  If not, read $KERNDIR/Documentation/sysctl/vm.txt, focusing on 
> these files.
>
> These tunables control the amount of writeback cache that is allowed to 
> accumulate before the system starts flushing it.  The problem is that the 
> defaults for these tunables were selected back when system memory 
> normally measured in the MiB, not the GiB of today, so the default ratios 
> allow too much dirty data to accumulate before attempting to flush it to 
> storage, resulting in flush storms that hog the available IO and starve 
> other tasks that might be trying to use it.
>
> The fix is to tweak these settings to try to smooth things out, starting 
> background flush earlier, so with a bit of luck the system never hits 
> high priority foreground flush mode, or if it does there's not so much to 
> be written as much of it has already been done in the background.
>
> There are five files, two pairs of files, one pair controlling foreground 
> sizes, the other background, and one file setting the time limit.  The 
> sizes can be set by either ratio, percentage of RAM, or bytes, with the 
> other appearing as zero when read.
>
> To set these temporarily you write to the appropriate file.  Once you 
> have a setting that works well for you, write it to your distro's sysctl 
> configuration (/etc/sysctl.conf or /etc/sysctrl.d/*.conf, usually), and 
> it should be automatically applied at boot for you.
>
> Here's the settings in my /etc/sysctl.conf, complete with notes about the 
> defaults and the values I've chosen for my 16G of RAM.  Note that while I 
> have fast ssds now, I set these values back when I had spinning rust.  I 
> was happy with them then, and while I shouldn't really need the settings 
> on my ssds, I've seen no reason to change them.
>
> At 16G, 1% ~ 160M.  At 64G, it'd be four times larger, 640M, likely too 
> chunky a granularity to be useful, so you'll probably want to set the 
> bytes value instead of ratio.
>
> # write-cache, foreground/background flushing
> # vm.dirty_ratio = 10 (% of RAM)
> # make it 3% of 16G ~ half a gig
> vm.dirty_ratio = 3
> # vm.dirty_bytes = 0
>
> # vm.dirty_background_ratio = 5 (% of RAM)
> # make it 1% of 16G ~ 160 M
> vm.dirty_background_ratio = 1
> # vm.dirty_background_bytes = 0
>
> # vm.dirty_expire_centisecs = 2999 (30 sec)
> # vm.dirty_writeback_centisecs = 499 (5 sec)
> # make it 10 sec
> vm.dirty_writeback_centisecs = 1000
>
>
> Now the other factor in the picture is how fast your actual hardware can 
> write.  hdparm's -t parameter tests sequential write speed and can give 
> you some idea.  You'll need to run it as root:
>
> hdparm -t /dev/sda
>
> /dev/sda:
>  Timing buffered disk reads: 1578 MB in  3.00 seconds = 525.73 MB/sec
>
> ... Like I said, fast ssd...  I believe fast modern spinning rust should 
> be 100 MB/sec or so, tho slower devices may only do 30 MB/sec, likely too 
> slow for your reported 10-50 MB/sec stream, tho you say yours should be 
> fast enough as it's fine with xfs.
>
>
> Now here's the problem.  As Qu mentions elsewhere on-thread, 30 seconds 
> of your 10-50 MB/sec stream is 300-1500 MiB.  Say your available device 
> IO bandwidth is 100 MiB/sec.  That should be fine.  But the default 
> dirty_* settings allow 5% of RAM in dirty writeback cache before even 
> starting low priority background flush, while it won't kick to high 
> priority until 10% of RAM or 30 seconds, whichever comes first.
>
> And at 64 GiB RAM, 1% is as I said, about 640 MiB, so 10% is 6.4 GB dirty 
> before it kicks to high priority, and 3.2 GB is the 5% accumulation 
> before it even starts low priority background writing.  That's assuming 
> the 30 second timeout hasn't expired yet, of course.
>
> But as we established above the write stream maxes out at ~1.5 GiB in 30 
> seconds, and that's well below the ~3.2 GiB @ 64 GiB RAM that would kick 
> in even low priority background writeback!
>
> So at the defaults, the background writeback never kicks in at all, until 
> the 30 second timeout expires, forcing immediate high priority foreground 
> flushing!
>
> Meanwhile, the way the kernel handles /background/ writeback flushing is 
> that it will take the opportunity to writeback what it can while the 
> device is idle.  But as we've just established, background never kicks in.
>
> So then the timeout expires and the kernel kicks in high priority 
> foreground writeback.
>
> And the kernel handles foreground writeback *MUCH* differently!  
> Basically, it stops anything attempting to dirty more writeback cache 
> until it can write the dirty cache out.  And it charges the time it 
> spends doing just that to the thread it stopped in ordered to do that 
> high priority writeback! 
>
> Now as designed this should work well, and it does when the dirty_* 
> values are set correctly, because any process that's trying to dirty the 
> writeback cache faster than it can be written out, thus kicking in 
> foreground mode, gets stopped until the data can be written out, thus 
> preventing it from dirtying even MORE cache faster than the system can 
> handle it, which in /theory/ is what kicked it into high priority 
> foreground mode in the /first/ place.
>
> But as I said, the default ratios were selected when memory was far 
> smaller.  With half a gig of RAM, the default 5% to kick in background 
> mode would be only ~25 MiB, easily writable within a second on modern 
> devices and back then, still writable within say 5-10 seconds.  And if it 
> ever reached foreground mode, that would still be only 50 MiB worth, and 
> it would still complete in well under the 30 seconds before the next 
> expiry.
>
> But with modern RAM levels, my 16 GiB to some extent and your 64 GiB is 
> even worse, as we've seen, even our max ~1500 MiB doesn't kick in 
> background writeback mode, so the stuff just sits there until it expires 
> and then it get slammed into high priority foreground mode, stopping your 
> streaming until it has a chance to write some of that dirty data out.
>
> And at our assumed 100 MiB/sec IO bandwidth, that 300-1500 MiB is going 
> to take 3-15 seconds to write out, well within the 30 seconds before the 
> next expiry, but for a time-critical streaming app, stopping it even the 
> minimal 3 seconds is very likely to drop frames!
>
>
> So try setting something a bit more reasonable and see if it helps.  That 
> 1% ratio at 16 GiB RAM for ~160 MB was fine for me, but I'm not doing 
> critical streaming, and at 64 GiB you're looking at ~640 MB per 1%, as I 
> said, too chunky.  For streaming, I'd suggest something approaching the 
> value of your per-second IO bandwidth, we're assuming 100 MB/sec here so 
> 100 MiB but let's round that up to a nice binary 128 MiB, for the 
> background value, perhaps half a GiB or 5 seconds worth of writeback time 
> for foreground, 4 times the background value.  So:
>
> vm.dirty_background_bytes = 134217728   # 128*1024*1024, 128 MiB
> vm.dirty_bytes = 536870912              # 512*1024*1024, 512 MiB
>
>
> As mentioned, try writing those values directly into /proc/sys/vm/
> dirty_background_bytes and dirty_bytes , first, to see if it helps.  If 
> my guess is correct, that should vastly improve the situation for you.  
> If it does but not quite enough or you just want to try tweaking some 
> more, you can tweak it from there, but those are reasonable starting 
> values and really should work far better than the default 5% and 10% of 
> RAM with 64 GiB of it!
>
>
> Other things to try tweaking include the IO scheduler -- the default is 
> the venerable CFQ but deadline may well be better for a streaming use-
> case, and now there's the new multi-queue stuff and the multi-queue kyber 
> and bfq schedulers, as well -- and setting IO priority -- probably by 
> increasing the IO priority of the streaming app.  The tool to use for the 
> latter is called ionice.  Do note, however, that not all schedulers 
> implement IO priorities.  CFQ does, but while I think deadline should 
> work better for the streaming use-case, it's simpler code and I don't 
> believe it implements IO priority.  Similarly for multi-queue, I'd guess 
> the low-code-designed-for-fast-direct-PCIE-connected-SSD kyber doesn't 
> implement IO priorities, while the more complex and general purpose 
> suitable-for-spinning-rust bfq /might/ implement IO priorities.
>
> But I know less about that stuff and it's googlable, should you decide to 
> try playing with it too.  I know what the dirty_* stuff does from 
> personal experience. =:^)
>
>
> And to tie up a loose end, xfs has somewhat different design principles 
> and may well not be particularly sensitive to the dirty_* settings, while 
> btrfs, due to COW and other design choices, is likely more sensitive to 
> them than the widely used ext* and reiserfs (my old choice and the basis 
> of my own settings, above).

Excellent booklike writeup showing how /proc/sys/vm/ works, but I
wonder, how can you explain why does XFS work in this case?

> -- 
> PGP Public Key (RSA/4096b):
> ID: 0xF2C6EA10
> SHA-1: 51DA 40EE 832A 0572 5AD8 B3C0 7AFF 69E1 F2C6 EA10


  reply	other threads:[~2018-01-23  8:38 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-01-20 10:47 Periodic frame losses when recording to btrfs volume with OBS Sebastian Ochmann
2018-01-21 10:04 ` Qu Wenruo
2018-01-21 15:27   ` Sebastian Ochmann
2018-01-21 22:05     ` Chris Murphy
     [not found]     ` <CAJCQCtQOTNZZnkiw2Tq9Mgwnc4pykbOjCb2DCOm4iCjn5K9jQw@mail.gmail.com>
2018-01-21 22:33       ` Sebastian Ochmann
2018-01-22  0:39     ` Qu Wenruo
2018-01-22  9:19       ` Nikolay Borisov
2018-01-22 18:33       ` Sebastian Ochmann
2018-01-22 19:08         ` Chris Mason
2018-01-22 21:17           ` Sebastian Ochmann
2018-01-24 17:52             ` Chris Mason
2018-01-22  8:59     ` Duncan
2018-01-23  8:38       ` ein [this message]
2018-01-24  1:32         ` Duncan
2018-01-22 14:27 ` Chris Mason

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5A66F475.3010902@gmail.com \
    --to=ein.net@gmail.com \
    --cc=1i5t5.duncan@cox.net \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).