public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Dave Chinner <david@fromorbit.com>
To: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Cc: Eric Sandeen <esandeen@redhat.com>,
	Eric Sandeen <sandeen@sandeen.net>,
	linux-xfs@vger.kernel.org
Subject: Re: XFS / xfs_repair - problem reading very large sparse files on very large filesystem
Date: Mon, 8 Nov 2021 09:25:50 +1100	[thread overview]
Message-ID: <20211107222550.GH449541@dread.disaster.area> (raw)
In-Reply-To: <20211105161947.GK32555@pcnci.linuxbox.cz>

On Fri, Nov 05, 2021 at 05:19:47PM +0100, Nikola Ciprich wrote:
> > 
> > ok, thanks for the clarification.
> 
> no problem... in the meantime, xfs_bmap finished as well,
> resulting output has 1.5GB, showing total of 25354643 groups :-O

Yeah, that'll do it. If you are on spinning disks, at ~250 extents
per btree block you're talking about a hundred thousand IOs to read
in the extent list on first access to the file after mount.

> > Though I've never heard of streaming video writes that weren't sequential ...
> > have you actually observed that via strace or whatnot?
> those are streams from many cameras, somehow multiplexed by processing software.
> The guy I communicate with, whos responsible unfortunately does not know
> many details

The multiplexing is the problem here. Look at the allocation pattern
in the trace.

	680367: [872751104..872759863]: 870787280..870796039
	680368: [872759864..872760423]: 870799440..870799999
	680369: [872760424..872761527]: 870921888..870922991
	680370: [872761528..872762079]: 870959584..870960135
	680371: [872762080..872763631]: 871192144..871193695
	680372: [872763632..872763647]: 871183760..871183775
	680373: [872763648..872767487]: hole
	680374: [872767488..872768687]: 870796040..870797239
	680375: [872768688..872769887]: 870800000..870801199
	680376: [872769888..872772367]: 870922992..870925471
	680377: [872772368..872773559]: 870989000..870990191
	680378: [872773560..872775639]: 871193696..871195775
	680379: [872775640..872775679]: hole
	680380: [872775680..872776231]: 870797240..870797791
	680381: [872776232..872776775]: 870801200..870801743
	680382: [872776776..872777847]: 870870440..870871511
	680383: [872777848..872778383]: 870990192..870990727
	680384: [872778384..872779727]: 871195776..871197119
	680385: [872779728..872779791]: 871175064..871175127
	680386: [872779792..872783871]: hole
	680387: [872783872..872785519]: 870797792..870799439
	680388: [872785520..872786927]: 870801744..870803151
	680389: [872786928..872789671]: 870925472..870928215
	680390: [872789672..872791087]: 870990728..870992143
	680391: [872791088..872791991]: 871197120..871198023
	680392: [872791992..872792063]: hole

Lets lay that out into sequential blocks:

Stream 1:
	680367: [872751104..872759863]: 870787280..870796039
	680374: [872767488..872768687]: 870796040..870797239
	680380: [872775680..872776231]: 870797240..870797791
	680387: [872783872..872785519]: 870797792..870799439

Stream 2:
	680368: [872759864..872760423]: 870799440..870799999
	680375: [872768688..872769887]: 870800000..870801199
	680381: [872776232..872776775]: 870801200..870801743
	680388: [872785520..872786927]: 870801744..870803151

Stream 3:
	680369: [872760424..872761527]: 870921888..870922991
	680376: [872769888..872772367]: 870922992..870925471
	680382: [872776776..872777847]: 870870440..870871511 (discontig)
	680389: [872786928..872789671]: 870925472..870928215

Stream 4:
	680370: [872761528..872762079]: 870959584..870960135
	680377: [872772368..872773559]: 870989000..870990191
	680383: [872777848..872778383]: 870990192..870990727
	680390: [872789672..872791087]: 870990728..870992143

Stream 5:
	680371: [872762080..872763631]: 871192144..871193695
	680378: [872773560..872775639]: 871193696..871195775
	680384: [872778384..872779727]: 871195776..871197119
	680391: [872791088..872791991]: 871197120..871198023


Stream 6:
	680372: [872763632..872763647]: 871183760..871183775
	680373: [872763648..872767487]: hole	(contig with 680372)
	680379: [872775640..872775679]: hole
	680385: [872779728..872779791]: 871175064..871175127
	680386: [872779792..872783871]: hole	(contig with 680385)
	680392: [872791992..872792063]: hole

The reason I point this out, is that the way tha XFS allocator works
is that is peels off a chunk of the longest free extent on every
new physical allocation for non-contiguous file offsets.

Hence when we see this physical allocation pattern:

	680367: [872751104..872759863]: 870787280..870796039
	680374: [872767488..872768687]: 870796040..870797239
	680380: [872775680..872776231]: 870797240..870797791
	680387: [872783872..872785519]: 870797792..870799439

It indicates the order in which the writes are occurring. Hence it
would appear that the application is doing sparse writes for chunks
in the file, that it then goes back and partially files holes later
with another run of sparse writes. Eventually, all holes are filled,
but you end up with a fragmented file.

This is actually by design - the XFS allocator is optimised for
efficient write IO (i.e. sequentialises writes as much as possible)
rather than optimal read IO.

From the allocation pattern, I suspect there are 6 cameras in this
multiplexer setup, each sample time that it needs to store an image
has a frame from each camera, and a series of frames is written per
camera before writing the next set of frames from the next camera.
Hence the allocation pattern on disk is effectively sequential for
each camera stream as they are written, but when viewed as a
multiplexed file, it's extremely fragmented because the individual
camera streams are interleaved..

> > What might be happening is that if you are streaming multiple
> > files into a single directory at the same time, it competes for
> > the allocator, and they will interleave.
> > 
> > XFS has an allocator mode called "filestreams" which was
> > designed just for this (video ingest).

Won't do anything - that's for ensure "file per frame" video ingest
places all the files for a given video stream contiguously in an AG.
This looks like "multiple cameras and many frames per file" which
means the filestreams code will not trigger or do anything different
here.

> anyways I'll rather preallocate files fully for now, it takes a
> lot of time, but should be the safest way before we know what
> exactly is wrong..

That may well cause serious problems for camera data ingest, because
it forces the ingest write IO pattern to be non-contiguous rather
than sequential. Hence instead of larger, sequentialised writes per
incoming data set as the above pattern suggests, preallocation will
change to be many more smaller, sparse write IOs that cannot merge.

This will increase write IO latency and reduce the amount of data
that can be written to disk. The likely result of this is that it
will reduce the number of cameras that can be supported per spinning
disk.

I would suggest that the best solution is to rotate camera data
files at a much smaller size so that the extent list doesn't get too
large. e.g. max file size is 1TB, keep historic records in 500x1TB
files instead of one single 500TB file...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

  reply	other threads:[~2021-11-07 22:25 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-04  9:09 XFS / xfs_repair - problem reading very large sparse files on very large filesystem Nikola Ciprich
2021-11-04 16:20 ` Eric Sandeen
2021-11-05 14:13   ` Nikola Ciprich
2021-11-05 14:17     ` Nikola Ciprich
2021-11-05 14:56       ` Eric Sandeen
2021-11-05 15:59         ` Nikola Ciprich
2021-11-05 16:11           ` Eric Sandeen
2021-11-05 16:19             ` Nikola Ciprich
2021-11-07 22:25               ` Dave Chinner [this message]
2021-11-04 23:04 ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20211107222550.GH449541@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=esandeen@redhat.com \
    --cc=linux-xfs@vger.kernel.org \
    --cc=nikola.ciprich@linuxbox.cz \
    --cc=sandeen@sandeen.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox