public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Avi Kivity <avi@scylladb.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: ENSOPC on a 10% used disk
Date: Sun, 21 Oct 2018 11:55:47 +0300	[thread overview]
Message-ID: <da5448d6-2ea1-ac8b-e21c-ce1124d400a1@scylladb.com> (raw)
In-Reply-To: <20181019075109.GM6311@dastard>


On 19/10/2018 10.51, Dave Chinner wrote:
> On Thu, Oct 18, 2018 at 04:36:42PM +0300, Avi Kivity wrote:
>> On 18/10/2018 14.00, Avi Kivity wrote:
>>>> Can I get access to the metadump to dig around in the filesystem
>>>> directly so I can see how everything has ended up laid out? that
>>>> will help me work out what is actually occurring and determine if
>>>> mkfs/mount options can address the problem or whether deeper
>>>> allocator algorithm changes may be necessary....
>>> I will ask permission to share the dump.
>> I'll send you a link privately.
> Thanks - I've started looking at this - the information here is
> just layout stuff - I'm omitted filenames and anything else that
> might be identifying from the output.
>
> Looking at a commit log file:
>
> stat.size = 33554432
> stat.blocks = 34720
> fsxattr.xflags = 0x800 [----------e-----]
> fsxattr.projid = 0
> fsxattr.extsize = 33554432
> fsxattr.cowextsize = 0
> fsxattr.nextents = 14
>
>
> and the layout:
>
> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
>    0: [0..4079]:       2646677520..2646681599 22 (95606800..95610879)  4080 001010
>    1: [4080..8159]:    2643130384..2643134463 22 (92059664..92063743)  4080 001010
>    2: [8160..12239]:   2642124816..2642128895 22 (91054096..91058175)  4080 001010
>    3: [12240..16319]:  2640666640..2640670719 22 (89595920..89599999)  4080 001010
>    4: [16320..18367]:  2640523264..2640525311 22 (89452544..89454591)  2048 000000
>    5: [18368..20415]:  2640119808..2640121855 22 (89049088..89051135)  2048 000000
>    6: [20416..21287]:  2639874064..2639874935 22 (88803344..88804215)   872 001111
>    7: [21288..21295]:  2639874936..2639874943 22 (88804216..88804223)     8 011111
>    8: [21296..24495]:  2639874944..2639878143 22 (88804224..88807423)  3200 001010
>    9: [24496..26543]:  2639427584..2639429631 22 (88356864..88358911)  2048 000000
>   10: [26544..28591]:  2638981120..2638983167 22 (87910400..87912447)  2048 000000
>   11: [28592..30639]:  2638770176..2638772223 22 (87699456..87701503)  2048 000000
>   12: [30640..31279]:  2638247952..2638248591 22 (87177232..87177871)   640 001111
>   13: [31280..34719]:  2638248592..2638252031 22 (87177872..87181311)  3440 011010
>   14: [34720..65535]:  hole                                           30816
>
> The first thing I note is the initial allocations are just short of
> 2MB and so the extent size hint is, indeed, being truncated here
> according to contiguous free space limitations. I had thought that
> should occur from reading the code, but it's complex and I wasn't
> 100% certain what minimum allocation length would be used.
>
> Looking at the system batchlog files, I'm guessing the filesystem
> ran out of contiguous 32MB free space extents some time around
> September 25. The *Data.db files from 24 Sep and earlier then are
> all nice 32MB extents, from 25 sep onwards they never make the full
> 32MB (30-31MB max). eg, good:
>
>   EXT: FILE-OFFSET       BLOCK-RANGE          AG AG-OFFSET            TOTAL FLAGS
>     0: [0..65535]:       350524552..350590087  3 (2651272..2716807)   65536 001111
>     1: [65536..131071]:  353378024..353443559  3 (5504744..5570279)   65536 001111
>     2: [131072..196607]: 355147016..355212551  3 (7273736..7339271)   65536 001111
>     3: [196608..262143]: 360029416..360094951  3 (12156136..12221671) 65536 001111
>     4: [262144..327679]: 362244144..362309679  3 (14370864..14436399) 65536 001111
>     5: [327680..343415]: 365809456..365825191  3 (17936176..17951911) 15736 001111
>
> bad:
>
> EXT: FILE-OFFSET      BLOCK-RANGE            AG AG-OFFSET            TOTAL FLAGS
>    0: [0..64127]:       512855496..512919623  4 (49024456..49088583) 64128 001111
>    1: [64128..128247]:  266567048..266631167  2 (34651528..34715647) 64120 001010
>    2: [128248..142327]: 264401888..264415967  2 (32486368..32500447) 14080 001111
>   


So extent size is a hint but the extent alignment is a hard requirement. 
Since eventually the ENOSPC happened due to the alignment restriction, I 
think the alignment requirement should be made a hint too.


> Hmmm - there's 2 million files in this filesystem. that is quite a
> lot...
>
> Ok... I see where all the files are - there's a db that was
> snapshotted every half hour going back to December 19 2017. There's
> 55GB of snapshot data there 14362 snapshots holding in 1.8million
> files.
>
> Ok, now I understand how the filesystem got into this mess. It has
> nothing really to do with the filesystem allocator, geometry, extent
> size hints, etc. It isn't really even an XFS specific problem - I
> think most filesystems would be in trouble if you did this to them.


Well, if you create snapshots and never delete them you'd run into a 
real ENOSPC sooner or later, so the main problem was lack of snapshot 
hygene. But it did trigger a premature ENOSPC due to the alignment 
restriction on those small files with hints (which I'm going to remove).


>
> First, let me demonstrate that the freespace fragmentation is caused
> by these snapshots by removing them all:
>
> before:
>     from      to extents  blocks    pct
>        1       1    5916    5916   0.00
>        2       3   10235   22678   0.01
>        4       7   12251   66829   0.02
>        8      15    5521   59556   0.01
>       16      31    5703  132031   0.03
>       32      63    9754  463825   0.11
>       64     127   16742 1590339   0.37
>      128     255 1550511 390108625  89.87
>      256     511   71516 29178504   6.72
>      512    1023      19   15355   0.00
>     1024    2047     287  461824   0.11
>     2048    4095     528 1611413   0.37
>     4096    8191    1537 10352304   2.38
>     8192   16383       2   19015   0.00
>
> Run a delete:
>
> for d in snapshots/*; do
> 	rm -rf $d &
> done
>
> <cranking along at ~12,000 write iops>
>
> # uptime
> 17:41:08 up 22:07,  1 user,  load average: 14293.17, 13840.37, 9517.14
> #
>
> 500,000 files removed:
>     from      to extents  blocks    pct
>       64     127   22564 2054234   0.47
>      128     255  900480 226428059  51.43
>      256     511  189904 91033237  20.68
>      512    1023   68304 54958788  12.48
>     1024    2047   25187 38284024   8.70
>     2048    4095    5508 15204528   3.45
>     4096    8191    1665 10999789   2.50
>     8192   16383      15  139424   0.03
>
> 1m files removed:
>    from      to extents  blocks    pct
>       64     127   21940 1991685   0.45
>      128     255  536985 134731402  30.35
>      256     511  152092 73465972  16.55
>      512    1023  100471 82971130  18.69
>     1024    2047   48519 74016490  16.67
>     2048    4095   17272 49209538  11.09
>     4096    8191    4307 25135374   5.66
>     8192   16383     135 1254037   0.28
>
> 1.5m files removed:
>    from      to extents  blocks    pct
>       64     127    9851  924782   0.20
>      128     255  227945 57079302  12.32
>      256     511   38723 18129086   3.91
>      512    1023   33547 28027554   6.05
>     1024    2047   31904 50171699  10.83
>     2048    4095   25263 75381887  16.27
>     4096    8191   16885 102836365  22.19
>     8192   16383    6367 68809645  14.85
>    16384   32767    1862 40183775   8.67
>    32768   65535     385 16228869   3.50
>    65536  131071      51 4213237   0.91
>   131072  262143       6  958528   0.21
>
> after:
>    from      to extents  blocks    pct
>      128     255  154063 38785829   8.64
>      256     511   11037 4942114   1.10
>      512    1023    8576 6930035   1.54
>     1024    2047    8496 13464298   3.00
>     2048    4095    7664 23034455   5.13
>     4096    8191    8497 55217061  12.31
>     8192   16383    4233 45867691  10.22
>    16384   32767    1533 33488995   7.46
>    32768   65535     520 23924895   5.33
>    65536  131071     305 28675646   6.39
>   131072  262143     230 42411732   9.45
>   262144  524287      98 37213190   8.29
>   524288 1048575      41 29163579   6.50
> 1048576 2097151      27 40502889   9.03
> 2097152 4194303       5 14576157   3.25
> 4194304 8388607       2 10005670   2.23
>
> Ok, so the results is not perfect, but there are now huge contiguous
> free space extents available again - ~70% of the free space is now
> contiguous extents >=32MB in length. There's every chance that the
> fs would confinue to help reform large contiguous free spaces as the
> database files come and go now, as long as the snapshot problem is
> dealt with.
>
> So, what's the problem? Well, it's simply that the workload is
> mixing data with vastly different temporal characteristics in the
> same physical locality. Every half an hour, a set of ~100 smallish
> files are written into a new directory which lands them at the low
> endof the largest free space extent in that AG. Each new snapshot
> directory ends up in a different AG, so it slowly spreads the
> snapshots across all the AGs in the filesystem.


Not exactly - those snapshots are hard links into the live database 
files, which eventually get removed. Usually, small files get removed 
early, but with the snapshots they get to live forever.


> Each snapshot effective appends to the current working area in the
> AG, chopping it out of the largest contiguous free space. By the
> time the next snapshot in that AG comes around, there's other new
> short term data between the old snapshot and the new one. The new
> snapshot chops up the largest freespace, and on goes the cycle.
>
> Eventually the short term data between the snapshots gets removed,
> but this doesn't reform large contiguous free spaces because the
> snapshot data is in the way. And so this cycle continues with the
> snapshot data chopping up the largest freespace extents in the
> filesystem until there's not more large free space extents to be
> found.
>
> The solution is to manage the snapshot data better. We need to keep
> all the long term data physically isolated from the short term data
> so they don't fragment free space. A short term application level
> solution would require migrating the snapshot data out of the
> filesystem to somewhere else and point to it with symlinks.


Snapshots should not live forever on the disk. The procedure is to 
create a snapshot, copy it away, and then delete the snapshot. It's okay 
to let snapshots live for a while, but not all of them and not without a 
bound on their lifetime.


The filesystem did have a role in this, by requiring alignment of the 
extent to the RAID stripe size. Now, given that this was a RAID with one 
member, alignment is pointless, but most of our deployments are to RAID 
arrays with >1 members, and alignment does save 12.5% of IOPS compared 
to un-aligned extents for compactions and writes (our scans/writes use 
128k buffers, and the alignment is to 1MB). The database caused the 
problem by indirectly requiring 1MB alignment for files that are much 
smaller than 1MB, and the user contributed to the problem by causing 
millions of such small files to be kept.


>
>  From the filesystem POV, I'm not sure that there is much we can do
> about this directly - we have no idea what the lifetime of the data
> is going to be....
>
> <ding>
>
> Hold on....
>
> <rummage in code>
>
> ....we already have an interface so setting those sorts of hints.
>
> fcntl(F_SET_RW_HINT, rw_hint)
>
> /*
>   * Valid hint values for F_{GET,SET}_RW_HINT. 0 is "not set", or can be
>   * used to clear any hints previously set.
>   */
> #define RWF_WRITE_LIFE_NOT_SET  0
> #define RWH_WRITE_LIFE_NONE     1
> #define RWH_WRITE_LIFE_SHORT    2
> #define RWH_WRITE_LIFE_MEDIUM   3
> #define RWH_WRITE_LIFE_LONG     4
> #define RWH_WRITE_LIFE_EXTREME  5
>
> Avi, does this sound like something that you could use to
> classify the different types of data the data base writes out?


So long as the penalty for a mis-classification is not too large, we can 
for sure. Commitlog files have short lifespan, and so do newly born 
small data files. Those small data files are compacted into increasingly 
larger and long-lived files, and this information is known at the time 
of creation.


Even without the filesystem altering its allocation according to the 
hint, this is still useful, since the disk will alter its internal 
allocation and maybe do something useful with it (as long as the 
filesystem passes the hint to the disk).


>
> I'll need to have a think about how to apply this to the allocator
> policy algorithms before going any further, but I suspect making use
> of this hint interface will allow us prevent interleaving of short
> and long term data so avoid the freespace fragmentation it is
> causing here....


IIUC, the problem (of having ENOSPC on a 10% used disk) is not 
fragmentation per se, it's the alignment requirement. To take it to 
extreme, a 1TB disk can only hold a million files if those files must be 
aligned to 1MB, even if everything is perfectly laid out. For sure 
fragmentation would have degraded performance sooner or later, but 
that's not as bad as that ENOSPC.


I'm addressing the ENOSPC by removing the extent allocation hint on 
files that are known small (and increasing their application buffer 
sizes). In fact that will increase fragmentation as the filesystem will 
allocate on extent per buffer, rather than one extent for the entire 
file. But I think that, given that the extent size is treated as a hint 
(or so I infer from the fact that we have <32MB extents), so should the 
alignment. Perhaps allocation with a hint should be performed in two 
passes, first trying to match size and alignment, and second relaxing 
both restrictions.

  reply	other threads:[~2018-10-21 17:09 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-10-17  7:52 ENSOPC on a 10% used disk Avi Kivity
2018-10-17  8:47 ` Christoph Hellwig
2018-10-17  8:57   ` Avi Kivity
2018-10-17 10:54     ` Avi Kivity
2018-10-18  1:37 ` Dave Chinner
2018-10-18  7:55   ` Avi Kivity
2018-10-18 10:05     ` Dave Chinner
2018-10-18 11:00       ` Avi Kivity
2018-10-18 13:36         ` Avi Kivity
2018-10-19  7:51           ` Dave Chinner
2018-10-21  8:55             ` Avi Kivity [this message]
2018-10-21 14:28               ` Dave Chinner
2018-10-22  8:35                 ` Avi Kivity
2018-10-22  9:52                   ` Dave Chinner
2018-10-18 15:44         ` Avi Kivity
2018-10-18 16:11           ` Avi Kivity
2018-10-19  1:24           ` Dave Chinner
2018-10-21  9:00             ` Avi Kivity
2018-10-21 14:34               ` Dave Chinner
2018-10-19  1:15         ` Dave Chinner
2018-10-21  9:21           ` Avi Kivity
2018-10-21 15:06             ` Dave Chinner
2018-10-18 15:54 ` Eric Sandeen
2018-10-21 11:49   ` Avi Kivity
2019-02-05 21:48 ` Dave Chinner
2019-02-07 10:51   ` Avi Kivity

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=da5448d6-2ea1-ac8b-e21c-ce1124d400a1@scylladb.com \
    --to=avi@scylladb.com \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox