Re: illegal snapshot, cannot be deleted

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Austin S Hemmelgarn <ahferroin7@gmail.com>
To: Hugo Mills <hugo@carfax.org.uk>,
	Vedran Vucic <vedran.vucic@gmail.com>,
	linux-btrfs@vger.kernel.org
Subject: Re: illegal snapshot, cannot be deleted
Date: Fri, 13 Nov 2015 15:20:26 -0500	[thread overview]
Message-ID: <5646460A.1010601@gmail.com> (raw)
In-Reply-To: <20151113195520.GG24333@carfax.org.uk>

[-- Attachment #1: Type: text/plain, Size: 6393 bytes --]

On 2015-11-13 14:55, Hugo Mills wrote:
> On Fri, Nov 13, 2015 at 02:40:44PM -0500, Austin S Hemmelgarn wrote:
>> On 2015-11-13 13:42, Hugo Mills wrote:
>>> On Fri, Nov 13, 2015 at 01:10:12PM -0500, Austin S Hemmelgarn wrote:
>>>> On 2015-11-13 12:30, Vedran Vucic wrote:
>>>>> Hello,
>>>>>
>>>>> Here are outputs of commands as you requested:
>>>>>   btrfs fi df /
>>>>> Data, single: total=8.00GiB, used=7.71GiB
>>>>> System, DUP: total=32.00MiB, used=16.00KiB
>>>>> Metadata, DUP: total=1.12GiB, used=377.25MiB
>>>>> GlobalReserve, single: total=128.00MiB, used=0.00B
>>>>>
>>>>> btrfs fi show
>>>>> Label: none  uuid: d6934db3-3ac9-49d0-83db-287be7b995a5
>>>>>          Total devices 1 FS bytes used 8.08GiB
>>>>>          devid    1 size 18.71GiB used 10.31GiB path /dev/sda6
>>>>>
>>>>> btrfs-progs v4.0+20150429
>>>>>
>>>> Hmm, that's odd, based on these numbers, you should be having no
>>>> issue at all trying to run a balance. You might be hitting some
>>>> other bug in the kernel, however, but I don't remember if there were
>>>> any known bugs related to ENOSPC or balance in the version you're
>>>> running.
>>>
>>>     There's one specific bug that shows up with ENOSPC exactly like
>>> this. It's in all versions of the kernel, there's no known solution,
>>> and no guaranteed mitigation strategy, I'm afraid. Various things like
>>> balancing, or adding, balancing, and removing a device again have been
>>> tried. Sometimes they seem to help; sometimes they just make the
>>> problem worse.
>>>
>>>     We average maybe one report a week or so with this particular
>>> set of symptoms.
>> We should get this listed on the Wiki on the Gotcha's page ASAP,
>> especially considering that it's a pretty significant bug (not quite
>> as bad as data corruption, but pretty darn close).
>
>     It's certainly mentioned in the FAQ, in the main entry on
> unexpected ENOSPC. The text takes you through identifying when there's
> the "usual" problem, then goes on to say that if you've hit ENOSPC
> with free space still to be unallocated, you've got this issue.
It should still probably be on the Gotcha's page also, as it definitely 
fits the general description of the stuff there.
>> Vedran, could you try running the balance with just '-dusage=40' and
>> then again with just '-musage=40'?  If just one of those fails, it
>> could help narrow things down significantly.
>>
>> Hugo, is there anything else known about this issue (I don't recall
>> seeing it mentioned before, and a quick web search didn't turn up
>> much)?
>
>     I grumble about it regularly on IRC, where we get many more reports
> of it than on the mailing list. There have been a couple on here that
> I can recall, but not many.
Ah, that would explain it, I'm almost never on IRC.
>
>>   In particular:
>> 1. Is there any known way to reliably reproduce it (I would assume
>> not, as that would likely lead to a mitigation strategy.  If someone
>> does find a reliable reproducer, please let me know, I've got some
>> significant spare processor time and storage space I could dedicate
>> to getting traces and filesystem images for debugging, and already
>> have most of the required infrastructure set up for something like
>> this)?
>
>     None that I know of. I can start asking people for btrfs-image
> dumps again, if you want to investigate. I did do that for a while, to
> pass them to josef, but he said he didn't need any more of them after
> a while. (He was always planning on investigating it, but kept getting
> diverted by data corruption bugs, which have higher priority).
I don't have the experience to be able to properly debug it myself from 
images (my expertise has always been finding bugs, not necessarily 
fixing them), but was more offering to try and generate images (if we 
could find some series of commands that reproduces this at least some of 
the time, I have the resources to run a couple of VM's doing that over 
and over again until it hits the bug).  If I could get some, I might be 
able to put some assertions into the kernel so that it panics when 
there's an ENOSPC in the balance code, and get a stack trace, but the 
more I think about it, the more likely it seems that that isn't going to 
be too helpful.
>
>> 2. Is it contagious (that is, if I send a snapshot from a filesystem
>> that is affected by it, does the filesystem that receives the
>> snapshot become affected; if we could find a way to reproduce it, I
>> could easily answer this question within a couple of minutes of
>> reproducing it)?
>
>     No, as far as I know, it doesn't transfer via send/receive.
> send/receive is largely equivalent to copying the data by other means
> -- receive is implemented almost exclusively in userspace, with only a
> couple of ioctls for mucking around with the UUIDs at the end.
I thought that might be the case, but wanted to ask just to be safe (I 
do local backups on some systems using send/receive, largely because 
this means if my regular root filesystem gets corrupted, I can directly 
boot the backups, run a couple of commands, and then have a working 
system again in about 5 or 10 minutes, but if this could spread through 
send/receive, then that makes backups done this way less useful (because 
this is something that I would treat similar to regular FS corruption)).
>
>> 3. Do we have any kind of statistics beyond the rate of reports (for
>> example, does it happen more often on bigger filesystems, or
>> possibly more frequently with certain chunk profiles)?
>
>     Not that I've noticed, no. We've had it on small and large,
> single-device and many devices, HDD and SSD, converted and not
> converted. At one point, a couple of years ago, I did think it was
> down to converted filesystems, because we had a run of them, but that
> seems not to be the case.
That would seem to me to indicate it's somewhere in the common path for 
balance, which narrows things down at least, although not by much.  Have 
we had anyone try balancing just data chunks or just metadata chunks? 
That might narrow things down even further.  If it's corruption in the 
FS itself, I would assume it's somewhere either in the system chunks, 
the metadata chunks, or the space cache (if it's there, mounting with 
clear_cache should fix it).


[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 3019 bytes --]

next prev parent reply	other threads:[~2015-11-13 20:21 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-11-11 22:11 illegal snapshot, cannot be deleted Vedran Vucic
2015-11-12 12:32 ` Austin S Hemmelgarn
2015-11-13 16:12   ` Vedran Vucic
2015-11-13 16:30     ` Austin S Hemmelgarn
2015-11-13 17:30       ` Vedran Vucic
2015-11-13 17:55         ` Henk Slager
2015-11-13 17:57           ` Vedran Vucic
2015-11-13 18:10         ` Austin S Hemmelgarn
2015-11-13 18:42           ` Hugo Mills
2015-11-13 19:40             ` Austin S Hemmelgarn
2015-11-13 19:55               ` Hugo Mills
2015-11-13 20:20                 ` Austin S Hemmelgarn [this message]
2015-11-13 21:11                 ` Duncan
2015-11-13 21:13                   ` Hugo Mills
2015-11-13 23:53                     ` Duncan
2015-11-13 20:15             ` Vedran Vucic
2015-11-13 20:18               ` Vedran Vucic

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5646460A.1010601@gmail.com \
    --to=ahferroin7@gmail.com \
    --cc=hugo@carfax.org.uk \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=vedran.vucic@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.