On 2015-11-13 13:42, Hugo Mills wrote:
> On Fri, Nov 13, 2015 at 01:10:12PM -0500, Austin S Hemmelgarn wrote:
>> On 2015-11-13 12:30, Vedran Vucic wrote:
>>> Hello,
>>>
>>> Here are outputs of commands as you requested:
>>>   btrfs fi df /
>>> Data, single: total=8.00GiB, used=7.71GiB
>>> System, DUP: total=32.00MiB, used=16.00KiB
>>> Metadata, DUP: total=1.12GiB, used=377.25MiB
>>> GlobalReserve, single: total=128.00MiB, used=0.00B
>>>
>>> btrfs fi show
>>> Label: none  uuid: d6934db3-3ac9-49d0-83db-287be7b995a5
>>>          Total devices 1 FS bytes used 8.08GiB
>>>          devid    1 size 18.71GiB used 10.31GiB path /dev/sda6
>>>
>>> btrfs-progs v4.0+20150429
>>>
>> Hmm, that's odd, based on these numbers, you should be having no
>> issue at all trying to run a balance. You might be hitting some
>> other bug in the kernel, however, but I don't remember if there were
>> any known bugs related to ENOSPC or balance in the version you're
>> running.
>
>     There's one specific bug that shows up with ENOSPC exactly like
> this. It's in all versions of the kernel, there's no known solution,
> and no guaranteed mitigation strategy, I'm afraid. Various things like
> balancing, or adding, balancing, and removing a device again have been
> tried. Sometimes they seem to help; sometimes they just make the
> problem worse.
>
>     We average maybe one report a week or so with this particular
> set of symptoms.
We should get this listed on the Wiki on the Gotcha's page ASAP, 
especially considering that it's a pretty significant bug (not quite as 
bad as data corruption, but pretty darn close).

Vedran, could you try running the balance with just '-dusage=40' and 
then again with just '-musage=40'?  If just one of those fails, it could 
help narrow things down significantly.

Hugo, is there anything else known about this issue (I don't recall 
seeing it mentioned before, and a quick web search didn't turn up much)? 
  In particular:
1. Is there any known way to reliably reproduce it (I would assume not, 
as that would likely lead to a mitigation strategy.  If someone does 
find a reliable reproducer, please let me know, I've got some 
significant spare processor time and storage space I could dedicate to 
getting traces and filesystem images for debugging, and already have 
most of the required infrastructure set up for something like this)?
2. Is it contagious (that is, if I send a snapshot from a filesystem 
that is affected by it, does the filesystem that receives the snapshot 
become affected; if we could find a way to reproduce it, I could easily 
answer this question within a couple of minutes of reproducing it)?
3. Do we have any kind of statistics beyond the rate of reports (for 
example, does it happen more often on bigger filesystems, or possibly 
more frequently with certain chunk profiles)?