On 2015-11-13 13:42, Hugo Mills wrote: > On Fri, Nov 13, 2015 at 01:10:12PM -0500, Austin S Hemmelgarn wrote: >> On 2015-11-13 12:30, Vedran Vucic wrote: >>> Hello, >>> >>> Here are outputs of commands as you requested: >>> btrfs fi df / >>> Data, single: total=8.00GiB, used=7.71GiB >>> System, DUP: total=32.00MiB, used=16.00KiB >>> Metadata, DUP: total=1.12GiB, used=377.25MiB >>> GlobalReserve, single: total=128.00MiB, used=0.00B >>> >>> btrfs fi show >>> Label: none uuid: d6934db3-3ac9-49d0-83db-287be7b995a5 >>> Total devices 1 FS bytes used 8.08GiB >>> devid 1 size 18.71GiB used 10.31GiB path /dev/sda6 >>> >>> btrfs-progs v4.0+20150429 >>> >> Hmm, that's odd, based on these numbers, you should be having no >> issue at all trying to run a balance. You might be hitting some >> other bug in the kernel, however, but I don't remember if there were >> any known bugs related to ENOSPC or balance in the version you're >> running. > > There's one specific bug that shows up with ENOSPC exactly like > this. It's in all versions of the kernel, there's no known solution, > and no guaranteed mitigation strategy, I'm afraid. Various things like > balancing, or adding, balancing, and removing a device again have been > tried. Sometimes they seem to help; sometimes they just make the > problem worse. > > We average maybe one report a week or so with this particular > set of symptoms. We should get this listed on the Wiki on the Gotcha's page ASAP, especially considering that it's a pretty significant bug (not quite as bad as data corruption, but pretty darn close). Vedran, could you try running the balance with just '-dusage=40' and then again with just '-musage=40'? If just one of those fails, it could help narrow things down significantly. Hugo, is there anything else known about this issue (I don't recall seeing it mentioned before, and a quick web search didn't turn up much)? In particular: 1. Is there any known way to reliably reproduce it (I would assume not, as that would likely lead to a mitigation strategy. If someone does find a reliable reproducer, please let me know, I've got some significant spare processor time and storage space I could dedicate to getting traces and filesystem images for debugging, and already have most of the required infrastructure set up for something like this)? 2. Is it contagious (that is, if I send a snapshot from a filesystem that is affected by it, does the filesystem that receives the snapshot become affected; if we could find a way to reproduce it, I could easily answer this question within a couple of minutes of reproducing it)? 3. Do we have any kind of statistics beyond the rate of reports (for example, does it happen more often on bigger filesystems, or possibly more frequently with certain chunk profiles)?