linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Goffredo Baroncelli <kreijack@inwind.it>
To: bo.li.liu@oracle.com
Cc: linux-btrfs@vger.kernel.org
Subject: Re: [PATCH RFC] Btrfs: expose bad chunks in sysfs
Date: Fri, 9 Feb 2018 18:02:01 +0100	[thread overview]
Message-ID: <5c754f5f-5f16-1c9c-4a0e-c9c5c55cdc60@inwind.it> (raw)
In-Reply-To: <20180208190714.GC9008@lim.localdomain>

On 02/08/2018 08:07 PM, Liu Bo wrote:
> On Thu, Feb 08, 2018 at 07:52:05PM +0100, Goffredo Baroncelli wrote:
>> On 02/06/2018 12:15 AM, Liu Bo wrote:
>> [...]
>>> One way to mitigate the data loss pain is to expose 'bad chunks',
>>> i.e. degraded chunks, to users, so that they can use 'btrfs balance'
>>> to relocate the whole chunk and get the full raid6 protection again
>>> (if the relocation works).
>>
>> [...]
>> [...]
>>
>>> +
>>> +	/* read lock please */
>>> +	do {
>>> +		seq = read_seqbegin(&fs_info->bc_lock);
>>> +		list_for_each_entry(bc, &fs_info->bad_chunks, list) {
>>> +			len += snprintf(buf + len, PAGE_SIZE - len, "%llu\n",
>>> +					bc->chunk_offset);
>>> +			/* chunk offset is u64 */
>>> +			if (len >= PAGE_SIZE)
>>> +				break;
>>> +		}
>>> +	} while (read_seqretry(&fs_info->bc_lock, seq));
>>
>> Using this interface, how many chunks can you list ? If I read the code correctly, only up to fill a kernel page.
>>
>> If my math are correctly (PAGE_SIZE=4k, a u64 could require up to 19 bytes) it is possible to list only few hundred of chunks (~200). Not more; and the last one could be even listed incomplete.
>>
> 
> That's true.
> 
>> IIRC a chunk size is max 1GB; If you lost a 500GB of disks, the chunks to list could be more than 200.
>>
>> My first suggestion is to limit the number of chunks to show to 200 (a page should be big enough to contains all these chunks offset). If the chunks number are greater, ends the list with a marker (something like '[...]\n').
>> This would solve the ambiguity about the fact that the list chunks are complete or not. Anyway you cannot list all the chunks.
>>
> 
> Good point, and I need to add one more field to each record to specify
> it's metadata or data.
> 
> So what I have in my mind is that this kind of error is rare and
> reflects bad sectors on disk, but if there are really that many
> errors, I think we need to replace the disk without hesitation.

What happens if you loose an entire disk ? If so, you fill "bad_sector"


> 
>> However, my second suggestions is to ... change completely the interface. What about adding a directory in sysfs, where each entry is a chunk ?
>>
>> Something like:
>>
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/type		# data/metadata/sys
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/profile		# dup/linear....
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/size		# size
>> /sys/fs/btrfs/<FS-UUID>/chunks/<chunks-offset>/devs		# chunks devs 
>>
>> And so on. 
>>
>> Checking  "[...]<chunks-offset>/devs", it would be easy understand if the chunk is in "degraded" mode or not.
> 
> I'm afraid we cannot do that as it'll occupy too much memory for large
> filesystems given a typical chunk size is 1GB.
> 
>>
>> However I have to admit that I don't know how feasible is iterate over a sysfs directory which is a map of a kernel objects list.
>>
>> I think that if these interface would be good enough, we could get rid of a lot of ioctl(TREE_SEARCH) from btrfs-progs. 
>>
> 
> TREE_SEARCH is faster than iterating sysfs (if we could), isn't it?

Likely yes. However TREE_SEARCH is bad because it is hard linked to the filesystem structure. I.e. if for some reason BTRFS change the on disk layout, what happens for the old application which expect the previous disk format ? And for the same reason, it is root-only (UID==0)

Let me to give another idea: what about exporting these information via an ioctl ? It could be extended to query all information about (all) the chunks...

> 
> thanks,
> -liubo
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5

  parent reply	other threads:[~2018-02-09 17:02 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-05 23:15 [PATCH RFC] Btrfs: expose bad chunks in sysfs Liu Bo
2018-02-06  1:28 ` Qu Wenruo
2018-02-07 22:57   ` Liu Bo
2018-02-08  2:49     ` Qu Wenruo
2018-02-06  9:11 ` Nikolay Borisov
2018-02-12 14:17   ` Liu Bo
2018-02-12 17:39     ` Nikolay Borisov
2018-02-13  0:06       ` Liu Bo
2018-02-08  9:47 ` Anand Jain
2018-02-12 14:37   ` Liu Bo
2018-02-08 18:52 ` Goffredo Baroncelli
2018-02-08 19:07   ` Liu Bo
2018-02-09  0:23     ` Qu Wenruo
2018-02-09 17:02     ` Goffredo Baroncelli [this message]
2018-02-12 15:26       ` Liu Bo
2018-02-09  4:57 ` Darrick J. Wong
2018-02-12 15:23   ` Liu Bo
2018-02-26 17:18     ` David Sterba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5c754f5f-5f16-1c9c-4a0e-c9c5c55cdc60@inwind.it \
    --to=kreijack@inwind.it \
    --cc=bo.li.liu@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).