linux-btrfs.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>,
	linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
Date: Thu, 12 Jul 2018 20:33:29 +0800	[thread overview]
Message-ID: <49fc4dbb-5e02-ab13-d7f1-7e52bf8868d6@oracle.com> (raw)
In-Reply-To: <db352d50-ce1b-f5fe-afbb-57bdd322b5fb@gmx.com>



On 07/12/2018 01:43 PM, Qu Wenruo wrote:
> 
> 
> On 2018年07月11日 15:50, Anand Jain wrote:
>>
>>
>> BTRFS Volume operations, Device Lists and Locks all in one page:
>>
>> Devices are managed in two contexts, the scan context and the mounted
>> context. In scan context the threads originate from the btrfs_control
>> ioctl and in the mounted context the threads originates from the mount
>> point ioctl.
>> Apart from these two context, there also can be two transient state
>> where device state are transitioning from the scan to the mount context
>> or from the mount to the scan context.
>>
>> Device List and Locks:-
>>
>>   Count: btrfs_fs_devices::num_devices
>>   List : btrfs_fs_devices::devices -> btrfs_devices::dev_list
>>   Lock : btrfs_fs_devices::device_list_mutex
>>
>>   Count: btrfs_fs_devices::rw_devices
> 
> So btrfs_fs_devices::num_devices = btrfs_fs_devices::rw_devices + RO
> devices.
> How seed and ro devices are different in this case?

  Given:
  btrfs_fs_devices::total_devices = btrfs_super_num_devices(disk_super);

  Consider no missing devices, no replace target, no seeding. Then,
    btrfs_fs_devices::total_devices == btrfs_fs_devices::num_devices

  And in case of seeding.
    btrfs_fs_devices::total_devices  == (btrfs_fs_devices::num_devices +
                                 btrfs_fs_devices::seed::total_devices

    All devices in the list [1] are RW/Sprout
      [1] fs_info::btrfs_fs_devices::devices
    All devices in the list [2] are RO/Seed
      [2] fs_info::btrfs_fs_devices::seed::devices


  Thanks for asking will add this part to the doc.


> 
>>   List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list
>>   Lock : btrfs_fs_info::chunk_mutex
> 
> At least the chunk_mutex is also shared with chunk allocator,

  Right.

> or we
> should have some mutex in btrfs_fs_devices other than fs_info.
> Right?

  More locks? no. But some of the locks-and-flags are wrongly
  belong to fs_info instead it should have been in fs_devices.
  When the dust settles planning to propose to migrate them
  to fs_devices.

>>   Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
>>
>> FSID List and Lock:-
>>
>>   Count : None
>>   HEAD  : Global::fs_uuids -> btrfs_fs_devices::fs_list
>>   Lock  : Global::uuid_mutex
>>
>>
>> After the fs_devices is mounted, the btrfs_fs_devices::opened > 0.
> 
> fs_devices::opended should be btrfs_fs_devices::num_devices if no device
> is missing and -1 or -2 for degraded case, right?

  No. I think you are getting confused with
     btrfs_fs_devices::open_devices

  btrfs_fs_devices::opened
   indicate how many times the volume is opened. And in reality it would
  stay at 1 always. (except for a short duration of time during
  subsequent subvol mount).


>> In the scan context we have the following device operations..
>>
>> Device SCAN:-  which creates the btrfs_fs_devices and its corresponding
>> btrfs_device entries, also checks and frees the duplicate device entries.
>> Lock: uuid_mutex
>>    SCAN
>>    if (found_duplicate && btrfs_fs_devices::opened == 0)
>>       Free_duplicate
>> Unlock: uuid_mutex
>>
>> Device READY:- check if the volume is ready. Also does an implicit scan
>> and duplicate device free as in Device SCAN.
>> Lock: uuid_mutex
>>    SCAN
>>    if (found_duplicate && btrfs_fs_devices::opened == 0)
>>       Free_duplicate
>>    Check READY
>> Unlock: uuid_mutex
>>
>> Device FORGET:- (planned) free a given or all unmounted devices and
>> empty fs_devices if any.
>> Lock: uuid_mutex
>>    if (found_duplicate && btrfs_fs_devices::opened == 0)
>>      Free duplicate
>> Unlock: uuid_mutex
>>
>> Device mount operation -> A Transient state leading to the mounted context
>> Lock: uuid_mutex
>>   Find, SCAN, btrfs_fs_devices::opened++
>> Unlock: uuid_mutex
>>
>> Device umount operation -> A transient state leading to the unmounted
>> context or scan context
>> Lock: uuid_mutex
>>    btrfs_fs_devices::opened--
>> Unlock: uuid_mutex
>>
>>
>> In the mounted context we have the following device operations..
>>
>> Device Rename through SCAN:- This is a special case where the device
>> path gets renamed after its been mounted. (Ubuntu changes the boot path
>> during boot up so we need this feature). Currently, this is part of
>> Device SCAN as above. And we need the locks as below, because the
>> dynamic disappearing device might cleanup the btrfs_device::name
>> Lock: btrfs_fs_devices::device_list_mutex
>>     Rename
>> Unlock: btrfs_fs_devices::device_list_mutex
>>
>> Commit Transaction:- Write All supers.
>> Lock: btrfs_fs_devices::device_list_mutex
>>    Write all super of btrfs_devices::dev_list
>> Unlock: btrfs_fs_devices::device_list_mutex
>>
>> Device add:- Add a new device to the existing mounted volume.
>> set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
>> Lock: btrfs_fs_devices::device_list_mutex
>> Lock: btrfs_fs_info::chunk_mutex
>>     List_add btrfs_devices::dev_list
>>     List_add btrfs_devices::dev_alloc_list
>> Unlock: btrfs_fs_info::chunk_mutex
>> Unlock: btrfs_fs_devices::device_list_mutex
>>
>> Device remove:- Remove a device from the mounted volume.
>> set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
>> Lock: btrfs_fs_devices::device_list_mutex
>> Lock: btrfs_fs_info::chunk_mutex
>>     List_del btrfs_devices::dev_list
>>     List_del btrfs_devices::dev_alloc_list
>> Unlock: btrfs_fs_info::chunk_mutex
>> Unlock: btrfs_fs_devices::device_list_mutex
>>
>> Device Replace:- Replace a device.
>> set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP
>> Lock: btrfs_fs_devices::device_list_mutex
>> Lock: btrfs_fs_info::chunk_mutex
>>     List_update btrfs_devices::dev_list
> 
> Here we still just add a new device but not deleting the existing one
> until the replace is finished.

  Right I did not elaborate that part. List_update: I meant add/delete
  accordingly.

>>     List_update btrfs_devices::dev_alloc_list
>> Unlock: btrfs_fs_info::chunk_mutex
>> Unlock: btrfs_fs_devices::device_list_mutex
>>
>> Sprouting:- Add a RW device to the mounted RO seed device, so to make
>> the mount point writable.
>> The following steps are used to hold the seed and sprout fs_devices.
>> (first two steps are not necessary for the sprouting, they are there to
>> ensure the seed device remains scanned, and it might change)
>> . Clone the (mounted) fs_devices, lets call it as old_devices
>> . Now add old_devices to fs_uuids (yeah, there is duplicate fsid in the
>> list but we change the other fsid before we release the uuid_mutex, so
>> its fine).
>>
>> . Alloc a new fs_devices, lets call it as seed_devices
>> . Copy fs_devices into the seed_devices
>> . Move fs_deviecs devices list into seed_devices
>> . Bring seed_devices to under fs_devices (fs_devices->seed = seed_devices)
>> . Assign a new FSID to the fs_devices and add the new writable device to
>> the fs_devices.
>>
>> In the unmounted context the fs_devices::seed is always NULL.
>> We alloc the fs_devices::seed only at the time of mount and or at
>> sprouting. And free at the time of umount or if the seed device is
>> replaced or deleted.
>>
>> Locks: Sprouting:
>> Lock: uuid_mutex <-- because fsid rename and Device SCAN
>> Reuses Device Add code
>>
>> Locks: Splitting: (Delete OR Replace a seed device)
>> uuid_mutex is not required as fs_devices::seed which is local to
>> fs_devices is being altered.
>> Reuses Device replace code
>>
>>
>> Device resize:- Resize the given volume or device.
>> Lock: btrfs_fs_info::chunk_mutex
>>     Update
>> Unlock: btrfs_fs_info::chunk_mutex
>>
>>
>> (Planned) Dynamic Device missing/reappearing:- A missing device might
>> reappear after its volume been mounted, we have the same btrfs_control
>> ioctl which does the scan of the reappearing device but in the mounted
>> context. In the contrary a device of a volume in a mounted context can
>> go missing as well, and still the volume will continue in the mounted
>> context.
>> Missing:
>> Lock: btrfs_fs_devices::device_list_mutex
>> Lock: btrfs_fs_info::chunk_mutex
>>    List_del: btrfs_devices::dev_alloc_list
>>    Close_bdev
>>    btrfs_device::bdev == NULL
>>    btrfs_device::name = NULL
>>    set_bit BTRFS_DEV_STATE_MISSING
>>    set_bit BTRFS_VOL_STATE_DEGRADED
>> Unlock: btrfs_fs_info::chunk_mutex
>> Unlock: btrfs_fs_devices::device_list_mutex
>>
>> Reappearing:
>> Lock: btrfs_fs_devices::device_list_mutex
>> Lock: btrfs_fs_info::chunk_mutex
>>    Open_bdev
>>    btrfs_device::name = PATH
>>    clear_bit BTRFS_DEV_STATE_MISSING
>>    clear_bit BTRFS_VOL_STATE_DEGRADED
>>    List_add: btrfs_devices::dev_alloc_list
>>    set_bit BTRFS_VOL_STATE_RESILVERING
>>    kthread_run HEALTH_CHECK
> 
> For this part, I'm planning to add scrub support for certain generation
> range, so just scrub for certain block groups which is newer than the
> last generation of the re-appeared device should be enough.
>
> However I'm wondering if it's possible to reuse btrfS_balance_args, as
> we really have a lot of similarity when specifying block groups to
> relocate/scrub.

  What you proposed sounds interesting. But how about failed writes
  at some generation number and not necessarily at the last generation?

  I have been scratching on fix for this [3] for some time now. Thanks
  for the participation. In my understanding we are missing across-tree
  parent transid verification at the lowest possible granular OR
  other approach is to modify Liubo approach to provide a list of
  degraded chunks but without a journal disk.
    [3] https://patchwork.kernel.org/patch/10403311/

  Further, as we do a self adapting chunk allocation in RAID1, it needs
  balance-convert to fix. IMO at some point we have to provide degraded
  raid1 chunk allocation and also modify the scrub to be chunk granular.

Thanks, Anand

> Any idea on this?
> 
> Thanks,
> Qu
> 
>> Unlock: btrfs_fs_info::chunk_mutex
>> Unlock: btrfs_fs_devices::device_list_mutex
>>
>> -----------------------------------------------------------------------
>>
>> Thanks, Anand
>> -- 
>> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

  reply	other threads:[~2018-07-12 12:39 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-07-11  7:50 [DOC] BTRFS Volume operations, Device Lists and Locks all in one page Anand Jain
2018-07-12  5:43 ` Qu Wenruo
2018-07-12 12:33   ` Anand Jain [this message]
2018-07-12 12:59     ` Qu Wenruo
2018-07-12 16:44       ` Anand Jain
2018-07-13  0:20         ` Qu Wenruo
2018-07-13  2:07           ` Qu Wenruo
2018-07-13  5:32           ` Anand Jain
2018-07-13  5:39             ` Qu Wenruo
2018-07-13  7:24               ` Anand Jain
2018-07-13  7:41                 ` Qu Wenruo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49fc4dbb-5e02-ab13-d7f1-7e52bf8868d6@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=linux-btrfs@vger.kernel.org \
    --cc=quwenruo.btrfs@gmx.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).