Linux Btrfs filesystem development
 help / color / mirror / Atom feed
From: Anand Jain <anand.jain@oracle.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Eli V <eliventer@gmail.com>, linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC v2 0/2] readmirror feature
Date: Tue, 24 Sep 2019 22:27:41 +0800	[thread overview]
Message-ID: <159bfdff-35e6-3b32-3270-e56138ca01fe@oracle.com> (raw)
In-Reply-To: <e84756ec-8351-559b-bbd5-c7f532795a6d@oracle.com>

On 9/16/19 4:19 PM, Anand Jain wrote:
> On 12/9/19 6:13 PM, Josef Bacik wrote:
>> On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote:
>>>
>>>
>>>> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>>
>>>> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
>>>>>
>>>>>
>>>>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the comments. More below.
>>>>>>>
>>>>>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>>>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik 
>>>>>>>>> <josef@toxicpanda.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>>>>>>> Function call chain  __btrfs_map_block()->find_live_mirror() 
>>>>>>>>>>> uses
>>>>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>>>>>>>
>>>>>>>>>>> This patch introduces a framework so that we can add policies 
>>>>>>>>>>> to determine
>>>>>>>>>>> the %mirror_num. And also adds the devid as the readmirror 
>>>>>>>>>>> policy.
>>>>>>>>>>>
>>>>>>>>>>> The new property is stored as an item in the device tree as 
>>>>>>>>>>> show below.
>>>>>>>>>>>     (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY, 
>>>>>>>>>>> devid)
>>>>>>>>>>>
>>>>>>>>>>> To be able to set and get this new property also introduces 
>>>>>>>>>>> new ioctls
>>>>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The 
>>>>>>>>>>> ioctl argument
>>>>>>>>>>> is defined as
>>>>>>>>>>>         struct btrfs_ioctl_readmirror_args {
>>>>>>>>>>>                 __u64 type; /* RW */
>>>>>>>>>>>                 __u64 device_bitmap; /* RW */
>>>>>>>>>>>         }
>>>>>>>>>>>
>>>>>>>>>>> An usage example as follows:
>>>>>>>>>>>         btrfs property set /btrfs readmirror devid:1,3
>>>>>>>>>>>         btrfs property get /btrfs readmirror
>>>>>>>>>>>           readmirror devid:1 3
>>>>>>>>>>>         btrfs property set /btrfs readmirror ""
>>>>>>>>>>>         btrfs property get /btrfs readmirror
>>>>>>>>>>>           readmirror default
>>>>>>>>>>>
>>>>>>>>>>> This patchset has been tested completely, however marked as 
>>>>>>>>>>> RFC for the
>>>>>>>>>>> following reasons and comments on them (or any other) are 
>>>>>>>>>>> appreciated as
>>>>>>>>>>> usual.
>>>>>>>>>>> . The new objectid is defined as
>>>>>>>>>>>       #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>>>>>>    Need consent we are fine to use this value, and with this 
>>>>>>>>>>> value it
>>>>>>>>>>>    shall be placed just before the DEV_STATS_OBJECTID item 
>>>>>>>>>>> which is more
>>>>>>>>>>>    frequently used only during the device errors.
>>>>>>>>>>>
>>>>>>>>>>> .  I am using a u64 bitmap to represent the devices id, so 
>>>>>>>>>>> the max device
>>>>>>>>>>>    id that we could represent is 63, its a kind of limitation 
>>>>>>>>>>> which should
>>>>>>>>>>>    be addressed before integration, I wonder if there is any 
>>>>>>>>>>> suggestion?
>>>>>>>>>>>    Kindly note that, multiple ioctls with each time 
>>>>>>>>>>> representing a set of
>>>>>>>>>>>    device(s) is not a choice because we need to make sure the 
>>>>>>>>>>> readmirror
>>>>>>>>>>>    changes happens in a commit transaction.
>>>>>>>>>>>
>>>>>>>>>>> v1->RFC v2:
>>>>>>>>>>>   . Property is stored as a dev-tree item instead of root 
>>>>>>>>>>> inode extended
>>>>>>>>>>>     attribute.
>>>>>>>>>>>   . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to 
>>>>>>>>>>> BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>>>>>>   . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>>>>>>>
>>>>>>>>>>> RFC->v1:
>>>>>>>>>>>   Drops pid as one of the readmirror policy choices and as 
>>>>>>>>>>> usual remains
>>>>>>>>>>>   as default. And when the devid is reset the readmirror 
>>>>>>>>>>> policy falls back
>>>>>>>>>>>   to pid.
>>>>>>>>>>>   Drops the mount -o readmirror idea, it can be added at a 
>>>>>>>>>>> later point of
>>>>>>>>>>>   time.
>>>>>>>>>>>   Property now accepts more than 1 devid as readmirror 
>>>>>>>>>>> device. As shown
>>>>>>>>>>>   in the example above.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is a lot of infrastructure
>>>>>>>
>>>>>>> Ok. Any idea on a better implementation?
>>>>>>> How about extended attribute approach? v1 patches proposed
>>>>>>> it, but it abused the extended attribute as commented here [1]
>>>>>>> and v2 got changed to an item-key.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/ 
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That's a NAK on the prop interface.  This is a fs wide policy, not a
>>>>>> directory/inode policy.
>>>>>>
>>>>>>>
>>>>>>>>>> to just change which mirror we read to based on
>>>>>>>>>> some arbitrary user policy.  I assume this is to solve the 
>>>>>>>>>> case where you have
>>>>>>>>>> slow and fast disks, so you can always read from the fast 
>>>>>>>>>> disk?  And then it's
>>>>>>>>>> only used in RAID1, so the very narrow usecase of having a 
>>>>>>>>>> RAID1 setup with a
>>>>>>>>>> SSD and a normal disk?  I'm not seeing a point to this much 
>>>>>>>>>> code for one
>>>>>>>>>> particular obscure setup.  Thanks,
>>>>>>>>>>
>>>>>>>>>> Josef
>>>>>>>>>
>>>>>>>>> Not commenting on the code itself, but as a user I see this SSD 
>>>>>>>>> RAID1
>>>>>>>>> acceleration as a future much have feature. It's only obscure 
>>>>>>>>> at the
>>>>>>>>> moment because we don't have code to take advantage of it. But on
>>>>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I 
>>>>>>>>> have
>>>>>>>>> for backups, the usability of the filesystem is dramatically 
>>>>>>>>> improved
>>>>>>>>> having the metadata on an SSD( though currently only half of 
>>>>>>>>> the time
>>>>>>>>> due to the even/odd pid distribution.)
>>>>>>>>
>>>>>>>> But that's different from a mirror.  100% it would be nice to 
>>>>>>>> say "put my
>>>>>>>> metadata on the ssd, data elsewhere".  That's not what this 
>>>>>>>> patch is about, this
>>>>>>>> patch is specifically about changing which drive we choose in a 
>>>>>>>> mirrored setup,
>>>>>>>> which is super unlikely to mirror a SSD with a slow drive, cause 
>>>>>>>> it's just going
>>>>>>>> to be slow no matter what.  Sure we could make it so reads 
>>>>>>>> always go to the SSD,
>>>>>>>> but we can accomplish that by just adding a check for 
>>>>>>>> nonrotational in the code,
>>>>>>>> and then we don't have to encode all this nonsense in the file 
>>>>>>>> system.  Thanks,
>>>>>>>
>>>>>>> I wrote about the readmirror policy framework here[2],
>>>>>>> I forgot to link it here, sorry about that, my mistake.
>>>>>>>
>>>>>>> [2]
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/ 
>>>>>>>
>>>>>>>
>>>>>>> Readmirror policy is for raid1, raid10 and future N way mirror.
>>>>>>> Yes for now its only for raid1.
>>>>>>>
>>>>>>> Here the idea is to create a framework so that readmirror policy
>>>>>>> can be configured as needed. And nonrotational can be one such 
>>>>>>> policy.
>>>>>>>
>>>>>>> The example of hard-coded nonrotational policy does not work in case
>>>>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME 
>>>>>>> block
>>>>>>> device, as all these are still nonrotational devices. So hard-coded
>>>>>>> policy is not a good idea. If we have to hardcode then there is 
>>>>>>> Q-depth
>>>>>>> based readmirror routing is better (patch in the ML), but that is
>>>>>>> not good enough, because some configs wants it based on the disk-LBA
>>>>>>> so that SAN storage target cache is balanced and not duplicated.
>>>>>>> So in short it must be a configurable policy.
>>>>>>>
>>>>>>
>>>>>> Again, if you are mixing disk types you likely always want 
>>>>>> non-rotational, but
>>>>>> still mixing different speed devices in a mirror setup is just 
>>>>>> asking for weird
>>>>>> latency problems.  I don't think solving this use case is 
>>>>>> necessary.  If you mix
>>>>>> ssd + network device in a serious production setup then you 
>>>>>> probably should be
>>>>>> fired cause you don't know what you are doing.  Having the generic
>>>>>> "nonrotational gets priority" is going to cover 99% of the actual 
>>>>>> use cases that
>>>>>> make sense.
>>>>>>
>>>>>> The SAN usecase I can sort of see, but again I don't feel like 
>>>>>> it's a problem we
>>>>>> need to solve with on-disk format.  Add a priority to sysfs so you 
>>>>>> can change it
>>>>>> with udev or something on the fly.  Thanks,
>>>>>>
>>>>>
>>>>> Ok.
>>>>> Sysfs is fine however we need configuration to be persistent across 
>>>>> reboots.
>>>>> Any idea?
>>>>>
>>>>
>>>> Udev rules.  Thanks,
>>>>
>>>
>>>   Josef, configs moving along with the luns in san seems to be more
>>>   easy to manage, otherwise when the host fails, each potential new
>>>   server has to be pre-configured with the udev rules.
>>>
>>
>> It's 2019, if people haven't figured out how to do persistent 
>> configuration by
>> now then all hope is lost.  Facebook persistently configures millions of
>> machines, I'm sure people can figure out how to make sure a udev rule 
>> ends up on
>> the right host connected to a SAN that doesn't move.  Thanks,
>>
>> Josef
>>
> 
> yeah. sysfs interface should be fine for now, and based on the
> usage feedback we could consider persistent storage for the readmirror.
> 
> Typical to WORM (Write Once Read Many times) setups they mainly need
> good read performance as data is imported once from somewhere else.
> Just out of curiosity ran read performance checks using the readmirror 
> property on ssd and nvme device.
> 
> $ btrfs fi show
> Label: none  uuid: e24bd8b5-a9e9-41d4-b28e-e95c0c545466
>      Total devices 2 FS bytes used 391.64MiB
>      devid    1 size 447.13GiB used 2.01GiB path /dev/sdf
>      devid    2 size 5.82TiB used 2.01GiB path /dev/nvme0n1
> 
> $ cat /sys/block/sdf/queue/rotational
> 0
> $ cat /sys/block/nvme0n1/queue/rotational
> 0
> 
> $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:2 && time 
> md5sum /btrfs/fOH2
> 6129ccf74f7a761e0c3e096e051ba7a2  /btrfs/fOH2
> 
> real    0m0.725s
> user    0m0.604s
> sys    0m0.113s
> 
> $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:1 && time 
> md5sum /btrfs/fOH2
> 6129ccf74f7a761e0c3e096e051ba7a2  /btrfs/fOH2
> 
> real    0m1.125s
> user    0m0.643s
> sys    0m0.087s
> 
> nvme provides ~40% improvement for 400mb read io, however there are no
> other IOs. The question will be what kind of read io load balancer will
> be required at the full throttle of the preferred readmirror
> device IO Q depth [1]. Which means a heuristic is required to juggle
> around the policies and will be another readmirror policy to be part
> of the same framework when required.
> 
> [1]
> $ cat /sys/block/nvme0n1/queue/nr_requests
> 1023
> $ cat /sys/block/sdf/queue/nr_requests
> 64
> 
> Thanks, Anand


Josef,

  Here I infer as follows .. I wonder if it makes sense.

  - Potential readmirror policies are: pid, devid/manual, lba/cache,
    Q-depth(patches are in the ml from facebook) each one of them has
    their own advantage so a framework is essential.

  - As the block device performance keeps evolving so its common to have
    heterogeneous block devices on the host. readmirror policy helps
    to configure for the best possible performance.

  - sysfs is a good interface for readmirror policy as in ioctl we can't
    pass all possible devices in the ioctl argument.

  - ondisk storage is essential for the readmirror, and key-item pair
    is the only way for its storage. (On the 2nd thought storage cannot
    be implemented at a later point as sysfs interface must be consistent
    in each version. So implementing together sysfs and ondisk key-item
    pair makes sense to me.)

Thanks, Anand

      reply	other threads:[~2019-09-24 14:28 UTC|newest]

Thread overview: 17+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-26  9:04 [PATCH RFC v2 0/2] readmirror feature Anand Jain
2019-08-26  9:04 ` [PATCH RFC v2] btrfs: add readmirror framework and policy devid Anand Jain
2019-08-26  9:04 ` [PATCH RFC v2] btrfs-progs: add readmirror property and ioctl to set get readmirror Anand Jain
2019-08-29  3:39   ` [PATCH RFC v2.1] btrfs-progs: add readmirror property and ioctl to set get Anand Jain
2019-09-11 16:37 ` [PATCH RFC v2 0/2] readmirror feature David Sterba
2019-09-12  7:48   ` Anand Jain
2019-09-11 18:42 ` Josef Bacik
2019-09-11 19:13   ` Eli V
2019-09-11 19:16     ` Josef Bacik
2019-09-12  7:41       ` Anand Jain
2019-09-12  9:50         ` Josef Bacik
2019-09-12 10:00           ` Anand Jain
2019-09-12 10:03             ` Josef Bacik
2019-09-12 10:10               ` Anand Jain
2019-09-12 10:13                 ` Josef Bacik
2019-09-16  8:19                   ` Anand Jain
2019-09-24 14:27                     ` Anand Jain [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=159bfdff-35e6-3b32-3270-e56138ca01fe@oracle.com \
    --to=anand.jain@oracle.com \
    --cc=eliventer@gmail.com \
    --cc=josef@toxicpanda.com \
    --cc=linux-btrfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox