From: Anand Jain <anand.jain@oracle.com>
To: Josef Bacik <josef@toxicpanda.com>
Cc: Eli V <eliventer@gmail.com>, linux-btrfs <linux-btrfs@vger.kernel.org>
Subject: Re: [PATCH RFC v2 0/2] readmirror feature
Date: Tue, 24 Sep 2019 22:27:41 +0800 [thread overview]
Message-ID: <159bfdff-35e6-3b32-3270-e56138ca01fe@oracle.com> (raw)
In-Reply-To: <e84756ec-8351-559b-bbd5-c7f532795a6d@oracle.com>
On 9/16/19 4:19 PM, Anand Jain wrote:
> On 12/9/19 6:13 PM, Josef Bacik wrote:
>> On Thu, Sep 12, 2019 at 06:10:08PM +0800, Anand Jain wrote:
>>>
>>>
>>>> On 12 Sep 2019, at 6:03 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>>
>>>> On Thu, Sep 12, 2019 at 06:00:21PM +0800, Anand Jain wrote:
>>>>>
>>>>>
>>>>>> On 12 Sep 2019, at 5:50 PM, Josef Bacik <josef@toxicpanda.com> wrote:
>>>>>>
>>>>>> On Thu, Sep 12, 2019 at 03:41:42PM +0800, Anand Jain wrote:
>>>>>>>
>>>>>>>
>>>>>>> Thanks for the comments. More below.
>>>>>>>
>>>>>>> On 12/9/19 3:16 AM, Josef Bacik wrote:
>>>>>>>> On Wed, Sep 11, 2019 at 03:13:21PM -0400, Eli V wrote:
>>>>>>>>> On Wed, Sep 11, 2019 at 2:46 PM Josef Bacik
>>>>>>>>> <josef@toxicpanda.com> wrote:
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 26, 2019 at 05:04:36PM +0800, Anand Jain wrote:
>>>>>>>>>>> Function call chain __btrfs_map_block()->find_live_mirror()
>>>>>>>>>>> uses
>>>>>>>>>>> thread pid to determine the %mirror_num when the mirror_num=0.
>>>>>>>>>>>
>>>>>>>>>>> This patch introduces a framework so that we can add policies
>>>>>>>>>>> to determine
>>>>>>>>>>> the %mirror_num. And also adds the devid as the readmirror
>>>>>>>>>>> policy.
>>>>>>>>>>>
>>>>>>>>>>> The new property is stored as an item in the device tree as
>>>>>>>>>>> show below.
>>>>>>>>>>> (BTRFS_READMIRROR_OBJECTID, BTRFS_PERSISTENT_ITEM_KEY,
>>>>>>>>>>> devid)
>>>>>>>>>>>
>>>>>>>>>>> To be able to set and get this new property also introduces
>>>>>>>>>>> new ioctls
>>>>>>>>>>> BTRFS_IOC_GET_READMIRROR and BTRFS_IOC_SET_READMIRROR. The
>>>>>>>>>>> ioctl argument
>>>>>>>>>>> is defined as
>>>>>>>>>>> struct btrfs_ioctl_readmirror_args {
>>>>>>>>>>> __u64 type; /* RW */
>>>>>>>>>>> __u64 device_bitmap; /* RW */
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> An usage example as follows:
>>>>>>>>>>> btrfs property set /btrfs readmirror devid:1,3
>>>>>>>>>>> btrfs property get /btrfs readmirror
>>>>>>>>>>> readmirror devid:1 3
>>>>>>>>>>> btrfs property set /btrfs readmirror ""
>>>>>>>>>>> btrfs property get /btrfs readmirror
>>>>>>>>>>> readmirror default
>>>>>>>>>>>
>>>>>>>>>>> This patchset has been tested completely, however marked as
>>>>>>>>>>> RFC for the
>>>>>>>>>>> following reasons and comments on them (or any other) are
>>>>>>>>>>> appreciated as
>>>>>>>>>>> usual.
>>>>>>>>>>> . The new objectid is defined as
>>>>>>>>>>> #define BTRFS_READMIRROR_OBJECTID -1ULL
>>>>>>>>>>> Need consent we are fine to use this value, and with this
>>>>>>>>>>> value it
>>>>>>>>>>> shall be placed just before the DEV_STATS_OBJECTID item
>>>>>>>>>>> which is more
>>>>>>>>>>> frequently used only during the device errors.
>>>>>>>>>>>
>>>>>>>>>>> . I am using a u64 bitmap to represent the devices id, so
>>>>>>>>>>> the max device
>>>>>>>>>>> id that we could represent is 63, its a kind of limitation
>>>>>>>>>>> which should
>>>>>>>>>>> be addressed before integration, I wonder if there is any
>>>>>>>>>>> suggestion?
>>>>>>>>>>> Kindly note that, multiple ioctls with each time
>>>>>>>>>>> representing a set of
>>>>>>>>>>> device(s) is not a choice because we need to make sure the
>>>>>>>>>>> readmirror
>>>>>>>>>>> changes happens in a commit transaction.
>>>>>>>>>>>
>>>>>>>>>>> v1->RFC v2:
>>>>>>>>>>> . Property is stored as a dev-tree item instead of root
>>>>>>>>>>> inode extended
>>>>>>>>>>> attribute.
>>>>>>>>>>> . Rename BTRFS_DEV_STATE_READ_OPRIMIZED to
>>>>>>>>>>> BTRFS_DEV_STATE_READ_PREFERRED.
>>>>>>>>>>> . Changed format specifier from devid1,2,3.. to devid:1,2,3..
>>>>>>>>>>>
>>>>>>>>>>> RFC->v1:
>>>>>>>>>>> Drops pid as one of the readmirror policy choices and as
>>>>>>>>>>> usual remains
>>>>>>>>>>> as default. And when the devid is reset the readmirror
>>>>>>>>>>> policy falls back
>>>>>>>>>>> to pid.
>>>>>>>>>>> Drops the mount -o readmirror idea, it can be added at a
>>>>>>>>>>> later point of
>>>>>>>>>>> time.
>>>>>>>>>>> Property now accepts more than 1 devid as readmirror
>>>>>>>>>>> device. As shown
>>>>>>>>>>> in the example above.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This is a lot of infrastructure
>>>>>>>
>>>>>>> Ok. Any idea on a better implementation?
>>>>>>> How about extended attribute approach? v1 patches proposed
>>>>>>> it, but it abused the extended attribute as commented here [1]
>>>>>>> and v2 got changed to an item-key.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://lore.kernel.org/linux-btrfs/be68e6ea-00bc-b750-25e1-9c584b99308f@gmx.com/
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That's a NAK on the prop interface. This is a fs wide policy, not a
>>>>>> directory/inode policy.
>>>>>>
>>>>>>>
>>>>>>>>>> to just change which mirror we read to based on
>>>>>>>>>> some arbitrary user policy. I assume this is to solve the
>>>>>>>>>> case where you have
>>>>>>>>>> slow and fast disks, so you can always read from the fast
>>>>>>>>>> disk? And then it's
>>>>>>>>>> only used in RAID1, so the very narrow usecase of having a
>>>>>>>>>> RAID1 setup with a
>>>>>>>>>> SSD and a normal disk? I'm not seeing a point to this much
>>>>>>>>>> code for one
>>>>>>>>>> particular obscure setup. Thanks,
>>>>>>>>>>
>>>>>>>>>> Josef
>>>>>>>>>
>>>>>>>>> Not commenting on the code itself, but as a user I see this SSD
>>>>>>>>> RAID1
>>>>>>>>> acceleration as a future much have feature. It's only obscure
>>>>>>>>> at the
>>>>>>>>> moment because we don't have code to take advantage of it. But on
>>>>>>>>> large btrfs filesystems with hundreds of GB of metadata, like I
>>>>>>>>> have
>>>>>>>>> for backups, the usability of the filesystem is dramatically
>>>>>>>>> improved
>>>>>>>>> having the metadata on an SSD( though currently only half of
>>>>>>>>> the time
>>>>>>>>> due to the even/odd pid distribution.)
>>>>>>>>
>>>>>>>> But that's different from a mirror. 100% it would be nice to
>>>>>>>> say "put my
>>>>>>>> metadata on the ssd, data elsewhere". That's not what this
>>>>>>>> patch is about, this
>>>>>>>> patch is specifically about changing which drive we choose in a
>>>>>>>> mirrored setup,
>>>>>>>> which is super unlikely to mirror a SSD with a slow drive, cause
>>>>>>>> it's just going
>>>>>>>> to be slow no matter what. Sure we could make it so reads
>>>>>>>> always go to the SSD,
>>>>>>>> but we can accomplish that by just adding a check for
>>>>>>>> nonrotational in the code,
>>>>>>>> and then we don't have to encode all this nonsense in the file
>>>>>>>> system. Thanks,
>>>>>>>
>>>>>>> I wrote about the readmirror policy framework here[2],
>>>>>>> I forgot to link it here, sorry about that, my mistake.
>>>>>>>
>>>>>>> [2]
>>>>>>>
>>>>>>> https://lore.kernel.org/linux-btrfs/1552989624-29577-1-git-send-email-anand.jain@oracle.com/
>>>>>>>
>>>>>>>
>>>>>>> Readmirror policy is for raid1, raid10 and future N way mirror.
>>>>>>> Yes for now its only for raid1.
>>>>>>>
>>>>>>> Here the idea is to create a framework so that readmirror policy
>>>>>>> can be configured as needed. And nonrotational can be one such
>>>>>>> policy.
>>>>>>>
>>>>>>> The example of hard-coded nonrotational policy does not work in case
>>>>>>> of ssd and a remote iscsi ssd, OR in case of local ssd and a NVME
>>>>>>> block
>>>>>>> device, as all these are still nonrotational devices. So hard-coded
>>>>>>> policy is not a good idea. If we have to hardcode then there is
>>>>>>> Q-depth
>>>>>>> based readmirror routing is better (patch in the ML), but that is
>>>>>>> not good enough, because some configs wants it based on the disk-LBA
>>>>>>> so that SAN storage target cache is balanced and not duplicated.
>>>>>>> So in short it must be a configurable policy.
>>>>>>>
>>>>>>
>>>>>> Again, if you are mixing disk types you likely always want
>>>>>> non-rotational, but
>>>>>> still mixing different speed devices in a mirror setup is just
>>>>>> asking for weird
>>>>>> latency problems. I don't think solving this use case is
>>>>>> necessary. If you mix
>>>>>> ssd + network device in a serious production setup then you
>>>>>> probably should be
>>>>>> fired cause you don't know what you are doing. Having the generic
>>>>>> "nonrotational gets priority" is going to cover 99% of the actual
>>>>>> use cases that
>>>>>> make sense.
>>>>>>
>>>>>> The SAN usecase I can sort of see, but again I don't feel like
>>>>>> it's a problem we
>>>>>> need to solve with on-disk format. Add a priority to sysfs so you
>>>>>> can change it
>>>>>> with udev or something on the fly. Thanks,
>>>>>>
>>>>>
>>>>> Ok.
>>>>> Sysfs is fine however we need configuration to be persistent across
>>>>> reboots.
>>>>> Any idea?
>>>>>
>>>>
>>>> Udev rules. Thanks,
>>>>
>>>
>>> Josef, configs moving along with the luns in san seems to be more
>>> easy to manage, otherwise when the host fails, each potential new
>>> server has to be pre-configured with the udev rules.
>>>
>>
>> It's 2019, if people haven't figured out how to do persistent
>> configuration by
>> now then all hope is lost. Facebook persistently configures millions of
>> machines, I'm sure people can figure out how to make sure a udev rule
>> ends up on
>> the right host connected to a SAN that doesn't move. Thanks,
>>
>> Josef
>>
>
> yeah. sysfs interface should be fine for now, and based on the
> usage feedback we could consider persistent storage for the readmirror.
>
> Typical to WORM (Write Once Read Many times) setups they mainly need
> good read performance as data is imported once from somewhere else.
> Just out of curiosity ran read performance checks using the readmirror
> property on ssd and nvme device.
>
> $ btrfs fi show
> Label: none uuid: e24bd8b5-a9e9-41d4-b28e-e95c0c545466
> Total devices 2 FS bytes used 391.64MiB
> devid 1 size 447.13GiB used 2.01GiB path /dev/sdf
> devid 2 size 5.82TiB used 2.01GiB path /dev/nvme0n1
>
> $ cat /sys/block/sdf/queue/rotational
> 0
> $ cat /sys/block/nvme0n1/queue/rotational
> 0
>
> $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:2 && time
> md5sum /btrfs/fOH2
> 6129ccf74f7a761e0c3e096e051ba7a2 /btrfs/fOH2
>
> real 0m0.725s
> user 0m0.604s
> sys 0m0.113s
>
> $ dropcache; sleep 3; btrfs prop set /btrfs readmirror devid:1 && time
> md5sum /btrfs/fOH2
> 6129ccf74f7a761e0c3e096e051ba7a2 /btrfs/fOH2
>
> real 0m1.125s
> user 0m0.643s
> sys 0m0.087s
>
> nvme provides ~40% improvement for 400mb read io, however there are no
> other IOs. The question will be what kind of read io load balancer will
> be required at the full throttle of the preferred readmirror
> device IO Q depth [1]. Which means a heuristic is required to juggle
> around the policies and will be another readmirror policy to be part
> of the same framework when required.
>
> [1]
> $ cat /sys/block/nvme0n1/queue/nr_requests
> 1023
> $ cat /sys/block/sdf/queue/nr_requests
> 64
>
> Thanks, Anand
Josef,
Here I infer as follows .. I wonder if it makes sense.
- Potential readmirror policies are: pid, devid/manual, lba/cache,
Q-depth(patches are in the ml from facebook) each one of them has
their own advantage so a framework is essential.
- As the block device performance keeps evolving so its common to have
heterogeneous block devices on the host. readmirror policy helps
to configure for the best possible performance.
- sysfs is a good interface for readmirror policy as in ioctl we can't
pass all possible devices in the ioctl argument.
- ondisk storage is essential for the readmirror, and key-item pair
is the only way for its storage. (On the 2nd thought storage cannot
be implemented at a later point as sysfs interface must be consistent
in each version. So implementing together sysfs and ondisk key-item
pair makes sense to me.)
Thanks, Anand
prev parent reply other threads:[~2019-09-24 14:28 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-08-26 9:04 [PATCH RFC v2 0/2] readmirror feature Anand Jain
2019-08-26 9:04 ` [PATCH RFC v2] btrfs: add readmirror framework and policy devid Anand Jain
2019-08-26 9:04 ` [PATCH RFC v2] btrfs-progs: add readmirror property and ioctl to set get readmirror Anand Jain
2019-08-29 3:39 ` [PATCH RFC v2.1] btrfs-progs: add readmirror property and ioctl to set get Anand Jain
2019-09-11 16:37 ` [PATCH RFC v2 0/2] readmirror feature David Sterba
2019-09-12 7:48 ` Anand Jain
2019-09-11 18:42 ` Josef Bacik
2019-09-11 19:13 ` Eli V
2019-09-11 19:16 ` Josef Bacik
2019-09-12 7:41 ` Anand Jain
2019-09-12 9:50 ` Josef Bacik
2019-09-12 10:00 ` Anand Jain
2019-09-12 10:03 ` Josef Bacik
2019-09-12 10:10 ` Anand Jain
2019-09-12 10:13 ` Josef Bacik
2019-09-16 8:19 ` Anand Jain
2019-09-24 14:27 ` Anand Jain [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=159bfdff-35e6-3b32-3270-e56138ca01fe@oracle.com \
--to=anand.jain@oracle.com \
--cc=eliventer@gmail.com \
--cc=josef@toxicpanda.com \
--cc=linux-btrfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox