From: Josef Bacik <josef@redhat.com>
To: Arne Jansen <sensille@gmx.net>
Cc: chris.mason@oracle.com, linux-btrfs@vger.kernel.org
Subject: Re: [PATCH v4 4/7] btrfs: initial readahead code and prototypes
Date: Thu, 30 Jun 2011 08:49:07 -0400 [thread overview]
Message-ID: <4E0C70C3.1000307@redhat.com> (raw)
In-Reply-To: <4E0C27A0.7040002@gmx.net>
On 06/30/2011 03:37 AM, Arne Jansen wrote:
> On 29.06.2011 23:49, Josef Bacik wrote:
>> On 06/29/2011 04:10 PM, Arne Jansen wrote:
>
>>
>>> + struct kref refcnt;
>>> + wait_queue_head_t wait;
>>> +};
>>> +
>>> +struct reada_extctl {
>>> + struct list_head list;
>>> + struct reada_control *rc;
>>> + u64 generation;
>>> +};
>>> +
>>
>> This is completely useless, kill this struct and just put the generation
>> and the list into reada_control.
>
> This struct is the link between reada_extent and reada_control.
> In case more than one readahead is running, each reada_extent
> can point to multiple reada_controls, so I need this link struct.
>
And this is where I get confused. Why would we need multiple
reada_control's for the same extent? Can't we just say "Oh hey there's
already a reada control outstanding for this extent, take a ref on the
control and wait for that"?
>>> +
>>> +/* call it with fs_info->reada_lock held */
>>> +static void reada_zone_put(struct reada_zone *zone)
>>> +{
>>> + if (!kref_put(&zone->refcnt, reada_kref_dummy))
>>> + return;
>>> +
>>> + radix_tree_delete(&zone->device->reada_zones,
>>> + zone->end>> PAGE_CACHE_SHIFT);
>>> +
>>
>> Instead of making the callers take the reada_lock, move it into this
>> function so that in the fast case we're not taking an extra spin_lock.
>
> I had to move this out, mainly because reada_start_machine_dev needs to
> hold the lock for several operation, one of which might call zone_put.
> Maybe I can defer the zone_put until afterwards.
> I'd like to wait with these kind of optimizations until there are some
> uses of readahead in more time critical paths, or at least until we have
> settled there will be such uses ;)
>
>> Also if you are going to use the kfref stuff you might as well use the
>> release function stuff.
>
> Right, in the case I can. I didn't do it for symmetrie with the other
> cases, but changed it now.
>
>>
>>> + kfree(zone);
>>> +
>>> + return;
>>> +}
>>> +
>>> +static void reada_control_put(struct reada_control *rc)
>>> +{
>>> + if (kref_put(&rc->refcnt, reada_kref_dummy)) {
>>> + kfree(rc);
>>> + return;
>>
>> Don't need the return here.
>
> I killed the whole function and built reada_control_release instead.
>
>>
>>> + }
>>> +}
>>> +
>>> +static int reada_add_block(struct reada_control *rc, u64 logical,
>>> + struct btrfs_key *top, int level, u64 generation)
>>> +{
>>> + struct btrfs_root *root = rc->root;
>>> + struct reada_extent *re;
>>> + struct reada_extctl *rec;
>>> +
>>> + re = reada_find_extent(root, logical, top, level); /* takes one ref */
>>> + if (!re)
>>> + return -1;
>>> +
>>> + rec = kzalloc(sizeof(*rec), GFP_NOFS);
>>> + if (!rec) {
>>> + reada_extent_put(root->fs_info, re);
>>> + return -1;
>>> + }
>>> +
>>> + rec->rc = rc;
>>> + rec->generation = generation;
>>> + spin_lock(&rc->lock);
>>> + ++rc->elems;
>>> + spin_unlock(&rc->lock);
>>> +
>>> + spin_lock(&re->lock);
>>> + list_add_tail(&rec->list,&re->extctl);
>>> + spin_unlock(&re->lock);
>>> +
>>> + /* leave the ref on the extent */
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +/*
>>> + * called with fs_info->reada_lock held
>>> + */
>>> +static void reada_peer_zones_set_lock(struct reada_zone *zone, int lock)
>>> +{
>>> + int i;
>>> + unsigned long index = zone->end>> PAGE_CACHE_SHIFT;
>>> +
>>> + for (i = 0; i< zone->ndevs; ++i) {
>>> + struct reada_zone *peer;
>>> + peer = radix_tree_lookup(&zone->devs[i]->reada_zones, index);
>>> + if (peer&& peer->device != zone->device)
>>> + peer->locked = lock;
>>> + }
>>> +}
>>> +
>>> +/*
>>> + * called with fs_info->reada_lock held
>>> + */
>>> +static int reada_pick_zone(struct btrfs_device *dev)
>>> +{
>>> + struct reada_zone *top_zone = NULL;
>>> + struct reada_zone *top_locked_zone = NULL;
>>> + u64 top_elems = 0;
>>> + u64 top_locked_elems = 0;
>>> + unsigned long index = 0;
>>> + int ret;
>>> +
>>> + if (dev->reada_curr_zone) {
>>> + reada_peer_zones_set_lock(dev->reada_curr_zone, 0);
>>> + reada_zone_put(dev->reada_curr_zone);
>>> + dev->reada_curr_zone = NULL;
>>> + }
>>> + /* pick the zone with the most elements */
>>> + while (1) {
>>> + struct reada_zone *zone;
>>> +
>>> + ret = radix_tree_gang_lookup(&dev->reada_zones,
>>> + (void **)&zone, index, 1);
>>> + if (ret == 0)
>>> + break;
>>> + index = (zone->end>> PAGE_CACHE_SHIFT) + 1;
>>> + if (zone->locked) {
>>> + if (zone->elems> top_locked_elems) {
>>> + top_locked_elems = zone->elems;
>>> + top_locked_zone = zone;
>>> + }
>>> + } else {
>>> + if (zone->elems> top_elems) {
>>> + top_elems = zone->elems;
>>> + top_zone = zone;
>>> + }
>>> + }
>>> + }
>>> + if (top_zone)
>>> + dev->reada_curr_zone = top_zone;
>>> + else if (top_locked_zone)
>>> + dev->reada_curr_zone = top_locked_zone;
>>> + else
>>> + return 0;
>>> +
>>> + dev->reada_next = dev->reada_curr_zone->start;
>>> + kref_get(&dev->reada_curr_zone->refcnt);
>>> + reada_peer_zones_set_lock(dev->reada_curr_zone, 1);
>>> +
>>> + return 1;
>>> +}
>>> +
>>> +static int reada_start_machine_dev(struct btrfs_fs_info *fs_info,
>>> + struct btrfs_device *dev)
>>> +{
>>> + struct reada_extent *re = NULL;
>>> + int mirror_num = 0;
>>> + struct extent_buffer *eb = NULL;
>>> + u64 logical;
>>> + u32 blocksize;
>>> + int ret;
>>> + int i;
>>> + int need_kick = 0;
>>> +
>>> + spin_lock(&fs_info->reada_lock);
>>> + if (dev->reada_curr_zone == NULL) {
>>> + ret = reada_pick_zone(dev);
>>> + if (!ret) {
>>> + spin_unlock(&fs_info->reada_lock);
>>> + return 0;
>>> + }
>>> + }
>>> + /*
>>> + * FIXME currently we issue the reads one extent at a time. If we have
>>> + * a contiguous block of extents, we could also coagulate them or use
>>> + * plugging to speed things up
>>> + */
>>> + ret = radix_tree_gang_lookup(&dev->reada_extents, (void **)&re,
>>> + dev->reada_next>> PAGE_CACHE_SHIFT, 1);
>>> + if (ret == 0 || re->logical>= dev->reada_curr_zone->end) {
>>> + ret = reada_pick_zone(dev);
>>> + if (!ret) {
>>> + spin_unlock(&fs_info->reada_lock);
>>> + return 0;
>>> + }
>>> + re = NULL;
>>> + ret = radix_tree_gang_lookup(&dev->reada_extents, (void **)&re,
>>> + dev->reada_next>> PAGE_CACHE_SHIFT, 1);
>>> + }
>>> + if (ret == 0) {
>>> + spin_unlock(&fs_info->reada_lock);
>>> + return 0;
>>> + }
>>> + dev->reada_next = re->logical + re->blocksize;
>>> + kref_get(&re->refcnt);
>>> +
>>> + spin_unlock(&fs_info->reada_lock);
>>> +
>>> + /*
>>> + * find mirror num
>>> + */
>>> + for (i = 0; i< re->nzones; ++i) {
>>> + if (re->zones[i]->device == dev) {
>>> + mirror_num = i + 1;
>>> + break;
>>> + }
>>> + }
>>> + logical = re->logical;
>>> + blocksize = re->blocksize;
>>> +
>>> + spin_lock(&re->lock);
>>> + if (re->scheduled_for == NULL) {
>>> + re->scheduled_for = dev;
>>> + need_kick = 1;
>>> + }
>>> + spin_unlock(&re->lock);
>>> +
>>> + reada_extent_put(fs_info, re);
>>> +
>>> + if (!need_kick)
>>> + return 0;
>>> +
>>> + atomic_inc(&dev->reada_in_flight);
>>> + ret = reada_tree_block_flagged(fs_info->extent_root, logical, blocksize,
>>> + mirror_num,&eb);
>>> + if (eb) {
>>> + __readahead_hook(fs_info->extent_root, eb, eb->start, ret);
>>> + free_extent_buffer(eb);
>>> + }
>>> +
>>> + return 1;
>>> +
>>> +}
>>> +
>>> +static void reada_start_machine_worker(struct btrfs_work *work)
>>> +{
>>> + struct reada_machine_work *rmw;
>>> + struct btrfs_fs_info *fs_info;
>>> +
>>> + rmw = container_of(work, struct reada_machine_work, work);
>>> + fs_info = rmw->fs_info;
>>> +
>>> + kfree(rmw);
>>> +
>>> + __reada_start_machine(fs_info);
>>> +}
>>> +
>>> +static void __reada_start_machine(struct btrfs_fs_info *fs_info)
>>> +{
>>> + struct btrfs_device *device;
>>> + struct btrfs_fs_devices *fs_devices = fs_info->fs_devices;
>>> + u64 enqueued;
>>> + u64 total = 0;
>>> + int i;
>>> +
>>> + do {
>>> + enqueued = 0;
>>> + list_for_each_entry(device,&fs_devices->devices, dev_list) {
>>> + if (atomic_read(&device->reada_in_flight)<
>>> + MAX_IN_FLIGHT)
>>> + enqueued += reada_start_machine_dev(fs_info,
>>> + device);
>>> + }
>>> + total += enqueued;
>>> + } while (enqueued&& total< 10000);
>>> +
>>
>> What is this? Are we doing this so that the worker stays alive so it
>> can continue to process new requests coming in? If thats the case we
>> need to have a proper kthread that doesn't exit until we unmount or
>> something, not this weirdness.
>
> Hopefully the comment below explains what the intention is. Maybe I
> should move it up to answer the question before it arises :) A kthread
> is not enough, as I want parallelism here.
>
> I'll kill the FIXME as it is done already.
>
Yeah the comment wasn't and still isn't clear to me. You are using the
workers here, which means we already have a thread per cpu running here,
so we don't need to do things to artificially keep one of them running,
we just give it work to do and the worker threads will dispatch a thread
to do the work, we have built in parallelism, so this is just confusing
and unnecessary.
>>
>>> + if (enqueued == 0)
>>> + return;
>>> +
>>> + /*
>>> + * If everything is already in the cache, this is effectively single
>>> + * threaded. To a) not hold the caller for too long and b) to utilize
>>> + * more cores, we broke the loop above after 10000 iterations and now
>>> + * enqueue to workers to finish it. This will distribute the load to
>>> + * the cores.
>>> + * FIXME we might need our own workqueue here, with an idle threshold
>>> + * of one. Also these worker are relatively long-running.
>>> + */
>>> + for (i = 0; i< 2; ++i)
>>> + reada_start_machine(fs_info);
>>> +}
>>> +
>>> +static void reada_start_machine(struct btrfs_fs_info *fs_info)
>>> +{
>>> + struct reada_machine_work *rmw;
>>> +
>>> + rmw = kzalloc(sizeof(*rmw), GFP_NOFS);
>>> + if (!rmw) {
>>> + /* FIXME we cannot handle this properly right now */
>>> + BUG();
>>
>> Yes you can, everywhere that calls this can handle failures, so make
>> this return an int and have it return -ENOMEM if it fails.
>
> Just passing up the error isn't enough. We also need to signal the error
> to all waiters and clean up all data structures. Maybe it's easier to
> just keep a small cache of these struct, maybe #CPUs, so we can never
> fail here.
Well this is just used to add the current readahead work right? We can
fail here with no problems, it just means the person wanting to add the
readahead work failed, there is no reason to stop any of the other
workers who may already have work going on. Thanks,
Josef
next prev parent reply other threads:[~2011-06-30 12:49 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-06-29 20:10 [PATCH v4 0/6] btrfs: generic readeahead interface Arne Jansen
2011-06-29 20:10 ` [PATCH v4 1/7] btrfs: add an extra wait mode to read_extent_buffer_pages Arne Jansen
2011-06-29 20:50 ` Josef Bacik
2011-06-29 20:10 ` [PATCH v4 2/7] btrfs: add READAHEAD extent buffer flag Arne Jansen
2011-06-29 20:55 ` Josef Bacik
2011-06-29 20:10 ` [PATCH v4 3/7] btrfs: state information for readahead Arne Jansen
2011-06-29 20:58 ` Josef Bacik
2011-06-29 20:10 ` [PATCH v4 4/7] btrfs: initial readahead code and prototypes Arne Jansen
2011-06-29 21:49 ` Josef Bacik
2011-06-30 7:37 ` Arne Jansen
2011-06-30 12:49 ` Josef Bacik [this message]
2011-06-30 14:01 ` Arne Jansen
2011-06-29 20:10 ` [PATCH v4 5/7] btrfs: hooks for readahead Arne Jansen
2011-06-29 20:10 ` [PATCH v4 6/7] btrfs: test ioctl " Arne Jansen
2011-06-29 20:10 ` [PATCH v4 7/7] btrfs: use readahead API for scrub Arne Jansen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4E0C70C3.1000307@redhat.com \
--to=josef@redhat.com \
--cc=chris.mason@oracle.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=sensille@gmx.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).