From: Guoqing Jiang <guoqing.jiang@cloud.ionos.com>
To: Jens Axboe <axboe@kernel.dk>, Ming Lei <ming.lei@redhat.com>
Cc: linux-block@vger.kernel.org,
Florian-Ewald Mueller <florian-ewald.mueller@cloud.ionos.com>
Subject: Re: [PATCH RFC 4/5] block: add a statistic table for io latency
Date: Sun, 12 Jul 2020 23:04:02 +0200 [thread overview]
Message-ID: <bce66026-9bb2-b20a-ca29-07152c4b4336@cloud.ionos.com> (raw)
In-Reply-To: <8fa699e5-06ba-660d-4101-f4f5a3d439d2@kernel.dk>
On 7/12/20 10:44 PM, Jens Axboe wrote:
> On 7/12/20 2:39 PM, Guoqing Jiang wrote:
>> On 7/11/20 3:32 AM, Ming Lei wrote:
>>> On Fri, Jul 10, 2020 at 12:29:28PM +0200, Guoqing Jiang wrote:
>>>> On 7/10/20 12:00 PM, Ming Lei wrote:
>>>>> On Fri, Jul 10, 2020 at 10:55:24AM +0200, Guoqing Jiang wrote:
>>>>>> Hi Ming,
>>>>>>
>>>>>> On 7/10/20 2:53 AM, Ming Lei wrote:
>>>>>>> Hi Guoqing,
>>>>>>>
>>>>>>> On Thu, Jul 09, 2020 at 08:48:08PM +0200, Guoqing Jiang wrote:
>>>>>>>> Hi Ming,
>>>>>>>>
>>>>>>>> On 7/8/20 4:06 PM, Guoqing Jiang wrote:
>>>>>>>>> On 7/8/20 4:02 PM, Guoqing Jiang wrote:
>>>>>>>>>>> Hi Guoqing,
>>>>>>>>>>>
>>>>>>>>>>> I believe it isn't hard to write a ebpf based script(bcc or
>>>>>>>>>>> bpftrace) to
>>>>>>>>>>> collect this kind of performance data, so looks not necessary to do it
>>>>>>>>>>> in kernel.
>>>>>>>>>> Hi Ming,
>>>>>>>>>>
>>>>>>>>>> Sorry, I don't know well about bcc or bpftrace, but I assume they
>>>>>>>>>> need to
>>>>>>>>>> read the latency value from somewhere inside kernel. Could you point
>>>>>>>>>> how can I get the latency value? Thanks in advance!
>>>>>>>>> Hmm, I suppose biolatency is suitable for track latency, will look into
>>>>>>>>> it.
>>>>>>>> I think biolatency can't trace data if it is not running,
>>>>>>> Yeah, the ebpf prog is only injected when the trace is started.
>>>>>>>
>>>>>>>> also seems no
>>>>>>>> place
>>>>>>>> inside kernel have recorded such information for ebpf to read, correct me
>>>>>>>> if my understanding is wrong.
>>>>>>> Just record the info by starting the bcc script in case you need that, is there
>>>>>>> anything wrong with this usage? Always doing such stuff in kernel isn't fair for
>>>>>>> users which don't care or need this info.
>>>>>> That is why we add a Kconfig option and set it to N by default. And I
>>>>>> suppose
>>>>>> with modern cpu, the cost with several more instructions would not be that
>>>>>> expensive even the option is enabled, just my $0.02.
>>>>>>
>>>>>>>> And as cloud provider,we would like to know data when necessary instead
>>>>>>>> of collect data by keep script running because it is expensive than just
>>>>>>>> read
>>>>>>>> node IMHO.
>>>>>>> It shouldn't be expensive. It might be a bit slow to inject the ebpf prog because
>>>>>>> the code has to be verified, however once it is put inside kernel, it should have
>>>>>>> been efficient enough. The kernel side prog only updates & stores the latency
>>>>>>> summery data into bpf map, and the stored summery data can be read out anytime
>>>>>>> by userspace.
>>>>>>>
>>>>>>> Could you explain a bit why it is expensive? such as biolatency
>>>>>> I thought I am compare read a sys node + extra instructions in kernel with
>>>>>> launch a specific process for monitoring which need to occupy more
>>>>>> resources (memory) and context switch. And for biolatency, it calls the
>>>>>> bpf_ktime_get_ns to calculate latency for each IO which I assume the
>>>>>> ktime_get_ns will be triggered finally, and it is not cheap as you said.
>>>>> You can replace one read of timestamp with rq->start_time_ns too, just
>>>>> like what this patch does. You can write your bcc/bfptrace script,
>>>>> which is quite easy to start. Once you learn its power, maybe you will love
>>>>> it.
>>>> Yes, I definitely need to learn more about it :-). But even with the change,
>>>> I still believe read a node is cheaper than a script.
>>>>
>>>> And seems biolatency can't trace bio based driver per below, and with
>>>> collect data in tree we can trace all block drivers.
>>>>
>>>> # load BPF program
>>>> b = BPF(text=bpf_text)
>>>> if args.queued:
>>>> b.attach_kprobe(event="blk_account_io_start", fn_name="trace_req_start")
>>>> else:
>>>> b.attach_kprobe(event="blk_start_request", fn_name="trace_req_start")
>>>> b.attach_kprobe(event="blk_mq_start_request", fn_name="trace_req_start")
>>>> b.attach_kprobe(event="blk_account_io_completion",
>>>> fn_name="trace_req_completion")
>>>>
>>>> Could it possible to extend it support trace both request and bio? Otherwise
>>>> we have to run another script to trace md raid.
>>> It is pretty easy to extend support bio, just add kprobe on submit_bio
>>> and bio_endio().
>>>
>> The thing is that we don't like the cost of ebpf based solution. And FWIW,
>> two years ago, we had compared about ebpf solution and record those
>> data in kernel.
>>
>> A. in-kernel monitor: 1~5% performance drop
>> B. ebpf monitor: 10~15% performance drop
>>
>> Note, we even copied each bio in approach A, which means the performance
>> could be more better since we don't clone bio now.
>>
>> And I think the major concern is about the additional Kconfig option, since
>> Jens doesn't like it, so I guess no need to make the change it in upstream
>> kernel.
> No, my main concern is trying to justify it with having a Kconfig option
> to turn it off. Fact is, distros will likely turn it on, and then
> everybody gets that overhead. There's a temptation to hide features like
> that behind a Kconfig option with this exact justification, and it just
> doesn't work like that in practice.
Good to know that, :-).
> I might be amenable to the change if:
>
> 1) it doesn't add any (real) overhead to the normal fast path
It is possible to remove ktime_get_ns from current change, but I admit
we can't avoid small overhead (several instructions per each IO I think),
and it is not fair to people who doesn't care about those data.
> 2) probably needs to be opt-in. not via kconfig, but as a sysfs
> attribute. Like we have 'iostats' today.
Thanks for the suggestion, will investigate a better way.
> Something I've mentioned in the past has been a blktrace flight recorder
> mode, but that's largely been superseded by having bpf available. But
> the point is that something like blktrace generally doesn't add ANY
> overhead at all if blktrace isn't being run. Your solution ends up
> collecting stats all the time, regardless of whether or not anyone is
> actually looking at the data. That's a bad idea, and I'd be much happier
> with a solution that only adds overhead when actually used, not all the
> time.
Appreciate for your input which make me know your concern better.
Thanks,
Guoqing
next prev parent reply other threads:[~2020-07-12 21:04 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2020-07-08 7:58 [RFC PATCH 0/4] block: add two statistic tables Guoqing Jiang
2020-07-08 7:58 ` [PATCH RFC 1/5] block: return ns precision from disk_start_io_acct Guoqing Jiang
2020-07-08 13:27 ` Ming Lei
2020-07-08 13:53 ` Guoqing Jiang
2020-07-08 17:46 ` Guoqing Jiang
2020-07-08 7:58 ` [PATCH RFC 2/5] drbd: remove unused argument from drbd_request_prepare and __drbd_make_request Guoqing Jiang
2020-07-08 7:58 ` [PATCH RFC 3/5] drbd: rename start_jif to start_ns Guoqing Jiang
2020-07-08 7:58 ` [PATCH RFC 4/5] block: add a statistic table for io latency Guoqing Jiang
2020-07-08 13:29 ` Ming Lei
2020-07-08 14:02 ` Guoqing Jiang
2020-07-08 14:06 ` Guoqing Jiang
2020-07-09 18:48 ` Guoqing Jiang
2020-07-10 0:53 ` Ming Lei
2020-07-10 8:55 ` Guoqing Jiang
2020-07-10 10:00 ` Ming Lei
2020-07-10 10:29 ` Guoqing Jiang
2020-07-11 1:32 ` Ming Lei
2020-07-12 20:39 ` Guoqing Jiang
2020-07-12 20:44 ` Jens Axboe
2020-07-12 21:04 ` Guoqing Jiang [this message]
2020-07-10 14:04 ` Jens Axboe
2020-07-08 7:58 ` [PATCH RFC 5/5] block: add a statistic table for io sector Guoqing Jiang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=bce66026-9bb2-b20a-ca29-07152c4b4336@cloud.ionos.com \
--to=guoqing.jiang@cloud.ionos.com \
--cc=axboe@kernel.dk \
--cc=florian-ewald.mueller@cloud.ionos.com \
--cc=linux-block@vger.kernel.org \
--cc=ming.lei@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox