public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Naman Jain <namjain@linux.microsoft.com>
To: Michael Kelley <mhklinux@outlook.com>,
	"K . Y . Srinivasan" <kys@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Wei Liu <wei.liu@kernel.org>, Dexuan Cui <decui@microsoft.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"x86@kernel.org" <x86@kernel.org>,
	"H . Peter Anvin" <hpa@zytor.com>,
	Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: "linux-hyperv@vger.kernel.org" <linux-hyperv@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"stable@vger.kernel.org" <stable@vger.kernel.org>
Subject: Re: [PATCH v2] clocksource: hyper-v: Fix hv tsc page based sched_clock for hibernation
Date: Thu, 12 Sep 2024 21:07:16 +0530	[thread overview]
Message-ID: <d0b42e38-c956-4140-926f-64fad527fbd1@linux.microsoft.com> (raw)
In-Reply-To: <SN6PR02MB4157967915492AAF473CB682D4642@SN6PR02MB4157.namprd02.prod.outlook.com>



On 9/12/2024 8:51 PM, Michael Kelley wrote:
> From: Naman Jain <namjain@linux.microsoft.com> Sent: Wednesday, September 11, 2024 9:51 PM
>>
>> On 9/12/2024 9:09 AM, Michael Kelley wrote:
>>> From: Naman Jain <namjain@linux.microsoft.com> Sent: Tuesday, September 10,
>> 2024 9:57 PM
>>>>
>>>
>>> This version of the patch looks good to me from the standpoint of
>>> separating the x86 specific functionality from the arch independent
>>> functionality. And I think the patch works as intended. But there
>>> are parts of the description and variable naming that don't align
>>> with my understanding of the problem and the fix. So I've added
>>> some additional comments below.
>>>
>>> Nit: Now that most of the code changes are in mshyperv.c, the
>>> patch Subject: prefix should perhaps be "x86/hyperv:" instead
>>> of "clocksource: hyperv:".
>>
>> Thanks a lot for reviewing Michael. As you rightly pointed out, these
>> comments and variable names made more sense when they were in
>> hyperv_timer.c. I'll change them accordingly in next patch.
>>
>> Will change commit msg subject as well.
>>
>>>
>>>> read_hv_sched_clock_tsc() assumes that the Hyper-V clock counter is
>>>> bigger than the variable hv_sched_clock_offset, which is cached during
>>>> early boot, but depending on the timing this assumption may be false
>>>> when a hibernated VM starts again (the clock counter starts from 0
>>>> again) and is resuming back (Note: hv_init_tsc_clocksource() is not
>>>> called during hibernation/resume); consequently,
>>>> read_hv_sched_clock_tsc() may return a negative integer (which is
>>>> interpreted as a huge positive integer since the return type is u64)
>>>> and new kernel messages are prefixed with huge timestamps before
>>>> read_hv_sched_clock_tsc() grows big enough (which typically takes
>>>> several seconds).
>>>
>>> Just so I'm clear on the sequence, when a new VM is created to
>>> resume the hibernated VM, I think the following happens:
>>>
>>> 1) The VM being used to resume the hibernation image boots a
>>> fresh instance of the Linux kernel. The sched clock and sched clock
>>> offset value are initialized as with any kernel, and kernel messages
>>> are printed with the correct timestamps starting at zero.
>>>
>>> 2) The new Linux kernel then loads the hibernation image and
>>> transfers control to it, whereupon the "resume" callbacks are run
>>> in the context of the hibernation image.  At this point, any kernel
>>> timestamps are wrong, and might even be negative, because the
>>> sched clock value is calculated based on the new Hyper-V reference
>>> time (which started again at zero) minus the old sched clock offset.
>>> The goal is that the sched clock value should be continuous with
>>> the sched clock value from the original VM. If the original VM
>>> had been running for 1000 seconds when the hibernation was
>>> done, the sched clock value in the resumed hibernation image
>>> should continue, starting at ~1000 seconds.
>>>
>>> 3) The fix is to adjust the sched clock offset in the resumed
>>> hibernation image, and make it more negative by that ~1000
>>> seconds.
>>>
>>> Is that all correct?  If so, then it seems like this patch is doing
>>> more than just cleaning up the negative values for sched clock.
>>> It's also making the sched clock values continuous with the
>>> sched clock values in the original VM rather than restarting
>>> near zero after hibernation image is resumed.
>>>
>>
>> Yes, that's exactly what this patch is trying to do. There was an option
>> to correct in suspend-resume callbacks of original VM in hyperv_timer.c,
>> but these are executed very late, and we still end up getting many logs
>> with these incorrect timestamps. We took reference from the code where
>> tsc clock correction takes place, and thought that similar should be
>> done here.
> 
> Yes, agreed. I'm glad the mechanism for the TSC clock correction
> is available to use. :-)

Yes.

"arch/x86/kernel/tsc.c"

void tsc_save_sched_clock_state(void)

/*
  * Even on processors with invariant TSC, TSC gets reset in some the
  * ACPI system sleep states. And in some systems BIOS seem to reinit TSC to
  * arbitrary value (still sync'd across cpu's) during resume from such 
sleep
  * states. To cope up with this, recompute the cyc2ns_offset for each 
cpu so
  * that sched_clock() continues from the point where it was left off during
  * suspend.
  */
void tsc_restore_sched_clock_state(void)


called from "arch/x86/power/cpu.c" in
__restore_processor_state() -> x86_platform.restore_sched_clock_state();

> 
>>
>> We can tweak the commit msg to add this additional detail.
> 
> Thanks. I think the change to make the sched clock time (and
> therefore the dmesg log timestamps) continuous with the
> original VM is important to call out.  It's a change that will be
> visible to users.
> 
> [snip]
> 
>>>> +/*
>>>> + * Hyper-V clock counter resets during hibernation. Save and restore clock
>>>> + * offset during suspend/resume, while also considering the time passed
>>>> + * before suspend. This is to make sure that sched_clock using hv tsc page
>>>> + * based clocksource, proceeds from where it left off during suspend and
>>>> + * it shows correct time for the timestamps of kernel messages after resume.
>>>> + */
>>
>> I added it here, but the same should be added in commit msg as well.
>> I'll add it.
> 
> Ah, indeed, you did add it in the comment.  But yes, it should go in
> the commit message as well.
> 
>>
>>>> +static void save_hv_clock_tsc_state(void)
>>>> +{
>>>> +	hv_sched_clock_offset_saved = hv_read_reference_counter();
>>>
>>> Naming this variable hv_sched_clock_offset_saved doesn't seem to match
>>> what it actually contains. The saved value is not a sched_clock_offset. It's
>>> the value of the Hyper-V reference counter at the time the original VM
>>> hibernates does "suspend".  The sched_clock_offset in the original VM will
>>> typically be a pretty small value (a few seconds or even less). But the
>>> Hyper-V reference counter value might be thousands of seconds if the
>>> VM has been running a while before it hibernates.
>>
>> I'll change it to something that conveys the right information. Thanks
>> for the suggestion.
>>
>>>
>>>> +}
>>>> +
>>>> +static void restore_hv_clock_tsc_state(void)
>>>> +{
>>>> +	/*
>>>> +	 * hv_sched_clock_offset = offset that is used by hyperv_timer clocksource driver
>>>> +	 *                         to get time.
>>>> +	 * Time passed before suspend = hv_sched_clock_offset_saved
>>>> +	 *                            - hv_sched_clock_offset (old)
>>>> +	 *
>>>> +	 * After Hyper-V clock counter resets, hv_sched_clock_offset needs a correction.
>>>> +	 *
>>>> +	 * New time = hv_read_reference_counter() (future) - hv_sched_clock_offset (new)
>>>> +	 * New time = Time passed before suspend + hv_read_reference_counter() (future)
>>>> +	 *                                       - hv_read_reference_counter() (now)
>>>> +	 *
>>>> +	 * Solving the above two equations gives:
>>>> +	 *
>>>> +	 * hv_sched_clock_offset (new) = hv_sched_clock_offset (old)
>>>> +	 *                             - hv_sched_clock_offset_saved
>>>> +	 *                             + hv_read_reference_counter() (now))
>>>> +	 */
>>>> +	hv_adj_sched_clock_offset(hv_sched_clock_offset_saved - hv_read_reference_counter());
>>>
>>> The argument passed to hv_adj_sched_clock_offset() makes sense to me if I think
>>> of it as:
>>>
>>> 	hv_ref_time_at_hibernate - hv_read_reference_counter()
>>>
>>> where hv_read_reference_counter() is just "ref time now".
>>>
>>> I think of it like this: The Hyper-V reference counter value changed underneath
>>> the resumed hibernation image when it starts running in the new VM. The adjustment
>>> changes the sched clock offset to compensate for that change so that sched clock
>>> values are continuous across the suspend/resume hibernation sequence.
>>>
>>> I don't completely understand what you've explained with the two equations and
>>> solving them, though the result matches my expectations.
>>
>> Yeah :) it made more sense when we look at it from hyperv_timer.c driver
>> POV because these offsets are nothing but reference counters at
>> different points of time.
> 
> Well, yes and no. The value of the Hyper-V reference counter is
> an absolute value at the time it is read. But the hv_sched_clock_offset
> is a "delta" value -- the difference between two absolute time values.
> While hv_sched_clock_offset is initially set to the current value of the
> Hyper-V reference counter, it is used as a delta value. And after
> the adjustment is applied when resuming from hibernation,
> hv_sched_clock_offset is definitely no longer a reference counter
> value from some point in time -- it's clearly only a delta value.
> 

Details matter :) Thanks for your feedback. I'll wait for a few days
before posting v3, in case there are some more review comments.


>> Having said that, I think we can go with a
>> comment explaining the intention, and skip adding these equations which
>> may be confusing here as there is no concept of offsets here, as you
>> rightly pointed out in your previous reply as well.
> 
> Works for me.
> 
> Thanks,
> 
> Michael

Regards,
Naman

      reply	other threads:[~2024-09-12 15:37 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-09-11  4:56 [PATCH v2] clocksource: hyper-v: Fix hv tsc page based sched_clock for hibernation Naman Jain
2024-09-12  3:39 ` Michael Kelley
2024-09-12  4:50   ` Naman Jain
2024-09-12 15:21     ` Michael Kelley
2024-09-12 15:37       ` Naman Jain [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=d0b42e38-c956-4140-926f-64fad527fbd1@linux.microsoft.com \
    --to=namjain@linux.microsoft.com \
    --cc=bp@alien8.de \
    --cc=daniel.lezcano@linaro.org \
    --cc=dave.hansen@linux.intel.com \
    --cc=decui@microsoft.com \
    --cc=haiyangz@microsoft.com \
    --cc=hpa@zytor.com \
    --cc=kys@microsoft.com \
    --cc=linux-hyperv@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mhklinux@outlook.com \
    --cc=mingo@redhat.com \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=wei.liu@kernel.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox