[PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper

Jeff Layton posted 11 patches 2 months, 2 weeks ago
There is a newer version of this series
[PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months, 2 weeks ago
For multigrain timestamps, we must keep track of the latest timestamp
that has ever been handed out, and never hand out a coarse time below
that value.

Add a static singleton atomic64_t into timekeeper.c that we can use to
keep track of the latest fine-grained time ever handed out. This is
tracked as a monotonic ktime_t value to ensure that it isn't affected by
clock jumps.

Add two new public interfaces:

- ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the
  coarse-grained clock and the floor time

- ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries
  to swap it into the floor. A timespec64 is filled with the result.

Since the floor is global, we take great pains to avoid updating it
unless it's absolutely necessary. If we do the cmpxchg and find that the
value has been updated since we fetched it, then we discard the
fine-grained time that was fetched in favor of the recent update.

To maximize the window of this occurring when multiple tasks are racing
to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie
value that represents the state of the floor tracking word, and
ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old"
value when calling cmpxchg().

Signed-off-by: Jeff Layton <jlayton@kernel.org>
---
 include/linux/timekeeping.h |  4 +++
 kernel/time/timekeeping.c   | 82 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 86 insertions(+)

diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
index fc12a9ba2c88..7aa85246c183 100644
--- a/include/linux/timekeeping.h
+++ b/include/linux/timekeeping.h
@@ -45,6 +45,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv);
 extern void ktime_get_coarse_ts64(struct timespec64 *ts);
 extern void ktime_get_coarse_real_ts64(struct timespec64 *ts);
 
+/* Multigrain timestamp interfaces */
+extern void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts);
+extern void ktime_get_real_ts64_mg(struct timespec64 *ts);
+
 void getboottime64(struct timespec64 *ts);
 
 /*
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 5391e4167d60..16937242b904 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw  ____cacheline_aligned = {
 	.base[1] = FAST_TK_INIT,
 };
 
+/*
+ * This represents the latest fine-grained time that we have handed out as a
+ * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the
+ * realtime clock on an as-needed basis.
+ */
+static __cacheline_aligned_in_smp atomic64_t mg_floor;
+
 static inline void tk_normalize_xtime(struct timekeeper *tk)
 {
 	while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) {
@@ -2394,6 +2401,81 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts)
 }
 EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
 
+/**
+ * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor
+ * @ts: timespec64 to be filled
+ *
+ * Adjust floor to realtime and compare it to the coarse time. Fill
+ * @ts with the latest one. Note that this is a filesystem-specific
+ * interface and should be avoided outside of that context.
+ */
+void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+	u64 floor = atomic64_read(&mg_floor);
+	ktime_t f_real, offset, coarse;
+	unsigned int seq;
+
+	WARN_ON(timekeeping_suspended);
+
+	do {
+		seq = read_seqcount_begin(&tk_core.seq);
+		*ts = tk_xtime(tk);
+		offset = *offsets[TK_OFFS_REAL];
+	} while (read_seqcount_retry(&tk_core.seq, seq));
+
+	coarse = timespec64_to_ktime(*ts);
+	f_real = ktime_add(floor, offset);
+	if (ktime_after(f_real, coarse))
+		*ts = ktime_to_timespec64(f_real);
+}
+EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg);
+
+/**
+ * ktime_get_real_ts64_mg - attempt to update floor value and return result
+ * @ts:		pointer to the timespec to be set
+ *
+ * Get a current monotonic fine-grained time value and attempt to swap
+ * it into the floor. @ts will be filled with the resulting floor value,
+ * regardless of the outcome of the swap. Note that this is a filesystem
+ * specific interface and should be avoided outside of that context.
+ */
+void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie)
+{
+	struct timekeeper *tk = &tk_core.timekeeper;
+	ktime_t old = atomic64_read(&mg_floor);
+	ktime_t offset, mono;
+	unsigned int seq;
+	u64 nsecs;
+
+	WARN_ON(timekeeping_suspended);
+
+	do {
+		seq = read_seqcount_begin(&tk_core.seq);
+
+		ts->tv_sec = tk->xtime_sec;
+		mono = tk->tkr_mono.base;
+		nsecs = timekeeping_get_ns(&tk->tkr_mono);
+		offset = *offsets[TK_OFFS_REAL];
+	} while (read_seqcount_retry(&tk_core.seq, seq));
+
+	mono = ktime_add_ns(mono, nsecs);
+
+	if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) {
+		ts->tv_nsec = 0;
+		timespec64_add_ns(ts, nsecs);
+	} else {
+		/*
+		 * Something has changed mg_floor since "old" was
+		 * fetched. "old" has now been updated with the
+		 * current value of mg_floor, so use that to return
+		 * the current coarse floor value.
+		 */
+		*ts = ktime_to_timespec64(ktime_add(old, offset));
+	}
+}
+EXPORT_SYMBOL_GPL(ktime_get_real_ts64_mg);
+
 void ktime_get_coarse_ts64(struct timespec64 *ts)
 {
 	struct timekeeper *tk = &tk_core.timekeeper;

-- 
2.46.0
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 2 months ago
On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote:

> For multigrain timestamps, we must keep track of the latest timestamp
> that has ever been handed out, and never hand out a coarse time below
> that value.

How is that correct when the clock is off by an hour and then set back
to the correct value? Then you'd get the same stale timestamp for an
hour unless something invokes ktime_get_real_ts64_mg() which will set
the "latest" timestamp back to a time before the previous one.

> Add a static singleton atomic64_t into timekeeper.c that we can use to
> keep track of the latest fine-grained time ever handed out. This is
> tracked as a monotonic ktime_t value to ensure that it isn't affected by
> clock jumps.

That's just wishful thinking.

ktime_get_real_ts64_mg(ts)
   ts = Tmono_1 + offset_1;   // TReal_1
   floor = Tmono_1;

                                // newtime < TReal_1                                
                                clock_settime(REALTIME, newtime);
                                   xtime = newtime; // TReal_2
                                   offset_2 = offset_1 + Treal_2 - TReal(now);
                                   --> offset_2 < offset_1

ktime_get_coarse_real_ts64_mg(ts)
    ts = tk_xtime();       // TReal_2
    offs = offset_2;

    if (Tmono_1 + offset_2 > ts)
       ts = Tmono_1 + offset_2; // Not taken

So this returns T_Real_2 because

    offset_2 < offset_1

and therefore

    Tmono_1 + offset_2 < TReal_2

so the returned time will jump backwards vs. TReal_1 as it should
because that's the actual time, no?

So if that's the intended behaviour then the changelog is misleading at
best.

If the intention is to never return a value < TReal_1 then this does not
work. You can make it work by using the Realtime timestamp as floor, but
that'd be more than questionable vs. clock_settime() making the clock go
backwards.

Thanks,

        tglx
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months ago
On Mon, 2024-09-30 at 21:13 +0200, Thomas Gleixner wrote:
> On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote:
> 
> > For multigrain timestamps, we must keep track of the latest timestamp
> > that has ever been handed out, and never hand out a coarse time below
> > that value.
> 
> How is that correct when the clock is off by an hour and then set back
> to the correct value? Then you'd get the same stale timestamp for an
> hour unless something invokes ktime_get_real_ts64_mg() which will set
> the "latest" timestamp back to a time before the previous one.
>
> > Add a static singleton atomic64_t into timekeeper.c that we can use to
> > keep track of the latest fine-grained time ever handed out. This is
> > tracked as a monotonic ktime_t value to ensure that it isn't affected by
> > clock jumps.
> 
> That's just wishful thinking.
> 
> ktime_get_real_ts64_mg(ts)
>    ts = Tmono_1 + offset_1;   // TReal_1
>    floor = Tmono_1;
> 
>                                 // newtime < TReal_1                                
>                                 clock_settime(REALTIME, newtime);
>                                    xtime = newtime; // TReal_2
>                                    offset_2 = offset_1 + Treal_2 - TReal(now);
>                                    --> offset_2 < offset_1
> 
> ktime_get_coarse_real_ts64_mg(ts)
>     ts = tk_xtime();       // TReal_2
>     offs = offset_2;
> 
>     if (Tmono_1 + offset_2 > ts)
>        ts = Tmono_1 + offset_2; // Not taken
> 
> So this returns T_Real_2 because
> 
>     offset_2 < offset_1
> 
> and therefore
> 
>     Tmono_1 + offset_2 < TReal_2
> 
> so the returned time will jump backwards vs. TReal_1 as it should
> because that's the actual time, no?
> 
> So if that's the intended behaviour then the changelog is misleading at
> best.
> 
> If the intention is to never return a value < TReal_1 then this does not
> work. You can make it work by using the Realtime timestamp as floor, but
> that'd be more than questionable vs. clock_settime() making the clock go
> backwards.
> 

That is the intended behavior and I'll plan to fix the changelog to
clarify this point:

If someone jumps the realtime clock backward by a large value, then the
realtime timestamp _can_ appear to go backward. This is a problem today
even without this patchset.

If two files get stamped and a realtime clock jump backward happens in
between them, all bets are off as to which one will appear to have been
modified first. I don't think that is something we can reasonably
prevent, since we must stamp files according to the realtime clock.

The main thing I'm trying to prevent is the timestamps being misordered
in the absence of such a clock jump. Without tracking the floor as I am
here, that's a possibility.
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 2 months ago
On Mon, Sep 30 2024 at 15:27, Jeff Layton wrote:
> On Mon, 2024-09-30 at 21:13 +0200, Thomas Gleixner wrote:
>> So if that's the intended behaviour then the changelog is misleading at
>> best.
>
> That is the intended behavior and I'll plan to fix the changelog to
> clarify this point:
>
> If someone jumps the realtime clock backward by a large value, then the
> realtime timestamp _can_ appear to go backward. This is a problem today
> even without this patchset.

Correct.

> If two files get stamped and a realtime clock jump backward happens in
> between them, all bets are off as to which one will appear to have been
> modified first. I don't think that is something we can reasonably
> prevent, since we must stamp files according to the realtime clock.

True. I just was utterly confused about the changelog.

> The main thing I'm trying to prevent is the timestamps being misordered
> in the absence of such a clock jump. Without tracking the floor as I am
> here, that's a possibility.

Correct.
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 2 months, 2 weeks ago
On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote:
> For multigrain timestamps, we must keep track of the latest timestamp

What is a multgrain timestamp? Can you please describe the concept
behind it? I'm not going to chase random documentation or whatever
because change logs have to self contained.

And again 'we' do nothing. Describe the problem in technical terms and
do not impersonate code.

> To maximize the window of this occurring when multiple tasks are racing
> to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie
> value that represents the state of the floor tracking word, and
> ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old"
> value when calling cmpxchg().

Clearly:

> +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie)

Can you please get your act together?

> +/**
> + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor
> + * @ts: timespec64 to be filled
> + *
> + * Adjust floor to realtime and compare it to the coarse time. Fill
> + * @ts with the latest one.

This explains nothing.

>      Note that this is a filesystem-specific
> + * interface and should be avoided outside of that context.
> + */
> +void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts)
> +{
> +	struct timekeeper *tk = &tk_core.timekeeper;
> +	u64 floor = atomic64_read(&mg_floor);
> +	ktime_t f_real, offset, coarse;
> +	unsigned int seq;
> +
> +	WARN_ON(timekeeping_suspended);
> +
> +	do {
> +		seq = read_seqcount_begin(&tk_core.seq);
> +		*ts = tk_xtime(tk);
> +		offset = *offsets[TK_OFFS_REAL];

Why this indirection? What's wrong with using
tk_core.timekeeper.offs_real directly?

> +	} while (read_seqcount_retry(&tk_core.seq, seq));
> +
> +	coarse = timespec64_to_ktime(*ts);
> +	f_real = ktime_add(floor, offset);

How is any of this synchronized against concurrent updates of the floor
value or the offset? I'm failing to see anything which keeps this
consistent. If this is magically consistent then it wants a big fat
comment in the code which explains it.

> +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie)

What is this cookie argument for and how does that match the
declaration?

> +extern void ktime_get_real_ts64_mg(struct timespec64 *ts);

This does not even build.

> +{
> +	struct timekeeper *tk = &tk_core.timekeeper;
> +	ktime_t old = atomic64_read(&mg_floor);
> +	ktime_t offset, mono;
> +	unsigned int seq;
> +	u64 nsecs;
> +
> +	WARN_ON(timekeeping_suspended);

WARN_ON_ONCE() if at all.

> +	do {
> +		seq = read_seqcount_begin(&tk_core.seq);
> +
> +		ts->tv_sec = tk->xtime_sec;
> +		mono = tk->tkr_mono.base;
> +		nsecs = timekeeping_get_ns(&tk->tkr_mono);
> +		offset = *offsets[TK_OFFS_REAL];
> +	} while (read_seqcount_retry(&tk_core.seq, seq));
> +
> +	mono = ktime_add_ns(mono, nsecs);
> +
> +	if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) {
> +		ts->tv_nsec = 0;
> +		timespec64_add_ns(ts, nsecs);
> +	} else {
> +		/*
> +		 * Something has changed mg_floor since "old" was
> +		 * fetched. "old" has now been updated with the
> +		 * current value of mg_floor, so use that to return
> +		 * the current coarse floor value.

'Something has changed' is a truly understandable technical
explanation.

I'm not going to accept this voodoo which makes everyone scratch his
head who wasn't involved in this.

Thanks,

        tglx
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months, 1 week ago
On Mon, 2024-09-16 at 12:12 +0200, Thomas Gleixner wrote:
> On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote:
> > For multigrain timestamps, we must keep track of the latest timestamp
> 
> What is a multgrain timestamp? Can you please describe the concept
> behind it? I'm not going to chase random documentation or whatever
> because change logs have to self contained.
> 
> And again 'we' do nothing. Describe the problem in technical terms and
> do not impersonate code.
> 

Hi Thomas!

Sorry for the delay in responding. I'll try to summarize below, but
I'll also note that patch #7 in the v8 series adds a file to
Documentation/ that explains this in a bit more depth:

Currently the kernel always stamps files (mtime, ctime, etc.) using the
coarse-grained clock. This is usually a good thing, since it reduces
the number of metadata updates, but means that you can't reliably use
file timestamps to detect whether there have been changes to the file
since it was last checked. This is particularly a problem for NFSv3
clients, which use timestamps to know when to invalidate their
pagecache for an inode [1].

The idea is to allow the kernel to use fine-grained timestamps (mtime
and ctime) on files when they are under direct observation. When a task
does a ->getattr against an inode for STATX_MTIME or STATX_CTIME, a
flag is set in the inode that tells the kernel to use the fine-grained
clock for the timestamp update iff the current coarse-grained clock
value would not cause a change to the mtime/ctime.

This works, but there is a problem:

It's possible for one inode to get a fine-grained timestamp, and then
another to get a coarse-grained timestamp. If this happens within a
single coarse-grained timer tick, then the files may appear to have
been modified in reverse order, which breaks POSIX guarantees (and
obscure programs like "make").

The fix for this is to establish a floor value for the coarse-grained
clock. When stamping a file with a fine-grained timestamp, we update
the floor value with the current monotonic time (using cmpxchg). Then
later, when a coarse-grained timestamp is requested, check whether the
floor is later than the current coarse-grained time. If it is, then the
kernel will return the floor value (converted to realtime) instead of
the current coarse-grained clock. That allows us to maintain the
ordering guarantees.

My original implementation of this tracked the floor value in
fs/inode.c (also using cmpxchg), but that caused a performance
regression, mostly due to multiple calls into the timekeeper functions
with seqcount loops. By adding the floor to the timekeeper we can get
that back down to 1 seqcount loop.

Let me know if you have more questions about this, or suggestions about
how to do this better. The timekeeping code is not my area of expertise
(obviously) so I'm open to doing this a better way if there is one.

Thanks for the review so far!

[1]: NFSv4 mandates an opaque change attribute (usually using
inode->i_version), but only some filesystems have a proper
implementation of it (XFS being the notable exception). For the others,
we end up using the ctime to generate a change attribute, which means
that NFSv4 has the same problem on those filesystems. i_version also
doesn't help NFSv3 clients and servers.
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 2 months ago
On Thu, Sep 19 2024 at 18:50, Jeff Layton wrote:
> The fix for this is to establish a floor value for the coarse-grained
> clock. When stamping a file with a fine-grained timestamp, we update
> the floor value with the current monotonic time (using cmpxchg). Then
> later, when a coarse-grained timestamp is requested, check whether the
> floor is later than the current coarse-grained time. If it is, then the
> kernel will return the floor value (converted to realtime) instead of
> the current coarse-grained clock. That allows us to maintain the
> ordering guarantees.
>
> My original implementation of this tracked the floor value in
> fs/inode.c (also using cmpxchg), but that caused a performance
> regression, mostly due to multiple calls into the timekeeper functions
> with seqcount loops. By adding the floor to the timekeeper we can get
> that back down to 1 seqcount loop.
>
> Let me know if you have more questions about this, or suggestions about
> how to do this better. The timekeeping code is not my area of expertise
> (obviously) so I'm open to doing this a better way if there is one.

The comments I made about races and the clock_settime() inconsistency
vs. the change log aside, I don't see room for improvement there.

What worries me is the atomic_cmpxchg() under contention on large
machines, but as it is not a cmpxchg() loop it might be not completely
horrible.

Thanks,

        tglx
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months ago
On Mon, 2024-09-30 at 21:43 +0200, Thomas Gleixner wrote:
> On Thu, Sep 19 2024 at 18:50, Jeff Layton wrote:
> > The fix for this is to establish a floor value for the coarse-grained
> > clock. When stamping a file with a fine-grained timestamp, we update
> > the floor value with the current monotonic time (using cmpxchg). Then
> > later, when a coarse-grained timestamp is requested, check whether the
> > floor is later than the current coarse-grained time. If it is, then the
> > kernel will return the floor value (converted to realtime) instead of
> > the current coarse-grained clock. That allows us to maintain the
> > ordering guarantees.
> > 
> > My original implementation of this tracked the floor value in
> > fs/inode.c (also using cmpxchg), but that caused a performance
> > regression, mostly due to multiple calls into the timekeeper functions
> > with seqcount loops. By adding the floor to the timekeeper we can get
> > that back down to 1 seqcount loop.
> > 
> > Let me know if you have more questions about this, or suggestions about
> > how to do this better. The timekeeping code is not my area of expertise
> > (obviously) so I'm open to doing this a better way if there is one.
> 
> The comments I made about races and the clock_settime() inconsistency
> vs. the change log aside, I don't see room for improvement there.
> 
> What worries me is the atomic_cmpxchg() under contention on large
> machines, but as it is not a cmpxchg() loop it might be not completely
> horrible.
> 


Thanks for taking a look.

The code does try to avoid requesting a fine-grained timestamp unless
there is no other option. In practice, that means that a
write()+stat()+write() has to occur within the same coarse-grained
timer tick. That obviously does happen, but I'm hoping it's not very
frequent on real-world workloads.

The main place where we see a performance hit from this set is the
will-it-scale pipe test, which does 1-byte writes and reads on a pipe
as fast as possible.

The previous patchset tracked the floor inside fs/inode.c, and each
timestamp update on the pipe inode could make multiple trips into the
timekeeper seqcount loops. That added extra barriers and slowed things
down.

This patchset moves the code into the timekeeper, which means only a
single barrier. That gets the cost down, but it's not entirely free.
There is some small cost to dealing with the floor, regardless.
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 2 months, 2 weeks ago
On Mon, Sep 16 2024 at 12:12, Thomas Gleixner wrote:
> On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote:
>> +	do {
>> +		seq = read_seqcount_begin(&tk_core.seq);
>> +
>> +		ts->tv_sec = tk->xtime_sec;
>> +		mono = tk->tkr_mono.base;
>> +		nsecs = timekeeping_get_ns(&tk->tkr_mono);
>> +		offset = *offsets[TK_OFFS_REAL];
>> +	} while (read_seqcount_retry(&tk_core.seq, seq));
>> +
>> +	mono = ktime_add_ns(mono, nsecs);
>> +
>> +	if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) {
>> +		ts->tv_nsec = 0;
>> +		timespec64_add_ns(ts, nsecs);
>> +	} else {
>> +		/*
>> +		 * Something has changed mg_floor since "old" was
>> +		 * fetched. "old" has now been updated with the
>> +		 * current value of mg_floor, so use that to return
>> +		 * the current coarse floor value.
>
> 'Something has changed' is a truly understandable technical
> explanation.

     old = mg_floor
                                mono = T1;
                                mg_floor = mono
preemption

     do {
        mono = T2;
     }

     cmpxchg fails and the function returns a value based on T1

No?

Thanks,

        tglx
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months, 2 weeks ago
On Mon, 2024-09-16 at 12:32 +0200, Thomas Gleixner wrote:
> On Mon, Sep 16 2024 at 12:12, Thomas Gleixner wrote:
> > On Sat, Sep 14 2024 at 13:07, Jeff Layton wrote:
> > > +	do {
> > > +		seq = read_seqcount_begin(&tk_core.seq);
> > > +
> > > +		ts->tv_sec = tk->xtime_sec;
> > > +		mono = tk->tkr_mono.base;
> > > +		nsecs = timekeeping_get_ns(&tk->tkr_mono);
> > > +		offset = *offsets[TK_OFFS_REAL];
> > > +	} while (read_seqcount_retry(&tk_core.seq, seq));
> > > +
> > > +	mono = ktime_add_ns(mono, nsecs);
> > > +
> > > +	if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) {
> > > +		ts->tv_nsec = 0;
> > > +		timespec64_add_ns(ts, nsecs);
> > > +	} else {
> > > +		/*
> > > +		 * Something has changed mg_floor since "old" was
> > > +		 * fetched. "old" has now been updated with the
> > > +		 * current value of mg_floor, so use that to return
> > > +		 * the current coarse floor value.
> > 
> > 'Something has changed' is a truly understandable technical
> > explanation.
> 
>      old = mg_floor
>                                 mono = T1;
>                                 mg_floor = mono
> preemption
> 
>      do {
>         mono = T2;
>      }
> 
>      cmpxchg fails and the function returns a value based on T1
> 
> No?
> 
> 

Packing for LPC, so I can't respond to all of these just now, but I
will later. You're correct, but either outcome is OK.

The requirement is that we don't hand out any values that were below
the floor at the time that the task entered the kernel. Since the time
changed while the task was already inside the kernel, either T1 or T2
would be valid timestamps.
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 2 months ago
On Mon, Sep 16 2024 at 06:57, Jeff Layton wrote:
> On Mon, 2024-09-16 at 12:32 +0200, Thomas Gleixner wrote:
>> > 'Something has changed' is a truly understandable technical
>> > explanation.
>> 
>>      old = mg_floor
>>                                 mono = T1;
>>                                 mg_floor = mono
>> preemption
>> 
>>      do {
>>         mono = T2;
>>      }
>> 
>>      cmpxchg fails and the function returns a value based on T1
>> 
>> No?
>> 
>> 
>
> Packing for LPC, so I can't respond to all of these just now, but I
> will later. You're correct, but either outcome is OK.
>
> The requirement is that we don't hand out any values that were below
> the floor at the time that the task entered the kernel. Since the time
> changed while the task was already inside the kernel, either T1 or T2
> would be valid timestamps.

That really needs to be documented. A similar scenario exists
vs. ktime_get_coarse_real_ts64_mg().

Thanks,

        tglx
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months ago
On Mon, 2024-09-30 at 21:16 +0200, Thomas Gleixner wrote:
> On Mon, Sep 16 2024 at 06:57, Jeff Layton wrote:
> > On Mon, 2024-09-16 at 12:32 +0200, Thomas Gleixner wrote:
> > > > 'Something has changed' is a truly understandable technical
> > > > explanation.
> > > 
> > >      old = mg_floor
> > >                                 mono = T1;
> > >                                 mg_floor = mono
> > > preemption
> > > 
> > >      do {
> > >         mono = T2;
> > >      }
> > > 
> > >      cmpxchg fails and the function returns a value based on T1
> > > 
> > > No?
> > > 
> > > 
> > 
> > Packing for LPC, so I can't respond to all of these just now, but I
> > will later. You're correct, but either outcome is OK.
> > 
> > The requirement is that we don't hand out any values that were below
> > the floor at the time that the task entered the kernel. Since the time
> > changed while the task was already inside the kernel, either T1 or T2
> > would be valid timestamps.
> 
> That really needs to be documented. A similar scenario exists
> vs. ktime_get_coarse_real_ts64_mg().
> 

Yes.

I have the following section in the multigrain-ts.rst file that gets
added in patch 7 of this series. I'll also plan to add some extra
wording about how backward realtime clock jumps can affect ordering:

Inode Timestamp Ordering
========================

In addition to providing info about changes to individual files, file                          
timestamps also serve an important purpose in applications like "make". These                       
programs measure timestamps in order to determine whether source files might be                     
newer than cached objects.                                                                          

Userland applications like make can only determine ordering based on                                
operational boundaries. For a syscall those are the syscall entry and exit                          
points. For io_uring or nfsd operations, that's the request submission and                          
response. In the case of concurrent operations, userland can make no                                
determination about the order in which things will occur.

For instance, if a single thread modifies one file, and then another file in                        
sequence, the second file must show an equal or later mtime than the first. The                     
same is true if two threads are issuing similar operations that do not overlap                      
in time.

If however, two threads have racing syscalls that overlap in time, then there                       
is no such guarantee, and the second file may appear to have been modified                          
before, after or at the same time as the first, regardless of which one was                         
submitted first.

-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 2 months ago
On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote:
> On Mon, 2024-09-30 at 21:16 +0200, Thomas Gleixner wrote:
> I have the following section in the multigrain-ts.rst file that gets
> added in patch 7 of this series. I'll also plan to add some extra
> wording about how backward realtime clock jumps can affect ordering:

Please also add comments into the code / interface.

> Inode Timestamp Ordering
> ========================
>
> In addition to providing info about changes to individual files, file                          
> timestamps also serve an important purpose in applications like "make". These                       
> programs measure timestamps in order to determine whether source files might be                     
> newer than cached objects.                                                                          
>
> Userland applications like make can only determine ordering based on                                
> operational boundaries. For a syscall those are the syscall entry and exit                          
> points. For io_uring or nfsd operations, that's the request submission and                          
> response. In the case of concurrent operations, userland can make no                                
> determination about the order in which things will occur.
>
> For instance, if a single thread modifies one file, and then another file in                        
> sequence, the second file must show an equal or later mtime than the first. The                     
> same is true if two threads are issuing similar operations that do not overlap                      
> in time.
>
> If however, two threads have racing syscalls that overlap in time, then there                       
> is no such guarantee, and the second file may appear to have been modified                          
> before, after or at the same time as the first, regardless of which one was                         
> submitted first.

That makes me ask a question. Are the timestamps always taken in thread
(syscall) context or can they be taken in other contexts (worker,
[soft]interrupt, etc.) too?

Thanks,

        tglx
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months ago
On Mon, 2024-09-30 at 22:19 +0200, Thomas Gleixner wrote:
> On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote:
> > On Mon, 2024-09-30 at 21:16 +0200, Thomas Gleixner wrote:
> > I have the following section in the multigrain-ts.rst file that gets
> > added in patch 7 of this series. I'll also plan to add some extra
> > wording about how backward realtime clock jumps can affect ordering:
> 
> Please also add comments into the code / interface.
> 

Will do.

> > Inode Timestamp Ordering
> > ========================
> > 
> > In addition to providing info about changes to individual files, file                          
> > timestamps also serve an important purpose in applications like "make". These                       
> > programs measure timestamps in order to determine whether source files might be                     
> > newer than cached objects.                                                                          
> > 
> > Userland applications like make can only determine ordering based on                                
> > operational boundaries. For a syscall those are the syscall entry and exit                          
> > points. For io_uring or nfsd operations, that's the request submission and                          
> > response. In the case of concurrent operations, userland can make no                                
> > determination about the order in which things will occur.
> > 
> > For instance, if a single thread modifies one file, and then another file in                        
> > sequence, the second file must show an equal or later mtime than the first. The                     
> > same is true if two threads are issuing similar operations that do not overlap                      
> > in time.
> > 
> > If however, two threads have racing syscalls that overlap in time, then there                       
> > is no such guarantee, and the second file may appear to have been modified                          
> > before, after or at the same time as the first, regardless of which one was                         
> > submitted first.
> 
> That makes me ask a question. Are the timestamps always taken in thread
> (syscall) context or can they be taken in other contexts (worker,
> [soft]interrupt, etc.) too?
> 

That's a good question.

The main place we do this is inode_set_ctime_current(). That is mostly
called in the context of a syscall or similar sort of operation
(io_uring, nfsd RPC request, etc.).

I certainly wouldn't rule out a workqueue job calling that function,
but this is something we do while dirtying an inode, and that's not
typically done in interrupt context.
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 1 month, 4 weeks ago
On Mon, Sep 30 2024 at 16:53, Jeff Layton wrote:
> On Mon, 2024-09-30 at 22:19 +0200, Thomas Gleixner wrote:
>> On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote:
>> > If however, two threads have racing syscalls that overlap in time, then there                       
>> > is no such guarantee, and the second file may appear to have been modified                          
>> > before, after or at the same time as the first, regardless of which one was                         
>> > submitted first.
>> 
>> That makes me ask a question. Are the timestamps always taken in thread
>> (syscall) context or can they be taken in other contexts (worker,
>> [soft]interrupt, etc.) too?
>> 
>
> That's a good question.
>
> The main place we do this is inode_set_ctime_current(). That is mostly
> called in the context of a syscall or similar sort of operation
> (io_uring, nfsd RPC request, etc.).
>
> I certainly wouldn't rule out a workqueue job calling that function,
> but this is something we do while dirtying an inode, and that's not
> typically done in interrupt context.

The reason I'm asking is that if it's always syscall context,
i.e. write() or io_uring()/RPC request etc., then you can avoid the
whole global floor value dance and make it strictly per thread, which
simplifies the exercise significantly.

But even if it's not syscall/thread context then the worker or io_uring
state machine might just require to serialize against itself and not
coordinate with something else. But what do I know.

Thanks,

        tglx
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 1 month, 4 weeks ago
On Mon, 2024-09-30 at 23:35 +0200, Thomas Gleixner wrote:
> On Mon, Sep 30 2024 at 16:53, Jeff Layton wrote:
> > On Mon, 2024-09-30 at 22:19 +0200, Thomas Gleixner wrote:
> > > On Mon, Sep 30 2024 at 15:37, Jeff Layton wrote:
> > > > If however, two threads have racing syscalls that overlap in time, then there                       
> > > > is no such guarantee, and the second file may appear to have been modified                          
> > > > before, after or at the same time as the first, regardless of which one was                         
> > > > submitted first.
> > > 
> > > That makes me ask a question. Are the timestamps always taken in thread
> > > (syscall) context or can they be taken in other contexts (worker,
> > > [soft]interrupt, etc.) too?
> > > 
> > 
> > That's a good question.
> > 
> > The main place we do this is inode_set_ctime_current(). That is mostly
> > called in the context of a syscall or similar sort of operation
> > (io_uring, nfsd RPC request, etc.).
> > 
> > I certainly wouldn't rule out a workqueue job calling that function,
> > but this is something we do while dirtying an inode, and that's not
> > typically done in interrupt context.
> 
> The reason I'm asking is that if it's always syscall context,
> i.e. write() or io_uring()/RPC request etc., then you can avoid the
> whole global floor value dance and make it strictly per thread, which
> simplifies the exercise significantly.
> 

I'm not sure I follow what you're proposing here.

Consider two threads doing writes serially to different files. IOW, the
second thread enters the write() syscall after the first thread returns
from its write(). In that situation, the second timestamp mustn't
appear to be earlier than the first (assuming no backward clock jump,
of course).

How would we ensure that with only per-thread structures?

> But even if it's not syscall/thread context then the worker or io_uring
> state machine might just require to serialize against itself and not
> coordinate with something else. But what do I know.


-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Thomas Gleixner 1 month, 4 weeks ago
On Tue, Oct 01 2024 at 05:45, Jeff Layton wrote:
> On Mon, 2024-09-30 at 23:35 +0200, Thomas Gleixner wrote:
>> > I certainly wouldn't rule out a workqueue job calling that function,
>> > but this is something we do while dirtying an inode, and that's not
>> > typically done in interrupt context.
>> 
>> The reason I'm asking is that if it's always syscall context,
>> i.e. write() or io_uring()/RPC request etc., then you can avoid the
>> whole global floor value dance and make it strictly per thread, which
>> simplifies the exercise significantly.
>> 
>
> I'm not sure I follow what you're proposing here.
>
> Consider two threads doing writes serially to different files. IOW, the
> second thread enters the write() syscall after the first thread returns
> from its write(). In that situation, the second timestamp mustn't
> appear to be earlier than the first (assuming no backward clock jump,
> of course).
>
> How would we ensure that with only per-thread structures?

Bah. Right. Ignore my sleep deprived rambling.
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 1 month, 4 weeks ago
On Tue, 2024-10-01 at 14:45 +0200, Thomas Gleixner wrote:
> On Tue, Oct 01 2024 at 05:45, Jeff Layton wrote:
> > On Mon, 2024-09-30 at 23:35 +0200, Thomas Gleixner wrote:
> > > > I certainly wouldn't rule out a workqueue job calling that function,
> > > > but this is something we do while dirtying an inode, and that's not
> > > > typically done in interrupt context.
> > > 
> > > The reason I'm asking is that if it's always syscall context,
> > > i.e. write() or io_uring()/RPC request etc., then you can avoid the
> > > whole global floor value dance and make it strictly per thread, which
> > > simplifies the exercise significantly.
> > > 
> > 
> > I'm not sure I follow what you're proposing here.
> > 
> > Consider two threads doing writes serially to different files. IOW, the
> > second thread enters the write() syscall after the first thread returns
> > from its write(). In that situation, the second timestamp mustn't
> > appear to be earlier than the first (assuming no backward clock jump,
> > of course).
> > 
> > How would we ensure that with only per-thread structures?
> 
> Bah. Right. Ignore my sleep deprived rambling.

No worries. My one big takeaway from working on all of this is that
anything dealing with clocks and time is just difficult to
conceptualize.

Could I trouble you for an Acked-by on this patch? I think this series
should probably go in via Christian's tree, but I don't think he wants
to merge it without an explicit ack from the timekeeping maintainers.

Thanks again for the review!
-- 
Jeff Layton <jlayton@kernel.org>
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by John Stultz 2 months, 2 weeks ago
On Sat, Sep 14, 2024 at 10:07 AM Jeff Layton <jlayton@kernel.org> wrote:
>
> For multigrain timestamps, we must keep track of the latest timestamp
> that has ever been handed out, and never hand out a coarse time below
> that value.
>
> Add a static singleton atomic64_t into timekeeper.c that we can use to
> keep track of the latest fine-grained time ever handed out. This is
> tracked as a monotonic ktime_t value to ensure that it isn't affected by
> clock jumps.
>
> Add two new public interfaces:
>
> - ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the
>   coarse-grained clock and the floor time
>
> - ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries
>   to swap it into the floor. A timespec64 is filled with the result.
>
> Since the floor is global, we take great pains to avoid updating it
> unless it's absolutely necessary. If we do the cmpxchg and find that the
> value has been updated since we fetched it, then we discard the
> fine-grained time that was fetched in favor of the recent update.
>
> To maximize the window of this occurring when multiple tasks are racing
> to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie
> value that represents the state of the floor tracking word, and
> ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old"
> value when calling cmpxchg().

This last bit seems out of date.

> ---
>  include/linux/timekeeping.h |  4 +++
>  kernel/time/timekeeping.c   | 82 +++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 86 insertions(+)
>
> diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> index fc12a9ba2c88..7aa85246c183 100644
> --- a/include/linux/timekeeping.h
> +++ b/include/linux/timekeeping.h
> @@ -45,6 +45,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv);
>  extern void ktime_get_coarse_ts64(struct timespec64 *ts);
>  extern void ktime_get_coarse_real_ts64(struct timespec64 *ts);
>
> +/* Multigrain timestamp interfaces */
> +extern void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts);
> +extern void ktime_get_real_ts64_mg(struct timespec64 *ts);
> +
>  void getboottime64(struct timespec64 *ts);
>
>  /*
> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 5391e4167d60..16937242b904 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw  ____cacheline_aligned = {
>         .base[1] = FAST_TK_INIT,
>  };
>
> +/*
> + * This represents the latest fine-grained time that we have handed out as a
> + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the
> + * realtime clock on an as-needed basis.
> + */
> +static __cacheline_aligned_in_smp atomic64_t mg_floor;
> +
>  static inline void tk_normalize_xtime(struct timekeeper *tk)
>  {
>         while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) {
> @@ -2394,6 +2401,81 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts)
>  }
>  EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
>
> +/**
> + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor
> + * @ts: timespec64 to be filled
> + *
> + * Adjust floor to realtime and compare it to the coarse time. Fill
> + * @ts with the latest one. Note that this is a filesystem-specific
> + * interface and should be avoided outside of that context.
> + */
> +void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts)
> +{
> +       struct timekeeper *tk = &tk_core.timekeeper;
> +       u64 floor = atomic64_read(&mg_floor);
> +       ktime_t f_real, offset, coarse;
> +       unsigned int seq;
> +
> +       WARN_ON(timekeeping_suspended);
> +
> +       do {
> +               seq = read_seqcount_begin(&tk_core.seq);
> +               *ts = tk_xtime(tk);
> +               offset = *offsets[TK_OFFS_REAL];
> +       } while (read_seqcount_retry(&tk_core.seq, seq));
> +
> +       coarse = timespec64_to_ktime(*ts);
> +       f_real = ktime_add(floor, offset);
> +       if (ktime_after(f_real, coarse))
> +               *ts = ktime_to_timespec64(f_real);
> +}
> +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg);
> +
> +/**
> + * ktime_get_real_ts64_mg - attempt to update floor value and return result
> + * @ts:                pointer to the timespec to be set
> + *
> + * Get a current monotonic fine-grained time value and attempt to swap
> + * it into the floor. @ts will be filled with the resulting floor value,
> + * regardless of the outcome of the swap. Note that this is a filesystem
> + * specific interface and should be avoided outside of that context.
> + */
> +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie)

Still passing a cookie. It doesn't match the header definition, so I'm
surprised this builds.

> +{
> +       struct timekeeper *tk = &tk_core.timekeeper;
> +       ktime_t old = atomic64_read(&mg_floor);
> +       ktime_t offset, mono;
> +       unsigned int seq;
> +       u64 nsecs;
> +
> +       WARN_ON(timekeeping_suspended);
> +
> +       do {
> +               seq = read_seqcount_begin(&tk_core.seq);
> +
> +               ts->tv_sec = tk->xtime_sec;
> +               mono = tk->tkr_mono.base;
> +               nsecs = timekeeping_get_ns(&tk->tkr_mono);
> +               offset = *offsets[TK_OFFS_REAL];
> +       } while (read_seqcount_retry(&tk_core.seq, seq));
> +
> +       mono = ktime_add_ns(mono, nsecs);
> +
> +       if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) {
> +               ts->tv_nsec = 0;
> +               timespec64_add_ns(ts, nsecs);
> +       } else {
> +               /*
> +                * Something has changed mg_floor since "old" was
> +                * fetched. "old" has now been updated with the
> +                * current value of mg_floor, so use that to return
> +                * the current coarse floor value.
> +                */
> +               *ts = ktime_to_timespec64(ktime_add(old, offset));
> +       }
> +}
> +EXPORT_SYMBOL_GPL(ktime_get_real_ts64_mg);

Other than those issues, I'm ok with it. Thanks again for working
through my concerns!

Since I'm traveling for LPC soon, to save the next cycle, once the
fixes above are sorted:
 Acked-by: John Stultz <jstultz@google.com>

thanks
-john
Re: [PATCH v8 01/11] timekeeping: move multigrain timestamp floor handling into timekeeper
Posted by Jeff Layton 2 months, 2 weeks ago
On Sat, 2024-09-14 at 13:10 -0700, John Stultz wrote:
> On Sat, Sep 14, 2024 at 10:07 AM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > For multigrain timestamps, we must keep track of the latest timestamp
> > that has ever been handed out, and never hand out a coarse time below
> > that value.
> > 
> > Add a static singleton atomic64_t into timekeeper.c that we can use to
> > keep track of the latest fine-grained time ever handed out. This is
> > tracked as a monotonic ktime_t value to ensure that it isn't affected by
> > clock jumps.
> > 
> > Add two new public interfaces:
> > 
> > - ktime_get_coarse_real_ts64_mg() fills a timespec64 with the later of the
> >   coarse-grained clock and the floor time
> > 
> > - ktime_get_real_ts64_mg() gets the fine-grained clock value, and tries
> >   to swap it into the floor. A timespec64 is filled with the result.
> > 
> > Since the floor is global, we take great pains to avoid updating it
> > unless it's absolutely necessary. If we do the cmpxchg and find that the
> > value has been updated since we fetched it, then we discard the
> > fine-grained time that was fetched in favor of the recent update.
> > 
> > To maximize the window of this occurring when multiple tasks are racing
> > to update the floor, ktime_get_coarse_real_ts64_mg returns a cookie
> > value that represents the state of the floor tracking word, and
> > ktime_get_real_ts64_mg accepts a cookie value that it uses as the "old"
> > value when calling cmpxchg().
> 
> This last bit seems out of date.
> 

Thanks. Dropped the last paragraph.

> > ---
> >  include/linux/timekeeping.h |  4 +++
> >  kernel/time/timekeeping.c   | 82 +++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 86 insertions(+)
> > 
> > diff --git a/include/linux/timekeeping.h b/include/linux/timekeeping.h
> > index fc12a9ba2c88..7aa85246c183 100644
> > --- a/include/linux/timekeeping.h
> > +++ b/include/linux/timekeeping.h
> > @@ -45,6 +45,10 @@ extern void ktime_get_real_ts64(struct timespec64 *tv);
> >  extern void ktime_get_coarse_ts64(struct timespec64 *ts);
> >  extern void ktime_get_coarse_real_ts64(struct timespec64 *ts);
> > 
> > +/* Multigrain timestamp interfaces */
> > +extern void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts);
> > +extern void ktime_get_real_ts64_mg(struct timespec64 *ts);
> > +
> >  void getboottime64(struct timespec64 *ts);
> > 
> >  /*
> > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> > index 5391e4167d60..16937242b904 100644
> > --- a/kernel/time/timekeeping.c
> > +++ b/kernel/time/timekeeping.c
> > @@ -114,6 +114,13 @@ static struct tk_fast tk_fast_raw  ____cacheline_aligned = {
> >         .base[1] = FAST_TK_INIT,
> >  };
> > 
> > +/*
> > + * This represents the latest fine-grained time that we have handed out as a
> > + * timestamp on the system. Tracked as a monotonic ktime_t, and converted to the
> > + * realtime clock on an as-needed basis.
> > + */
> > +static __cacheline_aligned_in_smp atomic64_t mg_floor;
> > +
> >  static inline void tk_normalize_xtime(struct timekeeper *tk)
> >  {
> >         while (tk->tkr_mono.xtime_nsec >= ((u64)NSEC_PER_SEC << tk->tkr_mono.shift)) {
> > @@ -2394,6 +2401,81 @@ void ktime_get_coarse_real_ts64(struct timespec64 *ts)
> >  }
> >  EXPORT_SYMBOL(ktime_get_coarse_real_ts64);
> > 
> > +/**
> > + * ktime_get_coarse_real_ts64_mg - get later of coarse grained time or floor
> > + * @ts: timespec64 to be filled
> > + *
> > + * Adjust floor to realtime and compare it to the coarse time. Fill
> > + * @ts with the latest one. Note that this is a filesystem-specific
> > + * interface and should be avoided outside of that context.
> > + */
> > +void ktime_get_coarse_real_ts64_mg(struct timespec64 *ts)
> > +{
> > +       struct timekeeper *tk = &tk_core.timekeeper;
> > +       u64 floor = atomic64_read(&mg_floor);
> > +       ktime_t f_real, offset, coarse;
> > +       unsigned int seq;
> > +
> > +       WARN_ON(timekeeping_suspended);
> > +
> > +       do {
> > +               seq = read_seqcount_begin(&tk_core.seq);
> > +               *ts = tk_xtime(tk);
> > +               offset = *offsets[TK_OFFS_REAL];
> > +       } while (read_seqcount_retry(&tk_core.seq, seq));
> > +
> > +       coarse = timespec64_to_ktime(*ts);
> > +       f_real = ktime_add(floor, offset);
> > +       if (ktime_after(f_real, coarse))
> > +               *ts = ktime_to_timespec64(f_real);
> > +}
> > +EXPORT_SYMBOL_GPL(ktime_get_coarse_real_ts64_mg);
> > +
> > +/**
> > + * ktime_get_real_ts64_mg - attempt to update floor value and return result
> > + * @ts:                pointer to the timespec to be set
> > + *
> > + * Get a current monotonic fine-grained time value and attempt to swap
> > + * it into the floor. @ts will be filled with the resulting floor value,
> > + * regardless of the outcome of the swap. Note that this is a filesystem
> > + * specific interface and should be avoided outside of that context.
> > + */
> > +void ktime_get_real_ts64_mg(struct timespec64 *ts, u64 cookie)
> 
> Still passing a cookie. It doesn't match the header definition, so I'm
> surprised this builds.
> 

Yeah, I didn't see a warning when I built it. Luckily the extra
parameter is ignored anyway, so it no harm. I've fixed that up in my
tree.

> > +{
> > +       struct timekeeper *tk = &tk_core.timekeeper;
> > +       ktime_t old = atomic64_read(&mg_floor);
> > +       ktime_t offset, mono;
> > +       unsigned int seq;
> > +       u64 nsecs;
> > +
> > +       WARN_ON(timekeeping_suspended);
> > +
> > +       do {
> > +               seq = read_seqcount_begin(&tk_core.seq);
> > +
> > +               ts->tv_sec = tk->xtime_sec;
> > +               mono = tk->tkr_mono.base;
> > +               nsecs = timekeeping_get_ns(&tk->tkr_mono);
> > +               offset = *offsets[TK_OFFS_REAL];
> > +       } while (read_seqcount_retry(&tk_core.seq, seq));
> > +
> > +       mono = ktime_add_ns(mono, nsecs);
> > +
> > +       if (atomic64_try_cmpxchg(&mg_floor, &old, mono)) {
> > +               ts->tv_nsec = 0;
> > +               timespec64_add_ns(ts, nsecs);
> > +       } else {
> > +               /*
> > +                * Something has changed mg_floor since "old" was
> > +                * fetched. "old" has now been updated with the
> > +                * current value of mg_floor, so use that to return
> > +                * the current coarse floor value.
> > +                */
> > +               *ts = ktime_to_timespec64(ktime_add(old, offset));
> > +       }
> > +}
> > +EXPORT_SYMBOL_GPL(ktime_get_real_ts64_mg);
> 
> Other than those issues, I'm ok with it. Thanks again for working
> through my concerns!
> 
> Since I'm traveling for LPC soon, to save the next cycle, once the
> fixes above are sorted:
>  Acked-by: John Stultz <jstultz@google.com>
> 

Thanks for the review!
-- 
Jeff Layton <jlayton@kernel.org>