twx-linux/kernel
Jiri Wiesner af7ab5da39 clocksource: Skip watchdog check for large watchdog intervals
commit 644649553508b9bacf0fc7a5bdc4f9e0165576a5 upstream.

There have been reports of the watchdog marking clocksources unstable on
machines with 8 NUMA nodes:

  clocksource: timekeeping watchdog on CPU373:
  Marking clocksource 'tsc' as unstable because the skew is too large:
  clocksource:   'hpet' wd_nsec: 14523447520
  clocksource:   'tsc'  cs_nsec: 14524115132

The measured clocksource skew - the absolute difference between cs_nsec
and wd_nsec - was 668 microseconds:

  cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612

The kernel used 200 microseconds for the uncertainty_margin of both the
clocksource and watchdog, resulting in a threshold of 400 microseconds (the
md variable). Both the cs_nsec and the wd_nsec value indicate that the
readout interval was circa 14.5 seconds.  The observed behaviour is that
watchdog checks failed for large readout intervals on 8 NUMA node
machines. This indicates that the size of the skew was directly proportinal
to the length of the readout interval on those machines. The measured
clocksource skew, 668 microseconds, was evaluated against a threshold (the
md variable) that is suited for readout intervals of roughly
WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.

The intention of 2e27e793e280 ("clocksource: Reduce clocksource-skew
threshold") was to tighten the threshold for evaluating skew and set the
lower bound for the uncertainty_margin of clocksources to twice
WATCHDOG_MAX_SKEW. Later in c37e85c135ce ("clocksource: Loosen clocksource
watchdog constraints"), the WATCHDOG_MAX_SKEW constant was increased to
125 microseconds to fit the limit of NTP, which is able to use a
clocksource that suffers from up to 500 microseconds of skew per second.
Both the TSC and the HPET use default uncertainty_margin. When the
readout interval gets stretched the default uncertainty_margin is no
longer a suitable lower bound for evaluating skew - it imposes a limit
that is far stricter than the skew with which NTP can deal.

The root causes of the skew being directly proportinal to the length of
the readout interval are:

  * the inaccuracy of the shift/mult pairs of clocksources and the watchdog
  * the conversion to nanoseconds is imprecise for large readout intervals

Prevent this by skipping the current watchdog check if the readout
interval exceeds 2 * WATCHDOG_INTERVAL. Considering the maximum readout
interval of 2 * WATCHDOG_INTERVAL, the current default uncertainty margin
(of the TSC and HPET) corresponds to a limit on clocksource skew of 250
ppm (microseconds of skew per second).  To keep the limit imposed by NTP
(500 microseconds of skew per second) for all possible readout intervals,
the margins would have to be scaled so that the threshold value is
proportional to the length of the actual readout interval.

As for why the readout interval may get stretched: Since the watchdog is
executed in softirq context the expiration of the watchdog timer can get
severely delayed on account of a ksoftirqd thread not getting to run in a
timely manner. Surely, a system with such belated softirq execution is not
working well and the scheduling issue should be looked into but the
clocksource watchdog should be able to deal with it accordingly.

Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
Suggested-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Paul E. McKenney <paulmck@kernel.org>
Reviewed-by: Feng Tang <feng.tang@intel.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20240122172350.GA740@incl
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2024-01-31 16:19:13 -08:00
..
bpf bpf: Add bpf_sock_addr_set_sun_path() to allow writing unix sockaddr from bpf 2024-01-31 16:19:04 -08:00
cgroup cgroup_freezer: cgroup_freezing: Check if not frozen 2023-12-13 18:45:22 +01:00
configs Kbuild updates for v6.6 2023-09-05 11:01:47 -07:00
debug kdb: Fix a potential buffer overflow in kdb_local() 2024-01-25 15:36:00 -08:00
dma dma-mapping: clear dev->dma_mem to NULL after freeing it 2024-01-25 15:35:26 -08:00
entry entry: Remove empty addr_limit_user_check() 2023-08-23 10:32:39 +02:00
events perf: Fix perf_event_validate_size() lockdep splat 2023-12-20 17:02:01 +01:00
futex futex: Don't include process MM in futex key on no-MMU 2023-11-20 11:58:53 +01:00
gcov gcov: shut up missing prototype warnings for internal stubs 2023-08-18 10:18:58 -07:00
irq genirq: Initialize resend_node hlist for all interrupt descriptors 2024-01-31 16:19:13 -08:00
kcsan kcsan: Don't expect 64 bits atomic builtins from 32 bits architectures 2023-06-09 23:29:50 +10:00
livepatch livepatch: Fix missing newline character in klp_resolve_symbols() 2023-11-20 11:59:25 +01:00
locking lockdep: Fix block chain corruption 2023-12-03 07:33:06 +01:00
module module/decompress: use kvmalloc() consistently 2023-11-20 11:59:37 +01:00
power PM: hibernate: Enforce ordering during image compression/decompression 2024-01-31 16:18:49 -08:00
printk Merge branch 'rework/misc-cleanups' into for-linus 2023-10-11 12:58:14 +02:00
rcu rcu: Defer RCU kthreads wakeup when CPU is dying 2024-01-31 16:19:03 -08:00
sched sched/fair: Update min_vruntime for reweight_entity() correctly 2024-01-25 15:35:14 -08:00
time clocksource: Skip watchdog check for large watchdog intervals 2024-01-31 16:19:13 -08:00
trace tracing: Ensure visibility when inserting an element into tracing_map 2024-01-31 16:19:01 -08:00
.gitignore
acct.c audit/stable-6.6 PR 20230829 2023-08-30 08:17:35 -07:00
async.c async: Introduce async_schedule_dev_nocall() 2024-01-31 16:18:49 -08:00
audit_fsnotify.c
audit_tree.c
audit_watch.c audit: don't WARN_ON_ONCE(!current->mm) in audit_exe_compare() 2023-11-28 17:19:56 +00:00
audit.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
audit.h audit: correct audit_filter_inodes() definition 2023-07-21 12:17:25 -04:00
auditfilter.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
auditsc.c audit,io_uring: io_uring openat triggers audit reference count underflow 2023-10-13 18:34:46 +02:00
backtracetest.c
bounds.c
capability.c lsm: constify the 'target' parameter in security_capget() 2023-08-08 16:48:47 -04:00
cfi.c
compat.c sched_getaffinity: don't assume 'cpumask_size()' is fully initialized 2023-03-14 19:32:38 -07:00
configs.c
context_tracking.c locking/atomic: treewide: use raw_atomic*_<op>() 2023-06-05 09:57:20 +02:00
cpu_pm.c cpuidle, cpu_pm: Remove RCU fiddling from cpu_pm_{enter,exit}() 2023-01-13 11:48:15 +01:00
cpu.c hrtimers: Push pending hrtimers away from outgoing CPU earlier 2023-12-13 18:44:56 +01:00
crash_core.c Crash: add lock to serialize crash hotplug handling 2023-09-29 17:20:48 -07:00
crash_dump.c
cred.c cred: get rid of CONFIG_DEBUG_CREDENTIALS 2023-12-20 17:01:51 +01:00
delayacct.c delayacct: track delays from IRQ/SOFTIRQ 2023-04-18 16:39:34 -07:00
dma.c
exec_domain.c
exit.c cred: get rid of CONFIG_DEBUG_CREDENTIALS 2023-12-20 17:01:51 +01:00
extable.c
fail_function.c kernel/fail_function: fix memory leak with using debugfs_lookup() 2023-02-08 13:36:22 +01:00
fork.c mm: add a NO_INHERIT flag to the PR_SET_MDWE prctl 2023-12-03 07:33:06 +01:00
freezer.c
gen_kheaders.sh Revert "kheaders: substituting --sort in archive creation" 2023-05-28 16:20:21 +09:00
groups.c
hung_task.c kernel/hung_task.c: set some hung_task.c variables storage-class-specifier to static 2023-04-08 13:45:37 -07:00
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c trace: Add trace_ipi_send_cpu() 2023-03-24 11:01:29 +01:00
jump_label.c jump_label: Prevent key->enabled int overflow 2022-12-01 15:53:05 -08:00
kallsyms_internal.h kallsyms: Reduce the memory occupied by kallsyms_seqs_of_names[] 2022-11-12 18:47:36 -08:00
kallsyms_selftest.c Modules changes for v6.6-rc1 2023-08-29 17:32:32 -07:00
kallsyms_selftest.h kallsyms: Add self-test facility 2022-11-15 00:42:02 -08:00
kallsyms.c kallsyms: Change func signature for cleanup_symbol_name() 2023-08-25 15:00:36 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.kexec kexec: select CRYPTO from KEXEC_FILE instead of depending on it 2024-01-05 15:19:41 +01:00
Kconfig.locks
Kconfig.preempt
kcov.c kcov: add prototypes for helper functions 2023-06-09 17:44:17 -07:00
kexec_core.c kexec: do syscore_shutdown() in kernel_kexec 2024-01-31 16:18:56 -08:00
kexec_elf.c
kexec_file.c integrity-v6.6 2023-08-30 09:16:56 -07:00
kexec_internal.h
kexec.c kernel: kexec: copy user-array safely 2023-11-28 17:19:40 +00:00
kheaders.c kheaders: Use array declaration instead of char 2023-03-24 20:10:59 -07:00
kprobes.c kprobes: consistent rcu api usage for kretprobe holder 2023-12-13 18:45:31 +01:00
ksyms_common.c kallsyms: make kallsyms_show_value() as generic function 2023-06-08 12:27:20 -07:00
ksysfs.c crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
kthread.c kthread: unexport __kthread_should_park() 2023-08-18 10:18:59 -07:00
latencytop.c
Makefile v6.5-rc1-modules-next 2023-06-28 15:51:08 -07:00
module_signature.c
notifier.c notifiers: add tracepoints to the notifiers infrastructure 2023-04-08 13:45:38 -07:00
nsproxy.c nsproxy: Convert nsproxy.count to refcount_t 2023-08-21 11:29:12 -07:00
padata.c crypto: pcrypt - Fix hungtask for PADATA_RESET 2023-11-28 17:19:42 +00:00
panic.c panic: Reenable preemption in WARN slowpath 2023-09-15 11:28:08 +02:00
params.c kernel: params: Remove unnecessary ‘0’ values from err 2023-07-10 12:47:01 -07:00
pid_namespace.c memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid_sysctl.h memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid.c pidfd: prevent a kernel-doc warning 2023-09-19 13:21:33 -07:00
profile.c
ptrace.c ptrace: Provide set/get interface for syscall user dispatch 2023-04-16 14:23:07 +02:00
range.c
reboot.c kernel/reboot: emergency_restart: Set correct system_state 2023-11-28 17:20:04 +00:00
regset.c
relay.c kernel: relay: remove unnecessary NULL values from relay_open_buf 2023-08-18 10:18:55 -07:00
resource_kunit.c
resource.c kernel/resource: Increment by align value in get_free_mem_region() 2024-01-10 17:16:58 +01:00
rseq.c rseq: Extend struct rseq with per-memory-map concurrency ID 2022-12-27 12:52:12 +01:00
scftorture.c scftorture: Pause testing after memory-allocation failure 2023-07-14 15:02:57 -07:00
scs.c scs: add support for dynamic shadow call stacks 2022-11-09 18:06:35 +00:00
seccomp.c seccomp: Add missing kerndoc notations 2023-08-17 12:32:15 -07:00
signal.c signal: print comm and exe name on fatal signals 2023-08-18 10:18:50 -07:00
smp.c smp,csd: Throw an error if a CSD lock is stuck for too long 2023-11-28 17:19:36 +00:00
smpboot.c cpu/hotplug: Remove unused state functions 2023-05-15 13:45:00 +02:00
smpboot.h
softirq.c sched/core: introduce sched_core_idle_cpu() 2023-07-13 15:21:50 +02:00
stackleak.c stackleak: allow to specify arch specific stackleak poison function 2023-04-20 11:36:35 +02:00
stacktrace.c
static_call_inline.c
static_call.c
stop_machine.c
sys_ni.c posix-timers: Get rid of [COMPAT_]SYS_NI() uses 2024-01-20 11:51:46 +01:00
sys.c prctl: Disable prctl(PR_SET_MDWE) on parisc 2023-12-03 07:33:06 +01:00
sysctl-test.c
sysctl.c v6.5-rc1-sysctl-next 2023-06-28 16:05:21 -07:00
task_work.c task_work: add kerneldoc annotation for 'data' argument 2023-09-19 13:21:32 -07:00
taskstats.c
torture.c rcutorture: Fix stuttering races and other issues 2023-11-28 17:20:08 +00:00
tracepoint.c tracepoint: Allow livepatch module add trace event 2023-02-18 14:34:36 -05:00
tsacct.c
ucount.c sysctl: Add size to register_sysctl 2023-08-15 15:26:17 -07:00
uid16.c
uid16.h
umh.c sysctl: fix unused proc_cap_handler() function warning 2023-06-29 15:19:43 -07:00
up.c
user_namespace.c userns: fix a struct's kernel-doc notation 2023-02-02 22:50:04 -08:00
user-return-notifier.c
user.c kernel/user: Allow user_struct::locked_vm to be usable for iommufd 2022-11-30 20:16:49 -04:00
usermode_driver.c
utsname_sysctl.c utsname: simplify one-level sysctl registration for uts_kern_table 2023-04-13 11:49:35 -07:00
utsname.c
vhost_task.c vhost: Fix worker hangs due to missed wake up calls 2023-06-08 15:43:09 -04:00
watch_queue.c kernel: watch_queue: copy user-array safely 2023-11-28 17:19:40 +00:00
watchdog_buddy.c watchdog/hardlockup: move SMP barriers from common code to buddy code 2023-06-19 16:25:28 -07:00
watchdog_perf.c watchdog/perf: add a weak function for an arch to detect if perf can use NMIs 2023-06-09 17:44:21 -07:00
watchdog.c watchdog: move softlockup_panic back to early_param 2023-11-28 17:19:57 +00:00
workqueue_internal.h workqueue: Drop the special locking rule for worker->flags and worker_pool->flags 2023-08-07 15:57:22 -10:00
workqueue.c workqueue: Make sure that wq_unbound_cpumask is never empty 2023-12-13 18:45:24 +01:00