twx-linux/kernel
Mel Gorman f169c62ff7 sched/numa: Complete scanning of inactive VMAs when there is no alternative
VMAs are skipped if there is no recent fault activity but this represents
a chicken-and-egg problem as there may be no fault activity if the PTEs
are never updated to trap NUMA hints. There is an indirect reliance on
scanning to be forced early in the lifetime of a task but this may fail
to detect changes in phase behaviour. Force inactive VMAs to be scanned
when all other eligible VMAs have been updated within the same scan
sequence.

Test results in general look good with some changes in performance, both
negative and positive, depending on whether the additional scanning and
faulting was beneficial or not to the workload. The autonuma benchmark
workload NUMA01_THREADLOCAL was picked for closer examination. The workload
creates two processes with numerous threads and thread-local storage that
is zero-filled in a loop. It exercises the corner case where unrelated
threads may skip VMAs that are thread-local to another thread and still
has some VMAs that inactive while the workload executes.

The VMA skipping activity frequency with and without the patch:

	6.6.0-rc2-sched-numabtrace-v1
	=============================
	    649 reason=scan_delay
	  9,094 reason=unsuitable
	 48,915 reason=shared_ro
	143,919 reason=inaccessible
	193,050 reason=pid_inactive

	6.6.0-rc2-sched-numabselective-v1
	=============================
	    146 reason=seq_completed
	    622 reason=ignore_pid_inactive

	    624 reason=scan_delay
	  6,570 reason=unsuitable
	 16,101 reason=shared_ro
	 27,608 reason=inaccessible
	 41,939 reason=pid_inactive

Note that with the patch applied, the PID activity is ignored
(ignore_pid_inactive) to ensure a VMA with some activity is completely
scanned. In addition, a small number of VMAs are scanned when no other
eligible VMA is available during a single scan window (seq_completed).
The number of times a VMA is skipped due to no PID activity from the
scanning task (pid_inactive) drops dramatically. It is expected that
this will increase the number of PTEs updated for NUMA hinting faults
as well as hinting faults but these represent PTEs that would otherwise
have been missed. The tradeoff is scan+fault overhead versus improving
locality due to migration.

On a 2-socket Cascade Lake test machine, the time to complete the
workload is as follows;

                                                 6.6.0-rc2              6.6.0-rc2
                                       sched-numabtrace-v1 sched-numabselective-v1
  Min       elsp-NUMA01_THREADLOCAL      174.22 (   0.00%)      117.64 (  32.48%)
  Amean     elsp-NUMA01_THREADLOCAL      175.68 (   0.00%)      123.34 *  29.79%*
  Stddev    elsp-NUMA01_THREADLOCAL        1.20 (   0.00%)        4.06 (-238.20%)
  CoeffVar  elsp-NUMA01_THREADLOCAL        0.68 (   0.00%)        3.29 (-381.70%)
  Max       elsp-NUMA01_THREADLOCAL      177.18 (   0.00%)      128.03 (  27.74%)

The time to complete the workload is reduced by almost 30%:

                     6.6.0-rc2   6.6.0-rc2
                  sched-numabtrace-v1 sched-numabselective-v1 /
  Duration User       91201.80    63506.64
  Duration System      2015.53     1819.78
  Duration Elapsed     1234.77      868.37

In this specific case, system CPU time was not increased but it's not
universally true.

From vmstat, the NUMA scanning and fault activity is as follows;

                                        6.6.0-rc2      6.6.0-rc2
                              sched-numabtrace-v1 sched-numabselective-v1
  Ops NUMA base-page range updates       64272.00    26374386.00
  Ops NUMA PTE updates                   36624.00       55538.00
  Ops NUMA PMD updates                      54.00       51404.00
  Ops NUMA hint faults                   15504.00       75786.00
  Ops NUMA hint local faults %           14860.00       56763.00
  Ops NUMA hint local percent               95.85          74.90
  Ops NUMA pages migrated                 1629.00     6469222.00

Both the number of PTE updates and hint faults is dramatically
increased. While this is superficially unfortunate, it represents
ranges that were simply skipped without the patch. As a result
of the scanning and hinting faults, many more pages were also
migrated but as the time to completion is reduced, the overhead
is offset by the gain.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Tested-by: Raghavendra K T <raghavendra.kt@amd.com>
Link: https://lore.kernel.org/r/20231010083143.19593-7-mgorman@techsingularity.net
2023-10-10 23:42:15 +02:00
..
bpf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf 2023-09-16 11:16:00 +01:00
cgroup cgroup: fix build when CGROUP_SCHED is not enabled 2023-09-02 08:27:17 -07:00
configs Kbuild updates for v6.6 2023-09-05 11:01:47 -07:00
debug printk changes for 6.6 2023-09-04 13:20:19 -07:00
dma swiotlb: fix the check whether a device has used software IO TLB 2023-09-27 11:19:15 +02:00
entry entry: Remove empty addr_limit_user_check() 2023-08-23 10:32:39 +02:00
events powerpc updates for 6.6 2023-08-31 12:43:10 -07:00
futex mm/mm_init.c: remove obsolete macro HASH_SMALL 2023-08-18 10:12:07 -07:00
gcov gcov: shut up missing prototype warnings for internal stubs 2023-08-18 10:18:58 -07:00
irq Boring updates for the interrupt subsystem: 2023-08-28 14:33:11 -07:00
kcsan
livepatch
locking - An extensive rework of kexec and crash Kconfig from Eric DeVolder 2023-08-29 14:53:51 -07:00
module module/decompress: use vmalloc() for zstd decompression workspace 2023-08-29 09:39:08 -07:00
power PM: hibernate: Fix the exclusive get block device in test_resume mode 2023-09-12 11:45:15 +02:00
printk Revert "printk: export symbols for debug modules" 2023-09-07 14:19:42 +02:00
rcu TTY/Serial driver changes for 6.6-rc1 2023-09-01 09:38:00 -07:00
sched sched/numa: Complete scanning of inactive VMAs when there is no alternative 2023-10-10 23:42:15 +02:00
time Fix false positive "softirq work is pending" messages on -rt 2023-09-02 09:01:48 -07:00
trace tracing/user_events: Align set_bit() address for all archs 2023-09-30 16:25:41 -04:00
.gitignore
acct.c audit/stable-6.6 PR 20230829 2023-08-30 08:17:35 -07:00
async.c
audit_fsnotify.c
audit_tree.c
audit_watch.c
audit.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
audit.h audit: correct audit_filter_inodes() definition 2023-07-21 12:17:25 -04:00
auditfilter.c audit: move trailing statements to next line 2023-08-15 18:16:14 -04:00
auditsc.c Including fixes from netfilter and bpf. 2023-09-07 18:33:07 -07:00
backtracetest.c
bounds.c
capability.c lsm: constify the 'target' parameter in security_capget() 2023-08-08 16:48:47 -04:00
cfi.c
compat.c
configs.c
context_tracking.c
cpu_pm.c
cpu.c cpu/hotplug: Prevent self deadlock on CPU hot-unplug 2023-08-30 12:24:22 +02:00
crash_core.c Crash: add lock to serialize crash hotplug handling 2023-09-29 17:20:48 -07:00
crash_dump.c
cred.c cred: convert printks to pr_<level> 2023-08-18 10:18:49 -07:00
delayacct.c
dma.c
exec_domain.c
exit.c
extable.c
fail_function.c
fork.c percpu: changes for v6.6 2023-09-01 15:44:45 -07:00
freezer.c freezer,sched: Use saved_state to reduce some spurious wakeups 2023-09-18 08:14:36 +02:00
gen_kheaders.sh
groups.c
hung_task.c
iomem.c kernel/iomem.c: remove __weak ioremap_cache helper 2023-08-21 13:37:28 -07:00
irq_work.c
jump_label.c
kallsyms_internal.h
kallsyms_selftest.c Modules changes for v6.6-rc1 2023-08-29 17:32:32 -07:00
kallsyms_selftest.h
kallsyms.c kallsyms: Change func signature for cleanup_symbol_name() 2023-08-25 15:00:36 -07:00
kcmp.c
Kconfig.freezer
Kconfig.hz
Kconfig.kexec crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
Kconfig.locks
Kconfig.preempt
kcov.c
kexec_core.c crash: add generic infrastructure for crash hotplug support 2023-08-24 16:25:13 -07:00
kexec_elf.c
kexec_file.c integrity-v6.6 2023-08-30 09:16:56 -07:00
kexec_internal.h
kexec.c crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
kheaders.c
kprobes.c kernel: kprobes: Use struct_size() 2023-08-23 09:38:17 +09:00
ksyms_common.c
ksysfs.c crash: hotplug support for kexec_load() 2023-08-24 16:25:14 -07:00
kthread.c kthread: unexport __kthread_should_park() 2023-08-18 10:18:59 -07:00
latencytop.c
Makefile v6.5-rc1-modules-next 2023-06-28 15:51:08 -07:00
module_signature.c
notifier.c
nsproxy.c nsproxy: Convert nsproxy.count to refcount_t 2023-08-21 11:29:12 -07:00
padata.c
panic.c panic: Reenable preemption in WARN slowpath 2023-09-15 11:28:08 +02:00
params.c kernel: params: Remove unnecessary ‘0’ values from err 2023-07-10 12:47:01 -07:00
pid_namespace.c memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid_sysctl.h memfd: replace ratcheting feature from vm.memfd_noexec with hierarchy 2023-08-21 13:37:59 -07:00
pid.c pidfd: prevent a kernel-doc warning 2023-09-19 13:21:33 -07:00
profile.c
ptrace.c
range.c
reboot.c
regset.c
relay.c kernel: relay: remove unnecessary NULL values from relay_open_buf 2023-08-18 10:18:55 -07:00
resource_kunit.c
resource.c
rseq.c
scftorture.c scftorture: Pause testing after memory-allocation failure 2023-07-14 15:02:57 -07:00
scs.c
seccomp.c seccomp: Add missing kerndoc notations 2023-08-17 12:32:15 -07:00
signal.c signal: print comm and exe name on fatal signals 2023-08-18 10:18:50 -07:00
smp.c smp: Reduce NMI traffic from CSD waiters to CSD destination 2023-07-10 14:19:04 -07:00
smpboot.c
smpboot.h
softirq.c sched/core: introduce sched_core_idle_cpu() 2023-07-13 15:21:50 +02:00
stackleak.c
stacktrace.c
static_call_inline.c
static_call.c
stop_machine.c
sys_ni.c x86/shstk: Introduce map_shadow_stack syscall 2023-08-02 15:01:51 -07:00
sys.c prctl: move PR_GET_AUXV out of PR_MCE_KILL 2023-07-17 12:53:21 -07:00
sysctl-test.c
sysctl.c v6.5-rc1-sysctl-next 2023-06-28 16:05:21 -07:00
task_work.c task_work: add kerneldoc annotation for 'data' argument 2023-09-19 13:21:32 -07:00
taskstats.c
torture.c torture: Stop right-shifting torture_random() return values 2023-08-14 15:01:08 -07:00
tracepoint.c
tsacct.c
ucount.c sysctl: Add size to register_sysctl 2023-08-15 15:26:17 -07:00
uid16.c
uid16.h
umh.c sysctl: fix unused proc_cap_handler() function warning 2023-06-29 15:19:43 -07:00
up.c
user_namespace.c
user-return-notifier.c
user.c
usermode_driver.c
utsname_sysctl.c
utsname.c
vhost_task.c
watch_queue.c
watchdog_buddy.c
watchdog_perf.c
watchdog.c watchdog/hardlockup: avoid large stack frames in watchdog_hardlockup_check() 2023-08-18 10:19:00 -07:00
workqueue_internal.h workqueue: Drop the special locking rule for worker->flags and worker_pool->flags 2023-08-07 15:57:22 -10:00
workqueue.c workqueue: Fix missed pwq_release_worker creation in wq_cpu_intensive_thresh_init() 2023-09-18 08:50:31 -10:00