twx-linux/drivers
Davidlohr Bueso b980077899 mm: introduce per-node proactive reclaim interface
This adds support for allowing proactive reclaim in general on a NUMA
system.  A per-node interface extends support for beyond a memcg-specific
interface, respecting the current semantics of memory.reclaim: respecting
aging LRU and not supporting artificially triggering eviction on nodes
belonging to non-bottom tiers.

This patch allows userspace to do:

     echo "512M swappiness=10" > /sys/devices/system/node/nodeX/reclaim

One of the premises for this is to semantically align as best as possible
with memory.reclaim.  During a brief time memcg did support nodemask until
55ab834a86a9 (Revert "mm: add nodes= arg to memory.reclaim"), for which
semantics around reclaim (eviction) vs demotion were not clear, rendering
charging expectations to be broken.

With this approach:

1. Users who do not use memcg can benefit from proactive reclaim.  The
   memcg interface is not NUMA aware and there are usecases that are
   focusing on NUMA balancing rather than workload memory footprint.

2. Proactive reclaim on top tiers will trigger demotion, for which
   memory is still byte-addressable.  Reclaiming on the bottom nodes will
   trigger evicting to swap (the traditional sense of reclaim).  This
   follows the semantics of what is today part of the aging process on
   tiered memory, mirroring what every other form of reclaim does
   (reactive and memcg proactive reclaim).  Furthermore per-node proactive
   reclaim is not as susceptible to the memcg charging problem mentioned
   above.

3. Unlike the nodes= arg, this interface avoids confusing semantics,
   such as what exactly the user wants when mixing top-tier and low-tier
   nodes in the nodemask.  Further per-node interface is less exposed to
   "free up memory in my container" usecases, where eviction is intended.

4. Users that *really* want to free up memory can use proactive
   reclaim on nodes knowingly to be on the bottom tiers to force eviction
   in a natural way - higher access latencies are still better than swap. 
   If compelled, while no guarantees and perhaps not worth the effort,
   users could also also potentially follow a ladder-like approach to
   eventually free up the memory.  Alternatively, perhaps an 'evict'
   option could be added to the parameters for both memory.reclaim and
   per-node interfaces to force this action unconditionally.

[akpm@linux-foundation.org: user_proactive_reclaim(): return -EBUSY on PGDAT_RECLAIM_LOCKED contention, per Roman]
[dave@stgolabs.net: memcg && node is also a bogus case, per Shakeel]
  Link: https://lkml.kernel.org/r/20250717235604.2atyx2aobwowpge3@offworld
Link: https://lkml.kernel.org/r/20250623185851.830632-5-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso <dave@stgolabs.net>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-19 18:59:53 -07:00
..
accel accel/amdxdna: Fix incorrect PSP firmware size 2025-06-09 07:16:32 -07:00
accessibility
acpi drivers,hmat: use node-notifier instead of memory-notifier 2025-07-13 16:38:16 -07:00
amba
android Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
ata ata: ahci: Use correct DMI identifier for ASUSPRO-D840SA LPM quirk 2025-06-25 15:17:57 +02:00
atm atm: idt77252: Add missing dma_map_error() 2025-06-25 15:28:57 -07:00
auxdisplay treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
base mm: introduce per-node proactive reclaim interface 2025-07-19 18:59:53 -07:00
bcma
block null_blk: use memzero_page() 2025-07-09 22:42:08 -07:00
bluetooth driver: bluetooth: hci_qca:fix unable to load the BT driver 2025-06-20 11:55:03 -04:00
bus treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
cache
cdrom cdrom: Remove unnecessary NULL check before unregister_sysctl_table() 2025-05-15 16:25:20 -06:00
cdx
char treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
clk I've recently moved computers (among other things) so I'm sending this from a 2025-05-30 09:15:40 -07:00
clocksource MFD for v6.16 2025-06-03 11:53:55 -07:00
comedi treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
connector
counter Second set of Counter updates for 6.16 2025-05-24 08:29:32 +02:00
cpufreq rust: Use CpuId in place of raw CPU numbers 2025-06-12 10:31:28 +05:30
cpuidle Merge branch 'pm-cpuidle' 2025-05-30 20:21:36 +02:00
crypto treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
cxl drivers,cxl: use node-notifier instead of memory-notifier 2025-07-13 16:38:15 -07:00
dax mm: remove callers of pfn_t functionality 2025-07-09 22:42:19 -07:00
dca
devfreq
dio
dma treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
dma-buf dma-buf: fix timeout handling in dma_resv_wait_timeout v2 2025-06-30 13:15:44 +02:00
dpll
edac EDAC: Initialize EDAC features sysfs attributes 2025-06-30 10:57:24 +02:00
eisa
extcon
firewire treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
firmware Samsung SoC fixes for v6.16 2025-07-03 16:23:53 +02:00
fpga FPGA Manager changes for 6.16-rc1 2025-05-21 14:08:44 +02:00
fsi
fwctl
gnss
gpio gpio: mlxbf3: only get IRQ for device instance 0 2025-06-18 12:19:39 +02:00
gpu mm: remove callers of pfn_t functionality 2025-07-09 22:42:19 -07:00
greybus treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
hid hid-for-linus-2025070502 2025-07-05 16:14:03 -07:00
hsi treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
hte
hv hyperv-next for v6.16 2025-06-03 08:39:20 -07:00
hwmon hwmon: (ltc4282) avoid repeated register write 2025-06-16 06:30:58 -07:00
hwspinlock
hwtracing mm: remove callers of pfn_t functionality 2025-07-09 22:42:19 -07:00
i2c i2c-for-6.16-rc5 2025-07-05 12:54:24 -07:00
i3c i3c: controllers do not need to depend on I3C 2025-05-24 22:49:07 +02:00
idle intel_idle: Update arguments of mwait_idle_with_hints() 2025-06-10 21:09:28 +02:00
iio treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
infiniband SCSI fixes on 20250703 2025-07-03 11:52:39 -07:00
input Input updates for v6.16-rc4 2025-07-04 09:54:15 -07:00
interconnect Merge branch 'icc-sa8775p' into icc-next 2025-05-19 17:09:50 +03:00
iommu iommu/vt-d: Assign devtlb cache tag on ATS enablement 2025-07-04 10:33:56 +02:00
ipack
irqchip irqchip/irq-msi-lib: Select CONFIG_GENERIC_MSI_IRQ 2025-06-30 16:59:12 +02:00
isdn treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
leds treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
macintosh
mailbox treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
mcb
md mm: remove callers of pfn_t functionality 2025-07-09 22:42:19 -07:00
media treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
memory treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
memstick treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
message
mfd mfd: Fix building without CONFIG_OF 2025-06-19 11:05:30 +01:00
misc mm/balloon_compaction: convert balloon_page_delete() to balloon_page_finalize() 2025-07-13 16:38:25 -07:00
mmc mtk-sd: reset host->mrq on prepare_data() error 2025-06-25 14:42:51 +02:00
most treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
mtd mtd: nand: qpic_common: prevent out of bounds access of BAM arrays 2025-06-29 22:10:47 +01:00
mux
net net: ngbe: specify IRQ vector when the number of VFs is 7 2025-07-03 11:51:40 +02:00
nfc treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
ntb
nubus
nvdimm mm: remove callers of pfn_t functionality 2025-07-09 22:42:19 -07:00
nvme block-6.16-20250704 2025-07-04 09:33:59 -07:00
nvmem Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
of - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox 2025-05-31 15:44:16 -07:00
opp OPP: switch to use kmemdup_array() 2025-05-19 15:37:53 +05:30
parisc
parport treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
pci pci-v6.16-fixes-2 2025-06-27 20:17:48 -07:00
pcmcia treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
peci
perf arm64 updates for 6.16 2025-05-28 14:55:35 -07:00
phy phy-for-6.16 2025-06-05 08:20:21 -07:00
pinctrl pinctrl: sunxi: dt: Consider pin base when calculating bank number from pin 2025-06-10 14:35:40 +02:00
platform platform-drivers-x86 for v6.16-3 2025-07-04 10:05:31 -07:00
pmdomain pmdomain: ti: Fix STANDBY handling of PER power domain 2025-05-19 16:11:05 +02:00
pnp
power - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
powercap powercap: intel_rapl: Do not change CLAMPING bit if ENABLE bit cannot be changed 2025-06-30 20:32:29 +02:00
pps treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
ps3
ptp ptp: allow reading of currently dialed frequency to succeed on free-running clocks 2025-06-17 16:13:09 -07:00
pwm pwm: axi-pwmgen: Fix handling of external clock 2025-06-06 13:16:50 -07:00
rapidio drivers/rapidio/rio_cm.c: prevent possible heap overwrite 2025-06-11 22:42:36 -07:00
ras
regulator regulator: gpio: Fix the out-of-bounds access to drvdata::gpiods 2025-07-03 12:22:35 +01:00
remoteproc remoteproc updates for v6.16 2025-06-02 11:04:29 -07:00
reset
rpmsg rpmsg: qcom_smd: Fix uninitialized return variable in __qcom_smd_send() 2025-05-20 21:46:10 -05:00
rtc rtc: pcf2127: add missing semicolon after statement 2025-06-24 16:06:14 +02:00
s390 mm: remove callers of pfn_t functionality 2025-07-09 22:42:19 -07:00
sbus
scsi scsi: core: Enforce unlimited max_segment_size when virt_boundary_mask is set 2025-06-24 21:20:58 -04:00
sh sh: Switch to irq_domain_create_*() 2025-05-16 21:06:11 +02:00
siox
slimbus
soc soc: drivers for 6.16 2025-05-31 07:53:30 -07:00
soundwire soundwire updates for 6.16 2025-06-05 08:07:24 -07:00
spi spi: cadence-quadspi: fix cleanup of rx_chan on failure paths 2025-07-01 14:02:26 +01:00
spmi irqdomain: spmi: Switch to irq_domain_create_tree() 2025-05-21 14:53:17 +02:00
ssb
staging staging: rtl8723bs: Avoid memset() in aes_cipher() and aes_decipher() 2025-06-19 17:33:43 +02:00
target scsi: target: Fix NULL pointer dereference in core_scsi3_decode_spec_i_port() 2025-06-16 14:35:57 -04:00
tc
tee A fix in the OP-TEE driver for v6.16 2025-07-03 16:26:08 +02:00
thermal Thermal control updates for 6.16-rc1 2025-05-27 16:28:02 -07:00
thunderbolt thunderbolt: Changes for v6.16 merge window 2025-05-21 12:26:51 +02:00
tty serial: imx: Restore original RXTL for console to fix data loss 2025-06-24 15:34:21 +01:00
ufs scsi: ufs: core: Fix spelling of a sysfs attribute name 2025-06-24 21:22:20 -04:00
uio Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
usb usb: hub: Fix flushing of delayed work used for post resume purposes 2025-06-30 15:36:00 +02:00
vdpa vdpa/octeon_ep: Control PCI dev enabling manually 2025-05-27 10:27:53 -04:00
vfio mm: remove callers of pfn_t functionality 2025-07-09 22:42:19 -07:00
vhost virtio, vhost: features, fixes 2025-05-29 08:15:35 -07:00
video treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
virt treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
virtio mm/balloon_compaction: convert balloon_page_delete() to balloon_page_finalize() 2025-07-13 16:38:25 -07:00
w1 Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
watchdog treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
xen xen/x86: fix initial memory balloon target 2025-05-23 07:09:00 +02:00
zorro
Kconfig
Makefile