twx-linux/drivers
Donet Tom 4f745def81 drivers/base/node: optimize memory block registration to reduce boot time
Patch series "drivers/base/node.c: optimization and cleanups", v7.


This patch (of 7)

During node device initialization, `memory blocks` are registered under
each NUMA node.  The `memory blocks` to be registered are identified using
the node's start and end PFNs, which are obtained from the node's pg_data

However, not all PFNs within this range necessarily belong to the same
node—some may belong to other nodes.  Additionally, due to the
discontiguous nature of physical memory, certain sections within a `memory
block` may be absent.

As a result, `memory blocks` that fall between a node's start and end PFNs
may span across multiple nodes, and some sections within those blocks may
be missing.  `Memory blocks` have a fixed size, which is architecture
dependent.

Due to these considerations, the memory block registration is currently
performed as follows:

for_each_online_node(nid):
    start_pfn = pgdat->node_start_pfn;
    end_pfn = pgdat->node_start_pfn + node_spanned_pages;
    for_each_memory_block_between(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn))
        mem_blk = memory_block_id(pfn_to_section_nr(pfn));
        pfn_mb_start=section_nr_to_pfn(mem_blk->start_section_nr)
        pfn_mb_end = pfn_start + memory_block_pfns - 1
        for (pfn = pfn_mb_start; pfn < pfn_mb_end; pfn++):
            if (get_nid_for_pfn(pfn) != nid):
                continue;
            else
                do_register_memory_block_under_node(nid, mem_blk,
                                                        MEMINIT_EARLY);

Here, we derive the start and end PFNs from the node's pg_data, then
determine the memory blocks that may belong to the node.  For each `memory
block` in this range, we inspect all PFNs it contains and check their
associated NUMA node ID.  If a PFN within the block matches the current
node, the memory block is registered under that node.

If CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, get_nid_for_pfn() performs
a binary search in the `memblock regions` to determine the NUMA node ID
for a given PFN.  If it is not enabled, the node ID is retrieved directly
from the struct page.

On large systems, this process can become time-consuming, especially since
we iterate over each `memory block` and all PFNs within it until a match
is found.  When CONFIG_DEFERRED_STRUCT_PAGE_INIT is enabled, the
additional overhead of the binary search increases the execution time
significantly, potentially leading to soft lockups during boot.

In this patch, we iterate over `memblock region` to identify the `memory
blocks` that belong to the current NUMA node.  `memblock regions` are
contiguous memory ranges, each associated with a single NUMA node, and
they do not span across multiple nodes.

for_each_memory_region(r): // r => region
  if (!node_online(r->nid)):
    continue;
  else
    for_each_memory_block_between(r->base, r->base + r->size - 1):
      do_register_memory_block_under_node(r->nid, mem_blk, MEMINIT_EARLY);

We iterate over all memblock regions, and if the node associated with the
region is online, we calculate the start and end memory blocks based on
the region's start and end PFNs.  We then register all the memory blocks
within that range under the region node.

Test Results on My system with 32TB RAM
=======================================
1. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.

Without this patch
------------------
Startup finished in 1min 16.528s (kernel)

With this patch
---------------
Startup finished in 17.236s (kernel) - 78% Improvement

2. Boot time with CONFIG_DEFERRED_STRUCT_PAGE_INIT disabled.

Without this patch
------------------
Startup finished in 28.320s (kernel)

With this patch
---------------
Startup finished in 15.621s (kernel) - 46% Improvement

[donettom@linux.ibm.com: restore removed extra line]
  Link: https://lkml.kernel.org/r/20250609140354.467908-1-donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Link: https://lkml.kernel.org/r/2a0a05c2dffc62a742bf1dd030098be4ce99be28.1748452241.git.donettom@linux.ibm.com
Signed-off-by: Donet Tom <donettom@linux.ibm.com>
Acked-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org>
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2025-07-09 22:41:59 -07:00
..
accel accel/amdxdna: Fix incorrect PSP firmware size 2025-06-09 07:16:32 -07:00
accessibility
acpi ACPI fix for 6.16-rc5. 2025-07-04 17:25:41 -07:00
amba
android Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
ata ata: ahci: Use correct DMI identifier for ASUSPRO-D840SA LPM quirk 2025-06-25 15:17:57 +02:00
atm atm: idt77252: Add missing dma_map_error() 2025-06-25 15:28:57 -07:00
auxdisplay treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
base drivers/base/node: optimize memory block registration to reduce boot time 2025-07-09 22:41:59 -07:00
bcma
block block-6.16-20250704 2025-07-04 09:33:59 -07:00
bluetooth driver: bluetooth: hci_qca:fix unable to load the BT driver 2025-06-20 11:55:03 -04:00
bus treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
cache
cdrom
cdx
char treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
clk I've recently moved computers (among other things) so I'm sending this from a 2025-05-30 09:15:40 -07:00
clocksource MFD for v6.16 2025-06-03 11:53:55 -07:00
comedi treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
connector
counter Second set of Counter updates for 6.16 2025-05-24 08:29:32 +02:00
cpufreq rust: Use CpuId in place of raw CPU numbers 2025-06-12 10:31:28 +05:30
cpuidle Merge branch 'pm-cpuidle' 2025-05-30 20:21:36 +02:00
crypto treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
cxl cxl/edac: Fix using wrong repair type to check dram event record 2025-06-25 12:05:45 -07:00
dax
dca
devfreq
dio
dma treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
dma-buf dma-buf: fix timeout handling in dma_resv_wait_timeout v2 2025-06-30 13:15:44 +02:00
dpll
edac EDAC: Initialize EDAC features sysfs attributes 2025-06-30 10:57:24 +02:00
eisa
extcon
firewire treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
firmware Samsung SoC fixes for v6.16 2025-07-03 16:23:53 +02:00
fpga FPGA Manager changes for 6.16-rc1 2025-05-21 14:08:44 +02:00
fsi
fwctl
gnss
gpio gpio: mlxbf3: only get IRQ for device instance 0 2025-06-18 12:19:39 +02:00
gpu mm: remove the for_reclaim field from struct writeback_control 2025-07-09 22:41:58 -07:00
greybus treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
hid hid-for-linus-2025070502 2025-07-05 16:14:03 -07:00
hsi treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
hte
hv hyperv-next for v6.16 2025-06-03 08:39:20 -07:00
hwmon hwmon: (ltc4282) avoid repeated register write 2025-06-16 06:30:58 -07:00
hwspinlock
hwtracing coresight: updates for Linux v6.16 2025-05-22 18:04:43 +02:00
i2c i2c-for-6.16-rc5 2025-07-05 12:54:24 -07:00
i3c i3c: controllers do not need to depend on I3C 2025-05-24 22:49:07 +02:00
idle intel_idle: Update arguments of mwait_idle_with_hints() 2025-06-10 21:09:28 +02:00
iio treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
infiniband SCSI fixes on 20250703 2025-07-03 11:52:39 -07:00
input Input updates for v6.16-rc4 2025-07-04 09:54:15 -07:00
interconnect Merge branch 'icc-sa8775p' into icc-next 2025-05-19 17:09:50 +03:00
iommu iommu/vt-d: Assign devtlb cache tag on ATS enablement 2025-07-04 10:33:56 +02:00
ipack
irqchip irqchip/irq-msi-lib: Select CONFIG_GENERIC_MSI_IRQ 2025-06-30 16:59:12 +02:00
isdn treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
leds treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
macintosh
mailbox treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
mcb
md - dm-crypt: fix a crash on 32-bit machines 2025-06-23 15:02:57 -07:00
media treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
memory treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
memstick treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
message
mfd mfd: Fix building without CONFIG_OF 2025-06-19 11:05:30 +01:00
misc treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
mmc mtk-sd: reset host->mrq on prepare_data() error 2025-06-25 14:42:51 +02:00
most treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
mtd mtd: nand: qpic_common: prevent out of bounds access of BAM arrays 2025-06-29 22:10:47 +01:00
mux
net net: ngbe: specify IRQ vector when the number of VFs is 7 2025-07-03 11:51:40 +02:00
nfc treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
ntb
nubus
nvdimm
nvme block-6.16-20250704 2025-07-04 09:33:59 -07:00
nvmem Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
of - The 11 patch series "Add folio_mk_pte()" from Matthew Wilcox 2025-05-31 15:44:16 -07:00
opp OPP: switch to use kmemdup_array() 2025-05-19 15:37:53 +05:30
parisc
parport treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
pci pci-v6.16-fixes-2 2025-06-27 20:17:48 -07:00
pcmcia treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
peci
perf arm64 updates for 6.16 2025-05-28 14:55:35 -07:00
phy phy-for-6.16 2025-06-05 08:20:21 -07:00
pinctrl pinctrl: sunxi: dt: Consider pin base when calculating bank number from pin 2025-06-10 14:35:40 +02:00
platform platform-drivers-x86 for v6.16-3 2025-07-04 10:05:31 -07:00
pmdomain pmdomain: ti: Fix STANDBY handling of PER power domain 2025-05-19 16:11:05 +02:00
pnp
power - The 3 patch series "hung_task: extend blocking task stacktrace dump to 2025-05-31 19:12:53 -07:00
powercap powercap: intel_rapl: Do not change CLAMPING bit if ENABLE bit cannot be changed 2025-06-30 20:32:29 +02:00
pps treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
ps3
ptp ptp: allow reading of currently dialed frequency to succeed on free-running clocks 2025-06-17 16:13:09 -07:00
pwm pwm: axi-pwmgen: Fix handling of external clock 2025-06-06 13:16:50 -07:00
rapidio drivers/rapidio/rio_cm.c: prevent possible heap overwrite 2025-06-11 22:42:36 -07:00
ras
regulator regulator: gpio: Fix the out-of-bounds access to drvdata::gpiods 2025-07-03 12:22:35 +01:00
remoteproc remoteproc updates for v6.16 2025-06-02 11:04:29 -07:00
reset
rpmsg rpmsg: qcom_smd: Fix uninitialized return variable in __qcom_smd_send() 2025-05-20 21:46:10 -05:00
rtc rtc: pcf2127: add missing semicolon after statement 2025-06-24 16:06:14 +02:00
s390 s390/pkey: Prevent overflow in size calculation for memdup_user() 2025-06-16 16:15:24 +02:00
sbus
scsi scsi: core: Enforce unlimited max_segment_size when virt_boundary_mask is set 2025-06-24 21:20:58 -04:00
sh
siox
slimbus
soc soc: drivers for 6.16 2025-05-31 07:53:30 -07:00
soundwire soundwire updates for 6.16 2025-06-05 08:07:24 -07:00
spi spi: cadence-quadspi: fix cleanup of rx_chan on failure paths 2025-07-01 14:02:26 +01:00
spmi irqdomain: spmi: Switch to irq_domain_create_tree() 2025-05-21 14:53:17 +02:00
ssb
staging staging: rtl8723bs: Avoid memset() in aes_cipher() and aes_decipher() 2025-06-19 17:33:43 +02:00
target scsi: target: Fix NULL pointer dereference in core_scsi3_decode_spec_i_port() 2025-06-16 14:35:57 -04:00
tc
tee A fix in the OP-TEE driver for v6.16 2025-07-03 16:26:08 +02:00
thermal Thermal control updates for 6.16-rc1 2025-05-27 16:28:02 -07:00
thunderbolt thunderbolt: Changes for v6.16 merge window 2025-05-21 12:26:51 +02:00
tty serial: imx: Restore original RXTL for console to fix data loss 2025-06-24 15:34:21 +01:00
ufs scsi: ufs: core: Fix spelling of a sysfs attribute name 2025-06-24 21:22:20 -04:00
uio Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
usb usb: hub: Fix flushing of delayed work used for post resume purposes 2025-06-30 15:36:00 +02:00
vdpa vdpa/octeon_ep: Control PCI dev enabling manually 2025-05-27 10:27:53 -04:00
vfio pci-v6.16-changes 2025-06-04 11:26:17 -07:00
vhost virtio, vhost: features, fixes 2025-05-29 08:15:35 -07:00
video treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
virt treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
virtio virtio_ring: Fix error reporting in virtqueue_resize 2025-07-03 11:40:02 +02:00
w1 Char/Misc/IIO pull request for 6.16-rc1 2025-06-06 11:50:47 -07:00
watchdog treewide, timers: Rename from_timer() to timer_container_of() 2025-06-08 09:07:37 +02:00
xen xen/x86: fix initial memory balloon target 2025-05-23 07:09:00 +02:00
zorro
Kconfig
Makefile