a3ebb59eee
2462 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
8f7aa3d3c7 |
Networking changes for 6.19.
Core & protocols
----------------
- Replace busylock at the Tx queuing layer with a lockless list. Resulting
in a 300% (4x) improvement on heavy TX workloads, sending twice the
number of packets per second, for half the cpu cycles.
- Allow constantly busy flows to migrate to a more suitable CPU/NIC
queue. Normally we perform queue re-selection when flow comes out
of idle, but under extreme circumstances the flows may be constantly
busy. Add sysctl to allow periodic rehashing even if it'd risk packet
reordering.
- Optimize the NAPI skb cache, make it larger, use it in more paths.
- Attempt returning Tx skbs to the originating CPU (like we already did
for Rx skbs).
- Various data structure layout and prefetch optimizations from Eric.
- Remove ktime_get() from the recvmsg() fast path, ktime_get() is sadly
quite expensive on recent AMD machines.
- Extend threaded NAPI polling to allow the kthread busy poll for packets.
- Make MPTCP use Rx backlog processing. This lowers the lock pressure,
improving the Rx performance.
- Support memcg accounting of MPTCP socket memory.
- Allow admin to opt sockets out of global protocol memory accounting
(using a sysctl or BPF-based policy). The global limits are a poor fit
for modern container workloads, where limits are imposed using cgroups.
- Improve heuristics for when to kick off AF_UNIX garbage collection.
- Allow users to control TCP SACK compression, and default to 33% of RTT.
- Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid unnecessarily
aggressive rcvbuf growth and overshot when the connection RTT is low.
- Preserve skb metadata space across skb_push / skb_pull operations.
- Support for IPIP encapsulation in the nftables flowtable offload.
- Support appending IP interface information to ICMP messages (RFC 5837).
- Support setting max record size in TLS (RFC 8449).
- Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
- Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
- Let users configure the number of write buffers in SMC.
- Add new struct sockaddr_unsized for sockaddr of unknown length,
from Kees.
- Some conversions away from the crypto_ahash API, from Eric Biggers.
- Some preparations for slimming down struct page.
- YAML Netlink protocol spec for WireGuard.
- Add a tool on top of YAML Netlink specs/lib for reporting commonly
computed derived statistics and summarized system state.
Driver API
----------
- Add CAN XL support to the CAN Netlink interface.
- Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics,
as defined by the OPEN Alliance's "Advanced diagnostic features
for 100BASE-T1 automotive Ethernet PHYs" specification.
- Add DPLL phase-adjust-gran pin attribute (and implement it in zl3073x).
- Refactor xfrm_input lock to reduce contention when NIC offloads IPsec
and performs RSS.
- Add info to devlink params whether the current setting is the default
or a user override. Allow resetting back to default.
- Add standard device stats for PSP crypto offload.
- Leverage DSA frame broadcast to implement simple HSR frame duplication
for a lot of switches without dedicated HSR offload.
- Add uAPI defines for 1.6Tbps link modes.
Device drivers
--------------
- Add Motorcomm YT921x gigabit Ethernet switch support.
- Add MUCSE driver for N500/N210 1GbE NIC series.
- Convert drivers to support dedicated ops for timestamping control,
and away from the direct IOCTL handling. While at it support GET
operations for PHY timestamping.
- Add (and convert most drivers to) a dedicated ethtool callback
for reading the Rx ring count.
- Significant refactoring efforts in the STMMAC driver, which supports
Synopsys turn-key MAC IP integrated into a ton of SoCs.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support PPS in/out on all pins
- Intel (100G, ice, idpf):
- ice: implement standard ethtool and timestamping stats
- i40e: support setting the max number of MAC addresses per VF
- iavf: support RSS of GTP tunnels for 5G and LTE deployments
- nVidia/Mellanox (mlx5):
- reduce downtime on interface reconfiguration
- disable being an XDP redirect target by default (same as other
drivers) to avoid wasting resources if feature is unused
- Meta (fbnic):
- add support for Linux-managed PCS on 25G, 50G, and 100G links
- Wangxun:
- support Rx descriptor merge, and Tx head writeback
- support Rx coalescing offload
- support 25G SPF and 40G QSFP modules
- Ethernet virtual:
- Google (gve):
- allow ethtool to configure rx_buf_len
- implement XDP HW RX Timestamping support for DQ descriptor format
- Microsoft vNIC (mana):
- support HW link state events
- handle hardware recovery events when probing the device
- Ethernet NICs consumer, and embedded:
- usbnet: add support for Byte Queue Limits (BQL)
- AMD (amd-xgbe):
- add device selftests
- NXP (enetc):
- add i.MX94 support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- bcmasp: add support for PHY-based Wake-on-LAN
- Broadcom switches (b53):
- support port isolation
- support BCM5389/97/98 and BCM63XX ARL formats
- Lantiq/MaxLinear switches:
- support bridge FDB entries on the CPU port
- use regmap for register access
- allow user to enable/disable learning
- support Energy Efficient Ethernet
- support configuring RMII clock delays
- add tagging driver for MaxLinear GSW1xx switches
- Synopsys (stmmac):
- support using the HW clock in free running mode
- add Eswin EIC7700 support
- add Rockchip RK3506 support
- add Altera Agilex5 support
- Cadence (macb):
- cleanup and consolidate descriptor and DMA address handling
- add EyeQ5 support
- TI:
- icssg-prueth: support AF_XDP
- Airoha access points:
- add missing Ethernet stats and link state callback
- add AN7583 support
- support out-of-order Tx completion processing
- Power over Ethernet:
- pd692x0: preserve PSE configuration across reboots
- add support for TPS23881B devices
- Ethernet PHYs:
- Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
- Support 50G SerDes and 100G interfaces in Linux-managed PHYs
- micrel:
- support for non PTP SKUs of lan8814
- enable in-band auto-negotiation on lan8814
- realtek:
- cable testing support on RTL8224
- interrupt support on RTL8221B
- motorcomm: support for PHY LEDs on YT853
- microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
- mscc: support for PHY LED control
- CAN drivers:
- m_can: add support for optional reset and system wake up
- remove can_change_mtu() obsoleted by core handling
- mcp251xfd: support GPIO controller functionality
- Bluetooth:
- add initial support for PASTa
- WiFi:
- split ieee80211.h file, it's way too big
- improvements in VHT radiotap reporting, S1G, Channel Switch
Announcement handling, rate tracking in mesh networks
- improve multi-radio monitor mode support, and add a cfg80211 debugfs
interface for it
- HT action frame handling on 6 GHz
- initial chanctx work towards NAN
- MU-MIMO sniffer improvements
- WiFi drivers:
- RealTek (rtw89):
- support USB devices RTL8852AU and RTL8852CU
- initial work for RTL8922DE
- improved injection support
- Intel:
- iwlwifi: new sniffer API support
- MediaTek (mt76):
- WED support for >32-bit DMA
- airoha NPU support
- regdomain improvements
- continued WiFi7/MLO work
- Qualcomm/Atheros:
- ath10k: factory test support
- ath11k: TX power insertion support
- ath12k: BSS color change support
- ath12k: statistics improvements
- brcmfmac: Acer A1 840 tablet quirk
- rtl8xxxu: 40 MHz connection fixes/support
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmkveRQACgkQMUZtbf5S
IrvY7A/+Nb0o4BxLHjPkAl1m3t3q2d0Y29B7SNkwnwEtxAV8EkNeZ3GWrdtDnTQY
MYhmc7LEzvz8/lihapr7UJkcokzSASUV54hbez5jDBKC8EEoyUk8FdWDPerwlcRI
zmCFNAVFyh9GX8i7wcrzKbDTHT5+GZLbSlGl9U5mhLsDdRlJgH7d8PJ7vWcmtLFY
XN0paDyaeHfCl8wReWNAYx4C/I0ODOvlscpO0tnAKhB0ngJbQCKY2t6tn3rOYdif
ZSQ5KwVRnJtQ4fYOFMOy9+FSCjVXtyrxF8KLxD+mqom2ZhmO00UpOMl09tqhq3uT
WnvwoHUVBt6F+iITHwg5kMgIDPUq1kpUvL4S4UbVSuUm9ZKD+4KRU2ZHRBYMx+MU
bsqmtY8/IULClUoRz+tZhltA8eb0NEqNZE2JPOFDiJHn1YiCCkFwxibhir893oM3
sB7x65D7LQI2ty2BBGVGYnwYDPtyaxOA/s3WTwPvLEi3+Y/TGNIIrS9lBLA4U+Yr
Gi93WQGVjttMmVyaHgXBUGmi3L52hvolm0AZ8zSRGrnIEpecjhly2KfYuaOzuxXC
IHEQ6AFLdRh6JzafXGb/mQwGCHNmhwsY8A49i94fakWQamaL/L6A+1dyPu4LXMqi
NwqCmlVb/LKGlfNG+V4wT27srJ+yBA2Vk3tpR1sZQQytFh0LKHI=
=UoDR
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Replace busylock at the Tx queuing layer with a lockless list.
Resulting in a 300% (4x) improvement on heavy TX workloads, sending
twice the number of packets per second, for half the cpu cycles.
- Allow constantly busy flows to migrate to a more suitable CPU/NIC
queue.
Normally we perform queue re-selection when flow comes out of idle,
but under extreme circumstances the flows may be constantly busy.
Add sysctl to allow periodic rehashing even if it'd risk packet
reordering.
- Optimize the NAPI skb cache, make it larger, use it in more paths.
- Attempt returning Tx skbs to the originating CPU (like we already
did for Rx skbs).
- Various data structure layout and prefetch optimizations from Eric.
- Remove ktime_get() from the recvmsg() fast path, ktime_get() is
sadly quite expensive on recent AMD machines.
- Extend threaded NAPI polling to allow the kthread busy poll for
packets.
- Make MPTCP use Rx backlog processing. This lowers the lock
pressure, improving the Rx performance.
- Support memcg accounting of MPTCP socket memory.
- Allow admin to opt sockets out of global protocol memory accounting
(using a sysctl or BPF-based policy). The global limits are a poor
fit for modern container workloads, where limits are imposed using
cgroups.
- Improve heuristics for when to kick off AF_UNIX garbage collection.
- Allow users to control TCP SACK compression, and default to 33% of
RTT.
- Add tcp_rcvbuf_low_rtt sysctl to let datacenter users avoid
unnecessarily aggressive rcvbuf growth and overshot when the
connection RTT is low.
- Preserve skb metadata space across skb_push / skb_pull operations.
- Support for IPIP encapsulation in the nftables flowtable offload.
- Support appending IP interface information to ICMP messages (RFC
5837).
- Support setting max record size in TLS (RFC 8449).
- Remove taking rtnl_lock from RTM_GETNEIGHTBL and RTM_SETNEIGHTBL.
- Use a dedicated lock (and RCU) in MPLS, instead of rtnl_lock.
- Let users configure the number of write buffers in SMC.
- Add new struct sockaddr_unsized for sockaddr of unknown length,
from Kees.
- Some conversions away from the crypto_ahash API, from Eric Biggers.
- Some preparations for slimming down struct page.
- YAML Netlink protocol spec for WireGuard.
- Add a tool on top of YAML Netlink specs/lib for reporting commonly
computed derived statistics and summarized system state.
Driver API:
- Add CAN XL support to the CAN Netlink interface.
- Add uAPI for reporting PHY Mean Square Error (MSE) diagnostics, as
defined by the OPEN Alliance's "Advanced diagnostic features for
100BASE-T1 automotive Ethernet PHYs" specification.
- Add DPLL phase-adjust-gran pin attribute (and implement it in
zl3073x).
- Refactor xfrm_input lock to reduce contention when NIC offloads
IPsec and performs RSS.
- Add info to devlink params whether the current setting is the
default or a user override. Allow resetting back to default.
- Add standard device stats for PSP crypto offload.
- Leverage DSA frame broadcast to implement simple HSR frame
duplication for a lot of switches without dedicated HSR offload.
- Add uAPI defines for 1.6Tbps link modes.
Device drivers:
- Add Motorcomm YT921x gigabit Ethernet switch support.
- Add MUCSE driver for N500/N210 1GbE NIC series.
- Convert drivers to support dedicated ops for timestamping control,
and away from the direct IOCTL handling. While at it support GET
operations for PHY timestamping.
- Add (and convert most drivers to) a dedicated ethtool callback for
reading the Rx ring count.
- Significant refactoring efforts in the STMMAC driver, which
supports Synopsys turn-key MAC IP integrated into a ton of SoCs.
- Ethernet high-speed NICs:
- Broadcom (bnxt):
- support PPS in/out on all pins
- Intel (100G, ice, idpf):
- ice: implement standard ethtool and timestamping stats
- i40e: support setting the max number of MAC addresses per VF
- iavf: support RSS of GTP tunnels for 5G and LTE deployments
- nVidia/Mellanox (mlx5):
- reduce downtime on interface reconfiguration
- disable being an XDP redirect target by default (same as
other drivers) to avoid wasting resources if feature is
unused
- Meta (fbnic):
- add support for Linux-managed PCS on 25G, 50G, and 100G links
- Wangxun:
- support Rx descriptor merge, and Tx head writeback
- support Rx coalescing offload
- support 25G SPF and 40G QSFP modules
- Ethernet virtual:
- Google (gve):
- allow ethtool to configure rx_buf_len
- implement XDP HW RX Timestamping support for DQ descriptor
format
- Microsoft vNIC (mana):
- support HW link state events
- handle hardware recovery events when probing the device
- Ethernet NICs consumer, and embedded:
- usbnet: add support for Byte Queue Limits (BQL)
- AMD (amd-xgbe):
- add device selftests
- NXP (enetc):
- add i.MX94 support
- Broadcom integrated MACs (bcmgenet, bcmasp):
- bcmasp: add support for PHY-based Wake-on-LAN
- Broadcom switches (b53):
- support port isolation
- support BCM5389/97/98 and BCM63XX ARL formats
- Lantiq/MaxLinear switches:
- support bridge FDB entries on the CPU port
- use regmap for register access
- allow user to enable/disable learning
- support Energy Efficient Ethernet
- support configuring RMII clock delays
- add tagging driver for MaxLinear GSW1xx switches
- Synopsys (stmmac):
- support using the HW clock in free running mode
- add Eswin EIC7700 support
- add Rockchip RK3506 support
- add Altera Agilex5 support
- Cadence (macb):
- cleanup and consolidate descriptor and DMA address handling
- add EyeQ5 support
- TI:
- icssg-prueth: support AF_XDP
- Airoha access points:
- add missing Ethernet stats and link state callback
- add AN7583 support
- support out-of-order Tx completion processing
- Power over Ethernet:
- pd692x0: preserve PSE configuration across reboots
- add support for TPS23881B devices
- Ethernet PHYs:
- Open Alliance OATC14 10BASE-T1S PHY cable diagnostic support
- Support 50G SerDes and 100G interfaces in Linux-managed PHYs
- micrel:
- support for non PTP SKUs of lan8814
- enable in-band auto-negotiation on lan8814
- realtek:
- cable testing support on RTL8224
- interrupt support on RTL8221B
- motorcomm: support for PHY LEDs on YT853
- microchip: support for LAN867X Rev.D0 PHYs w/ SQI and cable diag
- mscc: support for PHY LED control
- CAN drivers:
- m_can: add support for optional reset and system wake up
- remove can_change_mtu() obsoleted by core handling
- mcp251xfd: support GPIO controller functionality
- Bluetooth:
- add initial support for PASTa
- WiFi:
- split ieee80211.h file, it's way too big
- improvements in VHT radiotap reporting, S1G, Channel Switch
Announcement handling, rate tracking in mesh networks
- improve multi-radio monitor mode support, and add a cfg80211
debugfs interface for it
- HT action frame handling on 6 GHz
- initial chanctx work towards NAN
- MU-MIMO sniffer improvements
- WiFi drivers:
- RealTek (rtw89):
- support USB devices RTL8852AU and RTL8852CU
- initial work for RTL8922DE
- improved injection support
- Intel:
- iwlwifi: new sniffer API support
- MediaTek (mt76):
- WED support for >32-bit DMA
- airoha NPU support
- regdomain improvements
- continued WiFi7/MLO work
- Qualcomm/Atheros:
- ath10k: factory test support
- ath11k: TX power insertion support
- ath12k: BSS color change support
- ath12k: statistics improvements
- brcmfmac: Acer A1 840 tablet quirk
- rtl8xxxu: 40 MHz connection fixes/support"
* tag 'net-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1381 commits)
net: page_pool: sanitise allocation order
net: page pool: xa init with destroy on pp init
net/mlx5e: Support XDP target xmit with dummy program
net/mlx5e: Update XDP features in switch channels
selftests/tc-testing: Test CAKE scheduler when enqueue drops packets
net/sched: sch_cake: Fix incorrect qlen reduction in cake_drop
wireguard: netlink: generate netlink code
wireguard: uapi: generate header with ynl-gen
wireguard: uapi: move flag enums
wireguard: uapi: move enum wg_cmd
wireguard: netlink: add YNL specification
selftests: drv-net: Fix tolerance calculation in devlink_rate_tc_bw.py
selftests: drv-net: Fix and clarify TC bandwidth split in devlink_rate_tc_bw.py
selftests: drv-net: Set shell=True for sysfs writes in devlink_rate_tc_bw.py
selftests: drv-net: Use Iperf3Runner in devlink_rate_tc_bw.py
selftests: drv-net: introduce Iperf3Runner for measurement use cases
selftests: drv-net: Add devlink_rate_tc_bw.py to TEST_PROGS
net: ps3_gelic_net: Use napi_alloc_skb() and napi_gro_receive()
Documentation: net: dsa: mention simple HSR offload helpers
Documentation: net: dsa: mention availability of RedBox
...
|
||
|
|
015e7b0b0e |
bpf-next-6.19
-----BEGIN PGP SIGNATURE----- iQIzBAABCAAdFiEE+soXsSLHKoYyzcli6rmadz2vbToFAmktzC4ACgkQ6rmadz2v bTpA1w/+PZ45N3y6O+NQVIpBlpnHG7DEMK7Lw19On0xVLwH+XPHz6J5PEfzjyJR1 SCbsV30qkJ1YCtgRHHf+ZCuWPWm58hY8dXYwSDyjNavdQyVGOdf17aBu9pvH45NW K20OhwQHpCHWIfDlijjPkDdiHnYf5S7Xy6ctt/3ztF0pMDHIaghGxJymG4wULcDT iLKnT37kwO8b2ihmw/HbcZPQYMWfHRye7X009K+wCv0dnhJ6q/Ny1m+Pg4kF92e6 ON/RY26ep2dq7LpaNWa1rI1yOgFlI7uUlVojqrAuAb+xrg+64wUDBxeijvE37EN1 s/+PuEKAR6xwz1dbY2cWAI0D633saz24UdV6kCBW9HrjHKVRQ7ZSsBF9ENkS4DTK nowx4wOe1ZHc/6YgTktZp9LEn/0YrmQtFxjqEAJiYUgD18FrBrSjmhHpBiL+HghP sTqy41qDQGoKtg3bRu42Co9wmNeeLsnxT8NQExCmTYQ4ufpdA/VMQux9cBVX3GBq EchJb465+AcvvCJUiKbnHLxDsHCQz1YYytz3RqyFLgGDFZnHOE0FjwPJmM8I5kkK gvDB3ZYdO3Halm8BZfZZBnKv5uK7myuAWwqRLgMRanuZcRgmIV1oUP5EP88HdH75 fB20vZSVcfzB17SLyhiM20ivEWodJa9VCLEw9WDOmDoml+33Pks= =kaCJ -----END PGP SIGNATURE----- Merge tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: - Convert selftests/bpf/test_tc_edt and test_tc_tunnel from .sh to test_progs runner (Alexis Lothoré) - Convert selftests/bpf/test_xsk to test_progs runner (Bastien Curutchet) - Replace bpf memory allocator with kmalloc_nolock() in bpf_local_storage (Amery Hung), and in bpf streams and range tree (Puranjay Mohan) - Introduce support for indirect jumps in BPF verifier and x86 JIT (Anton Protopopov) and arm64 JIT (Puranjay Mohan) - Remove runqslower bpf tool (Hoyeon Lee) - Fix corner cases in the verifier to close several syzbot reports (Eduard Zingerman, KaFai Wan) - Several improvements in deadlock detection in rqspinlock (Kumar Kartikeya Dwivedi) - Implement "jmp" mode for BPF trampoline and corresponding DYNAMIC_FTRACE_WITH_JMP. It improves "fexit" program type performance from 80 M/s to 136 M/s. With Steven's Ack. (Menglong Dong) - Add ability to test non-linear skbs in BPF_PROG_TEST_RUN (Paul Chaignon) - Do not let BPF_PROG_TEST_RUN emit invalid GSO types to stack (Daniel Borkmann) - Generalize buildid reader into bpf_dynptr (Mykyta Yatsenko) - Optimize bpf_map_update_elem() for map-in-map types (Ritesh Oedayrajsingh Varma) - Introduce overwrite mode for BPF ring buffer (Xu Kuohai) * tag 'bpf-next-6.19' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (169 commits) bpf: optimize bpf_map_update_elem() for map-in-map types bpf: make kprobe_multi_link_prog_run always_inline selftests/bpf: do not hardcode target rate in test_tc_edt BPF program selftests/bpf: remove test_tc_edt.sh selftests/bpf: integrate test_tc_edt into test_progs selftests/bpf: rename test_tc_edt.bpf.c section to expose program type selftests/bpf: Add success stats to rqspinlock stress test rqspinlock: Precede non-head waiter queueing with AA check rqspinlock: Disable spinning for trylock fallback rqspinlock: Use trylock fallback when per-CPU rqnode is busy rqspinlock: Perform AA checks immediately rqspinlock: Enclose lock/unlock within lock entry acquisitions bpf: Remove runqslower tool selftests/bpf: Remove usage of lsm/file_alloc_security in selftest bpf: Disable file_alloc_security hook bpf: check for insn arrays in check_ptr_alignment bpf: force BPF_F_RDONLY_PROG on insn array creation bpf: Fix exclusive map memory leak selftests/bpf: Make CS length configurable for rqspinlock stress test selftests/bpf: Add lock wait time stats to rqspinlock stress test ... |
||
|
|
f2310b6271 |
nolibc changes for v6.19
Highlights:
* Preparations to the use of nolibc in UML.
* Cleanup of sparse warnings.
* Library mode without _start().
* More consistency when disabling errno.
* Unconditional installation of all architecture support files.
* Always 64-bit wide ino_t and off_t.
* Various cleanups and bug fixes.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQTg4lxklFHAidmUs57B+h1jyw5bOAUCaSwY2wAKCRDB+h1jyw5b
OPpKAP9wvshTAWselixtw/xR8BXBIxHUh9y/NeT827Ut4IvpUgD/UH3duISRfB++
31nIRPI5BSZpC4CgaN9QKChcmxR8rw4=
=CbTX
-----END PGP SIGNATURE-----
Merge tag 'nolibc-20251130-for-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/nolibc/linux-nolibc
Pull nolibc updates from Thomas Weißschuh:
- Preparations to the use of nolibc in UML:
- Cleanup of sparse warnings
- Library mode without _start()
- More consistency when disabling errno
- Unconditional installation of all architecture support files
- Always 64-bit wide ino_t and off_t
- Various cleanups and bug fixes
* tag 'nolibc-20251130-for-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/nolibc/linux-nolibc: (25 commits)
selftests/nolibc: error out on linker warnings
selftests/nolibc: use lld to link loongarch binaries
tools/nolibc: remove more __nolibc_enosys() fallbacks
tools/nolibc: remove now superfluous overflow check in llseek
tools/nolibc: use 64-bit off_t
tools/nolibc: prefer the llseek syscall
tools/nolibc: handle 64-bit off_t for llseek
tools/nolibc: use 64-bit ino_t
tools/nolibc: avoid using plain integer as NULL pointer
tools/nolibc: add support for fchdir()
tools/nolibc: clean up outdated comments in generic arch.h
tools/nolibc: make the "headers" target install all supported archs
tools/nolibc: add the more portable inttypes.h
tools/nolibc: provide the portable sys/select.h
tools/nolibc: add missing memchr() to string.h
tools/nolibc: fix misleading help message regarding installation path
tools/nolibc: add uio.h with readv and writev
tools/nolibc: add option to disable runtime
tools/nolibc: use __fallthrough__ rather than fallthrough
tools/nolibc: implement %m if errno is not defined
...
|
||
|
|
2547f79b0b |
s390 updates for 6.19 merge window
- Provide a new interface for dynamic configuration and deconfiguration of hotplug memory, allowing with and without memmap_on_memory support. This makes the way memory hotplug is handled on s390 much more similar to other architectures - Remove compat support. There shouldn't be any compat user space around anymore, therefore get rid of a lot of code which also doesn't need to be tested anymore - Add stackprotector support. GCC 16 will get new compiler options, which allow to generate code required for kernel stackprotector support - Merge pai_crypto and pai_ext PMU drivers into a new driver. This removes a lot of duplicated code. The new driver is also extendable and allows to support new PMUs - Add driver override support for AP queues - Rework and extend zcrypt and AP trace events to allow for tracing of crypto requests - Support block sizes larger than 65535 bytes for CCW tape devices - Since the rework of the virtual kernel address space the module area and the kernel image are within the same 4GB area. This eliminates the need of weak per cpu variables. Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU - Various other small improvements and fixes -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEECMNfWEw3SLnmiLkZIg7DeRspbsIFAmktZioACgkQIg7DeRsp bsK4Rw//VzkvHyzOtGKZ8Hb4S+Sh/PFlaZQXNhj+Xt5gWoOhP1uPmmhBe6LxjYaB J9Ns3hpONQ1dTHV7VVkds8FvM/SBcGe8m5RpefmChC/bjm5UEOV/MppKtA0aLnEH hJmdubIrrRAXKggxlHEfRLzBsFvV/rJ9Xf16FhRxGDc4pgmgkI1NPQ41/dyCHklQ dB3YrFVPIETywVYYVB/G3h11JgF5Z6CKtjYCdSx72Fkbj65+6JPfcPgLKMpcJuPd UxUXtCo1FCXlP70jsz8JQI8cdieG0KDQTtnZP4P/pqjQ3wirOqvMewNa9t9xmQ2e p6Rc1Vx5DESkq9bRWtQEaprTVVzK7DDLH3RuZwB+uLrcLGD8JvVS6/m9n9CgzBMT BnJXG2sLZH+gdQy+DSD/fVDD7OvIk8TGrH+OFwVIKhrT/J3B2E7ZSYyZZCNIS7VG yiuypoDGYg3ZpYjH9+qOXWB3nc0vQWrlFzb1bsQu1omJGmunLv4jtTjAKGN82C33 auBsIYAlQW20X7DV0vZa59PwqwtBqtdQQcTidwtSztzKogRXAdK8KKHtN60JM4S2 7sWFOFCQaTChAeDNw6MF5EtULb551nwH2RtJ9x3CrJj+OGK6clbQNcxIA7Oy0veR Sl9v1lMfeKOgDrPdDy3ArQBJ8WLlF9qX9wLKbiaNyIKmkz2ymkg= =CNrb -----END PGP SIGNATURE----- Merge tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 updates from Heiko Carstens: - Provide a new interface for dynamic configuration and deconfiguration of hotplug memory, allowing with and without memmap_on_memory support. This makes the way memory hotplug is handled on s390 much more similar to other architectures - Remove compat support. There shouldn't be any compat user space around anymore, therefore get rid of a lot of code which also doesn't need to be tested anymore - Add stackprotector support. GCC 16 will get new compiler options, which allow to generate code required for kernel stackprotector support - Merge pai_crypto and pai_ext PMU drivers into a new driver. This removes a lot of duplicated code. The new driver is also extendable and allows to support new PMUs - Add driver override support for AP queues - Rework and extend zcrypt and AP trace events to allow for tracing of crypto requests - Support block sizes larger than 65535 bytes for CCW tape devices - Since the rework of the virtual kernel address space the module area and the kernel image are within the same 4GB area. This eliminates the need of weak per cpu variables. Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU - Various other small improvements and fixes * tag 's390-6.19-1' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (92 commits) watchdog: diag288_wdt: Remove KMSG_COMPONENT macro s390/entry: Use lay instead of aghik s390/vdso: Get rid of -m64 flag handling s390/vdso: Rename vdso64 to vdso s390: Rename head64.S to head.S s390/vdso: Use common STABS_DEBUG and DWARF_DEBUG macros s390: Add stackprotector support s390/modules: Simplify module_finalize() slightly s390: Remove KMSG_COMPONENT macro s390/percpu: Get rid of ARCH_MODULE_NEEDS_WEAK_PER_CPU s390/ap: Restrict driver_override versus apmask and aqmask use s390/ap: Rename mutex ap_perms_mutex to ap_attr_mutex s390/ap: Support driver_override for AP queue devices s390/ap: Use all-bits-one apmask/aqmask for vfio in_use() checks s390/debug: Update description of resize operation s390/syscalls: Switch to generic system call table generation s390/syscalls: Remove system call table pointer from thread_struct s390/uapi: Remove 31 bit support from uapi header files s390: Remove compat support tools: Remove s390 compat support ... |
||
|
|
6c26fbe8c9 |
Performance events changes for v6.19:
Callchain support:
- Add support for deferred user-space stack unwinding for
perf, enabled on x86. (Peter Zijlstra, Steven Rostedt)
- unwind_user/x86: Enable frame pointer unwinding on x86
(Josh Poimboeuf)
x86 PMU support and infrastructure:
- x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)
- x86/insn,uprobes,alternative: Unify insn_is_nop()
(Peter Zijlstra)
Intel PMU driver:
- Large series to prepare for and implement architectural PEBS
support for Intel platforms such as Clearwater Forest (CWF)
and Panther Lake (PTL). (Dapeng Mi, Kan Liang)
- Check dynamic constraints (Kan Liang)
- Optimize PEBS extended config (Peter Zijlstra)
- cstates: Remove PC3 support from LunarLake (Zhang Rui)
- cstates: Add Pantherlake support (Zhang Rui)
- cstates: Clearwater Forest support (Zide Chen)
AMD PMU driver:
- x86/amd: Check event before enable to avoid GPF (George Kennedy)
Fixes and cleanups:
- task_work: Fix NMI race condition (Peter Zijlstra)
- perf/x86: Fix NULL event access and potential PEBS record loss
(Dapeng Mi)
- Misc other fixes and cleanups.
(Dapeng Mi, Ingo Molnar, Peter Zijlstra)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktcU0RHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gNKw//ThLmbkoGJ0/yLOdEcW8rA/7HB43Oz6j9
k0Vs7zwDBMRFP4zQg2XeF5SH7CWS9p/nI3eMhorgmH77oJCvXJxVtD5991zmlZhf
eafOar5ZMVaoMz+tK8WWiENZyuN0bt0mumZmz9svXR3KV1S/q18XZ8bCas0itwnq
D0T3Gqi/Z39gJIy7bHNgLoFY2zvI9b2EJNDKlzHk3NJ7UamA4GuMHN0cM2dIzKGK
2L+wXOe2BH9YYzYrz/cdKq7sBMjOvFsCQ/5jh23A2Yu6JI4nJbw0WmexZRK1OWCp
GAdMjBuqIShibLRxK746WRO9iut49uTsah4iSG80hXzhpwf7VaegOarost1nLaqm
zweIOr3iwJRf273r6IqRuaporVHpQYMj2w2H63z36sQtGtkKHNyxZ50b6bqpwwjU
LikLEJ9Bmh3mlvlXsOx2wX6dTb1fUk+cy2ezCDKUHqOLjqy4dM8V+jYhuRO4yxXz
mj9aHZKgyuREt8yo/3nLqAzF5Okj9cXp7H6F1hCKWuCoAhNXkrvYcvbg8h6aRxOX
2vGhMYjpElkl/DG6OWCSwuqCt9nVEC/dazW9fKQjh4S0CFOVopaMGSkGcS/xUPub
92J4XMDEJX4RJ6dfspeQr97+1fETXEIWNv4WbKnDjqJlAucU1gnOTprVnAYUjcWw
74320FjGN1E=
=/8GE
-----END PGP SIGNATURE-----
Merge tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull performance events updates from Ingo Molnar:
"Callchain support:
- Add support for deferred user-space stack unwinding for perf,
enabled on x86. (Peter Zijlstra, Steven Rostedt)
- unwind_user/x86: Enable frame pointer unwinding on x86 (Josh
Poimboeuf)
x86 PMU support and infrastructure:
- x86/insn: Simplify for_each_insn_prefix() (Peter Zijlstra)
- x86/insn,uprobes,alternative: Unify insn_is_nop() (Peter Zijlstra)
Intel PMU driver:
- Large series to prepare for and implement architectural PEBS
support for Intel platforms such as Clearwater Forest (CWF) and
Panther Lake (PTL). (Dapeng Mi, Kan Liang)
- Check dynamic constraints (Kan Liang)
- Optimize PEBS extended config (Peter Zijlstra)
- cstates:
- Remove PC3 support from LunarLake (Zhang Rui)
- Add Pantherlake support (Zhang Rui)
- Clearwater Forest support (Zide Chen)
AMD PMU driver:
- x86/amd: Check event before enable to avoid GPF (George Kennedy)
Fixes and cleanups:
- task_work: Fix NMI race condition (Peter Zijlstra)
- perf/x86: Fix NULL event access and potential PEBS record loss
(Dapeng Mi)
- Misc other fixes and cleanups (Dapeng Mi, Ingo Molnar, Peter
Zijlstra)"
* tag 'perf-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (38 commits)
perf/x86/intel: Fix and clean up intel_pmu_drain_arch_pebs() type use
perf/x86/intel: Optimize PEBS extended config
perf/x86/intel: Check PEBS dyn_constraints
perf/x86/intel: Add a check for dynamic constraints
perf/x86/intel: Add counter group support for arch-PEBS
perf/x86/intel: Setup PEBS data configuration and enable legacy groups
perf/x86/intel: Update dyn_constraint base on PEBS event precise level
perf/x86/intel: Allocate arch-PEBS buffer and initialize PEBS_BASE MSR
perf/x86/intel: Process arch-PEBS records or record fragments
perf/x86/intel/ds: Factor out PEBS group processing code to functions
perf/x86/intel/ds: Factor out PEBS record processing code to functions
perf/x86/intel: Initialize architectural PEBS
perf/x86/intel: Correct large PEBS flag check
perf/x86/intel: Replace x86_pmu.drain_pebs calling with static call
perf/x86: Fix NULL event access and potential PEBS record loss
perf/x86: Remove redundant is_x86_event() prototype
entry,unwind/deferred: Fix unwind_reset_info() placement
unwind_user/x86: Fix arch=um build
perf: Support deferred user unwind
unwind_user/x86: Teach FP unwind about start of function
...
|
||
|
|
63e6995005 |
objtool updates for v6.19:
- klp-build livepatch module generation (Josh Poimboeuf)
Introduce new objtool features and a klp-build
script to generate livepatch modules using a
source .patch as input.
This builds on concepts from the longstanding out-of-tree
kpatch project which began in 2012 and has been used for
many years to generate livepatch modules for production kernels.
However, this is a complete rewrite which incorporates
hard-earned lessons from 12+ years of maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for symbol/section/reloc
inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines script
which injects #line directives into the source .patch to preserve
the original line numbers at compile time.
- Disassemble code with libopcodes instead of running objdump
(Alexandre Chartre)
- Disassemble support (-d option to objtool) by Alexandre Chartre,
which supports the decoding of various Linux kernel code generation
specials such as alternatives:
17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx
17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT
17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx
... jump table alternatives:
1895: sched_use_asym_prio+0x5 test $0x8,%ch
1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19>
189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP
189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2
189c: sched_use_asym_prio+0xc mov $0x1,%eax
18a1: sched_use_asym_prio+0x11 and $0x80,%ecx
... exception table alternatives:
native_read_msr:
5b80: native_read_msr+0x0 mov %edi,%ecx
5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION
5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4>
5b84: native_read_msr+0x4 shl $0x20,%rdx
.... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
example in sched_balance_find_dst_group() above):
2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114>
2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG
2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax
... NOP sequence shortening:
1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7>
1048e4: snapshot_write_finalize+0xc4 nop6
1048ea: snapshot_write_finalize+0xca nop11
1048f5: snapshot_write_finalize+0xd5 nop11
104900: snapshot_write_finalize+0xe0 mov %rax,%rcx
104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax
... and much more.
- Function validation tracing support (Alexandre Chartre)
- Various -ffunction-sections fixes (Josh Poimboeuf)
- Clang AutoFDO (Automated Feedback-Directed Optimizations) support (Josh Poimboeuf)
- Misc fixes and cleanups (Borislav Petkov, Chen Ni,
Dylan Hatch, Ingo Molnar, John Wang, Josh Poimboeuf,
Pankaj Raghav, Peter Zijlstra, Thorsten Blum)
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmktavcRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1j3IhAAvc9tRV8SJcohim6DrkPGxCN/S80uzt5S
q8v1x5tBzMYmUxftfpoLsPCri6Ww0jprNuhnbRCvWAzXFuW79HWBNdVkEO7V/cym
OsCKQv3r0mWv5UXP3o8VM5K3tnU61wOAIx3yZCz5XKWeOg6NPXBJCSGWYpLuA7z0
1wUWAXuHgmj4RHMlHu5x0FZnSqGU3/TkUDGAqdxrY+myhdwm0Ul+dSwWGQdQjCgA
59Y/gDsWWEe5BVL56suwKZ1e+8UFnpbncbWkjELD6euJpYpDSNMOW/S6PYqOOz5M
rjMv06XIX5ma7QQbF5fMG/sXW64tZtc090UocDnx7hpDq9mLEyNNkXsqRQlmd8Wt
wG19IaeWo8aG9DTQkiv8OhtmssPKZHJsVjRUvXGnjktvxnsYSomgOT1lNme38dJD
X9jHgZCFMdPsQmG0dp00Y0oejfTChqIDef7qSpYwT96R7l9VQQF7K7AxfJwSeLGO
3hClZ0Gz/u9NiJTUUWTxUmR+YEy+1xIeaQSDq6t4JRtNJaMZlcevfVW+F2Lm04XH
9eSeF7bJS2XKrlLHVdPgWCGZOmee+ghdQ7svsyEGpzdzaAZ7UveTucHJ9CvW3Fft
Dcrl8rxX2NiD2PLz03HCHR/JVUDc3W3Exrer1TD8PD4LcZhFoBEGQbZ/gFlkBTxb
TOcemtJT03U=
=yPrS
-----END PGP SIGNATURE-----
Merge tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull objtool updates from Ingo Molnar:
- klp-build livepatch module generation (Josh Poimboeuf)
Introduce new objtool features and a klp-build script to generate
livepatch modules using a source .patch as input.
This builds on concepts from the longstanding out-of-tree kpatch
project which began in 2012 and has been used for many years to
generate livepatch modules for production kernels. However, this is a
complete rewrite which incorporates hard-earned lessons from 12+
years of maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for
symbol/section/reloc inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines
script which injects #line directives into the source .patch to
preserve the original line numbers at compile time.
- Disassemble code with libopcodes instead of running objdump
(Alexandre Chartre)
- Disassemble support (-d option to objtool) by Alexandre Chartre,
which supports the decoding of various Linux kernel code generation
specials such as alternatives:
17ef: sched_balance_find_dst_group+0x62f mov 0x34(%r9),%edx
17f3: sched_balance_find_dst_group+0x633 | <alternative.17f3> | X86_FEATURE_POPCNT
17f3: sched_balance_find_dst_group+0x633 | call 0x17f8 <__sw_hweight64> | popcnt %rdi,%rax
17f8: sched_balance_find_dst_group+0x638 cmp %eax,%edx
... jump table alternatives:
1895: sched_use_asym_prio+0x5 test $0x8,%ch
1898: sched_use_asym_prio+0x8 je 0x18a9 <sched_use_asym_prio+0x19>
189a: sched_use_asym_prio+0xa | <jump_table.189a> | JUMP
189a: sched_use_asym_prio+0xa | jmp 0x18ae <sched_use_asym_prio+0x1e> | nop2
189c: sched_use_asym_prio+0xc mov $0x1,%eax
18a1: sched_use_asym_prio+0x11 and $0x80,%ecx
... exception table alternatives:
native_read_msr:
5b80: native_read_msr+0x0 mov %edi,%ecx
5b82: native_read_msr+0x2 | <ex_table.5b82> | EXCEPTION
5b82: native_read_msr+0x2 | rdmsr | resume at 0x5b84 <native_read_msr+0x4>
5b84: native_read_msr+0x4 shl $0x20,%rdx
.... x86 feature flag decoding (also see the X86_FEATURE_POPCNT
example in sched_balance_find_dst_group() above):
2faaf: start_thread_common.constprop.0+0x1f jne 0x2fba4 <start_thread_common.constprop.0+0x114>
2fab5: start_thread_common.constprop.0+0x25 | <alternative.2fab5> | X86_FEATURE_ALWAYS | X86_BUG_NULL_SEG
2fab5: start_thread_common.constprop.0+0x25 | jmp 0x2faba <.altinstr_aux+0x2f4> | jmp 0x4b0 <start_thread_common.constprop.0+0x3f> | nop5
2faba: start_thread_common.constprop.0+0x2a mov $0x2b,%eax
... NOP sequence shortening:
1048e2: snapshot_write_finalize+0xc2 je 0x104917 <snapshot_write_finalize+0xf7>
1048e4: snapshot_write_finalize+0xc4 nop6
1048ea: snapshot_write_finalize+0xca nop11
1048f5: snapshot_write_finalize+0xd5 nop11
104900: snapshot_write_finalize+0xe0 mov %rax,%rcx
104903: snapshot_write_finalize+0xe3 mov 0x10(%rdx),%rax
... and much more.
- Function validation tracing support (Alexandre Chartre)
- Various -ffunction-sections fixes (Josh Poimboeuf)
- Clang AutoFDO (Automated Feedback-Directed Optimizations) support
(Josh Poimboeuf)
- Misc fixes and cleanups (Borislav Petkov, Chen Ni, Dylan Hatch, Ingo
Molnar, John Wang, Josh Poimboeuf, Pankaj Raghav, Peter Zijlstra,
Thorsten Blum)
* tag 'objtool-core-2025-12-01' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (129 commits)
objtool: Fix segfault on unknown alternatives
objtool: Build with disassembly can fail when including bdf.h
objtool: Trim trailing NOPs in alternative
objtool: Add wide output for disassembly
objtool: Compact output for alternatives with one instruction
objtool: Improve naming of group alternatives
objtool: Add Function to get the name of a CPU feature
objtool: Provide access to feature and flags of group alternatives
objtool: Fix address references in alternatives
objtool: Disassemble jump table alternatives
objtool: Disassemble exception table alternatives
objtool: Print addresses with alternative instructions
objtool: Disassemble group alternatives
objtool: Print headers for alternatives
objtool: Preserve alternatives order
objtool: Add the --disas=<function-pattern> action
objtool: Do not validate IBT for .return_sites and .call_sites
objtool: Improve tracing of alternative instructions
objtool: Add functions to better name alternatives
objtool: Identify the different types of alternatives
...
|
||
|
|
415d34b92c |
namespace-6.19-rc1
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCaSmOZQAKCRCRxhvAZXjc
ooKwAP4kR5kMjHlthf8jHmmCjVU3nQFO9hUZsIQL9gFJLOIQMAD+LLoTaq1WJufl
oSgZpREXZVmI1TK61eR6EZMB1YikGAo=
=TExi
-----END PGP SIGNATURE-----
Merge tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs
Pull namespace updates from Christian Brauner:
"This contains substantial namespace infrastructure changes including a new
system call, active reference counting, and extensive header cleanups.
The branch depends on the shared kbuild branch for -fms-extensions support.
Features:
- listns() system call
Add a new listns() system call that allows userspace to iterate
through namespaces in the system. This provides a programmatic
interface to discover and inspect namespaces, addressing
longstanding limitations:
Currently, there is no direct way for userspace to enumerate
namespaces. Applications must resort to scanning /proc/*/ns/ across
all processes, which is:
- Inefficient - requires iterating over all processes
- Incomplete - misses namespaces not attached to any running
process but kept alive by file descriptors, bind mounts, or
parent references
- Permission-heavy - requires access to /proc for many processes
- No ordering or ownership information
- No filtering per namespace type
The listns() system call solves these problems:
ssize_t listns(const struct ns_id_req *req, u64 *ns_ids,
size_t nr_ns_ids, unsigned int flags);
struct ns_id_req {
__u32 size;
__u32 spare;
__u64 ns_id;
struct /* listns */ {
__u32 ns_type;
__u32 spare2;
__u64 user_ns_id;
};
};
Features include:
- Pagination support for large namespace sets
- Filtering by namespace type (MNT_NS, NET_NS, USER_NS, etc.)
- Filtering by owning user namespace
- Permission checks respecting namespace isolation
- Active Reference Counting
Introduce an active reference count that tracks namespace
visibility to userspace. A namespace is visible in the following
cases:
- The namespace is in use by a task
- The namespace is persisted through a VFS object (namespace file
descriptor or bind-mount)
- The namespace is a hierarchical type and is the parent of child
namespaces
The active reference count does not regulate lifetime (that's still
done by the normal reference count) - it only regulates visibility
to namespace file handles and listns().
This prevents resurrection of namespaces that are pinned only for
internal kernel reasons (e.g., user namespaces held by
file->f_cred, lazy TLB references on idle CPUs, etc.) which should
not be accessible via (1)-(3).
- Unified Namespace Tree
Introduce a unified tree structure for all namespaces with:
- Fixed IDs assigned to initial namespaces
- Lookup based solely on inode number
- Maintained list of owned namespaces per user namespace
- Simplified rbtree comparison helpers
Cleanups
- Header Reorganization:
- Move namespace types into separate header (ns_common_types.h)
- Decouple nstree from ns_common header
- Move nstree types into separate header
- Switch to new ns_tree_{node,root} structures with helper functions
- Use guards for ns_tree_lock
- Initial Namespace Reference Count Optimization
- Make all reference counts on initial namespaces a nop to avoid
pointless cacheline ping-pong for namespaces that can never go
away
- Drop custom reference count initialization for initial namespaces
- Add NS_COMMON_INIT() macro and use it for all namespaces
- pid: rely on common reference count behavior
- Miscellaneous Cleanups
- Rename exit_task_namespaces() to exit_nsproxy_namespaces()
- Rename is_initial_namespace() and make argument const
- Use boolean to indicate anonymous mount namespace
- Simplify owner list iteration in nstree
- nsfs: raise SB_I_NODEV, SB_I_NOEXEC, and DCACHE_DONTCACHE explicitly
- nsfs: use inode_just_drop()
- pidfs: raise DCACHE_DONTCACHE explicitly
- pidfs: simplify PIDFD_GET__NAMESPACE ioctls
- libfs: allow to specify s_d_flags
- cgroup: add cgroup namespace to tree after owner is set
- nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
Fixes:
- setns(pidfd, ...) race condition
Fix a subtle race when using pidfds with setns(). When the target
task exits after prepare_nsset() but before commit_nsset(), the
namespace's active reference count might have been dropped. If
setns() then installs the namespaces, it would bump the active
reference count from zero without taking the required reference on
the owner namespace, leading to underflow when later decremented.
The fix resurrects the ownership chain if necessary - if the caller
succeeded in grabbing passive references, the setns() should
succeed even if the target task exits or gets reaped.
- Return EFAULT on put_user() error instead of success
- Make sure references are dropped outside of RCU lock (some
namespaces like mount namespace sleep when putting the last
reference)
- Don't skip active reference count initialization for network
namespace
- Add asserts for active refcount underflow
- Add asserts for initial namespace reference counts (both passive
and active)
- ipc: enable is_ns_init_id() assertions
- Fix kernel-doc comments for internal nstree functions
- Selftests
- 15 active reference count tests
- 9 listns() functionality tests
- 7 listns() permission tests
- 12 inactive namespace resurrection tests
- 3 threaded active reference count tests
- commit_creds() active reference tests
- Pagination and stress tests
- EFAULT handling test
- nsid tests fixes"
* tag 'namespace-6.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (103 commits)
pidfs: simplify PIDFD_GET_<type>_NAMESPACE ioctls
nstree: fix kernel-doc comments for internal functions
nsproxy: fix free_nsproxy() and simplify create_new_namespaces()
selftests/namespaces: fix nsid tests
ns: drop custom reference count initialization for initial namespaces
pid: rely on common reference count behavior
ns: add asserts for initial namespace active reference counts
ns: add asserts for initial namespace reference counts
ns: make all reference counts on initial namespace a nop
ipc: enable is_ns_init_id() assertions
fs: use boolean to indicate anonymous mount namespace
ns: rename is_initial_namespace()
ns: make is_initial_namespace() argument const
nstree: use guards for ns_tree_lock
nstree: simplify owner list iteration
nstree: switch to new structures
nstree: add helper to operate on struct ns_tree_{node,root}
nstree: move nstree types into separate header
nstree: decouple from ns_common header
ns: move namespace types into separate header
...
|
||
|
|
68e83f3472 |
tools: ynl-gen: add regeneration comment
Add a comment on regeneration to the generated files. The comment is placed after the YNL-GEN line[1], as to not interfere with ynl-regen.sh's detection logic. [1] and after the optional YNL-ARG line. Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/ Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
31b4d3af63 |
tools/nolibc: remove more __nolibc_enosys() fallbacks
Commit e6366101ce1f ("tools/nolibc: remove __nolibc_enosys() fallback
from time64-related functions") removed many of these fallbacks but
forgot a few.
Finish the job.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Acked-by: Willy Tarreau <w@1wt.eu>
|
||
|
|
3e1da545db |
tools/nolibc: remove now superfluous overflow check in llseek
As off_t is now always 64-bit wide this overflow can not happen anymore, remove the check. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu> Reviewed-by: Arnd Bergmann <arnd@arndb.de> |
||
|
|
e800e94468 |
tools/nolibc: use 64-bit off_t
The kernel uses 64-bit values for file offsets. Currently these might be truncated to 32-bit when assigned to nolibc's off_t values. Switch to 64-bit off_t consistently. Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/lkml/cec27d94-c99d-4c57-9a12-275ea663dda8@app.fastmail.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu> |
||
|
|
19c5a681b2 |
tools/nolibc: prefer the llseek syscall
Make sure to always use the 64-bit safe system call in preparation for 64-bit off_t on 32 bit architectures. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu> |
||
|
|
d93d0593dd |
tools/nolibc: handle 64-bit off_t for llseek
Correctly handle 64-bit off_t values in preparation for 64-bit off_t on 32-bit architectures. Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Willy Tarreau <w@1wt.eu> |
||
|
|
87506e44cb |
tools/nolibc: use 64-bit ino_t
The kernel uses 64-bit values for inode numbers. Currently these might be truncated to 32-bit when assigned to nolibc's ino_t values. Switch to 64-bit ino_t consistently. As ino_t is never used directly in kernel ABIs, no systemcall wrappers need to be adapted. Suggested-by: Arnd Bergmann <arnd@arndb.de> Link: https://lore.kernel.org/lkml/cec27d94-c99d-4c57-9a12-275ea663dda8@app.fastmail.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu> |
||
|
|
169ebcbb90 |
tools: Remove s390 compat support
Remove s390 compat support from everything within tools, since s390 compat support will be removed from the kernel. Reviewed-by: Arnd Bergmann <arnd@arndb.de> Acked-by: Thomas Weißschuh <linux@weissschuh.net> # tools/nolibc selftests/nolibc Reviewed-by: Thomas Weißschuh <linux@weissschuh.net> # selftests/vDSO Acked-by: Alexei Starovoitov <ast@kernel.org> # bpf bits Signed-off-by: Heiko Carstens <hca@linux.ibm.com> |
||
|
|
e47b68bda4 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf after 6.18-rc5+
Cross-merge BPF and other fixes after downstream PR. Minor conflict in kernel/bpf/helpers.c Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
c99ebb6132 |
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Cross-merge networking fixes after downstream PR (net-6.18-rc6).
No conflicts, adjacent changes in:
drivers/net/phy/micrel.c
96a9178a29a6 ("net: phy: micrel: lan8814 fix reset of the QSGMII interface")
61b7ade9ba8c ("net: phy: micrel: Add support for non PTP SKUs for lan8814")
and a trivial one in tools/testing/selftests/drivers/net/Makefile.
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
||
|
|
d851f2b2b2 |
Linux 6.18-rc5
-----BEGIN PGP SIGNATURE----- iQFSBAABCgA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmkRH1seHHRvcnZhbGRz QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGUCgH/j+fMbEg618ajVS2 SWdAXZKEDVtCqN6bq9VT3g3rwk/zNgvppjMdCBqyXFpjvkGGIxlZnNgiTVuTLzvR cjl0c5C1a38lJ+DzmLjTF1TJ3t0CcA/8l2iWKu3Dm1ch2yuxm5ZcM2b9ujBholf7 pYd7jZ7JhVm5eXD7U5X03AkZPUWAIx/Nip37cO7RLGzlkRSGLB7OXq3TB2u4e2ti gDpP4O+cgOqSuS71Hz0/8T6KIVQ9IZ/qzANWAYeHZD2DQwI3OZXI1WRnc1iw401o QaMaV21NirKwAANKetvbj7FgtmpdfQs/7FA+yR7YW2ARTpkc1EXrxgMZ6NuphGKE kYQo55g= =QaZ2 -----END PGP SIGNATURE----- Merge tag 'v6.18-rc5' into objtool/core, to pick up fixes Signed-off-by: Ingo Molnar <mingo@kernel.org> |
||
|
|
2d8482959e |
tools/nolibc: avoid using plain integer as NULL pointer
While an integer zero is a valid NULL pointer as per the C standard, sparse will complain about it. Use explicit NULL pointers instead. Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/r/202509261452.g5peaXCc-lkp@intel.com/ Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> Acked-by: Willy Tarreau <w@1wt.eu> |
||
|
|
107eb8336e |
tools/nolibc: add support for fchdir()
Add support for the file descriptor based variant of chdir(). Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de> Acked-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
b4ce5923e7 |
bpf, x86: add new map type: instructions array
On bpf(BPF_PROG_LOAD) syscall user-supplied BPF programs are
translated by the verifier into "xlated" BPF programs. During this
process the original instructions offsets might be adjusted and/or
individual instructions might be replaced by new sets of instructions,
or deleted.
Add a new BPF map type which is aimed to keep track of how, for a
given program, the original instructions were relocated during the
verification. Also, besides keeping track of the original -> xlated
mapping, make x86 JIT to build the xlated -> jitted mapping for every
instruction listed in an instruction array. This is required for every
future application of instruction arrays: static keys, indirect jumps
and indirect calls.
A map of the BPF_MAP_TYPE_INSN_ARRAY type must be created with a u32
keys and value of size 8. The values have different semantics for
userspace and for BPF space. For userspace a value consists of two
u32 values – xlated and jitted offsets. For BPF side the value is
a real pointer to a jitted instruction.
On map creation/initialization, before loading the program, each
element of the map should be initialized to point to an instruction
offset within the program. Before the program load such maps should
be made frozen. After the program verification xlated and jitted
offsets can be read via the bpf(2) syscall.
If a tracked instruction is removed by the verifier, then the xlated
offset is set to (u32)-1 which is considered to be too big for a valid
BPF program offset.
One such a map can, obviously, be used to track one and only one BPF
program. If the verification process was unsuccessful, then the same
map can be re-used to verify the program with a different log level.
However, if the program was loaded fine, then such a map, being
frozen in any case, can't be reused by other programs even after the
program release.
Example. Consider the following original and xlated programs:
Original prog: Xlated prog:
0: r1 = 0x0 0: r1 = 0
1: *(u32 *)(r10 - 0x4) = r1 1: *(u32 *)(r10 -4) = r1
2: r2 = r10 2: r2 = r10
3: r2 += -0x4 3: r2 += -4
4: r1 = 0x0 ll 4: r1 = map[id:88]
6: call 0x1 6: r1 += 272
7: r0 = *(u32 *)(r2 +0)
8: if r0 >= 0x1 goto pc+3
9: r0 <<= 3
10: r0 += r1
11: goto pc+1
12: r0 = 0
7: r6 = r0 13: r6 = r0
8: if r6 == 0x0 goto +0x2 14: if r6 == 0x0 goto pc+4
9: call 0x76 15: r0 = 0xffffffff8d2079c0
17: r0 = *(u64 *)(r0 +0)
10: *(u64 *)(r6 + 0x0) = r0 18: *(u64 *)(r6 +0) = r0
11: r0 = 0x0 19: r0 = 0x0
12: exit 20: exit
An instruction array map, containing, e.g., instructions [0,4,7,12]
will be translated by the verifier to [0,4,13,20]. A map with
index 5 (the middle of 16-byte instruction) or indexes greater than 12
(outside the program boundaries) would be rejected.
The functionality provided by this patch will be extended in consequent
patches to implement BPF Static Keys, indirect jumps, and indirect calls.
Signed-off-by: Anton Protopopov <a.s.protopopov@gmail.com>
Reviewed-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20251105090410.1250500-2-a.s.protopopov@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
||
|
|
c18d4b190a |
net: Extend NAPI threaded polling to allow kthread based busy polling
Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to enable and disable threaded busy polling. When threaded busy polling is enabled for a NAPI, enable NAPI_STATE_THREADED also. When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to signal napi_complete_done not to rearm interrupts. Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread go to sleep. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org> |
||
|
|
6fc9baa49d
|
nsfs: update tools header
Ensure all the new uapi bits are visible for the selftests. Link: https://patch.msgid.link/20251029-work-namespace-nstree-listns-v4-21-2e6f823ebdc0@kernel.org Tested-by: syzbot@syzkaller.appspotmail.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org> |
||
|
|
549042f167 |
tools headers asm: Sync fls headers header with the kernel sources
To pick the changes in:
6606c8c7e8188656 ("bitops: Add __attribute_const__ to generic ffs()-family implementations")
This addresses these tools build warnings:
Warning: Kernel ABI header differences:
diff -u tools/include/asm-generic/bitops/__fls.h include/asm-generic/bitops/__fls.h
diff -u tools/include/asm-generic/bitops/fls.h include/asm-generic/bitops/fls.h
diff -u tools/include/asm-generic/bitops/fls64.h include/asm-generic/bitops/fls64.h
Please see tools/include/uapi/README for further details.
Cc: Kees Cook <kees@kernel.org>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
29e4d12a29 |
tools headers UAPI: Sync linux/kvm.h with the kernel sources
To pick the changes in:
fe2bf6234e947bf5 ("KVM: guest_memfd: Add INIT_SHARED flag, reject user page faults if not set")
d2042d8f96ddefde ("KVM: Rework KVM_CAP_GUEST_MEMFD_MMAP into KVM_CAP_GUEST_MEMFD_FLAGS")
3d3a04fad25a6621 ("KVM: Allow and advertise support for host mmap() on guest_memfd files")
That just rebuilds perf, as these patches don't add any new KVM ioctl to
be harvested for the 'perf trace' ioctl syscall argument beautifiers.
This addresses this perf build warning:
Warning: Kernel ABI header differences:
diff -u tools/include/uapi/linux/kvm.h include/uapi/linux/kvm.h
Please see tools/include/uapi/README for further details.
Cc: Sean Christopherson <seanjc@google.com>
Cc: Fuad Tabba <tabba@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
f8950b47db |
tools headers UAPI: Update tools's copy of drm.h to pick DRM_IOCTL_GEM_CHANGE_HANDLE
Picking the changes from:
0864197382fa7c8c ("drm: Move drm_gem ioctl kerneldoc to uapi file")
53096728b8910c69 ("drm: Add DRM prime interface to reassign GEM handle")
Addressing these perf build warnings:
Warning: Kernel ABI header differences:
Now 'perf trace' and other code that might use the tools/perf/trace/beauty
autogenerated tables will be able to translate this new ioctl command into
a string:
$ tools/perf/trace/beauty/drm_ioctl.sh > before
$ cp include/uapi/drm/drm.h tools/include/uapi/drm/drm.h
$ tools/perf/trace/beauty/drm_ioctl.sh > after
$ diff -u before after
--- before 2025-11-03 09:57:34.832553174 -0300
+++ after 2025-11-03 09:57:47.969409428 -0300
@@ -111,6 +111,7 @@
[0xCF] = "SYNCOBJ_EVENTFD",
[0xD0] = "MODE_CLOSEFB",
[0xD1] = "SET_CLIENT_NAME",
+ [0xD2] = "GEM_CHANGE_HANDLE",
[DRM_COMMAND_BASE + 0x00] = "I915_INIT",
[DRM_COMMAND_BASE + 0x01] = "I915_FLUSH",
[DRM_COMMAND_BASE + 0x02] = "I915_FLIP",
$
Please see tools/include/uapi/README for further details.
Cc: Christian König <christian.koenig@amd.com>
Cc: David Francis <David.Francis@amd.com>
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
|
||
|
|
7534b9bfe6 |
tools/nolibc: clean up outdated comments in generic arch.h
Along the code reorganizations, the file has been keeping the original comments about argv and envp which are no longer relevant to this file anymore. Let's just drop them. Acked-by: Thomas Weißschuh <linux@weissschuh.net> Signed-off-by: Willy Tarreau <w@1wt.eu> |
||
|
|
1868c027b6 |
tools/nolibc: make the "headers" target install all supported archs
The efforts we go through by installing a single arch are counter productive when the base directory already supports them all, and the arch-specific files are really small. Let's make the "headers" target simply install headers for all supported archs and stop trying to build a hybrid "arch.h" file on the fly, to instead keep the generic one. Now the same nolibc headers installation will be usable with any arch-specific uapi installation. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
44bf8bbe29 |
tools/nolibc: add the more portable inttypes.h
It's often recommended to only use inttypes.h instead of stdint.h for portability reasons since the former is always present when the latter is present, but not conversely, and the former includes the latter. Due to this some simple programs fail to build when including inttypes.h. Let's add one that simply includes nolibc.h to better support these programs. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
10f407c660 |
tools/nolibc: provide the portable sys/select.h
Modern programs tend to include sys/select.h to get FD_SET() and FD_CLR() definitions as well as struct fd_set, but in our case it didn't exist. The definitions were moved from types.h to sys/select.h, which is now included from nolibc.h, and the sys_select() definition moved there as well from sys.h. Signed-off-by: Willy Tarreau <w@1wt.eu> [Thomas: adapt to current -next branch] Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
db75042e93 |
tools/nolibc: add missing memchr() to string.h
Surprisingly we forgot to add this common one. It was added with a per-arch guard allowing to later implement it in arch-specific asm code like was done for a few other ones. The test verifies that we don't search past the indicated length. Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
09c873c91f |
tools/nolibc: fix misleading help message regarding installation path
The help message says the headers are going to be installed into tools/include/nolibc but this is only the default if $OUTPUT is not set, so better clarify this (the current value of $OUTPUT is already shown). Signed-off-by: Willy Tarreau <w@1wt.eu> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
4bb30188c7 |
tools/nolibc: add uio.h with readv and writev
This is generally useful and struct iovec is also needed for other purposes such as ptrace. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
3d66c4e14f |
tools/nolibc: add option to disable runtime
In principle, it is possible to use nolibc for only some object files in a program. In that case, the startup code in _start and _start_c is not going to be used. Add the NOLIBC_NO_RUNTIME compile time option to disable it entirely and also remove anything that depends on it. Doing this avoids warnings from modpost for UML as the _start_c code references the main function from the .init.text section while it is not inside .init itself. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
2cb6cc8361 |
tools/nolibc: use __fallthrough__ rather than fallthrough
Use the version of the attribute with underscores to avoid issues if fallthrough has been defined by another header file already. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
fbd1b7f6b3 |
tools/nolibc: implement %m if errno is not defined
For improved compatibility, print %m as "unknown error" when nolibc is compiled using NOLIBC_IGNORE_ERRNO. Signed-off-by: Benjamin Berg <benjamin.berg@intel.com> Signed-off-by: Thomas Weißschuh <linux@weissschuh.net> |
||
|
|
4ada5679f1 |
tools/nolibc/dirent: avoid errno in readdir_r
Using errno is not possible when NOLIBC_IGNORE_ERRNO is set. Use
sys_lseek instead of lseek as that avoids using errno.
Fixes: 665fa8dea90d ("tools/nolibc: add support for directory access")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
||
|
|
c485ca3aff |
tools/nolibc/stdio: let perror work when NOLIBC_IGNORE_ERRNO is set
There is no errno variable when NOLIBC_IGNORE_ERRNO is defined. As such,
simply print the message with "unknown error" rather than the integer
value of errno.
Fixes: acab7bcdb1bc ("tools/nolibc/stdio: add perror() to report the errno value")
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
||
|
|
089c0a9853 |
tools/nolibc: remove outdated comment about __sysret() in mmap()
Since commit fb01ff635efd ("tools/nolibc: keep brk(), sbrk(), mmap()
away from __sysret()") the implementation of mmap() does not use the
__sysret() macro anymore.
Remove the outdated comment.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
||
|
|
c69993ecdd |
perf: Support deferred user unwind
Add support for deferred userspace unwind to perf. Where perf currently relies on in-place stack unwinding; from NMI context and all that. This moves the userspace part of the unwind to right before the return-to-userspace. This has two distinct benefits, the biggest is that it moves the unwind to a faultable context. It becomes possible to fault in debug info (.eh_frame, SFrame etc.) that might not otherwise be readily available. And secondly, it de-duplicates the user callchain where multiple samples happen during the same kernel entry. To facilitate this the perf interface is extended with a new record type: PERF_RECORD_CALLCHAIN_DEFERRED and two new attribute flags: perf_event_attr::defer_callchain - to request the user unwind be deferred perf_event_attr::defer_output - to request PERF_RECORD_CALLCHAIN_DEFERRED records The existing PERF_RECORD_SAMPLE callchain section gets a new context type: PERF_CONTEXT_USER_DEFERRED After which will come a single entry, denoting the 'cookie' of the deferred callchain that should be attached here, matching the 'cookie' field of the above mentioned PERF_RECORD_CALLCHAIN_DEFERRED. The 'defer_callchain' flag is expected on all events with PERF_SAMPLE_CALLCHAIN. The 'defer_output' flag is expect on the event responsible for collecting side-band events (like mmap, comm etc.). Setting 'defer_output' on multiple events will get you duplicated PERF_RECORD_CALLCHAIN_DEFERRED records. Based on earlier patches by Josh and Steven. Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251023150002.GR4067720@noisy.programming.kicks-ass.net |
||
|
|
feeaf1346f |
bpf: Add overwrite mode for BPF ring buffer
When the BPF ring buffer is full, a new event cannot be recorded until one
or more old events are consumed to make enough space for it. In cases such
as fault diagnostics, where recent events are more useful than older ones,
this mechanism may lead to critical events being lost.
So add overwrite mode for BPF ring buffer to address it. In this mode, the
new event overwrites the oldest event when the buffer is full.
The basic idea is as follows:
1. producer_pos tracks the next position to record new event. When there
is enough free space, producer_pos is simply advanced by producer to
make space for the new event.
2. To avoid waiting for consumer when the buffer is full, a new variable,
overwrite_pos, is introduced for producer. It points to the oldest event
committed in the buffer. It is advanced by producer to discard one or more
oldest events to make space for the new event when the buffer is full.
3. pending_pos tracks the oldest event to be committed. pending_pos is never
passed by producer_pos, so multiple producers never write to the same
position at the same time.
The following example diagrams show how it works in a 4096-byte ring buffer.
1. At first, {producer,overwrite,pending,consumer}_pos are all set to 0.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| |
| |
| |
+-----------------------------------------------------------------------+
^
|
|
producer_pos = 0
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0
2. Now reserve a 512-byte event A.
There is enough free space, so A is allocated at offset 0. And producer_pos
is advanced to 512, the end of A. Since A is not submitted, the BUSY bit is
set.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | |
| A | |
| [BUSY] | |
+-----------------------------------------------------------------------+
^ ^
| |
| |
| producer_pos = 512
|
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0
3. Reserve event B, size 1024.
B is allocated at offset 512 with BUSY bit set, and producer_pos is advanced
to the end of B.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | |
| A | B | |
| [BUSY] | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^
| |
| |
| producer_pos = 1536
|
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0
4. Reserve event C, size 2048.
C is allocated at offset 1536, and producer_pos is advanced to 3584.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| A | B | C | |
| [BUSY] | [BUSY] | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^
| |
| |
| producer_pos = 3584
|
overwrite_pos = 0
pending_pos = 0
consumer_pos = 0
5. Submit event A.
The BUSY bit of A is cleared. B becomes the oldest event to be committed, so
pending_pos is advanced to 512, the start of B.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| A | B | C | |
| | [BUSY] | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | |
| pending_pos = 512 producer_pos = 3584
|
overwrite_pos = 0
consumer_pos = 0
6. Submit event B.
The BUSY bit of B is cleared, and pending_pos is advanced to the start of C,
which is now the oldest event to be committed.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| A | B | C | |
| | | [BUSY] | |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | |
| pending_pos = 1536 producer_pos = 3584
|
overwrite_pos = 0
consumer_pos = 0
7. Reserve event D, size 1536 (3 * 512).
There are 2048 bytes not being written between producer_pos (currently 3584)
and pending_pos, so D is allocated at offset 3584, and producer_pos is advanced
by 1536 (from 3584 to 5120).
Since event D will overwrite all bytes of event A and the first 512 bytes of
event B, overwrite_pos is advanced to the start of event C, the oldest event
that is not overwritten.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| D End | | C | D Begin|
| [BUSY] | | [BUSY] | [BUSY] |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | pending_pos = 1536
| | overwrite_pos = 1536
| |
| producer_pos=5120
|
consumer_pos = 0
8. Reserve event E, size 1024.
Although there are 512 bytes not being written between producer_pos and
pending_pos, E cannot be reserved, as it would overwrite the first 512
bytes of event C, which is still being written.
9. Submit event C and D.
pending_pos is advanced to the end of D.
0 512 1024 1536 2048 2560 3072 3584 4096
+-----------------------------------------------------------------------+
| | | | |
| D End | | C | D Begin|
| | | | |
+-----------------------------------------------------------------------+
^ ^ ^
| | |
| | overwrite_pos = 1536
| |
| producer_pos=5120
| pending_pos=5120
|
consumer_pos = 0
The performance data for overwrite mode will be provided in a follow-up
patch that adds overwrite-mode benchmarks.
A sample of performance data for non-overwrite mode, collected on an x86_64
CPU and an arm64 CPU, before and after this patch, is shown below. As we can
see, no obvious performance regression occurs.
- x86_64 (AMD EPYC 9654)
Before:
Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.623 ± 0.027M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 15.812 ± 0.014M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 7.871 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 6.703 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 2.896 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 2.054 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 1.864 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 1.580 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 1.484 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 1.369 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 1.316 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 1.272 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 1.239 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 1.226 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 1.213 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 1.193 ± 0.001M/s (drops 0.000 ± 0.000M/s)
After:
Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.845 ± 0.036M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 15.889 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 8.155 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 6.708 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 2.918 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 2.065 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 1.870 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 1.582 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 1.482 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 1.372 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 1.323 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 1.264 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 1.236 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 1.209 ± 0.002M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 1.189 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 1.165 ± 0.002M/s (drops 0.000 ± 0.000M/s)
- arm64 (HiSilicon Kunpeng 920)
Before:
Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.310 ± 0.623M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 9.947 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 6.634 ± 0.011M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 4.502 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 3.888 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 3.372 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 3.189 ± 0.010M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.998 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 3.086 ± 0.018M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 2.845 ± 0.004M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 2.815 ± 0.008M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 2.771 ± 0.009M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 2.814 ± 0.011M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 2.752 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 2.695 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 2.710 ± 0.006M/s (drops 0.000 ± 0.000M/s)
After:
Ringbuf, multi-producer contention
==================================
rb-libbpf nr_prod 1 11.283 ± 0.550M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 2 9.993 ± 0.003M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 3 6.898 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 4 5.257 ± 0.001M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 8 3.830 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 12 3.528 ± 0.013M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 16 3.265 ± 0.018M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 20 2.990 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 24 2.929 ± 0.014M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 28 2.898 ± 0.010M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 32 2.818 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 36 2.789 ± 0.012M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 40 2.770 ± 0.006M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 44 2.651 ± 0.007M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 48 2.669 ± 0.005M/s (drops 0.000 ± 0.000M/s)
rb-libbpf nr_prod 52 2.695 ± 0.009M/s (drops 0.000 ± 0.000M/s)
Signed-off-by: Xu Kuohai <xukuohai@huawei.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20251018035738.4039621-2-xukuohai@huaweicloud.com
|
||
|
|
531b87d865 |
bpf: widen dynptr size/offset to 64 bit
Dynptr currently caps size and offset at 24 bits, which isn’t sufficient for file-backed use cases; even 32 bits can be limiting. Refactor dynptr helpers/kfuncs to use 64-bit size and offset, ensuring consistency across the APIs. This change does not affect internals of xdp, skb or other dynptrs, which continue to behave as before. Also it does not break binary compatibility. The widening enables large-file access support via dynptr, implemented in the next patches. Signed-off-by: Mykyta Yatsenko <yatsenko@meta.com> Acked-by: Eduard Zingerman <eddyz87@gmail.com> Link: https://lore.kernel.org/r/20251026203853.135105-3-mykyta.yatsenko5@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org> |
||
|
|
2602949b22 |
tools/nolibc: x86: fix section mismatch caused by asm "mem*" functions
I recently got occasional build failures at -Os or -Oz that would always
involve waitpid(), where the assembler would complain about this:
init.s: Error: .size expression for waitpid.constprop.0 does not evaluate to a constant
And without -fno-asynchronous-unwind-tables it could also spit such
errors:
init.s:836: Error: CFI instruction used without previous .cfi_startproc
init.s:838: Error: .cfi_endproc without corresponding .cfi_startproc
init.s: Error: open CFI at the end of file; missing .cfi_endproc directive
A trimmed down reproducer is as simple as this:
int main(int argc, char **argv)
{
int ret, status;
if (argc == 0)
ret = waitpid(-1, &status, 0);
else
ret = waitpid(-1, &status, 0);
return status;
}
It produces the following asm code on x86_64:
.text
.section .text.nolibc_memmove_memcpy
.weak memmove
.weak memcpy
memmove:
memcpy:
movq %rdx, %rcx
(...)
retq
.section .text.nolibc_memset
.weak memset
memset:
xchgl %eax, %esi
movq %rdx, %rcx
pushq %rdi
rep stosb
popq %rax
retq
.type waitpid.constprop.0.isra.0, @function
waitpid.constprop.0.isra.0:
subq $8, %rsp
(...)
jmp *.L5(,%rax,8)
.section .rodata
.align 8
.align 4
.L5:
.quad .L10
(...)
.quad .L4
.text
.L10:
(...)
.cfi_def_cfa_offset 8
ret
.cfi_endproc
.LFE273:
.size waitpid.constprop.0.isra.0, .-waitpid.constprop.0.isra.0
It's a bit dense, but here's the explanation: the compiler has emitted a
".text" statement because it knows it's working in the .text section.
Then, our hand-written asm code for the mem* functions forced the section
to .text.something without the compiler knowing about it, so it thinks
the code is still being emitted for .text. As such, without any .section
statement, the waitpid.constprop.0.isra.0 label is in fact placed in the
previously created section, here .text.nolibc_memset.
The waitpid() function involves a switch/case statement that can be
turned to a jump table, which is what the compiler does with the .rodata
section, and after that it restores .text, which is no longer the
previous .text.nolibc_memset section. Then the CFI statements cross a
section, so does the .size calculation, which explains the error.
While a first approach consisting in placing an explicit ".text" at the
end of these functions was verified to work, it's still unreliable as
it depends on what the compiler remembers having emitted previously. A
better approach is to replace the ".section" with ".pushsection", and
place a ".popsection" at the end, so that these code blocks are agnostic
to where they're placed relative to other blocks.
Fixes: 553845eebd60 ("tools/nolibc: x86-64: Use `rep movsb` for `memcpy()` and `memmove()`")
Fixes: 12108aa8c1a1 ("tools/nolibc: x86-64: Use `rep stosb` for `memset()`")
Signed-off-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
||
|
|
38163af068 |
bpf: Introduce SK_BPF_BYPASS_PROT_MEM.
If a socket has sk->sk_bypass_prot_mem flagged, the socket opts out
of the global protocol memory accounting.
This is easily controlled by net.core.bypass_prot_mem sysctl, but it
lacks flexibility.
Let's support flagging (and clearing) sk->sk_bypass_prot_mem via
bpf_setsockopt() at the BPF_CGROUP_INET_SOCK_CREATE hook.
int val = 1;
bpf_setsockopt(ctx, SOL_SOCKET, SK_BPF_BYPASS_PROT_MEM,
&val, sizeof(val));
As with net.core.bypass_prot_mem, this is inherited to child sockets,
and BPF always takes precedence over sysctl at socket(2) and accept(2).
SK_BPF_BYPASS_PROT_MEM is only supported at BPF_CGROUP_INET_SOCK_CREATE
and not supported on other hooks for some reasons:
1. UDP charges memory under sk->sk_receive_queue.lock instead
of lock_sock()
2. Modifying the flag after skb is charged to sk requires such
adjustment during bpf_setsockopt() and complicates the logic
unnecessarily
We can support other hooks later if a real use case justifies that.
Most changes are inline and hard to trace, but a microbenchmark on
__sk_mem_raise_allocated() during neper/tcp_stream showed that more
samples completed faster with sk->sk_bypass_prot_mem == 1. This will
be more visible under tcp_mem pressure (but it's not a fair comparison).
# bpftrace -e 'kprobe:__sk_mem_raise_allocated { @start[tid] = nsecs; }
kretprobe:__sk_mem_raise_allocated /@start[tid]/
{ @end[tid] = nsecs - @start[tid]; @times = hist(@end[tid]); delete(@start[tid]); }'
# tcp_stream -6 -F 1000 -N -T 256
Without bpf prog:
[128, 256) 3846 | |
[256, 512) 1505326 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 1371006 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 198207 |@@@@@@ |
[2K, 4K) 31199 |@ |
With bpf prog in the next patch:
(must be attached before tcp_stream)
# bpftool prog load sk_bypass_prot_mem.bpf.o /sys/fs/bpf/test type cgroup/sock_create
# bpftool cgroup attach /sys/fs/cgroup/test cgroup_inet_sock_create pinned /sys/fs/bpf/test
[128, 256) 6413 | |
[256, 512) 1868425 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K) 1101697 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
[1K, 2K) 117031 |@@@@ |
[2K, 4K) 11773 | |
Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com>
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
Acked-by: Roman Gushchin <roman.gushchin@linux.dev>
Link: https://patch.msgid.link/20251014235604.3057003-6-kuniyu@google.com
|
||
|
|
dd590d4d57 |
objtool/klp: Introduce klp diff subcommand for diffing object files
Add a new klp diff subcommand which performs a binary diff between two
object files and extracts changed functions into a new object which can
then be linked into a livepatch module.
This builds on concepts from the longstanding out-of-tree kpatch [1]
project which began in 2012 and has been used for many years to generate
livepatch modules for production kernels. However, this is a complete
rewrite which incorporates hard-earned lessons from 12+ years of
maintaining kpatch.
Key improvements compared to kpatch-build:
- Integrated with objtool: Leverages objtool's existing control-flow
graph analysis to help detect changed functions.
- Works on vmlinux.o: Supports late-linked objects, making it
compatible with LTO, IBT, and similar.
- Simplified code base: ~3k fewer lines of code.
- Upstream: No more out-of-tree #ifdef hacks, far less cruft.
- Cleaner internals: Vastly simplified logic for symbol/section/reloc
inclusion and special section extraction.
- Robust __LINE__ macro handling: Avoids false positive binary diffs
caused by the __LINE__ macro by introducing a fix-patch-lines script
(coming in a later patch) which injects #line directives into the
source .patch to preserve the original line numbers at compile time.
Note the end result of this subcommand is not yet functionally complete.
Livepatch needs some ELF magic which linkers don't like:
- Two relocation sections (.rela*, .klp.rela*) for the same text
section.
- Use of SHN_LIVEPATCH to mark livepatch symbols.
Unfortunately linkers tend to mangle such things. To work around that,
klp diff generates a linker-compliant intermediate binary which encodes
the relevant KLP section/reloc/symbol metadata.
After module linking, a klp post-link step (coming soon) will clean up
the mess and convert the linked .ko into a fully compliant livepatch
module.
Note this subcommand requires the diffed binaries to have been compiled
with -ffunction-sections and -fdata-sections, and processed with
'objtool --checksum'. Those constraints will be handled by a klp-build
script introduced in a later patch.
Without '-ffunction-sections -fdata-sections', reliable object diffing
would be infeasible due to toolchain limitations:
- For intra-file+intra-section references, the compiler might
occasionally generated hard-coded instruction offsets instead of
relocations.
- Section-symbol-based references can be ambiguous:
- Overlapping or zero-length symbols create ambiguity as to which
symbol is being referenced.
- A reference to the end of a symbol (e.g., checking array bounds)
can be misinterpreted as a reference to the next symbol, or vice
versa.
A potential future alternative to '-ffunction-sections -fdata-sections'
would be to introduce a toolchain option that forces symbol-based
(non-section) relocations.
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
||
|
|
58f36a5756 |
objtool: Add ANNOTATE_DATA_SPECIAL
In preparation for the objtool klp diff subcommand, add an ANNOTATE_DATA_SPECIAL macro which annotates special section entries so that objtool can determine their size and location and extract them when needed. Acked-by: Petr Mladek <pmladek@suse.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> |
||
|
|
b37491d72b |
interval_tree: Fix ITSTATIC usage for *_subtree_search()
For consistency with the other function templates, change _subtree_search_*() to use the user-supplied ITSTATIC rather than the hard-coded 'static'. Acked-by: Petr Mladek <pmladek@suse.com> Tested-by: Joe Lawrence <joe.lawrence@redhat.com> Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org> |
||
|
|
9b7eacac22 |
interval_tree: Sync interval_tree_generic.h with tools
The following commit made an improvement to interval_tree_generic.h, but
didn't sync it to the tools copy:
19811285784f ("lib/interval_tree: skip the check before go to the right subtree")
Sync it, and add it to objtool's sync-check.sh so they are more likely
to stay in sync going forward.
Acked-by: Petr Mladek <pmladek@suse.com>
Tested-by: Joe Lawrence <joe.lawrence@redhat.com>
Signed-off-by: Josh Poimboeuf <jpoimboe@kernel.org>
|
||
|
|
812f223fe9 |
tools/nolibc: handle NULL wstatus argument to waitpid()
wstatus is allowed to be NULL. Avoid a segmentation fault in this case.
Fixes: 0c89abf5ab3f ("tools/nolibc: implement waitpid() in terms of waitid()")
Signed-off-by: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Acked-by: Willy Tarreau <w@1wt.eu>
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
|
||
|
|
9591fdb061 |
- Remove a bunch of asm implementing condition flags testing in KVM's
emulator in favor of int3_emulate_jcc() which is written in C - Replace KVM fastops with C-based stubs which avoids problems with the fastop infra related to latter not adhering to the C ABI due to their special calling convention and, more importantly, bypassing compiler control-flow integrity checking because they're written in asm - Remove wrongly used static branches and other ugliness accumulated over time in hyperv's hypercall implementation with a proper static function call to the correct hypervisor call variant - Add some fixes and modifications to allow running FRED-enabled kernels in KVM even on non-FRED hardware - Add kCFI improvements like validating indirect calls and prepare for enabling kCFI with GCC. Add cmdline params documentation and other code cleanups - Use the single-byte 0xd6 insn as the official #UD single-byte undefined opcode instruction as agreed upon by both x86 vendors - Other smaller cleanups and touchups all over the place -----BEGIN PGP SIGNATURE----- iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmjqXxkACgkQEsHwGGHe VUq9QBAAsjaay99a1+Dc53xyP1/HzCUFZDOzEYhj9zF85I8/xA9vTXZr7Qg2m6os +4EEmnlwU43AR5KgwGJcuszLF9qSqTMz5qkAdFpvnoQ1Hbc8b49A+3yo9/hM7NA2 gPGH0gVZVBcffoETiQ8tJN6C9H6Ec0nTZwKTbasWwxz5oUAw+ppjP+aF4rFQ2/5w b1ofrcga5yucjvSlXjBOEwHvd21l7O9iMre1oGEn6b0E2LU8ldToRkJkVZIhkWeL 2Iq3gYtVNN4Ao06WbV/EfXAqg5HWXjcm5bLcUXDtSF+Blae+gWoCjrT7XQdQGyEq J12l4FbIZk5Ha8eWAC425ye9i3Wwo+oie3Cc4SVCMdv5A+AmOF0ijAlo1hcxq0rX eGNWm8BKJOJ9zz1kxLISO7CfjULKgpsXLabF5a19uwoCsQgj5YrhlJezaIKHXbnK OWwHWg9IuRkN2KLmJa7pXtHkuAHp4MtEV9TP9kU2WCvCInrNrzp3gYtds3pri82c 8ove+WA3yb/AQ6RCq5vAMLYXBxMRbN7FrmY5ZuwgWJTMi6cp1Sp02mhobwJOgNhO H7nKWCZnQMyCLPzVeg97HTSgqSXw13dSrujWX9gWYVWBMfZO1B9HcUrhtiOhH7Q9 cvELkcqaxKrCKdRHLLYgHeMIQU2tdpsQ5TXHm7C7liEcZPZpk+g= =3Otb -----END PGP SIGNATURE----- Merge tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull more x86 updates from Borislav Petkov: - Remove a bunch of asm implementing condition flags testing in KVM's emulator in favor of int3_emulate_jcc() which is written in C - Replace KVM fastops with C-based stubs which avoids problems with the fastop infra related to latter not adhering to the C ABI due to their special calling convention and, more importantly, bypassing compiler control-flow integrity checking because they're written in asm - Remove wrongly used static branches and other ugliness accumulated over time in hyperv's hypercall implementation with a proper static function call to the correct hypervisor call variant - Add some fixes and modifications to allow running FRED-enabled kernels in KVM even on non-FRED hardware - Add kCFI improvements like validating indirect calls and prepare for enabling kCFI with GCC. Add cmdline params documentation and other code cleanups - Use the single-byte 0xd6 insn as the official #UD single-byte undefined opcode instruction as agreed upon by both x86 vendors - Other smaller cleanups and touchups all over the place * tag 'x86_core_for_v6.18_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits) x86,retpoline: Optimize patch_retpoline() x86,ibt: Use UDB instead of 0xEA x86/cfi: Remove __noinitretpoline and __noretpoline x86/cfi: Add "debug" option to "cfi=" bootparam x86/cfi: Standardize on common "CFI:" prefix for CFI reports x86/cfi: Document the "cfi=" bootparam options x86/traps: Clarify KCFI instruction layout compiler_types.h: Move __nocfi out of compiler-specific header objtool: Validate kCFI calls x86/fred: KVM: VMX: Always use FRED for IRQs when CONFIG_X86_FRED=y x86/fred: Play nice with invoking asm_fred_entry_from_kvm() on non-FRED hardware x86/fred: Install system vector handlers even if FRED isn't fully enabled x86/hyperv: Use direct call to hypercall-page x86/hyperv: Clean up hv_do_hypercall() KVM: x86: Remove fastops KVM: x86: Convert em_salc() to C KVM: x86: Introduce EM_ASM_3WCL KVM: x86: Introduce EM_ASM_1SRC2 KVM: x86: Introduce EM_ASM_2CL KVM: x86: Introduce EM_ASM_2W ... |