Commit Graph

1336452 Commits

Author SHA1 Message Date
Linus Torvalds b8a1c9f4b7 Merge tag 's390-6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull s390 fixes from Alexander Gordeev:

 - Architecutre-specific ftrace recursion trylock tests were removed in
   favour of the generic function_graph_enter(), but s390 got missed.

   Remove this test for s390 as well.

 - Add ftrace_get_symaddr() for s390, which returns the symbol address
   from ftrace 'ip' parameter

* tag 's390-6.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/tracing: Define ftrace_get_symaddr() for s390
  s390/fgraph: Fix to remove ftrace_test_recursion_trylock()
2025-01-30 10:53:49 -08:00
Linus Torvalds b731bc5f49 Merge tag 's390-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux
Pull more s390 updates from Alexander Gordeev:

 - The rework that uncoupled physical and virtual address spaces
   inadvertently prevented KASAN shadow mappings from using large pages.
   Restore large page mappings for KASAN shadows

 - Add decompressor routine physmem_alloc() that may fail, unlike
   physmem_alloc_or_die(). This allows callers to implement fallback
   paths

 - Allow falling back from large pages to smaller pages (1MB or 4KB) if
   the allocation of 2GB pages in the decompressor can not be fulfilled

 - Add to the decompressor boot print support of "%%" format string,
   width and padding hadnling, length modifiers and decimal conversion
   specifiers

 - Add to the decompressor message severity levels similar to kernel
   ones. Support command-line options that control console output
   verbosity

 - Replaces boot_printk() calls with appropriate loglevel- specific
   helpers such as boot_emerg(), boot_warn(), and boot_debug().

 - Collect all boot messages into a ring buffer independent of the
   current log level. This is particularly useful for early crash
   analysis

 - If 'earlyprintk' command line parameter is not specified, store
   decompressor boot messages in a ring buffer to be printed later by
   the kernel, once the console driver is registered

 - Add 'bootdebug' command line parameter to enable printing of
   decompressor debug messages when needed. That parameters allows
   message suppressing and filtering

 - Dump boot messages on a decompressor crash, but only if 'bootdebug'
   command line parameter is enabled

 - When CONFIG_PRINTK_TIME is enabled, add timestamps to boot messages
   in the same format as regular printk()

 - Dump physical memory tracking information on boot: online ranges,
   reserved areas and vmem allocations

 - Dump virtual memory layout and randomization details

 - Improve decompression error reporting and dump the message ring
   buffer in case the boot failed and system halted

 - Add an exception handler which handles exceptions when FPU control
   register is attempted to be set to an invalid value. Remove '.fixup'
   section as result of this change

 - Use 'A', 'O', and 'R' inline assembly format flags, which allows
   recent Clang compilers to generate better FPU code

 - Rework uaccess code so it reads better and generates more efficient
   code

 - Cleanup futex inline assembly code

 - Disable KMSAN instrumention for futex inline assemblies, which
   contain dereferenced user pointers. Otherwise, shadows for the user
   pointers would be accessed

 - PFs which are not initially configured but in standby create only a
   single-function PCI domain. If they are configured later on, sibling
   PFs and their child VFs will not be added to their PCI domain
   breaking SR-IOV expectations.

   Fix that by allowing initially configured but in standby PFs create
   multi-function PCI domains

 - Add '-std=gnu11' to decompressor and purgatory CFLAGS to avoid
   compile errors caused by kernel's own definitions of 'bool', 'false',
   and 'true' conflicting with the C23 reserved keywords

 - Fix sclp subsystem failure when a sclp console is not present

 - Fix misuse of non-NULL terminated strings in vmlogrdr driver

 - Various other small improvements, cleanups and fixes

* tag 's390-6.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: (53 commits)
  s390/vmlogrdr: Use array instead of string initializer
  s390/vmlogrdr: Use internal_name for error messages
  s390/sclp: Initialize sclp subsystem via arch_cpu_finalize_init()
  s390/tools: Use array instead of string initializer
  s390/vmem: Fix null-pointer-arithmetic warning in vmem_map_init()
  s390: Add '-std=gnu11' to decompressor and purgatory CFLAGS
  s390/bitops: Use correct constraint for arch_test_bit() inline assembly
  s390/pci: Fix SR-IOV for PFs initially in standby
  s390/futex: Avoid KMSAN instrumention for user pointers
  s390/uaccess: Rename get_put_user_noinstr_attributes to uaccess_kmsan_or_inline
  s390/futex: Cleanup futex_atomic_cmpxchg_inatomic()
  s390/futex: Generate futex atomic op functions
  s390/uaccess: Remove INLINE_COPY_FROM_USER and INLINE_COPY_TO_USER
  s390/uaccess: Use asm goto for put_user()/get_user()
  s390/uaccess: Remove usage of the oac specifier
  s390/uaccess: Replace EX_TABLE_UA_LOAD_MEM exception handling
  s390/uaccess: Cleanup noinstr __put_user()/__get_user() inline assembly constraints
  s390/uaccess: Remove __put_user_fn()/__get_user_fn() wrappers
  s390/uaccess: Move put_user() / __put_user() close to put_user() asm code
  s390/uaccess: Use asm goto for __mvc_kernel_nofault()
  ...
2025-01-30 10:48:17 -08:00
Linus Torvalds 90cb220062 Merge tag 'gpio-fixes-for-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux
Pull gpio fixes from Bartosz Golaszewski:

 - update gpio-sim selftests to not fail now that we no longer allow
   rmdir() on configfs entries of active devices

 - remove leftover code from gpio-mxc

* tag 'gpio-fixes-for-v6.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/brgl/linux:
  selftests: gpio: gpio-sim: Fix missing chip disablements
  gpio: mxc: remove dead code after switch to DT-only
2025-01-30 10:19:30 -08:00
Nathan Chancellor ee2ab467bd x86/boot: Use '-std=gnu11' to fix build with GCC 15
GCC 15 changed the default C standard version to C23, which should not
have impacted the kernel because it requests the gnu11 standard via
'-std=' in the main Makefile. However, the x86 compressed boot Makefile
uses its own set of KBUILD_CFLAGS without a '-std=' value (i.e., using
the default), resulting in errors from the kernel's definitions of bool,
true, and false in stddef.h, which are reserved keywords under C23.

  ./include/linux/stddef.h:11:9: error: expected identifier before ‘false’
     11 |         false   = 0,
  ./include/linux/types.h:35:33: error: two or more data types in declaration specifiers
     35 | typedef _Bool                   bool;

Set '-std=gnu11' in the x86 compressed boot Makefile to resolve the
error and consistently use the same C standard version for the entire
kernel.

Closes: https://lore.kernel.org/4OAhbllK7x4QJGpZjkYjtBYNLd_2whHx9oFiuZcGwtVR4hIzvduultkgfAIRZI3vQpZylu7Gl929HaYFRGeMEalWCpeMzCIIhLxxRhq4U-Y=@protonmail.com/
Closes: https://lore.kernel.org/Z4467umXR2PZ0M1H@tucnak/
Reported-by: Kostadin Shishmanov <kostadinshishmanov@protonmail.com>
Reported-by: Jakub Jelinek <jakub@redhat.com>
Signed-off-by: Nathan Chancellor <nathan@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Ard Biesheuvel <ardb@kernel.org>
Cc:stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250121-x86-use-std-consistently-gcc-15-v1-1-8ab0acf645cb%40kernel.org
2025-01-30 09:59:24 -08:00
Linus Torvalds d3d90cc289 Merge tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs d_revalidate updates from Al Viro:
 "Provide stable parent and name to ->d_revalidate() instances

  Most of the filesystem methods where we care about dentry name and
  parent have their stability guaranteed by the callers;
  ->d_revalidate() is the major exception.

  It's easy enough for callers to supply stable values for expected name
  and expected parent of the dentry being validated. That kills quite a
  bit of boilerplate in ->d_revalidate() instances, along with a bunch
  of races where they used to access ->d_name without sufficient
  precautions"

* tag 'pull-revalidate' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  9p: fix ->rename_sem exclusion
  orangefs_d_revalidate(): use stable parent inode and name passed by caller
  ocfs2_dentry_revalidate(): use stable parent inode and name passed by caller
  nfs: fix ->d_revalidate() UAF on ->d_name accesses
  nfs{,4}_lookup_validate(): use stable parent inode passed by caller
  gfs2_drevalidate(): use stable parent inode and name passed by caller
  fuse_dentry_revalidate(): use stable parent inode and name passed by caller
  vfat_revalidate{,_ci}(): use stable parent inode passed by caller
  exfat_d_revalidate(): use stable parent inode passed by caller
  fscrypt_d_revalidate(): use stable parent inode passed by caller
  ceph_d_revalidate(): propagate stable name down into request encoding
  ceph_d_revalidate(): use stable parent inode passed by caller
  afs_d_revalidate(): use stable name and parent inode passed by caller
  Pass parent directory inode and expected name to ->d_revalidate()
  generic_ci_d_compare(): use shortname_storage
  ext4 fast_commit: make use of name_snapshot primitives
  dissolve external_name.u into separate members
  make take_dentry_name_snapshot() lockless
  dcache: back inline names with a struct-wrapped array of unsigned long
  make sure that DNAME_INLINE_LEN is a multiple of word size
2025-01-30 09:13:35 -08:00
Jakub Kicinski dfffaccffc Merge tag 'nf-25-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following batch contains one Netfilter fix:

1) Reject mismatching sum of field_len with set key length which allows
   to create a set without inconsistent pipapo rule width and set key
   length.

* tag 'nf-25-01-30' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
  netfilter: nf_tables: reject mismatching sum of field_len with set key length
====================

Link: https://patch.msgid.link/20250130113307.2327470-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30 09:01:00 -08:00
Jakub Kicinski d7dda216ca MAINTAINERS: add Neal to TCP maintainers
Neal Cardwell has been indispensable in TCP reviews
and investigations, especially protocol-related.
Neal is also the author of packetdrill.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20250129191332.2526140-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30 08:57:38 -08:00
Eric Dumazet e759e1e4a4 net: revert RTNL changes in unregister_netdevice_many_notify()
This patch reverts following changes:

83419b61d1 net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)
ae646f1a0b net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)
cfa579f666 net: no longer hold RTNL while calling flush_all_backlogs()

This caused issues in layers holding a private mutex:

cleanup_net()
  rtnl_lock();
	mutex_lock(subsystem_mutex);

	unregister_netdevice();

	   rtnl_unlock();		// LOCKDEP violation
	   rtnl_lock();

I will revisit this in next cycle, opt-in for the new behavior
from safe contexts only.

Fixes: cfa579f666 ("net: no longer hold RTNL while calling flush_all_backlogs()")
Fixes: ae646f1a0b ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 1)")
Fixes: 83419b61d1 ("net: reduce RTNL hold duration in unregister_netdevice_many_notify() (part 2)")
Reported-by: syzbot+5b9196ecf74447172a9a@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/netdev/6789d55f.050a0220.20d369.004e.GAE@google.com/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://patch.msgid.link/20250129142726.747726-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30 08:57:18 -08:00
Eric Dumazet 0f5697f1a3 net: hsr: fix fill_frame_info() regression vs VLAN packets
Stephan Wurm reported that my recent patch broke VLAN support.

Apparently skb->mac_len is not correct for VLAN traffic as
shown by debug traces [1].

Use instead pskb_may_pull() to make sure the expected header
is present in skb->head.

Many thanks to Stephan for his help.

[1]
kernel: skb len=170 headroom=2 headlen=170 tailroom=20
        mac=(2,14) mac_len=14 net=(16,-1) trans=-1
        shinfo(txflags=0 nr_frags=0 gso(size=0 type=0 segs=0))
        csum(0x0 start=0 offset=0 ip_summed=0 complete_sw=0 valid=0 level=0)
        hash(0x0 sw=0 l4=0) proto=0x0000 pkttype=0 iif=0
        priority=0x0 mark=0x0 alloc_cpu=0 vlan_all=0x0
        encapsulation=0 inner(proto=0x0000, mac=0, net=0, trans=0)
kernel: dev name=prp0 feat=0x0000000000007000
kernel: sk family=17 type=3 proto=0
kernel: skb headroom: 00000000: 74 00
kernel: skb linear:   00000000: 01 0c cd 01 00 01 00 d0 93 53 9c cb 81 00 80 00
kernel: skb linear:   00000010: 88 b8 00 01 00 98 00 00 00 00 61 81 8d 80 16 52
kernel: skb linear:   00000020: 45 47 44 4e 43 54 52 4c 2f 4c 4c 4e 30 24 47 4f
kernel: skb linear:   00000030: 24 47 6f 43 62 81 01 14 82 16 52 45 47 44 4e 43
kernel: skb linear:   00000040: 54 52 4c 2f 4c 4c 4e 30 24 44 73 47 6f 6f 73 65
kernel: skb linear:   00000050: 83 07 47 6f 49 64 65 6e 74 84 08 67 8d f5 93 7e
kernel: skb linear:   00000060: 76 c8 00 85 01 01 86 01 00 87 01 00 88 01 01 89
kernel: skb linear:   00000070: 01 00 8a 01 02 ab 33 a2 15 83 01 00 84 03 03 00
kernel: skb linear:   00000080: 00 91 08 67 8d f5 92 77 4b c6 1f 83 01 00 a2 1a
kernel: skb linear:   00000090: a2 06 85 01 00 83 01 00 84 03 03 00 00 91 08 67
kernel: skb linear:   000000a0: 8d f5 92 77 4b c6 1f 83 01 00
kernel: skb tailroom: 00000000: 80 18 02 00 fe 4e 00 00 01 01 08 0a 4f fd 5e d1
kernel: skb tailroom: 00000010: 4f fd 5e cd

Fixes: b9653d19e5 ("net: hsr: avoid potential out-of-bound access in fill_frame_info()")
Reported-by: Stephan Wurm <stephan.wurm@a-eberle.de>
Tested-by: Stephan Wurm <stephan.wurm@a-eberle.de>
Closes: https://lore.kernel.org/netdev/Z4o_UC0HweBHJ_cw@PC-LX-SteWu/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20250129130007.644084-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-30 08:56:59 -08:00
Kees Cook dce4aab844 kbuild: Use -fzero-init-padding-bits=all
GCC 15 introduces a regression in "= { 0 }" style initialization of
unions that Linux has depended on for eliminating uninitialized variable
contents. GCC does not seem likely to fix it[1], instead suggesting[2]
that affected projects start using -fzero-init-padding-bits=unions.

To avoid future surprises beyond just the current situation with unions,
enable -fzero-init-padding-bits=all when available (GCC 15+). This will
correctly zero padding bits in unions and structs that might have been
left uninitialized, and will make sure there is no immediate regression
in union initializations. As seen in the stackinit KUnit selftest union
cases, which were passing before, were failing under GCC 15:

    not ok 18 test_small_start_old_zero
    ok 29 test_small_start_dynamic_partial # SKIP XFAIL uninit bytes: 63
    ok 32 test_small_start_assigned_dynamic_partial # SKIP XFAIL uninit bytes: 63
    ok 67 test_small_start_static_partial # SKIP XFAIL uninit bytes: 63
    ok 70 test_small_start_static_all # SKIP XFAIL uninit bytes: 56
    ok 73 test_small_start_dynamic_all # SKIP XFAIL uninit bytes: 56
    ok 82 test_small_start_assigned_static_partial # SKIP XFAIL uninit bytes: 63
    ok 85 test_small_start_assigned_static_all # SKIP XFAIL uninit bytes: 56
    ok 88 test_small_start_assigned_dynamic_all # SKIP XFAIL uninit bytes: 56

The above all now pass again with -fzero-init-padding-bits=all added.

This also fixes the following cases for struct initialization that had
been XFAIL until now because there was no compiler support beyond the
larger "-ftrivial-auto-var-init=zero" option:

    ok 38 test_small_hole_static_all # SKIP XFAIL uninit bytes: 3
    ok 39 test_big_hole_static_all # SKIP XFAIL uninit bytes: 124
    ok 40 test_trailing_hole_static_all # SKIP XFAIL uninit bytes: 7
    ok 42 test_small_hole_dynamic_all # SKIP XFAIL uninit bytes: 3
    ok 43 test_big_hole_dynamic_all # SKIP XFAIL uninit bytes: 124
    ok 44 test_trailing_hole_dynamic_all # SKIP XFAIL uninit bytes: 7
    ok 58 test_small_hole_assigned_static_all # SKIP XFAIL uninit bytes: 3
    ok 59 test_big_hole_assigned_static_all # SKIP XFAIL uninit bytes: 124
    ok 60 test_trailing_hole_assigned_static_all # SKIP XFAIL uninit bytes: 7
    ok 62 test_small_hole_assigned_dynamic_all # SKIP XFAIL uninit bytes: 3
    ok 63 test_big_hole_assigned_dynamic_all # SKIP XFAIL uninit bytes: 124
    ok 64 test_trailing_hole_assigned_dynamic_all # SKIP XFAIL uninit bytes: 7

All of the above now pass when built under GCC 15. Tests can be seen
with:

    ./tools/testing/kunit/kunit.py run stackinit --arch=x86_64 \
        --make_option CC=gcc-15

Clang continues to fully initialize these kinds of variables[3] without
additional flags.

Suggested-by: Jakub Jelinek <jakub@redhat.com>
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118403 [1]
Link: https://lore.kernel.org/linux-toolchains/Z0hRrrNU3Q+ro2T7@tucnak/ [2]
Link: https://github.com/llvm/llvm-project/commit/7a086e1b2dc05f54afae3591614feede727601fa [3]
Reviewed-by: Nathan Chancellor <nathan@kernel.org>
Acked-by: Masahiro Yamada <masahiroy@kernel.org>
Link: https://lore.kernel.org/r/20250127191031.245214-3-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2025-01-30 08:48:34 -08:00
Kees Cook e71a29db79 stackinit: Add union initialization to selftests
The stack initialization selftests were checking scalars, strings,
and structs, but not unions. Add union tests (which are mostly identical
setup to structs). This catches the recent union initialization behavioral
changes seen in GCC 15. Before GCC 15, this new test passes:

    ok 18 test_small_start_old_zero

With GCC 15, it fails:

    not ok 18 test_small_start_old_zero

Specifically, a union with a larger member where a smaller member is
initialized with the older "= { 0 }" syntax:

union test_small_start {
     char one:1;
     char two;
     short three;
     unsigned long four;
     struct big_struct {
             unsigned long array[8];
     } big;
};

This is a regression in compiler behavior that Linux has depended on.
GCC does not seem likely to fix it, instead suggesting that affected
projects start using -fzero-init-padding-bits=unions:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118403

Link: https://lore.kernel.org/r/20250127191031.245214-2-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2025-01-30 08:48:28 -08:00
Kees Cook ad9f265c73 stackinit: Add old-style zero-init syntax to struct tests
The deprecated way to do a full zero init of a structure is with "= { 0 }",
but we weren't testing this style. Add it.

Link: https://lore.kernel.org/r/20250127191031.245214-1-kees@kernel.org
Signed-off-by: Kees Cook <kees@kernel.org>
2025-01-30 08:48:28 -08:00
Linus Torvalds ce335806b5 Merge tag 'ntfs3_for_6.14' of https://github.com/Paragon-Software-Group/linux-ntfs3
Pull ntfs3 fixes from Konstantin Komarov:

 - unify inode corruption marking and mark them as bad immediately upon
   detection of an error in attribute enumeration

 - folio cleanup

* tag 'ntfs3_for_6.14' of https://github.com/Paragon-Software-Group/linux-ntfs3:
  fs/ntfs3: Unify inode corruption marking with _ntfs_bad_inode()
  fs/ntfs3: Mark inode as bad as soon as error detected in mi_enum_attr()
  ntfs3: Remove an access to page->index
2025-01-30 08:47:17 -08:00
Linus Torvalds 8080ff5ac6 Merge tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs
Pull bcachefs fixes from Kent Overstreet:

 - second half of a fix for a bug that'd been causing oopses on
   filesystems using snapshots with memory pressure (key cache fills for
   snaphots btrees are tricky)

 - build fix for strange compiler configurations that double stack frame
   size

 - "journal stuck timeout" now takes into account device latency: this
   fixes some spurious warnings, and the main remaining source of SRCU
   lock hold time warnings (I'm no longer seeing this in my CI, so any
   users still seeing this should definitely ping me)

 - fix for slow/hanging unmounts (" Improve journal pin flushing")

 - some more tracepoint fixes/improvements, to chase down the "rebalance
   isn't making progress" issues

* tag 'bcachefs-2025-01-29' of git://evilpiepirate.org/bcachefs:
  bcachefs: Improve trace_move_extent_finish
  bcachefs: Fix trace_copygc
  bcachefs: Journal writes are now IOPRIO_CLASS_RT
  bcachefs: Improve journal pin flushing
  bcachefs: fix bch2_btree_node_flags
  bcachefs: rebalance, copygc enabled are runtime opts
  bcachefs: Improve decompression error messages
  bcachefs: bset_blacklisted_journal_seq is now AUTOFIX
  bcachefs: "Journal stuck" timeout now takes into account device latency
  bcachefs: Reduce stack frame size of __bch2_str_hash_check_key()
  bcachefs: Fix btree_trans_peek_key_cache()
2025-01-30 08:42:50 -08:00
Jens Axboe 8c8492ca64 io_uring/net: don't retry connect operation on EPOLLERR
If a socket is shutdown before the connection completes, POLLERR is set
in the poll mask. However, connect ignores this as it doesn't know, and
attempts the connection again. This may lead to a bogus -ETIMEDOUT
result, where it should have noticed the POLLERR and just returned
-ECONNRESET instead.

Have the poll logic check for whether or not POLLERR is set in the mask,
and if so, mark the request as failed. Then connect can appropriately
fail the request rather than retry it.

Reported-by: Sergey Galas <ssgalas@cloud.ru>
Cc: stable@vger.kernel.org
Link: https://github.com/axboe/liburing/discussions/1335
Fixes: 3fb1bd6881 ("io_uring/net: handle -EINPROGRESS correct for IORING_OP_CONNECT")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-01-30 09:41:25 -07:00
Tetsuo Handa bdc35f164b tomoyo: use better patterns for procfs in learning mode
Commit 08ae2487b2 ("tomoyo: automatically use patterns for several
situations in learning mode") replaced only $PID part of procfs pathname
with \$ pattern. But it turned out that we need to also replace $TID part
and $FD part to make this functionality useful for e.g. /bin/lsof .

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
2025-01-31 00:27:44 +09:00
Paolo Abeni 047558caf7 Merge branch 'mptcp-blackhole-only-if-1st-syn-retrans-w-o-mpc-is-accepted'
Matthieu Baerts says:

====================
mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted

Here are two small fixes for issues introduced in v6.12.

- Patch 1: reset the mpc_drop mark for other SYN retransmits, to only
  consider an MPTCP blackhole when the first SYN retransmitted without
  the MPTCP options is accepted, as initially intended.

- Patch 2: also mention in the doc that the blackhole_timeout sysctl
  knob is per-netns, like all the others.

Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
====================

Link: https://patch.msgid.link/20250129-net-mptcp-blackhole-fix-v1-0-afe88e5a6d2c@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30 14:02:22 +01:00
Matthieu Baerts (NGI0) 18da4b5d12 doc: mptcp: sysctl: blackhole_timeout is per-netns
All other sysctl entries mention it, and it is a per-namespace sysctl.

So mention it as well.

Fixes: 27069e7cb3 ("mptcp: disable active MPTCP in case of blackhole")
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30 14:02:19 +01:00
Matthieu Baerts (NGI0) e598d8981f mptcp: blackhole only if 1st SYN retrans w/o MPC is accepted
The Fixes commit mentioned this:

> An MPTCP firewall blackhole can be detected if the following SYN
> retransmission after a fallback to "plain" TCP is accepted.

But in fact, this blackhole was detected if any following SYN
retransmissions after a fallback to TCP was accepted.

That's because 'mptcp_subflow_early_fallback()' will set 'request_mptcp'
to 0, and 'mpc_drop' will never be reset to 0 after.

This is an issue, because some not so unusual situations might cause the
kernel to detect a false-positive blackhole, e.g. a client trying to
connect to a server while the network is not ready yet, causing a few
SYN retransmissions, before reaching the end server.

Fixes: 27069e7cb3 ("mptcp: disable active MPTCP in case of blackhole")
Cc: stable@vger.kernel.org
Reviewed-by: Mat Martineau <martineau@kernel.org>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30 14:02:19 +01:00
Takashi Iwai 8c2fa44132 ALSA: hda/realtek: Workaround for resume on Dell Venue 11 Pro 7130
It was reported that the headphone output on Dell Venue 11 Pro 7130
becomes mono after PM resume.  The cause seems to be the BIOS setting
up the codec COEF 0x0d bit 0x40 wrongly by some reason, and restoring
the original value 0x2800 fixes the problem.

This patch adds the quirk entry to perform the COEF restore.

Cc: <stable@vger.kernel.org>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=219697
Link: https://bugzilla.opensuse.org/show_bug.cgi?id=1235686
Link: https://patch.msgid.link/20250130123301.8996-1-tiwai@suse.de
Signed-off-by: Takashi Iwai <tiwai@suse.de>
2025-01-30 13:35:33 +01:00
Pablo Neira Ayuso 1b9335a800 netfilter: nf_tables: reject mismatching sum of field_len with set key length
The field length description provides the length of each separated key
field in the concatenation, each field gets rounded up to 32-bits to
calculate the pipapo rule width from pipapo_init(). The set key length
provides the total size of the key aligned to 32-bits.

Register-based arithmetics still allows for combining mismatching set
key length and field length description, eg. set key length 10 and field
description [ 5, 4 ] leading to pipapo width of 12.

Cc: stable@vger.kernel.org
Fixes: 3ce67e3793 ("netfilter: nf_tables: do not allow mismatch field size and set key length")
Reported-by: Noam Rathaus <noamr@ssd-disclosure.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-01-30 12:26:11 +01:00
Paolo Abeni 3effcc04d8 Merge branch 'fix-missing-rtnl-lock-in-suspend-path'
Kory Maincent says:

====================
Fix missing rtnl lock in suspend path

Fix the suspend path by ensuring the rtnl lock is held where required.
Calls to open, close and WOL operations must be performed under the
rtnl lock to prevent conflicts with ongoing ndo operations.

Discussion about this issue can be found here:
https://lore.kernel.org/netdev/20250120141926.1290763-1-kory.maincent@bootlin.com/

While working on the ravb fix, it was discovered that the sh_eth driver
has the same issue. This patch series addresses both drivers.

I do not have access to hardware for either of these MACs, so it would
be great if maintainers or others with the relevant boards could test
these fixes.

v2: https://lore.kernel.org/r/20250123-fix_missing_rtnl_lock_phy_disconnect-v2-0-e6206f5508ba@bootlin.com

v1: https://lore.kernel.org/r/20250122-fix_missing_rtnl_lock_phy_disconnect-v1-0-8cb9f6f88fd1@bootlin.com

Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
====================

Link: https://patch.msgid.link/20250129-fix_missing_rtnl_lock_phy_disconnect-v3-0-24c4ba185a92@bootlin.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30 11:23:32 +01:00
Kory Maincent b95102215a net: sh_eth: Fix missing rtnl lock in suspend/resume path
Fix the suspend/resume path by ensuring the rtnl lock is held where
required. Calls to sh_eth_close, sh_eth_open and wol operations must be
performed under the rtnl lock to prevent conflicts with ongoing ndo
operations.

Fixes: b71af04676 ("sh_eth: add more PM methods")
Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Sergey Shtylyov <s.shtylyov@omp.ru>
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30 11:23:01 +01:00
Kory Maincent 2c2ebb2b49 net: ravb: Fix missing rtnl lock in suspend/resume path
Fix the suspend/resume path by ensuring the rtnl lock is held where
required. Calls to ravb_open, ravb_close and wol operations must be
performed under the rtnl lock to prevent conflicts with ongoing ndo
operations.

Without this fix, the following warning is triggered:
[   39.032969] =============================
[   39.032983] WARNING: suspicious RCU usage
[   39.033019] -----------------------------
[   39.033033] drivers/net/phy/phy_device.c:2004 suspicious
rcu_dereference_protected() usage!
...
[   39.033597] stack backtrace:
[   39.033613] CPU: 0 UID: 0 PID: 174 Comm: python3 Not tainted
6.13.0-rc7-next-20250116-arm64-renesas-00002-g35245dfdc62c #7
[   39.033623] Hardware name: Renesas SMARC EVK version 2 based on
r9a08g045s33 (DT)
[   39.033628] Call trace:
[   39.033633]  show_stack+0x14/0x1c (C)
[   39.033652]  dump_stack_lvl+0xb4/0xc4
[   39.033664]  dump_stack+0x14/0x1c
[   39.033671]  lockdep_rcu_suspicious+0x16c/0x22c
[   39.033682]  phy_detach+0x160/0x190
[   39.033694]  phy_disconnect+0x40/0x54
[   39.033703]  ravb_close+0x6c/0x1cc
[   39.033714]  ravb_suspend+0x48/0x120
[   39.033721]  dpm_run_callback+0x4c/0x14c
[   39.033731]  device_suspend+0x11c/0x4dc
[   39.033740]  dpm_suspend+0xdc/0x214
[   39.033748]  dpm_suspend_start+0x48/0x60
[   39.033758]  suspend_devices_and_enter+0x124/0x574
[   39.033769]  pm_suspend+0x1ac/0x274
[   39.033778]  state_store+0x88/0x124
[   39.033788]  kobj_attr_store+0x14/0x24
[   39.033798]  sysfs_kf_write+0x48/0x6c
[   39.033808]  kernfs_fop_write_iter+0x118/0x1a8
[   39.033817]  vfs_write+0x27c/0x378
[   39.033825]  ksys_write+0x64/0xf4
[   39.033833]  __arm64_sys_write+0x18/0x20
[   39.033841]  invoke_syscall+0x44/0x104
[   39.033852]  el0_svc_common.constprop.0+0xb4/0xd4
[   39.033862]  do_el0_svc+0x18/0x20
[   39.033870]  el0_svc+0x3c/0xf0
[   39.033880]  el0t_64_sync_handler+0xc0/0xc4
[   39.033888]  el0t_64_sync+0x154/0x158
[   39.041274] ravb 11c30000.ethernet eth0: Link is Down

Reported-by: Claudiu Beznea <claudiu.beznea.uj@bp.renesas.com>
Closes: https://lore.kernel.org/netdev/4c6419d8-c06b-495c-b987-d66c2e1ff848@tuxon.dev/
Fixes: 0184165b2f ("ravb: add sleep PM suspend/resume support")
Signed-off-by: Kory Maincent <kory.maincent@bootlin.com>
Tested-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30 11:23:01 +01:00
Paolo Abeni da5ca229b6 Merge tag 'for-net-2025-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Luiz Augusto von Dentz says:

====================
bluetooth pull request for net:

 - btusb: mediatek: Add locks for usb_driver_claim_interface()
 - L2CAP: accept zero as a special value for MTU auto-selection
 - btusb: Fix possible infinite recursion of btusb_reset
 - Add ABI doc for sysfs reset
 - btnxpuart: Fix glitches seen in dual A2DP streaming

* tag 'for-net-2025-01-29' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth:
  Bluetooth: L2CAP: accept zero as a special value for MTU auto-selection
  Bluetooth: btnxpuart: Fix glitches seen in dual A2DP streaming
  Bluetooth: Add ABI doc for sysfs reset
  Bluetooth: Fix possible infinite recursion of btusb_reset
  Bluetooth: btusb: mediatek: Add locks for usb_driver_claim_interface()
====================

Link: https://patch.msgid.link/20250129210057.1318963-1-luiz.dentz@gmail.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-01-30 11:00:32 +01:00
Al Viro 57b314752e debugfs: Fix the missing initializations in __debugfs_file_get()
both method table pointers in debugfs_fsdata need to be initialized,
obviously, and calculating the bitmap of present methods would also
go better if we start with initialized state.

Fixes: 41a0ecc099 ("debugfs: get rid of dynamically allocation proxy_ops")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Link: https://lore.kernel.org/r/20250129191937.GR1977892@ZenIV
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2025-01-30 08:22:31 +01:00
Toke Høiland-Jørgensen f7bf624b1f selftests/net: Add test for loading devbound XDP program in generic mode
Add a test to bpf_offload.py for loading a devbound XDP program in
generic mode, checking that it fails correctly.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Link: https://patch.msgid.link/20250127131344.238147-2-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 19:04:23 -08:00
Toke Høiland-Jørgensen 3595599fa8 net: xdp: Disallow attaching device-bound programs in generic mode
Device-bound programs are used to support RX metadata kfuncs. These
kfuncs are driver-specific and rely on the driver context to read the
metadata. This means they can't work in generic XDP mode. However, there
is no check to disallow such programs from being attached in generic
mode, in which case the metadata kfuncs will be called in an invalid
context, leading to crashes.

Fix this by adding a check to disallow attaching device-bound programs
in generic mode.

Fixes: 2b3486bc2d ("bpf: Introduce device-bound XDP programs")
Reported-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Closes: https://lore.kernel.org/r/dae862ec-43b5-41a0-8edf-46c59071cdda@hetzner-cloud.de
Tested-by: Marcus Wichelmann <marcus.wichelmann@hetzner-cloud.de>
Acked-by: Stanislav Fomichev <sdf@fomichev.me>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://patch.msgid.link/20250127131344.238147-1-toke@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 19:04:23 -08:00
Jon Maloy 8c670bdfa5 tcp: correct handling of extreme memory squeeze
Testing with iperf3 using the "pasta" protocol splicer has revealed
a problem in the way tcp handles window advertising in extreme memory
squeeze situations.

Under memory pressure, a socket endpoint may temporarily advertise
a zero-sized window, but this is not stored as part of the socket data.
The reasoning behind this is that it is considered a temporary setting
which shouldn't influence any further calculations.

However, if we happen to stall at an unfortunate value of the current
window size, the algorithm selecting a new value will consistently fail
to advertise a non-zero window once we have freed up enough memory.
This means that this side's notion of the current window size is
different from the one last advertised to the peer, causing the latter
to not send any data to resolve the sitution.

The problem occurs on the iperf3 server side, and the socket in question
is a completely regular socket with the default settings for the
fedora40 kernel. We do not use SO_PEEK or SO_RCVBUF on the socket.

The following excerpt of a logging session, with own comments added,
shows more in detail what is happening:

//              tcp_v4_rcv(->)
//                tcp_rcv_established(->)
[5201<->39222]:     ==== Activating log @ net/ipv4/tcp_input.c/tcp_data_queue()/5257 ====
[5201<->39222]:     tcp_data_queue(->)
[5201<->39222]:        DROPPING skb [265600160..265665640], reason: SKB_DROP_REASON_PROTO_MEM
                       [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                       [copied_seq 259909392->260034360 (124968), unread 5565800, qlen 85, ofoq 0]
                       [OFO queue: gap: 65480, len: 0]
[5201<->39222]:     tcp_data_queue(<-)
[5201<->39222]:     __tcp_transmit_skb(->)
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:       tcp_select_window(->)
[5201<->39222]:         (inet_csk(sk)->icsk_ack.pending & ICSK_ACK_NOMEM) ? --> TRUE
                        [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
                        returning 0
[5201<->39222]:       tcp_select_window(<-)
[5201<->39222]:       ADVERTISING WIN 0, ACK_SEQ: 265600160
[5201<->39222]:     [__tcp_transmit_skb(<-)
[5201<->39222]:   tcp_rcv_established(<-)
[5201<->39222]: tcp_v4_rcv(<-)

// Receive queue is at 85 buffers and we are out of memory.
// We drop the incoming buffer, although it is in sequence, and decide
// to send an advertisement with a window of zero.
// We don't update tp->rcv_wnd and tp->rcv_wup accordingly, which means
// we unconditionally shrink the window.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 0, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260040464->260040464 (0), unread 5559696, qlen 85, ofoq 0]
                  returning 6104 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// After each read, the algorithm for calculating the new receive
// window in __tcp_cleanup_rbuf() finds it is too small to advertise
// or to update tp->rcv_wnd.
// Meanwhile, the peer thinks the window is zero, and will not send
// any more data to trigger an update from the interrupt mode side.

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 260099840->260171536 (71696), unread 5428624, qlen 83, ofoq 0]
                  returning 131072 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The above pattern repeats again and again, since nothing changes
// between the reads.

[...]

[5201<->39222]: tcp_recvmsg_locked(->)
[5201<->39222]:   __tcp_cleanup_rbuf(->) tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160
[5201<->39222]:     [new_win = 262144, win_now = 131184, 2 * win_now = 262368]
[5201<->39222]:     [new_win >= (2 * win_now) ? --> time_to_ack = 0]
[5201<->39222]:     NOT calling tcp_send_ack()
                    [tp->rcv_wup: 265469200, tp->rcv_wnd: 262144, tp->rcv_nxt 265600160]
[5201<->39222]:   __tcp_cleanup_rbuf(<-)
                  [rcv_nxt 265600160, rcv_wnd 262144, snt_ack 265469200, win_now 131184]
                  [copied_seq 265600160->265600160 (0), unread 0, qlen 0, ofoq 0]
                  returning 54672 bytes
[5201<->39222]: tcp_recvmsg_locked(<-)

// The receive queue is empty, but no new advertisement has been sent.
// The peer still thinks the receive window is zero, and sends nothing.
// We have ended up in a deadlock situation.

Note that well behaved endpoints will send win0 probes, so the problem
will not occur.

Furthermore, we have observed that in these situations this side may
send out an updated 'th->ack_seq´ which is not stored in tp->rcv_wup
as it should be. Backing ack_seq seems to be harmless, but is of
course still wrong from a protocol viewpoint.

We fix this by updating the socket state correctly when a packet has
been dropped because of memory exhaustion and we have to advertize
a zero window.

Further testing shows that the connection recovers neatly from the
squeeze situation, and traffic can continue indefinitely.

Fixes: e2142825c1 ("net: tcp: send zero-window ACK when no memory")
Cc: Menglong Dong <menglong8.dong@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Jon Maloy <jmaloy@redhat.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Link: https://patch.msgid.link/20250127231304.1465565-1-jmaloy@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:58:15 -08:00
Rafał Miłecki 752e5fcc2e bgmac: reduce max frame size to support just MTU 1500
bgmac allocates new replacement buffer before handling each received
frame. Allocating & DMA-preparing 9724 B each time consumes a lot of CPU
time. Ideally bgmac should just respect currently set MTU but it isn't
the case right now. For now just revert back to the old limited frame
size.

This change bumps NAT masquerade speed by ~95%.

Since commit 8218f62c9c ("mm: page_frag: use initial zero offset for
page_frag_alloc_align()"), the bgmac driver fails to open its network
interface successfully and runs out of memory in the following call
stack:

bgmac_open
  -> bgmac_dma_init
    -> bgmac_dma_rx_skb_for_slot
      -> netdev_alloc_frag

BGMAC_RX_ALLOC_SIZE = 10048 and PAGE_FRAG_CACHE_MAX_SIZE = 32768.

Eventually we land into __page_frag_alloc_align() with the following
parameters across multiple successive calls:

__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=0
__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=10048
__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=20096
__page_frag_alloc_align: fragsz=10048, align_mask=-1, size=32768, offset=30144

So in that case we do indeed have offset + fragsz (40192) > size (32768)
and so we would eventually return NULL. Reverting to the older 1500
bytes MTU allows the network driver to be usable again.

Fixes: 8c7da63978 ("bgmac: configure MTU and add support for frames beyond 8192 byte size")
Signed-off-by: Rafał Miłecki <rafal@milecki.pl>
[florian: expand commit message about recent commits]
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: Florian Fainelli <florian.fainelli@broadcom.com>
Link: https://patch.msgid.link/20250127175159.1788246-1-florian.fainelli@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:57:42 -08:00
Jakub Kicinski 6e6c62be16 Merge branch 'vsock-transport-reassignment-and-error-handling-issues'
Michal Luczaj says:

====================
vsock: Transport reassignment and error handling issues

Series deals with two issues:
- socket reference count imbalance due to an unforgiving transport release
  (triggered by transport reassignment);
- unintentional API feature, a failing connect() making the socket
  impossible to use for any subsequent connect() attempts.

v2: https://lore.kernel.org/20250121-vsock-transport-vs-autobind-v2-0-aad6069a4e8c@rbox.co
v1: https://lore.kernel.org/20250117-vsock-transport-vs-autobind-v1-0-c802c803762d@rbox.co
====================

Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-0-1cf57065b770@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:50:41 -08:00
Michal Luczaj 4695f64e02 vsock/test: Add test for connect() retries
Deliberately fail a connect() attempt; expect error. Then verify that
subsequent attempt (using the same socket) can still succeed, rather than
fail outright.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-6-1cf57065b770@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:50:37 -08:00
Michal Luczaj 301a62dfb0 vsock/test: Add test for UAF due to socket unbinding
Fail the autobind, then trigger a transport reassign. Socket might get
unbound from unbound_sockets, which then leads to a reference count
underflow.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-5-1cf57065b770@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:50:37 -08:00
Michal Luczaj ac12b7e291 vsock/test: Introduce vsock_connect_fd()
Distill timeout-guarded vsock_connect_fd(). Adapt callers.

Suggested-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-4-1cf57065b770@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:50:37 -08:00
Michal Luczaj 852a00c428 vsock/test: Introduce vsock_bind()
Add a helper for socket()+bind(). Adapt callers.

Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-3-1cf57065b770@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:50:37 -08:00
Michal Luczaj aa388c7211 vsock: Allow retrying on connect() failure
sk_err is set when a (connectible) connect() fails. Effectively, this makes
an otherwise still healthy SS_UNCONNECTED socket impossible to use for any
subsequent connection attempts.

Clear sk_err upon trying to establish a connection.

Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Luigi Leonardi <leonardi@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-2-1cf57065b770@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:50:37 -08:00
Michal Luczaj fcdd2242c0 vsock: Keep the binding until socket destruction
Preserve sockets bindings; this includes both resulting from an explicit
bind() and those implicitly bound through autobind during connect().

Prevents socket unbinding during a transport reassignment, which fixes a
use-after-free:

    1. vsock_create() (refcnt=1) calls vsock_insert_unbound() (refcnt=2)
    2. transport->release() calls vsock_remove_bound() without checking if
       sk was bound and moved to bound list (refcnt=1)
    3. vsock_bind() assumes sk is in unbound list and before
       __vsock_insert_bound(vsock_bound_sockets()) calls
       __vsock_remove_bound() which does:
           list_del_init(&vsk->bound_table); // nop
           sock_put(&vsk->sk);               // refcnt=0

BUG: KASAN: slab-use-after-free in __vsock_bind+0x62e/0x730
Read of size 4 at addr ffff88816b46a74c by task a.out/2057
 dump_stack_lvl+0x68/0x90
 print_report+0x174/0x4f6
 kasan_report+0xb9/0x190
 __vsock_bind+0x62e/0x730
 vsock_bind+0x97/0xe0
 __sys_bind+0x154/0x1f0
 __x64_sys_bind+0x6e/0xb0
 do_syscall_64+0x93/0x1b0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Allocated by task 2057:
 kasan_save_stack+0x1e/0x40
 kasan_save_track+0x10/0x30
 __kasan_slab_alloc+0x85/0x90
 kmem_cache_alloc_noprof+0x131/0x450
 sk_prot_alloc+0x5b/0x220
 sk_alloc+0x2c/0x870
 __vsock_create.constprop.0+0x2e/0xb60
 vsock_create+0xe4/0x420
 __sock_create+0x241/0x650
 __sys_socket+0xf2/0x1a0
 __x64_sys_socket+0x6e/0xb0
 do_syscall_64+0x93/0x1b0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Freed by task 2057:
 kasan_save_stack+0x1e/0x40
 kasan_save_track+0x10/0x30
 kasan_save_free_info+0x37/0x60
 __kasan_slab_free+0x4b/0x70
 kmem_cache_free+0x1a1/0x590
 __sk_destruct+0x388/0x5a0
 __vsock_bind+0x5e1/0x730
 vsock_bind+0x97/0xe0
 __sys_bind+0x154/0x1f0
 __x64_sys_bind+0x6e/0xb0
 do_syscall_64+0x93/0x1b0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

refcount_t: addition on 0; use-after-free.
WARNING: CPU: 7 PID: 2057 at lib/refcount.c:25 refcount_warn_saturate+0xce/0x150
RIP: 0010:refcount_warn_saturate+0xce/0x150
 __vsock_bind+0x66d/0x730
 vsock_bind+0x97/0xe0
 __sys_bind+0x154/0x1f0
 __x64_sys_bind+0x6e/0xb0
 do_syscall_64+0x93/0x1b0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

refcount_t: underflow; use-after-free.
WARNING: CPU: 7 PID: 2057 at lib/refcount.c:28 refcount_warn_saturate+0xee/0x150
RIP: 0010:refcount_warn_saturate+0xee/0x150
 vsock_remove_bound+0x187/0x1e0
 __vsock_release+0x383/0x4a0
 vsock_release+0x90/0x120
 __sock_release+0xa3/0x250
 sock_close+0x14/0x20
 __fput+0x359/0xa80
 task_work_run+0x107/0x1d0
 do_exit+0x847/0x2560
 do_group_exit+0xb8/0x250
 __x64_sys_exit_group+0x3a/0x50
 x64_sys_call+0xfec/0x14f0
 do_syscall_64+0x93/0x1b0
 entry_SYSCALL_64_after_hwframe+0x76/0x7e

Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: Michal Luczaj <mhal@rbox.co>
Link: https://patch.msgid.link/20250128-vsock-transport-vs-autobind-v3-1-1cf57065b770@rbox.co
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-01-29 18:50:36 -08:00
Abel Wu c78f4afbd9 bpf: Fix deadlock when freeing cgroup storage
The following commit
bc235cdb42 ("bpf: Prevent deadlock from recursive bpf_task_storage_[get|delete]")
first introduced deadlock prevention for fentry/fexit programs attaching
on bpf_task_storage helpers. That commit also employed the logic in map
free path in its v6 version.

Later bpf_cgrp_storage was first introduced in
c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
which faces the same issue as bpf_task_storage, instead of its busy
counter, NULL was passed to bpf_local_storage_map_free() which opened
a window to cause deadlock:

	<TASK>
		(acquiring local_storage->lock)
	_raw_spin_lock_irqsave+0x3d/0x50
	bpf_local_storage_update+0xd1/0x460
	bpf_cgrp_storage_get+0x109/0x130
	bpf_prog_a4d4a370ba857314_cgrp_ptr+0x139/0x170
	? __bpf_prog_enter_recur+0x16/0x80
	bpf_trampoline_6442485186+0x43/0xa4
	cgroup_storage_ptr+0x9/0x20
		(holding local_storage->lock)
	bpf_selem_unlink_storage_nolock.constprop.0+0x135/0x160
	bpf_selem_unlink_storage+0x6f/0x110
	bpf_local_storage_map_free+0xa2/0x110
	bpf_map_free_deferred+0x5b/0x90
	process_one_work+0x17c/0x390
	worker_thread+0x251/0x360
	kthread+0xd2/0x100
	ret_from_fork+0x34/0x50
	ret_from_fork_asm+0x1a/0x30
	</TASK>

Progs:
 - A: SEC("fentry/cgroup_storage_ptr")
   - cgid (BPF_MAP_TYPE_HASH)
	Record the id of the cgroup the current task belonging
	to in this hash map, using the address of the cgroup
	as the map key.
   - cgrpa (BPF_MAP_TYPE_CGRP_STORAGE)
	If current task is a kworker, lookup the above hash
	map using function parameter @owner as the key to get
	its corresponding cgroup id which is then used to get
	a trusted pointer to the cgroup through
	bpf_cgroup_from_id(). This trusted pointer can then
	be passed to bpf_cgrp_storage_get() to finally trigger
	the deadlock issue.
 - B: SEC("tp_btf/sys_enter")
   - cgrpb (BPF_MAP_TYPE_CGRP_STORAGE)
	The only purpose of this prog is to fill Prog A's
	hash map by calling bpf_cgrp_storage_get() for as
	many userspace tasks as possible.

Steps to reproduce:
 - Run A;
 - while (true) { Run B; Destroy B; }

Fix this issue by passing its busy counter to the free procedure so
it can be properly incremented before storage/smap locking.

Fixes: c4bcfb38a9 ("bpf: Implement cgroup storage available to non-cgroup-attached bpf progs")
Signed-off-by: Abel Wu <wuyun.abel@bytedance.com>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20241221061018.37717-1-wuyun.abel@bytedance.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2025-01-29 18:38:19 -08:00
Yunhui Cui 101971298b riscv: add a warning when physical memory address overflows
The part of physical memory that exceeds the size of the linear mapping
will be discarded. When the system starts up normally, a warning message
will be printed to prevent confusion caused by the mismatch between the
system memory and the actual physical memory.

Signed-off-by: Yunhui Cui <cuiyunhui@bytedance.com>
Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com>
Link: https://lore.kernel.org/r/20240814062625.19794-1-cuiyunhui@bytedance.com
Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2025-01-29 18:36:09 -08:00
Huacai Chen 35fcac7a7c audit: Initialize lsmctx to avoid memory allocation error
When audit is enabled in a kernel build, and there are no LSMs active
that support LSM labeling, it is possible that local variable lsmctx
in the AUDIT_SIGNAL_INFO handler in audit_receive_msg() could be used
before it is properly initialize. Then kmalloc() will try to allocate
a large amount of memory with the uninitialized length.

This patch corrects this problem by initializing the lsmctx to a safe
value when it is declared, which avoid errors like:

 WARNING: CPU: 2 PID: 443 at mm/page_alloc.c:4727 __alloc_pages_noprof
        ...
    ra: 9000000003059644 ___kmalloc_large_node+0x84/0x1e0
   ERA: 900000000304d588 __alloc_pages_noprof+0x4c8/0x1040
  CRMD: 000000b0 (PLV0 -IE -DA +PG DACF=CC DACM=CC -WE)
  PRMD: 00000004 (PPLV0 +PIE -PWE)
  EUEN: 00000007 (+FPE +SXE +ASXE -BTE)
  ECFG: 00071c1d (LIE=0,2-4,10-12 VS=7)
 ESTAT: 000c0000 [BRK] (IS= ECode=12 EsubCode=0)
  PRID: 0014c010 (Loongson-64bit, Loongson-3A5000)
 CPU: 2 UID: 0 PID: 443 Comm: auditd Not tainted 6.13.0-rc1+ #1899
        ...
 Call Trace:
 [<9000000002def6a8>] show_stack+0x30/0x148
 [<9000000002debf58>] dump_stack_lvl+0x68/0xa0
 [<9000000002e0fe18>] __warn+0x80/0x108
 [<900000000407486c>] report_bug+0x154/0x268
 [<90000000040ad468>] do_bp+0x2a8/0x320
 [<9000000002dedda0>] handle_bp+0x120/0x1c0
 [<900000000304d588>] __alloc_pages_noprof+0x4c8/0x1040
 [<9000000003059640>] ___kmalloc_large_node+0x80/0x1e0
 [<9000000003061504>] __kmalloc_noprof+0x2c4/0x380
 [<9000000002f0f7ac>] audit_receive_msg+0x764/0x1530
 [<9000000002f1065c>] audit_receive+0xe4/0x1c0
 [<9000000003e5abe8>] netlink_unicast+0x340/0x450
 [<9000000003e5ae9c>] netlink_sendmsg+0x1a4/0x4a0
 [<9000000003d9ffd0>] __sock_sendmsg+0x48/0x58
 [<9000000003da32f0>] __sys_sendto+0x100/0x170
 [<9000000003da3374>] sys_sendto+0x14/0x28
 [<90000000040ad574>] do_syscall+0x94/0x138
 [<9000000002ded318>] handle_syscall+0xb8/0x158

Fixes: 6fba89813c ("lsm: ensure the correct LSM context releaser")
Signed-off-by: Huacai Chen <chenhuacai@loongson.cn>
[PM: resolved excessive line length in the backtrace]
Signed-off-by: Paul Moore <paul@paul-moore.com>
2025-01-29 20:02:04 -05:00
Masahiro Yamada a409fc1463 kconfig: fix memory leak in sym_warn_unmet_dep()
The string allocated in sym_warn_unmet_dep() is never freed, leading
to a memory leak when an unmet dependency is detected.

Fixes: f8f69dc0b4 ("kconfig: make unmet dependency warnings readable")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
2025-01-30 08:43:12 +09:00
Masahiro Yamada a314f52a02 kconfig: fix file name in warnings when loading KCONFIG_DEFCONFIG_LIST
Most 'make *config' commands use .config as the base configuration file.

When .config does not exist, Kconfig tries to load a file listed in
KCONFIG_DEFCONFIG_LIST instead.

However, since commit b75b0a819a ("kconfig: change defconfig_list
option to environment variable"), warning messages have displayed an
incorrect file name in such cases.

Below is a demonstration using Debian Trixie. While loading
/boot/config-6.12.9-amd64, the warning messages incorrectly show .config
as the file name.

With this commit, the correct file name is displayed in warnings.

[Before]

  $ rm -f .config
  $ make config
  #
  # using defaults found in /boot/config-6.12.9-amd64
  #
  .config:6804:warning: symbol value 'm' invalid for FB_BACKLIGHT
  .config:9895:warning: symbol value 'm' invalid for ANDROID_BINDER_IPC

[After]

  $ rm -f .config
  $ make config
  #
  # using defaults found in /boot/config-6.12.9-amd64
  #
  /boot/config-6.12.9-amd64:6804:warning: symbol value 'm' invalid for FB_BACKLIGHT
  /boot/config-6.12.9-amd64:9895:warning: symbol value 'm' invalid for ANDROID_BINDER_IPC

Fixes: b75b0a819a ("kconfig: change defconfig_list option to environment variable")
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2025-01-30 08:40:17 +09:00
Pali Rohár 78f69467cb cifs: Add mount option -o reparse=none
This new mount option allows to completely disable creating new reparse
points. When -o sfu or -o mfsymlinks or -o symlink= is not specified then
creating any special file (fifo, socket, symlink, block and char) will fail
with -EOPNOTSUPP error.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:35:38 -06:00
Pali Rohár 660618dde2 cifs: Add mount option -o symlink= for choosing symlink create type
Currently Linux CIFS client creates a new symlink of the first flavor which
is allowed by mount options, parsed in this order: -o (no)mfsymlinks,
-o (no)sfu, -o (no)unix (+ its aliases) and -o reparse=[type].

Introduce a new mount option -o symlink= for explicitly choosing a symlink
flavor. Possible options are:

  -o symlink=default    - The default behavior, like before this change.
  -o symlink=none       - Disallow creating a new symlinks
  -o symlink=native     - Create as native SMB symlink reparse point
  -o symlink=unix       - Create via SMB1 unix extension command
  -o symlink=mfsymlinks - Create as regular file of mfsymlinks format
  -o symlink=sfu        - Create as regular system file of SFU format
  -o symlink=nfs        - Create as NFS reparse point
  -o symlink=wsl        - Create as WSL reparse point

So for example specifying -o sfu,mfsymlinks,symlink=native will allow to
parse symlinks also of SFU and mfsymlinks types (which are disabled by
default unless mount option is explicitly specified), but new symlinks will
be created under native SMB type (which parsing is always enabled).

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:35:32 -06:00
Pali Rohár 12b466eb52 cifs: Fix creating and resolving absolute NT-style symlinks
If the SMB symlink is stored on NT server in absolute form then it points
to the NT object hierarchy, which is different from POSIX one and needs
some conversion / mapping.

To make interoperability with Windows SMB server and WSL subsystem, reuse
its logic of mapping between NT paths and POSIX paths into Linux SMB
client.

WSL subsystem on Windows uses for -t drvfs mount option -o symlinkroot=
which specifies the POSIX path where are expected to be mounted lowercase
Windows drive letters (without colon).

Do same for Linux SMB client and add a new mount option -o symlinkroot=
which mimics the drvfs mount option of the same name. It specifies where in
the Linux VFS hierarchy is the root of the DOS / Windows drive letters, and
translates between absolute NT-style symlinks and absolute Linux VFS
symlinks. Default value of symlinkroot is "/mnt", same what is using WSL.

Note that DOS / Windows drive letter symlinks are just subset of all
possible NT-style symlinks. Drive letters live in NT subtree \??\ and
important details about NT paths and object hierarchy are in the comments
in this change.

When symlink target location from non-POSIX SMB server is in absolute form
(indicated by absence of SYMLINK_FLAG_RELATIVE) then it is converted to
Linux absolute symlink according to symlinkroot configuration.

And when creating a new symlink on non-POSIX SMB server in absolute form
then Linux absolute target is converted to NT-style according to
symlinkroot configuration.

When SMB server is POSIX, then this change does not affect neither reading
target location of symlink, nor creating a new symlink. It is expected that
POSIX SMB server works with POSIX paths where the absolute root is /.

This change improves interoperability of absolute SMB symlinks with Windows
SMB servers.

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:35:06 -06:00
Pali Rohár 32ba03042a cifs: Simplify reparse point check in cifs_query_path_info() function
For checking if path is reparse point and setting data->reparse_point
member, it is enough to check if ATTR_REPARSE is present.

It is not required to call CIFS_open() without OPEN_REPARSE_POINT and
checking for -EOPNOTSUPP error code.

Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:20:40 -06:00
Pali Rohár 65c49767dd cifs: Remove symlink member from cifs_open_info_data union
Member 'symlink' is part of the union in struct cifs_open_info_data. Its
value is assigned on few places, but is always read through another union
member 'reparse_point'. So to make code more readable, always use only
'reparse_point' member and drop whole union structure. No function change.

Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:19:52 -06:00
Pali Rohár a46221fcdd cifs: Update description about ACL permissions
There are some incorrect information about individual SMB permission
constants like WRITE_DAC can change ownership, or incomplete information to
distinguish between ACL types (discretionary vs system) and there is
completely missing information how permissions apply for directory objects
and what is meaning of GENERIC_* bits.

Also there is missing constant for MAXIMUM_ALLOWED permission.

Fix and extend description of all SMB permission constants to match the
reality, how the reference Windows SMB / NTFS implementation handles them.

Links to official Microsoft documentation related to permissions:
https://learn.microsoft.com/en-us/windows/win32/fileio/file-access-rights-constants
https://learn.microsoft.com/en-us/windows/win32/secauthz/access-mask
https://learn.microsoft.com/en-us/windows/win32/secauthz/standard-access-rights
https://learn.microsoft.com/en-us/windows/win32/secauthz/generic-access-rights
https://learn.microsoft.com/en-us/windows/win32/api/winternl/nf-winternl-ntcreatefile
https://learn.microsoft.com/en-us/windows-hardware/drivers/ddi/ntifs/nf-ntifs-ntcreatefile

Signed-off-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:19:12 -06:00
Pali Rohár b6d002f0a3 cifs: Rename struct reparse_posix_data to reparse_nfs_data_buffer and move to common/smb2pdu.h
Function parse_reparse_posix() parses NFS-style reparse points, which are
used only by Windows NFS server since Windows Server 2012 version. This
style is not understood by Microsoft POSIX/Interix/SFU/SUA subsystems.

So make it clear that parse_reparse_posix() function and reparse_posix_data
structure are not POSIX general, but rather NFS specific.

All reparse buffer structures are defined in common/smb2pdu.h and have
_buffer suffix. So move struct reparse_posix_data from client/cifspdu.h to
common/smb2pdu.h and rename it to reparse_nfs_data_buffer for consistency.
Note that also SMB specification in [MS-FSCC] document, section 2.1.2.6
defines it under name "Network File System (NFS) Reparse Data Buffer".
So use this name for consistency.

Having this structure in common/smb2pdu.h can be useful for ksmbd server
code as NFS-style reparse points is the preferred way for implementing
support for special files.

Signed-off-by: Pali Rohár <pali@kernel.org>
Acked-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:18:55 -06:00
Pali Rohár 65ccccee4e cifs: Remove struct reparse_posix_data from struct cifs_open_info_data
Linux SMB client already supports more reparse point types but only the
reparse_posix_data is defined in union of struct cifs_open_info_data.
This union is currently used as implicit casting between point types.

With this code style, it hides information that union is used for pointer
casting, and just in mknod_nfs() and posix_reparse_to_fattr() functions.

Other reparse point buffers do not use this kind of casting. So remove
reparse_posix_data from reparse part of struct cifs_open_info_data and for
all cases of reparse buffer use just struct reparse_data_buffer *buf.

Signed-off-by: Pali Rohár <pali@kernel.org>
Reviewed-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
2025-01-29 17:12:56 -06:00