syzbot


ID Workflow Result Correct Bug Created Started Finished Revision Error
ece1b702-1347-42e5-af92-7ce323429d26 assessment-kcsan Benign: ✅  Confident: ✅  KCSAN: data-race in sk_psock_drop / unix_stream_sendmsg 2026/02/23 15:12 2026/02/23 15:12 2026/02/23 15:16 305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9
Benign: ✅    Confident: ✅   

Explanation:
The data race occurs on the `sk_data_ready` function pointer in `struct sock`. 

### Analysis

**Task 1 (Write):**
The write happens in `sk_psock_stop_verdict()` (called via `sk_psock_drop()`), which is part of the cleanup process when a socket is removed from a BPF sockmap or is being closed (e.g., via `unix_release()`). The code restores the original `sk_data_ready` callback that was saved when the `sk_psock` was first attached to the socket:
```c
sk->sk_data_ready = psock->saved_data_ready;
```
This write is performed while holding the `sk->sk_callback_lock` (write-side).

**Task 2 (Read):**
The read happens in `unix_stream_sendmsg()` when a peer socket sends data and notifies the receiving socket:
```c
other->sk_data_ready(other);
```
This invocation is performed without holding the `sk_callback_lock`.

### Verdict: Benign

The data race is considered benign for the following reasons:

1.  **Atomicity:** On the architectures supported by the Linux kernel (including x86_64 as seen in the report), a pointer-sized, naturally aligned memory access is atomic. The compiler will generate a single instruction to load or store the function pointer.
2.  **Safety of Callbacks:** Both the old value (`sk_psock_verdict_data_ready`) and the new value (typically `sock_def_readable`) are valid, persistent kernel functions. If the reader sees the old value, it will call the BPF-based notification; if it sees the new value, it will call the default notification. Both are safe to execute.
3.  **Handling of Detached State:** The BPF callback `sk_psock_verdict_data_ready` specifically handles the case where the `psock` has already been detached by checking `sk_psock(sk)` (which retrieves `sk_user_data` via RCU). If it finds the `psock` is NULL, it safely returns or handles the packet without crashing.
4.  **Functional Impact:** While there is a theoretical window where a packet might be dropped if the BPF callback is invoked just after the `psock` is cleared but before the function pointer is restored, this occurs during socket teardown or reconfiguration. Such transient packet loss is generally acceptable in these states and does not compromise the integrity of the kernel memory or state.
5.  **Common Pattern:** Lockless invocation of socket callbacks is a standard pattern in the Linux networking stack to avoid the overhead of rwlocks in the fast path (data transmission).

In summary, the race does not lead to memory corruption, use-after-free, or crashes, and is typical of high-performance networking code.

Conclusion: Benign

Crash report:
==================================================================
BUG: KCSAN: data-race in sk_psock_drop / unix_stream_sendmsg

write to 0xffff88810220dc08 of 8 bytes by task 3611 on cpu 1:
 sk_psock_stop_verdict net/core/skmsg.c:1311 [inline]
 sk_psock_drop+0x12f/0x270 net/core/skmsg.c:896
 sk_psock_put include/linux/skmsg.h:506 [inline]
 sock_map_unref+0x2a5/0x300 net/core/sock_map.c:185
 __sock_map_delete net/core/sock_map.c:426 [inline]
 sock_map_delete_from_link net/core/sock_map.c:439 [inline]
 sock_map_unlink net/core/sock_map.c:1608 [inline]
 sock_map_remove_links+0x228/0x340 net/core/sock_map.c:1623
 sock_map_close+0xa1/0x340 net/core/sock_map.c:1684
 unix_release+0x4e/0x80 net/unix/af_unix.c:1183
 __sock_release net/socket.c:662 [inline]
 sock_close+0x6b/0x150 net/socket.c:1455
 __fput+0x29b/0x650 fs/file_table.c:469
 fput_close_sync+0xad/0x190 fs/file_table.c:574
 __do_sys_close fs/open.c:1509 [inline]
 __se_sys_close fs/open.c:1494 [inline]
 __x64_sys_close+0x55/0xf0 fs/open.c:1494
 x64_sys_call+0x275d/0x3020 arch/x86/include/generated/asm/syscalls_64.h:4
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88810220dc08 of 8 bytes by task 3605 on cpu 0:
 unix_stream_sendmsg+0x7f8/0xbb0 net/unix/af_unix.c:2480
 sock_sendmsg_nosec net/socket.c:727 [inline]
 __sock_sendmsg net/socket.c:742 [inline]
 ____sys_sendmsg+0x5af/0x600 net/socket.c:2592
 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646
 __sys_sendmsg net/socket.c:2678 [inline]
 __do_sys_sendmsg net/socket.c:2683 [inline]
 __se_sys_sendmsg net/socket.c:2681 [inline]
 __x64_sys_sendmsg+0xd4/0x160 net/socket.c:2681
 x64_sys_call+0x194c/0x3020 arch/x86/include/generated/asm/syscalls_64.h:47
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0xffffffff8486dd00 -> 0xffffffff8472c9f0

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 3605 Comm: syz.4.30 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
==================================================================

Trajectory:
Seq Timestamp Type Name Duration
0/0 2026/02/23 15:12 flow assessment-kcsan
4m Results:
map[Benign:true Confident:true Explanation:The data race occurs on the `sk_data_ready` function pointer in `struct sock`. 

### Analysis

**Task 1 (Write):**
The write happens in `sk_psock_stop_verdict()` (called via `sk_psock_drop()`), which is part of the cleanup process when a socket is removed from a BPF sockmap or is being closed (e.g., via `unix_release()`). The code restores the original `sk_data_ready` callback that was saved when the `sk_psock` was first attached to the socket:
```c
sk->sk_data_ready = psock->saved_data_ready;
```
This write is performed while holding the `sk->sk_callback_lock` (write-side).

**Task 2 (Read):**
The read happens in `unix_stream_sendmsg()` when a peer socket sends data and notifies the receiving socket:
```c
other->sk_data_ready(other);
```
This invocation is performed without holding the `sk_callback_lock`.

### Verdict: Benign

The data race is considered benign for the following reasons:

1.  **Atomicity:** On the architectures supported by the Linux kernel (including x86_64 as seen in the report), a pointer-sized, naturally aligned memory access is atomic. The compiler will generate a single instruction to load or store the function pointer.
2.  **Safety of Callbacks:** Both the old value (`sk_psock_verdict_data_ready`) and the new value (typically `sock_def_readable`) are valid, persistent kernel functions. If the reader sees the old value, it will call the BPF-based notification; if it sees the new value, it will call the default notification. Both are safe to execute.
3.  **Handling of Detached State:** The BPF callback `sk_psock_verdict_data_ready` specifically handles the case where the `psock` has already been detached by checking `sk_psock(sk)` (which retrieves `sk_user_data` via RCU). If it finds the `psock` is NULL, it safely returns or handles the packet without crashing.
4.  **Functional Impact:** While there is a theoretical window where a packet might be dropped if the BPF callback is invoked just after the `psock` is cleared but before the function pointer is restored, this occurs during socket teardown or reconfiguration. Such transient packet loss is generally acceptable in these states and does not compromise the integrity of the kernel memory or state.
5.  **Common Pattern:** Lockless invocation of socket callbacks is a standard pattern in the Linux networking stack to avoid the overhead of rwlocks in the fast path (data transmission).

In summary, the race does not lead to memory corruption, use-after-free, or crashes, and is typical of high-performance networking code.

Conclusion: Benign]

1/1 2026/02/23 15:12 action kernel-checkouter
0m Results:
map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/c96791085f13f78370fd810620bcf7e12db2393e]

2/1 2026/02/23 15:12 action kernel-builder
0m Results:
map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/87c5b072f2369cda5419459302ab4ef9203773cd]

3/1 2026/02/23 15:12 action codesearch-prepare
0m Results:
map[Index:codesearch-index]

4/1 2026/02/23 15:12 agent expert
3m Model:
gemini-3-flash-preview

Results:
map[Benign:true Confident:true]

Instruction:
You are an experienced Linux kernel developer tasked with determining if the given kernel
data race is benign or not. The data race report is from KCSAN tool.
It contains 2 stack traces of the memory accesses that constitute a data race.

A "benign" data races are on a simple int/bool variable or similar field,
and the accesses are not supposed to be protected by any mutual exclusion primitives.
Common examples of such "benign" data races are accesses to various flags fields,
statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash
with a conservative compiler that compiles memory accesses to primitive types
effectively as atomic.

A non-benign (or "harmful" data race) can lead to corruption/crash even with
a conservative compiler that compiles memory accesses to primitive types
effectively as atomic. A common example of a "harmful" data races is race on
a complex container (list/hashmap/etc), where accesses are supposed to be protected
by a mutual exclusion primitive.

Also consider races that happen at the same time with the given one.
If there is no synchronization in between, other memory accesses in the involved threads
race with each other if they access the same memory. For example, if both threads execute:

	some_struct->have_elements = true;
	list_add(new_node, &some_struct->list_head);

the race on some_struct->have_elements may appear benign, however it also implies there
is a race on some_struct->list_head which is not benign, since the list is not thread-safe.

Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses,
and thus even with a conservative compiler may not be fully atomic. However, such races may
still be benign depending on how writes are done, and how read data is used.

In the final reply explain why you think the given data race is benign or is harmful.

Use the provided tools to confirm any assumptions, variables/fields being accessed, etc.
In particular, don't make assumptions about the kernel source code,
use codesearch tools to read the actual source code.

Prefer calling several tools at the same time to save round-trips.


Use set-results tool to provide results of the analysis.
It must be called exactly once before the final reply.
Ignore results of this tool.

Prompt:
The data race report is:

==================================================================
BUG: KCSAN: data-race in sk_psock_drop / unix_stream_sendmsg

write to 0xffff88810220dc08 of 8 bytes by task 3611 on cpu 1:
 sk_psock_stop_verdict net/core/skmsg.c:1311 [inline]
 sk_psock_drop+0x12f/0x270 net/core/skmsg.c:896
 sk_psock_put include/linux/skmsg.h:506 [inline]
 sock_map_unref+0x2a5/0x300 net/core/sock_map.c:185
 __sock_map_delete net/core/sock_map.c:426 [inline]
 sock_map_delete_from_link net/core/sock_map.c:439 [inline]
 sock_map_unlink net/core/sock_map.c:1608 [inline]
 sock_map_remove_links+0x228/0x340 net/core/sock_map.c:1623
 sock_map_close+0xa1/0x340 net/core/sock_map.c:1684
 unix_release+0x4e/0x80 net/unix/af_unix.c:1183
 __sock_release net/socket.c:662 [inline]
 sock_close+0x6b/0x150 net/socket.c:1455
 __fput+0x29b/0x650 fs/file_table.c:469
 fput_close_sync+0xad/0x190 fs/file_table.c:574
 __do_sys_close fs/open.c:1509 [inline]
 __se_sys_close fs/open.c:1494 [inline]
 __x64_sys_close+0x55/0xf0 fs/open.c:1494
 x64_sys_call+0x275d/0x3020 arch/x86/include/generated/asm/syscalls_64.h:4
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

read to 0xffff88810220dc08 of 8 bytes by task 3605 on cpu 0:
 unix_stream_sendmsg+0x7f8/0xbb0 net/unix/af_unix.c:2480
 sock_sendmsg_nosec net/socket.c:727 [inline]
 __sock_sendmsg net/socket.c:742 [inline]
 ____sys_sendmsg+0x5af/0x600 net/socket.c:2592
 ___sys_sendmsg+0x195/0x1e0 net/socket.c:2646
 __sys_sendmsg net/socket.c:2678 [inline]
 __do_sys_sendmsg net/socket.c:2683 [inline]
 __se_sys_sendmsg net/socket.c:2681 [inline]
 __x64_sys_sendmsg+0xd4/0x160 net/socket.c:2681
 x64_sys_call+0x194c/0x3020 arch/x86/include/generated/asm/syscalls_64.h:47
 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline]
 do_syscall_64+0x12c/0x370 arch/x86/entry/syscall_64.c:94
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

value changed: 0xffffffff8486dd00 -> 0xffffffff8472c9f0

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 UID: 0 PID: 3605 Comm: syz.4.30 Not tainted syzkaller #0 PREEMPT(full) 
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026
==================================================================


Reply:
The data race occurs on the `sk_data_ready` function pointer in `struct sock`. 

### Analysis

**Task 1 (Write):**
The write happens in `sk_psock_stop_verdict()` (called via `sk_psock_drop()`), which is part of the cleanup process when a socket is removed from a BPF sockmap or is being closed (e.g., via `unix_release()`). The code restores the original `sk_data_ready` callback that was saved when the `sk_psock` was first attached to the socket:
```c
sk->sk_data_ready = psock->saved_data_ready;
```
This write is performed while holding the `sk->sk_callback_lock` (write-side).

**Task 2 (Read):**
The read happens in `unix_stream_sendmsg()` when a peer socket sends data and notifies the receiving socket:
```c
other->sk_data_ready(other);
```
This invocation is performed without holding the `sk_callback_lock`.

### Verdict: Benign

The data race is considered benign for the following reasons:

1.  **Atomicity:** On the architectures supported by the Linux kernel (including x86_64 as seen in the report), a pointer-sized, naturally aligned memory access is atomic. The compiler will generate a single instruction to load or store the function pointer.
2.  **Safety of Callbacks:** Both the old value (`sk_psock_verdict_data_ready`) and the new value (typically `sock_def_readable`) are valid, persistent kernel functions. If the reader sees the old value, it will call the BPF-based notification; if it sees the new value, it will call the default notification. Both are safe to execute.
3.  **Handling of Detached State:** The BPF callback `sk_psock_verdict_data_ready` specifically handles the case where the `psock` has already been detached by checking `sk_psock(sk)` (which retrieves `sk_user_data` via RCU). If it finds the `psock` is NULL, it safely returns or handles the packet without crashing.
4.  **Functional Impact:** While there is a theoretical window where a packet might be dropped if the BPF callback is invoked just after the `psock` is cleared but before the function pointer is restored, this occurs during socket teardown or reconfiguration. Such transient packet loss is generally acceptable in these states and does not compromise the integrity of the kernel memory or state.
5.  **Common Pattern:** Lockless invocation of socket callbacks is a standard pattern in the Linux networking stack to avoid the overhead of rwlocks in the fast path (data transmission).

In summary, the race does not lead to memory corruption, use-after-free, or crashes, and is typical of high-performance networking code.

Conclusion: Benign

5/2 2026/02/23 15:12 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 4188
							output: 47
							thoughts: 132
						

Thoughts:
**Analyzing a Data Race in the Kernel**

Okay, so the report flags a nasty data race. My expertise tells me this is serious, and it needs immediate attention. The core issue revolves around a race condition between `sk_psock_stop_verdict`, which gets called through `sk_psock_drop`, and `unix_stream_sendmsg`.  The crucial memory location being fought over is `0xffff88810220dc08`. I need to figure out how these two seemingly disparate functions are stepping on each other's toes.

The write operation, the one that might be corrupting data, is happening within `sk_psock_stop_verdict`, specifically at `net/core/skmsg.c:1311`. This is a crucial area for packet socket control. I need to understand what's happening there, exactly *what* is being written, and under what conditions.

The read operation, the potential victim of this race, is found in `unix_stream_sendmsg` at `net/unix/af_unix.c:2480`. This is how unix domain sockets are sending messages. I'm going to have to carefully dissect this code, tracing how this specific memory address is being accessed during the send operation. I need to see *how* it's reading the shared data, and determine how the write from the packet socket can corrupt it.

My immediate next step is to pull up the source code for both those functions and start correlating the accesses. Debugging this requires careful static analysis coupled with a deep understanding of the kernel internals; I'll need to think about locking, memory barriers, and potential atomicity violations. This is going to be a fun one to chase down.



6/2 2026/02/23 15:12 tool read-file
0m Args:
map[File:net/core/skmsg.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0
/* Copyright (c) 2017 - 2018 Covalent IO, Inc. http://covalent.io */

#include <linux/skmsg.h>
#include <linux/skbuff.h>
#include <linux/scatterlist.h>

#include <net/sock.h>
#include <net/tcp.h>
#include <net/tls.h>
#include <trace/events/sock.h>

static bool sk_msg_try_coalesce_ok(struct sk_msg *msg, int elem_first_coalesce)
{
	if (msg->sg.end > msg->sg.start &&
	    elem_first_coalesce < msg->sg.end)
		return true;

	if (msg->sg.end < msg->sg.start &&
	    (elem_first_coalesce > msg->sg.start ||
	     elem_first_coalesce < msg->sg.end))
		return true;

	return false;
}

int sk_msg_alloc(struct sock *sk, struct sk_msg *msg, int len,
		 int elem_first_coalesce)
{
	struct page_frag *pfrag = sk_page_frag(sk);
	u32 osize = msg->sg.size;
	int ret = 0;

	len -= msg->sg.size;
	while (len > 0) {
		struct scatterlist *sge;
		u32 orig_offset;
		int use, i;

		if (!sk_page_frag_refill(sk, pfrag)) {
			ret = -ENOMEM;
			goto msg_trim;
		}

		orig_offset = pfrag->offset;
		use = min_t(int, len, pfrag->size - orig_offset);
		if (!sk_wmem_schedule(sk, use)) {
			ret = -ENOMEM;
			goto msg_trim;
		}

		i = msg->sg.end;
		sk_msg_iter_var_prev(i);
		sge = &msg->sg.data[i];

		if (sk_msg_try_coalesce_ok(msg, elem_first_coalesce) &&
		    sg_page(sge) == pfrag->page &&
		    sge->offset + sge->length == orig_offset) {
			sge->length += use;
		} else {
			if (sk_msg_full(msg)) {
				ret = -ENOSPC;
				break;
			}

			sge = &msg->sg.data[msg->sg.end];
			sg_unmark_end(sge);
			sg_set_page(sge, pfrag->page, use, orig_offset);
			get_page(pfrag->page);
			sk_msg_iter_next(msg, end);
		}

		sk_mem_charge(sk, use);
		msg->sg.size += use;
		pfrag->offset += use;
		len -= use;
	}

	return ret;

msg_trim:
	sk_msg_trim(sk, msg, osize);
	return ret;
}
EXPORT_SYMBOL_GPL(sk_msg_alloc);

int sk_msg_clone(struct sock *sk, struct sk_msg *dst, struct sk_msg *src,
		 u32 off, u32 len)
{
	int i = src->sg.start;
	struct scatterlist *sge = sk_msg_elem(src, i);
	struct scatterlist *sgd = NULL;
	u32 sge_len, sge_off;

	while (off) {
		if (sge->length > off)
			break;
		off -= sge->length;
		sk_msg_iter_var_next(i);
		if (i == src->sg.end && off)
			return -ENOSPC;
		sge = sk_msg_elem(src, i);
	}

	while (len) {
		sge_len = sge->length - off;
		if (sge_len > len)
			sge_len = len;

		if (dst->sg.end)
			sgd = sk_msg_elem(dst, dst->sg.end - 1);

		if (sgd &&
		    (sg_page(sge) == sg_page(sgd)) &&
		    (sg_virt(sge) + off == sg_virt(sgd) + sgd->length)) {
			sgd->length += sge_len;
			dst->sg.size += sge_len;
		} else if (!sk_msg_full(dst)) {
			sge_off = sge->offset + off;
			sk_msg_page_add(dst, sg_page(sge), sge_len, sge_off);
		} else {
			return -ENOSPC;
		}

		off = 0;
		len -= sge_len;
		sk_mem_charge(sk, sge_len);
		sk_msg_iter_var_next(i);
		if (i == src->sg.end && len)
			return -ENOSPC;
		sge = sk_msg_elem(src, i);
	}

	return 0;
}
EXPORT_SYMBOL_GPL(sk_msg_clone);

void sk_msg_return_zero(struct sock *sk, struct sk_msg *msg, int bytes)
{
	int i = msg->sg.start;

	do {
		struct scatterlist *sge = sk_msg_elem(msg, i);

		if (bytes < sge->length) {
			sge->length -= bytes;
			sge->offset += bytes;
			sk_mem_uncharge(sk, bytes);
			break;
		}

		sk_mem_uncharge(sk, sge->length);
		bytes -= sge->length;
		sge->length = 0;
		sge->offset = 0;
		sk_msg_iter_var_next(i);
	} while (bytes && i != msg->sg.end);
	msg->sg.start = i;
}
EXPORT_SYMBOL_GPL(sk_msg_return_zero);

void sk_msg_return(struct sock *sk, struct sk_msg *msg, int bytes)
{
	int i = msg->sg.start;

	do {
		struct scatterlist *sge = &msg->sg.data[i];
		int uncharge = (bytes < sge->length) ? bytes : sge->length;

		sk_mem_uncharge(sk, uncharge);
		bytes -= uncharge;
		sk_msg_iter_var_next(i);
	} while (i != msg->sg.end);
}
EXPORT_SYMBOL_GPL(sk_msg_return);

static int sk_msg_free_elem(struct sock *sk, struct sk_msg *msg, u32 i,
			    bool charge)
{
	struct scatterlist *sge = sk_msg_elem(msg, i);
	u32 len = sge->length;

	/* When the skb owns the memory we free it from consume_skb path. */
	if (!msg->skb) {
		if (charge)
			sk_mem_uncharge(sk, len);
		put_page(sg_page(sge));
	}
	memset(sge, 0, sizeof(*sge));
	return len;
}

static int __sk_msg_free(struct sock *sk, struct sk_msg *msg, u32 i,
			 bool charge)
{
	struct scatterlist *sge = sk_msg_elem(msg, i);
	int freed = 0;

	while (msg->sg.size) {
		msg->sg.size -= sge->length;
		freed += sk_msg_free_elem(sk, msg, i, charge);
		sk_msg_iter_var_next(i);
		sk_msg_check_to_free(msg, i, msg->sg.size);
		sge = sk_msg_elem(msg, i);
	}
	consume_skb(msg->skb);
	sk_msg_init(msg);
	return freed;
}

int sk_msg_free_nocharge(struct sock *sk, struct sk_msg *msg)
{
	return __sk_msg_free(sk, msg, msg->sg.start, false);
}
EXPORT_SYMBOL_GPL(sk_msg_free_nocharge);

int sk_msg_free(struct sock *sk, struct sk_msg *msg)
{
	return __sk_msg_free(sk, msg, msg->sg.start, true);
}
EXPORT_SYMBOL_GPL(sk_msg_free);

static void __sk_msg_free_partial(struct sock *sk, struct sk_msg *msg,
				  u32 bytes, bool charge)
{
	struct scatterlist *sge;
	u32 i = msg->sg.start;

	while (bytes) {
		sge = sk_msg_elem(msg, i);
		if (!sge->length)
			break;
		if (bytes < sge->length) {
			if (charge)
				sk_mem_uncharge(sk, bytes);
			sge->length -= bytes;
			sge->offset += bytes;
			msg->sg.size -= bytes;
			break;
		}

		msg->sg.size -= sge->length;
		bytes -= sge->length;
		sk_msg_free_elem(sk, msg, i, charge);
		sk_msg_iter_var_next(i);
		sk_msg_check_to_free(msg, i, bytes);
	}
	msg->sg.start = i;
}

void sk_msg_free_partial(struct sock *sk, struct sk_msg *msg, u32 bytes)
{
	__sk_msg_free_partial(sk, msg, bytes, true);
}
EXPORT_SYMBOL_GPL(sk_msg_free_partial);

void sk_msg_free_partial_nocharge(struct sock *sk, struct sk_msg *msg,
				  u32 bytes)
{
	__sk_msg_free_partial(sk, msg, bytes, false);
}

void sk_msg_trim(struct sock *sk, struct sk_msg *msg, int len)
{
	int trim = msg->sg.size - len;
	u32 i = msg->sg.end;

	if (trim <= 0) {
		WARN_ON(trim < 0);
		return;
	}

	sk_msg_iter_var_prev(i);
	msg->sg.size = len;
	while (msg->sg.data[i].length &&
	       trim >= msg->sg.data[i].length) {
		trim -= msg->sg.data[i].length;
		sk_msg_free_elem(sk, msg, i, true);
		sk_msg_iter_var_prev(i);
		if (!trim)
			goto out;
	}

	msg->sg.data[i].length -= trim;
	sk_mem_uncharge(sk, trim);
	/* Adjust copybreak if it falls into the trimmed part of last buf */
	if (msg->sg.curr == i && msg->sg.copybreak > msg->sg.data[i].length)
		msg->sg.copybreak = msg->sg.data[i].length;
out:
	sk_msg_iter_var_next(i);
	msg->sg.end = i;

	/* If we trim data a full sg elem before curr pointer update
	 * copybreak and current so that any future copy operations
	 * start at new copy location.
	 * However trimmed data that has not yet been used in a copy op
	 * does not require an update.
	 */
	if (!msg->sg.size) {
		msg->sg.curr = msg->sg.start;
		msg->sg.copybreak = 0;
	} else if (sk_msg_iter_dist(msg->sg.start, msg->sg.curr) >=
		   sk_msg_iter_dist(msg->sg.start, msg->sg.end)) {
		sk_msg_iter_var_prev(i);
		msg->sg.curr = i;
		msg->sg.copybreak = msg->sg.data[i].length;
	}
}
EXPORT_SYMBOL_GPL(sk_msg_trim);

int sk_msg_zerocopy_from_iter(struct sock *sk, struct iov_iter *from,
			      struct sk_msg *msg, u32 bytes)
{
	int i, maxpages, ret = 0, num_elems = sk_msg_elem_used(msg);
	const int to_max_pages = MAX_MSG_FRAGS;
	struct page *pages[MAX_MSG_FRAGS];
	ssize_t orig, copied, use, offset;

	orig = msg->sg.size;
	while (bytes > 0) {
		i = 0;
		maxpages = to_max_pages - num_elems;
		if (maxpages == 0) {
			ret = -EFAULT;
			goto out;
		}

		copied = iov_iter_get_pages2(from, pages, bytes, maxpages,
					    &offset);
		if (copied <= 0) {
			ret = -EFAULT;
			goto out;
		}

		bytes -= copied;
		msg->sg.size += copied;

		while (copied) {
			use = min_t(int, copied, PAGE_SIZE - offset);
			sg_set_page(&msg->sg.data[msg->sg.end],
				    pages[i], use, offset);
			sg_unmark_end(&msg->sg.data[msg->sg.end]);
			sk_mem_charge(sk, use);

			offset = 0;
			copied -= use;
			sk_msg_iter_next(msg, end);
			num_elems++;
			i++;
		}
		/* When zerocopy is mixed with sk_msg_*copy* operations we
		 * may have a copybreak set in this case clear and prefer
		 * zerocopy remainder when possible.
		 */
		msg->sg.copybreak = 0;
		msg->sg.curr = msg->sg.end;
	}
out:
	/* Revert iov_iter updates, msg will need to use 'trim' later if it
	 * also needs to be cleared.
	 */
	if (ret)
		iov_iter_revert(from, msg->sg.size - orig);
	return ret;
}
EXPORT_SYMBOL_GPL(sk_msg_zerocopy_from_iter);

int sk_msg_memcopy_from_iter(struct sock *sk, struct iov_iter *from,
			     struct sk_msg *msg, u32 bytes)
{
	int ret = -ENOSPC, i = msg->sg.curr;
	u32 copy, buf_size, copied = 0;
	struct scatterlist *sge;
	void *to;

	do {
		sge = sk_msg_elem(msg, i);
		/* This is possible if a trim operation shrunk the buffer */
		if (msg->sg.copybreak >= sge->length) {
			msg->sg.copybreak = 0;
			sk_msg_iter_var_next(i);
			if (i == msg->sg.end)
				break;
			sge = sk_msg_elem(msg, i);
		}

		buf_size = sge->length - msg->sg.copybreak;
		copy = (buf_size > bytes) ? bytes : buf_size;
		to = sg_virt(sge) + msg->sg.copybreak;
		msg->sg.copybreak += copy;
		if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY)
			ret = copy_from_iter_nocache(to, copy, from);
		else
			ret = copy_from_iter(to, copy, from);
		if (ret != copy) {
			ret = -EFAULT;
			goto out;
		}
		bytes -= copy;
		copied += copy;
		if (!bytes)
			break;
		msg->sg.copybreak = 0;
		sk_msg_iter_var_next(i);
	} while (i != msg->sg.end);
out:
	msg->sg.curr = i;
	return (ret < 0) ? ret : copied;
}
EXPORT_SYMBOL_GPL(sk_msg_memcopy_from_iter);

int __sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
		     int len, int flags, int *copied_from_self)
{
	struct iov_iter *iter = &msg->msg_iter;
	int peek = flags & MSG_PEEK;
	struct sk_msg *msg_rx;
	int i, copied = 0;
	bool from_self;

	msg_rx = sk_psock_peek_msg(psock);
	if (copied_from_self)
		*copied_from_self = 0;

	while (copied != len) {
		struct scatterlist *sge;

		if (unlikely(!msg_rx))
			break;

		from_self = msg_rx->sk == sk;
		i = msg_rx->sg.start;
		do {
			struct page *page;
			int copy;

			sge = sk_msg_elem(msg_rx, i);
			copy = sge->length;
			page = sg_page(sge);
			if (copied + copy > len)
				copy = len - copied;
			if (copy)
				copy = copy_page_to_iter(page, sge->offset, copy, iter);
			if (!copy) {
				copied = copied ? copied : -EFAULT;
				goto out;
			}

			copied += copy;
			if (from_self && copied_from_self)
				*copied_from_self += copy;

			if (likely(!peek)) {
				sge->offset += copy;
				sge->length -= copy;
				if (!msg_rx->skb) {
					sk_mem_uncharge(sk, copy);
					atomic_sub(copy, &sk->sk_rmem_alloc);
				}
				msg_rx->sg.size -= copy;
				sk_psock_msg_len_add(psock, -copy);

				if (!sge->length) {
					sk_msg_iter_var_next(i);
					if (!msg_rx->skb)
						put_page(page);
				}
			} else {
				/* Lets not optimize peek case if copy_page_to_iter
				 * didn't copy the entire length lets just break.
				 */
				if (copy != sge->length)
					goto out;
				sk_msg_iter_var_next(i);
			}

			if (copied == len)
				break;
		} while ((i != msg_rx->sg.end) && !sg_is_last(sge));

		if (unlikely(peek)) {
			msg_rx = sk_psock_next_msg(psock, msg_rx);
			if (!msg_rx)
				break;
			continue;
		}

		msg_rx->sg.start = i;
		if (!sge->length && (i == msg_rx->sg.end || sg_is_last(sge))) {
			msg_rx = sk_psock_dequeue_msg(psock);
			kfree_sk_msg(msg_rx);
		}
		msg_rx = sk_psock_peek_msg(psock);
	}
out:
	return copied;
}

/* Receive sk_msg from psock->ingress_msg to @msg. */
int sk_msg_recvmsg(struct sock *sk, struct sk_psock *psock, struct msghdr *msg,
		   int len, int flags)
{
	return __sk_msg_recvmsg(sk, psock, msg, len, flags, NULL);
}
EXPORT_SYMBOL_GPL(sk_msg_recvmsg);

bool sk_msg_is_readable(struct sock *sk)
{
	struct sk_psock *psock;
	bool empty = true;

	rcu_read_lock();
	psock = sk_psock(sk);
	if (likely(psock))
		empty = list_empty(&psock->ingress_msg);
	rcu_read_unlock();
	return !empty;
}
EXPORT_SYMBOL_GPL(sk_msg_is_readable);

static struct sk_msg *alloc_sk_msg(gfp_t gfp)
{
	struct sk_msg *msg;

	msg = kzalloc(sizeof(*msg), gfp | __GFP_NOWARN);
	if (unlikely(!msg))
		return NULL;
	sg_init_marker(msg->sg.data, NR_MSG_FRAG_IDS);
	return msg;
}

static struct sk_msg *sk_psock_create_ingress_msg(struct sock *sk,
						  struct sk_buff *skb)
{
	if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
		return NULL;

	if (!sk_rmem_schedule(sk, skb, skb->truesize))
		return NULL;

	return alloc_sk_msg(GFP_KERNEL);
}

static int sk_psock_skb_ingress_enqueue(struct sk_buff *skb,
					u32 off, u32 len,
					struct sk_psock *psock,
					struct sock *sk,
					struct sk_msg *msg,
					bool take_ref)
{
	int num_sge, copied;

	/* skb_to_sgvec will fail when the total number of fragments in
	 * frag_list and frags exceeds MAX_MSG_FRAGS. For example, the
	 * caller may aggregate multiple skbs.
	 */
	num_sge = skb_to_sgvec(skb, msg->sg.data, off, len);
	if (num_sge < 0) {
		/* skb linearize may fail with ENOMEM, but lets simply try again
		 * later if this happens. Under memory pressure we don't want to
		 * drop the skb. We need to linearize the skb so that the mapping
		 * in skb_to_sgvec can not error.
		 * Note that skb_linearize requires the skb not to be shared.
		 */
		if (skb_linearize(skb))
			return -EAGAIN;

		num_sge = skb_to_sgvec(skb, msg->sg.data, off, len);
		if (unlikely(num_sge < 0))
			return num_sge;
	}

#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
	psock->ingress_bytes += len;
#endif
	copied = len;
	msg->sg.start = 0;
	msg->sg.size = copied;
	msg->sg.end = num_sge;
	msg->skb = take_ref ? skb_get(skb) : skb;

	sk_psock_queue_msg(psock, msg);
	sk_psock_data_ready(sk, psock);
	return copied;
}

static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb,
				     u32 off, u32 len, bool take_ref);

static int sk_psock_skb_ingress(struct sk_psock *psock, struct sk_buff *skb,
				u32 off, u32 len)
{
	struct sock *sk = psock->sk;
	struct sk_msg *msg;
	int err;

	/* If we are receiving on the same sock skb->sk is already assigned,
	 * skip memory accounting and owner transition seeing it already set
	 * correctly.
	 */
	if (unlikely(skb->sk == sk))
		return sk_psock_skb_ingress_self(psock, skb, off, len, true);
	msg = sk_psock_create_ingress_msg(sk, skb);
	if (!msg)
		return -EAGAIN;

	/* This will transition ownership of the data from the socket where
	 * the BPF program was run initiating the redirect to the socket
	 * we will eventually receive this data on. The data will be released
	 * from skb_consume found in __tcp_bpf_recvmsg() after its been copied
	 * into user buffers.
	 */
	skb_set_owner_r(skb, sk);
	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, true);
	if (err < 0)
		kfree(msg);
	return err;
}

/* Puts an skb on the ingress queue of the socket already assigned to the
 * skb. In this case we do not need to check memory limits or skb_set_owner_r
 * because the skb is already accounted for here.
 */
static int sk_psock_skb_ingress_self(struct sk_psock *psock, struct sk_buff *skb,
				     u32 off, u32 len, bool take_ref)
{
	struct sk_msg *msg = alloc_sk_msg(GFP_ATOMIC);
	struct sock *sk = psock->sk;
	int err;

	if (unlikely(!msg))
		return -EAGAIN;
	skb_set_owner_r(skb, sk);

	/* This is used in tcp_bpf_recvmsg_parser() to determine whether the
	 * data originates from the socket's own protocol stack. No need to
	 * refcount sk because msg's lifetime is bound to sk via the ingress_msg.
	 */
	msg->sk = sk;
	err = sk_psock_skb_ingress_enqueue(skb, off, len, psock, sk, msg, take_ref);
	if (err < 0)
		kfree(msg);
	return err;
}

static int sk_psock_handle_skb(struct sk_psock *psock, struct sk_buff *skb,
			       u32 off, u32 len, bool ingress)
{
	if (!ingress) {
		if (!sock_writeable(psock->sk))
			return -EAGAIN;
		return skb_send_sock(psock->sk, skb, off, len);
	}

	return sk_psock_skb_ingress(psock, skb, off, len);
}

static void sk_psock_skb_state(struct sk_psock *psock,
			       struct sk_psock_work_state *state,
			       int len, int off)
{
	spin_lock_bh(&psock->ingress_lock);
	if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
		state->len = len;
		state->off = off;
	}
	spin_unlock_bh(&psock->ingress_lock);
}

static void sk_psock_backlog(struct work_struct *work)
{
	struct delayed_work *dwork = to_delayed_work(work);
	struct sk_psock *psock = container_of(dwork, struct sk_psock, work);
	struct sk_psock_work_state *state = &psock->work_state;
	struct sk_buff *skb = NULL;
	u32 len = 0, off = 0;
	bool ingress;
	int ret;

	/* If sk is quickly removed from the map and then added back, the old
	 * psock should not be scheduled, because there are now two psocks
	 * pointing to the same sk.
	 */
	if (!sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
		return;

	/* Increment the psock refcnt to synchronize with close(fd) path in
	 * sock_map_close(), ensuring we wait for backlog thread completion
	 * before sk_socket freed. If refcnt increment fails, it indicates
	 * sock_map_close() completed with sk_socket potentially already freed.
	 */
	if (!sk_psock_get(psock->sk))
		return;
	mutex_lock(&psock->work_mutex);
	while ((skb = skb_peek(&psock->ingress_skb))) {
		len = skb->len;
		off = 0;
		if (skb_bpf_strparser(skb)) {
			struct strp_msg *stm = strp_msg(skb);

			off = stm->offset;
			len = stm->full_len;
		}

		/* Resume processing from previous partial state */
		if (unlikely(state->len)) {
			len = state->len;
			off = state->off;
		}

		ingress = skb_bpf_ingress(skb);
		skb_bpf_redirect_clear(skb);
		do {
			ret = -EIO;
			if (!sock_flag(psock->sk, SOCK_DEAD))
				ret = sk_psock_handle_skb(psock, skb, off,
							  len, ingress);
			if (ret <= 0) {
				if (ret == -EAGAIN) {
					sk_psock_skb_state(psock, state, len, off);
					/* Restore redir info we cleared before */
					skb_bpf_set_redir(skb, psock->sk, ingress);
					/* Delay slightly to prioritize any
					 * other work that might be here.
					 */
					if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
						schedule_delayed_work(&psock->work, 1);
					goto end;
				}
				/* Hard errors break pipe and stop xmit. */
				sk_psock_report_error(psock, ret ? -ret : EPIPE);
				sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
				goto end;
			}
			off += ret;
			len -= ret;
		} while (len);

		/* The entire skb sent, clear state */
		sk_psock_skb_state(psock, state, 0, 0);
		skb = skb_dequeue(&psock->ingress_skb);
		kfree_skb(skb);
	}
end:
	mutex_unlock(&psock->work_mutex);
	sk_psock_put(psock->sk, psock);
}

struct sk_psock *sk_psock_init(struct sock *sk, int node)
{
	struct sk_psock *psock;
	struct proto *prot;

	write_lock_bh(&sk->sk_callback_lock);

	if (sk_is_inet(sk) && inet_csk_has_ulp(sk)) {
		psock = ERR_PTR(-EINVAL);
		goto out;
	}

	if (sk->sk_user_data) {
		psock = ERR_PTR(-EBUSY);
		goto out;
	}

	psock = kzalloc_node(sizeof(*psock), GFP_ATOMIC | __GFP_NOWARN, node);
	if (!psock) {
		psock = ERR_PTR(-ENOMEM);
		goto out;
	}

	prot = READ_ONCE(sk->sk_prot);
	psock->sk = sk;
	psock->eval = __SK_NONE;
	psock->sk_proto = prot;
	psock->saved_unhash = prot->unhash;
	psock->saved_destroy = prot->destroy;
	psock->saved_close = prot->close;
	psock->saved_write_space = sk->sk_write_space;

	INIT_LIST_HEAD(&psock->link);
	spin_lock_init(&psock->link_lock);

	INIT_DELAYED_WORK(&psock->work, sk_psock_backlog);
	mutex_init(&psock->work_mutex);
	INIT_LIST_HEAD(&psock->ingress_msg);
	spin_lock_init(&psock->ingress_lock);
	skb_queue_head_init(&psock->ingress_skb);

	sk_psock_set_state(psock, SK_PSOCK_TX_ENABLED);
	refcount_set(&psock->refcnt, 1);

	__rcu_assign_sk_user_data_with_flags(sk, psock,
					     SK_USER_DATA_NOCOPY |
					     SK_USER_DATA_PSOCK);
	sock_hold(sk);

out:
	write_unlock_bh(&sk->sk_callback_lock);
	return psock;
}
EXPORT_SYMBOL_GPL(sk_psock_init);

struct sk_psock_link *sk_psock_link_pop(struct sk_psock *psock)
{
	struct sk_psock_link *link;

	spin_lock_bh(&psock->link_lock);
	link = list_first_entry_or_null(&psock->link, struct sk_psock_link,
					list);
	if (link)
		list_del(&link->list);
	spin_unlock_bh(&psock->link_lock);
	return link;
}

static void __sk_psock_purge_ingress_msg(struct sk_psock *psock)
{
	struct sk_msg *msg, *tmp;

	list_for_each_entry_safe(msg, tmp, &psock->ingress_msg, list) {
		list_del(&msg->list);
		if (!msg->skb)
			atomic_sub(msg->sg.size, &psock->sk->sk_rmem_alloc);
		sk_psock_msg_len_add(psock, -msg->sg.size);
		sk_msg_free(psock->sk, msg);
		kfree(msg);
	}
	WARN_ON_ONCE(psock->msg_tot_len);
}

static void __sk_psock_zap_ingress(struct sk_psock *psock)
{
	struct sk_buff *skb;

	while ((skb = skb_dequeue(&psock->ingress_skb)) != NULL) {
		skb_bpf_redirect_clear(skb);
		sock_drop(psock->sk, skb);
	}
	__sk_psock_purge_ingress_msg(psock);
}

static void sk_psock_link_destroy(struct sk_psock *psock)
{
	struct sk_psock_link *link, *tmp;

	list_for_each_entry_safe(link, tmp, &psock->link, list) {
		list_del(&link->list);
		sk_psock_free_link(link);
	}
}

void sk_psock_stop(struct sk_psock *psock)
{
	spin_lock_bh(&psock->ingress_lock);
	sk_psock_clear_state(psock, SK_PSOCK_TX_ENABLED);
	sk_psock_cork_free(psock);
	spin_unlock_bh(&psock->ingress_lock);
}

static void sk_psock_done_strp(struct sk_psock *psock);

static void sk_psock_destroy(struct work_struct *work)
{
	struct sk_psock *psock = container_of(to_rcu_work(work),
					      struct sk_psock, rwork);
	/* No sk_callback_lock since already detached. */

	sk_psock_done_strp(psock);

	cancel_delayed_work_sync(&psock->work);
	__sk_psock_zap_ingress(psock);
	mutex_destroy(&psock->work_mutex);

	psock_progs_drop(&psock->progs);

	sk_psock_link_destroy(psock);
	sk_psock_cork_free(psock);

	if (psock->sk_redir)
		sock_put(psock->sk_redir);
	if (psock->sk_pair)
		sock_put(psock->sk_pair);
	sock_put(psock->sk);
	kfree(psock);
}

void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
{
	write_lock_bh(&sk->sk_callback_lock);
	sk_psock_restore_proto(sk, psock);
	rcu_assign_sk_user_data(sk, NULL);
	if (psock->progs.stream_parser)
		sk_psock_stop_strp(sk, psock);
	else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
		sk_psock_stop_verdict(sk, psock);
	write_unlock_bh(&sk->sk_callback_lock);

	sk_psock_stop(psock);

	INIT_RCU_WORK(&psock->rwork, sk_psock_destroy);
	queue_rcu_work(system_percpu_wq, &psock->rwork);
}
EXPORT_SYMBOL_GPL(sk_psock_drop);

static int sk_psock_map_verd(int verdict, bool redir)
{
	switch (verdict) {
	case SK_PASS:
		return redir ? __SK_REDIRECT : __SK_PASS;
	case SK_DROP:
	default:
		break;
	}

	return __SK_DROP;
}

int sk_psock_msg_verdict(struct sock *sk, struct sk_psock *psock,
			 struct sk_msg *msg)
{
	struct bpf_prog *prog;
	int ret;

	rcu_read_lock();
	prog = READ_ONCE(psock->progs.msg_parser);
	if (unlikely(!prog)) {
		ret = __SK_PASS;
		goto out;
	}

	sk_msg_compute_data_pointers(msg);
	msg->sk = sk;
	ret = bpf_prog_run_pin_on_cpu(prog, msg);
	msg->sk = NULL;
	ret = sk_psock_map_verd(ret, msg->sk_redir);
	psock->apply_bytes = msg->apply_bytes;
	if (ret == __SK_REDIRECT) {
		if (psock->sk_redir) {
			sock_put(psock->sk_redir);
			psock->sk_redir = NULL;
		}
		if (!msg->sk_redir) {
			ret = __SK_DROP;
			goto out;
		}
		psock->redir_ingress = sk_msg_to_ingress(msg);
		psock->sk_redir = msg->sk_redir;
		sock_hold(psock->sk_redir);
	}
out:
	rcu_read_unlock();
	return ret;
}
EXPORT_SYMBOL_GPL(sk_psock_msg_verdict);

static int sk_psock_skb_redirect(struct sk_psock *from, struct sk_buff *skb)
{
	struct sk_psock *psock_other;
	struct sock *sk_other;

	sk_other = skb_bpf_redirect_fetch(skb);
	/* This error is a buggy BPF program, it returned a redirect
	 * return code, but then didn't set a redirect interface.
	 */
	if (unlikely(!sk_other)) {
		skb_bpf_redirect_clear(skb);
		sock_drop(from->sk, skb);
		return -EIO;
	}
	psock_other = sk_psock(sk_other);
	/* This error indicates the socket is being torn down or had another
	 * error that caused the pipe to break. We can't send a packet on
	 * a socket that is in this state so we drop the skb.
	 */
	if (!psock_other || sock_flag(sk_other, SOCK_DEAD)) {
		skb_bpf_redirect_clear(skb);
		sock_drop(from->sk, skb);
		return -EIO;
	}
	spin_lock_bh(&psock_other->ingress_lock);
	if (!sk_psock_test_state(psock_other, SK_PSOCK_TX_ENABLED)) {
		spin_unlock_bh(&psock_other->ingress_lock);
		skb_bpf_redirect_clear(skb);
		sock_drop(from->sk, skb);
		return -EIO;
	}

	skb_queue_tail(&psock_other->ingress_skb, skb);
	schedule_delayed_work(&psock_other->work, 0);
	spin_unlock_bh(&psock_other->ingress_lock);
	return 0;
}

static void sk_psock_tls_verdict_apply(struct sk_buff *skb,
				       struct sk_psock *from, int verdict)
{
	switch (verdict) {
	case __SK_REDIRECT:
		sk_psock_skb_redirect(from, skb);
		break;
	case __SK_PASS:
	case __SK_DROP:
	default:
		break;
	}
}

int sk_psock_tls_strp_read(struct sk_psock *psock, struct sk_buff *skb)
{
	struct bpf_prog *prog;
	int ret = __SK_PASS;

	rcu_read_lock();
	prog = READ_ONCE(psock->progs.stream_verdict);
	if (likely(prog)) {
		skb->sk = psock->sk;
		skb_dst_drop(skb);
		skb_bpf_redirect_clear(skb);
		ret = bpf_prog_run_pin_on_cpu(prog, skb);
		ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb));
		skb->sk = NULL;
	}
	sk_psock_tls_verdict_apply(skb, psock, ret);
	rcu_read_unlock();
	return ret;
}
EXPORT_SYMBOL_GPL(sk_psock_tls_strp_read);

static int sk_psock_verdict_apply(struct sk_psock *psock, struct sk_buff *skb,
				  int verdict)
{
	struct sock *sk_other;
	int err = 0;
	u32 len, off;

	switch (verdict) {
	case __SK_PASS:
		err = -EIO;
		sk_other = psock->sk;
		if (sock_flag(sk_other, SOCK_DEAD) ||
		    !sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
			goto out_free;

		skb_bpf_set_ingress(skb);

		/* If the queue is empty then we can submit directly
		 * into the msg queue. If its not empty we have to
		 * queue work otherwise we may get OOO data. Otherwise,
		 * if sk_psock_skb_ingress errors will be handled by
		 * retrying later from workqueue.
		 */
		if (skb_queue_empty(&psock->ingress_skb)) {
			len = skb->len;
			off = 0;
			if (skb_bpf_strparser(skb)) {
				struct strp_msg *stm = strp_msg(skb);

				off = stm->offset;
				len = stm->full_len;
			}
			err = sk_psock_skb_ingress_self(psock, skb, off, len, false);
		}
		if (err < 0) {
			spin_lock_bh(&psock->ingress_lock);
			if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED)) {
				skb_queue_tail(&psock->ingress_skb, skb);
				schedule_delayed_work(&psock->work, 0);
				err = 0;
			}
			spin_unlock_bh(&psock->ingress_lock);
			if (err < 0)
				goto out_free;
		}
		break;
	case __SK_REDIRECT:
		tcp_eat_skb(psock->sk, skb);
		err = sk_psock_skb_redirect(psock, skb);
		break;
	case __SK_DROP:
	default:
out_free:
		skb_bpf_redirect_clear(skb);
		tcp_eat_skb(psock->sk, skb);
		sock_drop(psock->sk, skb);
	}

	return err;
}

static void sk_psock_write_space(struct sock *sk)
{
	struct sk_psock *psock;
	void (*write_space)(struct sock *sk) = NULL;

	rcu_read_lock();
	psock = sk_psock(sk);
	if (likely(psock)) {
		if (sk_psock_test_state(psock, SK_PSOCK_TX_ENABLED))
			schedule_delayed_work(&psock->work, 0);
		write_space = psock->saved_write_space;
	}
	rcu_read_unlock();
	if (write_space)
		write_space(sk);
}

#if IS_ENABLED(CONFIG_BPF_STREAM_PARSER)
static void sk_psock_strp_read(struct strparser *strp, struct sk_buff *skb)
{
	struct sk_psock *psock;
	struct bpf_prog *prog;
	int ret = __SK_DROP;
	struct sock *sk;

	rcu_read_lock();
	sk = strp->sk;
	psock = sk_psock(sk);
	if (unlikely(!psock)) {
		sock_drop(sk, skb);
		goto out;
	}
	prog = READ_ONCE(psock->progs.stream_verdict);
	if (likely(prog)) {
		skb->sk = sk;
		skb_dst_drop(skb);
		skb_bpf_redirect_clear(skb);
		ret = bpf_prog_run_pin_on_cpu(prog, skb);
		skb_bpf_set_strparser(skb);
		ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb));
		skb->sk = NULL;
	}
	sk_psock_verdict_apply(psock, skb, ret);
out:
	rcu_read_unlock();
}

static int sk_psock_strp_read_done(struct strparser *strp, int err)
{
	return err;
}

static int sk_psock_strp_parse(struct strparser *strp, struct sk_buff *skb)
{
	struct sk_psock *psock = container_of(strp, struct sk_psock, strp);
	struct bpf_prog *prog;
	int ret = skb->len;

	rcu_read_lock();
	prog = READ_ONCE(psock->progs.stream_parser);
	if (likely(prog)) {
		skb->sk = psock->sk;
		ret = bpf_prog_run_pin_on_cpu(prog, skb);
		skb->sk = NULL;
	}
	rcu_read_unlock();
	return ret;
}

/* Called with socket lock held. */
static void sk_psock_strp_data_ready(struct sock *sk)
{
	struct sk_psock *psock;

	trace_sk_data_ready(sk);

	rcu_read_lock();
	psock = sk_psock(sk);
	if (likely(psock)) {
		if (tls_sw_has_ctx_rx(sk)) {
			psock->saved_data_ready(sk);
		} else {
			read_lock_bh(&sk->sk_callback_lock);
			strp_data_ready(&psock->strp);
			read_unlock_bh(&sk->sk_callback_lock);
		}
	}
	rcu_read_unlock();
}

int sk_psock_init_strp(struct sock *sk, struct sk_psock *psock)
{
	int ret;

	static const struct strp_callbacks cb = {
		.rcv_msg	= sk_psock_strp_read,
		.read_sock_done	= sk_psock_strp_read_done,
		.parse_msg	= sk_psock_strp_parse,
	};

	ret = strp_init(&psock->strp, sk, &cb);
	if (!ret)
		sk_psock_set_state(psock, SK_PSOCK_RX_STRP_ENABLED);

	if (sk_is_tcp(sk)) {
		psock->strp.cb.read_sock = tcp_bpf_strp_read_sock;
		psock->copied_seq = tcp_sk(sk)->copied_seq;
	}
	return ret;
}

void sk_psock_start_strp(struct sock *sk, struct sk_psock *psock)
{
	if (psock->saved_data_ready)
		return;

	psock->saved_data_ready = sk->sk_data_ready;
	sk->sk_data_ready = sk_psock_strp_data_ready;
	sk->sk_write_space = sk_psock_write_space;
}

void sk_psock_stop_strp(struct sock *sk, struct sk_psock *psock)
{
	psock_set_prog(&psock->progs.stream_parser, NULL);

	if (!psock->saved_data_ready)
		return;

	sk->sk_data_ready = psock->saved_data_ready;
	psock->saved_data_ready = NULL;
	strp_stop(&psock->strp);
}

static void sk_psock_done_strp(struct sk_psock *psock)
{
	/* Parser has been stopped */
	if (sk_psock_test_state(psock, SK_PSOCK_RX_STRP_ENABLED))
		strp_done(&psock->strp);
}
#else
static void sk_psock_done_strp(struct sk_psock *psock)
{
}
#endif /* CONFIG_BPF_STREAM_PARSER */

static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
{
	struct sk_psock *psock;
	struct bpf_prog *prog;
	int ret = __SK_DROP;
	int len = skb->len;

	rcu_read_lock();
	psock = sk_psock(sk);
	if (unlikely(!psock)) {
		len = 0;
		tcp_eat_skb(sk, skb);
		sock_drop(sk, skb);
		goto out;
	}
	prog = READ_ONCE(psock->progs.stream_verdict);
	if (!prog)
		prog = READ_ONCE(psock->progs.skb_verdict);
	if (likely(prog)) {
		skb_dst_drop(skb);
		skb_bpf_redirect_clear(skb);
		ret = bpf_prog_run_pin_on_cpu(prog, skb);
		ret = sk_psock_map_verd(ret, skb_bpf_redirect_fetch(skb));
	}
	ret = sk_psock_verdict_apply(psock, skb, ret);
	if (ret < 0)
		len = ret;
out:
	rcu_read_unlock();
	return len;
}

static void sk_psock_verdict_data_ready(struct sock *sk)
{
	struct socket *sock = sk->sk_socket;
	const struct proto_ops *ops;
	int copied;

	trace_sk_data_ready(sk);

	if (unlikely(!sock))
		return;
	ops = READ_ONCE(sock->ops);
	if (!ops || !ops->read_skb)
		return;
	copied = ops->read_skb(sk, sk_psock_verdict_recv);
	if (copied >= 0) {
		struct sk_psock *psock;

		rcu_read_lock();
		psock = sk_psock(sk);
		if (psock)
			sk_psock_data_ready(sk, psock);
		rcu_read_unlock();
	}
}

void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)
{
	if (psock->saved_data_ready)
		return;

	psock->saved_data_ready = sk->sk_data_ready;
	sk->sk_data_ready = sk_psock_verdict_data_ready;
	sk->sk_write_space = sk_psock_write_space;
}

void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock)
{
	psock_set_prog(&psock->progs.stream_verdict, NULL);
	psock_set_prog(&psock->progs.skb_verdict, NULL);

	if (!psock->saved_data_ready)
		return;

	sk->sk_data_ready = psock->saved_data_ready;
	psock->saved_data_ready = NULL;
}
]

7/2 2026/02/23 15:12 tool read-file
0m Args:
map[File:net/unix/af_unix.c]

Results:
map[Contents:// SPDX-License-Identifier: GPL-2.0-or-later
/*
 * NET4:	Implementation of BSD Unix domain sockets.
 *
 * Authors:	Alan Cox, <alan@lxorguk.ukuu.org.uk>
 *
 * Fixes:
 *		Linus Torvalds	:	Assorted bug cures.
 *		Niibe Yutaka	:	async I/O support.
 *		Carsten Paeth	:	PF_UNIX check, address fixes.
 *		Alan Cox	:	Limit size of allocated blocks.
 *		Alan Cox	:	Fixed the stupid socketpair bug.
 *		Alan Cox	:	BSD compatibility fine tuning.
 *		Alan Cox	:	Fixed a bug in connect when interrupted.
 *		Alan Cox	:	Sorted out a proper draft version of
 *					file descriptor passing hacked up from
 *					Mike Shaver's work.
 *		Marty Leisner	:	Fixes to fd passing
 *		Nick Nevin	:	recvmsg bugfix.
 *		Alan Cox	:	Started proper garbage collector
 *		Heiko EiBfeldt	:	Missing verify_area check
 *		Alan Cox	:	Started POSIXisms
 *		Andreas Schwab	:	Replace inode by dentry for proper
 *					reference counting
 *		Kirk Petersen	:	Made this a module
 *	    Christoph Rohland	:	Elegant non-blocking accept/connect algorithm.
 *					Lots of bug fixes.
 *	     Alexey Kuznetosv	:	Repaired (I hope) bugs introduces
 *					by above two patches.
 *	     Andrea Arcangeli	:	If possible we block in connect(2)
 *					if the max backlog of the listen socket
 *					is been reached. This won't break
 *					old apps and it will avoid huge amount
 *					of socks hashed (this for unix_gc()
 *					performances reasons).
 *					Security fix that limits the max
 *					number of socks to 2*max_files and
 *					the number of skb queueable in the
 *					dgram receiver.
 *		Artur Skawina   :	Hash function optimizations
 *	     Alexey Kuznetsov   :	Full scale SMP. Lot of bugs are introduced 8)
 *	      Malcolm Beattie   :	Set peercred for socketpair
 *	     Michal Ostrowski   :       Module initialization cleanup.
 *	     Arnaldo C. Melo	:	Remove MOD_{INC,DEC}_USE_COUNT,
 *	     				the core infrastructure is doing that
 *	     				for all net proto families now (2.5.69+)
 *
 * Known differences from reference BSD that was tested:
 *
 *	[TO FIX]
 *	ECONNREFUSED is not returned from one end of a connected() socket to the
 *		other the moment one end closes.
 *	fstat() doesn't return st_dev=0, and give the blksize as high water mark
 *		and a fake inode identifier (nor the BSD first socket fstat twice bug).
 *	[NOT TO FIX]
 *	accept() returns a path name even if the connecting socket has closed
 *		in the meantime (BSD loses the path and gives up).
 *	accept() returns 0 length path for an unbound connector. BSD returns 16
 *		and a null first byte in the path (but not for gethost/peername - BSD bug ??)
 *	socketpair(...SOCK_RAW..) doesn't panic the kernel.
 *	BSD af_unix apparently has connect forgetting to block properly.
 *		(need to check this with the POSIX spec in detail)
 *
 * Differences from 2.0.0-11-... (ANK)
 *	Bug fixes and improvements.
 *		- client shutdown killed server socket.
 *		- removed all useless cli/sti pairs.
 *
 *	Semantic changes/extensions.
 *		- generic control message passing.
 *		- SCM_CREDENTIALS control message.
 *		- "Abstract" (not FS based) socket bindings.
 *		  Abstract names are sequences of bytes (not zero terminated)
 *		  started by 0, so that this name space does not intersect
 *		  with BSD names.
 */

#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt

#include <linux/bpf-cgroup.h>
#include <linux/btf_ids.h>
#include <linux/dcache.h>
#include <linux/errno.h>
#include <linux/fcntl.h>
#include <linux/file.h>
#include <linux/filter.h>
#include <linux/fs.h>
#include <linux/fs_struct.h>
#include <linux/init.h>
#include <linux/kernel.h>
#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/net.h>
#include <linux/pidfs.h>
#include <linux/poll.h>
#include <linux/proc_fs.h>
#include <linux/sched/signal.h>
#include <linux/security.h>
#include <linux/seq_file.h>
#include <linux/skbuff.h>
#include <linux/slab.h>
#include <linux/socket.h>
#include <linux/splice.h>
#include <linux/string.h>
#include <linux/uaccess.h>
#include <net/af_unix.h>
#include <net/net_namespace.h>
#include <net/scm.h>
#include <net/tcp_states.h>
#include <uapi/linux/sockios.h>
#include <uapi/linux/termios.h>

#include "af_unix.h"

static atomic_long_t unix_nr_socks;
static struct hlist_head bsd_socket_buckets[UNIX_HASH_SIZE / 2];
static spinlock_t bsd_socket_locks[UNIX_HASH_SIZE / 2];

/* SMP locking strategy:
 *    hash table is protected with spinlock.
 *    each socket state is protected by separate spinlock.
 */
#ifdef CONFIG_PROVE_LOCKING
#define cmp_ptr(l, r)	(((l) > (r)) - ((l) < (r)))

static int unix_table_lock_cmp_fn(const struct lockdep_map *a,
				  const struct lockdep_map *b)
{
	return cmp_ptr(a, b);
}

static int unix_state_lock_cmp_fn(const struct lockdep_map *_a,
				  const struct lockdep_map *_b)
{
	const struct unix_sock *a, *b;

	a = container_of(_a, struct unix_sock, lock.dep_map);
	b = container_of(_b, struct unix_sock, lock.dep_map);

	if (a->sk.sk_state == TCP_LISTEN) {
		/* unix_stream_connect(): Before the 2nd unix_state_lock(),
		 *
		 *   1. a is TCP_LISTEN.
		 *   2. b is not a.
		 *   3. concurrent connect(b -> a) must fail.
		 *
		 * Except for 2. & 3., the b's state can be any possible
		 * value due to concurrent connect() or listen().
		 *
		 * 2. is detected in debug_spin_lock_before(), and 3. cannot
		 * be expressed as lock_cmp_fn.
		 */
		switch (b->sk.sk_state) {
		case TCP_CLOSE:
		case TCP_ESTABLISHED:
		case TCP_LISTEN:
			return -1;
		default:
			/* Invalid case. */
			return 0;
		}
	}

	/* Should never happen.  Just to be symmetric. */
	if (b->sk.sk_state == TCP_LISTEN) {
		switch (b->sk.sk_state) {
		case TCP_CLOSE:
		case TCP_ESTABLISHED:
			return 1;
		default:
			return 0;
		}
	}

	/* unix_state_double_lock(): ascending address order. */
	return cmp_ptr(a, b);
}

static int unix_recvq_lock_cmp_fn(const struct lockdep_map *_a,
				  const struct lockdep_map *_b)
{
	const struct sock *a, *b;

	a = container_of(_a, struct sock, sk_receive_queue.lock.dep_map);
	b = container_of(_b, struct sock, sk_receive_queue.lock.dep_map);

	/* unix_collect_skb(): listener -> embryo order. */
	if (a->sk_state == TCP_LISTEN && unix_sk(b)->listener == a)
		return -1;

	/* Should never happen.  Just to be symmetric. */
	if (b->sk_state == TCP_LISTEN && unix_sk(a)->listener == b)
		return 1;

	return 0;
}
#endif

static unsigned int unix_unbound_hash(struct sock *sk)
{
	unsigned long hash = (unsigned long)sk;

	hash ^= hash >> 16;
	hash ^= hash >> 8;
	hash ^= sk->sk_type;

	return hash & UNIX_HASH_MOD;
}

static unsigned int unix_bsd_hash(struct inode *i)
{
	return i->i_ino & UNIX_HASH_MOD;
}

static unsigned int unix_abstract_hash(struct sockaddr_un *sunaddr,
				       int addr_len, int type)
{
	__wsum csum = csum_partial(sunaddr, addr_len, 0);
	unsigned int hash;

	hash = (__force unsigned int)csum_fold(csum);
	hash ^= hash >> 8;
	hash ^= type;

	return UNIX_HASH_MOD + 1 + (hash & UNIX_HASH_MOD);
}

static void unix_table_double_lock(struct net *net,
				   unsigned int hash1, unsigned int hash2)
{
	if (hash1 == hash2) {
		spin_lock(&net->unx.table.locks[hash1]);
		return;
	}

	if (hash1 > hash2)
		swap(hash1, hash2);

	spin_lock(&net->unx.table.locks[hash1]);
	spin_lock(&net->unx.table.locks[hash2]);
}

static void unix_table_double_unlock(struct net *net,
				     unsigned int hash1, unsigned int hash2)
{
	if (hash1 == hash2) {
		spin_unlock(&net->unx.table.locks[hash1]);
		return;
	}

	spin_unlock(&net->unx.table.locks[hash1]);
	spin_unlock(&net->unx.table.locks[hash2]);
}

#ifdef CONFIG_SECURITY_NETWORK
static void unix_get_secdata(struct scm_cookie *scm, struct sk_buff *skb)
{
	UNIXCB(skb).secid = scm->secid;
}

static inline void unix_set_secdata(struct scm_cookie *scm, struct sk_buff *skb)
{
	scm->secid = UNIXCB(skb).secid;
}

static inline bool unix_secdata_eq(struct scm_cookie *scm, struct sk_buff *skb)
{
	return (scm->secid == UNIXCB(skb).secid);
}
#else
static inline void unix_get_secdata(struct scm_cookie *scm, struct sk_buff *skb)
{ }

static inline void unix_set_secdata(struct scm_cookie *scm, struct sk_buff *skb)
{ }

static inline bool unix_secdata_eq(struct scm_cookie *scm, struct sk_buff *skb)
{
	return true;
}
#endif /* CONFIG_SECURITY_NETWORK */

static inline int unix_may_send(struct sock *sk, struct sock *osk)
{
	return !unix_peer(osk) || unix_peer(osk) == sk;
}

static inline int unix_recvq_full_lockless(const struct sock *sk)
{
	return skb_queue_len_lockless(&sk->sk_receive_queue) > sk->sk_max_ack_backlog;
}

struct sock *unix_peer_get(struct sock *s)
{
	struct sock *peer;

	unix_state_lock(s);
	peer = unix_peer(s);
	if (peer)
		sock_hold(peer);
	unix_state_unlock(s);
	return peer;
}
EXPORT_SYMBOL_GPL(unix_peer_get);

static struct unix_address *unix_create_addr(struct sockaddr_un *sunaddr,
					     int addr_len)
{
	struct unix_address *addr;

	addr = kmalloc(sizeof(*addr) + addr_len, GFP_KERNEL);
	if (!addr)
		return NULL;

	refcount_set(&addr->refcnt, 1);
	addr->len = addr_len;
	memcpy(addr->name, sunaddr, addr_len);

	return addr;
}

static inline void unix_release_addr(struct unix_address *addr)
{
	if (refcount_dec_and_test(&addr->refcnt))
		kfree(addr);
}

/*
 *	Check unix socket name:
 *		- should be not zero length.
 *	        - if started by not zero, should be NULL terminated (FS object)
 *		- if started by zero, it is abstract name.
 */

static int unix_validate_addr(struct sockaddr_un *sunaddr, int addr_len)
{
	if (addr_len <= offsetof(struct sockaddr_un, sun_path) ||
	    addr_len > sizeof(*sunaddr))
		return -EINVAL;

	if (sunaddr->sun_family != AF_UNIX)
		return -EINVAL;

	return 0;
}

static int unix_mkname_bsd(struct sockaddr_un *sunaddr, int addr_len)
{
	struct sockaddr_storage *addr = (struct sockaddr_storage *)sunaddr;
	short offset = offsetof(struct sockaddr_storage, __data);

	BUILD_BUG_ON(offset != offsetof(struct sockaddr_un, sun_path));

	/* This may look like an off by one error but it is a bit more
	 * subtle.  108 is the longest valid AF_UNIX path for a binding.
	 * sun_path[108] doesn't as such exist.  However in kernel space
	 * we are guaranteed that it is a valid memory location in our
	 * kernel address buffer because syscall functions always pass
	 * a pointer of struct sockaddr_storage which has a bigger buffer
	 * than 108.  Also, we must terminate sun_path for strlen() in
	 * getname_kernel().
	 */
	addr->__data[addr_len - offset] = 0;

	/* Don't pass sunaddr->sun_path to strlen().  Otherwise, 108 will
	 * cause panic if CONFIG_FORTIFY_SOURCE=y.  Let __fortify_strlen()
	 * know the actual buffer.
	 */
	return strlen(addr->__data) + offset + 1;
}

static void __unix_remove_socket(struct sock *sk)
{
	sk_del_node_init(sk);
}

static void __unix_insert_socket(struct net *net, struct sock *sk)
{
	DEBUG_NET_WARN_ON_ONCE(!sk_unhashed(sk));
	sk_add_node(sk, &net->unx.table.buckets[sk->sk_hash]);
}

static void __unix_set_addr_hash(struct net *net, struct sock *sk,
				 struct unix_address *addr, unsigned int hash)
{
	__unix_remove_socket(sk);
	smp_store_release(&unix_sk(sk)->addr, addr);

	sk->sk_hash = hash;
	__unix_insert_socket(net, sk);
}

static void unix_remove_socket(struct net *net, struct sock *sk)
{
	spin_lock(&net->unx.table.locks[sk->sk_hash]);
	__unix_remove_socket(sk);
	spin_unlock(&net->unx.table.locks[sk->sk_hash]);
}

static void unix_insert_unbound_socket(struct net *net, struct sock *sk)
{
	spin_lock(&net->unx.table.locks[sk->sk_hash]);
	__unix_insert_socket(net, sk);
	spin_unlock(&net->unx.table.locks[sk->sk_hash]);
}

static void unix_insert_bsd_socket(struct sock *sk)
{
	spin_lock(&bsd_socket_locks[sk->sk_hash]);
	sk_add_bind_node(sk, &bsd_socket_buckets[sk->sk_hash]);
	spin_unlock(&bsd_socket_locks[sk->sk_hash]);
}

static void unix_remove_bsd_socket(struct sock *sk)
{
	if (!hlist_unhashed(&sk->sk_bind_node)) {
		spin_lock(&bsd_socket_locks[sk->sk_hash]);
		__sk_del_bind_node(sk);
		spin_unlock(&bsd_socket_locks[sk->sk_hash]);

		sk_node_init(&sk->sk_bind_node);
	}
}

static struct sock *__unix_find_socket_byname(struct net *net,
					      struct sockaddr_un *sunname,
					      int len, unsigned int hash)
{
	struct sock *s;

	sk_for_each(s, &net->unx.table.buckets[hash]) {
		struct unix_sock *u = unix_sk(s);

		if (u->addr->len == len &&
		    !memcmp(u->addr->name, sunname, len))
			return s;
	}
	return NULL;
}

static inline struct sock *unix_find_socket_byname(struct net *net,
						   struct sockaddr_un *sunname,
						   int len, unsigned int hash)
{
	struct sock *s;

	spin_lock(&net->unx.table.locks[hash]);
	s = __unix_find_socket_byname(net, sunname, len, hash);
	if (s)
		sock_hold(s);
	spin_unlock(&net->unx.table.locks[hash]);
	return s;
}

static struct sock *unix_find_socket_byinode(struct inode *i)
{
	unsigned int hash = unix_bsd_hash(i);
	struct sock *s;

	spin_lock(&bsd_socket_locks[hash]);
	sk_for_each_bound(s, &bsd_socket_buckets[hash]) {
		struct dentry *dentry = unix_sk(s)->path.dentry;

		if (dentry && d_backing_inode(dentry) == i) {
			sock_hold(s);
			spin_unlock(&bsd_socket_locks[hash]);
			return s;
		}
	}
	spin_unlock(&bsd_socket_locks[hash]);
	return NULL;
}

/* Support code for asymmetrically connected dgram sockets
 *
 * If a datagram socket is connected to a socket not itself connected
 * to the first socket (eg, /dev/log), clients may only enqueue more
 * messages if the present receive queue of the server socket is not
 * "too large". This means there's a second writeability condition
 * poll and sendmsg need to test. The dgram recv code will do a wake
 * up on the peer_wait wait queue of a socket upon reception of a
 * datagram which needs to be propagated to sleeping would-be writers
 * since these might not have sent anything so far. This can't be
 * accomplished via poll_wait because the lifetime of the server
 * socket might be less than that of its clients if these break their
 * association with it or if the server socket is closed while clients
 * are still connected to it and there's no way to inform "a polling
 * implementation" that it should let go of a certain wait queue
 *
 * In order to propagate a wake up, a wait_queue_entry_t of the client
 * socket is enqueued on the peer_wait queue of the server socket
 * whose wake function does a wake_up on the ordinary client socket
 * wait queue. This connection is established whenever a write (or
 * poll for write) hit the flow control condition and broken when the
 * association to the server socket is dissolved or after a wake up
 * was relayed.
 */

static int unix_dgram_peer_wake_relay(wait_queue_entry_t *q, unsigned mode, int flags,
				      void *key)
{
	struct unix_sock *u;
	wait_queue_head_t *u_sleep;

	u = container_of(q, struct unix_sock, peer_wake);

	__remove_wait_queue(&unix_sk(u->peer_wake.private)->peer_wait,
			    q);
	u->peer_wake.private = NULL;

	/* relaying can only happen while the wq still exists */
	u_sleep = sk_sleep(&u->sk);
	if (u_sleep)
		wake_up_interruptible_poll(u_sleep, key_to_poll(key));

	return 0;
}

static int unix_dgram_peer_wake_connect(struct sock *sk, struct sock *other)
{
	struct unix_sock *u, *u_other;
	int rc;

	u = unix_sk(sk);
	u_other = unix_sk(other);
	rc = 0;
	spin_lock(&u_other->peer_wait.lock);

	if (!u->peer_wake.private) {
		u->peer_wake.private = other;
		__add_wait_queue(&u_other->peer_wait, &u->peer_wake);

		rc = 1;
	}

	spin_unlock(&u_other->peer_wait.lock);
	return rc;
}

static void unix_dgram_peer_wake_disconnect(struct sock *sk,
					    struct sock *other)
{
	struct unix_sock *u, *u_other;

	u = unix_sk(sk);
	u_other = unix_sk(other);
	spin_lock(&u_other->peer_wait.lock);

	if (u->peer_wake.private == other) {
		__remove_wait_queue(&u_other->peer_wait, &u->peer_wake);
		u->peer_wake.private = NULL;
	}

	spin_unlock(&u_other->peer_wait.lock);
}

static void unix_dgram_peer_wake_disconnect_wakeup(struct sock *sk,
						   struct sock *other)
{
	unix_dgram_peer_wake_disconnect(sk, other);
	wake_up_interruptible_poll(sk_sleep(sk),
				   EPOLLOUT |
				   EPOLLWRNORM |
				   EPOLLWRBAND);
}

/* preconditions:
 *	- unix_peer(sk) == other
 *	- association is stable
 */
static int unix_dgram_peer_wake_me(struct sock *sk, struct sock *other)
{
	int connected;

	connected = unix_dgram_peer_wake_connect(sk, other);

	/* If other is SOCK_DEAD, we want to make sure we signal
	 * POLLOUT, such that a subsequent write() can get a
	 * -ECONNREFUSED. Otherwise, if we haven't queued any skbs
	 * to other and its full, we will hang waiting for POLLOUT.
	 */
	if (unix_recvq_full_lockless(other) && !sock_flag(other, SOCK_DEAD))
		return 1;

	if (connected)
		unix_dgram_peer_wake_disconnect(sk, other);

	return 0;
}

static int unix_writable(const struct sock *sk, unsigned char state)
{
	return state != TCP_LISTEN &&
		(refcount_read(&sk->sk_wmem_alloc) << 2) <= READ_ONCE(sk->sk_sndbuf);
}

static void unix_write_space(struct sock *sk)
{
	struct socket_wq *wq;

	rcu_read_lock();
	if (unix_writable(sk, READ_ONCE(sk->sk_state))) {
		wq = rcu_dereference(sk->sk_wq);
		if (skwq_has_sleeper(wq))
			wake_up_interruptible_sync_poll(&wq->wait,
				EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND);
		sk_wake_async_rcu(sk, SOCK_WAKE_SPACE, POLL_OUT);
	}
	rcu_read_unlock();
}

/* When dgram socket disconnects (or changes its peer), we clear its receive
 * queue of packets arrived from previous peer. First, it allows to do
 * flow control based only on wmem_alloc; second, sk connected to peer
 * may receive messages only from that peer. */
static void unix_dgram_disconnected(struct sock *sk, struct sock *other)
{
	if (!skb_queue_empty(&sk->sk_receive_queue)) {
		skb_queue_purge_reason(&sk->sk_receive_queue,
				       SKB_DROP_REASON_UNIX_DISCONNECT);

		wake_up_interruptible_all(&unix_sk(sk)->peer_wait);

		/* If one link of bidirectional dgram pipe is disconnected,
		 * we signal error. Messages are lost. Do not make this,
		 * when peer was not connected to us.
		 */
		if (!sock_flag(other, SOCK_DEAD) && unix_peer(other) == sk) {
			WRITE_ONCE(other->sk_err, ECONNRESET);
			sk_error_report(other);
		}
	}
}

static void unix_sock_destructor(struct sock *sk)
{
	struct unix_sock *u = unix_sk(sk);

	skb_queue_purge_reason(&sk->sk_receive_queue, SKB_DROP_REASON_SOCKET_CLOSE);

	DEBUG_NET_WARN_ON_ONCE(refcount_read(&sk->sk_wmem_alloc));
	DEBUG_NET_WARN_ON_ONCE(!sk_unhashed(sk));
	DEBUG_NET_WARN_ON_ONCE(sk->sk_socket);
	if (!sock_flag(sk, SOCK_DEAD)) {
		pr_info("Attempt to release alive unix socket: %p\n", sk);
		return;
	}

	if (u->addr)
		unix_release_addr(u->addr);

	atomic_long_dec(&unix_nr_socks);
	sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
#ifdef UNIX_REFCNT_DEBUG
	pr_debug("UNIX %p is destroyed, %ld are still alive.\n", sk,
		atomic_long_read(&unix_nr_socks));
#endif
}

static unsigned int unix_skb_len(const struct sk_buff *skb)
{
	return skb->len - UNIXCB(skb).consumed;
}

static void unix_release_sock(struct sock *sk, int embrion)
{
	struct unix_sock *u = unix_sk(sk);
	struct sock *skpair;
	struct sk_buff *skb;
	struct path path;
	int state;

	unix_remove_socket(sock_net(sk), sk);
	unix_remove_bsd_socket(sk);

	/* Clear state */
	unix_state_lock(sk);
	sock_orphan(sk);
	WRITE_ONCE(sk->sk_shutdown, SHUTDOWN_MASK);
	path	     = u->path;
	u->path.dentry = NULL;
	u->path.mnt = NULL;
	state = sk->sk_state;
	WRITE_ONCE(sk->sk_state, TCP_CLOSE);

	skpair = unix_peer(sk);
	unix_peer(sk) = NULL;

	unix_state_unlock(sk);

#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
	u->oob_skb = NULL;
#endif

	wake_up_interruptible_all(&u->peer_wait);

	if (skpair != NULL) {
		if (sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) {
			struct sk_buff *skb = skb_peek(&sk->sk_receive_queue);

#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
			if (skb && !unix_skb_len(skb))
				skb = skb_peek_next(skb, &sk->sk_receive_queue);
#endif
			unix_state_lock(skpair);
			/* No more writes */
			WRITE_ONCE(skpair->sk_shutdown, SHUTDOWN_MASK);
			if (skb || embrion)
				WRITE_ONCE(skpair->sk_err, ECONNRESET);
			unix_state_unlock(skpair);
			skpair->sk_state_change(skpair);
			sk_wake_async(skpair, SOCK_WAKE_WAITD, POLL_HUP);
		}

		unix_dgram_peer_wake_disconnect(sk, skpair);
		sock_put(skpair); /* It may now die */
	}

	/* Try to flush out this socket. Throw out buffers at least */

	while ((skb = skb_dequeue(&sk->sk_receive_queue)) != NULL) {
		if (state == TCP_LISTEN)
			unix_release_sock(skb->sk, 1);

		/* passed fds are erased in the kfree_skb hook */
		kfree_skb_reason(skb, SKB_DROP_REASON_SOCKET_CLOSE);
	}

	if (path.dentry)
		path_put(&path);

	sock_put(sk);

	/* ---- Socket is dead now and most probably destroyed ---- */

	unix_schedule_gc(NULL);
}

struct unix_peercred {
	struct pid *peer_pid;
	const struct cred *peer_cred;
};

static inline int prepare_peercred(struct unix_peercred *peercred)
{
	struct pid *pid;
	int err;

	pid = task_tgid(current);
	err = pidfs_register_pid(pid);
	if (likely(!err)) {
		peercred->peer_pid = get_pid(pid);
		peercred->peer_cred = get_current_cred();
	}
	return err;
}

static void drop_peercred(struct unix_peercred *peercred)
{
	const struct cred *cred = NULL;
	struct pid *pid = NULL;

	might_sleep();

	swap(peercred->peer_pid, pid);
	swap(peercred->peer_cred, cred);

	put_pid(pid);
	put_cred(cred);
}

static inline void init_peercred(struct sock *sk,
				 const struct unix_peercred *peercred)
{
	sk->sk_peer_pid = peercred->peer_pid;
	sk->sk_peer_cred = peercred->peer_cred;
}

static void update_peercred(struct sock *sk, struct unix_peercred *peercred)
{
	const struct cred *old_cred;
	struct pid *old_pid;

	spin_lock(&sk->sk_peer_lock);
	old_pid = sk->sk_peer_pid;
	old_cred = sk->sk_peer_cred;
	init_peercred(sk, peercred);
	spin_unlock(&sk->sk_peer_lock);

	peercred->peer_pid = old_pid;
	peercred->peer_cred = old_cred;
}

static void copy_peercred(struct sock *sk, struct sock *peersk)
{
	lockdep_assert_held(&unix_sk(peersk)->lock);

	spin_lock(&sk->sk_peer_lock);
	sk->sk_peer_pid = get_pid(peersk->sk_peer_pid);
	sk->sk_peer_cred = get_cred(peersk->sk_peer_cred);
	spin_unlock(&sk->sk_peer_lock);
}

static bool unix_may_passcred(const struct sock *sk)
{
	return sk->sk_scm_credentials || sk->sk_scm_pidfd;
}

static int unix_listen(struct socket *sock, int backlog)
{
	int err;
	struct sock *sk = sock->sk;
	struct unix_sock *u = unix_sk(sk);
	struct unix_peercred peercred = {};

	err = -EOPNOTSUPP;
	if (sock->type != SOCK_STREAM && sock->type != SOCK_SEQPACKET)
		goto out;	/* Only stream/seqpacket sockets accept */
	err = -EINVAL;
	if (!READ_ONCE(u->addr))
		goto out;	/* No listens on an unbound socket */
	err = prepare_peercred(&peercred);
	if (err)
		goto out;
	unix_state_lock(sk);
	if (sk->sk_state != TCP_CLOSE && sk->sk_state != TCP_LISTEN)
		goto out_unlock;
	if (backlog > sk->sk_max_ack_backlog)
		wake_up_interruptible_all(&u->peer_wait);
	sk->sk_max_ack_backlog	= backlog;
	WRITE_ONCE(sk->sk_state, TCP_LISTEN);

	/* set credentials so connect can copy them */
	update_peercred(sk, &peercred);
	err = 0;

out_unlock:
	unix_state_unlock(sk);
	drop_peercred(&peercred);
out:
	return err;
}

static int unix_release(struct socket *);
static int unix_bind(struct socket *, struct sockaddr_unsized *, int);
static int unix_stream_connect(struct socket *, struct sockaddr_unsized *,
			       int addr_len, int flags);
static int unix_socketpair(struct socket *, struct socket *);
static int unix_accept(struct socket *, struct socket *, struct proto_accept_arg *arg);
static int unix_getname(struct socket *, struct sockaddr *, int);
static __poll_t unix_poll(struct file *, struct socket *, poll_table *);
static __poll_t unix_dgram_poll(struct file *, struct socket *,
				    poll_table *);
static int unix_ioctl(struct socket *, unsigned int, unsigned long);
#ifdef CONFIG_COMPAT
static int unix_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg);
#endif
static int unix_shutdown(struct socket *, int);
static int unix_stream_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_stream_recvmsg(struct socket *, struct msghdr *, size_t, int);
static ssize_t unix_stream_splice_read(struct socket *,  loff_t *ppos,
				       struct pipe_inode_info *, size_t size,
				       unsigned int flags);
static int unix_dgram_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_dgram_recvmsg(struct socket *, struct msghdr *, size_t, int);
static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor);
static int unix_dgram_connect(struct socket *, struct sockaddr_unsized *,
			      int, int);
static int unix_seqpacket_sendmsg(struct socket *, struct msghdr *, size_t);
static int unix_seqpacket_recvmsg(struct socket *, struct msghdr *, size_t,
				  int);

#ifdef CONFIG_PROC_FS
static int unix_count_nr_fds(struct sock *sk)
{
	struct sk_buff *skb;
	struct unix_sock *u;
	int nr_fds = 0;

	spin_lock(&sk->sk_receive_queue.lock);
	skb = skb_peek(&sk->sk_receive_queue);
	while (skb) {
		u = unix_sk(skb->sk);
		nr_fds += atomic_read(&u->scm_stat.nr_fds);
		skb = skb_peek_next(skb, &sk->sk_receive_queue);
	}
	spin_unlock(&sk->sk_receive_queue.lock);

	return nr_fds;
}

static void unix_show_fdinfo(struct seq_file *m, struct socket *sock)
{
	struct sock *sk = sock->sk;
	unsigned char s_state;
	struct unix_sock *u;
	int nr_fds = 0;

	if (sk) {
		s_state = READ_ONCE(sk->sk_state);
		u = unix_sk(sk);

		/* SOCK_STREAM and SOCK_SEQPACKET sockets never change their
		 * sk_state after switching to TCP_ESTABLISHED or TCP_LISTEN.
		 * SOCK_DGRAM is ordinary. So, no lock is needed.
		 */
		if (sock->type == SOCK_DGRAM || s_state == TCP_ESTABLISHED)
			nr_fds = atomic_read(&u->scm_stat.nr_fds);
		else if (s_state == TCP_LISTEN)
			nr_fds = unix_count_nr_fds(sk);

		seq_printf(m, "scm_fds: %u\n", nr_fds);
	}
}
#else
#define unix_show_fdinfo NULL
#endif

static bool unix_custom_sockopt(int optname)
{
	switch (optname) {
	case SO_INQ:
		return true;
	default:
		return false;
	}
}

static int unix_setsockopt(struct socket *sock, int level, int optname,
			   sockptr_t optval, unsigned int optlen)
{
	struct unix_sock *u = unix_sk(sock->sk);
	struct sock *sk = sock->sk;
	int val;

	if (level != SOL_SOCKET)
		return -EOPNOTSUPP;

	if (!unix_custom_sockopt(optname))
		return sock_setsockopt(sock, level, optname, optval, optlen);

	if (optlen != sizeof(int))
		return -EINVAL;

	if (copy_from_sockptr(&val, optval, sizeof(val)))
		return -EFAULT;

	switch (optname) {
	case SO_INQ:
		if (sk->sk_type != SOCK_STREAM)
			return -EINVAL;

		if (val > 1 || val < 0)
			return -EINVAL;

		WRITE_ONCE(u->recvmsg_inq, val);
		break;
	default:
		return -ENOPROTOOPT;
	}

	return 0;
}

static const struct proto_ops unix_stream_ops = {
	.family =	PF_UNIX,
	.owner =	THIS_MODULE,
	.release =	unix_release,
	.bind =		unix_bind,
	.connect =	unix_stream_connect,
	.socketpair =	unix_socketpair,
	.accept =	unix_accept,
	.getname =	unix_getname,
	.poll =		unix_poll,
	.ioctl =	unix_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl =	unix_compat_ioctl,
#endif
	.listen =	unix_listen,
	.shutdown =	unix_shutdown,
	.setsockopt =	unix_setsockopt,
	.sendmsg =	unix_stream_sendmsg,
	.recvmsg =	unix_stream_recvmsg,
	.read_skb =	unix_stream_read_skb,
	.mmap =		sock_no_mmap,
	.splice_read =	unix_stream_splice_read,
	.set_peek_off =	sk_set_peek_off,
	.show_fdinfo =	unix_show_fdinfo,
};

static const struct proto_ops unix_dgram_ops = {
	.family =	PF_UNIX,
	.owner =	THIS_MODULE,
	.release =	unix_release,
	.bind =		unix_bind,
	.connect =	unix_dgram_connect,
	.socketpair =	unix_socketpair,
	.accept =	sock_no_accept,
	.getname =	unix_getname,
	.poll =		unix_dgram_poll,
	.ioctl =	unix_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl =	unix_compat_ioctl,
#endif
	.listen =	sock_no_listen,
	.shutdown =	unix_shutdown,
	.sendmsg =	unix_dgram_sendmsg,
	.read_skb =	unix_read_skb,
	.recvmsg =	unix_dgram_recvmsg,
	.mmap =		sock_no_mmap,
	.set_peek_off =	sk_set_peek_off,
	.show_fdinfo =	unix_show_fdinfo,
};

static const struct proto_ops unix_seqpacket_ops = {
	.family =	PF_UNIX,
	.owner =	THIS_MODULE,
	.release =	unix_release,
	.bind =		unix_bind,
	.connect =	unix_stream_connect,
	.socketpair =	unix_socketpair,
	.accept =	unix_accept,
	.getname =	unix_getname,
	.poll =		unix_dgram_poll,
	.ioctl =	unix_ioctl,
#ifdef CONFIG_COMPAT
	.compat_ioctl =	unix_compat_ioctl,
#endif
	.listen =	unix_listen,
	.shutdown =	unix_shutdown,
	.sendmsg =	unix_seqpacket_sendmsg,
	.recvmsg =	unix_seqpacket_recvmsg,
	.mmap =		sock_no_mmap,
	.set_peek_off =	sk_set_peek_off,
	.show_fdinfo =	unix_show_fdinfo,
};

static void unix_close(struct sock *sk, long timeout)
{
	/* Nothing to do here, unix socket does not need a ->close().
	 * This is merely for sockmap.
	 */
}

static bool unix_bpf_bypass_getsockopt(int level, int optname)
{
	if (level == SOL_SOCKET) {
		switch (optname) {
		case SO_PEERPIDFD:
			return true;
		default:
			return false;
		}
	}

	return false;
}

struct proto unix_dgram_proto = {
	.name			= "UNIX",
	.owner			= THIS_MODULE,
	.obj_size		= sizeof(struct unix_sock),
	.close			= unix_close,
	.bpf_bypass_getsockopt	= unix_bpf_bypass_getsockopt,
#ifdef CONFIG_BPF_SYSCALL
	.psock_update_sk_prot	= unix_dgram_bpf_update_proto,
#endif
};

struct proto unix_stream_proto = {
	.name			= "UNIX-STREAM",
	.owner			= THIS_MODULE,
	.obj_size		= sizeof(struct unix_sock),
	.close			= unix_close,
	.bpf_bypass_getsockopt	= unix_bpf_bypass_getsockopt,
#ifdef CONFIG_BPF_SYSCALL
	.psock_update_sk_prot	= unix_stream_bpf_update_proto,
#endif
};

static struct sock *unix_create1(struct net *net, struct socket *sock, int kern, int type)
{
	struct unix_sock *u;
	struct sock *sk;
	int err;

	atomic_long_inc(&unix_nr_socks);
	if (atomic_long_read(&unix_nr_socks) > 2 * get_max_files()) {
		err = -ENFILE;
		goto err;
	}

	if (type == SOCK_STREAM)
		sk = sk_alloc(net, PF_UNIX, GFP_KERNEL, &unix_stream_proto, kern);
	else /*dgram and  seqpacket */
		sk = sk_alloc(net, PF_UNIX, GFP_KERNEL, &unix_dgram_proto, kern);

	if (!sk) {
		err = -ENOMEM;
		goto err;
	}

	sock_init_data(sock, sk);

	sk->sk_scm_rights	= 1;
	sk->sk_hash		= unix_unbound_hash(sk);
	sk->sk_allocation	= GFP_KERNEL_ACCOUNT;
	sk->sk_write_space	= unix_write_space;
	sk->sk_max_ack_backlog	= READ_ONCE(net->unx.sysctl_max_dgram_qlen);
	sk->sk_destruct		= unix_sock_destructor;
	lock_set_cmp_fn(&sk->sk_receive_queue.lock, unix_recvq_lock_cmp_fn, NULL);

	u = unix_sk(sk);
	u->listener = NULL;
	u->vertex = NULL;
	u->path.dentry = NULL;
	u->path.mnt = NULL;
	spin_lock_init(&u->lock);
	lock_set_cmp_fn(&u->lock, unix_state_lock_cmp_fn, NULL);
	mutex_init(&u->iolock); /* single task reading lock */
	mutex_init(&u->bindlock); /* single task binding lock */
	init_waitqueue_head(&u->peer_wait);
	init_waitqueue_func_entry(&u->peer_wake, unix_dgram_peer_wake_relay);
	memset(&u->scm_stat, 0, sizeof(struct scm_stat));
	unix_insert_unbound_socket(net, sk);

	sock_prot_inuse_add(net, sk->sk_prot, 1);

	return sk;

err:
	atomic_long_dec(&unix_nr_socks);
	return ERR_PTR(err);
}

static int unix_create(struct net *net, struct socket *sock, int protocol,
		       int kern)
{
	struct sock *sk;

	if (protocol && protocol != PF_UNIX)
		return -EPROTONOSUPPORT;

	sock->state = SS_UNCONNECTED;

	switch (sock->type) {
	case SOCK_STREAM:
		set_bit(SOCK_CUSTOM_SOCKOPT, &sock->flags);
		sock->ops = &unix_stream_ops;
		break;
		/*
		 *	Believe it or not BSD has AF_UNIX, SOCK_RAW though
		 *	nothing uses it.
		 */
	case SOCK_RAW:
		sock->type = SOCK_DGRAM;
		fallthrough;
	case SOCK_DGRAM:
		sock->ops = &unix_dgram_ops;
		break;
	case SOCK_SEQPACKET:
		sock->ops = &unix_seqpacket_ops;
		break;
	default:
		return -ESOCKTNOSUPPORT;
	}

	sk = unix_create1(net, sock, kern, sock->type);
	if (IS_ERR(sk))
		return PTR_ERR(sk);

	return 0;
}

static int unix_release(struct socket *sock)
{
	struct sock *sk = sock->sk;

	if (!sk)
		return 0;

	sk->sk_prot->close(sk, 0);
	unix_release_sock(sk, 0);
	sock->sk = NULL;

	return 0;
}

static struct sock *unix_find_bsd(struct sockaddr_un *sunaddr, int addr_len,
				  int type, int flags)
{
	struct inode *inode;
	struct path path;
	struct sock *sk;
	int err;

	unix_mkname_bsd(sunaddr, addr_len);

	if (flags & SOCK_COREDUMP) {
		struct path root;

		task_lock(&init_task);
		get_fs_root(init_task.fs, &root);
		task_unlock(&init_task);

		scoped_with_kernel_creds()
			err = vfs_path_lookup(root.dentry, root.mnt, sunaddr->sun_path,
					      LOOKUP_BENEATH | LOOKUP_NO_SYMLINKS |
					      LOOKUP_NO_MAGICLINKS, &path);
		path_put(&root);
		if (err)
			goto fail;
	} else {
		err = kern_path(sunaddr->sun_path, LOOKUP_FOLLOW, &path);
		if (err)
			goto fail;

		err = path_permission(&path, MAY_WRITE);
		if (err)
			goto path_put;
	}

	err = -ECONNREFUSED;
	inode = d_backing_inode(path.dentry);
	if (!S_ISSOCK(inode->i_mode))
		goto path_put;

	sk = unix_find_socket_byinode(inode);
	if (!sk)
		goto path_put;

	err = -EPROTOTYPE;
	if (sk->sk_type == type)
		touch_atime(&path);
	else
		goto sock_put;

	path_put(&path);

	return sk;

sock_put:
	sock_put(sk);
path_put:
	path_put(&path);
fail:
	return ERR_PTR(err);
}

static struct sock *unix_find_abstract(struct net *net,
				       struct sockaddr_un *sunaddr,
				       int addr_len, int type)
{
	unsigned int hash = unix_abstract_hash(sunaddr, addr_len, type);
	struct dentry *dentry;
	struct sock *sk;

	sk = unix_find_socket_byname(net, sunaddr, addr_len, hash);
	if (!sk)
		return ERR_PTR(-ECONNREFUSED);

	dentry = unix_sk(sk)->path.dentry;
	if (dentry)
		touch_atime(&unix_sk(sk)->path);

	return sk;
}

static struct sock *unix_find_other(struct net *net,
				    struct sockaddr_un *sunaddr,
				    int addr_len, int type, int flags)
{
	struct sock *sk;

	if (sunaddr->sun_path[0])
		sk = unix_find_bsd(sunaddr, addr_len, type, flags);
	else
		sk = unix_find_abstract(net, sunaddr, addr_len, type);

	return sk;
}

static int unix_autobind(struct sock *sk)
{
	struct unix_sock *u = unix_sk(sk);
	unsigned int new_hash, old_hash;
	struct net *net = sock_net(sk);
	struct unix_address *addr;
	u32 lastnum, ordernum;
	int err;

	err = mutex_lock_interruptible(&u->bindlock);
	if (err)
		return err;

	if (u->addr)
		goto out;

	err = -ENOMEM;
	addr = kzalloc(sizeof(*addr) +
		       offsetof(struct sockaddr_un, sun_path) + 16, GFP_KERNEL);
	if (!addr)
		goto out;

	addr->len = offsetof(struct sockaddr_un, sun_path) + 6;
	addr->name->sun_family = AF_UNIX;
	refcount_set(&addr->refcnt, 1);

	old_hash = sk->sk_hash;
	ordernum = get_random_u32();
	lastnum = ordernum & 0xFFFFF;
retry:
	ordernum = (ordernum + 1) & 0xFFFFF;
	sprintf(addr->name->sun_path + 1, "%05x", ordernum);

	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
	unix_table_double_lock(net, old_hash, new_hash);

	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash)) {
		unix_table_double_unlock(net, old_hash, new_hash);

		/* __unix_find_socket_byname() may take long time if many names
		 * are already in use.
		 */
		cond_resched();

		if (ordernum == lastnum) {
			/* Give up if all names seems to be in use. */
			err = -ENOSPC;
			unix_release_addr(addr);
			goto out;
		}

		goto retry;
	}

	__unix_set_addr_hash(net, sk, addr, new_hash);
	unix_table_double_unlock(net, old_hash, new_hash);
	err = 0;

out:	mutex_unlock(&u->bindlock);
	return err;
}

static int unix_bind_bsd(struct sock *sk, struct sockaddr_un *sunaddr,
			 int addr_len)
{
	umode_t mode = S_IFSOCK |
	       (SOCK_INODE(sk->sk_socket)->i_mode & ~current_umask());
	struct unix_sock *u = unix_sk(sk);
	unsigned int new_hash, old_hash;
	struct net *net = sock_net(sk);
	struct mnt_idmap *idmap;
	struct unix_address *addr;
	struct dentry *dentry;
	struct path parent;
	int err;

	addr_len = unix_mkname_bsd(sunaddr, addr_len);
	addr = unix_create_addr(sunaddr, addr_len);
	if (!addr)
		return -ENOMEM;

	/*
	 * Get the parent directory, calculate the hash for last
	 * component.
	 */
	dentry = start_creating_path(AT_FDCWD, addr->name->sun_path, &parent, 0);
	if (IS_ERR(dentry)) {
		err = PTR_ERR(dentry);
		goto out;
	}

	/*
	 * All right, let's create it.
	 */
	idmap = mnt_idmap(parent.mnt);
	err = security_path_mknod(&parent, dentry, mode, 0);
	if (!err)
		err = vfs_mknod(idmap, d_inode(parent.dentry), dentry, mode, 0, NULL);
	if (err)
		goto out_path;
	err = mutex_lock_interruptible(&u->bindlock);
	if (err)
		goto out_unlink;
	if (u->addr)
		goto out_unlock;

	old_hash = sk->sk_hash;
	new_hash = unix_bsd_hash(d_backing_inode(dentry));
	unix_table_double_lock(net, old_hash, new_hash);
	u->path.mnt = mntget(parent.mnt);
	u->path.dentry = dget(dentry);
	__unix_set_addr_hash(net, sk, addr, new_hash);
	unix_table_double_unlock(net, old_hash, new_hash);
	unix_insert_bsd_socket(sk);
	mutex_unlock(&u->bindlock);
	end_creating_path(&parent, dentry);
	return 0;

out_unlock:
	mutex_unlock(&u->bindlock);
	err = -EINVAL;
out_unlink:
	/* failed after successful mknod?  unlink what we'd created... */
	vfs_unlink(idmap, d_inode(parent.dentry), dentry, NULL);
out_path:
	end_creating_path(&parent, dentry);
out:
	unix_release_addr(addr);
	return err == -EEXIST ? -EADDRINUSE : err;
}

static int unix_bind_abstract(struct sock *sk, struct sockaddr_un *sunaddr,
			      int addr_len)
{
	struct unix_sock *u = unix_sk(sk);
	unsigned int new_hash, old_hash;
	struct net *net = sock_net(sk);
	struct unix_address *addr;
	int err;

	addr = unix_create_addr(sunaddr, addr_len);
	if (!addr)
		return -ENOMEM;

	err = mutex_lock_interruptible(&u->bindlock);
	if (err)
		goto out;

	if (u->addr) {
		err = -EINVAL;
		goto out_mutex;
	}

	old_hash = sk->sk_hash;
	new_hash = unix_abstract_hash(addr->name, addr->len, sk->sk_type);
	unix_table_double_lock(net, old_hash, new_hash);

	if (__unix_find_socket_byname(net, addr->name, addr->len, new_hash))
		goto out_spin;

	__unix_set_addr_hash(net, sk, addr, new_hash);
	unix_table_double_unlock(net, old_hash, new_hash);
	mutex_unlock(&u->bindlock);
	return 0;

out_spin:
	unix_table_double_unlock(net, old_hash, new_hash);
	err = -EADDRINUSE;
out_mutex:
	mutex_unlock(&u->bindlock);
out:
	unix_release_addr(addr);
	return err;
}

static int unix_bind(struct socket *sock, struct sockaddr_unsized *uaddr, int addr_len)
{
	struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
	struct sock *sk = sock->sk;
	int err;

	if (addr_len == offsetof(struct sockaddr_un, sun_path) &&
	    sunaddr->sun_family == AF_UNIX)
		return unix_autobind(sk);

	err = unix_validate_addr(sunaddr, addr_len);
	if (err)
		return err;

	if (sunaddr->sun_path[0])
		err = unix_bind_bsd(sk, sunaddr, addr_len);
	else
		err = unix_bind_abstract(sk, sunaddr, addr_len);

	return err;
}

static void unix_state_double_lock(struct sock *sk1, struct sock *sk2)
{
	if (unlikely(sk1 == sk2) || !sk2) {
		unix_state_lock(sk1);
		return;
	}

	if (sk1 > sk2)
		swap(sk1, sk2);

	unix_state_lock(sk1);
	unix_state_lock(sk2);
}

static void unix_state_double_unlock(struct sock *sk1, struct sock *sk2)
{
	if (unlikely(sk1 == sk2) || !sk2) {
		unix_state_unlock(sk1);
		return;
	}
	unix_state_unlock(sk1);
	unix_state_unlock(sk2);
}

static int unix_dgram_connect(struct socket *sock, struct sockaddr_unsized *addr,
			      int alen, int flags)
{
	struct sockaddr_un *sunaddr = (struct sockaddr_un *)addr;
	struct sock *sk = sock->sk;
	struct sock *other;
	int err;

	err = -EINVAL;
	if (alen < offsetofend(struct sockaddr, sa_family))
		goto out;

	if (addr->sa_family != AF_UNSPEC) {
		err = unix_validate_addr(sunaddr, alen);
		if (err)
			goto out;

		err = BPF_CGROUP_RUN_PROG_UNIX_CONNECT_LOCK(sk, addr, &alen);
		if (err)
			goto out;

		if (unix_may_passcred(sk) && !READ_ONCE(unix_sk(sk)->addr)) {
			err = unix_autobind(sk);
			if (err)
				goto out;
		}

restart:
		other = unix_find_other(sock_net(sk), sunaddr, alen, sock->type, 0);
		if (IS_ERR(other)) {
			err = PTR_ERR(other);
			goto out;
		}

		unix_state_double_lock(sk, other);

		/* Apparently VFS overslept socket death. Retry. */
		if (sock_flag(other, SOCK_DEAD)) {
			unix_state_double_unlock(sk, other);
			sock_put(other);
			goto restart;
		}

		err = -EPERM;
		if (!unix_may_send(sk, other))
			goto out_unlock;

		err = security_unix_may_send(sk->sk_socket, other->sk_socket);
		if (err)
			goto out_unlock;

		WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
		WRITE_ONCE(other->sk_state, TCP_ESTABLISHED);
	} else {
		/*
		 *	1003.1g breaking connected state with AF_UNSPEC
		 */
		other = NULL;
		unix_state_double_lock(sk, other);
	}

	/*
	 * If it was connected, reconnect.
	 */
	if (unix_peer(sk)) {
		struct sock *old_peer = unix_peer(sk);

		unix_peer(sk) = other;
		if (!other)
			WRITE_ONCE(sk->sk_state, TCP_CLOSE);
		unix_dgram_peer_wake_disconnect_wakeup(sk, old_peer);

		unix_state_double_unlock(sk, other);

		if (other != old_peer) {
			unix_dgram_disconnected(sk, old_peer);

			unix_state_lock(old_peer);
			if (!unix_peer(old_peer))
				WRITE_ONCE(old_peer->sk_state, TCP_CLOSE);
			unix_state_unlock(old_peer);
		}

		sock_put(old_peer);
	} else {
		unix_peer(sk) = other;
		unix_state_double_unlock(sk, other);
	}

	return 0;

out_unlock:
	unix_state_double_unlock(sk, other);
	sock_put(other);
out:
	return err;
}

static long unix_wait_for_peer(struct sock *other, long timeo)
{
	struct unix_sock *u = unix_sk(other);
	int sched;
	DEFINE_WAIT(wait);

	prepare_to_wait_exclusive(&u->peer_wait, &wait, TASK_INTERRUPTIBLE);

	sched = !sock_flag(other, SOCK_DEAD) &&
		!(other->sk_shutdown & RCV_SHUTDOWN) &&
		unix_recvq_full_lockless(other);

	unix_state_unlock(other);

	if (sched)
		timeo = schedule_timeout(timeo);

	finish_wait(&u->peer_wait, &wait);
	return timeo;
}

static int unix_stream_connect(struct socket *sock, struct sockaddr_unsized *uaddr,
			       int addr_len, int flags)
{
	struct sockaddr_un *sunaddr = (struct sockaddr_un *)uaddr;
	struct sock *sk = sock->sk, *newsk = NULL, *other = NULL;
	struct unix_sock *u = unix_sk(sk), *newu, *otheru;
	struct unix_peercred peercred = {};
	struct net *net = sock_net(sk);
	struct sk_buff *skb = NULL;
	unsigned char state;
	long timeo;
	int err;

	err = unix_validate_addr(sunaddr, addr_len);
	if (err)
		goto out;

	err = BPF_CGROUP_RUN_PROG_UNIX_CONNECT_LOCK(sk, uaddr, &addr_len);
	if (err)
		goto out;

	if (unix_may_passcred(sk) && !READ_ONCE(u->addr)) {
		err = unix_autobind(sk);
		if (err)
			goto out;
	}

	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);

	err = prepare_peercred(&peercred);
	if (err)
		goto out;

	/* create new sock for complete connection */
	newsk = unix_create1(net, NULL, 0, sock->type);
	if (IS_ERR(newsk)) {
		err = PTR_ERR(newsk);
		goto out;
	}

	/* Allocate skb for sending to listening sock */
	skb = sock_wmalloc(newsk, 1, 0, GFP_KERNEL);
	if (!skb) {
		err = -ENOMEM;
		goto out_free_sk;
	}

restart:
	/*  Find listening sock. */
	other = unix_find_other(net, sunaddr, addr_len, sk->sk_type, flags);
	if (IS_ERR(other)) {
		err = PTR_ERR(other);
		goto out_free_skb;
	}

	unix_state_lock(other);

	/* Apparently VFS overslept socket death. Retry. */
	if (sock_flag(other, SOCK_DEAD)) {
		unix_state_unlock(other);
		sock_put(other);
		goto restart;
	}

	if (other->sk_state != TCP_LISTEN ||
	    other->sk_shutdown & RCV_SHUTDOWN) {
		err = -ECONNREFUSED;
		goto out_unlock;
	}

	if (unix_recvq_full_lockless(other)) {
		if (!timeo) {
			err = -EAGAIN;
			goto out_unlock;
		}

		timeo = unix_wait_for_peer(other, timeo);
		sock_put(other);

		err = sock_intr_errno(timeo);
		if (signal_pending(current))
			goto out_free_skb;

		goto restart;
	}

	/* self connect and simultaneous connect are eliminated
	 * by rejecting TCP_LISTEN socket to avoid deadlock.
	 */
	state = READ_ONCE(sk->sk_state);
	if (unlikely(state != TCP_CLOSE)) {
		err = state == TCP_ESTABLISHED ? -EISCONN : -EINVAL;
		goto out_unlock;
	}

	unix_state_lock(sk);

	if (unlikely(sk->sk_state != TCP_CLOSE)) {
		err = sk->sk_state == TCP_ESTABLISHED ? -EISCONN : -EINVAL;
		unix_state_unlock(sk);
		goto out_unlock;
	}

	err = security_unix_stream_connect(sk, other, newsk);
	if (err) {
		unix_state_unlock(sk);
		goto out_unlock;
	}

	/* The way is open! Fastly set all the necessary fields... */

	sock_hold(sk);
	unix_peer(newsk) = sk;
	newsk->sk_state = TCP_ESTABLISHED;
	newsk->sk_type = sk->sk_type;
	newsk->sk_scm_recv_flags = other->sk_scm_recv_flags;
	init_peercred(newsk, &peercred);

	newu = unix_sk(newsk);
	newu->listener = other;
	RCU_INIT_POINTER(newsk->sk_wq, &newu->peer_wq);
	otheru = unix_sk(other);

	/* copy address information from listening to new sock
	 *
	 * The contents of *(otheru->addr) and otheru->path
	 * are seen fully set up here, since we have found
	 * otheru in hash under its lock.  Insertion into the
	 * hash chain we'd found it in had been done in an
	 * earlier critical area protected by the chain's lock,
	 * the same one where we'd set *(otheru->addr) contents,
	 * as well as otheru->path and otheru->addr itself.
	 *
	 * Using smp_store_release() here to set newu->addr
	 * is enough to make those stores, as well as stores
	 * to newu->path visible to anyone who gets newu->addr
	 * by smp_load_acquire().  IOW, the same warranties
	 * as for unix_sock instances bound in unix_bind() or
	 * in unix_autobind().
	 */
	if (otheru->path.dentry) {
		path_get(&otheru->path);
		newu->path = otheru->path;
	}
	refcount_inc(&otheru->addr->refcnt);
	smp_store_release(&newu->addr, otheru->addr);

	/* Set credentials */
	copy_peercred(sk, other);

	sock->state	= SS_CONNECTED;
	WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED);
	sock_hold(newsk);

	smp_mb__after_atomic();	/* sock_hold() does an atomic_inc() */
	unix_peer(sk)	= newsk;

	unix_state_unlock(sk);

	/* take ten and send info to listening sock */
	spin_lock(&other->sk_receive_queue.lock);
	__skb_queue_tail(&other->sk_receive_queue, skb);
	spin_unlock(&other->sk_receive_queue.lock);
	unix_state_unlock(other);
	other->sk_data_ready(other);
	sock_put(other);
	return 0;

out_unlock:
	unix_state_unlock(other);
	sock_put(other);
out_free_skb:
	consume_skb(skb);
out_free_sk:
	unix_release_sock(newsk, 0);
out:
	drop_peercred(&peercred);
	return err;
}

static int unix_socketpair(struct socket *socka, struct socket *sockb)
{
	struct unix_peercred ska_peercred = {}, skb_peercred = {};
	struct sock *ska = socka->sk, *skb = sockb->sk;
	int err;

	err = prepare_peercred(&ska_peercred);
	if (err)
		return err;

	err = prepare_peercred(&skb_peercred);
	if (err) {
		drop_peercred(&ska_peercred);
		return err;
	}

	/* Join our sockets back to back */
	sock_hold(ska);
	sock_hold(skb);
	unix_peer(ska) = skb;
	unix_peer(skb) = ska;
	init_peercred(ska, &ska_peercred);
	init_peercred(skb, &skb_peercred);

	ska->sk_state = TCP_ESTABLISHED;
	skb->sk_state = TCP_ESTABLISHED;
	socka->state  = SS_CONNECTED;
	sockb->state  = SS_CONNECTED;
	return 0;
}

static int unix_accept(struct socket *sock, struct socket *newsock,
		       struct proto_accept_arg *arg)
{
	struct sock *sk = sock->sk;
	struct sk_buff *skb;
	struct sock *tsk;

	arg->err = -EOPNOTSUPP;
	if (sock->type != SOCK_STREAM && sock->type != SOCK_SEQPACKET)
		goto out;

	arg->err = -EINVAL;
	if (READ_ONCE(sk->sk_state) != TCP_LISTEN)
		goto out;

	/* If socket state is TCP_LISTEN it cannot change (for now...),
	 * so that no locks are necessary.
	 */

	skb = skb_recv_datagram(sk, (arg->flags & O_NONBLOCK) ? MSG_DONTWAIT : 0,
				&arg->err);
	if (!skb) {
		/* This means receive shutdown. */
		if (arg->err == 0)
			arg->err = -EINVAL;
		goto out;
	}

	tsk = skb->sk;
	skb_free_datagram(sk, skb);
	wake_up_interruptible(&unix_sk(sk)->peer_wait);

	if (tsk->sk_type == SOCK_STREAM)
		set_bit(SOCK_CUSTOM_SOCKOPT, &newsock->flags);

	/* attach accepted sock to socket */
	unix_state_lock(tsk);
	unix_update_edges(unix_sk(tsk));
	newsock->state = SS_CONNECTED;
	sock_graft(tsk, newsock);
	unix_state_unlock(tsk);
	return 0;

out:
	return arg->err;
}


static int unix_getname(struct socket *sock, struct sockaddr *uaddr, int peer)
{
	struct sock *sk = sock->sk;
	struct unix_address *addr;
	DECLARE_SOCKADDR(struct sockaddr_un *, sunaddr, uaddr);
	int err = 0;

	if (peer) {
		sk = unix_peer_get(sk);

		err = -ENOTCONN;
		if (!sk)
			goto out;
		err = 0;
	} else {
		sock_hold(sk);
	}

	addr = smp_load_acquire(&unix_sk(sk)->addr);
	if (!addr) {
		sunaddr->sun_family = AF_UNIX;
		sunaddr->sun_path[0] = 0;
		err = offsetof(struct sockaddr_un, sun_path);
	} else {
		err = addr->len;
		memcpy(sunaddr, addr->name, addr->len);

		if (peer)
			BPF_CGROUP_RUN_SA_PROG(sk, uaddr, &err,
					       CGROUP_UNIX_GETPEERNAME);
		else
			BPF_CGROUP_RUN_SA_PROG(sk, uaddr, &err,
					       CGROUP_UNIX_GETSOCKNAME);
	}
	sock_put(sk);
out:
	return err;
}

/* The "user->unix_inflight" variable is protected by the garbage
 * collection lock, and we just read it locklessly here. If you go
 * over the limit, there might be a tiny race in actually noticing
 * it across threads. Tough.
 */
static inline bool too_many_unix_fds(struct task_struct *p)
{
	struct user_struct *user = current_user();

	if (unlikely(READ_ONCE(user->unix_inflight) > task_rlimit(p, RLIMIT_NOFILE)))
		return !capable(CAP_SYS_RESOURCE) && !capable(CAP_SYS_ADMIN);
	return false;
}

static int unix_attach_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
	if (too_many_unix_fds(current))
		return -ETOOMANYREFS;

	UNIXCB(skb).fp = scm->fp;
	scm->fp = NULL;

	if (unix_prepare_fpl(UNIXCB(skb).fp))
		return -ENOMEM;

	return 0;
}

static void unix_detach_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
	scm->fp = UNIXCB(skb).fp;
	UNIXCB(skb).fp = NULL;

	unix_destroy_fpl(scm->fp);
}

static void unix_peek_fds(struct scm_cookie *scm, struct sk_buff *skb)
{
	scm->fp = scm_fp_dup(UNIXCB(skb).fp);
}

static void unix_destruct_scm(struct sk_buff *skb)
{
	struct scm_cookie scm;

	memset(&scm, 0, sizeof(scm));
	scm.pid = UNIXCB(skb).pid;
	if (UNIXCB(skb).fp)
		unix_detach_fds(&scm, skb);

	/* Alas, it calls VFS */
	/* So fscking what? fput() had been SMP-safe since the last Summer */
	scm_destroy(&scm);
	sock_wfree(skb);
}

static int unix_scm_to_skb(struct scm_cookie *scm, struct sk_buff *skb, bool send_fds)
{
	int err = 0;

	UNIXCB(skb).pid = get_pid(scm->pid);
	UNIXCB(skb).uid = scm->creds.uid;
	UNIXCB(skb).gid = scm->creds.gid;
	UNIXCB(skb).fp = NULL;
	unix_get_secdata(scm, skb);
	if (scm->fp && send_fds)
		err = unix_attach_fds(scm, skb);

	skb->destructor = unix_destruct_scm;
	return err;
}

static void unix_skb_to_scm(struct sk_buff *skb, struct scm_cookie *scm)
{
	scm_set_cred(scm, UNIXCB(skb).pid, UNIXCB(skb).uid, UNIXCB(skb).gid);
	unix_set_secdata(scm, skb);
}

/**
 * unix_maybe_add_creds() - Adds current task uid/gid and struct pid to skb if needed.
 * @skb: skb to attach creds to.
 * @sk: Sender sock.
 * @other: Receiver sock.
 *
 * Some apps rely on write() giving SCM_CREDENTIALS
 * We include credentials if source or destination socket
 * asserted SOCK_PASSCRED.
 *
 * Context: May sleep.
 * Return: On success zero, on error a negative error code is returned.
 */
static int unix_maybe_add_creds(struct sk_buff *skb, const struct sock *sk,
				const struct sock *other)
{
	if (UNIXCB(skb).pid)
		return 0;

	if (unix_may_passcred(sk) || unix_may_passcred(other) ||
	    !other->sk_socket) {
		struct pid *pid;
		int err;

		pid = task_tgid(current);
		err = pidfs_register_pid(pid);
		if (unlikely(err))
			return err;

		UNIXCB(skb).pid = get_pid(pid);
		current_uid_gid(&UNIXCB(skb).uid, &UNIXCB(skb).gid);
	}

	return 0;
}

static bool unix_skb_scm_eq(struct sk_buff *skb,
			    struct scm_cookie *scm)
{
	return UNIXCB(skb).pid == scm->pid &&
	       uid_eq(UNIXCB(skb).uid, scm->creds.uid) &&
	       gid_eq(UNIXCB(skb).gid, scm->creds.gid) &&
	       unix_secdata_eq(scm, skb);
}

static void scm_stat_add(struct sock *sk, struct sk_buff *skb)
{
	struct scm_fp_list *fp = UNIXCB(skb).fp;
	struct unix_sock *u = unix_sk(sk);

	if (unlikely(fp && fp->count)) {
		atomic_add(fp->count, &u->scm_stat.nr_fds);
		unix_add_edges(fp, u);
	}
}

static void scm_stat_del(struct sock *sk, struct sk_buff *skb)
{
	struct scm_fp_list *fp = UNIXCB(skb).fp;
	struct unix_sock *u = unix_sk(sk);

	if (unlikely(fp && fp->count)) {
		atomic_sub(fp->count, &u->scm_stat.nr_fds);
		unix_del_edges(fp);
	}
}

/*
 *	Send AF_UNIX data.
 */

static int unix_dgram_sendmsg(struct socket *sock, struct msghdr *msg,
			      size_t len)
{
	struct sock *sk = sock->sk, *other = NULL;
	struct unix_sock *u = unix_sk(sk);
	struct scm_cookie scm;
	struct sk_buff *skb;
	int data_len = 0;
	int sk_locked;
	long timeo;
	int err;

	err = scm_send(sock, msg, &scm, false);
	if (err < 0)
		return err;

	if (msg->msg_flags & MSG_OOB) {
		err = -EOPNOTSUPP;
		goto out;
	}

	if (msg->msg_namelen) {
		err = unix_validate_addr(msg->msg_name, msg->msg_namelen);
		if (err)
			goto out;

		err = BPF_CGROUP_RUN_PROG_UNIX_SENDMSG_LOCK(sk,
							    msg->msg_name,
							    &msg->msg_namelen,
							    NULL);
		if (err)
			goto out;
	}

	if (unix_may_passcred(sk) && !READ_ONCE(u->addr)) {
		err = unix_autobind(sk);
		if (err)
			goto out;
	}

	if (len > READ_ONCE(sk->sk_sndbuf) - 32) {
		err = -EMSGSIZE;
		goto out;
	}

	if (len > SKB_MAX_ALLOC) {
		data_len = min_t(size_t,
				 len - SKB_MAX_ALLOC,
				 MAX_SKB_FRAGS * PAGE_SIZE);
		data_len = PAGE_ALIGN(data_len);

		BUILD_BUG_ON(SKB_MAX_ALLOC < PAGE_SIZE);
	}

	skb = sock_alloc_send_pskb(sk, len - data_len, data_len,
				   msg->msg_flags & MSG_DONTWAIT, &err,
				   PAGE_ALLOC_COSTLY_ORDER);
	if (!skb)
		goto out;

	err = unix_scm_to_skb(&scm, skb, true);
	if (err < 0)
		goto out_free;

	skb_put(skb, len - data_len);
	skb->data_len = data_len;
	skb->len = len;
	err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, len);
	if (err)
		goto out_free;

	timeo = sock_sndtimeo(sk, msg->msg_flags & MSG_DONTWAIT);

	if (msg->msg_namelen) {
lookup:
		other = unix_find_other(sock_net(sk), msg->msg_name,
					msg->msg_namelen, sk->sk_type, 0);
		if (IS_ERR(other)) {
			err = PTR_ERR(other);
			goto out_free;
		}
	} else {
		other = unix_peer_get(sk);
		if (!other) {
			err = -ENOTCONN;
			goto out_free;
		}
	}

	if (sk_filter(other, skb) < 0) {
		/* Toss the packet but do not return any error to the sender */
		err = len;
		goto out_sock_put;
	}

	err = unix_maybe_add_creds(skb, sk, other);
	if (err)
		goto out_sock_put;

restart:
	sk_locked = 0;
	unix_state_lock(other);
restart_locked:

	if (!unix_may_send(sk, other)) {
		err = -EPERM;
		goto out_unlock;
	}

	if (unlikely(sock_flag(other, SOCK_DEAD))) {
		/* Check with 1003.1g - what should datagram error */

		unix_state_unlock(other);

		if (sk->sk_type == SOCK_SEQPACKET) {
			/* We are here only when racing with unix_release_sock()
			 * is clearing @other. Never change state to TCP_CLOSE
			 * unlike SOCK_DGRAM wants.
			 */
			err = -EPIPE;
			goto out_sock_put;
		}

		if (!sk_locked)
			unix_state_lock(sk);

		if (unix_peer(sk) == other) {
			unix_peer(sk) = NULL;
			unix_dgram_peer_wake_disconnect_wakeup(sk, other);

			WRITE_ONCE(sk->sk_state, TCP_CLOSE);
			unix_state_unlock(sk);

			unix_dgram_disconnected(sk, other);
			sock_put(other);
			err = -ECONNREFUSED;
			goto out_sock_put;
		}

		unix_state_unlock(sk);

		if (!msg->msg_namelen) {
			err = -ECONNRESET;
			goto out_sock_put;
		}

		sock_put(other);
		goto lookup;
	}

	if (other->sk_shutdown & RCV_SHUTDOWN) {
		err = -EPIPE;
		goto out_unlock;
	}

	if (UNIXCB(skb).fp && !other->sk_scm_rights) {
		err = -EPERM;
		goto out_unlock;
	}

	if (sk->sk_type != SOCK_SEQPACKET) {
		err = security_unix_may_send(sk->sk_socket, other->sk_socket);
		if (err)
			goto out_unlock;
	}

	/* other == sk && unix_peer(other) != sk if
	 * - unix_peer(sk) == NULL, destination address bound to sk
	 * - unix_peer(sk) == sk by time of get but disconnected before lock
	 */
	if (other != sk &&
	    unlikely(unix_peer(other) != sk &&
	    unix_recvq_full_lockless(other))) {
		if (timeo) {
			timeo = unix_wait_for_peer(other, timeo);

			err = sock_intr_errno(timeo);
			if (signal_pending(current))
				goto out_sock_put;

			goto restart;
		}

		if (!sk_locked) {
			unix_state_unlock(other);
			unix_state_double_lock(sk, other);
		}

		if (unix_peer(sk) != other ||
		    unix_dgram_peer_wake_me(sk, other)) {
			err = -EAGAIN;
			sk_locked = 1;
			goto out_unlock;
		}

		if (!sk_locked) {
			sk_locked = 1;
			goto restart_locked;
		}
	}

	if (unlikely(sk_locked))
		unix_state_unlock(sk);

	if (sock_flag(other, SOCK_RCVTSTAMP))
		__net_timestamp(skb);

	scm_stat_add(other, skb);
	skb_queue_tail(&other->sk_receive_queue, skb);
	unix_state_unlock(other);
	other->sk_data_ready(other);
	sock_put(other);
	scm_destroy(&scm);
	return len;

out_unlock:
	if (sk_locked)
		unix_state_unlock(sk);
	unix_state_unlock(other);
out_sock_put:
	sock_put(other);
out_free:
	consume_skb(skb);
out:
	scm_destroy(&scm);
	return err;
}

/* We use paged skbs for stream sockets, and limit occupancy to 32768
 * bytes, and a minimum of a full page.
 */
#define UNIX_SKB_FRAGS_SZ (PAGE_SIZE << get_order(32768))

#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
static int queue_oob(struct sock *sk, struct msghdr *msg, struct sock *other,
		     struct scm_cookie *scm, bool fds_sent)
{
	struct unix_sock *ousk = unix_sk(other);
	struct sk_buff *skb;
	int err;

	skb = sock_alloc_send_skb(sk, 1, msg->msg_flags & MSG_DONTWAIT, &err);

	if (!skb)
		return err;

	err = unix_scm_to_skb(scm, skb, !fds_sent);
	if (err < 0)
		goto out;

	err = unix_maybe_add_creds(skb, sk, other);
	if (err)
		goto out;

	skb_put(skb, 1);
	err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, 1);

	if (err)
		goto out;

	unix_state_lock(other);

	if (sock_flag(other, SOCK_DEAD) ||
	    (other->sk_shutdown & RCV_SHUTDOWN)) {
		err = -EPIPE;
		goto out_unlock;
	}

	if (UNIXCB(skb).fp && !other->sk_scm_rights) {
		err = -EPERM;
		goto out_unlock;
	}

	scm_stat_add(other, skb);

	spin_lock(&other->sk_receive_queue.lock);
	WRITE_ONCE(ousk->oob_skb, skb);
	WRITE_ONCE(ousk->inq_len, ousk->inq_len + 1);
	__skb_queue_tail(&other->sk_receive_queue, skb);
	spin_unlock(&other->sk_receive_queue.lock);

	sk_send_sigurg(other);
	unix_state_unlock(other);
	other->sk_data_ready(other);

	return 0;
out_unlock:
	unix_state_unlock(other);
out:
	consume_skb(skb);
	return err;
}
#endif

static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
			       size_t len)
{
	struct sock *sk = sock->sk;
	struct sk_buff *skb = NULL;
	struct sock *other = NULL;
	struct unix_sock *otheru;
	struct scm_cookie scm;
	bool fds_sent = false;
	int err, sent = 0;

	err = scm_send(sock, msg, &scm, false);
	if (err < 0)
		return err;

	if (msg->msg_flags & MSG_OOB) {
		err = -EOPNOTSUPP;
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
		if (len)
			len--;
		else
#endif
			goto out_err;
	}

	if (msg->msg_namelen) {
		err = READ_ONCE(sk->sk_state) == TCP_ESTABLISHED ? -EISCONN : -EOPNOTSUPP;
		goto out_err;
	}

	other = unix_peer(sk);
	if (!other) {
		err = -ENOTCONN;
		goto out_err;
	}

	otheru = unix_sk(other);

	if (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN)
		goto out_pipe;

	while (sent < len) {
		int size = len - sent;
		int data_len;

		if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
			skb = sock_alloc_send_pskb(sk, 0, 0,
						   msg->msg_flags & MSG_DONTWAIT,
						   &err, 0);
		} else {
			/* Keep two messages in the pipe so it schedules better */
			size = min_t(int, size, (READ_ONCE(sk->sk_sndbuf) >> 1) - 64);

			/* allow fallback to order-0 allocations */
			size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);

			data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));

			data_len = min_t(size_t, size, PAGE_ALIGN(data_len));

			skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
						   msg->msg_flags & MSG_DONTWAIT, &err,
						   get_order(UNIX_SKB_FRAGS_SZ));
		}
		if (!skb)
			goto out_err;

		/* Only send the fds in the first buffer */
		err = unix_scm_to_skb(&scm, skb, !fds_sent);
		if (err < 0)
			goto out_free;

		fds_sent = true;

		err = unix_maybe_add_creds(skb, sk, other);
		if (err)
			goto out_free;

		if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
			skb->ip_summed = CHECKSUM_UNNECESSARY;
			err = skb_splice_from_iter(skb, &msg->msg_iter, size);
			if (err < 0)
				goto out_free;

			size = err;
			refcount_add(size, &sk->sk_wmem_alloc);
		} else {
			skb_put(skb, size - data_len);
			skb->data_len = data_len;
			skb->len = size;
			err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
			if (err)
				goto out_free;
		}

		unix_state_lock(other);

		if (sock_flag(other, SOCK_DEAD) ||
		    (other->sk_shutdown & RCV_SHUTDOWN))
			goto out_pipe_unlock;

		if (UNIXCB(skb).fp && !other->sk_scm_rights) {
			unix_state_unlock(other);
			err = -EPERM;
			goto out_free;
		}

		scm_stat_add(other, skb);

		spin_lock(&other->sk_receive_queue.lock);
		WRITE_ONCE(otheru->inq_len, otheru->inq_len + skb->len);
		__skb_queue_tail(&other->sk_receive_queue, skb);
		spin_unlock(&other->sk_receive_queue.lock);

		unix_state_unlock(other);
		other->sk_data_ready(other);
		sent += size;
	}

#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
	if (msg->msg_flags & MSG_OOB) {
		err = queue_oob(sk, msg, other, &scm, fds_sent);
		if (err)
			goto out_err;
		sent++;
	}
#endif

	scm_destroy(&scm);

	return sent;

out_pipe_unlock:
	unix_state_unlock(other);
out_pipe:
	if (!sent && !(msg->msg_flags & MSG_NOSIGNAL))
		send_sig(SIGPIPE, current, 0);
	err = -EPIPE;
out_free:
	consume_skb(skb);
out_err:
	scm_destroy(&scm);
	return sent ? : err;
}

static int unix_seqpacket_sendmsg(struct socket *sock, struct msghdr *msg,
				  size_t len)
{
	int err;
	struct sock *sk = sock->sk;

	err = sock_error(sk);
	if (err)
		return err;

	if (READ_ONCE(sk->sk_state) != TCP_ESTABLISHED)
		return -ENOTCONN;

	if (msg->msg_namelen)
		msg->msg_namelen = 0;

	return unix_dgram_sendmsg(sock, msg, len);
}

static int unix_seqpacket_recvmsg(struct socket *sock, struct msghdr *msg,
				  size_t size, int flags)
{
	struct sock *sk = sock->sk;

	if (READ_ONCE(sk->sk_state) != TCP_ESTABLISHED)
		return -ENOTCONN;

	return unix_dgram_recvmsg(sock, msg, size, flags);
}

static void unix_copy_addr(struct msghdr *msg, struct sock *sk)
{
	struct unix_address *addr = smp_load_acquire(&unix_sk(sk)->addr);

	if (addr) {
		msg->msg_namelen = addr->len;
		memcpy(msg->msg_name, addr->name, addr->len);
	}
}

int __unix_dgram_recvmsg(struct sock *sk, struct msghdr *msg, size_t size,
			 int flags)
{
	struct scm_cookie scm;
	struct socket *sock = sk->sk_socket;
	struct unix_sock *u = unix_sk(sk);
	struct sk_buff *skb, *last;
	long timeo;
	int skip;
	int err;

	err = -EOPNOTSUPP;
	if (flags&MSG_OOB)
		goto out;

	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);

	do {
		mutex_lock(&u->iolock);

		skip = sk_peek_offset(sk, flags);
		skb = __skb_try_recv_datagram(sk, &sk->sk_receive_queue, flags,
					      &skip, &err, &last);
		if (skb) {
			if (!(flags & MSG_PEEK))
				scm_stat_del(sk, skb);
			break;
		}

		mutex_unlock(&u->iolock);

		if (err != -EAGAIN)
			break;
	} while (timeo &&
		 !__skb_wait_for_more_packets(sk, &sk->sk_receive_queue,
					      &err, &timeo, last));

	if (!skb) { /* implies iolock unlocked */
		/* Signal EOF on disconnected non-blocking SEQPACKET socket. */
		if (sk->sk_type == SOCK_SEQPACKET && err == -EAGAIN &&
		    (READ_ONCE(sk->sk_shutdown) & RCV_SHUTDOWN))
			err = 0;
		goto out;
	}

	if (wq_has_sleeper(&u->peer_wait))
		wake_up_interruptible_sync_poll(&u->peer_wait,
						EPOLLOUT | EPOLLWRNORM |
						EPOLLWRBAND);

	if (msg->msg_name) {
		unix_copy_addr(msg, skb->sk);

		BPF_CGROUP_RUN_PROG_UNIX_RECVMSG_LOCK(sk,
						      msg->msg_name,
						      &msg->msg_namelen);
	}

	if (size > skb->len - skip)
		size = skb->len - skip;
	else if (size < skb->len - skip)
		msg->msg_flags |= MSG_TRUNC;

	err = skb_copy_datagram_msg(skb, skip, msg, size);
	if (err)
		goto out_free;

	if (sock_flag(sk, SOCK_RCVTSTAMP))
		__sock_recv_timestamp(msg, sk, skb);

	memset(&scm, 0, sizeof(scm));

	unix_skb_to_scm(skb, &scm);

	if (!(flags & MSG_PEEK)) {
		if (UNIXCB(skb).fp)
			unix_detach_fds(&scm, skb);

		sk_peek_offset_bwd(sk, skb->len);
	} else {
		/* It is questionable: on PEEK we could:
		   - do not return fds - good, but too simple 8)
		   - return fds, and do not return them on read (old strategy,
		     apparently wrong)
		   - clone fds (I chose it for now, it is the most universal
		     solution)

		   POSIX 1003.1g does not actually define this clearly
		   at all. POSIX 1003.1g doesn't define a lot of things
		   clearly however!

		*/

		sk_peek_offset_fwd(sk, size);

		if (UNIXCB(skb).fp)
			unix_peek_fds(&scm, skb);
	}
	err = (flags & MSG_TRUNC) ? skb->len - skip : size;

	scm_recv_unix(sock, msg, &scm, flags);

out_free:
	skb_free_datagram(sk, skb);
	mutex_unlock(&u->iolock);
out:
	return err;
}

static int unix_dgram_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
			      int flags)
{
	struct sock *sk = sock->sk;

#ifdef CONFIG_BPF_SYSCALL
	const struct proto *prot = READ_ONCE(sk->sk_prot);

	if (prot != &unix_dgram_proto)
		return prot->recvmsg(sk, msg, size, flags, NULL);
#endif
	return __unix_dgram_recvmsg(sk, msg, size, flags);
}

static int unix_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
{
	struct unix_sock *u = unix_sk(sk);
	struct sk_buff *skb;
	int err;

	mutex_lock(&u->iolock);
	skb = skb_recv_datagram(sk, MSG_DONTWAIT, &err);
	mutex_unlock(&u->iolock);
	if (!skb)
		return err;

	return recv_actor(sk, skb);
}

/*
 *	Sleep until more data has arrived. But check for races..
 */
static long unix_stream_data_wait(struct sock *sk, long timeo,
				  struct sk_buff *last, unsigned int last_len,
				  bool freezable)
{
	unsigned int state = TASK_INTERRUPTIBLE | freezable * TASK_FREEZABLE;
	struct sk_buff *tail;
	DEFINE_WAIT(wait);

	unix_state_lock(sk);

	for (;;) {
		prepare_to_wait(sk_sleep(sk), &wait, state);

		tail = skb_peek_tail(&sk->sk_receive_queue);
		if (tail != last ||
		    (tail && tail->len != last_len) ||
		    sk->sk_err ||
		    (sk->sk_shutdown & RCV_SHUTDOWN) ||
		    signal_pending(current) ||
		    !timeo)
			break;

		sk_set_bit(SOCKWQ_ASYNC_WAITDATA, sk);
		unix_state_unlock(sk);
		timeo = schedule_timeout(timeo);
		unix_state_lock(sk);

		if (sock_flag(sk, SOCK_DEAD))
			break;

		sk_clear_bit(SOCKWQ_ASYNC_WAITDATA, sk);
	}

	finish_wait(sk_sleep(sk), &wait);
	unix_state_unlock(sk);
	return timeo;
}

struct unix_stream_read_state {
	int (*recv_actor)(struct sk_buff *, int, int,
			  struct unix_stream_read_state *);
	struct socket *socket;
	struct msghdr *msg;
	struct pipe_inode_info *pipe;
	size_t size;
	int flags;
	unsigned int splice_flags;
};

#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
static int unix_stream_recv_urg(struct unix_stream_read_state *state)
{
	struct sk_buff *oob_skb, *read_skb = NULL;
	struct socket *sock = state->socket;
	struct sock *sk = sock->sk;
	struct unix_sock *u = unix_sk(sk);
	int chunk = 1;

	mutex_lock(&u->iolock);
	unix_state_lock(sk);
	spin_lock(&sk->sk_receive_queue.lock);

	if (sock_flag(sk, SOCK_URGINLINE) || !u->oob_skb) {
		spin_unlock(&sk->sk_receive_queue.lock);
		unix_state_unlock(sk);
		mutex_unlock(&u->iolock);
		return -EINVAL;
	}

	oob_skb = u->oob_skb;

	if (!(state->flags & MSG_PEEK)) {
		WRITE_ONCE(u->oob_skb, NULL);
		WRITE_ONCE(u->inq_len, u->inq_len - 1);

		if (oob_skb->prev != (struct sk_buff *)&sk->sk_receive_queue &&
		    !unix_skb_len(oob_skb->prev)) {
			read_skb = oob_skb->prev;
			__skb_unlink(read_skb, &sk->sk_receive_queue);
		}
	}

	spin_unlock(&sk->sk_receive_queue.lock);
	unix_state_unlock(sk);

	chunk = state->recv_actor(oob_skb, 0, chunk, state);

	if (!(state->flags & MSG_PEEK))
		UNIXCB(oob_skb).consumed += 1;

	mutex_unlock(&u->iolock);

	consume_skb(read_skb);

	if (chunk < 0)
		return -EFAULT;

	state->msg->msg_flags |= MSG_OOB;
	return 1;
}

static struct sk_buff *manage_oob(struct sk_buff *skb, struct sock *sk,
				  int flags, int copied)
{
	struct sk_buff *read_skb = NULL, *unread_skb = NULL;
	struct unix_sock *u = unix_sk(sk);

	if (likely(unix_skb_len(skb) && skb != READ_ONCE(u->oob_skb)))
		return skb;

	spin_lock(&sk->sk_receive_queue.lock);

	if (!unix_skb_len(skb)) {
		if (copied && (!u->oob_skb || skb == u->oob_skb)) {
			skb = NULL;
		} else if (flags & MSG_PEEK) {
			skb = skb_peek_next(skb, &sk->sk_receive_queue);
		} else {
			read_skb = skb;
			skb = skb_peek_next(skb, &sk->sk_receive_queue);
			__skb_unlink(read_skb, &sk->sk_receive_queue);
		}

		if (!skb)
			goto unlock;
	}

	if (skb != u->oob_skb)
		goto unlock;

	if (copied) {
		skb = NULL;
	} else if (!(flags & MSG_PEEK)) {
		WRITE_ONCE(u->oob_skb, NULL);

		if (!sock_flag(sk, SOCK_URGINLINE)) {
			__skb_unlink(skb, &sk->sk_receive_queue);
			unread_skb = skb;
			skb = skb_peek(&sk->sk_receive_queue);
		}
	} else if (!sock_flag(sk, SOCK_URGINLINE)) {
		skb = skb_peek_next(skb, &sk->sk_receive_queue);
	}

unlock:
	spin_unlock(&sk->sk_receive_queue.lock);

	consume_skb(read_skb);
	kfree_skb_reason(unread_skb, SKB_DROP_REASON_UNIX_SKIP_OOB);

	return skb;
}
#endif

static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
{
	struct sk_buff_head *queue = &sk->sk_receive_queue;
	struct unix_sock *u = unix_sk(sk);
	struct sk_buff *skb;
	int err;

	if (unlikely(READ_ONCE(sk->sk_state) != TCP_ESTABLISHED))
		return -ENOTCONN;

	err = sock_error(sk);
	if (err)
		return err;

	mutex_lock(&u->iolock);
	spin_lock(&queue->lock);

	skb = __skb_dequeue(queue);
	if (!skb) {
		spin_unlock(&queue->lock);
		mutex_unlock(&u->iolock);
		return -EAGAIN;
	}

	WRITE_ONCE(u->inq_len, u->inq_len - skb->len);

#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
	if (skb == u->oob_skb) {
		WRITE_ONCE(u->oob_skb, NULL);
		spin_unlock(&queue->lock);
		mutex_unlock(&u->iolock);

		kfree_skb_reason(skb, SKB_DROP_REASON_UNIX_SKIP_OOB);
		return -EAGAIN;
	}
#endif

	spin_unlock(&queue->lock);
	mutex_unlock(&u->iolock);

	return recv_actor(sk, skb);
}

static int unix_stream_read_generic(struct unix_stream_read_state *state,
				    bool freezable)
{
	int noblock = state->flags & MSG_DONTWAIT;
	struct socket *sock = state->socket;
	struct msghdr *msg = state->msg;
	struct sock *sk = sock->sk;
	size_t size = state->size;
	int flags = state->flags;
	bool check_creds = false;
	struct scm_cookie scm;
	unsigned int last_len;
	struct unix_sock *u;
	int copied = 0;
	int err = 0;
	long timeo;
	int target;
	int skip;

	if (unlikely(READ_ONCE(sk->sk_state) != TCP_ESTABLISHED)) {
		err = -EINVAL;
		goto out;
	}

	if (unlikely(flags & MSG_OOB)) {
		err = -EOPNOTSUPP;
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
		err = unix_stream_recv_urg(state);
#endif
		goto out;
	}

	target = sock_rcvlowat(sk, flags & MSG_WAITALL, size);
	timeo = sock_rcvtimeo(sk, noblock);

	memset(&scm, 0, sizeof(scm));

	u = unix_sk(sk);

redo:
	/* Lock the socket to prevent queue disordering
	 * while sleeps in memcpy_tomsg
	 */
	mutex_lock(&u->iolock);

	skip = max(sk_peek_offset(sk, flags), 0);

	do {
		struct sk_buff *skb, *last;
		int chunk;

		unix_state_lock(sk);
		if (sock_flag(sk, SOCK_DEAD)) {
			err = -ECONNRESET;
			goto unlock;
		}
		last = skb = skb_peek(&sk->sk_receive_queue);
		last_len = last ? last->len : 0;

again:
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
		if (skb) {
			skb = manage_oob(skb, sk, flags, copied);
			if (!skb && copied) {
				unix_state_unlock(sk);
				break;
			}
		}
#endif
		if (skb == NULL) {
			if (copied >= target)
				goto unlock;

			/*
			 *	POSIX 1003.1g mandates this order.
			 */

			err = sock_error(sk);
			if (err)
				goto unlock;
			if (sk->sk_shutdown & RCV_SHUTDOWN)
				goto unlock;

			unix_state_unlock(sk);
			if (!timeo) {
				err = -EAGAIN;
				break;
			}

			mutex_unlock(&u->iolock);

			timeo = unix_stream_data_wait(sk, timeo, last,
						      last_len, freezable);

			if (signal_pending(current)) {
				err = sock_intr_errno(timeo);
				scm_destroy(&scm);
				goto out;
			}

			goto redo;
unlock:
			unix_state_unlock(sk);
			break;
		}

		while (skip >= unix_skb_len(skb)) {
			skip -= unix_skb_len(skb);
			last = skb;
			last_len = skb->len;
			skb = skb_peek_next(skb, &sk->sk_receive_queue);
			if (!skb)
				goto again;
		}

		unix_state_unlock(sk);

		if (check_creds) {
			/* Never glue messages from different writers */
			if (!unix_skb_scm_eq(skb, &scm))
				break;
		} else if (unix_may_passcred(sk)) {
			/* Copy credentials */
			unix_skb_to_scm(skb, &scm);
			check_creds = true;
		}

		/* Copy address just once */
		if (msg && msg->msg_name) {
			DECLARE_SOCKADDR(struct sockaddr_un *, sunaddr, msg->msg_name);

			unix_copy_addr(msg, skb->sk);
			BPF_CGROUP_RUN_PROG_UNIX_RECVMSG_LOCK(sk, msg->msg_name,
							      &msg->msg_namelen);

			sunaddr = NULL;
		}

		chunk = min_t(unsigned int, unix_skb_len(skb) - skip, size);
		chunk = state->recv_actor(skb, skip, chunk, state);
		if (chunk < 0) {
			if (copied == 0)
				copied = -EFAULT;
			break;
		}
		copied += chunk;
		size -= chunk;

		/* Mark read part of skb as used */
		if (!(flags & MSG_PEEK)) {
			UNIXCB(skb).consumed += chunk;

			sk_peek_offset_bwd(sk, chunk);

			if (UNIXCB(skb).fp) {
				scm_stat_del(sk, skb);
				unix_detach_fds(&scm, skb);
			}

			if (unix_skb_len(skb))
				break;

			spin_lock(&sk->sk_receive_queue.lock);
			WRITE_ONCE(u->inq_len, u->inq_len - skb->len);
			__skb_unlink(skb, &sk->sk_receive_queue);
			spin_unlock(&sk->sk_receive_queue.lock);

			consume_skb(skb);

			if (scm.fp)
				break;
		} else {
			/* It is questionable, see note in unix_dgram_recvmsg.
			 */
			if (UNIXCB(skb).fp)
				unix_peek_fds(&scm, skb);

			sk_peek_offset_fwd(sk, chunk);

			if (UNIXCB(skb).fp)
				break;

			skip = 0;
			last = skb;
			last_len = skb->len;
			unix_state_lock(sk);
			skb = skb_peek_next(skb, &sk->sk_receive_queue);
			if (skb)
				goto again;
			unix_state_unlock(sk);
			break;
		}
	} while (size);

	mutex_unlock(&u->iolock);
	if (msg) {
		bool do_cmsg = READ_ONCE(u->recvmsg_inq);

		scm_recv_unix(sock, msg, &scm, flags);

		if ((do_cmsg | msg->msg_get_inq) && (copied ?: err) >= 0) {
			msg->msg_inq = READ_ONCE(u->inq_len);
			if (do_cmsg)
				put_cmsg(msg, SOL_SOCKET, SCM_INQ,
					 sizeof(msg->msg_inq), &msg->msg_inq);
		}
	} else {
		scm_destroy(&scm);
	}
out:
	return copied ? : err;
}

static int unix_stream_read_actor(struct sk_buff *skb,
				  int skip, int chunk,
				  struct unix_stream_read_state *state)
{
	int ret;

	ret = skb_copy_datagram_msg(skb, UNIXCB(skb).consumed + skip,
				    state->msg, chunk);
	return ret ?: chunk;
}

int __unix_stream_recvmsg(struct sock *sk, struct msghdr *msg,
			  size_t size, int flags)
{
	struct unix_stream_read_state state = {
		.recv_actor = unix_stream_read_actor,
		.socket = sk->sk_socket,
		.msg = msg,
		.size = size,
		.flags = flags
	};

	return unix_stream_read_generic(&state, true);
}

static int unix_stream_recvmsg(struct socket *sock, struct msghdr *msg,
			       size_t size, int flags)
{
	struct unix_stream_read_state state = {
		.recv_actor = unix_stream_read_actor,
		.socket = sock,
		.msg = msg,
		.size = size,
		.flags = flags
	};

#ifdef CONFIG_BPF_SYSCALL
	struct sock *sk = sock->sk;
	const struct proto *prot = READ_ONCE(sk->sk_prot);

	if (prot != &unix_stream_proto)
		return prot->recvmsg(sk, msg, size, flags, NULL);
#endif
	return unix_stream_read_generic(&state, true);
}

static int unix_stream_splice_actor(struct sk_buff *skb,
				    int skip, int chunk,
				    struct unix_stream_read_state *state)
{
	return skb_splice_bits(skb, state->socket->sk,
			       UNIXCB(skb).consumed + skip,
			       state->pipe, chunk, state->splice_flags);
}

static ssize_t unix_stream_splice_read(struct socket *sock,  loff_t *ppos,
				       struct pipe_inode_info *pipe,
				       size_t size, unsigned int flags)
{
	struct unix_stream_read_state state = {
		.recv_actor = unix_stream_splice_actor,
		.socket = sock,
		.pipe = pipe,
		.size = size,
		.splice_flags = flags,
	};

	if (unlikely(*ppos))
		return -ESPIPE;

	if (sock->file->f_flags & O_NONBLOCK ||
	    flags & SPLICE_F_NONBLOCK)
		state.flags = MSG_DONTWAIT;

	return unix_stream_read_generic(&state, false);
}

static int unix_shutdown(struct socket *sock, int mode)
{
	struct sock *sk = sock->sk;
	struct sock *other;

	if (mode < SHUT_RD || mode > SHUT_RDWR)
		return -EINVAL;
	/* This maps:
	 * SHUT_RD   (0) -> RCV_SHUTDOWN  (1)
	 * SHUT_WR   (1) -> SEND_SHUTDOWN (2)
	 * SHUT_RDWR (2) -> SHUTDOWN_MASK (3)
	 */
	++mode;

	unix_state_lock(sk);
	WRITE_ONCE(sk->sk_shutdown, sk->sk_shutdown | mode);
	other = unix_peer(sk);
	if (other)
		sock_hold(other);
	unix_state_unlock(sk);
	sk->sk_state_change(sk);

	if (other &&
		(sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET)) {

		int peer_mode = 0;
		const struct proto *prot = READ_ONCE(other->sk_prot);

		if (prot->unhash)
			prot->unhash(other);
		if (mode&RCV_SHUTDOWN)
			peer_mode |= SEND_SHUTDOWN;
		if (mode&SEND_SHUTDOWN)
			peer_mode |= RCV_SHUTDOWN;
		unix_state_lock(other);
		WRITE_ONCE(other->sk_shutdown, other->sk_shutdown | peer_mode);
		unix_state_unlock(other);
		other->sk_state_change(other);
		if (peer_mode == SHUTDOWN_MASK)
			sk_wake_async(other, SOCK_WAKE_WAITD, POLL_HUP);
		else if (peer_mode & RCV_SHUTDOWN)
			sk_wake_async(other, SOCK_WAKE_WAITD, POLL_IN);
	}
	if (other)
		sock_put(other);

	return 0;
}

long unix_inq_len(struct sock *sk)
{
	struct sk_buff *skb;
	long amount = 0;

	if (READ_ONCE(sk->sk_state) == TCP_LISTEN)
		return -EINVAL;

	if (sk->sk_type == SOCK_STREAM)
		return READ_ONCE(unix_sk(sk)->inq_len);

	spin_lock(&sk->sk_receive_queue.lock);
	if (sk->sk_type == SOCK_SEQPACKET) {
		skb_queue_walk(&sk->sk_receive_queue, skb)
			amount += unix_skb_len(skb);
	} else {
		skb = skb_peek(&sk->sk_receive_queue);
		if (skb)
			amount = skb->len;
	}
	spin_unlock(&sk->sk_receive_queue.lock);

	return amount;
}
EXPORT_SYMBOL_GPL(unix_inq_len);

long unix_outq_len(struct sock *sk)
{
	return sk_wmem_alloc_get(sk);
}
EXPORT_SYMBOL_GPL(unix_outq_len);

static int unix_open_file(struct sock *sk)
{
	if (!ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))
		return -EPERM;

	if (!smp_load_acquire(&unix_sk(sk)->addr))
		return -ENOENT;

	if (!unix_sk(sk)->path.dentry)
		return -ENOENT;

	return FD_ADD(O_CLOEXEC, dentry_open(&unix_sk(sk)->path, O_PATH, current_cred()));
}

static int unix_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
{
	struct sock *sk = sock->sk;
	long amount = 0;
	int err;

	switch (cmd) {
	case SIOCOUTQ:
		amount = unix_outq_len(sk);
		err = put_user(amount, (int __user *)arg);
		break;
	case SIOCINQ:
		amount = unix_inq_len(sk);
		if (amount < 0)
			err = amount;
		else
			err = put_user(amount, (int __user *)arg);
		break;
	case SIOCUNIXFILE:
		err = unix_open_file(sk);
		break;
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
	case SIOCATMARK:
		{
			struct unix_sock *u = unix_sk(sk);
			struct sk_buff *skb;
			int answ = 0;

			mutex_lock(&u->iolock);

			skb = skb_peek(&sk->sk_receive_queue);
			if (skb) {
				struct sk_buff *oob_skb = READ_ONCE(u->oob_skb);
				struct sk_buff *next_skb;

				next_skb = skb_peek_next(skb, &sk->sk_receive_queue);

				if (skb == oob_skb ||
				    (!unix_skb_len(skb) &&
				     (!oob_skb || next_skb == oob_skb)))
					answ = 1;
			}

			mutex_unlock(&u->iolock);

			err = put_user(answ, (int __user *)arg);
		}
		break;
#endif
	default:
		err = -ENOIOCTLCMD;
		break;
	}
	return err;
}

#ifdef CONFIG_COMPAT
static int unix_compat_ioctl(struct socket *sock, unsigned int cmd, unsigned long arg)
{
	return unix_ioctl(sock, cmd, (unsigned long)compat_ptr(arg));
}
#endif

static __poll_t unix_poll(struct file *file, struct socket *sock, poll_table *wait)
{
	struct sock *sk = sock->sk;
	unsigned char state;
	__poll_t mask;
	u8 shutdown;

	sock_poll_wait(file, sock, wait);
	mask = 0;
	shutdown = READ_ONCE(sk->sk_shutdown);
	state = READ_ONCE(sk->sk_state);

	/* exceptional events? */
	if (READ_ONCE(sk->sk_err))
		mask |= EPOLLERR;
	if (shutdown == SHUTDOWN_MASK)
		mask |= EPOLLHUP;
	if (shutdown & RCV_SHUTDOWN)
		mask |= EPOLLRDHUP | EPOLLIN | EPOLLRDNORM;

	/* readable? */
	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
		mask |= EPOLLIN | EPOLLRDNORM;
	if (sk_is_readable(sk))
		mask |= EPOLLIN | EPOLLRDNORM;
#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
	if (READ_ONCE(unix_sk(sk)->oob_skb))
		mask |= EPOLLPRI;
#endif

	/* Connection-based need to check for termination and startup */
	if ((sk->sk_type == SOCK_STREAM || sk->sk_type == SOCK_SEQPACKET) &&
	    state == TCP_CLOSE)
		mask |= EPOLLHUP;

	/*
	 * we set writable also when the other side has shut down the
	 * connection. This prevents stuck sockets.
	 */
	if (unix_writable(sk, state))
		mask |= EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND;

	return mask;
}

static __poll_t unix_dgram_poll(struct file *file, struct socket *sock,
				    poll_table *wait)
{
	struct sock *sk = sock->sk, *other;
	unsigned int writable;
	unsigned char state;
	__poll_t mask;
	u8 shutdown;

	sock_poll_wait(file, sock, wait);
	mask = 0;
	shutdown = READ_ONCE(sk->sk_shutdown);
	state = READ_ONCE(sk->sk_state);

	/* exceptional events? */
	if (READ_ONCE(sk->sk_err) ||
	    !skb_queue_empty_lockless(&sk->sk_error_queue))
		mask |= EPOLLERR |
			(sock_flag(sk, SOCK_SELECT_ERR_QUEUE) ? EPOLLPRI : 0);

	if (shutdown & RCV_SHUTDOWN)
		mask |= EPOLLRDHUP | EPOLLIN | EPOLLRDNORM;
	if (shutdown == SHUTDOWN_MASK)
		mask |= EPOLLHUP;

	/* readable? */
	if (!skb_queue_empty_lockless(&sk->sk_receive_queue))
		mask |= EPOLLIN | EPOLLRDNORM;
	if (sk_is_readable(sk))
		mask |= EPOLLIN | EPOLLRDNORM;

	/* Connection-based need to check for termination and startup */
	if (sk->sk_type == SOCK_SEQPACKET && state == TCP_CLOSE)
		mask |= EPOLLHUP;

	/* No write status requested, avoid expensive OUT tests. */
	if (!(poll_requested_events(wait) & (EPOLLWRBAND|EPOLLWRNORM|EPOLLOUT)))
		return mask;

	writable = unix_writable(sk, state);
	if (writable) {
		unix_state_lock(sk);

		other = unix_peer(sk);
		if (other && unix_peer(other) != sk &&
		    unix_recvq_full_lockless(other) &&
		    unix_dgram_peer_wake_me(sk, other))
			writable = 0;

		unix_state_unlock(sk);
	}

	if (writable)
		mask |= EPOLLOUT | EPOLLWRNORM | EPOLLWRBAND;
	else
		sk_set_bit(SOCKWQ_ASYNC_NOSPACE, sk);

	return mask;
}

#ifdef CONFIG_PROC_FS

#define BUCKET_SPACE (BITS_PER_LONG - (UNIX_HASH_BITS + 1) - 1)

#define get_bucket(x) ((x) >> BUCKET_SPACE)
#define get_offset(x) ((x) & ((1UL << BUCKET_SPACE) - 1))
#define set_bucket_offset(b, o) ((b) << BUCKET_SPACE | (o))

static struct sock *unix_from_bucket(struct seq_file *seq, loff_t *pos)
{
	unsigned long offset = get_offset(*pos);
	unsigned long bucket = get_bucket(*pos);
	unsigned long count = 0;
	struct sock *sk;

	for (sk = sk_head(&seq_file_net(seq)->unx.table.buckets[bucket]);
	     sk; sk = sk_next(sk)) {
		if (++count == offset)
			break;
	}

	return sk;
}

static struct sock *unix_get_first(struct seq_file *seq, loff_t *pos)
{
	unsigned long bucket = get_bucket(*pos);
	struct net *net = seq_file_net(seq);
	struct sock *sk;

	while (bucket < UNIX_HASH_SIZE) {
		spin_lock(&net->unx.table.locks[bucket]);

		sk = unix_from_bucket(seq, pos);
		if (sk)
			return sk;

		spin_unlock(&net->unx.table.locks[bucket]);

		*pos = set_bucket_offset(++bucket, 1);
	}

	return NULL;
}

static struct sock *unix_get_next(struct seq_file *seq, struct sock *sk,
				  loff_t *pos)
{
	unsigned long bucket = get_bucket(*pos);

	sk = sk_next(sk);
	if (sk)
		return sk;


	spin_unlock(&seq_file_net(seq)->unx.table.locks[bucket]);

	*pos = set_bucket_offset(++bucket, 1);

	return unix_get_first(seq, pos);
}

static void *unix_seq_start(struct seq_file *seq, loff_t *pos)
{
	if (!*pos)
		return SEQ_START_TOKEN;

	return unix_get_first(seq, pos);
}

static void *unix_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
	++*pos;

	if (v == SEQ_START_TOKEN)
		return unix_get_first(seq, pos);

	return unix_get_next(seq, v, pos);
}

static void unix_seq_stop(struct seq_file *seq, void *v)
{
	struct sock *sk = v;

	if (sk)
		spin_unlock(&seq_file_net(seq)->unx.table.locks[sk->sk_hash]);
}

static int unix_seq_show(struct seq_file *seq, void *v)
{

	if (v == SEQ_START_TOKEN)
		seq_puts(seq, "Num       RefCount Protocol Flags    Type St "
			 "Inode Path\n");
	else {
		struct sock *s = v;
		struct unix_sock *u = unix_sk(s);
		unix_state_lock(s);

		seq_printf(seq, "%pK: %08X %08X %08X %04X %02X %5lu",
			s,
			refcount_read(&s->sk_refcnt),
			0,
			s->sk_state == TCP_LISTEN ? __SO_ACCEPTCON : 0,
			s->sk_type,
			s->sk_socket ?
			(s->sk_state == TCP_ESTABLISHED ? SS_CONNECTED : SS_UNCONNECTED) :
			(s->sk_state == TCP_ESTABLISHED ? SS_CONNECTING : SS_DISCONNECTING),
			sock_i_ino(s));

		if (u->addr) {	// under a hash table lock here
			int i, len;
			seq_putc(seq, ' ');

			i = 0;
			len = u->addr->len -
				offsetof(struct sockaddr_un, sun_path);
			if (u->addr->name->sun_path[0]) {
				len--;
			} else {
				seq_putc(seq, '@');
				i++;
			}
			for ( ; i < len; i++)
				seq_putc(seq, u->addr->name->sun_path[i] ?:
					 '@');
		}
		unix_state_unlock(s);
		seq_putc(seq, '\n');
	}

	return 0;
}

static const struct seq_operations unix_seq_ops = {
	.start  = unix_seq_start,
	.next   = unix_seq_next,
	.stop   = unix_seq_stop,
	.show   = unix_seq_show,
};

#ifdef CONFIG_BPF_SYSCALL
struct bpf_unix_iter_state {
	struct seq_net_private p;
	unsigned int cur_sk;
	unsigned int end_sk;
	unsigned int max_sk;
	struct sock **batch;
	bool st_bucket_done;
};

struct bpf_iter__unix {
	__bpf_md_ptr(struct bpf_iter_meta *, meta);
	__bpf_md_ptr(struct unix_sock *, unix_sk);
	uid_t uid __aligned(8);
};

static int unix_prog_seq_show(struct bpf_prog *prog, struct bpf_iter_meta *meta,
			      struct unix_sock *unix_sk, uid_t uid)
{
	struct bpf_iter__unix ctx;

	meta->seq_num--;  /* skip SEQ_START_TOKEN */
	ctx.meta = meta;
	ctx.unix_sk = unix_sk;
	ctx.uid = uid;
	return bpf_iter_run_prog(prog, &ctx);
}

static int bpf_iter_unix_hold_batch(struct seq_file *seq, struct sock *start_sk)

{
	struct bpf_unix_iter_state *iter = seq->private;
	unsigned int expected = 1;
	struct sock *sk;

	sock_hold(start_sk);
	iter->batch[iter->end_sk++] = start_sk;

	for (sk = sk_next(start_sk); sk; sk = sk_next(sk)) {
		if (iter->end_sk < iter->max_sk) {
			sock_hold(sk);
			iter->batch[iter->end_sk++] = sk;
		}

		expected++;
	}

	spin_unlock(&seq_file_net(seq)->unx.table.locks[start_sk->sk_hash]);

	return expected;
}

static void bpf_iter_unix_put_batch(struct bpf_unix_iter_state *iter)
{
	while (iter->cur_sk < iter->end_sk)
		sock_put(iter->batch[iter->cur_sk++]);
}

static int bpf_iter_unix_realloc_batch(struct bpf_unix_iter_state *iter,
				       unsigned int new_batch_sz)
{
	struct sock **new_batch;

	new_batch = kvmalloc(sizeof(*new_batch) * new_batch_sz,
			     GFP_USER | __GFP_NOWARN);
	if (!new_batch)
		return -ENOMEM;

	bpf_iter_unix_put_batch(iter);
	kvfree(iter->batch);
	iter->batch = new_batch;
	iter->max_sk = new_batch_sz;

	return 0;
}

static struct sock *bpf_iter_unix_batch(struct seq_file *seq,
					loff_t *pos)
{
	struct bpf_unix_iter_state *iter = seq->private;
	unsigned int expected;
	bool resized = false;
	struct sock *sk;

	if (iter->st_bucket_done)
		*pos = set_bucket_offset(get_bucket(*pos) + 1, 1);

again:
	/* Get a new batch */
	iter->cur_sk = 0;
	iter->end_sk = 0;

	sk = unix_get_first(seq, pos);
	if (!sk)
		return NULL; /* Done */

	expected = bpf_iter_unix_hold_batch(seq, sk);

	if (iter->end_sk == expected) {
		iter->st_bucket_done = true;
		return sk;
	}

	if (!resized && !bpf_iter_unix_realloc_batch(iter, expected * 3 / 2)) {
		resized = true;
		goto again;
	}

	return sk;
}

static void *bpf_iter_unix_seq_start(struct seq_file *seq, loff_t *pos)
{
	if (!*pos)
		return SEQ_START_TOKEN;

	/* bpf iter does not support lseek, so it always
	 * continue from where it was stop()-ped.
	 */
	return bpf_iter_unix_batch(seq, pos);
}

static void *bpf_iter_unix_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
	struct bpf_unix_iter_state *iter = seq->private;
	struct sock *sk;

	/* Whenever seq_next() is called, the iter->cur_sk is
	 * done with seq_show(), so advance to the next sk in
	 * the batch.
	 */
	if (iter->cur_sk < iter->end_sk)
		sock_put(iter->batch[iter->cur_sk++]);

	++*pos;

	if (iter->cur_sk < iter->end_sk)
		sk = iter->batch[iter->cur_sk];
	else
		sk = bpf_iter_unix_batch(seq, pos);

	return sk;
}

static int bpf_iter_unix_seq_show(struct seq_file *seq, void *v)
{
	struct bpf_iter_meta meta;
	struct bpf_prog *prog;
	struct sock *sk = v;
	uid_t uid;
	bool slow;
	int ret;

	if (v == SEQ_START_TOKEN)
		return 0;

	slow = lock_sock_fast(sk);

	if (unlikely(sk_unhashed(sk))) {
		ret = SEQ_SKIP;
		goto unlock;
	}

	uid = from_kuid_munged(seq_user_ns(seq), sk_uid(sk));
	meta.seq = seq;
	prog = bpf_iter_get_info(&meta, false);
	ret = unix_prog_seq_show(prog, &meta, v, uid);
unlock:
	unlock_sock_fast(sk, slow);
	return ret;
}

static void bpf_iter_unix_seq_stop(struct seq_file *seq, void *v)
{
	struct bpf_unix_iter_state *iter = seq->private;
	struct bpf_iter_meta meta;
	struct bpf_prog *prog;

	if (!v) {
		meta.seq = seq;
		prog = bpf_iter_get_info(&meta, true);
		if (prog)
			(void)unix_prog_seq_show(prog, &meta, v, 0);
	}

	if (iter->cur_sk < iter->end_sk)
		bpf_iter_unix_put_batch(iter);
}

static const struct seq_operations bpf_iter_unix_seq_ops = {
	.start	= bpf_iter_unix_seq_start,
	.next	= bpf_iter_unix_seq_next,
	.stop	= bpf_iter_unix_seq_stop,
	.show	= bpf_iter_unix_seq_show,
};
#endif
#endif

static const struct net_proto_family unix_family_ops = {
	.family = PF_UNIX,
	.create = unix_create,
	.owner	= THIS_MODULE,
};


static int __net_init unix_net_init(struct net *net)
{
	int i;

	net->unx.sysctl_max_dgram_qlen = 10;
	if (unix_sysctl_register(net))
		goto out;

#ifdef CONFIG_PROC_FS
	if (!proc_create_net("unix", 0, net->proc_net, &unix_seq_ops,
			     sizeof(struct seq_net_private)))
		goto err_sysctl;
#endif

	net->unx.table.locks = kvmalloc_array(UNIX_HASH_SIZE,
					      sizeof(spinlock_t), GFP_KERNEL);
	if (!net->unx.table.locks)
		goto err_proc;

	net->unx.table.buckets = kvmalloc_array(UNIX_HASH_SIZE,
						sizeof(struct hlist_head),
						GFP_KERNEL);
	if (!net->unx.table.buckets)
		goto free_locks;

	for (i = 0; i < UNIX_HASH_SIZE; i++) {
		spin_lock_init(&net->unx.table.locks[i]);
		lock_set_cmp_fn(&net->unx.table.locks[i], unix_table_lock_cmp_fn, NULL);
		INIT_HLIST_HEAD(&net->unx.table.buckets[i]);
	}

	return 0;

free_locks:
	kvfree(net->unx.table.locks);
err_proc:
#ifdef CONFIG_PROC_FS
	remove_proc_entry("unix", net->proc_net);
err_sysctl:
#endif
	unix_sysctl_unregister(net);
out:
	return -ENOMEM;
}

static void __net_exit unix_net_exit(struct net *net)
{
	kvfree(net->unx.table.buckets);
	kvfree(net->unx.table.locks);
	unix_sysctl_unregister(net);
	remove_proc_entry("unix", net->proc_net);
}

static struct pernet_operations unix_net_ops = {
	.init = unix_net_init,
	.exit = unix_net_exit,
};

#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
DEFINE_BPF_ITER_FUNC(unix, struct bpf_iter_meta *meta,
		     struct unix_sock *unix_sk, uid_t uid)

#define INIT_BATCH_SZ 16

static int bpf_iter_init_unix(void *priv_data, struct bpf_iter_aux_info *aux)
{
	struct bpf_unix_iter_state *iter = priv_data;
	int err;

	err = bpf_iter_init_seq_net(priv_data, aux);
	if (err)
		return err;

	err = bpf_iter_unix_realloc_batch(iter, INIT_BATCH_SZ);
	if (err) {
		bpf_iter_fini_seq_net(priv_data);
		return err;
	}

	return 0;
}

static void bpf_iter_fini_unix(void *priv_data)
{
	struct bpf_unix_iter_state *iter = priv_data;

	bpf_iter_fini_seq_net(priv_data);
	kvfree(iter->batch);
}

static const struct bpf_iter_seq_info unix_seq_info = {
	.seq_ops		= &bpf_iter_unix_seq_ops,
	.init_seq_private	= bpf_iter_init_unix,
	.fini_seq_private	= bpf_iter_fini_unix,
	.seq_priv_size		= sizeof(struct bpf_unix_iter_state),
};

static const struct bpf_func_proto *
bpf_iter_unix_get_func_proto(enum bpf_func_id func_id,
			     const struct bpf_prog *prog)
{
	switch (func_id) {
	case BPF_FUNC_setsockopt:
		return &bpf_sk_setsockopt_proto;
	case BPF_FUNC_getsockopt:
		return &bpf_sk_getsockopt_proto;
	default:
		return NULL;
	}
}

static struct bpf_iter_reg unix_reg_info = {
	.target			= "unix",
	.ctx_arg_info_size	= 1,
	.ctx_arg_info		= {
		{ offsetof(struct bpf_iter__unix, unix_sk),
		  PTR_TO_BTF_ID_OR_NULL },
	},
	.get_func_proto         = bpf_iter_unix_get_func_proto,
	.seq_info		= &unix_seq_info,
};

static void __init bpf_iter_register(void)
{
	unix_reg_info.ctx_arg_info[0].btf_id = btf_sock_ids[BTF_SOCK_TYPE_UNIX];
	if (bpf_iter_reg_target(&unix_reg_info))
		pr_warn("Warning: could not register bpf iterator unix\n");
}
#endif

static int __init af_unix_init(void)
{
	int i, rc = -1;

	BUILD_BUG_ON(sizeof(struct unix_skb_parms) > sizeof_field(struct sk_buff, cb));

	for (i = 0; i < UNIX_HASH_SIZE / 2; i++) {
		spin_lock_init(&bsd_socket_locks[i]);
		INIT_HLIST_HEAD(&bsd_socket_buckets[i]);
	}

	rc = proto_register(&unix_dgram_proto, 1);
	if (rc != 0) {
		pr_crit("%s: Cannot create unix_sock SLAB cache!\n", __func__);
		goto out;
	}

	rc = proto_register(&unix_stream_proto, 1);
	if (rc != 0) {
		pr_crit("%s: Cannot create unix_sock SLAB cache!\n", __func__);
		proto_unregister(&unix_dgram_proto);
		goto out;
	}

	sock_register(&unix_family_ops);
	register_pernet_subsys(&unix_net_ops);
	unix_bpf_build_proto();

#if defined(CONFIG_BPF_SYSCALL) && defined(CONFIG_PROC_FS)
	bpf_iter_register();
#endif

out:
	return rc;
}

/* Later than subsys_initcall() because we depend on stuff initialised there */
fs_initcall(af_unix_init);
]

8/2 2026/02/23 15:12 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 54603
							output: 90
							thoughts: 965
						

Thoughts:
**Analyzing a Suspected Socket Interaction**

Okay, so the report indicates a write to memory address `0xffff88810220dc08` of 8 bytes by task 3611 on CPU 1, specifically within `sk_psock_stop_verdict` and `sk_psock_drop` in `net/core/skmsg.c`.  Concurrently, task 3605 on CPU 0 is reading from the *same* memory address, and the call stack suggests this is happening inside `unix_stream_sendmsg` in `net/unix/af_unix.c`.  My initial thought is to cross-reference the line numbers in the report with the actual source code.

My first check reveals a possible line number mismatch in `net/core/skmsg.c`. The report mentions line 1311, but, assuming inlining or macro expansion, the relevant part might be within the `sk_psock_stop_verdict` function. It seems line 1311 is the `return` statement, meaning it can't be the write. The actual write could be within `psock_set_prog`, which is likely a macro. I'll need to check the definition in `include/linux/skmsg.h`.

Switching gears, in `net/unix/af_unix.c`, line 2480 in my `read-file` output seems to be `other = unix_peer(sk);`, which is also a bit off compared to the report. I need to be exact here, so I'll utilize `codesearch-definition-source` to get the precise line numbers for these functions. This will help confirm the exact point of the read and write operations.



9/2 2026/02/23 15:12 tool codesearch-definition-source
0m Args:
map[ContextFile:net/core/skmsg.c IncludeLines:true Name:sk_psock_stop_verdict]

Results:
map[SourceCode:1303:	void sk_psock_stop_verdict(struct sock *sk, struct sk_psock *psock)
1304:	{
1305:		psock_set_prog(&psock->progs.stream_verdict, NULL);
1306:		psock_set_prog(&psock->progs.skb_verdict, NULL);
1307:	
1308:		if (!psock->saved_data_ready)
1309:			return;
1310:	
1311:		sk->sk_data_ready = psock->saved_data_ready;
1312:		psock->saved_data_ready = NULL;
1313:	}
 SourceFile:net/core/skmsg.c]

10/2 2026/02/23 15:12 tool codesearch-definition-source
0m Args:
map[ContextFile:net/unix/af_unix.c IncludeLines:true Name:unix_stream_sendmsg]

Results:
map[SourceCode:2365:	static int unix_stream_sendmsg(struct socket *sock, struct msghdr *msg,
2366:				       size_t len)
2367:	{
2368:		struct sock *sk = sock->sk;
2369:		struct sk_buff *skb = NULL;
2370:		struct sock *other = NULL;
2371:		struct unix_sock *otheru;
2372:		struct scm_cookie scm;
2373:		bool fds_sent = false;
2374:		int err, sent = 0;
2375:	
2376:		err = scm_send(sock, msg, &scm, false);
2377:		if (err < 0)
2378:			return err;
2379:	
2380:		if (msg->msg_flags & MSG_OOB) {
2381:			err = -EOPNOTSUPP;
2382:	#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
2383:			if (len)
2384:				len--;
2385:			else
2386:	#endif
2387:				goto out_err;
2388:		}
2389:	
2390:		if (msg->msg_namelen) {
2391:			err = READ_ONCE(sk->sk_state) == TCP_ESTABLISHED ? -EISCONN : -EOPNOTSUPP;
2392:			goto out_err;
2393:		}
2394:	
2395:		other = unix_peer(sk);
2396:		if (!other) {
2397:			err = -ENOTCONN;
2398:			goto out_err;
2399:		}
2400:	
2401:		otheru = unix_sk(other);
2402:	
2403:		if (READ_ONCE(sk->sk_shutdown) & SEND_SHUTDOWN)
2404:			goto out_pipe;
2405:	
2406:		while (sent < len) {
2407:			int size = len - sent;
2408:			int data_len;
2409:	
2410:			if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
2411:				skb = sock_alloc_send_pskb(sk, 0, 0,
2412:							   msg->msg_flags & MSG_DONTWAIT,
2413:							   &err, 0);
2414:			} else {
2415:				/* Keep two messages in the pipe so it schedules better */
2416:				size = min_t(int, size, (READ_ONCE(sk->sk_sndbuf) >> 1) - 64);
2417:	
2418:				/* allow fallback to order-0 allocations */
2419:				size = min_t(int, size, SKB_MAX_HEAD(0) + UNIX_SKB_FRAGS_SZ);
2420:	
2421:				data_len = max_t(int, 0, size - SKB_MAX_HEAD(0));
2422:	
2423:				data_len = min_t(size_t, size, PAGE_ALIGN(data_len));
2424:	
2425:				skb = sock_alloc_send_pskb(sk, size - data_len, data_len,
2426:							   msg->msg_flags & MSG_DONTWAIT, &err,
2427:							   get_order(UNIX_SKB_FRAGS_SZ));
2428:			}
2429:			if (!skb)
2430:				goto out_err;
2431:	
2432:			/* Only send the fds in the first buffer */
2433:			err = unix_scm_to_skb(&scm, skb, !fds_sent);
2434:			if (err < 0)
2435:				goto out_free;
2436:	
2437:			fds_sent = true;
2438:	
2439:			err = unix_maybe_add_creds(skb, sk, other);
2440:			if (err)
2441:				goto out_free;
2442:	
2443:			if (unlikely(msg->msg_flags & MSG_SPLICE_PAGES)) {
2444:				skb->ip_summed = CHECKSUM_UNNECESSARY;
2445:				err = skb_splice_from_iter(skb, &msg->msg_iter, size);
2446:				if (err < 0)
2447:					goto out_free;
2448:	
2449:				size = err;
2450:				refcount_add(size, &sk->sk_wmem_alloc);
2451:			} else {
2452:				skb_put(skb, size - data_len);
2453:				skb->data_len = data_len;
2454:				skb->len = size;
2455:				err = skb_copy_datagram_from_iter(skb, 0, &msg->msg_iter, size);
2456:				if (err)
2457:					goto out_free;
2458:			}
2459:	
2460:			unix_state_lock(other);
2461:	
2462:			if (sock_flag(other, SOCK_DEAD) ||
2463:			    (other->sk_shutdown & RCV_SHUTDOWN))
2464:				goto out_pipe_unlock;
2465:	
2466:			if (UNIXCB(skb).fp && !other->sk_scm_rights) {
2467:				unix_state_unlock(other);
2468:				err = -EPERM;
2469:				goto out_free;
2470:			}
2471:	
2472:			scm_stat_add(other, skb);
2473:	
2474:			spin_lock(&other->sk_receive_queue.lock);
2475:			WRITE_ONCE(otheru->inq_len, otheru->inq_len + skb->len);
2476:			__skb_queue_tail(&other->sk_receive_queue, skb);
2477:			spin_unlock(&other->sk_receive_queue.lock);
2478:	
2479:			unix_state_unlock(other);
2480:			other->sk_data_ready(other);
2481:			sent += size;
2482:		}
2483:	
2484:	#if IS_ENABLED(CONFIG_AF_UNIX_OOB)
2485:		if (msg->msg_flags & MSG_OOB) {
2486:			err = queue_oob(sk, msg, other, &scm, fds_sent);
2487:			if (err)
2488:				goto out_err;
2489:			sent++;
2490:		}
2491:	#endif
2492:	
2493:		scm_destroy(&scm);
2494:	
2495:		return sent;
2496:	
2497:	out_pipe_unlock:
2498:		unix_state_unlock(other);
2499:	out_pipe:
2500:		if (!sent && !(msg->msg_flags & MSG_NOSIGNAL))
2501:			send_sig(SIGPIPE, current, 0);
2502:		err = -EPIPE;
2503:	out_free:
2504:		consume_skb(skb);
2505:	out_err:
2506:		scm_destroy(&scm);
2507:		return sent ? : err;
2508:	}
 SourceFile:net/unix/af_unix.c]

11/2 2026/02/23 15:12 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 58437
							output: 37
							thoughts: 3165
						

Thoughts:
**Data Race Analysis: `sk_data_ready` in Socket Operations**

Okay, so the KCSAN report flagged a potential data race on the `sk_data_ready` function pointer within the socket structure. Essentially, a write operation within `sk_psock_stop_verdict` is occurring concurrently with a read within `unix_stream_sendmsg`.

Let's break this down. The write happens when a "psock" (presumably a packet socket) is being dropped. The critical line `sk->sk_data_ready = psock->saved_data_ready;` within `sk_psock_stop_verdict` attempts to restore the original function pointer. This write is protected by `sk->sk_callback_lock`, which is a write lock.

Meanwhile, `unix_stream_sendmsg` calls `other->sk_data_ready(other)`, where `other` is the peer socket. Crucially, this read (and subsequent call) *does not* hold the `sk_callback_lock`. The other socket could be the socket that is being dropped.

The core of the problem lies in the unsynchronized access to `other->sk_data_ready`. It's possible that the function pointer changes while `unix_stream_sendmsg` is attempting to call it.

In examining the stack traces, the write operation is initiated when a psock associated with a socket is dropped. The `sk_psock_drop` function is at the core of this operation. Inside, `sk_psock_stop_verdict` is responsible for restoring the original function pointer before the psock is removed.

Further investigation reveals that `unix_stream_sendmsg` obtains the `other` socket via `unix_peer(sk)`. This suggests that the socket sending the message (`sk`) could potentially race with another socket (`other`), the receiver.

The potential race is on the `other->sk_data_ready` function pointer. Specifically, `sk_psock_stop_verdict` modifies `sk->sk_data_ready` *while* the other task might be reading and calling it.

The question then becomes: is this actually harmful? On x86-64, 8-byte aligned reads and writes are atomic. So, assuming the compiler doesn't do anything *too* crazy, the read operation should be atomic. The change is from `0xffffffff8486dd00` (likely `sk_psock_verdict_data_ready`) to `0xffffffff8472c9f0` (probably the original `sk_data_ready` - e.g. `sock_def_readable` or the like). Both are valid function addresses.

The issue then is that if `sk_data_ready` is changed while it is being called, we might call the old or the new value. The old value (`sk_psock_verdict_data_ready`) uses `rcu_read_lock()` and also performs an RCU dereference. The `rcu_assign_sk_user_data` happens *before* the function is restored. There is an opportunity for a window to exist where we may be reading `sk->sk_data_ready` when the user data may have already been set to NULL. However, `sk_psock_verdict_data_ready` does check that the sk_psock exists before dereferencing it, mitigating this specific risk.

Unfortunately, the kernel *doesn't* use `READ_ONCE` or `WRITE_ONCE` for this particular field, which is common in similar scenarios. So, technically, it's a data race. Many other similar call sites within the kernel also do not use these primitives when invoking `sk_data_ready`.

Given this, the race likely doesn't *cause* immediate crashes. However, because it is unsynchronized, it should be addressed for correctness. The problem would be more critical if the function pointers were not always valid, or if the lifetimes of the functions weren't properly managed. A patch might need to use `READ_ONCE` and `WRITE_ONCE` to avoid this. Or maybe the code needs to use a read lock during the execution.



12/2 2026/02/23 15:13 tool codesearch-definition-source
0m Args:
map[ContextFile:include/net/sock.h IncludeLines:true Name:sock]

Results:
map[SourceCode: 239:	/**
 240:	  *	struct sock - network layer representation of sockets
 241:	  *	@__sk_common: shared layout with inet_timewait_sock
 242:	  *	@sk_shutdown: mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
 243:	  *	@sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
 244:	  *	@sk_lock:	synchronizer
 245:	  *	@sk_kern_sock: True if sock is using kernel lock classes
 246:	  *	@sk_rcvbuf: size of receive buffer in bytes
 247:	  *	@sk_wq: sock wait queue and async head
 248:	  *	@sk_rx_dst: receive input route used by early demux
 249:	  *	@sk_rx_dst_ifindex: ifindex for @sk_rx_dst
 250:	  *	@sk_rx_dst_cookie: cookie for @sk_rx_dst
 251:	  *	@sk_dst_cache: destination cache
 252:	  *	@sk_dst_pending_confirm: need to confirm neighbour
 253:	  *	@sk_policy: flow policy
 254:	  *	@psp_assoc: PSP association, if socket is PSP-secured
 255:	  *	@sk_receive_queue: incoming packets
 256:	  *	@sk_wmem_alloc: transmit queue bytes committed
 257:	  *	@sk_tsq_flags: TCP Small Queues flags
 258:	  *	@sk_write_queue: Packet sending queue
 259:	  *	@sk_omem_alloc: "o" is "option" or "other"
 260:	  *	@sk_wmem_queued: persistent queue size
 261:	  *	@sk_forward_alloc: space allocated forward
 262:	  *	@sk_reserved_mem: space reserved and non-reclaimable for the socket
 263:	  *	@sk_napi_id: id of the last napi context to receive data for sk
 264:	  *	@sk_ll_usec: usecs to busypoll when there is no data
 265:	  *	@sk_allocation: allocation mode
 266:	  *	@sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
 267:	  *	@sk_pacing_status: Pacing status (requested, handled by sch_fq)
 268:	  *	@sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
 269:	  *	@sk_sndbuf: size of send buffer in bytes
 270:	  *	@sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
 271:	  *	@sk_no_check_rx: allow zero checksum in RX packets
 272:	  *	@sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
 273:	  *	@sk_gso_disabled: if set, NETIF_F_GSO_MASK is forbidden.
 274:	  *	@sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
 275:	  *	@sk_gso_max_size: Maximum GSO segment size to build
 276:	  *	@sk_gso_max_segs: Maximum number of GSO segments
 277:	  *	@sk_pacing_shift: scaling factor for TCP Small Queues
 278:	  *	@sk_lingertime: %SO_LINGER l_linger setting
 279:	  *	@sk_backlog: always used with the per-socket spinlock held
 280:	  *	@sk_callback_lock: used with the callbacks in the end of this struct
 281:	  *	@sk_error_queue: rarely used
 282:	  *	@sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
 283:	  *			  IPV6_ADDRFORM for instance)
 284:	  *	@sk_err: last error
 285:	  *	@sk_err_soft: errors that don't cause failure but are the cause of a
 286:	  *		      persistent failure not just 'timed out'
 287:	  *	@sk_drops: raw/udp drops counter
 288:	  *	@sk_drop_counters: optional pointer to numa_drop_counters
 289:	  *	@sk_ack_backlog: current listen backlog
 290:	  *	@sk_max_ack_backlog: listen backlog set in listen()
 291:	  *	@sk_uid: user id of owner
 292:	  *	@sk_ino: inode number (zero if orphaned)
 293:	  *	@sk_prefer_busy_poll: prefer busypolling over softirq processing
 294:	  *	@sk_busy_poll_budget: napi processing budget when busypolling
 295:	  *	@sk_priority: %SO_PRIORITY setting
 296:	  *	@sk_type: socket type (%SOCK_STREAM, etc)
 297:	  *	@sk_protocol: which protocol this socket belongs in this network family
 298:	  *	@sk_peer_lock: lock protecting @sk_peer_pid and @sk_peer_cred
 299:	  *	@sk_peer_pid: &struct pid for this socket's peer
 300:	  *	@sk_peer_cred: %SO_PEERCRED setting
 301:	  *	@sk_rcvlowat: %SO_RCVLOWAT setting
 302:	  *	@sk_rcvtimeo: %SO_RCVTIMEO setting
 303:	  *	@sk_sndtimeo: %SO_SNDTIMEO setting
 304:	  *	@sk_txhash: computed flow hash for use on transmit
 305:	  *	@sk_txrehash: enable TX hash rethink
 306:	  *	@sk_filter: socket filtering instructions
 307:	  *	@sk_timer: sock cleanup timer
 308:	  *	@tcp_retransmit_timer: tcp retransmit timer
 309:	  *	@mptcp_retransmit_timer: mptcp retransmit timer
 310:	  *	@sk_stamp: time stamp of last packet received
 311:	  *	@sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
 312:	  *	@sk_tsflags: SO_TIMESTAMPING flags
 313:	  *	@sk_bpf_cb_flags: used in bpf_setsockopt()
 314:	  *	@sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
 315:	  *			   Sockets that can be used under memory reclaim should
 316:	  *			   set this to false.
 317:	  *	@sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
 318:	  *	              for timestamping
 319:	  *	@sk_tskey: counter to disambiguate concurrent tstamp requests
 320:	  *	@sk_tx_queue_mapping_jiffies: time in jiffies of last @sk_tx_queue_mapping refresh.
 321:	  *	@sk_zckey: counter to order MSG_ZEROCOPY notifications
 322:	  *	@sk_socket: Identd and reporting IO signals
 323:	  *	@sk_user_data: RPC layer private data. Write-protected by @sk_callback_lock.
 324:	  *	@sk_frag: cached page frag
 325:	  *	@sk_peek_off: current peek_offset value
 326:	  *	@sk_send_head: front of stuff to transmit
 327:	  *	@tcp_rtx_queue: TCP re-transmit queue [union with @sk_send_head]
 328:	  *	@sk_security: used by security modules
 329:	  *	@sk_mark: generic packet mark
 330:	  *	@sk_cgrp_data: cgroup data for this cgroup
 331:	  *	@sk_memcg: this socket's memory cgroup association
 332:	  *	@sk_write_pending: a write to stream socket waits to start
 333:	  *	@sk_disconnects: number of disconnect operations performed on this sock
 334:	  *	@sk_state_change: callback to indicate change in the state of the sock
 335:	  *	@sk_data_ready: callback to indicate there is data to be processed
 336:	  *	@sk_write_space: callback to indicate there is bf sending space available
 337:	  *	@sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
 338:	  *	@sk_backlog_rcv: callback to process the backlog
 339:	  *	@sk_validate_xmit_skb: ptr to an optional validate function
 340:	  *	@sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
 341:	  *	@sk_reuseport_cb: reuseport group container
 342:	  *	@sk_bpf_storage: ptr to cache and control for bpf_sk_storage
 343:	  *	@sk_rcu: used during RCU grace period
 344:	  *	@sk_freeptr: used for SLAB_TYPESAFE_BY_RCU managed sockets
 345:	  *	@sk_clockid: clockid used by time-based scheduling (SO_TXTIME)
 346:	  *	@sk_txtime_deadline_mode: set deadline mode for SO_TXTIME
 347:	  *	@sk_txtime_report_errors: set report errors mode for SO_TXTIME
 348:	  *	@sk_txtime_unused: unused txtime flags
 349:	  *	@sk_scm_recv_flags: all flags used by scm_recv()
 350:	  *	@sk_scm_credentials: flagged by SO_PASSCRED to recv SCM_CREDENTIALS
 351:	  *	@sk_scm_security: flagged by SO_PASSSEC to recv SCM_SECURITY
 352:	  *	@sk_scm_pidfd: flagged by SO_PASSPIDFD to recv SCM_PIDFD
 353:	  *	@sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
 354:	  *	@sk_scm_unused: unused flags for scm_recv()
 355:	  *	@ns_tracker: tracker for netns reference
 356:	  *	@sk_user_frags: xarray of pages the user is holding a reference on.
 357:	  *	@sk_owner: reference to the real owner of the socket that calls
 358:	  *		   sock_lock_init_class_and_name().
 359:	  */
 360:	struct sock {
 361:		/*
 362:		 * Now struct inet_timewait_sock also uses sock_common, so please just
 363:		 * don't add nothing before this first member (__sk_common) --acme
 364:		 */
 365:		struct sock_common	__sk_common;
 366:	#define sk_node			__sk_common.skc_node
 367:	#define sk_nulls_node		__sk_common.skc_nulls_node
 368:	#define sk_refcnt		__sk_common.skc_refcnt
 369:	#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
 370:	#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
 371:	#define sk_rx_queue_mapping	__sk_common.skc_rx_queue_mapping
 372:	#endif
 373:	
 374:	#define sk_dontcopy_begin	__sk_common.skc_dontcopy_begin
 375:	#define sk_dontcopy_end		__sk_common.skc_dontcopy_end
 376:	#define sk_hash			__sk_common.skc_hash
 377:	#define sk_portpair		__sk_common.skc_portpair
 378:	#define sk_num			__sk_common.skc_num
 379:	#define sk_dport		__sk_common.skc_dport
 380:	#define sk_addrpair		__sk_common.skc_addrpair
 381:	#define sk_daddr		__sk_common.skc_daddr
 382:	#define sk_rcv_saddr		__sk_common.skc_rcv_saddr
 383:	#define sk_family		__sk_common.skc_family
 384:	#define sk_state		__sk_common.skc_state
 385:	#define sk_reuse		__sk_common.skc_reuse
 386:	#define sk_reuseport		__sk_common.skc_reuseport
 387:	#define sk_ipv6only		__sk_common.skc_ipv6only
 388:	#define sk_net_refcnt		__sk_common.skc_net_refcnt
 389:	#define sk_bypass_prot_mem	__sk_common.skc_bypass_prot_mem
 390:	#define sk_bound_dev_if		__sk_common.skc_bound_dev_if
 391:	#define sk_bind_node		__sk_common.skc_bind_node
 392:	#define sk_prot			__sk_common.skc_prot
 393:	#define sk_net			__sk_common.skc_net
 394:	#define sk_v6_daddr		__sk_common.skc_v6_daddr
 395:	#define sk_v6_rcv_saddr	__sk_common.skc_v6_rcv_saddr
 396:	#define sk_cookie		__sk_common.skc_cookie
 397:	#define sk_incoming_cpu		__sk_common.skc_incoming_cpu
 398:	#define sk_flags		__sk_common.skc_flags
 399:	#define sk_rxhash		__sk_common.skc_rxhash
 400:	
 401:		__cacheline_group_begin(sock_write_rx);
 402:	
 403:		atomic_t		sk_drops;
 404:		__s32			sk_peek_off;
 405:		struct sk_buff_head	sk_error_queue;
 406:		struct sk_buff_head	sk_receive_queue;
 407:		/*
 408:		 * The backlog queue is special, it is always used with
 409:		 * the per-socket spinlock held and requires low latency
 410:		 * access. Therefore we special case it's implementation.
 411:		 * Note : rmem_alloc is in this structure to fill a hole
 412:		 * on 64bit arches, not because its logically part of
 413:		 * backlog.
 414:		 */
 415:		struct {
 416:			atomic_t	rmem_alloc;
 417:			int		len;
 418:			struct sk_buff	*head;
 419:			struct sk_buff	*tail;
 420:		} sk_backlog;
 421:	#define sk_rmem_alloc sk_backlog.rmem_alloc
 422:	
 423:		__cacheline_group_end(sock_write_rx);
 424:	
 425:		__cacheline_group_begin(sock_read_rx);
 426:		/* early demux fields */
 427:		struct dst_entry __rcu	*sk_rx_dst;
 428:		int			sk_rx_dst_ifindex;
 429:		u32			sk_rx_dst_cookie;
 430:	
 431:	#ifdef CONFIG_NET_RX_BUSY_POLL
 432:		unsigned int		sk_ll_usec;
 433:		unsigned int		sk_napi_id;
 434:		u16			sk_busy_poll_budget;
 435:		u8			sk_prefer_busy_poll;
 436:	#endif
 437:		u8			sk_userlocks;
 438:		int			sk_rcvbuf;
 439:	
 440:		struct sk_filter __rcu	*sk_filter;
 441:		union {
 442:			struct socket_wq __rcu	*sk_wq;
 443:			/* private: */
 444:			struct socket_wq	*sk_wq_raw;
 445:			/* public: */
 446:		};
 447:	
 448:		void			(*sk_data_ready)(struct sock *sk);
 449:		long			sk_rcvtimeo;
 450:		int			sk_rcvlowat;
 451:		__cacheline_group_end(sock_read_rx);
 452:	
 453:		__cacheline_group_begin(sock_read_rxtx);
 454:		int			sk_err;
 455:		struct socket		*sk_socket;
 456:	#ifdef CONFIG_MEMCG
 457:		struct mem_cgroup	*sk_memcg;
 458:	#endif
 459:	#ifdef CONFIG_XFRM
 460:		struct xfrm_policy __rcu *sk_policy[2];
 461:	#endif
 462:	#if IS_ENABLED(CONFIG_INET_PSP)
 463:		struct psp_assoc __rcu	*psp_assoc;
 464:	#endif
 465:		__cacheline_group_end(sock_read_rxtx);
 466:	
 467:		__cacheline_group_begin(sock_write_rxtx);
 468:		socket_lock_t		sk_lock;
 469:		u32			sk_reserved_mem;
 470:		int			sk_forward_alloc;
 471:		u32			sk_tsflags;
 472:		__cacheline_group_end(sock_write_rxtx);
 473:	
 474:		__cacheline_group_begin(sock_write_tx);
 475:		int			sk_write_pending;
 476:		atomic_t		sk_omem_alloc;
 477:		int			sk_err_soft;
 478:	
 479:		int			sk_wmem_queued;
 480:		refcount_t		sk_wmem_alloc;
 481:		unsigned long		sk_tsq_flags;
 482:		union {
 483:			struct sk_buff	*sk_send_head;
 484:			struct rb_root	tcp_rtx_queue;
 485:		};
 486:		struct sk_buff_head	sk_write_queue;
 487:		struct page_frag	sk_frag;
 488:		union {
 489:			struct timer_list	sk_timer;
 490:			struct timer_list	tcp_retransmit_timer;
 491:			struct timer_list	mptcp_retransmit_timer;
 492:		};
 493:		unsigned long		sk_pacing_rate; /* bytes per second */
 494:		atomic_t		sk_zckey;
 495:		atomic_t		sk_tskey;
 496:		unsigned long		sk_tx_queue_mapping_jiffies;
 497:		__cacheline_group_end(sock_write_tx);
 498:	
 499:		__cacheline_group_begin(sock_read_tx);
 500:		u32			sk_dst_pending_confirm;
 501:		u32			sk_pacing_status; /* see enum sk_pacing */
 502:		unsigned long		sk_max_pacing_rate;
 503:		long			sk_sndtimeo;
 504:		u32			sk_priority;
 505:		u32			sk_mark;
 506:		kuid_t			sk_uid;
 507:		u16			sk_protocol;
 508:		u16			sk_type;
 509:		struct dst_entry __rcu	*sk_dst_cache;
 510:		netdev_features_t	sk_route_caps;
 511:	#ifdef CONFIG_SOCK_VALIDATE_XMIT
 512:		struct sk_buff*		(*sk_validate_xmit_skb)(struct sock *sk,
 513:								struct net_device *dev,
 514:								struct sk_buff *skb);
 515:	#endif
 516:		u16			sk_gso_type;
 517:		u16			sk_gso_max_segs;
 518:		unsigned int		sk_gso_max_size;
 519:		gfp_t			sk_allocation;
 520:		u32			sk_txhash;
 521:		int			sk_sndbuf;
 522:		u8			sk_pacing_shift;
 523:		bool			sk_use_task_frag;
 524:		__cacheline_group_end(sock_read_tx);
 525:	
 526:		/*
 527:		 * Because of non atomicity rules, all
 528:		 * changes are protected by socket lock.
 529:		 */
 530:		u8			sk_gso_disabled : 1,
 531:					sk_kern_sock : 1,
 532:					sk_no_check_tx : 1,
 533:					sk_no_check_rx : 1;
 534:		u8			sk_shutdown;
 535:		unsigned long	        sk_lingertime;
 536:		struct proto		*sk_prot_creator;
 537:		rwlock_t		sk_callback_lock;
 538:		u32			sk_ack_backlog;
 539:		u32			sk_max_ack_backlog;
 540:		unsigned long		sk_ino;
 541:		spinlock_t		sk_peer_lock;
 542:		int			sk_bind_phc;
 543:		struct pid		*sk_peer_pid;
 544:		const struct cred	*sk_peer_cred;
 545:	
 546:		ktime_t			sk_stamp;
 547:	#if BITS_PER_LONG==32
 548:		seqlock_t		sk_stamp_seq;
 549:	#endif
 550:		int			sk_disconnects;
 551:	
 552:		union {
 553:			u8		sk_txrehash;
 554:			u8		sk_scm_recv_flags;
 555:			struct {
 556:				u8	sk_scm_credentials : 1,
 557:					sk_scm_security : 1,
 558:					sk_scm_pidfd : 1,
 559:					sk_scm_rights : 1,
 560:					sk_scm_unused : 4;
 561:			};
 562:		};
 563:		u8			sk_clockid;
 564:		u8			sk_txtime_deadline_mode : 1,
 565:					sk_txtime_report_errors : 1,
 566:					sk_txtime_unused : 6;
 567:	#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
 568:		u8			sk_bpf_cb_flags;
 569:	
 570:		void			*sk_user_data;
 571:	#ifdef CONFIG_SECURITY
 572:		void			*sk_security;
 573:	#endif
 574:		struct sock_cgroup_data	sk_cgrp_data;
 575:		void			(*sk_state_change)(struct sock *sk);
 576:		void			(*sk_write_space)(struct sock *sk);
 577:		void			(*sk_error_report)(struct sock *sk);
 578:		int			(*sk_backlog_rcv)(struct sock *sk,
 579:							  struct sk_buff *skb);
 580:		void                    (*sk_destruct)(struct sock *sk);
 581:		struct sock_reuseport __rcu	*sk_reuseport_cb;
 582:	#ifdef CONFIG_BPF_SYSCALL
 583:		struct bpf_local_storage __rcu	*sk_bpf_storage;
 584:	#endif
 585:		struct numa_drop_counters *sk_drop_counters;
 586:		/* sockets using SLAB_TYPESAFE_BY_RCU can use sk_freeptr.
 587:		 * By the time kfree() is called, sk_rcu can not be in
 588:		 * use and can be mangled.
 589:		 */
 590:		union {
 591:			struct rcu_head	sk_rcu;
 592:			freeptr_t	sk_freeptr;
 593:		};
 594:		netns_tracker		ns_tracker;
 595:		struct xarray		sk_user_frags;
 596:	
 597:	#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
 598:		struct module		*sk_owner;
 599:	#endif
 600:	};
 SourceFile:include/net/sock.h]

13/2 2026/02/23 15:13 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 69726
							output: 39
							thoughts: 1600
						

Thoughts:
**Analyzing the `sk_data_ready` Race Condition**

Okay, so the comment for `sk_callback_lock` mentions callbacks at the end of the `sock` struct, which includes `sk_data_ready`. But `sk_data_ready` seems to be in the middle of the struct in some sense. The `sk_callback_lock` is used to protect several of these callbacks, which are at the end, so `sk_data_ready` is likely one of the protected callbacks. It is indeed protected in functions like `sk_psock_drop`, where write locks are used.

If `unix_stream_sendmsg` calls `sk_data_ready` without holding `sk_callback_lock`, there's a data race. However, is it a *harmful* one? The function pointer is likely updated atomically. Even if the caller gets a different version, either the old or new function is a valid function to call. Compilers usually load the function pointer only once before the call, reducing the chances of seeing intermediate states. And even if there's an intermediate change of a pointer, on x86-64, an 8-byte aligned write has no intermediate states.

But wait, `sk_data_ready` depends on `psock` if the function pointer is `sk_psock_verdict_data_ready`. `psock` is accessed via `sk_psock(sk)` which uses RCU. `sk_psock_drop` sets `sk`'s user data to `NULL` but uses a release barrier. `sk_psock(sk)` uses an acquire barrier (via RCU). If `sk_psock_verdict_data_ready` is called, it will correctly see either the old or `NULL` `psock`. If the old `psock` is used, is it still valid? `sk_psock_drop` initiates an RCU destruction which keeps the `psock` alive for the grace period of any preceding RCU read-side critical section. Since `sk_psock_verdict_data_ready` uses an RCU read lock, the logic appears sound, even with the race.

KCSAN is right that there is a data race here, but is it *benign*? Historically, many such races on function pointers are benign if they do not cause crashes. However, since `sk_callback_lock` is a reader-writer lock, we should use a read lock. But locking every single packet, on the data path, could be too expensive. That's why these locks are often skipped.

I'll check if `unix_stream_sendmsg` is special. Other protocols like `tcp_data_ready` call `sk_data_ready`. No, TCP has its own logic. `sock_def_readable` is the default. Looking at the `unix_stream_sendmsg` code, it's a direct call to `other->sk_data_ready(other)`.

I should search for this race in kernel history; maybe something similar has been addressed. There was a similar report for `sk_write_space`.

If `sk_data_ready` is changed mid-call, and we are in the middle of calling it...
The compiler's possible code might look something like: `reg = sk->sk_data_ready; if (reg) reg(sk);`
If `sk->sk_data_ready` is changed from non-NULL to NULL in between the check and the call, it could crash. But in this specific case, it's changed from one valid function to another valid function. The `sk_psock_stop_verdict` sets it to `psock->saved_data_ready`, which was the original `sk_data_ready`. It's never NULL for a live socket. So it should still be benign.

But what about the value change mentioned in the report? The report says the value changed from `0xffffffff8486dd00 -> 0xffffffff8472c9f0`. I'll use `grepper` to see what those addresses actually point to, starting with `sk_psock_verdict_data_ready`.



14/2 2026/02/23 15:13 tool grepper
0m Args:
map[Expression:sk_psock_verdict_data_ready|unix_stream_data_ready|sock_def_readable]

Results:
map[Output:include/net/sock.h=2366=static inline bool sk_has_allocations(const struct sock *sk)
--
include/net/sock.h-2386- *   ...                 ...
include/net/sock.h:2387: *   tp->rcv_nxt check   sock_def_readable
include/net/sock.h-2388- *   ...                 {
--
include/net/sock.h=3090=static inline bool sk_dev_equal_l3scope(struct sock *sk, int dif)
--
include/net/sock.h-3104-
include/net/sock.h:3105:void sock_def_readable(struct sock *sk);
include/net/sock.h-3106-
--
net/core/skmsg.c=1236=static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
--
net/core/skmsg.c-1267-
net/core/skmsg.c:1268:static void sk_psock_verdict_data_ready(struct sock *sk)
net/core/skmsg.c-1269-{
--
net/core/skmsg.c=1293=void sk_psock_start_verdict(struct sock *sk, struct sk_psock *psock)
--
net/core/skmsg.c-1298-	psock->saved_data_ready = sk->sk_data_ready;
net/core/skmsg.c:1299:	sk->sk_data_ready = sk_psock_verdict_data_ready;
net/core/skmsg.c-1300-	sk->sk_write_space = sk_psock_write_space;
--
net/core/sock.c=3590=static void sock_def_error_report(struct sock *sk)
--
net/core/sock.c-3601-
net/core/sock.c:3602:void sock_def_readable(struct sock *sk)
net/core/sock.c-3603-{
--
net/core/sock.c=3696=void sock_init_data_uid(struct socket *sock, struct sock *sk, kuid_t uid)
--
net/core/sock.c-3721-	sk->sk_state_change	=	sock_def_wakeup;
net/core/sock.c:3722:	sk->sk_data_ready	=	sock_def_readable;
net/core/sock.c-3723-	sk->sk_write_space	=	sock_def_write_space;
--
net/ipv4/udp.c=1700=int __udp_enqueue_schedule_skb(struct sock *sk, struct sk_buff *skb)
--
net/ipv4/udp.c-1790-			INDIRECT_CALL_1(sk->sk_data_ready,
net/ipv4/udp.c:1791:					sock_def_readable, sk);
net/ipv4/udp.c-1792-			nb--;
--
net/mptcp/protocol.h=864=static inline void mptcp_subflow_tcp_fallback(struct sock *sk,
--
net/mptcp/protocol.h-866-{
net/mptcp/protocol.h:867:	sk->sk_data_ready = sock_def_readable;
net/mptcp/protocol.h-868-	sk->sk_state_change = ctx->tcp_state_change;
--
net/mptcp/subflow.c=1962=static int subflow_ulp_init(struct sock *sk)
--
net/mptcp/subflow.c-1990-
net/mptcp/subflow.c:1991:	WARN_ON_ONCE(sk->sk_data_ready != sock_def_readable);
net/mptcp/subflow.c-1992-	WARN_ON_ONCE(sk->sk_write_space != sk_stream_write_space);
--
net/rds/tcp.c=639=static struct pernet_operations rds_tcp_net_ops = {
--
net/rds/tcp.c-645-
net/rds/tcp.c:646:void *rds_tcp_listen_sock_def_readable(struct net *net)
net/rds/tcp.c-647-{
--
net/rds/tcp.h=95=void rds_tcp_keepalive(struct socket *sock);
net/rds/tcp.h:96:void *rds_tcp_listen_sock_def_readable(struct net *net);
net/rds/tcp.h-97-
--
net/rds/tcp_listen.c=352=void rds_tcp_listen_data_ready(struct sock *sk)
--
net/rds/tcp_listen.c-377-	else
net/rds/tcp_listen.c:378:		ready = rds_tcp_listen_sock_def_readable(sock_net(sk));
net/rds/tcp_listen.c-379-
--
net/smc/smc_rx.c=32=static void smc_rx_wake_up(struct sock *sk)
--
net/smc/smc_rx.c-37-
net/smc/smc_rx.c:38:	/* derived from sock_def_readable() */
net/smc/smc_rx.c-39-	/* called already in smc_listen_work() */
--
net/xdp/xsk.c=343=static void xsk_flush(struct xdp_sock *xs)
--
net/xdp/xsk.c-346-	__xskq_cons_release(xs->pool->fq);
net/xdp/xsk.c:347:	sock_def_readable(&xs->sk);
net/xdp/xsk.c-348-}
]

15/2 2026/02/23 15:13 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 73804
							output: 43
							thoughts: 1096
						

Thoughts:
**Analyzing a Kernel Data Race**

Okay, so I'm looking at this report and these addresses, `0xffffffff8486dd00` and `0xffffffff8472c9f0`, they're definitely function pointers, likely within the kernel. I can't immediately map them to specific functions without more context, but the addresses themselves point that way.

The core of the issue seems to be around `sk_data_ready`. The question is, what happens if this function pointer gets modified while it's being called? On x86_64, the load operation should be atomic, which should protect against torn reads. The primary concern is the compiler potentially reloading the pointer and obtaining a stale value, possibly NULL. But, if `sk_psock_stop_verdict` sets `sk_data_ready` to `psock->saved_data_ready`, and that variable isn't NULL, then it should be safe.

However, hang on, what if `psock->saved_data_ready` *is* NULL? The code explicitly checks for that.

```c
1308:		if (!psock->saved_data_ready)
1309:			return;
1310:	
1311:		sk->sk_data_ready = psock->saved_data_ready;
```

Right, it checks. So it looks like it's a "benign" data race, in the sense that the system will work as intended, but KCSAN would flag it. To be strictly correct, I should use `READ_ONCE` and `WRITE_ONCE`. But `sk_data_ready` is a member of `struct sock`, and changing it everywhere it's used would be a significant undertaking.

Let's look at `unix_stream_sendmsg` again.

```c
2480:			other->sk_data_ready(other);
```

I could fix the KCSAN warning by changing that to this:

```c
			void (*data_ready)(struct sock *sk);
			data_ready = READ_ONCE(other->sk_data_ready);
			data_ready(other);
```

That isolates the read. But why is this showing up now? My hunch is this is linked to the relatively new BPF support for AF_UNIX sockets. That's likely where this is surfacing.

So, is there any real harm? Let's check the code path. If `sk_psock_verdict_data_ready` is called, it does this:

```c
1274 static void sk_psock_verdict_data_ready(struct sock *sk)
1275 {
...
1281 	ops = READ_ONCE(sock->ops);
1282 	if (!ops || !ops->read_skb)
1283 		return;
1284 	copied = ops->read_skb(sk, sk_psock_verdict_recv);
...
```

It calls `ops->read_skb`. For AF_UNIX stream sockets, `ops` should be `unix_stream_ops`, and `unix_stream_ops.read_skb` is `unix_stream_read_skb`.

```c
2227 static int unix_stream_read_skb(struct sock *sk, skb_read_actor_t recv_actor)
2228 {
...
2237 	mutex_lock(&u->iolock);
2238 	spin_lock(&queue->lock);
2239 
2240 	skb = __skb_dequeue(queue);
...
```

It tries to dequeue an `skb` and call the actor. The actor is `sk_psock_verdict_recv`.

```c
1236 static int sk_psock_verdict_recv(struct sock *sk, struct sk_buff *skb)
1237 {
...
1241 	rcu_read_lock();
1242 	psock = sk_psock(sk);
...
1259 	ret = sk_psock_verdict_apply(psock, skb, ret);
...
```

If `psock` is NULL (because it was dropped), `sk_psock_verdict_recv` handles it:

```c
1243 	if (unlikely(!psock)) {
12


16/2 2026/02/23 15:13 tool codesearch-definition-source
0m Args:
map[ContextFile:net/core/skmsg.c IncludeLines:true Name:tcp_eat_skb]

Results:
map[SourceCode:  15:	void tcp_eat_skb(struct sock *sk, struct sk_buff *skb)
  16:	{
  17:		struct tcp_sock *tcp;
  18:		int copied;
  19:	
  20:		if (!skb || !skb->len || !sk_is_tcp(sk))
  21:			return;
  22:	
  23:		if (skb_bpf_strparser(skb))
  24:			return;
  25:	
  26:		tcp = tcp_sk(sk);
  27:		copied = tcp->copied_seq + skb->len;
  28:		WRITE_ONCE(tcp->copied_seq, copied);
  29:		tcp_rcv_space_adjust(sk);
  30:		__tcp_cleanup_rbuf(sk, skb->len);
  31:	}
 SourceFile:net/ipv4/tcp_bpf.c]

17/2 2026/02/23 15:13 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 76234
							output: 49
							thoughts: 3447
						

Thoughts:
**Analysis of Potential Packet Loss in AF_UNIX with BPF Sockets**

Okay, here's what I'm thinking about this issue. The `tcp_eat_skb` check on line 20 of `tcp_eat_skb` is irrelevant here because we're looking at AF_UNIX. The core problem appears to be a race condition involving the `sk_psock_verdict_data_ready` callback, potentially leading to packet loss in AF_UNIX sockets when a BPF program is involved.

Here's the sequence: If `sk_psock_verdict_data_ready` is called *after* the `psock` (BPF socket) is being dropped (set to NULL) but *before* the original `sk_data_ready` is restored, we have a problem. The packet might be dequeued but then dropped because the BPF socket context (`psock`) has already become invalid. It looks like the window for this race exists between lines 893 and 897 in `sk_psock_drop`. Within this window, `sk_user_data` (psock) becomes NULL, but `sk->sk_data_ready` still points to `sk_psock_verdict_data_ready`.

Specifically, if a data event triggers `sk_data_ready` through, for example, `unix_stream_sendmsg`, during this window, `sk_psock_verdict_data_ready` will be invoked. This in turn calls `ops->read_skb` which, for unix sockets, is `unix_stream_read_skb`, and it dequeues the skb. In `sk_psock_verdict_recv` if the `psock == NULL` then `tcp_eat_skb` is called and `sock_drop` is used on the skb and the packet is lost.

This looks like a legitimate packet loss scenario, not just a benign data race.

Now, let's re-verify this: In `sk_psock_verdict_recv`, when `psock` is NULL, the skb is dropped. The logic path is clear.

If the order of operations within `sk_psock_drop` were changed to restore `sk_data_ready` *before* setting `sk_user_data` to NULL, it *seems* better, but `sk_psock_stop_verdict` also clears the BPF programs, and that's critical too. Specifically, the problematic window of dropped packets exists between the setting of the BPF program pointers to NULL (lines 1305, 1306) and the restoration of the original `sk_data_ready` callback (line 1311). This still results in packets being dropped. So changing the order doesn't *solve* the problem.

Given that `sk_psock_verdict_data_ready` is the callback, it's a replacement, so it's responsible for *either* handling the data via BPF or *waking up* the receiver. The real issue is the transition when the psock is being dropped.

The data race reported by KCSAN is on `sk->sk_data_ready`: One task is writing the callback pointer and another task is potentially reading it.

If `sk_psock_verdict_data_ready` had more logic to check if it should really be handling the packet (e.g., if the `psock` is valid), we could avoid dropping the packet. But the `read_skb` function has already dequeued the packet.

The fundamental issue is that there's no synchronization between updating the callback pointer (`sk->sk_data_ready`) and its invocation.

Taking the `sk_callback_lock` in `sendmsg` would close the window and protect against the race, but this is a performance consideration.

If `sk_psock_verdict_data_ready` checked `psock` before calling the underlying `read_skb` function, it could potentially avoid this issue. Returning without doing anything *doesn't* solve the issue because the receiver isn't woken.

The value changes reported by KCSAN seem to confirm `sk_psock_verdict_data_ready` and `sock_def_readable`.

The fact that `sk_data_ready` is called without explicit synchronization, while being updated, seems like a common pattern in the network stack. But, as we're seeing, this can lead to subtle but significant race conditions and resulting packet loss.

Finally, `sk_psock_restore_proto` within `sk_psock_drop` also changes `sk->sk_prot` which,


18/2 2026/02/23 15:14 tool codesearch-definition-source
0m Args:
map[ContextFile:net/unix/unix_bpf.c IncludeLines:true Name:unix_stream_bpf_update_proto]

Results:
map[SourceCode: 164:	int unix_stream_bpf_update_proto(struct sock *sk, struct sk_psock *psock, bool restore)
 165:	{
 166:		struct sock *sk_pair;
 167:	
 168:		/* Restore does not decrement the sk_pair reference yet because we must
 169:		 * keep the a reference to the socket until after an RCU grace period
 170:		 * and any pending sends have completed.
 171:		 */
 172:		if (restore) {
 173:			sk->sk_write_space = psock->saved_write_space;
 174:			sock_replace_proto(sk, psock->sk_proto);
 175:			return 0;
 176:		}
 177:	
 178:		/* psock_update_sk_prot can be called multiple times if psock is
 179:		 * added to multiple maps and/or slots in the same map. There is
 180:		 * also an edge case where replacing a psock with itself can trigger
 181:		 * an extra psock_update_sk_prot during the insert process. So it
 182:		 * must be safe to do multiple calls. Here we need to ensure we don't
 183:		 * increment the refcnt through sock_hold many times. There will only
 184:		 * be a single matching destroy operation.
 185:		 */
 186:		if (!psock->sk_pair) {
 187:			sk_pair = unix_peer(sk);
 188:			sock_hold(sk_pair);
 189:			psock->sk_pair = sk_pair;
 190:		}
 191:	
 192:		unix_stream_bpf_check_needs_rebuild(psock->sk_proto);
 193:		sock_replace_proto(sk, &unix_stream_bpf_prot);
 194:		return 0;
 195:	}
 SourceFile:net/unix/unix_bpf.c]

19/2 2026/02/23 15:14 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 81321
							output: 25
							thoughts: 5342
						

Thoughts:
**Data Race Analysis: sk_data_ready in AF_UNIX Stream Sockets with BPF Sockets**

Alright, let's break down this KCSAN report. We're looking at a data race involving the `sk_data_ready` function pointer within the context of AF_UNIX stream sockets, specifically when BPF sockets (psocks) are involved. The core of the issue is a potential race condition between modifying this function pointer and calling it.

The scenario involves a socket (`sk`) whose `sk_prot` is switched to the `unix_stream_bpf_prot` protocol, and `sk_data_ready` is likely changed, by BPF. The code in question is in the context of `sock_map_close` which triggers the `sk_psock_drop` function, which then restores the original `sk_data_ready`.

The race arises because `unix_stream_sendmsg`, potentially running on a separate thread (Task 3605), might be invoking `sk_data_ready` on the socket in the middle of this state transition. We know this because the KCSAN report indicates a value change in `sk->sk_data_ready`.

The critical concern is whether this data race is "benign" or "harmful." My analysis will consider:

1.  **Memory Safety:** Is it possible for this to cause a crash or memory corruption?
2.  **Functional Correctness:** Could this lead to packet loss or other incorrect behavior?

Let's start with memory safety. In a typical scenario, the race occurs when a socket (`sk`, the sender) is connected to another socket (`other`, the receiver) and the receiver is being closed or removed from a BPF map.

`unix_stream_sendmsg` gets a pointer to the receiving socket, `other`, from `unix_peer(sk)`. Importantly, `unix_stream_sendmsg` *doesn't* take a reference to `other`. Instead, the sockets maintain their peers. If `unix_release` is running, then this socket is being shut down. When the socket `sk` is being shut down, the kernel will clear the pointer `other->sk->peer`. The other socket (`sk`) is the source, and is still valid. The race would be if that pointer has been cleared. The release of resources is done via `sock_put`, which decrements a reference counter. If the sockets aren't used, they will be freed, however, there's no race there.

Now, let's think about the `sk_data_ready` function pointer. The code, after the `sk_psock_drop` is complete, sets the original `sk_data_ready` function. The original function is always a valid function pointer, and the BPF one is, too. This isn't like a dangling pointer scenario. The architecture (e.g., x86_64) ensures that pointer accesses are atomic, so a corrupted memory address is extremely unlikely, even if a context switch interrupts the read/write. So, I don't see any memory corruption risk.

Regarding functional correctness, the race *could* lead to a packet being processed by the "wrong" callback, though. The packet might be handled by the original, or the BPF one. This might cause a packet to be lost. However, this only happens during a transition. The kernel already drops packets during socket close or BPF detach operations (e.g. if the BPF program sees that the socket is already being dropped, the packet will be dropped). So, this might not be *worse* than the existing packet loss conditions. This is what's important here.

But, there is an edge case: if `sk_psock_drop` is triggered by removing the socket from a BPF map but NOT closing the socket. That case results in packet loss.

Considering a similar scenario discussed previously, for example, `sk_write_space`, kernel developers have found that `READ_ONCE`/`WRITE_ONCE` are used in many cases to silence the KCSAN report (though technically, they don't solve any synchronization issues).

The race boils down to this:

*   **Thread 1 (Closing or BPF Detach):**  Writes the original `sk_data_ready` function pointer. This is protected by `sk_callback_lock`
*   **Thread 2 (Sending):** Reads and calls `sk_data_ready`. This is *not* protected.

If we want to fix it properly, we *could* add a `read_lock(&sk->sk_callback_lock)` in `unix_stream_sendmsg`. But the kernel devs have already chosen not to do so.

Based on the common


20/2 2026/02/23 15:15 tool grepper
0m Error:
no matches

Args:
map[Expression:READ_ONCE\(.*sk_data_ready]

Results:
map[Output:]

21/2 2026/02/23 15:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 87727
							output: 26
							thoughts: 69
						

Thoughts:
Okay, here's my thought process on this...

**Investigating `sk_data_ready` and `READ_ONCE` Co-occurrence**

Right, let's see. I'm thinking I might have a better chance of pinpointing the relevant code if I combine my search terms. I should try looking for instances where `sk_data_ready` is used in conjunction with `READ_ONCE`. That might narrow things down significantly.

Hmm, let me formulate that search. I'll use `grep` – that's usually my go-to for this kind of quick search. So, I need to look recursively. Okay, I'll try `grep -r "READ_ONCE(sk->sk_data_ready)"`.

Oh, wait, I don't think I actually have `grep -r` on this system. No problem, I remember I have `grepper`, which should work just fine. I will use that.



22/2 2026/02/23 15:15 tool grepper
0m Error:
no matches

Args:
map[Expression:READ_ONCE\(.*sk_data_ready\)]

Results:
map[Output:]

23/2 2026/02/23 15:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 88042
							output: 27
							thoughts: 78
						

Thoughts:
Okay, let me think this through.

**Analyzing the Function Call**

Hmm, so it looks like we're dealing with a function pointer here. Specifically, a `data_ready` function pointer within the context of a socket structure, `sk`. The key line is that call to `data_ready(sk)`. Before the call, though, there's that `READ_ONCE` macro being used with `sk->sk_data_ready`. My initial thought is that `data_ready` is likely being assigned the value of `sk->sk_data_ready`, but the `READ_ONCE` hints at something more subtle, likely related to memory barriers or cache coherency.

I need to quickly see where `READ_ONCE` is used on `sk->sk_data_ready`. I need to figure out what's going on with this assignment. Is this just a straightforward assignment of a function pointer? Or is there some kind of inter-processor communication or potential for races that the `READ_ONCE` is mitigating? Time to search the codebase and see how `sk_data_ready` is being accessed and updated elsewhere, specifically looking for `READ_ONCE` in conjunction with `sk_data_ready`. Let's see what we can find.



24/2 2026/02/23 15:15 tool grepper
0m Error:
no matches

Args:
map[Expression:READ_ONCE\(.*->sk_data_ready\)]

Results:
map[Output:]

25/2 2026/02/23 15:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 88434
							output: 26
							thoughts: 1115
						

Thoughts:
**Analyzing a Potential Data Race in `sk_data_ready`**

Okay, that's a bit surprising. I didn't initially think this would be a common issue. Let's delve back into `net/ipv4/udp.c` and take another look at the code. Specifically, line 1790, where `INDIRECT_CALL_1` is used with `sk->sk_data_ready`. Right, that macro expands out to a conditional check.  It doesn't use `READ_ONCE` on the function pointer, which caught my attention.

If `sk_data_ready` isn't protected, KCSAN is going to flag any changes to it. And *it is* changed by `sk_psock`. Hmm, if I recall `sk_psock_drop`, it's called during `sock_map_close`, a very typical path for BPF-enabled sockets.

Is there any *other* way this can be problematic? Would the compiler’s optimization introduce any problems? On x86_64, a function pointer call is usually a single instruction, which is typically safe.

I'm checking the usage of `sk_callback_lock`. I see it's used in both `sk_psock_init` and `sk_psock_drop`.  Oh, and `sk_psock_verdict_data_ready`.  No...wait, `sk_psock_verdict_data_ready` calls `ops->read_skb`. For AF_UNIX, `ops` is `unix_stream_ops`, and that's a static const struct, so no race there. And `ops->read_skb` is `unix_stream_read_skb`.

Wait, here's a crucial point. `unix_stream_sendmsg` calls `other->sk_data_ready(other)`. If `other` is a socket with a psock, then `other->sk_data_ready` is `sk_psock_verdict_data_ready`. When the psock is removed, `other->sk_data_ready` goes back to `sock_def_readable`.

The race is definitely *real*.

Is it, however, *harmful*? Following the definition of benign data race, this doesn't fit the strict definition of what's "benign" because the function pointer is not a simple type. It’s also not "harmful" under the strict definition, because it doesn’t seem like it will crash the kernel.

Wait, what about the potential packet loss? Could packet loss be considered "harmful" in this context?  It's a functional bug. Here, the packet loss happens during socket close or reconfiguration, which is already a "lossy" period.

However, `sk_psock` is used for high-performance redirection. If removing a socket from a map causes a packet to be dropped, it *could* be a performance issue for some applications. However, this is more of a logic bug in `sk_psock`, not a memory safety race. Even if we fixed the data race with `READ_ONCE`, the packet loss is still happening because of the window.

To eliminate the packet loss, we would need to hold `sk_callback_lock` in `sendmsg`. But that would have a performance impact and isn’t desirable.

Actually, in network stack design, using `READ_ONCE` and accepting the potential for the old or new value is the typical and accepted approach as long as both states are safe.

Is `sk_psock_verdict_data_ready` safe to call when the psock is being dropped? Yes, it handles `psock == NULL`.

Okay, I'm comfortable concluding it's benign, given the constraints of the system.

Wait, I should confirm if there are other reports of this. I found a similar report for `sk_write_space` in `sk_psock`. That was also considered benign, or at least not high priority.

Actually, there is a relevant commit: `d36358178e31 ("net: core: skmsg: fix data-race in sk_psock_drop")`. Wait, let me check the details on that...



26/2 2026/02/23 15:15 tool grepper
0m Error:
no matches

Args:
map[Expression:d36358178e31]

Results:
map[Output:]

27/2 2026/02/23 15:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 90519
							output: 33
							thoughts: 454
						

Thoughts:
**Analyzing Potential Concurrency Issues in `sk_psock_drop` and `sk_psock_stop_verdict`**

Okay, so I'm trying to track down a potential data race or ordering problem, and I'm missing the git history, which would be ideal. I need to figure out if we're missing any `READ_ONCE`/`WRITE_ONCE` usage related to these specific functions.

Initially, I checked `sk_psock_drop` and didn't see anything obvious. But wait, let me look at `sk_psock_drop` again. Specifically, I'm focusing on the interaction with `sk_data_ready`. Looking at the code:
```c
889 void sk_psock_drop(struct sock *sk, struct sk_psock *psock)
890 {
891 	write_lock_bh(&sk->sk_callback_lock);
892 	sk_psock_restore_proto(sk, psock);
893 	rcu_assign_sk_user_data(sk, NULL);
894 	if (psock->progs.stream_parser)
895 		sk_psock_stop_strp(sk, psock);
896 	else if (psock->progs.stream_verdict || psock->progs.skb_verdict)
897 		sk_psock_stop_verdict(sk, psock);
898 	write_unlock_bh(&sk->sk_callback_lock);
...
```

I see `rcu_assign_sk_user_data` which is a `WRITE_ONCE` with a barrier! That's good. But I'm concerned about how `sk->sk_data_ready` is handled.  Looking at the code I see that `sk_data_ready` isn't used here in the function.

Now, let's examine `sk_psock_stop_verdict` to see where `sk_data_ready` is actually being modified.
```c
1311:		sk->sk_data_ready = psock->saved_data_ready;
```

Ah, it appears to be a plain store. I need to think about this...

Now, I need to see *where* this `sk->sk_data_ready` is used. Looking at `unix_stream_sendmsg`, I see it used there:
```c
2480:			other->sk_data_ready(other);
```

Okay, so that's a plain load!

I need to get a better overview.  Perhaps there are more usages in `net/core/skmsg.c`.  I should search for all uses of `sk_data_ready` there to understand the full access pattern. This might reveal if a memory barrier is needed somewhere in this flow.



28/2 2026/02/23 15:15 tool grepper
0m Args:
map[Expression:sk_data_ready SourceFile:net/core/skmsg.c]

Results:
map[Output:
Full output is too long, showing 500 out of 1448 lines.
Use more precise expression if possible.

[Documentation/networking/tls-handshake.rst=210=While a handshake is under way, the kernel consumer must alter the
 Documentation/networking/tls-handshake.rst:211:socket's sk_data_ready callback function to ignore all incoming data.
 Documentation/networking/tls-handshake.rst-212-Once the handshake completion callback function has been invoked, normal
 --
 drivers/infiniband/hw/erdma/erdma_cm.c=26=static void erdma_sk_assign_cm_upcalls(struct sock *sk)
 --
 drivers/infiniband/hw/erdma/erdma_cm.c-29-	sk->sk_state_change = erdma_cm_llp_state_change;
 drivers/infiniband/hw/erdma/erdma_cm.c:30:	sk->sk_data_ready = erdma_cm_llp_data_ready;
 drivers/infiniband/hw/erdma/erdma_cm.c-31-	sk->sk_error_report = erdma_cm_llp_error_report;
 --
 drivers/infiniband/hw/erdma/erdma_cm.c=35=static void erdma_sk_save_upcalls(struct sock *sk)
 --
 drivers/infiniband/hw/erdma/erdma_cm.c-40-	cep->sk_state_change = sk->sk_state_change;
 drivers/infiniband/hw/erdma/erdma_cm.c:41:	cep->sk_data_ready = sk->sk_data_ready;
 drivers/infiniband/hw/erdma/erdma_cm.c-42-	cep->sk_error_report = sk->sk_error_report;
 --
 drivers/infiniband/hw/erdma/erdma_cm.c=46=static void erdma_sk_restore_upcalls(struct sock *sk, struct erdma_cep *cep)
 --
 drivers/infiniband/hw/erdma/erdma_cm.c-48-	sk->sk_state_change = cep->sk_state_change;
 drivers/infiniband/hw/erdma/erdma_cm.c:49:	sk->sk_data_ready = cep->sk_data_ready;
 drivers/infiniband/hw/erdma/erdma_cm.c-50-	sk->sk_error_report = cep->sk_error_report;
 --
 drivers/infiniband/hw/erdma/erdma_cm.c=642=static void erdma_accept_newconn(struct erdma_cep *cep)
 --
 drivers/infiniband/hw/erdma/erdma_cm.c-668-	new_cep->sk_state_change = cep->sk_state_change;
 drivers/infiniband/hw/erdma/erdma_cm.c:669:	new_cep->sk_data_ready = cep->sk_data_ready;
 drivers/infiniband/hw/erdma/erdma_cm.c-670-	new_cep->sk_error_report = cep->sk_error_report;
 --
 drivers/infiniband/hw/erdma/erdma_cm.c=928=static void erdma_cm_llp_data_ready(struct sock *sk)
 --
 drivers/infiniband/hw/erdma/erdma_cm.c-931-
 drivers/infiniband/hw/erdma/erdma_cm.c:932:	trace_sk_data_ready(sk);
 drivers/infiniband/hw/erdma/erdma_cm.c-933-
 --
 drivers/infiniband/hw/erdma/erdma_cm.h=67=struct erdma_sk_upcalls {
 drivers/infiniband/hw/erdma/erdma_cm.h-68-	void (*sk_state_change)(struct sock *sk);
 drivers/infiniband/hw/erdma/erdma_cm.h:69:	void (*sk_data_ready)(struct sock *sk, int bytes);
 drivers/infiniband/hw/erdma/erdma_cm.h-70-	void (*sk_error_report)(struct sock *sk);
 --
 drivers/infiniband/hw/erdma/erdma_cm.h=86=struct erdma_cep {
 --
 drivers/infiniband/hw/erdma/erdma_cm.h-114-	void (*sk_state_change)(struct sock *sk);
 drivers/infiniband/hw/erdma/erdma_cm.h:115:	void (*sk_data_ready)(struct sock *sk);
 drivers/infiniband/hw/erdma/erdma_cm.h-116-	void (*sk_error_report)(struct sock *sk);
 --
 drivers/infiniband/sw/siw/siw_cm.c=91=static void siw_sk_assign_cm_upcalls(struct sock *sk)
 --
 drivers/infiniband/sw/siw/siw_cm.c-96-	cep->sk_state_change = sk->sk_state_change;
 drivers/infiniband/sw/siw/siw_cm.c:97:	cep->sk_data_ready = sk->sk_data_ready;
 drivers/infiniband/sw/siw/siw_cm.c-98-	cep->sk_write_space = sk->sk_write_space;
 --
 drivers/infiniband/sw/siw/siw_cm.c-101-	sk->sk_state_change = siw_cm_llp_state_change;
 drivers/infiniband/sw/siw/siw_cm.c:102:	sk->sk_data_ready = siw_cm_llp_data_ready;
 drivers/infiniband/sw/siw/siw_cm.c-103-	sk->sk_write_space = siw_cm_llp_write_space;
 --
 drivers/infiniband/sw/siw/siw_cm.c=108=static void siw_sk_restore_upcalls(struct sock *sk, struct siw_cep *cep)
 --
 drivers/infiniband/sw/siw/siw_cm.c-110-	sk->sk_state_change = cep->sk_state_change;
 drivers/infiniband/sw/siw/siw_cm.c:111:	sk->sk_data_ready = cep->sk_data_ready;
 drivers/infiniband/sw/siw/siw_cm.c-112-	sk->sk_write_space = cep->sk_write_space;
 --
 drivers/infiniband/sw/siw/siw_cm.c=117=static void siw_qp_socket_assoc(struct siw_cep *cep, struct siw_qp *qp)
 --
 drivers/infiniband/sw/siw/siw_cm.c-124-	qp->attrs.sk = s;
 drivers/infiniband/sw/siw/siw_cm.c:125:	sk->sk_data_ready = siw_qp_llp_data_ready;
 drivers/infiniband/sw/siw/siw_cm.c-126-	sk->sk_write_space = siw_qp_llp_write_space;
 --
 drivers/infiniband/sw/siw/siw_cm.c=151=static void siw_rtr_data_ready(struct sock *sk)
 --
 drivers/infiniband/sw/siw/siw_cm.c-156-
 drivers/infiniband/sw/siw/siw_cm.c:157:	trace_sk_data_ready(sk);
 drivers/infiniband/sw/siw/siw_cm.c-158-
 --
 drivers/infiniband/sw/siw/siw_cm.c=187=static void siw_sk_assign_rtr_upcalls(struct siw_cep *cep)
 --
 drivers/infiniband/sw/siw/siw_cm.c-191-	write_lock_bh(&sk->sk_callback_lock);
 drivers/infiniband/sw/siw/siw_cm.c:192:	sk->sk_data_ready = siw_rtr_data_ready;
 drivers/infiniband/sw/siw/siw_cm.c-193-	sk->sk_write_space = siw_qp_llp_write_space;
 --
 drivers/infiniband/sw/siw/siw_cm.c=969=static void siw_accept_newconn(struct siw_cep *cep)
 --
 drivers/infiniband/sw/siw/siw_cm.c-995-	new_cep->sk_state_change = cep->sk_state_change;
 drivers/infiniband/sw/siw/siw_cm.c:996:	new_cep->sk_data_ready = cep->sk_data_ready;
 drivers/infiniband/sw/siw/siw_cm.c-997-	new_cep->sk_write_space = cep->sk_write_space;
 --
 drivers/infiniband/sw/siw/siw_cm.c=1058=static void siw_cm_work_handler(struct work_struct *w)
 --
 drivers/infiniband/sw/siw/siw_cm.c-1106-			if (cep->state == SIW_EPSTATE_RDMA_MODE) {
 drivers/infiniband/sw/siw/siw_cm.c:1107:				cep->sock->sk->sk_data_ready(cep->sock->sk);
 drivers/infiniband/sw/siw/siw_cm.c-1108-				siw_dbg_cep(cep, "already in RDMA mode");
 --
 drivers/infiniband/sw/siw/siw_cm.c=1279=static void siw_cm_llp_data_ready(struct sock *sk)
 --
 drivers/infiniband/sw/siw/siw_cm.c-1282-
 drivers/infiniband/sw/siw/siw_cm.c:1283:	trace_sk_data_ready(sk);
 drivers/infiniband/sw/siw/siw_cm.c-1284-
 --
 drivers/infiniband/sw/siw/siw_cm.h=37=struct siw_cep {
 --
 drivers/infiniband/sw/siw/siw_cm.h-62-	void (*sk_state_change)(struct sock *sk);
 drivers/infiniband/sw/siw/siw_cm.h:63:	void (*sk_data_ready)(struct sock *sk);
 drivers/infiniband/sw/siw/siw_cm.h-64-	void (*sk_write_space)(struct sock *sk);
 --
 drivers/infiniband/sw/siw/siw_qp.c=94=void siw_qp_llp_data_ready(struct sock *sk)
 --
 drivers/infiniband/sw/siw/siw_qp.c-97-
 drivers/infiniband/sw/siw/siw_qp.c:98:	trace_sk_data_ready(sk);
 drivers/infiniband/sw/siw/siw_qp.c-99-
 --
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c=1538=static void add_pass_open_to_parent(struct sock *child, struct sock *lsk,
 --
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c-1559-		inet_csk_reqsk_queue_add(lsk, oreq, child);
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c:1560:		lsk->sk_data_ready(lsk);
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c-1561-	}
 --
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c=1682=static void chtls_recv_data(struct sock *sk, struct sk_buff *skb)
 --
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c-1723-		check_sk_callbacks(csk);
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c:1724:		sk->sk_data_ready(sk);
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c-1725-	}
 --
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c=1804=static void chtls_rx_hdr(struct sock *sk, struct sk_buff *skb)
 --
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c-1844-		check_sk_callbacks(csk);
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c:1845:		sk->sk_data_ready(sk);
 drivers/net/ethernet/chelsio/inline_crypto/chtls/chtls_cm.c-1846-	}
 --
 drivers/net/ovpn/peer.h-40- * @tcp.out_msg.len: remaining data to send within packet (TCP only)
 drivers/net/ovpn/peer.h:41: * @tcp.sk_cb.sk_data_ready: pointer to original cb (TCP only)
 drivers/net/ovpn/peer.h-42- * @tcp.sk_cb.sk_write_space: pointer to original cb (TCP only)
 --
 drivers/net/ovpn/peer.h=63=struct ovpn_peer {
 --
 drivers/net/ovpn/peer.h-89-		struct {
 drivers/net/ovpn/peer.h:90:			void (*sk_data_ready)(struct sock *sk);
 drivers/net/ovpn/peer.h-91-			void (*sk_write_space)(struct sock *sk);
 --
 drivers/net/ovpn/tcp.c=64=static void ovpn_tcp_to_userspace(struct ovpn_peer *peer, struct sock *sk,
 --
 drivers/net/ovpn/tcp.c-69-	skb_queue_tail(&peer->tcp.user_queue, skb);
 drivers/net/ovpn/tcp.c:70:	peer->tcp.sk_cb.sk_data_ready(sk);
 drivers/net/ovpn/tcp.c-71-}
 --
 drivers/net/ovpn/tcp.c=190=void ovpn_tcp_socket_detach(struct ovpn_socket *ovpn_sock)
 --
 drivers/net/ovpn/tcp.c-198-	/* restore CBs that were saved in ovpn_sock_set_tcp_cb() */
 drivers/net/ovpn/tcp.c:199:	sk->sk_data_ready = peer->tcp.sk_cb.sk_data_ready;
 drivers/net/ovpn/tcp.c-200-	sk->sk_write_space = peer->tcp.sk_cb.sk_write_space;
 --
 drivers/net/ovpn/tcp.c=437=static void ovpn_tcp_data_ready(struct sock *sk)
 --
 drivers/net/ovpn/tcp.c-440-
 drivers/net/ovpn/tcp.c:441:	trace_sk_data_ready(sk);
 drivers/net/ovpn/tcp.c-442-
 --
 drivers/net/ovpn/tcp.c=478=int ovpn_tcp_socket_attach(struct ovpn_socket *ovpn_sock,
 --
 drivers/net/ovpn/tcp.c-513-	/* save current CBs so that they can be restored upon socket release */
 drivers/net/ovpn/tcp.c:514:	peer->tcp.sk_cb.sk_data_ready = ovpn_sock->sk->sk_data_ready;
 drivers/net/ovpn/tcp.c-515-	peer->tcp.sk_cb.sk_write_space = ovpn_sock->sk->sk_write_space;
 --
 drivers/net/ovpn/tcp.c-519-	/* assign our static CBs and prot/ops */
 drivers/net/ovpn/tcp.c:520:	ovpn_sock->sk->sk_data_ready = ovpn_tcp_data_ready;
 drivers/net/ovpn/tcp.c-521-	ovpn_sock->sk->sk_write_space = ovpn_tcp_write_space;
 --
 drivers/net/tun.c=646=static void tun_detach_all(struct net_device *dev)
 --
 drivers/net/tun.c-656-		tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
 drivers/net/tun.c:657:		tfile->socket.sk->sk_data_ready(tfile->socket.sk);
 drivers/net/tun.c-658-		RCU_INIT_POINTER(tfile->tun, NULL);
 --
 drivers/net/tun.c-662-		tfile->socket.sk->sk_shutdown = RCV_SHUTDOWN;
 drivers/net/tun.c:663:		tfile->socket.sk->sk_data_ready(tfile->socket.sk);
 drivers/net/tun.c-664-		RCU_INIT_POINTER(tfile->tun, NULL);
 --
 drivers/net/tun.c=1003=static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 --
 drivers/net/tun.c-1074-		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
 drivers/net/tun.c:1075:	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
 drivers/net/tun.c-1076-
 --
 drivers/net/tun.c=1194=static void __tun_xdp_flush_tfile(struct tun_file *tfile)
 --
 drivers/net/tun.c-1198-		kill_fasync(&tfile->fasync, SIGIO, POLL_IN);
 drivers/net/tun.c:1199:	tfile->socket.sk->sk_data_ready(tfile->socket.sk);
 drivers/net/tun.c-1200-}
 --
 drivers/nvme/host/tcp.c=1062=static void nvme_tcp_data_ready(struct sock *sk)
 --
 drivers/nvme/host/tcp.c-1065-
 drivers/nvme/host/tcp.c:1066:	trace_sk_data_ready(sk);
 drivers/nvme/host/tcp.c-1067-
 --
 drivers/nvme/host/tcp.c=1912=static void nvme_tcp_restore_sock_ops(struct nvme_tcp_queue *queue)
 --
 drivers/nvme/host/tcp.c-1917-	sock->sk->sk_user_data  = NULL;
 drivers/nvme/host/tcp.c:1918:	sock->sk->sk_data_ready = queue->data_ready;
 drivers/nvme/host/tcp.c-1919-	sock->sk->sk_state_change = queue->state_change;
 --
 drivers/nvme/host/tcp.c=1975=static void nvme_tcp_setup_sock_ops(struct nvme_tcp_queue *queue)
 --
 drivers/nvme/host/tcp.c-1979-	queue->state_change = queue->sock->sk->sk_state_change;
 drivers/nvme/host/tcp.c:1980:	queue->data_ready = queue->sock->sk->sk_data_ready;
 drivers/nvme/host/tcp.c-1981-	queue->write_space = queue->sock->sk->sk_write_space;
 drivers/nvme/host/tcp.c:1982:	queue->sock->sk->sk_data_ready = nvme_tcp_data_ready;
 drivers/nvme/host/tcp.c-1983-	queue->sock->sk->sk_state_change = nvme_tcp_state_change;
 --
 drivers/nvme/target/tcp.c=1549=static void nvmet_tcp_restore_socket_callbacks(struct nvmet_tcp_queue *queue)
 --
 drivers/nvme/target/tcp.c-1556-	write_lock_bh(&sock->sk->sk_callback_lock);
 drivers/nvme/target/tcp.c:1557:	sock->sk->sk_data_ready =  queue->data_ready;
 drivers/nvme/target/tcp.c-1558-	sock->sk->sk_state_change = queue->state_change;
 --
 drivers/nvme/target/tcp.c=1619=static void nvmet_tcp_data_ready(struct sock *sk)
 --
 drivers/nvme/target/tcp.c-1622-
 drivers/nvme/target/tcp.c:1623:	trace_sk_data_ready(sk);
 drivers/nvme/target/tcp.c-1624-
 --
 drivers/nvme/target/tcp.c=1686=static int nvmet_tcp_set_queue_sock(struct nvmet_tcp_queue *queue)
 --
 drivers/nvme/target/tcp.c-1725-		sock->sk->sk_user_data = queue;
 drivers/nvme/target/tcp.c:1726:		queue->data_ready = sock->sk->sk_data_ready;
 drivers/nvme/target/tcp.c:1727:		sock->sk->sk_data_ready = nvmet_tcp_data_ready;
 drivers/nvme/target/tcp.c-1728-		queue->state_change = sock->sk->sk_state_change;
 --
 drivers/nvme/target/tcp.c=1897=static void nvmet_tcp_alloc_queue(struct nvmet_tcp_port *port,
 --
 drivers/nvme/target/tcp.c-1961-		sk->sk_user_data = NULL;
 drivers/nvme/target/tcp.c:1962:		sk->sk_data_ready = port->data_ready;
 drivers/nvme/target/tcp.c-1963-		write_unlock_bh(&sk->sk_callback_lock);
 --
 drivers/nvme/target/tcp.c=2018=static void nvmet_tcp_listen_data_ready(struct sock *sk)
 --
 drivers/nvme/target/tcp.c-2021-
 drivers/nvme/target/tcp.c:2022:	trace_sk_data_ready(sk);
 drivers/nvme/target/tcp.c-2023-
 --
 drivers/nvme/target/tcp.c=2034=static int nvmet_tcp_add_port(struct nvmet_port *nport)
 --
 drivers/nvme/target/tcp.c-2078-	port->sock->sk->sk_user_data = port;
 drivers/nvme/target/tcp.c:2079:	port->data_ready = port->sock->sk->sk_data_ready;
 drivers/nvme/target/tcp.c:2080:	port->sock->sk->sk_data_ready = nvmet_tcp_listen_data_ready;
 drivers/nvme/target/tcp.c-2081-	sock_set_reuseaddr(port->sock->sk);
 --
 drivers/nvme/target/tcp.c=2123=static void nvmet_tcp_remove_port(struct nvmet_port *nport)
 --
 drivers/nvme/target/tcp.c-2127-	write_lock_bh(&port->sock->sk->sk_callback_lock);
 drivers/nvme/target/tcp.c:2128:	port->sock->sk->sk_data_ready = port->data_ready;
 drivers/nvme/target/tcp.c-2129-	port->sock->sk->sk_user_data = NULL;
 --
 drivers/scsi/iscsi_tcp.c=167=static void iscsi_sw_tcp_data_ready(struct sock *sk)
 --
 drivers/scsi/iscsi_tcp.c-172-
 drivers/scsi/iscsi_tcp.c:173:	trace_sk_data_ready(sk);
 drivers/scsi/iscsi_tcp.c-174-
 --
 drivers/scsi/iscsi_tcp.c=245=static void iscsi_sw_tcp_conn_set_callbacks(struct iscsi_conn *conn)
 --
 drivers/scsi/iscsi_tcp.c-253-	sk->sk_user_data = conn;
 drivers/scsi/iscsi_tcp.c:254:	tcp_sw_conn->old_data_ready = sk->sk_data_ready;
 drivers/scsi/iscsi_tcp.c-255-	tcp_sw_conn->old_state_change = sk->sk_state_change;
 drivers/scsi/iscsi_tcp.c-256-	tcp_sw_conn->old_write_space = sk->sk_write_space;
 drivers/scsi/iscsi_tcp.c:257:	sk->sk_data_ready = iscsi_sw_tcp_data_ready;
 drivers/scsi/iscsi_tcp.c-258-	sk->sk_state_change = iscsi_sw_tcp_state_change;
 --
 drivers/scsi/iscsi_tcp.c=264=iscsi_sw_tcp_conn_restore_callbacks(struct iscsi_conn *conn)
 --
 drivers/scsi/iscsi_tcp.c-272-	sk->sk_user_data    = NULL;
 drivers/scsi/iscsi_tcp.c:273:	sk->sk_data_ready   = tcp_sw_conn->old_data_ready;
 drivers/scsi/iscsi_tcp.c-274-	sk->sk_state_change = tcp_sw_conn->old_state_change;
 --
 drivers/soc/qcom/qmi_interface.c=569=static void qmi_data_ready(struct sock *sk)
 --
 drivers/soc/qcom/qmi_interface.c-572-
 drivers/soc/qcom/qmi_interface.c:573:	trace_sk_data_ready(sk);
 drivers/soc/qcom/qmi_interface.c-574-
 --
 drivers/soc/qcom/qmi_interface.c=585=static struct socket *qmi_sock_create(struct qmi_handle *qmi,
 --
 drivers/soc/qcom/qmi_interface.c-602-	sock->sk->sk_user_data = qmi;
 drivers/soc/qcom/qmi_interface.c:603:	sock->sk->sk_data_ready = qmi_data_ready;
 drivers/soc/qcom/qmi_interface.c-604-	sock->sk->sk_error_report = qmi_data_ready;
 --
 drivers/target/iscsi/iscsi_target_nego.c=318=static int iscsi_target_do_tx_login_io(struct iscsit_conn *conn, struct iscsi_login *login)
 --
 drivers/target/iscsi/iscsi_target_nego.c-382-
 drivers/target/iscsi/iscsi_target_nego.c:383:static void iscsi_target_sk_data_ready(struct sock *sk)
 drivers/target/iscsi/iscsi_target_nego.c-384-{
 --
 drivers/target/iscsi/iscsi_target_nego.c-387-
 drivers/target/iscsi/iscsi_target_nego.c:388:	trace_sk_data_ready(sk);
 drivers/target/iscsi/iscsi_target_nego.c:389:	pr_debug("Entering iscsi_target_sk_data_ready: conn: %p\n", conn);
 drivers/target/iscsi/iscsi_target_nego.c-390-
 --
 drivers/target/iscsi/iscsi_target_nego.c-408-		pr_debug("Got LOGIN_FLAGS_READ_ACTIVE=1, conn: %p >>>>\n", conn);
 drivers/target/iscsi/iscsi_target_nego.c:409:		if (iscsi_target_sk_data_ready == conn->orig_data_ready)
 drivers/target/iscsi/iscsi_target_nego.c-410-			return;
 --
 drivers/target/iscsi/iscsi_target_nego.c-416-	if (!rc) {
 drivers/target/iscsi/iscsi_target_nego.c:417:		pr_debug("iscsi_target_sk_data_ready, schedule_delayed_work"
 drivers/target/iscsi/iscsi_target_nego.c-418-			 " got false\n");
 --
 drivers/target/iscsi/iscsi_target_nego.c=425=static void iscsi_target_set_sock_callbacks(struct iscsit_conn *conn)
 --
 drivers/target/iscsi/iscsi_target_nego.c-436-	sk->sk_user_data = conn;
 drivers/target/iscsi/iscsi_target_nego.c:437:	conn->orig_data_ready = sk->sk_data_ready;
 drivers/target/iscsi/iscsi_target_nego.c-438-	conn->orig_state_change = sk->sk_state_change;
 drivers/target/iscsi/iscsi_target_nego.c:439:	sk->sk_data_ready = iscsi_target_sk_data_ready;
 drivers/target/iscsi/iscsi_target_nego.c-440-	sk->sk_state_change = iscsi_target_sk_state_change;
 --
 drivers/target/iscsi/iscsi_target_nego.c=447=static void iscsi_target_restore_sock_callbacks(struct iscsit_conn *conn)
 --
 drivers/target/iscsi/iscsi_target_nego.c-462-	sk->sk_user_data = NULL;
 drivers/target/iscsi/iscsi_target_nego.c:463:	sk->sk_data_ready = conn->orig_data_ready;
 drivers/target/iscsi/iscsi_target_nego.c-464-	sk->sk_state_change = conn->orig_state_change;
 --
 drivers/target/iscsi/iscsi_target_nego.c=538=static void iscsi_target_do_login_rx(struct work_struct *work)
 --
 drivers/target/iscsi/iscsi_target_nego.c-555-	/*
 drivers/target/iscsi/iscsi_target_nego.c:556:	 * If iscsi_target_do_login_rx() has been invoked by ->sk_data_ready()
 drivers/target/iscsi/iscsi_target_nego.c-557-	 * before initial PDU processing in iscsi_target_start_negotiation()
 --
 drivers/target/iscsi/iscsi_target_nego.c-601-	/*
 drivers/target/iscsi/iscsi_target_nego.c:602:	 * LOGIN_FLAGS_READ_ACTIVE is cleared so that sk_data_ready
 drivers/target/iscsi/iscsi_target_nego.c-603-	 * could be triggered again after this.
 --
 drivers/xen/pvcalls-back.c=289=static void pvcalls_sk_state_change(struct sock *sock)
 --
 drivers/xen/pvcalls-back.c-299-
 drivers/xen/pvcalls-back.c:300:static void pvcalls_sk_data_ready(struct sock *sock)
 drivers/xen/pvcalls-back.c-301-{
 --
 drivers/xen/pvcalls-back.c-304-
 drivers/xen/pvcalls-back.c:305:	trace_sk_data_ready(sock);
 drivers/xen/pvcalls-back.c-306-
 --
 drivers/xen/pvcalls-back.c=316=static struct sock_mapping *pvcalls_new_active_socket(
 --
 drivers/xen/pvcalls-back.c-376-	write_lock_bh(&map->sock->sk->sk_callback_lock);
 drivers/xen/pvcalls-back.c:377:	map->saved_data_ready = map->sock->sk->sk_data_ready;
 drivers/xen/pvcalls-back.c-378-	map->sock->sk->sk_user_data = map;
 drivers/xen/pvcalls-back.c:379:	map->sock->sk->sk_data_ready = pvcalls_sk_data_ready;
 drivers/xen/pvcalls-back.c-380-	map->sock->sk->sk_state_change = pvcalls_sk_state_change;
 --
 drivers/xen/pvcalls-back.c=436=static int pvcalls_back_release_active(struct xenbus_device *dev,
 --
 drivers/xen/pvcalls-back.c-443-		map->sock->sk->sk_user_data = NULL;
 drivers/xen/pvcalls-back.c:444:		map->sock->sk->sk_data_ready = map->saved_data_ready;
 drivers/xen/pvcalls-back.c-445-		write_unlock_bh(&map->sock->sk->sk_callback_lock);
 --
 drivers/xen/pvcalls-back.c=461=static int pvcalls_back_release_passive(struct xenbus_device *dev,
 --
 drivers/xen/pvcalls-back.c-467-		mappass->sock->sk->sk_user_data = NULL;
 drivers/xen/pvcalls-back.c:468:		mappass->sock->sk->sk_data_ready = mappass->saved_data_ready;
 drivers/xen/pvcalls-back.c-469-		write_unlock_bh(&mappass->sock->sk->sk_callback_lock);
 --
 drivers/xen/pvcalls-back.c=516=static void __pvcalls_back_accept(struct work_struct *work)
 --
 drivers/xen/pvcalls-back.c-589-
 drivers/xen/pvcalls-back.c:590:static void pvcalls_pass_sk_data_ready(struct sock *sock)
 drivers/xen/pvcalls-back.c-591-{
 --
 drivers/xen/pvcalls-back.c-597-
 drivers/xen/pvcalls-back.c:598:	trace_sk_data_ready(sock);
 drivers/xen/pvcalls-back.c-599-
 --
 drivers/xen/pvcalls-back.c=625=static int pvcalls_back_bind(struct xenbus_device *dev,
 --
 drivers/xen/pvcalls-back.c-668-	write_lock_bh(&map->sock->sk->sk_callback_lock);
 drivers/xen/pvcalls-back.c:669:	map->saved_data_ready = map->sock->sk->sk_data_ready;
 drivers/xen/pvcalls-back.c-670-	map->sock->sk->sk_user_data = map;
 drivers/xen/pvcalls-back.c:671:	map->sock->sk->sk_data_ready = pvcalls_pass_sk_data_ready;
 drivers/xen/pvcalls-back.c-672-	write_unlock_bh(&map->sock->sk->sk_callback_lock);
 --
 fs/dlm/lowcomms.c=172=static struct listen_sock_callbacks {
 fs/dlm/lowcomms.c-173-	void (*sk_error_report)(struct sock *);
 fs/dlm/lowcomms.c:174:	void (*sk_data_ready)(struct sock *);
 fs/dlm/lowcomms.c-175-	void (*sk_state_change)(struct sock *);
 --
 fs/dlm/lowcomms.c=506=static void lowcomms_data_ready(struct sock *sk)
 --
 fs/dlm/lowcomms.c-509-
 fs/dlm/lowcomms.c:510:	trace_sk_data_ready(sk);
 fs/dlm/lowcomms.c-511-
 --
 fs/dlm/lowcomms.c=532=static void lowcomms_state_change(struct sock *sk)
 fs/dlm/lowcomms.c-533-{
 fs/dlm/lowcomms.c:534:	/* SCTP layer is not calling sk_data_ready when the connection
 fs/dlm/lowcomms.c-535-	 * is done, so we catch the signal through here.
 --
 fs/dlm/lowcomms.c=541=static void lowcomms_listen_data_ready(struct sock *sk)
 fs/dlm/lowcomms.c-542-{
 fs/dlm/lowcomms.c:543:	trace_sk_data_ready(sk);
 fs/dlm/lowcomms.c-544-
 --
 fs/dlm/lowcomms.c=631=static void restore_callbacks(struct sock *sk)
 --
 fs/dlm/lowcomms.c-637-	sk->sk_user_data = NULL;
 fs/dlm/lowcomms.c:638:	sk->sk_data_ready = listen_sock.sk_data_ready;
 fs/dlm/lowcomms.c-639-	sk->sk_state_change = listen_sock.sk_state_change;
 --
 fs/dlm/lowcomms.c=645=static void add_sock(struct socket *sock, struct connection *con)
 --
 fs/dlm/lowcomms.c-652-	sk->sk_user_data = con;
 fs/dlm/lowcomms.c:653:	sk->sk_data_ready = lowcomms_data_ready;
 fs/dlm/lowcomms.c-654-	sk->sk_write_space = lowcomms_write_space;
 --
 fs/dlm/lowcomms.c=1717=void dlm_lowcomms_shutdown(void)
 --
 fs/dlm/lowcomms.c-1723-	lock_sock(listen_con.sock->sk);
 fs/dlm/lowcomms.c:1724:	listen_con.sock->sk->sk_data_ready = listen_sock.sk_data_ready;
 fs/dlm/lowcomms.c-1725-	release_sock(listen_con.sock->sk);
 --
 fs/dlm/lowcomms.c=1753=static int dlm_listen_for_all(void)
 --
 fs/dlm/lowcomms.c-1779-	lock_sock(sock->sk);
 fs/dlm/lowcomms.c:1780:	listen_sock.sk_data_ready = sock->sk->sk_data_ready;
 fs/dlm/lowcomms.c-1781-	listen_sock.sk_write_space = sock->sk->sk_write_space;
 --
 fs/dlm/lowcomms.c-1788-	sock->sk->sk_use_task_frag = false;
 fs/dlm/lowcomms.c:1789:	sock->sk->sk_data_ready = lowcomms_listen_data_ready;
 fs/dlm/lowcomms.c-1790-	release_sock(sock->sk);
 --
 fs/ocfs2/cluster/tcp.c=584=static void o2net_data_ready(struct sock *sk)
 --
 fs/ocfs2/cluster/tcp.c-588-
 fs/ocfs2/cluster/tcp.c:589:	trace_sk_data_ready(sk);
 fs/ocfs2/cluster/tcp.c-590-
 --
 fs/ocfs2/cluster/tcp.c-598-	} else {
 fs/ocfs2/cluster/tcp.c:599:		ready = sk->sk_data_ready;
 fs/ocfs2/cluster/tcp.c-600-	}
 --
 fs/ocfs2/cluster/tcp.c=648=static void o2net_register_callbacks(struct sock *sk,
 --
 fs/ocfs2/cluster/tcp.c-653-	/* accepted sockets inherit the old listen socket data ready */
 fs/ocfs2/cluster/tcp.c:654:	if (sk->sk_data_ready == o2net_listen_data_ready) {
 fs/ocfs2/cluster/tcp.c:655:		sk->sk_data_ready = sk->sk_user_data;
 fs/ocfs2/cluster/tcp.c-656-		sk->sk_user_data = NULL;
 --
 fs/ocfs2/cluster/tcp.c-662-
 fs/ocfs2/cluster/tcp.c:663:	sc->sc_data_ready = sk->sk_data_ready;
 fs/ocfs2/cluster/tcp.c-664-	sc->sc_state_change = sk->sk_state_change;
 fs/ocfs2/cluster/tcp.c:665:	sk->sk_data_ready = o2net_data_ready;
 fs/ocfs2/cluster/tcp.c-666-	sk->sk_state_change = o2net_state_change;
 --
 fs/ocfs2/cluster/tcp.c=673=static int o2net_unregister_callbacks(struct sock *sk,
 --
 fs/ocfs2/cluster/tcp.c-681-		sk->sk_user_data = NULL;
 fs/ocfs2/cluster/tcp.c:682:		sk->sk_data_ready = sc->sc_data_ready;
 fs/ocfs2/cluster/tcp.c-683-		sk->sk_state_change = sc->sc_state_change;
 --
 fs/ocfs2/cluster/tcp.c=1940=static void o2net_listen_data_ready(struct sock *sk)
 --
 fs/ocfs2/cluster/tcp.c-1943-
 fs/ocfs2/cluster/tcp.c:1944:	trace_sk_data_ready(sk);
 fs/ocfs2/cluster/tcp.c-1945-
 --
 fs/ocfs2/cluster/tcp.c-1948-	if (ready == NULL) { /* check for teardown race */
 fs/ocfs2/cluster/tcp.c:1949:		ready = sk->sk_data_ready;
 fs/ocfs2/cluster/tcp.c-1950-		goto out;
 --
 fs/ocfs2/cluster/tcp.c=1978=static int o2net_open_listening_sock(__be32 addr, __be16 port)
 --
 fs/ocfs2/cluster/tcp.c-1996-	write_lock_bh(&sock->sk->sk_callback_lock);
 fs/ocfs2/cluster/tcp.c:1997:	sock->sk->sk_user_data = sock->sk->sk_data_ready;
 fs/ocfs2/cluster/tcp.c:1998:	sock->sk->sk_data_ready = o2net_listen_data_ready;
 fs/ocfs2/cluster/tcp.c-1999-	write_unlock_bh(&sock->sk->sk_callback_lock);
 --
 fs/ocfs2/cluster/tcp.c=2060=void o2net_stop_listening(struct o2nm_node *node)
 --
 fs/ocfs2/cluster/tcp.c-2069-	write_lock_bh(&sock->sk->sk_callback_lock);
 fs/ocfs2/cluster/tcp.c:2070:	sock->sk->sk_data_ready = sock->sk->sk_user_data;
 fs/ocfs2/cluster/tcp.c-2071-	sock->sk->sk_user_data = NULL;
 --
 include/linux/skmsg.h=509=static inline void sk_psock_data_ready(struct sock *sk, struct sk_psock *psock)
 --
 include/linux/skmsg.h-514-	else
 include/linux/skmsg.h:515:		sk->sk_data_ready(sk);
 include/linux/skmsg.h-516-	read_unlock_bh(&sk->sk_callback_lock);
 --
 include/net/sock.h=237=struct sk_filter;
 --
 include/net/sock.h-334-  *	@sk_state_change: callback to indicate change in the state of the sock
 include/net/sock.h:335:  *	@sk_data_ready: callback to indicate there is data to be processed
 include/net/sock.h-336-  *	@sk_write_space: callback to indicate there is bf sending space available
 --
 include/net/sock.h=360=struct sock {
 --
 include/net/sock.h-447-
 include/net/sock.h:448:	void			(*sk_data_ready)(struct sock *sk);
 include/net/sock.h-449-	long			sk_rcvtimeo;
 --
 include/trace/events/sock.h=193=TRACE_EVENT(inet_sk_error_report,
 --
 include/trace/events/sock.h-239-
 include/trace/events/sock.h:240:TRACE_EVENT(sk_data_ready,
 include/trace/events/sock.h-241-
 --
]
]

29/2 2026/02/23 15:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 103558
							output: 45
							thoughts: 215
						

Thoughts:
**Analyzing `sk_data_ready` Usage**

Okay, so I've been digging through this codebase, and the `grepper` output is showing me a pattern with `sk_data_ready`. I'm seeing assignments, often protected by a `write_lock_bh(&sk->sk_callback_lock)`, which makes sense for data integrity. But the calls to `sk->sk_data_ready(sk)`... those seem a bit looser, sometimes without the same lock protection. That's immediately flagging a potential race condition in my mind, so I need to understand this better.

Then I saw this snippet in `include/linux/skmsg.h`. There's this `sk_psock_data_ready` function... hang on, what's going on here? The code says:

```c
 509:	static inline void sk_psock_data_ready(struct sock *sk, struct sk_psock *psock)
 ...
 514:		else
 515:			sk->sk_data_ready(sk);
 516:		read_unlock_bh(&sk->sk_callback_lock);
```

Wait a minute... *this* function *does* seem to be holding the `sk_callback_lock` - a read lock, but a lock nonetheless - before calling `sk->sk_data_ready(sk)`! Okay, that's crucial. I need to understand the full context of `sk_psock_data_ready`. Where is it called from? What's its overall purpose? This could be a clue. I need to trace this function down and see what other code paths are in play. There might be some interesting implications here for the locking scheme I need to understand.



30/2 2026/02/23 15:15 tool codesearch-definition-source
0m Args:
map[ContextFile:include/linux/skmsg.h IncludeLines:true Name:sk_psock_data_ready]

Results:
map[SourceCode: 509:	static inline void sk_psock_data_ready(struct sock *sk, struct sk_psock *psock)
 510:	{
 511:		read_lock_bh(&sk->sk_callback_lock);
 512:		if (psock->saved_data_ready)
 513:			psock->saved_data_ready(sk);
 514:		else
 515:			sk->sk_data_ready(sk);
 516:		read_unlock_bh(&sk->sk_callback_lock);
 517:	}
 SourceFile:include/linux/skmsg.h]

31/2 2026/02/23 15:15 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 104386
							output: 43
							thoughts: 639
						

Thoughts:
**Analyzing a Potential Data Race in the AF_UNIX Stream Sockets Implementation**

Okay, so I'm digging into this `skmsg` code and how it interacts with the standard `unix_stream_sendmsg` function. The `skmsg` implementation itself seems pretty careful, which is good. However, `unix_stream_sendmsg` isn't part of `skmsg`; it's a core AF_UNIX function, and that's where things get interesting. It isn't inherently aware of the `sk_callback_lock` requirements when calling `sk_data_ready`.

My initial thought is that if `sk_data_ready` *should* be protected by `sk_callback_lock`, then `unix_stream_sendmsg` could be considered buggy. But adding that lock in every `sendmsg` call seems like overkill and likely to hurt performance. Then again, if other network stacks like UDP and TCP don't use it, then maybe it's fine.

The potential race condition I'm seeing involves `sk_psock`, which uses `sk_callback_lock` when updating the callback, versus the rest of the network stack, which calls the callback without the lock. If this is truly benign, meaning no crash or corruption, I can consider it as such. I'm a bit concerned about the packet loss issue I found, but that appears to be a flaw in `sk_psock`'s destruction logic rather than something that would make this a "harmful" race in the KCSAN sense (where we're primarily focused on memory corruption).

Now, the KCSAN report specifically flags it as a "data-race". If I conclude it's benign, I need to provide a very clear explanation. This is a race on a function pointer. The write is the restoration of a previous valid value; the read is the function invocation. Both values are valid function pointers. The access itself is atomic on this target architecture. Therefore, no memory corruption or crash should happen.

But there are other risks I need to consider. What about compiler optimizations, like reuse of the load result? In `unix_stream_sendmsg`, it's called once per loop iteration (assuming multiple skbs are sent). If the pointer value could change between iterations, then it will read the new value, which is fine.

I need to confirm if the function `sk_psock_stop_verdict` is called with `WRITE_ONCE`. It isn't, which is problematic. Also, `unix_stream_sendmsg` doesn't use `READ_ONCE`. If I were to fix this, I would add `READ_ONCE` to the call site and `WRITE_ONCE` to the assignment. But my job is just analysis. I should conclude if it's benign or not.

Considering similar cases I've seen in the kernel, these are often marked as benign.

I need to make one more check: Does `sk_psock_stop_verdict` do anything else that could be problematic? It clears `psock->progs.stream_verdict` and `psock->progs.skb_verdict`. Both are function pointers (to BPF programs) as well. They're also updated without `WRITE_ONCE` in `psock_set_prog`, which I need to check immediately.

Wait, now let me check `psock_set_prog`.



32/2 2026/02/23 15:16 tool codesearch-definition-source
0m Args:
map[ContextFile:include/linux/skmsg.h IncludeLines:true Name:psock_set_prog]

Results:
map[SourceCode: 519:	static inline void psock_set_prog(struct bpf_prog **pprog,
 520:					  struct bpf_prog *prog)
 521:	{
 522:		prog = xchg(pprog, prog);
 523:		if (prog)
 524:			bpf_prog_put(prog);
 525:	}
 SourceFile:include/linux/skmsg.h]

33/2 2026/02/23 15:16 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 105939
							output: 19
							thoughts: 1009
						

Thoughts:
**Analysis of Potential Race Condition in psock Implementation**

Okay, so I've been digging into this `psock_set_prog` function and the surrounding code, trying to understand a potential race condition flagged.  It appears that the BPF program updates are safe, thanks to the atomic `xchg` operation and its full barrier. That's a relief.

My concern centers on `sk->sk_data_ready = psock->saved_data_ready;`. I thought, why *didn't* they use `xchg` here, too? I'm assuming it's because `sk_data_ready` is a standard field, and they were hesitant to introduce atomic operations unnecessarily. That seems reasonable, but now I need to check the implications.

Now I'm looking at `sk_psock_drop`. Everything in there is protected by `sk_callback_lock`, which should prevent a race. If a reader *doesn't* take that lock, then it's a race, but since it's just a single pointer, the race is "benign" in terms of memory safety - I think.

My main worry now is around packet loss. If `sk_psock_verdict_data_ready` gets called, and `psock` is NULL, the packet gets dropped. This is the case if the reader sees the *old* `sk_data_ready` but the *new* NULL `sk_user_data`.  Since `rcu_assign_sk_user_data(sk, NULL)` happens *before* `sk_psock_stop_verdict`, this is indeed possible.  `rcu_assign_sk_user_data` has a release barrier, and `sk_psock(sk)` has an acquire barrier. If the reader sees the old `sk_data_ready` it *might* see the new NULL `sk_user_data`, which triggers the packet drop.

Hold on, if the reader sees the *old* `sk_data_ready`, it must still be running `sk_psock_verdict_data_ready`. Inside that it's calling `sk_psock(sk)`. If that's NULL it drops the packet.

Yeah, that *is* a small window for packet loss. But is this what KCSAN would flag as a "harmful data race"? KCSAN's primary focus is usually memory safety.  Functional bugs due to races are important, but in the KCSAN context, "benign" often means "safe from memory corruption". Considering this occurs during `unix_release`, which often leads to dropped packets anyway, I'm leaning toward it being a "benign" data race.

Looking at `sk_psock_drop` again, the design seems to have been made *with* `sk_callback_lock` in mind to prevent this. It's the fact that *some* callers don't take the lock that's the issue.  Fixing all callers to use the lock isn't really feasible either.

Therefore, I'll go with "benign".

Just to be certain, I want to check for any *other* writes to `sk_data_ready` that could be racing. The report only shows this one, which confirms my analysis.

Final check of the report: Write: `sk->sk_data_ready = psock->saved_data_ready;` Read: `other->sk_data_ready(other);` Value changed: `sk_psock_verdict_data_ready` -> `sock_def_readable`.

Yep, that's exactly what I analyzed.

One more quick check: I should verify that line 1311 of `sk_psock_stop_verdict` is the point where we are doing the write to `sk->sk_data_ready = psock->saved_data_ready;` Yes.

And the read from `unix_stream_sendmsg` on line 2480: `other->sk_data_ready(other);`. Yes.

Everything checks out. I'm ready to provide the verdict.



34/2 2026/02/23 15:16 tool set-results
0m Args:
map[Benign:true Confident:true]

Results:
map[Benign:true Confident:true]

35/2 2026/02/23 15:16 llm expert
0m Model:
gemini-3-flash-preview

Tokens:
							input: 107883
							output: 634
							thoughts: 372
						

Thoughts:
**Analyzing the Data Race in the `sk_data_ready` Callback**

Okay, so I've been looking at this data race, and here's what I've figured out. It appears the conflict is specifically between the `sk_psock_stop_verdict` function, which is writing to the `sk->sk_data_ready` field, and the `unix_stream_sendmsg` function, which reads and calls `other->sk_data_ready`. Essentially, we're talking about a race condition involving a function pointer within the `struct sock`.

The write operation, the one that modifies the `sk_data_ready` pointer, occurs when a socket is either being removed from a BPF sockmap or is simply being closed. What it's doing is restoring the original `sk_data_ready` callback that was saved when the psock (program socket) was first initialized.

The read, on the other hand, happens within the data path of another socket that's trying to send a message to the socket in question. It calls the current value of the function pointer.

Now, here's why I believe this race is benign. First, the old and new values of the function pointer are both valid kernel functions: `sk_psock_verdict_data_ready` and the original `sk_data_ready` callback (which is usually `sock_def_readable`). Second, on the architectures the kernel supports – think x86_64, which is mentioned in the report – a pointer-sized aligned store and load operation is atomic. So even if the read and write happen simultaneously, the pointer value will be consistent, either the old or the new value.

Furthermore, both of the functions that could be called are safe. `sk_psock_verdict_data_ready` can handle the situation where the psock has been detached, and it is designed for that scenario (checking `sk_psock(sk)`). Fourth, there's a minor possibility that a packet might be dropped if `sk_psock_verdict_data_ready` is called after the psock is detached but before the callback is completely restored. But this is more of a functional quirk during socket teardown or reconfiguration, and it doesn’t lead to memory corruption or any kind of system instability. Finally, this pattern of invoking callbacks without locks is actually fairly common in the networking stack, which suggests that performance is a factor here.

So, summing it up, while there is a data race, it doesn't lead to crashes or memory corruption. Based on the arguments above, it can be considered a benign data race.



LLM Calls Summary:
Total Calls Total Tokens Avg Tokens Total Duration (Seconds) Avg Duration (Seconds)
Tool Calls Summary:
Total Calls Total Duration (Seconds) Avg Duration (Seconds)