syzbot |
sign-in | mailing list | source | docs | 🏰 |
| ID | Workflow | Result | Correct | Bug | Created | Started | Finished | Revision | Error |
|---|---|---|---|---|---|---|---|---|---|
| a648c79b-7b55-47bd-ad5a-0cd8fcc3aa91 | assessment-kcsan | Benign: ✅ Confident: ✅ | ❓ | KCSAN: data-race in __inet_lookup_established / inet_put_port (5) | 2026/02/23 14:31 | 2026/02/23 14:31 | 2026/02/23 14:36 | 305c0ec5cd886e2d13738e28e1b2df9b0ec20fc9 |
The data race occurs between `__inet_put_port` and `inet_match`. 1. **Write Access**: In `__inet_put_port`, the field `inet_sk(sk)->inet_num` (which is a macro for `sk->__sk_common.skc_num`) is set to `0`. This function is called when a socket is being closed or destroyed (e.g., in `tcp_done` or `tcp_time_wait` when a time-wait socket cannot be allocated). 2. **Read Access**: In `inet_match` (called by `__inet_lookup_established`), the field `sk->sk_portpair` is read. `sk_portpair` is a 32-bit union that includes `skc_num` (the local port) and `skc_dport` (the foreign port). This read is part of the process of matching an incoming packet to an established socket. **Analysis:** The race happens when a packet arrives at the same time a socket is being closed. The socket is first unhashed from the established hash table (`ehash`) using RCU-safe primitives. However, a concurrent reader in `__inet_lookup_established` might still see the socket if it started its lookup before the unhashing was complete. If the reader sees `inet_num = 0` during the `sk_portpair` check, `inet_match` will return `false`, and the lookup will continue to the next socket. This is correct behavior for a socket that is being removed. If the reader sees the old port value, `inet_match` returns `true`. The lookup algorithm then attempts to increment the socket's reference count using `refcount_inc_not_zero`. If successful, it **re-checks** `inet_match`. This double-check pattern is specifically designed to handle concurrent changes to the socket's identity or state during RCU lookups. If the second check fails (because `inet_num` was set to 0), the reference is dropped, and the lookup restarts. The value change reported (`0x400190d0 -> 0x000090d0`) confirms that the upper 16 bits (the local port `skc_num`) were changed to 0, while the lower 16 bits (`skc_dport`) remained unchanged. Because the lookup algorithm is robust against such concurrent changes and the result of the race (failing to match a closing socket) is consistent with the system state, the data race is benign. It does not lead to memory corruption, crashes, or incorrect protocol behavior. Final Verdict: Benign.
================================================================== BUG: KCSAN: data-race in __inet_lookup_established / inet_put_port write to 0xffff88810b7c30ce of 2 bytes by task 162 on cpu 1: __inet_put_port net/ipv4/inet_hashtables.c:227 [inline] inet_put_port+0x197/0x640 net/ipv4/inet_hashtables.c:247 tcp_set_state net/ipv4/tcp.c:3053 [inline] tcp_done+0x1af/0x390 net/ipv4/tcp.c:5072 tcp_time_wait+0x4fd/0x5c0 net/ipv4/tcp_minisocks.c:394 __tcp_close+0x95b/0x1130 net/ipv4/tcp.c:3308 tcp_close+0x28/0xd0 net/ipv4/tcp.c:3349 inet_release+0xcd/0xf0 net/ipv4/af_inet.c:437 inet6_release+0x3e/0x60 net/ipv6/af_inet6.c:487 __sock_release net/socket.c:662 [inline] sock_release+0x4b/0xe0 net/socket.c:690 rds_tcp_accept_one+0x335/0x900 net/rds/tcp_listen.c:359 rds_tcp_accept_worker+0x1a/0x60 net/rds/tcp.c:524 process_one_work kernel/workqueue.c:3275 [inline] process_scheduled_works+0x4de/0x9e0 kernel/workqueue.c:3358 worker_thread+0x581/0x770 kernel/workqueue.c:3439 kthread+0x22a/0x280 kernel/kthread.c:467 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 read to 0xffff88810b7c30cc of 4 bytes by interrupt on cpu 0: inet_match include/net/inet_hashtables.h:348 [inline] __inet_lookup_established+0x141/0x520 net/ipv4/inet_hashtables.c:549 __inet_lookup include/net/inet_hashtables.h:405 [inline] __inet_lookup_skb include/net/inet_hashtables.h:490 [inline] tcp_v4_rcv+0x984/0x1db0 net/ipv4/tcp_ipv4.c:2187 ip_protocol_deliver_rcu+0x395/0x790 net/ipv4/ip_input.c:207 ip_local_deliver_finish+0x1fc/0x2f0 net/ipv4/ip_input.c:241 NF_HOOK include/linux/netfilter.h:318 [inline] ip_local_deliver+0xe8/0x1e0 net/ipv4/ip_input.c:262 dst_input include/net/dst.h:480 [inline] ip_rcv_finish+0x194/0x1c0 net/ipv4/ip_input.c:453 NF_HOOK include/linux/netfilter.h:318 [inline] ip_rcv+0x62/0x160 net/ipv4/ip_input.c:573 __netif_receive_skb_one_core net/core/dev.c:6156 [inline] __netif_receive_skb net/core/dev.c:6269 [inline] process_backlog+0x341/0x670 net/core/dev.c:6620 __napi_poll+0x61/0x330 net/core/dev.c:7684 napi_poll net/core/dev.c:7747 [inline] net_rx_action+0x452/0x930 net/core/dev.c:7899 handle_softirqs+0xb9/0x2a0 kernel/softirq.c:622 do_softirq+0x45/0x60 kernel/softirq.c:523 __local_bh_enable_ip+0x70/0x80 kernel/softirq.c:450 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:924 [inline] __dev_queue_xmit+0x1393/0x1f40 net/core/dev.c:4863 dev_queue_xmit include/linux/netdevice.h:3384 [inline] neigh_hh_output include/net/neighbour.h:540 [inline] neigh_output include/net/neighbour.h:554 [inline] ip_finish_output2+0x705/0x8c0 net/ipv4/ip_output.c:237 __ip_finish_output net/ipv4/ip_output.c:-1 [inline] ip_finish_output+0x114/0x2a0 net/ipv4/ip_output.c:325 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip_output+0xbd/0x1c0 net/ipv4/ip_output.c:438 dst_output include/net/dst.h:470 [inline] ip_local_out net/ipv4/ip_output.c:131 [inline] __ip_queue_xmit+0xbb0/0xbf0 net/ipv4/ip_output.c:534 ip_queue_xmit+0x39/0x50 net/ipv4/ip_output.c:548 __tcp_transmit_skb+0x1af2/0x1f10 net/ipv4/tcp_output.c:1693 tcp_transmit_skb net/ipv4/tcp_output.c:1711 [inline] tcp_write_xmit+0x1597/0x3640 net/ipv4/tcp_output.c:3064 __tcp_push_pending_frames+0x6d/0x1b0 net/ipv4/tcp_output.c:3247 tcp_send_fin+0x58a/0x6a0 net/ipv4/tcp_output.c:3846 tcp_shutdown+0xbe/0xd0 net/ipv4/tcp.c:3121 inet_shutdown+0x1c2/0x210 net/ipv4/af_inet.c:931 rds_tcp_conn_path_shutdown+0x12e/0x5e0 net/rds/tcp_connect.c:248 rds_conn_shutdown+0x293/0x4e0 net/rds/connection.c:414 rds_shutdown_worker+0x1c/0x30 net/rds/threads.c:249 process_one_work kernel/workqueue.c:3275 [inline] process_scheduled_works+0x4de/0x9e0 kernel/workqueue.c:3358 worker_thread+0x581/0x770 kernel/workqueue.c:3439 kthread+0x22a/0x280 kernel/kthread.c:467 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 value changed: 0x400190d0 -> 0x000090d0 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 30 Comm: kworker/u8:1 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 Workqueue: krds_cp_wq#9/0 rds_shutdown_worker ==================================================================
| Seq | Timestamp | Type | Name | Duration |
|---|---|---|---|---|
| 0/0 | 2026/02/23 14:31 | flow | assessment-kcsan |
5mResults:map[Benign:true Confident:true Explanation:The data race occurs between `__inet_put_port` and `inet_match`. 1. **Write Access**: In `__inet_put_port`, the field `inet_sk(sk)->inet_num` (which is a macro for `sk->__sk_common.skc_num`) is set to `0`. This function is called when a socket is being closed or destroyed (e.g., in `tcp_done` or `tcp_time_wait` when a time-wait socket cannot be allocated). 2. **Read Access**: In `inet_match` (called by `__inet_lookup_established`), the field `sk->sk_portpair` is read. `sk_portpair` is a 32-bit union that includes `skc_num` (the local port) and `skc_dport` (the foreign port). This read is part of the process of matching an incoming packet to an established socket. **Analysis:** The race happens when a packet arrives at the same time a socket is being closed. The socket is first unhashed from the established hash table (`ehash`) using RCU-safe primitives. However, a concurrent reader in `__inet_lookup_established` might still see the socket if it started its lookup before the unhashing was complete. If the reader sees `inet_num = 0` during the `sk_portpair` check, `inet_match` will return `false`, and the lookup will continue to the next socket. This is correct behavior for a socket that is being removed. If the reader sees the old port value, `inet_match` returns `true`. The lookup algorithm then attempts to increment the socket's reference count using `refcount_inc_not_zero`. If successful, it **re-checks** `inet_match`. This double-check pattern is specifically designed to handle concurrent changes to the socket's identity or state during RCU lookups. If the second check fails (because `inet_num` was set to 0), the reference is dropped, and the lookup restarts. The value change reported (`0x400190d0 -> 0x000090d0`) confirms that the upper 16 bits (the local port `skc_num`) were changed to 0, while the lower 16 bits (`skc_dport`) remained unchanged. Because the lookup algorithm is robust against such concurrent changes and the result of the race (failing to match a closing socket) is consistent with the system state, the data race is benign. It does not lead to memory corruption, crashes, or incorrect protocol behavior. Final Verdict: Benign.] |
| 1/1 | 2026/02/23 14:31 | action | kernel-checkouter |
0mResults:map[KernelSrc:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/src/c60cb8f389456b7f3a86008e464b61ef03e87c3f] |
| 2/1 | 2026/02/23 14:31 | action | kernel-builder |
0mResults:map[KernelObj:/usr/local/google/home/dvyukov/syzkaller/agent/workdir/cache/build/378d42f3f6df6e27321f8267aecfa7afea9cf38e] |
| 3/1 | 2026/02/23 14:31 | action | codesearch-prepare |
0mResults:map[Index:codesearch-index] |
| 4/1 | 2026/02/23 14:31 | agent | expert |
5mModel:gemini-3-flash-preview Results: map[Benign:true Confident:true] Instruction: You are an experienced Linux kernel developer tasked with determining if the given kernel data race is benign or not. The data race report is from KCSAN tool. It contains 2 stack traces of the memory accesses that constitute a data race. A "benign" data races are on a simple int/bool variable or similar field, and the accesses are not supposed to be protected by any mutual exclusion primitives. Common examples of such "benign" data races are accesses to various flags fields, statistics counters, and similar. A "benign" data race does not lead to memory corruption/crash with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A non-benign (or "harmful" data race) can lead to corruption/crash even with a conservative compiler that compiles memory accesses to primitive types effectively as atomic. A common example of a "harmful" data races is race on a complex container (list/hashmap/etc), where accesses are supposed to be protected by a mutual exclusion primitive. Also consider races that happen at the same time with the given one. If there is no synchronization in between, other memory accesses in the involved threads race with each other if they access the same memory. For example, if both threads execute: some_struct->have_elements = true; list_add(new_node, &some_struct->list_head); the race on some_struct->have_elements may appear benign, however it also implies there is a race on some_struct->list_head which is not benign, since the list is not thread-safe. Take into account that on 32-bit systems 64-bit memory accesses may be split into two accesses, and thus even with a conservative compiler may not be fully atomic. However, such races may still be benign depending on how writes are done, and how read data is used. In the final reply explain why you think the given data race is benign or is harmful. Use the provided tools to confirm any assumptions, variables/fields being accessed, etc. In particular, don't make assumptions about the kernel source code, use codesearch tools to read the actual source code. Prefer calling several tools at the same time to save round-trips. Use set-results tool to provide results of the analysis. It must be called exactly once before the final reply. Ignore results of this tool. Prompt: The data race report is: ================================================================== BUG: KCSAN: data-race in __inet_lookup_established / inet_put_port write to 0xffff88810b7c30ce of 2 bytes by task 162 on cpu 1: __inet_put_port net/ipv4/inet_hashtables.c:227 [inline] inet_put_port+0x197/0x640 net/ipv4/inet_hashtables.c:247 tcp_set_state net/ipv4/tcp.c:3053 [inline] tcp_done+0x1af/0x390 net/ipv4/tcp.c:5072 tcp_time_wait+0x4fd/0x5c0 net/ipv4/tcp_minisocks.c:394 __tcp_close+0x95b/0x1130 net/ipv4/tcp.c:3308 tcp_close+0x28/0xd0 net/ipv4/tcp.c:3349 inet_release+0xcd/0xf0 net/ipv4/af_inet.c:437 inet6_release+0x3e/0x60 net/ipv6/af_inet6.c:487 __sock_release net/socket.c:662 [inline] sock_release+0x4b/0xe0 net/socket.c:690 rds_tcp_accept_one+0x335/0x900 net/rds/tcp_listen.c:359 rds_tcp_accept_worker+0x1a/0x60 net/rds/tcp.c:524 process_one_work kernel/workqueue.c:3275 [inline] process_scheduled_works+0x4de/0x9e0 kernel/workqueue.c:3358 worker_thread+0x581/0x770 kernel/workqueue.c:3439 kthread+0x22a/0x280 kernel/kthread.c:467 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 read to 0xffff88810b7c30cc of 4 bytes by interrupt on cpu 0: inet_match include/net/inet_hashtables.h:348 [inline] __inet_lookup_established+0x141/0x520 net/ipv4/inet_hashtables.c:549 __inet_lookup include/net/inet_hashtables.h:405 [inline] __inet_lookup_skb include/net/inet_hashtables.h:490 [inline] tcp_v4_rcv+0x984/0x1db0 net/ipv4/tcp_ipv4.c:2187 ip_protocol_deliver_rcu+0x395/0x790 net/ipv4/ip_input.c:207 ip_local_deliver_finish+0x1fc/0x2f0 net/ipv4/ip_input.c:241 NF_HOOK include/linux/netfilter.h:318 [inline] ip_local_deliver+0xe8/0x1e0 net/ipv4/ip_input.c:262 dst_input include/net/dst.h:480 [inline] ip_rcv_finish+0x194/0x1c0 net/ipv4/ip_input.c:453 NF_HOOK include/linux/netfilter.h:318 [inline] ip_rcv+0x62/0x160 net/ipv4/ip_input.c:573 __netif_receive_skb_one_core net/core/dev.c:6156 [inline] __netif_receive_skb net/core/dev.c:6269 [inline] process_backlog+0x341/0x670 net/core/dev.c:6620 __napi_poll+0x61/0x330 net/core/dev.c:7684 napi_poll net/core/dev.c:7747 [inline] net_rx_action+0x452/0x930 net/core/dev.c:7899 handle_softirqs+0xb9/0x2a0 kernel/softirq.c:622 do_softirq+0x45/0x60 kernel/softirq.c:523 __local_bh_enable_ip+0x70/0x80 kernel/softirq.c:450 local_bh_enable include/linux/bottom_half.h:33 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:924 [inline] __dev_queue_xmit+0x1393/0x1f40 net/core/dev.c:4863 dev_queue_xmit include/linux/netdevice.h:3384 [inline] neigh_hh_output include/net/neighbour.h:540 [inline] neigh_output include/net/neighbour.h:554 [inline] ip_finish_output2+0x705/0x8c0 net/ipv4/ip_output.c:237 __ip_finish_output net/ipv4/ip_output.c:-1 [inline] ip_finish_output+0x114/0x2a0 net/ipv4/ip_output.c:325 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip_output+0xbd/0x1c0 net/ipv4/ip_output.c:438 dst_output include/net/dst.h:470 [inline] ip_local_out net/ipv4/ip_output.c:131 [inline] __ip_queue_xmit+0xbb0/0xbf0 net/ipv4/ip_output.c:534 ip_queue_xmit+0x39/0x50 net/ipv4/ip_output.c:548 __tcp_transmit_skb+0x1af2/0x1f10 net/ipv4/tcp_output.c:1693 tcp_transmit_skb net/ipv4/tcp_output.c:1711 [inline] tcp_write_xmit+0x1597/0x3640 net/ipv4/tcp_output.c:3064 __tcp_push_pending_frames+0x6d/0x1b0 net/ipv4/tcp_output.c:3247 tcp_send_fin+0x58a/0x6a0 net/ipv4/tcp_output.c:3846 tcp_shutdown+0xbe/0xd0 net/ipv4/tcp.c:3121 inet_shutdown+0x1c2/0x210 net/ipv4/af_inet.c:931 rds_tcp_conn_path_shutdown+0x12e/0x5e0 net/rds/tcp_connect.c:248 rds_conn_shutdown+0x293/0x4e0 net/rds/connection.c:414 rds_shutdown_worker+0x1c/0x30 net/rds/threads.c:249 process_one_work kernel/workqueue.c:3275 [inline] process_scheduled_works+0x4de/0x9e0 kernel/workqueue.c:3358 worker_thread+0x581/0x770 kernel/workqueue.c:3439 kthread+0x22a/0x280 kernel/kthread.c:467 ret_from_fork+0x150/0x360 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245 value changed: 0x400190d0 -> 0x000090d0 Reported by Kernel Concurrency Sanitizer on: CPU: 0 UID: 0 PID: 30 Comm: kworker/u8:1 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 02/12/2026 Workqueue: krds_cp_wq#9/0 rds_shutdown_worker ================================================================== Reply: The data race occurs between `__inet_put_port` and `inet_match`. 1. **Write Access**: In `__inet_put_port`, the field `inet_sk(sk)->inet_num` (which is a macro for `sk->__sk_common.skc_num`) is set to `0`. This function is called when a socket is being closed or destroyed (e.g., in `tcp_done` or `tcp_time_wait` when a time-wait socket cannot be allocated). 2. **Read Access**: In `inet_match` (called by `__inet_lookup_established`), the field `sk->sk_portpair` is read. `sk_portpair` is a 32-bit union that includes `skc_num` (the local port) and `skc_dport` (the foreign port). This read is part of the process of matching an incoming packet to an established socket. **Analysis:** The race happens when a packet arrives at the same time a socket is being closed. The socket is first unhashed from the established hash table (`ehash`) using RCU-safe primitives. However, a concurrent reader in `__inet_lookup_established` might still see the socket if it started its lookup before the unhashing was complete. If the reader sees `inet_num = 0` during the `sk_portpair` check, `inet_match` will return `false`, and the lookup will continue to the next socket. This is correct behavior for a socket that is being removed. If the reader sees the old port value, `inet_match` returns `true`. The lookup algorithm then attempts to increment the socket's reference count using `refcount_inc_not_zero`. If successful, it **re-checks** `inet_match`. This double-check pattern is specifically designed to handle concurrent changes to the socket's identity or state during RCU lookups. If the second check fails (because `inet_num` was set to 0), the reference is dropped, and the lookup restarts. The value change reported (`0x400190d0 -> 0x000090d0`) confirms that the upper 16 bits (the local port `skc_num`) were changed to 0, while the lower 16 bits (`skc_dport`) remained unchanged. Because the lookup algorithm is robust against such concurrent changes and the result of the race (failing to match a closing socket) is consistent with the system state, the data race is benign. It does not lead to memory corruption, crashes, or incorrect protocol behavior. Final Verdict: Benign. |
| 5/2 | 2026/02/23 14:31 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 5266 output: 88 thoughts: 309 Thoughts: **Analyzing a Data Race in the IPv4 Hash Table** Okay, the report indicates a classic data race – `__inet_put_port` is writing while `inet_match` is reading. I need to get to the bottom of this. The write operation to memory address `0xffff88810b7c30ce` involves 2 bytes, and the read is targeting `0xffff88810b7c30cc` and reads 4 bytes. Crucially, the addresses overlap; `0xffff88810b7c30ce` falls within the span of the `inet_match` read from `0xffff88810b7c30cc` through `0xffff88810b7c30cf`. Specifically, the write is touching the upper two bytes of a 4-byte read, which is a textbook example of the kind of problem that'll cause havoc. The value being modified went from `0x400190d0` to `0x000090d0`, and so clearly, it's the upper 2 bytes being wiped. Time to dive into the code. The write operation, `__inet_put_port`, is defined in `net/ipv4/inet_hashtables.c:227`, and the read, `inet_match`, is in `include/net/inet_hashtables.h:348`. My next step is going to be tracing these locations. Let's see what's actually going on. |
| 6/2 | 2026/02/23 14:31 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/inet_hashtables.c IncludeLines:true Name:__inet_put_port] Results: map[SourceCode: 209: /*
210: * Get rid of any references to a local port held by the given sock.
211: */
212: static void __inet_put_port(struct sock *sk)
213: {
214: struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
215: struct inet_bind_hashbucket *head, *head2;
216: struct net *net = sock_net(sk);
217: struct inet_bind_bucket *tb;
218: int bhash;
219:
220: bhash = inet_bhashfn(net, inet_sk(sk)->inet_num, hashinfo->bhash_size);
221: head = &hashinfo->bhash[bhash];
222: head2 = inet_bhashfn_portaddr(hashinfo, sk, net, inet_sk(sk)->inet_num);
223:
224: spin_lock(&head->lock);
225: tb = inet_csk(sk)->icsk_bind_hash;
226: inet_csk(sk)->icsk_bind_hash = NULL;
227: inet_sk(sk)->inet_num = 0;
228: sk->sk_userlocks &= ~SOCK_CONNECT_BIND;
229:
230: spin_lock(&head2->lock);
231: if (inet_csk(sk)->icsk_bind2_hash) {
232: struct inet_bind2_bucket *tb2 = inet_csk(sk)->icsk_bind2_hash;
233:
234: __sk_del_bind_node(sk);
235: inet_csk(sk)->icsk_bind2_hash = NULL;
236: inet_bind2_bucket_destroy(hashinfo->bind2_bucket_cachep, tb2);
237: }
238: spin_unlock(&head2->lock);
239:
240: inet_bind_bucket_destroy(tb);
241: spin_unlock(&head->lock);
242: }
SourceFile:net/ipv4/inet_hashtables.c] |
| 7/2 | 2026/02/23 14:31 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:include/net/inet_hashtables.h IncludeLines:true Name:inet_match] Results: map[SourceCode: 343: static inline bool inet_match(const struct net *net, const struct sock *sk,
344: const __addrpair cookie, const __portpair ports,
345: int dif, int sdif)
346: {
347: if (!net_eq(sock_net(sk), net) ||
348: sk->sk_portpair != ports ||
349: sk->sk_addrpair != cookie)
350: return false;
351:
352: /* READ_ONCE() paired with WRITE_ONCE() in sock_bindtoindex_locked() */
353: return inet_sk_bound_dev_eq(net, READ_ONCE(sk->sk_bound_dev_if), dif,
354: sdif);
355: }
SourceFile:include/net/inet_hashtables.h] |
| 8/2 | 2026/02/23 14:31 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 6894 output: 73 thoughts: 158 Thoughts: **Analyzing Potential Network Socket Port Conflict** Okay, so I'm looking at this code and I see a potential port assignment issue. Specifically, I'm concerned about how `inet_sk(sk)->inet_num` gets set to zero at line 227 in `net/ipv4/inet_hashtables.c`, and then the subsequent comparison `sk->sk_portpair != ports` at line 348 in `include/net/inet_hashtables.h`. These lines are clearly involved in some sort of port mapping or conflict detection. My immediate thought is that `sk->sk_portpair` is most likely a combined field, probably a 4-byte value that encapsulates both source and destination ports. And, given the context, `inet_sk(sk)->inet_num` *seems* to be the specific field holding the source port. This is my working hypothesis, based on my experience. To confirm this and gain a precise understanding of the data structures involved, I need to verify the layout of `struct sock` and `struct inet_sock`. I'll fire up `codesearch-struct-layout` to get a definitive view of these structures. Let me cross-reference the actual field offsets to confirm my initial assessment of the relationship between these variables and where they are located in memory. That is the next logical step to ensure I understand the impact of the assignments at hand. |
| 9/2 | 2026/02/23 14:31 | tool | codesearch-struct-layout |
0mArgs:map[ContextFile:include/net/inet_hashtables.h Name:sock] Results: map[Fields:[map[Name:__sk_common OffsetBits:0 SizeBits:1088] map[Name:__cacheline_group_begin__sock_write_rx OffsetBits:1088 SizeBits:0] map[Name:sk_drops OffsetBits:1088 SizeBits:32] map[Name:sk_peek_off OffsetBits:1120 SizeBits:32] map[Name:sk_error_queue OffsetBits:1152 SizeBits:192] map[Name:sk_receive_queue OffsetBits:1344 SizeBits:192] map[Name:sk_backlog OffsetBits:1536 SizeBits:192] map[Name:__cacheline_group_end__sock_write_rx OffsetBits:1728 SizeBits:0] map[Name:__cacheline_group_begin__sock_read_rx OffsetBits:1728 SizeBits:0] map[Name:sk_rx_dst OffsetBits:1728 SizeBits:64] map[Name:sk_rx_dst_ifindex OffsetBits:1792 SizeBits:32] map[Name:sk_rx_dst_cookie OffsetBits:1824 SizeBits:32] map[Name:sk_ll_usec OffsetBits:1856 SizeBits:32] map[Name:sk_napi_id OffsetBits:1888 SizeBits:32] map[Name:sk_busy_poll_budget OffsetBits:1920 SizeBits:16] map[Name:sk_prefer_busy_poll OffsetBits:1936 SizeBits:8] map[Name:sk_userlocks OffsetBits:1944 SizeBits:8] map[Name:sk_rcvbuf OffsetBits:1952 SizeBits:32] map[Name:sk_filter OffsetBits:1984 SizeBits:64] map[Name: OffsetBits:2048 SizeBits:64] map[Name:sk_data_ready OffsetBits:2112 SizeBits:64] map[Name:sk_rcvtimeo OffsetBits:2176 SizeBits:64] map[Name:sk_rcvlowat OffsetBits:2240 SizeBits:32] map[Name:__cacheline_group_end__sock_read_rx OffsetBits:2272 SizeBits:0] map[Name:__cacheline_group_begin__sock_read_rxtx OffsetBits:2272 SizeBits:0] map[Name:sk_err OffsetBits:2272 SizeBits:32] map[Name:sk_socket OffsetBits:2304 SizeBits:64] map[Name:sk_memcg OffsetBits:2368 SizeBits:64] map[Name:sk_policy OffsetBits:2432 SizeBits:128] map[Name:psp_assoc OffsetBits:2560 SizeBits:64] map[Name:__cacheline_group_end__sock_read_rxtx OffsetBits:2624 SizeBits:0] map[Name:__cacheline_group_begin__sock_write_rxtx OffsetBits:2624 SizeBits:0] map[Name:sk_lock OffsetBits:2624 SizeBits:256] map[Name:sk_reserved_mem OffsetBits:2880 SizeBits:32] map[Name:sk_forward_alloc OffsetBits:2912 SizeBits:32] map[Name:sk_tsflags OffsetBits:2944 SizeBits:32] map[Name:__cacheline_group_end__sock_write_rxtx OffsetBits:2976 SizeBits:0] map[Name:__cacheline_group_begin__sock_write_tx OffsetBits:2976 SizeBits:0] map[Name:sk_write_pending OffsetBits:2976 SizeBits:32] map[Name:sk_omem_alloc OffsetBits:3008 SizeBits:32] map[Name:sk_err_soft OffsetBits:3040 SizeBits:32] map[Name:sk_wmem_queued OffsetBits:3072 SizeBits:32] map[Name:sk_wmem_alloc OffsetBits:3104 SizeBits:32] map[Name:sk_tsq_flags OffsetBits:3136 SizeBits:64] map[Name: OffsetBits:3200 SizeBits:64] map[Name:sk_write_queue OffsetBits:3264 SizeBits:192] map[Name:sk_frag OffsetBits:3456 SizeBits:128] map[Name: OffsetBits:3584 SizeBits:320] map[Name:sk_pacing_rate OffsetBits:3904 SizeBits:64] map[Name:sk_zckey OffsetBits:3968 SizeBits:32] map[Name:sk_tskey OffsetBits:4000 SizeBits:32] map[Name:sk_tx_queue_mapping_jiffies OffsetBits:4032 SizeBits:64] map[Name:__cacheline_group_end__sock_write_tx OffsetBits:4096 SizeBits:0] map[Name:__cacheline_group_begin__sock_read_tx OffsetBits:4096 SizeBits:0] map[Name:sk_dst_pending_confirm OffsetBits:4096 SizeBits:32] map[Name:sk_pacing_status OffsetBits:4128 SizeBits:32] map[Name:sk_max_pacing_rate OffsetBits:4160 SizeBits:64] map[Name:sk_sndtimeo OffsetBits:4224 SizeBits:64] map[Name:sk_priority OffsetBits:4288 SizeBits:32] map[Name:sk_mark OffsetBits:4320 SizeBits:32] map[Name:sk_uid OffsetBits:4352 SizeBits:32] map[Name:sk_protocol OffsetBits:4384 SizeBits:16] map[Name:sk_type OffsetBits:4400 SizeBits:16] map[Name:sk_dst_cache OffsetBits:4416 SizeBits:64] map[Name:sk_route_caps OffsetBits:4480 SizeBits:64] map[Name:sk_validate_xmit_skb OffsetBits:4544 SizeBits:64] map[Name:sk_gso_type OffsetBits:4608 SizeBits:16] map[Name:sk_gso_max_segs OffsetBits:4624 SizeBits:16] map[Name:sk_gso_max_size OffsetBits:4640 SizeBits:32] map[Name:sk_allocation OffsetBits:4672 SizeBits:32] map[Name:sk_txhash OffsetBits:4704 SizeBits:32] map[Name:sk_sndbuf OffsetBits:4736 SizeBits:32] map[Name:sk_pacing_shift OffsetBits:4768 SizeBits:8] map[Name:sk_use_task_frag OffsetBits:4776 SizeBits:8] map[Name:__cacheline_group_end__sock_read_tx OffsetBits:4784 SizeBits:0] map[Name:sk_gso_disabled OffsetBits:4784 SizeBits:1] map[Name:sk_kern_sock OffsetBits:4785 SizeBits:1] map[Name:sk_no_check_tx OffsetBits:4786 SizeBits:1] map[Name:sk_no_check_rx OffsetBits:4787 SizeBits:1] map[Name:sk_shutdown OffsetBits:4792 SizeBits:8] map[Name:sk_lingertime OffsetBits:4800 SizeBits:64] map[Name:sk_prot_creator OffsetBits:4864 SizeBits:64] map[Name:sk_callback_lock OffsetBits:4928 SizeBits:64] map[Name:sk_ack_backlog OffsetBits:4992 SizeBits:32] map[Name:sk_max_ack_backlog OffsetBits:5024 SizeBits:32] map[Name:sk_ino OffsetBits:5056 SizeBits:64] map[Name:sk_peer_lock OffsetBits:5120 SizeBits:32] map[Name:sk_bind_phc OffsetBits:5152 SizeBits:32] map[Name:sk_peer_pid OffsetBits:5184 SizeBits:64] map[Name:sk_peer_cred OffsetBits:5248 SizeBits:64] map[Name:sk_stamp OffsetBits:5312 SizeBits:64] map[Name:sk_disconnects OffsetBits:5376 SizeBits:32] map[Name: OffsetBits:5408 SizeBits:8] map[Name:sk_clockid OffsetBits:5416 SizeBits:8] map[Name:sk_txtime_deadline_mode OffsetBits:5424 SizeBits:1] map[Name:sk_txtime_report_errors OffsetBits:5425 SizeBits:1] map[Name:sk_txtime_unused OffsetBits:5426 SizeBits:6] map[Name:sk_bpf_cb_flags OffsetBits:5432 SizeBits:8] map[Name:sk_user_data OffsetBits:5440 SizeBits:64] map[Name:sk_security OffsetBits:5504 SizeBits:64] map[Name:sk_cgrp_data OffsetBits:5568 SizeBits:128] map[Name:sk_state_change OffsetBits:5696 SizeBits:64] map[Name:sk_write_space OffsetBits:5760 SizeBits:64] map[Name:sk_error_report OffsetBits:5824 SizeBits:64] map[Name:sk_backlog_rcv OffsetBits:5888 SizeBits:64] map[Name:sk_destruct OffsetBits:5952 SizeBits:64] map[Name:sk_reuseport_cb OffsetBits:6016 SizeBits:64] map[Name:sk_bpf_storage OffsetBits:6080 SizeBits:64] map[Name:sk_drop_counters OffsetBits:6144 SizeBits:64] map[Name: OffsetBits:6208 SizeBits:128] map[Name:ns_tracker OffsetBits:6336 SizeBits:64] map[Name:sk_user_frags OffsetBits:6400 SizeBits:128]]] |
| 10/2 | 2026/02/23 14:31 | tool | codesearch-struct-layout |
0mArgs:map[ContextFile:net/ipv4/inet_hashtables.c Name:inet_sock] Results: map[Fields:[map[Name:sk OffsetBits:0 SizeBits:6528] map[Name:pinet6 OffsetBits:6528 SizeBits:64] map[Name:ipv6_fl_list OffsetBits:6592 SizeBits:64] map[Name:inet_flags OffsetBits:6656 SizeBits:64] map[Name:inet_saddr OffsetBits:6720 SizeBits:32] map[Name:uc_ttl OffsetBits:6752 SizeBits:16] map[Name:inet_sport OffsetBits:6768 SizeBits:16] map[Name:inet_opt OffsetBits:6784 SizeBits:64] map[Name:inet_id OffsetBits:6848 SizeBits:32] map[Name:tos OffsetBits:6880 SizeBits:8] map[Name:min_ttl OffsetBits:6888 SizeBits:8] map[Name:mc_ttl OffsetBits:6896 SizeBits:8] map[Name:pmtudisc OffsetBits:6904 SizeBits:8] map[Name:rcv_tos OffsetBits:6912 SizeBits:8] map[Name:convert_csum OffsetBits:6920 SizeBits:8] map[Name:uc_index OffsetBits:6944 SizeBits:32] map[Name:mc_index OffsetBits:6976 SizeBits:32] map[Name:mc_addr OffsetBits:7008 SizeBits:32] map[Name:local_port_range OffsetBits:7040 SizeBits:32] map[Name:mc_list OffsetBits:7104 SizeBits:64] map[Name:cork OffsetBits:7168 SizeBits:1344]]] |
| 11/2 | 2026/02/23 14:31 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 10682 output: 34 thoughts: 39 Thoughts: Okay, here's my interpretation of that thought process, assuming I'm already deep in kernel networking code: **Locating `sk_portpair`** Right, where the heck is `sk_portpair`? I was pretty sure it was part of the socket data structure. Hang on... I'm drawing a blank on its exact location within `struct sock`. Ah, that's it! It's *probably* in `struct sock_common`, isn't it? That would make sense, since `sock_common` is the base for both the raw and the transport-specific parts of a socket. Okay, quick sanity check. I need to dive into `struct sock_common` and confirm this assumption before proceeding. No point going further down the rabbit hole until I'm absolutely certain I'm looking at the right structure. Let's get to it. |
| 12/2 | 2026/02/23 14:32 | tool | codesearch-struct-layout |
0mArgs:map[ContextFile:include/net/sock.h Name:sock_common] Results: map[Fields:[map[Name: OffsetBits:0 SizeBits:64] map[Name: OffsetBits:64 SizeBits:32] map[Name: OffsetBits:96 SizeBits:32] map[Name:skc_family OffsetBits:128 SizeBits:16] map[Name:skc_state OffsetBits:144 SizeBits:8] map[Name:skc_reuse OffsetBits:152 SizeBits:4] map[Name:skc_reuseport OffsetBits:156 SizeBits:1] map[Name:skc_ipv6only OffsetBits:157 SizeBits:1] map[Name:skc_net_refcnt OffsetBits:158 SizeBits:1] map[Name:skc_bypass_prot_mem OffsetBits:159 SizeBits:1] map[Name:skc_bound_dev_if OffsetBits:160 SizeBits:32] map[Name: OffsetBits:192 SizeBits:128] map[Name:skc_prot OffsetBits:320 SizeBits:64] map[Name:skc_net OffsetBits:384 SizeBits:64] map[Name:skc_v6_daddr OffsetBits:448 SizeBits:128] map[Name:skc_v6_rcv_saddr OffsetBits:576 SizeBits:128] map[Name:skc_cookie OffsetBits:704 SizeBits:64] map[Name: OffsetBits:768 SizeBits:64] map[Name:skc_dontcopy_begin OffsetBits:832 SizeBits:0] map[Name: OffsetBits:832 SizeBits:128] map[Name:skc_tx_queue_mapping OffsetBits:960 SizeBits:16] map[Name:skc_rx_queue_mapping OffsetBits:976 SizeBits:16] map[Name: OffsetBits:992 SizeBits:32] map[Name:skc_refcnt OffsetBits:1024 SizeBits:32] map[Name:skc_dontcopy_end OffsetBits:1056 SizeBits:0] map[Name: OffsetBits:1056 SizeBits:32]]] |
| 13/2 | 2026/02/23 14:32 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 11531 output: 22 thoughts: 42 Thoughts: Okay, here's my assessment of the situation: **Investigating Union Names in `sock_common`** Right, so the `sock_common` layout, when I'm looking at it, isn't explicitly showing the names of the unions. This feels a bit... suboptimal. I need to understand what's actually *in* there, especially with those unions, to fully grasp its structure. So, the logical next step is to head directly to the source. Let's pull up the definition of `struct sock_common`. I'll find it in `include/net/sock.h`. That's where all the fundamental socket structures should be defined. I'm going to dig into that header file and see exactly what's hiding within `sock_common`. |
| 14/2 | 2026/02/23 14:32 | tool | read-file |
0mArgs:map[File:include/net/sock.h] Results: map[Contents:/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* INET An implementation of the TCP/IP protocol suite for the LINUX
* operating system. INET is implemented using the BSD Socket
* interface as the means of communication with the user level.
*
* Definitions for the AF_INET socket handler.
*
* Version: @(#)sock.h 1.0.4 05/13/93
*
* Authors: Ross Biro
* Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
* Corey Minyard <wf-rch!minyard@relay.EU.net>
* Florian La Roche <flla@stud.uni-sb.de>
*
* Fixes:
* Alan Cox : Volatiles in skbuff pointers. See
* skbuff comments. May be overdone,
* better to prove they can be removed
* than the reverse.
* Alan Cox : Added a zapped field for tcp to note
* a socket is reset and must stay shut up
* Alan Cox : New fields for options
* Pauline Middelink : identd support
* Alan Cox : Eliminate low level recv/recvfrom
* David S. Miller : New socket lookup architecture.
* Steve Whitehouse: Default routines for sock_ops
* Arnaldo C. Melo : removed net_pinfo, tp_pinfo and made
* protinfo be just a void pointer, as the
* protocol specific parts were moved to
* respective headers and ipv4/v6, etc now
* use private slabcaches for its socks
* Pedro Hortas : New flags field for socket options
*/
#ifndef _SOCK_H
#define _SOCK_H
#include <linux/hardirq.h>
#include <linux/kernel.h>
#include <linux/list.h>
#include <linux/list_nulls.h>
#include <linux/timer.h>
#include <linux/cache.h>
#include <linux/bitops.h>
#include <linux/lockdep.h>
#include <linux/netdevice.h>
#include <linux/skbuff.h> /* struct sk_buff */
#include <linux/mm.h>
#include <linux/security.h>
#include <linux/slab.h>
#include <linux/uaccess.h>
#include <linux/page_counter.h>
#include <linux/memcontrol.h>
#include <linux/static_key.h>
#include <linux/sched.h>
#include <linux/wait.h>
#include <linux/cgroup-defs.h>
#include <linux/rbtree.h>
#include <linux/rculist_nulls.h>
#include <linux/poll.h>
#include <linux/sockptr.h>
#include <linux/indirect_call_wrapper.h>
#include <linux/atomic.h>
#include <linux/refcount.h>
#include <linux/llist.h>
#include <net/dst.h>
#include <net/checksum.h>
#include <net/tcp_states.h>
#include <linux/net_tstamp.h>
#include <net/l3mdev.h>
#include <uapi/linux/socket.h>
/*
* This structure really needs to be cleaned up.
* Most of it is for TCP, and not used by any of
* the other protocols.
*/
/* This is the per-socket lock. The spinlock provides a synchronization
* between user contexts and software interrupt processing, whereas the
* mini-semaphore synchronizes multiple users amongst themselves.
*/
typedef struct {
spinlock_t slock;
int owned;
wait_queue_head_t wq;
/*
* We express the mutex-alike socket_lock semantics
* to the lock validator by explicitly managing
* the slock as a lock variant (in addition to
* the slock itself):
*/
#ifdef CONFIG_DEBUG_LOCK_ALLOC
struct lockdep_map dep_map;
#endif
} socket_lock_t;
struct sock;
struct proto;
struct net;
typedef __u32 __bitwise __portpair;
typedef __u64 __bitwise __addrpair;
/**
* struct sock_common - minimal network layer representation of sockets
* @skc_daddr: Foreign IPv4 addr
* @skc_rcv_saddr: Bound local IPv4 addr
* @skc_addrpair: 8-byte-aligned __u64 union of @skc_daddr & @skc_rcv_saddr
* @skc_hash: hash value used with various protocol lookup tables
* @skc_u16hashes: two u16 hash values used by UDP lookup tables
* @skc_dport: placeholder for inet_dport/tw_dport
* @skc_num: placeholder for inet_num/tw_num
* @skc_portpair: __u32 union of @skc_dport & @skc_num
* @skc_family: network address family
* @skc_state: Connection state
* @skc_reuse: %SO_REUSEADDR setting
* @skc_reuseport: %SO_REUSEPORT setting
* @skc_ipv6only: socket is IPV6 only
* @skc_net_refcnt: socket is using net ref counting
* @skc_bypass_prot_mem: bypass the per-protocol memory accounting for skb
* @skc_bound_dev_if: bound device index if != 0
* @skc_bind_node: bind hash linkage for various protocol lookup tables
* @skc_portaddr_node: second hash linkage for UDP/UDP-Lite protocol
* @skc_prot: protocol handlers inside a network family
* @skc_net: reference to the network namespace of this socket
* @skc_v6_daddr: IPV6 destination address
* @skc_v6_rcv_saddr: IPV6 source address
* @skc_cookie: socket's cookie value
* @skc_node: main hash linkage for various protocol lookup tables
* @skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
* @skc_tx_queue_mapping: tx queue number for this connection
* @skc_rx_queue_mapping: rx queue number for this connection
* @skc_flags: place holder for sk_flags
* %SO_LINGER (l_onoff), %SO_BROADCAST, %SO_KEEPALIVE,
* %SO_OOBINLINE settings, %SO_TIMESTAMPING settings
* @skc_listener: connection request listener socket (aka rsk_listener)
* [union with @skc_flags]
* @skc_tw_dr: (aka tw_dr) ptr to &struct inet_timewait_death_row
* [union with @skc_flags]
* @skc_incoming_cpu: record/match cpu processing incoming packets
* @skc_rcv_wnd: (aka rsk_rcv_wnd) TCP receive window size (possibly scaled)
* [union with @skc_incoming_cpu]
* @skc_tw_rcv_nxt: (aka tw_rcv_nxt) TCP window next expected seq number
* [union with @skc_incoming_cpu]
* @skc_refcnt: reference count
*
* This is the minimal network layer representation of sockets, the header
* for struct sock and struct inet_timewait_sock.
*/
struct sock_common {
union {
__addrpair skc_addrpair;
struct {
__be32 skc_daddr;
__be32 skc_rcv_saddr;
};
};
union {
unsigned int skc_hash;
__u16 skc_u16hashes[2];
};
/* skc_dport && skc_num must be grouped as well */
union {
__portpair skc_portpair;
struct {
__be16 skc_dport;
__u16 skc_num;
};
};
unsigned short skc_family;
volatile unsigned char skc_state;
unsigned char skc_reuse:4;
unsigned char skc_reuseport:1;
unsigned char skc_ipv6only:1;
unsigned char skc_net_refcnt:1;
unsigned char skc_bypass_prot_mem:1;
int skc_bound_dev_if;
union {
struct hlist_node skc_bind_node;
struct hlist_node skc_portaddr_node;
};
struct proto *skc_prot;
possible_net_t skc_net;
#if IS_ENABLED(CONFIG_IPV6)
struct in6_addr skc_v6_daddr;
struct in6_addr skc_v6_rcv_saddr;
#endif
atomic64_t skc_cookie;
/* following fields are padding to force
* offset(struct sock, sk_refcnt) == 128 on 64bit arches
* assuming IPV6 is enabled. We use this padding differently
* for different kind of 'sockets'
*/
union {
unsigned long skc_flags;
struct sock *skc_listener; /* request_sock */
struct inet_timewait_death_row *skc_tw_dr; /* inet_timewait_sock */
};
/*
* fields between dontcopy_begin/dontcopy_end
* are not copied in sock_copy()
*/
/* private: */
int skc_dontcopy_begin[0];
/* public: */
union {
struct hlist_node skc_node;
struct hlist_nulls_node skc_nulls_node;
};
unsigned short skc_tx_queue_mapping;
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
unsigned short skc_rx_queue_mapping;
#endif
union {
int skc_incoming_cpu;
u32 skc_rcv_wnd;
u32 skc_tw_rcv_nxt; /* struct tcp_timewait_sock */
};
refcount_t skc_refcnt;
/* private: */
int skc_dontcopy_end[0];
union {
u32 skc_rxhash;
u32 skc_window_clamp;
u32 skc_tw_snd_nxt; /* struct tcp_timewait_sock */
};
/* public: */
};
struct bpf_local_storage;
struct sk_filter;
/**
* struct sock - network layer representation of sockets
* @__sk_common: shared layout with inet_timewait_sock
* @sk_shutdown: mask of %SEND_SHUTDOWN and/or %RCV_SHUTDOWN
* @sk_userlocks: %SO_SNDBUF and %SO_RCVBUF settings
* @sk_lock: synchronizer
* @sk_kern_sock: True if sock is using kernel lock classes
* @sk_rcvbuf: size of receive buffer in bytes
* @sk_wq: sock wait queue and async head
* @sk_rx_dst: receive input route used by early demux
* @sk_rx_dst_ifindex: ifindex for @sk_rx_dst
* @sk_rx_dst_cookie: cookie for @sk_rx_dst
* @sk_dst_cache: destination cache
* @sk_dst_pending_confirm: need to confirm neighbour
* @sk_policy: flow policy
* @psp_assoc: PSP association, if socket is PSP-secured
* @sk_receive_queue: incoming packets
* @sk_wmem_alloc: transmit queue bytes committed
* @sk_tsq_flags: TCP Small Queues flags
* @sk_write_queue: Packet sending queue
* @sk_omem_alloc: "o" is "option" or "other"
* @sk_wmem_queued: persistent queue size
* @sk_forward_alloc: space allocated forward
* @sk_reserved_mem: space reserved and non-reclaimable for the socket
* @sk_napi_id: id of the last napi context to receive data for sk
* @sk_ll_usec: usecs to busypoll when there is no data
* @sk_allocation: allocation mode
* @sk_pacing_rate: Pacing rate (if supported by transport/packet scheduler)
* @sk_pacing_status: Pacing status (requested, handled by sch_fq)
* @sk_max_pacing_rate: Maximum pacing rate (%SO_MAX_PACING_RATE)
* @sk_sndbuf: size of send buffer in bytes
* @sk_no_check_tx: %SO_NO_CHECK setting, set checksum in TX packets
* @sk_no_check_rx: allow zero checksum in RX packets
* @sk_route_caps: route capabilities (e.g. %NETIF_F_TSO)
* @sk_gso_disabled: if set, NETIF_F_GSO_MASK is forbidden.
* @sk_gso_type: GSO type (e.g. %SKB_GSO_TCPV4)
* @sk_gso_max_size: Maximum GSO segment size to build
* @sk_gso_max_segs: Maximum number of GSO segments
* @sk_pacing_shift: scaling factor for TCP Small Queues
* @sk_lingertime: %SO_LINGER l_linger setting
* @sk_backlog: always used with the per-socket spinlock held
* @sk_callback_lock: used with the callbacks in the end of this struct
* @sk_error_queue: rarely used
* @sk_prot_creator: sk_prot of original sock creator (see ipv6_setsockopt,
* IPV6_ADDRFORM for instance)
* @sk_err: last error
* @sk_err_soft: errors that don't cause failure but are the cause of a
* persistent failure not just 'timed out'
* @sk_drops: raw/udp drops counter
* @sk_drop_counters: optional pointer to numa_drop_counters
* @sk_ack_backlog: current listen backlog
* @sk_max_ack_backlog: listen backlog set in listen()
* @sk_uid: user id of owner
* @sk_ino: inode number (zero if orphaned)
* @sk_prefer_busy_poll: prefer busypolling over softirq processing
* @sk_busy_poll_budget: napi processing budget when busypolling
* @sk_priority: %SO_PRIORITY setting
* @sk_type: socket type (%SOCK_STREAM, etc)
* @sk_protocol: which protocol this socket belongs in this network family
* @sk_peer_lock: lock protecting @sk_peer_pid and @sk_peer_cred
* @sk_peer_pid: &struct pid for this socket's peer
* @sk_peer_cred: %SO_PEERCRED setting
* @sk_rcvlowat: %SO_RCVLOWAT setting
* @sk_rcvtimeo: %SO_RCVTIMEO setting
* @sk_sndtimeo: %SO_SNDTIMEO setting
* @sk_txhash: computed flow hash for use on transmit
* @sk_txrehash: enable TX hash rethink
* @sk_filter: socket filtering instructions
* @sk_timer: sock cleanup timer
* @tcp_retransmit_timer: tcp retransmit timer
* @mptcp_retransmit_timer: mptcp retransmit timer
* @sk_stamp: time stamp of last packet received
* @sk_stamp_seq: lock for accessing sk_stamp on 32 bit architectures only
* @sk_tsflags: SO_TIMESTAMPING flags
* @sk_bpf_cb_flags: used in bpf_setsockopt()
* @sk_use_task_frag: allow sk_page_frag() to use current->task_frag.
* Sockets that can be used under memory reclaim should
* set this to false.
* @sk_bind_phc: SO_TIMESTAMPING bind PHC index of PTP virtual clock
* for timestamping
* @sk_tskey: counter to disambiguate concurrent tstamp requests
* @sk_tx_queue_mapping_jiffies: time in jiffies of last @sk_tx_queue_mapping refresh.
* @sk_zckey: counter to order MSG_ZEROCOPY notifications
* @sk_socket: Identd and reporting IO signals
* @sk_user_data: RPC layer private data. Write-protected by @sk_callback_lock.
* @sk_frag: cached page frag
* @sk_peek_off: current peek_offset value
* @sk_send_head: front of stuff to transmit
* @tcp_rtx_queue: TCP re-transmit queue [union with @sk_send_head]
* @sk_security: used by security modules
* @sk_mark: generic packet mark
* @sk_cgrp_data: cgroup data for this cgroup
* @sk_memcg: this socket's memory cgroup association
* @sk_write_pending: a write to stream socket waits to start
* @sk_disconnects: number of disconnect operations performed on this sock
* @sk_state_change: callback to indicate change in the state of the sock
* @sk_data_ready: callback to indicate there is data to be processed
* @sk_write_space: callback to indicate there is bf sending space available
* @sk_error_report: callback to indicate errors (e.g. %MSG_ERRQUEUE)
* @sk_backlog_rcv: callback to process the backlog
* @sk_validate_xmit_skb: ptr to an optional validate function
* @sk_destruct: called at sock freeing time, i.e. when all refcnt == 0
* @sk_reuseport_cb: reuseport group container
* @sk_bpf_storage: ptr to cache and control for bpf_sk_storage
* @sk_rcu: used during RCU grace period
* @sk_freeptr: used for SLAB_TYPESAFE_BY_RCU managed sockets
* @sk_clockid: clockid used by time-based scheduling (SO_TXTIME)
* @sk_txtime_deadline_mode: set deadline mode for SO_TXTIME
* @sk_txtime_report_errors: set report errors mode for SO_TXTIME
* @sk_txtime_unused: unused txtime flags
* @sk_scm_recv_flags: all flags used by scm_recv()
* @sk_scm_credentials: flagged by SO_PASSCRED to recv SCM_CREDENTIALS
* @sk_scm_security: flagged by SO_PASSSEC to recv SCM_SECURITY
* @sk_scm_pidfd: flagged by SO_PASSPIDFD to recv SCM_PIDFD
* @sk_scm_rights: flagged by SO_PASSRIGHTS to recv SCM_RIGHTS
* @sk_scm_unused: unused flags for scm_recv()
* @ns_tracker: tracker for netns reference
* @sk_user_frags: xarray of pages the user is holding a reference on.
* @sk_owner: reference to the real owner of the socket that calls
* sock_lock_init_class_and_name().
*/
struct sock {
/*
* Now struct inet_timewait_sock also uses sock_common, so please just
* don't add nothing before this first member (__sk_common) --acme
*/
struct sock_common __sk_common;
#define sk_node __sk_common.skc_node
#define sk_nulls_node __sk_common.skc_nulls_node
#define sk_refcnt __sk_common.skc_refcnt
#define sk_tx_queue_mapping __sk_common.skc_tx_queue_mapping
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
#define sk_rx_queue_mapping __sk_common.skc_rx_queue_mapping
#endif
#define sk_dontcopy_begin __sk_common.skc_dontcopy_begin
#define sk_dontcopy_end __sk_common.skc_dontcopy_end
#define sk_hash __sk_common.skc_hash
#define sk_portpair __sk_common.skc_portpair
#define sk_num __sk_common.skc_num
#define sk_dport __sk_common.skc_dport
#define sk_addrpair __sk_common.skc_addrpair
#define sk_daddr __sk_common.skc_daddr
#define sk_rcv_saddr __sk_common.skc_rcv_saddr
#define sk_family __sk_common.skc_family
#define sk_state __sk_common.skc_state
#define sk_reuse __sk_common.skc_reuse
#define sk_reuseport __sk_common.skc_reuseport
#define sk_ipv6only __sk_common.skc_ipv6only
#define sk_net_refcnt __sk_common.skc_net_refcnt
#define sk_bypass_prot_mem __sk_common.skc_bypass_prot_mem
#define sk_bound_dev_if __sk_common.skc_bound_dev_if
#define sk_bind_node __sk_common.skc_bind_node
#define sk_prot __sk_common.skc_prot
#define sk_net __sk_common.skc_net
#define sk_v6_daddr __sk_common.skc_v6_daddr
#define sk_v6_rcv_saddr __sk_common.skc_v6_rcv_saddr
#define sk_cookie __sk_common.skc_cookie
#define sk_incoming_cpu __sk_common.skc_incoming_cpu
#define sk_flags __sk_common.skc_flags
#define sk_rxhash __sk_common.skc_rxhash
__cacheline_group_begin(sock_write_rx);
atomic_t sk_drops;
__s32 sk_peek_off;
struct sk_buff_head sk_error_queue;
struct sk_buff_head sk_receive_queue;
/*
* The backlog queue is special, it is always used with
* the per-socket spinlock held and requires low latency
* access. Therefore we special case it's implementation.
* Note : rmem_alloc is in this structure to fill a hole
* on 64bit arches, not because its logically part of
* backlog.
*/
struct {
atomic_t rmem_alloc;
int len;
struct sk_buff *head;
struct sk_buff *tail;
} sk_backlog;
#define sk_rmem_alloc sk_backlog.rmem_alloc
__cacheline_group_end(sock_write_rx);
__cacheline_group_begin(sock_read_rx);
/* early demux fields */
struct dst_entry __rcu *sk_rx_dst;
int sk_rx_dst_ifindex;
u32 sk_rx_dst_cookie;
#ifdef CONFIG_NET_RX_BUSY_POLL
unsigned int sk_ll_usec;
unsigned int sk_napi_id;
u16 sk_busy_poll_budget;
u8 sk_prefer_busy_poll;
#endif
u8 sk_userlocks;
int sk_rcvbuf;
struct sk_filter __rcu *sk_filter;
union {
struct socket_wq __rcu *sk_wq;
/* private: */
struct socket_wq *sk_wq_raw;
/* public: */
};
void (*sk_data_ready)(struct sock *sk);
long sk_rcvtimeo;
int sk_rcvlowat;
__cacheline_group_end(sock_read_rx);
__cacheline_group_begin(sock_read_rxtx);
int sk_err;
struct socket *sk_socket;
#ifdef CONFIG_MEMCG
struct mem_cgroup *sk_memcg;
#endif
#ifdef CONFIG_XFRM
struct xfrm_policy __rcu *sk_policy[2];
#endif
#if IS_ENABLED(CONFIG_INET_PSP)
struct psp_assoc __rcu *psp_assoc;
#endif
__cacheline_group_end(sock_read_rxtx);
__cacheline_group_begin(sock_write_rxtx);
socket_lock_t sk_lock;
u32 sk_reserved_mem;
int sk_forward_alloc;
u32 sk_tsflags;
__cacheline_group_end(sock_write_rxtx);
__cacheline_group_begin(sock_write_tx);
int sk_write_pending;
atomic_t sk_omem_alloc;
int sk_err_soft;
int sk_wmem_queued;
refcount_t sk_wmem_alloc;
unsigned long sk_tsq_flags;
union {
struct sk_buff *sk_send_head;
struct rb_root tcp_rtx_queue;
};
struct sk_buff_head sk_write_queue;
struct page_frag sk_frag;
union {
struct timer_list sk_timer;
struct timer_list tcp_retransmit_timer;
struct timer_list mptcp_retransmit_timer;
};
unsigned long sk_pacing_rate; /* bytes per second */
atomic_t sk_zckey;
atomic_t sk_tskey;
unsigned long sk_tx_queue_mapping_jiffies;
__cacheline_group_end(sock_write_tx);
__cacheline_group_begin(sock_read_tx);
u32 sk_dst_pending_confirm;
u32 sk_pacing_status; /* see enum sk_pacing */
unsigned long sk_max_pacing_rate;
long sk_sndtimeo;
u32 sk_priority;
u32 sk_mark;
kuid_t sk_uid;
u16 sk_protocol;
u16 sk_type;
struct dst_entry __rcu *sk_dst_cache;
netdev_features_t sk_route_caps;
#ifdef CONFIG_SOCK_VALIDATE_XMIT
struct sk_buff* (*sk_validate_xmit_skb)(struct sock *sk,
struct net_device *dev,
struct sk_buff *skb);
#endif
u16 sk_gso_type;
u16 sk_gso_max_segs;
unsigned int sk_gso_max_size;
gfp_t sk_allocation;
u32 sk_txhash;
int sk_sndbuf;
u8 sk_pacing_shift;
bool sk_use_task_frag;
__cacheline_group_end(sock_read_tx);
/*
* Because of non atomicity rules, all
* changes are protected by socket lock.
*/
u8 sk_gso_disabled : 1,
sk_kern_sock : 1,
sk_no_check_tx : 1,
sk_no_check_rx : 1;
u8 sk_shutdown;
unsigned long sk_lingertime;
struct proto *sk_prot_creator;
rwlock_t sk_callback_lock;
u32 sk_ack_backlog;
u32 sk_max_ack_backlog;
unsigned long sk_ino;
spinlock_t sk_peer_lock;
int sk_bind_phc;
struct pid *sk_peer_pid;
const struct cred *sk_peer_cred;
ktime_t sk_stamp;
#if BITS_PER_LONG==32
seqlock_t sk_stamp_seq;
#endif
int sk_disconnects;
union {
u8 sk_txrehash;
u8 sk_scm_recv_flags;
struct {
u8 sk_scm_credentials : 1,
sk_scm_security : 1,
sk_scm_pidfd : 1,
sk_scm_rights : 1,
sk_scm_unused : 4;
};
};
u8 sk_clockid;
u8 sk_txtime_deadline_mode : 1,
sk_txtime_report_errors : 1,
sk_txtime_unused : 6;
#define SK_BPF_CB_FLAG_TEST(SK, FLAG) ((SK)->sk_bpf_cb_flags & (FLAG))
u8 sk_bpf_cb_flags;
void *sk_user_data;
#ifdef CONFIG_SECURITY
void *sk_security;
#endif
struct sock_cgroup_data sk_cgrp_data;
void (*sk_state_change)(struct sock *sk);
void (*sk_write_space)(struct sock *sk);
void (*sk_error_report)(struct sock *sk);
int (*sk_backlog_rcv)(struct sock *sk,
struct sk_buff *skb);
void (*sk_destruct)(struct sock *sk);
struct sock_reuseport __rcu *sk_reuseport_cb;
#ifdef CONFIG_BPF_SYSCALL
struct bpf_local_storage __rcu *sk_bpf_storage;
#endif
struct numa_drop_counters *sk_drop_counters;
/* sockets using SLAB_TYPESAFE_BY_RCU can use sk_freeptr.
* By the time kfree() is called, sk_rcu can not be in
* use and can be mangled.
*/
union {
struct rcu_head sk_rcu;
freeptr_t sk_freeptr;
};
netns_tracker ns_tracker;
struct xarray sk_user_frags;
#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
struct module *sk_owner;
#endif
};
struct sock_bh_locked {
struct sock *sock;
local_lock_t bh_lock;
};
enum sk_pacing {
SK_PACING_NONE = 0,
SK_PACING_NEEDED = 1,
SK_PACING_FQ = 2,
};
/* flag bits in sk_user_data
*
* - SK_USER_DATA_NOCOPY: Pointer stored in sk_user_data might
* not be suitable for copying when cloning the socket. For instance,
* it can point to a reference counted object. sk_user_data bottom
* bit is set if pointer must not be copied.
*
* - SK_USER_DATA_BPF: Mark whether sk_user_data field is
* managed/owned by a BPF reuseport array. This bit should be set
* when sk_user_data's sk is added to the bpf's reuseport_array.
*
* - SK_USER_DATA_PSOCK: Mark whether pointer stored in
* sk_user_data points to psock type. This bit should be set
* when sk_user_data is assigned to a psock object.
*/
#define SK_USER_DATA_NOCOPY 1UL
#define SK_USER_DATA_BPF 2UL
#define SK_USER_DATA_PSOCK 4UL
#define SK_USER_DATA_PTRMASK ~(SK_USER_DATA_NOCOPY | SK_USER_DATA_BPF |\
SK_USER_DATA_PSOCK)
/**
* sk_user_data_is_nocopy - Test if sk_user_data pointer must not be copied
* @sk: socket
*/
static inline bool sk_user_data_is_nocopy(const struct sock *sk)
{
return ((uintptr_t)sk->sk_user_data & SK_USER_DATA_NOCOPY);
}
#define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
/**
* __locked_read_sk_user_data_with_flags - return the pointer
* only if argument flags all has been set in sk_user_data. Otherwise
* return NULL
*
* @sk: socket
* @flags: flag bits
*
* The caller must be holding sk->sk_callback_lock.
*/
static inline void *
__locked_read_sk_user_data_with_flags(const struct sock *sk,
uintptr_t flags)
{
uintptr_t sk_user_data =
(uintptr_t)rcu_dereference_check(__sk_user_data(sk),
lockdep_is_held(&sk->sk_callback_lock));
WARN_ON_ONCE(flags & SK_USER_DATA_PTRMASK);
if ((sk_user_data & flags) == flags)
return (void *)(sk_user_data & SK_USER_DATA_PTRMASK);
return NULL;
}
/**
* __rcu_dereference_sk_user_data_with_flags - return the pointer
* only if argument flags all has been set in sk_user_data. Otherwise
* return NULL
*
* @sk: socket
* @flags: flag bits
*/
static inline void *
__rcu_dereference_sk_user_data_with_flags(const struct sock *sk,
uintptr_t flags)
{
uintptr_t sk_user_data = (uintptr_t)rcu_dereference(__sk_user_data(sk));
WARN_ON_ONCE(flags & SK_USER_DATA_PTRMASK);
if ((sk_user_data & flags) == flags)
return (void *)(sk_user_data & SK_USER_DATA_PTRMASK);
return NULL;
}
#define rcu_dereference_sk_user_data(sk) \
__rcu_dereference_sk_user_data_with_flags(sk, 0)
#define __rcu_assign_sk_user_data_with_flags(sk, ptr, flags) \
({ \
uintptr_t __tmp1 = (uintptr_t)(ptr), \
__tmp2 = (uintptr_t)(flags); \
WARN_ON_ONCE(__tmp1 & ~SK_USER_DATA_PTRMASK); \
WARN_ON_ONCE(__tmp2 & SK_USER_DATA_PTRMASK); \
rcu_assign_pointer(__sk_user_data((sk)), \
__tmp1 | __tmp2); \
})
#define rcu_assign_sk_user_data(sk, ptr) \
__rcu_assign_sk_user_data_with_flags(sk, ptr, 0)
static inline
struct net *sock_net(const struct sock *sk)
{
return read_pnet(&sk->sk_net);
}
static inline
void sock_net_set(struct sock *sk, struct net *net)
{
write_pnet(&sk->sk_net, net);
}
/*
* SK_CAN_REUSE and SK_NO_REUSE on a socket mean that the socket is OK
* or not whether his port will be reused by someone else. SK_FORCE_REUSE
* on a socket means that the socket will reuse everybody else's port
* without looking at the other's sk_reuse value.
*/
#define SK_NO_REUSE 0
#define SK_CAN_REUSE 1
#define SK_FORCE_REUSE 2
int sk_set_peek_off(struct sock *sk, int val);
static inline int sk_peek_offset(const struct sock *sk, int flags)
{
if (unlikely(flags & MSG_PEEK)) {
return READ_ONCE(sk->sk_peek_off);
}
return 0;
}
static inline void sk_peek_offset_bwd(struct sock *sk, int val)
{
s32 off = READ_ONCE(sk->sk_peek_off);
if (unlikely(off >= 0)) {
off = max_t(s32, off - val, 0);
WRITE_ONCE(sk->sk_peek_off, off);
}
}
static inline void sk_peek_offset_fwd(struct sock *sk, int val)
{
sk_peek_offset_bwd(sk, -val);
}
/*
* Hashed lists helper routines
*/
static inline struct sock *sk_entry(const struct hlist_node *node)
{
return hlist_entry(node, struct sock, sk_node);
}
static inline struct sock *__sk_head(const struct hlist_head *head)
{
return hlist_entry(head->first, struct sock, sk_node);
}
static inline struct sock *sk_head(const struct hlist_head *head)
{
return hlist_empty(head) ? NULL : __sk_head(head);
}
static inline struct sock *__sk_nulls_head(const struct hlist_nulls_head *head)
{
return hlist_nulls_entry(head->first, struct sock, sk_nulls_node);
}
static inline struct sock *sk_nulls_head(const struct hlist_nulls_head *head)
{
return hlist_nulls_empty(head) ? NULL : __sk_nulls_head(head);
}
static inline struct sock *sk_next(const struct sock *sk)
{
return hlist_entry_safe(sk->sk_node.next, struct sock, sk_node);
}
static inline struct sock *sk_nulls_next(const struct sock *sk)
{
return (!is_a_nulls(sk->sk_nulls_node.next)) ?
hlist_nulls_entry(sk->sk_nulls_node.next,
struct sock, sk_nulls_node) :
NULL;
}
static inline bool sk_unhashed(const struct sock *sk)
{
return hlist_unhashed(&sk->sk_node);
}
static inline bool sk_hashed(const struct sock *sk)
{
return !sk_unhashed(sk);
}
static inline void sk_node_init(struct hlist_node *node)
{
node->pprev = NULL;
}
static inline void __sk_del_node(struct sock *sk)
{
__hlist_del(&sk->sk_node);
}
/* NB: equivalent to hlist_del_init_rcu */
static inline bool __sk_del_node_init(struct sock *sk)
{
if (sk_hashed(sk)) {
__sk_del_node(sk);
sk_node_init(&sk->sk_node);
return true;
}
return false;
}
/* Grab socket reference count. This operation is valid only
when sk is ALREADY grabbed f.e. it is found in hash table
or a list and the lookup is made under lock preventing hash table
modifications.
*/
static __always_inline void sock_hold(struct sock *sk)
{
refcount_inc(&sk->sk_refcnt);
}
/* Ungrab socket in the context, which assumes that socket refcnt
cannot hit zero, f.e. it is true in context of any socketcall.
*/
static __always_inline void __sock_put(struct sock *sk)
{
refcount_dec(&sk->sk_refcnt);
}
static inline bool sk_del_node_init(struct sock *sk)
{
bool rc = __sk_del_node_init(sk);
if (rc)
__sock_put(sk);
return rc;
}
#define sk_del_node_init_rcu(sk) sk_del_node_init(sk)
static inline bool __sk_nulls_del_node_init_rcu(struct sock *sk)
{
if (sk_hashed(sk)) {
hlist_nulls_del_init_rcu(&sk->sk_nulls_node);
return true;
}
return false;
}
static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
{
bool rc = __sk_nulls_del_node_init_rcu(sk);
if (rc)
__sock_put(sk);
return rc;
}
static inline bool sk_nulls_replace_node_init_rcu(struct sock *old,
struct sock *new)
{
if (sk_hashed(old)) {
hlist_nulls_replace_init_rcu(&old->sk_nulls_node,
&new->sk_nulls_node);
__sock_put(old);
return true;
}
return false;
}
static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
{
hlist_add_head(&sk->sk_node, list);
}
static inline void sk_add_node(struct sock *sk, struct hlist_head *list)
{
sock_hold(sk);
__sk_add_node(sk, list);
}
static inline void sk_add_node_rcu(struct sock *sk, struct hlist_head *list)
{
sock_hold(sk);
if (IS_ENABLED(CONFIG_IPV6) && sk->sk_reuseport &&
sk->sk_family == AF_INET6)
hlist_add_tail_rcu(&sk->sk_node, list);
else
hlist_add_head_rcu(&sk->sk_node, list);
}
static inline void sk_add_node_tail_rcu(struct sock *sk, struct hlist_head *list)
{
sock_hold(sk);
hlist_add_tail_rcu(&sk->sk_node, list);
}
static inline void __sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
{
hlist_nulls_add_head_rcu(&sk->sk_nulls_node, list);
}
static inline void __sk_nulls_add_node_tail_rcu(struct sock *sk, struct hlist_nulls_head *list)
{
hlist_nulls_add_tail_rcu(&sk->sk_nulls_node, list);
}
static inline void sk_nulls_add_node_rcu(struct sock *sk, struct hlist_nulls_head *list)
{
sock_hold(sk);
__sk_nulls_add_node_rcu(sk, list);
}
static inline void __sk_del_bind_node(struct sock *sk)
{
__hlist_del(&sk->sk_bind_node);
}
static inline void sk_add_bind_node(struct sock *sk,
struct hlist_head *list)
{
hlist_add_head(&sk->sk_bind_node, list);
}
#define sk_for_each(__sk, list) \
hlist_for_each_entry(__sk, list, sk_node)
#define sk_for_each_rcu(__sk, list) \
hlist_for_each_entry_rcu(__sk, list, sk_node)
#define sk_nulls_for_each(__sk, node, list) \
hlist_nulls_for_each_entry(__sk, node, list, sk_nulls_node)
#define sk_nulls_for_each_rcu(__sk, node, list) \
hlist_nulls_for_each_entry_rcu(__sk, node, list, sk_nulls_node)
#define sk_for_each_from(__sk) \
hlist_for_each_entry_from(__sk, sk_node)
#define sk_nulls_for_each_from(__sk, node) \
if (__sk && ({ node = &(__sk)->sk_nulls_node; 1; })) \
hlist_nulls_for_each_entry_from(__sk, node, sk_nulls_node)
#define sk_for_each_safe(__sk, tmp, list) \
hlist_for_each_entry_safe(__sk, tmp, list, sk_node)
#define sk_for_each_bound(__sk, list) \
hlist_for_each_entry(__sk, list, sk_bind_node)
#define sk_for_each_bound_safe(__sk, tmp, list) \
hlist_for_each_entry_safe(__sk, tmp, list, sk_bind_node)
/**
* sk_for_each_entry_offset_rcu - iterate over a list at a given struct offset
* @tpos: the type * to use as a loop cursor.
* @pos: the &struct hlist_node to use as a loop cursor.
* @head: the head for your list.
* @offset: offset of hlist_node within the struct.
*
*/
#define sk_for_each_entry_offset_rcu(tpos, pos, head, offset) \
for (pos = rcu_dereference(hlist_first_rcu(head)); \
pos != NULL && \
({ tpos = (typeof(*tpos) *)((void *)pos - offset); 1;}); \
pos = rcu_dereference(hlist_next_rcu(pos)))
static inline struct user_namespace *sk_user_ns(const struct sock *sk)
{
/* Careful only use this in a context where these parameters
* can not change and must all be valid, such as recvmsg from
* userspace.
*/
return sk->sk_socket->file->f_cred->user_ns;
}
/* Sock flags */
enum sock_flags {
SOCK_DEAD,
SOCK_DONE,
SOCK_URGINLINE,
SOCK_KEEPOPEN,
SOCK_LINGER,
SOCK_DESTROY,
SOCK_BROADCAST,
SOCK_TIMESTAMP,
SOCK_ZAPPED,
SOCK_USE_WRITE_QUEUE, /* whether to call sk->sk_write_space in sock_wfree */
SOCK_DBG, /* %SO_DEBUG setting */
SOCK_RCVTSTAMP, /* %SO_TIMESTAMP setting */
SOCK_RCVTSTAMPNS, /* %SO_TIMESTAMPNS setting */
SOCK_LOCALROUTE, /* route locally only, %SO_DONTROUTE setting */
SOCK_MEMALLOC, /* VM depends on this socket for swapping */
SOCK_TIMESTAMPING_RX_SOFTWARE, /* %SOF_TIMESTAMPING_RX_SOFTWARE */
SOCK_FASYNC, /* fasync() active */
SOCK_RXQ_OVFL,
SOCK_ZEROCOPY, /* buffers from userspace */
SOCK_WIFI_STATUS, /* push wifi status to userspace */
SOCK_NOFCS, /* Tell NIC not to do the Ethernet FCS.
* Will use last 4 bytes of packet sent from
* user-space instead.
*/
SOCK_FILTER_LOCKED, /* Filter cannot be changed anymore */
SOCK_SELECT_ERR_QUEUE, /* Wake select on error queue */
SOCK_RCU_FREE, /* wait rcu grace period in sk_destruct() */
SOCK_TXTIME,
SOCK_XDP, /* XDP is attached */
SOCK_TSTAMP_NEW, /* Indicates 64 bit timestamps always */
SOCK_RCVMARK, /* Receive SO_MARK ancillary data with packet */
SOCK_RCVPRIORITY, /* Receive SO_PRIORITY ancillary data with packet */
SOCK_TIMESTAMPING_ANY, /* Copy of sk_tsflags & TSFLAGS_ANY */
};
#define SK_FLAGS_TIMESTAMP ((1UL << SOCK_TIMESTAMP) | (1UL << SOCK_TIMESTAMPING_RX_SOFTWARE))
/*
* The highest bit of sk_tsflags is reserved for kernel-internal
* SOCKCM_FLAG_TS_OPT_ID. There is a check in core/sock.c to control that
* SOF_TIMESTAMPING* values do not reach this reserved area
*/
#define SOCKCM_FLAG_TS_OPT_ID BIT(31)
static inline void sock_copy_flags(struct sock *nsk, const struct sock *osk)
{
nsk->sk_flags = osk->sk_flags;
}
static inline void sock_set_flag(struct sock *sk, enum sock_flags flag)
{
__set_bit(flag, &sk->sk_flags);
}
static inline void sock_reset_flag(struct sock *sk, enum sock_flags flag)
{
__clear_bit(flag, &sk->sk_flags);
}
static inline void sock_valbool_flag(struct sock *sk, enum sock_flags bit,
int valbool)
{
if (valbool)
sock_set_flag(sk, bit);
else
sock_reset_flag(sk, bit);
}
static inline bool sock_flag(const struct sock *sk, enum sock_flags flag)
{
return test_bit(flag, &sk->sk_flags);
}
#ifdef CONFIG_NET
DECLARE_STATIC_KEY_FALSE(memalloc_socks_key);
static inline int sk_memalloc_socks(void)
{
return static_branch_unlikely(&memalloc_socks_key);
}
void __receive_sock(struct file *file);
#else
static inline int sk_memalloc_socks(void)
{
return 0;
}
static inline void __receive_sock(struct file *file)
{ }
#endif
static inline gfp_t sk_gfp_mask(const struct sock *sk, gfp_t gfp_mask)
{
return gfp_mask | (sk->sk_allocation & __GFP_MEMALLOC);
}
static inline void sk_acceptq_removed(struct sock *sk)
{
WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog - 1);
}
static inline void sk_acceptq_added(struct sock *sk)
{
WRITE_ONCE(sk->sk_ack_backlog, sk->sk_ack_backlog + 1);
}
/* Note: If you think the test should be:
* return READ_ONCE(sk->sk_ack_backlog) >= READ_ONCE(sk->sk_max_ack_backlog);
* Then please take a look at commit 64a146513f8f ("[NET]: Revert incorrect accept queue backlog changes.")
*/
static inline bool sk_acceptq_is_full(const struct sock *sk)
{
return READ_ONCE(sk->sk_ack_backlog) > READ_ONCE(sk->sk_max_ack_backlog);
}
/*
* Compute minimal free write space needed to queue new packets.
*/
static inline int sk_stream_min_wspace(const struct sock *sk)
{
return READ_ONCE(sk->sk_wmem_queued) >> 1;
}
static inline int sk_stream_wspace(const struct sock *sk)
{
return READ_ONCE(sk->sk_sndbuf) - READ_ONCE(sk->sk_wmem_queued);
}
static inline void sk_wmem_queued_add(struct sock *sk, int val)
{
WRITE_ONCE(sk->sk_wmem_queued, sk->sk_wmem_queued + val);
}
static inline void sk_forward_alloc_add(struct sock *sk, int val)
{
/* Paired with lockless reads of sk->sk_forward_alloc */
WRITE_ONCE(sk->sk_forward_alloc, sk->sk_forward_alloc + val);
}
void sk_stream_write_space(struct sock *sk);
/* OOB backlog add */
static inline void __sk_add_backlog(struct sock *sk, struct sk_buff *skb)
{
/* dont let skb dst not refcounted, we are going to leave rcu lock */
skb_dst_force(skb);
if (!sk->sk_backlog.tail)
WRITE_ONCE(sk->sk_backlog.head, skb);
else
sk->sk_backlog.tail->next = skb;
WRITE_ONCE(sk->sk_backlog.tail, skb);
skb->next = NULL;
}
/*
* Take into account size of receive queue and backlog queue
* Do not take into account this skb truesize,
* to allow even a single big packet to come.
*/
static inline bool sk_rcvqueues_full(const struct sock *sk, unsigned int limit)
{
unsigned int qsize = sk->sk_backlog.len + atomic_read(&sk->sk_rmem_alloc);
return qsize > limit;
}
/* The per-socket spinlock must be held here. */
static inline __must_check int sk_add_backlog(struct sock *sk, struct sk_buff *skb,
unsigned int limit)
{
if (sk_rcvqueues_full(sk, limit))
return -ENOBUFS;
/*
* If the skb was allocated from pfmemalloc reserves, only
* allow SOCK_MEMALLOC sockets to use it as this socket is
* helping free memory
*/
if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC))
return -ENOMEM;
__sk_add_backlog(sk, skb);
sk->sk_backlog.len += skb->truesize;
return 0;
}
int __sk_backlog_rcv(struct sock *sk, struct sk_buff *skb);
INDIRECT_CALLABLE_DECLARE(int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb));
INDIRECT_CALLABLE_DECLARE(int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb));
static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
{
if (sk_memalloc_socks() && skb_pfmemalloc(skb))
return __sk_backlog_rcv(sk, skb);
return INDIRECT_CALL_INET(sk->sk_backlog_rcv,
tcp_v6_do_rcv,
tcp_v4_do_rcv,
sk, skb);
}
static inline void sk_incoming_cpu_update(struct sock *sk)
{
int cpu = raw_smp_processor_id();
if (unlikely(READ_ONCE(sk->sk_incoming_cpu) != cpu))
WRITE_ONCE(sk->sk_incoming_cpu, cpu);
}
static inline void sock_rps_save_rxhash(struct sock *sk,
const struct sk_buff *skb)
{
#ifdef CONFIG_RPS
/* The following WRITE_ONCE() is paired with the READ_ONCE()
* here, and another one in sock_rps_record_flow().
*/
if (unlikely(READ_ONCE(sk->sk_rxhash) != skb->hash))
WRITE_ONCE(sk->sk_rxhash, skb->hash);
#endif
}
static inline void sock_rps_reset_rxhash(struct sock *sk)
{
#ifdef CONFIG_RPS
/* Paired with READ_ONCE() in sock_rps_record_flow() */
WRITE_ONCE(sk->sk_rxhash, 0);
#endif
}
#define sk_wait_event(__sk, __timeo, __condition, __wait) \
({ int __rc, __dis = __sk->sk_disconnects; \
release_sock(__sk); \
__rc = __condition; \
if (!__rc) { \
*(__timeo) = wait_woken(__wait, \
TASK_INTERRUPTIBLE, \
*(__timeo)); \
} \
sched_annotate_sleep(); \
lock_sock(__sk); \
__rc = __dis == __sk->sk_disconnects ? __condition : -EPIPE; \
__rc; \
})
int sk_stream_wait_connect(struct sock *sk, long *timeo_p);
int sk_stream_wait_memory(struct sock *sk, long *timeo_p);
void sk_stream_wait_close(struct sock *sk, long timeo_p);
int sk_stream_error(struct sock *sk, int flags, int err);
void sk_stream_kill_queues(struct sock *sk);
void sk_set_memalloc(struct sock *sk);
void sk_clear_memalloc(struct sock *sk);
void __sk_flush_backlog(struct sock *sk);
static inline bool sk_flush_backlog(struct sock *sk)
{
if (unlikely(READ_ONCE(sk->sk_backlog.tail))) {
__sk_flush_backlog(sk);
return true;
}
return false;
}
int sk_wait_data(struct sock *sk, long *timeo, const struct sk_buff *skb);
struct request_sock_ops;
struct timewait_sock_ops;
struct inet_hashinfo;
struct raw_hashinfo;
struct smc_hashinfo;
struct module;
struct sk_psock;
/*
* caches using SLAB_TYPESAFE_BY_RCU should let .next pointer from nulls nodes
* un-modified. Special care is taken when initializing object to zero.
*/
static inline void sk_prot_clear_nulls(struct sock *sk, int size)
{
if (offsetof(struct sock, sk_node.next) != 0)
memset(sk, 0, offsetof(struct sock, sk_node.next));
memset(&sk->sk_node.pprev, 0,
size - offsetof(struct sock, sk_node.pprev));
}
struct proto_accept_arg {
int flags;
int err;
int is_empty;
bool kern;
};
/* Networking protocol blocks we attach to sockets.
* socket layer -> transport layer interface
*/
struct proto {
void (*close)(struct sock *sk,
long timeout);
int (*pre_connect)(struct sock *sk,
struct sockaddr_unsized *uaddr,
int addr_len);
int (*connect)(struct sock *sk,
struct sockaddr_unsized *uaddr,
int addr_len);
int (*disconnect)(struct sock *sk, int flags);
struct sock * (*accept)(struct sock *sk,
struct proto_accept_arg *arg);
int (*ioctl)(struct sock *sk, int cmd,
int *karg);
int (*init)(struct sock *sk);
void (*destroy)(struct sock *sk);
void (*shutdown)(struct sock *sk, int how);
int (*setsockopt)(struct sock *sk, int level,
int optname, sockptr_t optval,
unsigned int optlen);
int (*getsockopt)(struct sock *sk, int level,
int optname, char __user *optval,
int __user *option);
void (*keepalive)(struct sock *sk, int valbool);
#ifdef CONFIG_COMPAT
int (*compat_ioctl)(struct sock *sk,
unsigned int cmd, unsigned long arg);
#endif
int (*sendmsg)(struct sock *sk, struct msghdr *msg,
size_t len);
int (*recvmsg)(struct sock *sk, struct msghdr *msg,
size_t len, int flags, int *addr_len);
void (*splice_eof)(struct socket *sock);
int (*bind)(struct sock *sk,
struct sockaddr_unsized *addr, int addr_len);
int (*bind_add)(struct sock *sk,
struct sockaddr_unsized *addr, int addr_len);
int (*backlog_rcv) (struct sock *sk,
struct sk_buff *skb);
bool (*bpf_bypass_getsockopt)(int level,
int optname);
void (*release_cb)(struct sock *sk);
/* Keeping track of sk's, looking them up, and port selection methods. */
int (*hash)(struct sock *sk);
void (*unhash)(struct sock *sk);
void (*rehash)(struct sock *sk);
int (*get_port)(struct sock *sk, unsigned short snum);
void (*put_port)(struct sock *sk);
#ifdef CONFIG_BPF_SYSCALL
int (*psock_update_sk_prot)(struct sock *sk,
struct sk_psock *psock,
bool restore);
#endif
/* Keeping track of sockets in use */
#ifdef CONFIG_PROC_FS
unsigned int inuse_idx;
#endif
bool (*stream_memory_free)(const struct sock *sk, int wake);
bool (*sock_is_readable)(struct sock *sk);
/* Memory pressure */
void (*enter_memory_pressure)(struct sock *sk);
void (*leave_memory_pressure)(struct sock *sk);
atomic_long_t *memory_allocated; /* Current allocated memory. */
int __percpu *per_cpu_fw_alloc;
struct percpu_counter *sockets_allocated; /* Current number of sockets. */
/*
* Pressure flag: try to collapse.
* Technical note: it is used by multiple contexts non atomically.
* Make sure to use READ_ONCE()/WRITE_ONCE() for all reads/writes.
* All the __sk_mem_schedule() is of this nature: accounting
* is strict, actions are advisory and have some latency.
*/
unsigned long *memory_pressure;
long *sysctl_mem;
int *sysctl_wmem;
int *sysctl_rmem;
u32 sysctl_wmem_offset;
u32 sysctl_rmem_offset;
int max_header;
bool no_autobind;
struct kmem_cache *slab;
unsigned int obj_size;
unsigned int freeptr_offset;
unsigned int ipv6_pinfo_offset;
slab_flags_t slab_flags;
unsigned int useroffset; /* Usercopy region offset */
unsigned int usersize; /* Usercopy region size */
struct request_sock_ops *rsk_prot;
struct timewait_sock_ops *twsk_prot;
union {
struct inet_hashinfo *hashinfo;
struct udp_table *udp_table;
struct raw_hashinfo *raw_hash;
struct smc_hashinfo *smc_hash;
} h;
struct module *owner;
char name[32];
struct list_head node;
int (*diag_destroy)(struct sock *sk, int err);
} __randomize_layout;
int proto_register(struct proto *prot, int alloc_slab);
void proto_unregister(struct proto *prot);
int sock_load_diag_module(int family, int protocol);
INDIRECT_CALLABLE_DECLARE(bool tcp_stream_memory_free(const struct sock *sk, int wake));
static inline bool __sk_stream_memory_free(const struct sock *sk, int wake)
{
if (READ_ONCE(sk->sk_wmem_queued) >= READ_ONCE(sk->sk_sndbuf))
return false;
return sk->sk_prot->stream_memory_free ?
INDIRECT_CALL_INET_1(sk->sk_prot->stream_memory_free,
tcp_stream_memory_free, sk, wake) : true;
}
static inline bool sk_stream_memory_free(const struct sock *sk)
{
return __sk_stream_memory_free(sk, 0);
}
static inline bool __sk_stream_is_writeable(const struct sock *sk, int wake)
{
return sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) &&
__sk_stream_memory_free(sk, wake);
}
static inline bool sk_stream_is_writeable(const struct sock *sk)
{
return __sk_stream_is_writeable(sk, 0);
}
static inline int sk_under_cgroup_hierarchy(struct sock *sk,
struct cgroup *ancestor)
{
#ifdef CONFIG_SOCK_CGROUP_DATA
return cgroup_is_descendant(sock_cgroup_ptr(&sk->sk_cgrp_data),
ancestor);
#else
return -ENOTSUPP;
#endif
}
#define SK_ALLOC_PERCPU_COUNTER_BATCH 16
static inline void sk_sockets_allocated_dec(struct sock *sk)
{
percpu_counter_add_batch(sk->sk_prot->sockets_allocated, -1,
SK_ALLOC_PERCPU_COUNTER_BATCH);
}
static inline void sk_sockets_allocated_inc(struct sock *sk)
{
percpu_counter_add_batch(sk->sk_prot->sockets_allocated, 1,
SK_ALLOC_PERCPU_COUNTER_BATCH);
}
static inline u64
sk_sockets_allocated_read_positive(struct sock *sk)
{
return percpu_counter_read_positive(sk->sk_prot->sockets_allocated);
}
static inline int
proto_sockets_allocated_sum_positive(struct proto *prot)
{
return percpu_counter_sum_positive(prot->sockets_allocated);
}
#ifdef CONFIG_PROC_FS
#define PROTO_INUSE_NR 64 /* should be enough for the first time */
struct prot_inuse {
int all;
int val[PROTO_INUSE_NR];
};
static inline void sock_prot_inuse_add(const struct net *net,
const struct proto *prot, int val)
{
this_cpu_add(net->core.prot_inuse->val[prot->inuse_idx], val);
}
static inline void sock_inuse_add(const struct net *net, int val)
{
this_cpu_add(net->core.prot_inuse->all, val);
}
int sock_prot_inuse_get(struct net *net, struct proto *proto);
int sock_inuse_get(struct net *net);
#else
static inline void sock_prot_inuse_add(const struct net *net,
const struct proto *prot, int val)
{
}
static inline void sock_inuse_add(const struct net *net, int val)
{
}
#endif
/* With per-bucket locks this operation is not-atomic, so that
* this version is not worse.
*/
static inline int __sk_prot_rehash(struct sock *sk)
{
sk->sk_prot->unhash(sk);
return sk->sk_prot->hash(sk);
}
/* About 10 seconds */
#define SOCK_DESTROY_TIME (10*HZ)
/* Sockets 0-1023 can't be bound to unless you are superuser */
#define PROT_SOCK 1024
#define SHUTDOWN_MASK 3
#define RCV_SHUTDOWN 1
#define SEND_SHUTDOWN 2
#define SOCK_BINDADDR_LOCK 4
#define SOCK_BINDPORT_LOCK 8
/**
* define SOCK_CONNECT_BIND - &sock->sk_userlocks flag for auto-bind at connect() time
*/
#define SOCK_CONNECT_BIND 16
struct socket_alloc {
struct socket socket;
struct inode vfs_inode;
};
static inline struct socket *SOCKET_I(struct inode *inode)
{
return &container_of(inode, struct socket_alloc, vfs_inode)->socket;
}
static inline struct inode *SOCK_INODE(struct socket *socket)
{
return &container_of(socket, struct socket_alloc, socket)->vfs_inode;
}
/*
* Functions for memory accounting
*/
int __sk_mem_raise_allocated(struct sock *sk, int size, int amt, int kind);
int __sk_mem_schedule(struct sock *sk, int size, int kind);
void __sk_mem_reduce_allocated(struct sock *sk, int amount);
void __sk_mem_reclaim(struct sock *sk, int amount);
#define SK_MEM_SEND 0
#define SK_MEM_RECV 1
/* sysctl_mem values are in pages */
static inline long sk_prot_mem_limits(const struct sock *sk, int index)
{
return READ_ONCE(sk->sk_prot->sysctl_mem[index]);
}
static inline int sk_mem_pages(int amt)
{
return (amt + PAGE_SIZE - 1) >> PAGE_SHIFT;
}
static inline bool sk_has_account(struct sock *sk)
{
/* return true if protocol supports memory accounting */
return !!sk->sk_prot->memory_allocated;
}
static inline bool sk_wmem_schedule(struct sock *sk, int size)
{
int delta;
if (!sk_has_account(sk))
return true;
delta = size - sk->sk_forward_alloc;
return delta <= 0 || __sk_mem_schedule(sk, delta, SK_MEM_SEND);
}
static inline bool
__sk_rmem_schedule(struct sock *sk, int size, bool pfmemalloc)
{
int delta;
if (!sk_has_account(sk))
return true;
delta = size - sk->sk_forward_alloc;
return delta <= 0 || __sk_mem_schedule(sk, delta, SK_MEM_RECV) ||
pfmemalloc;
}
static inline bool
sk_rmem_schedule(struct sock *sk, const struct sk_buff *skb, int size)
{
return __sk_rmem_schedule(sk, size, skb_pfmemalloc(skb));
}
static inline int sk_unused_reserved_mem(const struct sock *sk)
{
int unused_mem;
if (likely(!sk->sk_reserved_mem))
return 0;
unused_mem = sk->sk_reserved_mem - sk->sk_wmem_queued -
atomic_read(&sk->sk_rmem_alloc);
return unused_mem > 0 ? unused_mem : 0;
}
static inline void sk_mem_reclaim(struct sock *sk)
{
int reclaimable;
if (!sk_has_account(sk))
return;
reclaimable = sk->sk_forward_alloc - sk_unused_reserved_mem(sk);
if (reclaimable >= (int)PAGE_SIZE)
__sk_mem_reclaim(sk, reclaimable);
}
static inline void sk_mem_reclaim_final(struct sock *sk)
{
sk->sk_reserved_mem = 0;
sk_mem_reclaim(sk);
}
static inline void sk_mem_charge(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return;
sk_forward_alloc_add(sk, -size);
}
static inline void sk_mem_uncharge(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return;
sk_forward_alloc_add(sk, size);
sk_mem_reclaim(sk);
}
void __sk_charge(struct sock *sk, gfp_t gfp);
#if IS_ENABLED(CONFIG_PROVE_LOCKING) && IS_ENABLED(CONFIG_MODULES)
static inline void sk_owner_set(struct sock *sk, struct module *owner)
{
__module_get(owner);
sk->sk_owner = owner;
}
static inline void sk_owner_clear(struct sock *sk)
{
sk->sk_owner = NULL;
}
static inline void sk_owner_put(struct sock *sk)
{
module_put(sk->sk_owner);
}
#else
static inline void sk_owner_set(struct sock *sk, struct module *owner)
{
}
static inline void sk_owner_clear(struct sock *sk)
{
}
static inline void sk_owner_put(struct sock *sk)
{
}
#endif
/*
* Macro so as to not evaluate some arguments when
* lockdep is not enabled.
*
* Mark both the sk_lock and the sk_lock.slock as a
* per-address-family lock class.
*/
#define sock_lock_init_class_and_name(sk, sname, skey, name, key) \
do { \
sk_owner_set(sk, THIS_MODULE); \
sk->sk_lock.owned = 0; \
init_waitqueue_head(&sk->sk_lock.wq); \
spin_lock_init(&(sk)->sk_lock.slock); \
debug_check_no_locks_freed((void *)&(sk)->sk_lock, \
sizeof((sk)->sk_lock)); \
lockdep_set_class_and_name(&(sk)->sk_lock.slock, \
(skey), (sname)); \
lockdep_init_map(&(sk)->sk_lock.dep_map, (name), (key), 0); \
} while (0)
static inline bool lockdep_sock_is_held(const struct sock *sk)
{
return lockdep_is_held(&sk->sk_lock) ||
lockdep_is_held(&sk->sk_lock.slock);
}
void lock_sock_nested(struct sock *sk, int subclass);
static inline void lock_sock(struct sock *sk)
{
lock_sock_nested(sk, 0);
}
void __lock_sock(struct sock *sk);
void __release_sock(struct sock *sk);
void release_sock(struct sock *sk);
/* BH context may only use the following locking interface. */
#define bh_lock_sock(__sk) spin_lock(&((__sk)->sk_lock.slock))
#define bh_lock_sock_nested(__sk) \
spin_lock_nested(&((__sk)->sk_lock.slock), \
SINGLE_DEPTH_NESTING)
#define bh_unlock_sock(__sk) spin_unlock(&((__sk)->sk_lock.slock))
bool __lock_sock_fast(struct sock *sk) __acquires(&sk->sk_lock.slock);
/**
* lock_sock_fast - fast version of lock_sock
* @sk: socket
*
* This version should be used for very small section, where process won't block
* return false if fast path is taken:
*
* sk_lock.slock locked, owned = 0, BH disabled
*
* return true if slow path is taken:
*
* sk_lock.slock unlocked, owned = 1, BH enabled
*/
static inline bool lock_sock_fast(struct sock *sk)
{
/* The sk_lock has mutex_lock() semantics here. */
mutex_acquire(&sk->sk_lock.dep_map, 0, 0, _RET_IP_);
return __lock_sock_fast(sk);
}
/* fast socket lock variant for caller already holding a [different] socket lock */
static inline bool lock_sock_fast_nested(struct sock *sk)
{
mutex_acquire(&sk->sk_lock.dep_map, SINGLE_DEPTH_NESTING, 0, _RET_IP_);
return __lock_sock_fast(sk);
}
/**
* unlock_sock_fast - complement of lock_sock_fast
* @sk: socket
* @slow: slow mode
*
* fast unlock socket for user context.
* If slow mode is on, we call regular release_sock()
*/
static inline void unlock_sock_fast(struct sock *sk, bool slow)
__releases(&sk->sk_lock.slock)
{
if (slow) {
release_sock(sk);
__release(&sk->sk_lock.slock);
} else {
mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
spin_unlock_bh(&sk->sk_lock.slock);
}
}
void sockopt_lock_sock(struct sock *sk);
void sockopt_release_sock(struct sock *sk);
bool sockopt_ns_capable(struct user_namespace *ns, int cap);
bool sockopt_capable(int cap);
/* Used by processes to "lock" a socket state, so that
* interrupts and bottom half handlers won't change it
* from under us. It essentially blocks any incoming
* packets, so that we won't get any new data or any
* packets that change the state of the socket.
*
* While locked, BH processing will add new packets to
* the backlog queue. This queue is processed by the
* owner of the socket lock right before it is released.
*
* Since ~2.3.5 it is also exclusive sleep lock serializing
* accesses from user process context.
*/
static inline void sock_owned_by_me(const struct sock *sk)
{
#ifdef CONFIG_LOCKDEP
WARN_ON_ONCE(!lockdep_sock_is_held(sk) && debug_locks);
#endif
}
static inline void sock_not_owned_by_me(const struct sock *sk)
{
#ifdef CONFIG_LOCKDEP
WARN_ON_ONCE(lockdep_sock_is_held(sk) && debug_locks);
#endif
}
static inline bool sock_owned_by_user(const struct sock *sk)
{
sock_owned_by_me(sk);
return sk->sk_lock.owned;
}
static inline bool sock_owned_by_user_nocheck(const struct sock *sk)
{
return sk->sk_lock.owned;
}
static inline void sock_release_ownership(struct sock *sk)
{
DEBUG_NET_WARN_ON_ONCE(!sock_owned_by_user_nocheck(sk));
sk->sk_lock.owned = 0;
/* The sk_lock has mutex_unlock() semantics: */
mutex_release(&sk->sk_lock.dep_map, _RET_IP_);
}
/* no reclassification while locks are held */
static inline bool sock_allow_reclassification(const struct sock *csk)
{
struct sock *sk = (struct sock *)csk;
return !sock_owned_by_user_nocheck(sk) &&
!spin_is_locked(&sk->sk_lock.slock);
}
struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
struct proto *prot, int kern);
void sk_free(struct sock *sk);
void sk_net_refcnt_upgrade(struct sock *sk);
void sk_destruct(struct sock *sk);
struct sock *sk_clone(const struct sock *sk, const gfp_t priority, bool lock);
static inline struct sock *sk_clone_lock(const struct sock *sk, const gfp_t priority)
{
return sk_clone(sk, priority, true);
}
struct sk_buff *sock_wmalloc(struct sock *sk, unsigned long size, int force,
gfp_t priority);
void __sock_wfree(struct sk_buff *skb);
void sock_wfree(struct sk_buff *skb);
struct sk_buff *sock_omalloc(struct sock *sk, unsigned long size,
gfp_t priority);
void skb_orphan_partial(struct sk_buff *skb);
void sock_rfree(struct sk_buff *skb);
void sock_efree(struct sk_buff *skb);
#ifdef CONFIG_INET
void sock_edemux(struct sk_buff *skb);
void sock_pfree(struct sk_buff *skb);
static inline void skb_set_owner_edemux(struct sk_buff *skb, struct sock *sk)
{
skb_orphan(skb);
if (refcount_inc_not_zero(&sk->sk_refcnt)) {
skb->sk = sk;
skb->destructor = sock_edemux;
}
}
#else
#define sock_edemux sock_efree
#endif
int sk_setsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, unsigned int optlen);
int sock_setsockopt(struct socket *sock, int level, int op,
sockptr_t optval, unsigned int optlen);
int do_sock_setsockopt(struct socket *sock, bool compat, int level,
int optname, sockptr_t optval, int optlen);
int do_sock_getsockopt(struct socket *sock, bool compat, int level,
int optname, sockptr_t optval, sockptr_t optlen);
int sk_getsockopt(struct sock *sk, int level, int optname,
sockptr_t optval, sockptr_t optlen);
int sock_gettstamp(struct socket *sock, void __user *userstamp,
bool timeval, bool time32);
struct sk_buff *sock_alloc_send_pskb(struct sock *sk, unsigned long header_len,
unsigned long data_len, int noblock,
int *errcode, int max_page_order);
static inline struct sk_buff *sock_alloc_send_skb(struct sock *sk,
unsigned long size,
int noblock, int *errcode)
{
return sock_alloc_send_pskb(sk, size, 0, noblock, errcode, 0);
}
void *sock_kmalloc(struct sock *sk, int size, gfp_t priority);
void *sock_kmemdup(struct sock *sk, const void *src,
int size, gfp_t priority);
void sock_kfree_s(struct sock *sk, void *mem, int size);
void sock_kzfree_s(struct sock *sk, void *mem, int size);
void sk_send_sigurg(struct sock *sk);
static inline void sock_replace_proto(struct sock *sk, struct proto *proto)
{
if (sk->sk_socket)
clear_bit(SOCK_SUPPORT_ZC, &sk->sk_socket->flags);
WRITE_ONCE(sk->sk_prot, proto);
}
struct sockcm_cookie {
u64 transmit_time;
u32 mark;
u32 tsflags;
u32 ts_opt_id;
u32 priority;
u32 dmabuf_id;
};
static inline void sockcm_init(struct sockcm_cookie *sockc,
const struct sock *sk)
{
*sockc = (struct sockcm_cookie) {
.mark = READ_ONCE(sk->sk_mark),
.tsflags = READ_ONCE(sk->sk_tsflags),
.priority = READ_ONCE(sk->sk_priority),
};
}
int __sock_cmsg_send(struct sock *sk, struct cmsghdr *cmsg,
struct sockcm_cookie *sockc);
int sock_cmsg_send(struct sock *sk, struct msghdr *msg,
struct sockcm_cookie *sockc);
/*
* Functions to fill in entries in struct proto_ops when a protocol
* does not implement a particular function.
*/
int sock_no_bind(struct socket *sock, struct sockaddr_unsized *saddr, int len);
int sock_no_connect(struct socket *sock, struct sockaddr_unsized *saddr, int len, int flags);
int sock_no_socketpair(struct socket *, struct socket *);
int sock_no_accept(struct socket *, struct socket *, struct proto_accept_arg *);
int sock_no_getname(struct socket *, struct sockaddr *, int);
int sock_no_ioctl(struct socket *, unsigned int, unsigned long);
int sock_no_listen(struct socket *, int);
int sock_no_shutdown(struct socket *, int);
int sock_no_sendmsg(struct socket *, struct msghdr *, size_t);
int sock_no_sendmsg_locked(struct sock *sk, struct msghdr *msg, size_t len);
int sock_no_recvmsg(struct socket *, struct msghdr *, size_t, int);
int sock_no_mmap(struct file *file, struct socket *sock,
struct vm_area_struct *vma);
/*
* Functions to fill in entries in struct proto_ops when a protocol
* uses the inet style.
*/
int sock_common_getsockopt(struct socket *sock, int level, int optname,
char __user *optval, int __user *optlen);
int sock_common_recvmsg(struct socket *sock, struct msghdr *msg, size_t size,
int flags);
int sock_common_setsockopt(struct socket *sock, int level, int optname,
sockptr_t optval, unsigned int optlen);
void sk_common_release(struct sock *sk);
/*
* Default socket callbacks and setup code
*/
/* Initialise core socket variables using an explicit uid. */
void sock_init_data_uid(struct socket *sock, struct sock *sk, kuid_t uid);
/* Initialise core socket variables.
* Assumes struct socket *sock is embedded in a struct socket_alloc.
*/
void sock_init_data(struct socket *sock, struct sock *sk);
/*
* Socket reference counting postulates.
*
* * Each user of socket SHOULD hold a reference count.
* * Each access point to socket (an hash table bucket, reference from a list,
* running timer, skb in flight MUST hold a reference count.
* * When reference count hits 0, it means it will never increase back.
* * When reference count hits 0, it means that no references from
* outside exist to this socket and current process on current CPU
* is last user and may/should destroy this socket.
* * sk_free is called from any context: process, BH, IRQ. When
* it is called, socket has no references from outside -> sk_free
* may release descendant resources allocated by the socket, but
* to the time when it is called, socket is NOT referenced by any
* hash tables, lists etc.
* * Packets, delivered from outside (from network or from another process)
* and enqueued on receive/error queues SHOULD NOT grab reference count,
* when they sit in queue. Otherwise, packets will leak to hole, when
* socket is looked up by one cpu and unhasing is made by another CPU.
* It is true for udp/raw, netlink (leak to receive and error queues), tcp
* (leak to backlog). Packet socket does all the processing inside
* BR_NETPROTO_LOCK, so that it has not this race condition. UNIX sockets
* use separate SMP lock, so that they are prone too.
*/
/* Ungrab socket and destroy it, if it was the last reference. */
static inline void sock_put(struct sock *sk)
{
if (refcount_dec_and_test(&sk->sk_refcnt))
sk_free(sk);
}
/* Generic version of sock_put(), dealing with all sockets
* (TCP_TIMEWAIT, TCP_NEW_SYN_RECV, ESTABLISHED...)
*/
void sock_gen_put(struct sock *sk);
int __sk_receive_skb(struct sock *sk, struct sk_buff *skb, const int nested,
unsigned int trim_cap, bool refcounted);
static inline int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
const int nested)
{
return __sk_receive_skb(sk, skb, nested, 1, true);
}
static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
{
/* sk_tx_queue_mapping accept only upto a 16-bit value */
if (WARN_ON_ONCE((unsigned short)tx_queue >= USHRT_MAX))
return;
/* Paired with READ_ONCE() in sk_tx_queue_get() and
* other WRITE_ONCE() because socket lock might be not held.
*/
if (READ_ONCE(sk->sk_tx_queue_mapping) != tx_queue) {
WRITE_ONCE(sk->sk_tx_queue_mapping, tx_queue);
WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
return;
}
/* Refresh sk_tx_queue_mapping_jiffies if too old. */
if (time_is_before_jiffies(READ_ONCE(sk->sk_tx_queue_mapping_jiffies) + HZ))
WRITE_ONCE(sk->sk_tx_queue_mapping_jiffies, jiffies);
}
#define NO_QUEUE_MAPPING USHRT_MAX
static inline void sk_tx_queue_clear(struct sock *sk)
{
/* Paired with READ_ONCE() in sk_tx_queue_get() and
* other WRITE_ONCE() because socket lock might be not held.
*/
WRITE_ONCE(sk->sk_tx_queue_mapping, NO_QUEUE_MAPPING);
}
int sk_tx_queue_get(const struct sock *sk);
static inline void __sk_rx_queue_set(struct sock *sk,
const struct sk_buff *skb,
bool force_set)
{
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
if (skb_rx_queue_recorded(skb)) {
u16 rx_queue = skb_get_rx_queue(skb);
if (force_set ||
unlikely(READ_ONCE(sk->sk_rx_queue_mapping) != rx_queue))
WRITE_ONCE(sk->sk_rx_queue_mapping, rx_queue);
}
#endif
}
static inline void sk_rx_queue_set(struct sock *sk, const struct sk_buff *skb)
{
__sk_rx_queue_set(sk, skb, true);
}
static inline void sk_rx_queue_update(struct sock *sk, const struct sk_buff *skb)
{
__sk_rx_queue_set(sk, skb, false);
}
static inline void sk_rx_queue_clear(struct sock *sk)
{
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
WRITE_ONCE(sk->sk_rx_queue_mapping, NO_QUEUE_MAPPING);
#endif
}
static inline int sk_rx_queue_get(const struct sock *sk)
{
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
if (sk) {
int res = READ_ONCE(sk->sk_rx_queue_mapping);
if (res != NO_QUEUE_MAPPING)
return res;
}
#endif
return -1;
}
static inline void sk_set_socket(struct sock *sk, struct socket *sock)
{
sk->sk_socket = sock;
if (sock) {
WRITE_ONCE(sk->sk_uid, SOCK_INODE(sock)->i_uid);
WRITE_ONCE(sk->sk_ino, SOCK_INODE(sock)->i_ino);
} else {
/* Note: sk_uid is unchanged. */
WRITE_ONCE(sk->sk_ino, 0);
}
}
static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
BUILD_BUG_ON(offsetof(struct socket_wq, wait) != 0);
return &rcu_dereference_raw(sk->sk_wq)->wait;
}
/* Detach socket from process context.
* Announce socket dead, detach it from wait queue and inode.
* Note that parent inode held reference count on this struct sock,
* we do not release it in this function, because protocol
* probably wants some additional cleanups or even continuing
* to work with this socket (TCP).
*/
static inline void sock_orphan(struct sock *sk)
{
write_lock_bh(&sk->sk_callback_lock);
sock_set_flag(sk, SOCK_DEAD);
sk_set_socket(sk, NULL);
sk->sk_wq = NULL;
write_unlock_bh(&sk->sk_callback_lock);
}
static inline void sock_graft(struct sock *sk, struct socket *parent)
{
WARN_ON(parent->sk);
write_lock_bh(&sk->sk_callback_lock);
rcu_assign_pointer(sk->sk_wq, &parent->wq);
parent->sk = sk;
sk_set_socket(sk, parent);
security_sock_graft(sk, parent);
write_unlock_bh(&sk->sk_callback_lock);
}
static inline unsigned long sock_i_ino(const struct sock *sk)
{
/* Paired with WRITE_ONCE() in sock_graft() and sock_orphan() */
return READ_ONCE(sk->sk_ino);
}
static inline kuid_t sk_uid(const struct sock *sk)
{
/* Paired with WRITE_ONCE() in sockfs_setattr() */
return READ_ONCE(sk->sk_uid);
}
static inline kuid_t sock_net_uid(const struct net *net, const struct sock *sk)
{
return sk ? sk_uid(sk) : make_kuid(net->user_ns, 0);
}
static inline u32 net_tx_rndhash(void)
{
u32 v = get_random_u32();
return v ?: 1;
}
static inline void sk_set_txhash(struct sock *sk)
{
/* This pairs with READ_ONCE() in skb_set_hash_from_sk() */
WRITE_ONCE(sk->sk_txhash, net_tx_rndhash());
}
static inline bool sk_rethink_txhash(struct sock *sk)
{
if (sk->sk_txhash && sk->sk_txrehash == SOCK_TXREHASH_ENABLED) {
sk_set_txhash(sk);
return true;
}
return false;
}
static inline struct dst_entry *
__sk_dst_get(const struct sock *sk)
{
return rcu_dereference_check(sk->sk_dst_cache,
lockdep_sock_is_held(sk));
}
static inline struct dst_entry *
sk_dst_get(const struct sock *sk)
{
struct dst_entry *dst;
rcu_read_lock();
dst = rcu_dereference(sk->sk_dst_cache);
if (dst && !rcuref_get(&dst->__rcuref))
dst = NULL;
rcu_read_unlock();
return dst;
}
static inline void __dst_negative_advice(struct sock *sk)
{
struct dst_entry *dst = __sk_dst_get(sk);
if (dst && dst->ops->negative_advice)
dst->ops->negative_advice(sk, dst);
}
static inline void dst_negative_advice(struct sock *sk)
{
sk_rethink_txhash(sk);
__dst_negative_advice(sk);
}
static inline void
__sk_dst_set(struct sock *sk, struct dst_entry *dst)
{
struct dst_entry *old_dst;
sk_tx_queue_clear(sk);
WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
old_dst = rcu_dereference_protected(sk->sk_dst_cache,
lockdep_sock_is_held(sk));
rcu_assign_pointer(sk->sk_dst_cache, dst);
dst_release(old_dst);
}
static inline void
sk_dst_set(struct sock *sk, struct dst_entry *dst)
{
struct dst_entry *old_dst;
sk_tx_queue_clear(sk);
WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
old_dst = unrcu_pointer(xchg(&sk->sk_dst_cache, RCU_INITIALIZER(dst)));
dst_release(old_dst);
}
static inline void
__sk_dst_reset(struct sock *sk)
{
__sk_dst_set(sk, NULL);
}
static inline void
sk_dst_reset(struct sock *sk)
{
sk_dst_set(sk, NULL);
}
struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie);
struct dst_entry *sk_dst_check(struct sock *sk, u32 cookie);
static inline void sk_dst_confirm(struct sock *sk)
{
if (!READ_ONCE(sk->sk_dst_pending_confirm))
WRITE_ONCE(sk->sk_dst_pending_confirm, 1);
}
static inline void sock_confirm_neigh(struct sk_buff *skb, struct neighbour *n)
{
if (skb_get_dst_pending_confirm(skb)) {
struct sock *sk = skb->sk;
if (sk && READ_ONCE(sk->sk_dst_pending_confirm))
WRITE_ONCE(sk->sk_dst_pending_confirm, 0);
neigh_confirm(n);
}
}
bool sk_mc_loop(const struct sock *sk);
static inline bool sk_can_gso(const struct sock *sk)
{
return net_gso_ok(sk->sk_route_caps, sk->sk_gso_type);
}
void sk_setup_caps(struct sock *sk, struct dst_entry *dst);
static inline void sk_gso_disable(struct sock *sk)
{
sk->sk_gso_disabled = 1;
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
}
static inline int skb_do_copy_data_nocache(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, char *to,
int copy, int offset)
{
if (skb->ip_summed == CHECKSUM_NONE) {
__wsum csum = 0;
if (!csum_and_copy_from_iter_full(to, copy, &csum, from))
return -EFAULT;
skb->csum = csum_block_add(skb->csum, csum, offset);
} else if (sk->sk_route_caps & NETIF_F_NOCACHE_COPY) {
if (!copy_from_iter_full_nocache(to, copy, from))
return -EFAULT;
} else if (!copy_from_iter_full(to, copy, from))
return -EFAULT;
return 0;
}
static inline int skb_add_data_nocache(struct sock *sk, struct sk_buff *skb,
struct iov_iter *from, int copy)
{
int err, offset = skb->len;
err = skb_do_copy_data_nocache(sk, skb, from, skb_put(skb, copy),
copy, offset);
if (err)
__skb_trim(skb, offset);
return err;
}
static inline int skb_copy_to_page_nocache(struct sock *sk, struct iov_iter *from,
struct sk_buff *skb,
struct page *page,
int off, int copy)
{
int err;
err = skb_do_copy_data_nocache(sk, skb, from, page_address(page) + off,
copy, skb->len);
if (err)
return err;
skb_len_add(skb, copy);
sk_wmem_queued_add(sk, copy);
sk_mem_charge(sk, copy);
return 0;
}
#define SK_WMEM_ALLOC_BIAS 1
/**
* sk_wmem_alloc_get - returns write allocations
* @sk: socket
*
* Return: sk_wmem_alloc minus initial offset of one
*/
static inline int sk_wmem_alloc_get(const struct sock *sk)
{
return refcount_read(&sk->sk_wmem_alloc) - SK_WMEM_ALLOC_BIAS;
}
/**
* sk_rmem_alloc_get - returns read allocations
* @sk: socket
*
* Return: sk_rmem_alloc
*/
static inline int sk_rmem_alloc_get(const struct sock *sk)
{
return atomic_read(&sk->sk_rmem_alloc);
}
/**
* sk_has_allocations - check if allocations are outstanding
* @sk: socket
*
* Return: true if socket has write or read allocations
*/
static inline bool sk_has_allocations(const struct sock *sk)
{
return sk_wmem_alloc_get(sk) || sk_rmem_alloc_get(sk);
}
/**
* skwq_has_sleeper - check if there are any waiting processes
* @wq: struct socket_wq
*
* Return: true if socket_wq has waiting processes
*
* The purpose of the skwq_has_sleeper and sock_poll_wait is to wrap the memory
* barrier call. They were added due to the race found within the tcp code.
*
* Consider following tcp code paths::
*
* CPU1 CPU2
* sys_select receive packet
* ... ...
* __add_wait_queue update tp->rcv_nxt
* ... ...
* tp->rcv_nxt check sock_def_readable
* ... {
* schedule rcu_read_lock();
* wq = rcu_dereference(sk->sk_wq);
* if (wq && waitqueue_active(&wq->wait))
* wake_up_interruptible(&wq->wait)
* ...
* }
*
* The race for tcp fires when the __add_wait_queue changes done by CPU1 stay
* in its cache, and so does the tp->rcv_nxt update on CPU2 side. The CPU1
* could then endup calling schedule and sleep forever if there are no more
* data on the socket.
*
*/
static inline bool skwq_has_sleeper(struct socket_wq *wq)
{
return wq && wq_has_sleeper(&wq->wait);
}
/**
* sock_poll_wait - wrapper for the poll_wait call.
* @filp: file
* @sock: socket to wait on
* @p: poll_table
*
* See the comments in the wq_has_sleeper function.
*/
static inline void sock_poll_wait(struct file *filp, struct socket *sock,
poll_table *p)
{
/* Provides a barrier we need to be sure we are in sync
* with the socket flags modification.
*
* This memory barrier is paired in the wq_has_sleeper.
*/
poll_wait(filp, &sock->wq.wait, p);
}
static inline void skb_set_hash_from_sk(struct sk_buff *skb, struct sock *sk)
{
/* This pairs with WRITE_ONCE() in sk_set_txhash() */
u32 txhash = READ_ONCE(sk->sk_txhash);
if (txhash) {
skb->l4_hash = 1;
skb->hash = txhash;
}
}
void skb_set_owner_w(struct sk_buff *skb, struct sock *sk);
/*
* Queue a received datagram if it will fit. Stream and sequenced
* protocols can't normally use this as they need to fit buffers in
* and play with them.
*
* Inlined as it's very short and called for pretty much every
* packet ever received.
*/
static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
{
skb_orphan(skb);
skb->sk = sk;
skb->destructor = sock_rfree;
atomic_add(skb->truesize, &sk->sk_rmem_alloc);
sk_mem_charge(sk, skb->truesize);
}
static inline __must_check bool skb_set_owner_sk_safe(struct sk_buff *skb, struct sock *sk)
{
if (sk && refcount_inc_not_zero(&sk->sk_refcnt)) {
skb_orphan(skb);
skb->destructor = sock_efree;
skb->sk = sk;
return true;
}
return false;
}
static inline struct sk_buff *skb_clone_and_charge_r(struct sk_buff *skb, struct sock *sk)
{
skb = skb_clone(skb, sk_gfp_mask(sk, GFP_ATOMIC));
if (skb) {
if (sk_rmem_schedule(sk, skb, skb->truesize)) {
skb_set_owner_r(skb, sk);
return skb;
}
__kfree_skb(skb);
}
return NULL;
}
static inline void skb_prepare_for_gro(struct sk_buff *skb)
{
if (skb->destructor != sock_wfree) {
skb_orphan(skb);
return;
}
skb->slow_gro = 1;
}
void sk_reset_timer(struct sock *sk, struct timer_list *timer,
unsigned long expires);
void sk_stop_timer(struct sock *sk, struct timer_list *timer);
void sk_stop_timer_sync(struct sock *sk, struct timer_list *timer);
int __sk_queue_drop_skb(struct sock *sk, struct sk_buff_head *sk_queue,
struct sk_buff *skb, unsigned int flags,
void (*destructor)(struct sock *sk,
struct sk_buff *skb));
int __sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb);
int sock_queue_rcv_skb_reason(struct sock *sk, struct sk_buff *skb,
enum skb_drop_reason *reason);
static inline int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
return sock_queue_rcv_skb_reason(sk, skb, NULL);
}
int sock_queue_err_skb(struct sock *sk, struct sk_buff *skb);
struct sk_buff *sock_dequeue_err_skb(struct sock *sk);
/*
* Recover an error report and clear atomically
*/
static inline int sock_error(struct sock *sk)
{
int err;
/* Avoid an atomic operation for the common case.
* This is racy since another cpu/thread can change sk_err under us.
*/
if (likely(data_race(!sk->sk_err)))
return 0;
err = xchg(&sk->sk_err, 0);
return -err;
}
void sk_error_report(struct sock *sk);
static inline unsigned long sock_wspace(struct sock *sk)
{
int amt = 0;
if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
amt = sk->sk_sndbuf - refcount_read(&sk->sk_wmem_alloc);
if (amt < 0)
amt = 0;
}
return amt;
}
/* Note:
* We use sk->sk_wq_raw, from contexts knowing this
* pointer is not NULL and cannot disappear/change.
*/
static inline void sk_set_bit(int nr, struct sock *sk)
{
if ((nr == SOCKWQ_ASYNC_NOSPACE || nr == SOCKWQ_ASYNC_WAITDATA) &&
!sock_flag(sk, SOCK_FASYNC))
return;
set_bit(nr, &sk->sk_wq_raw->flags);
}
static inline void sk_clear_bit(int nr, struct sock *sk)
{
if ((nr == SOCKWQ_ASYNC_NOSPACE || nr == SOCKWQ_ASYNC_WAITDATA) &&
!sock_flag(sk, SOCK_FASYNC))
return;
clear_bit(nr, &sk->sk_wq_raw->flags);
}
static inline void sk_wake_async(const struct sock *sk, int how, int band)
{
if (sock_flag(sk, SOCK_FASYNC)) {
rcu_read_lock();
sock_wake_async(rcu_dereference(sk->sk_wq), how, band);
rcu_read_unlock();
}
}
static inline void sk_wake_async_rcu(const struct sock *sk, int how, int band)
{
if (unlikely(sock_flag(sk, SOCK_FASYNC)))
sock_wake_async(rcu_dereference(sk->sk_wq), how, band);
}
/* Since sk_{r,w}mem_alloc sums skb->truesize, even a small frame might
* need sizeof(sk_buff) + MTU + padding, unless net driver perform copybreak.
* Note: for send buffers, TCP works better if we can build two skbs at
* minimum.
*/
#define TCP_SKB_MIN_TRUESIZE (2048 + SKB_DATA_ALIGN(sizeof(struct sk_buff)))
#define SOCK_MIN_SNDBUF (TCP_SKB_MIN_TRUESIZE * 2)
#define SOCK_MIN_RCVBUF TCP_SKB_MIN_TRUESIZE
static inline void sk_stream_moderate_sndbuf(struct sock *sk)
{
u32 val;
if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
return;
val = min(sk->sk_sndbuf, sk->sk_wmem_queued >> 1);
val = max_t(u32, val, sk_unused_reserved_mem(sk));
WRITE_ONCE(sk->sk_sndbuf, max_t(u32, val, SOCK_MIN_SNDBUF));
}
/**
* sk_page_frag - return an appropriate page_frag
* @sk: socket
*
* Use the per task page_frag instead of the per socket one for
* optimization when we know that we're in process context and own
* everything that's associated with %current.
*
* Both direct reclaim and page faults can nest inside other
* socket operations and end up recursing into sk_page_frag()
* while it's already in use: explicitly avoid task page_frag
* when users disable sk_use_task_frag.
*
* Return: a per task page_frag if context allows that,
* otherwise a per socket one.
*/
static inline struct page_frag *sk_page_frag(struct sock *sk)
{
if (sk->sk_use_task_frag)
return ¤t->task_frag;
return &sk->sk_frag;
}
bool sk_page_frag_refill(struct sock *sk, struct page_frag *pfrag);
static inline bool __sock_writeable(const struct sock *sk, int wmem_alloc)
{
return wmem_alloc < (READ_ONCE(sk->sk_sndbuf) >> 1);
}
/*
* Default write policy as shown to user space via poll/select/SIGIO
*/
static inline bool sock_writeable(const struct sock *sk)
{
return __sock_writeable(sk, refcount_read(&sk->sk_wmem_alloc));
}
static inline gfp_t gfp_any(void)
{
return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
}
static inline gfp_t gfp_memcg_charge(void)
{
return in_softirq() ? GFP_ATOMIC : GFP_KERNEL;
}
#ifdef CONFIG_MEMCG
static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
return sk->sk_memcg;
}
static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
{
return mem_cgroup_sockets_enabled && mem_cgroup_from_sk(sk);
}
static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
{
struct mem_cgroup *memcg = mem_cgroup_from_sk(sk);
#ifdef CONFIG_MEMCG_V1
if (!cgroup_subsys_on_dfl(memory_cgrp_subsys))
return !!memcg->tcpmem_pressure;
#endif /* CONFIG_MEMCG_V1 */
do {
if (time_before64(get_jiffies_64(),
mem_cgroup_get_socket_pressure(memcg))) {
memcg_memory_event(mem_cgroup_from_sk(sk),
MEMCG_SOCK_THROTTLED);
return true;
}
} while ((memcg = parent_mem_cgroup(memcg)));
return false;
}
#else
static inline struct mem_cgroup *mem_cgroup_from_sk(const struct sock *sk)
{
return NULL;
}
static inline bool mem_cgroup_sk_enabled(const struct sock *sk)
{
return false;
}
static inline bool mem_cgroup_sk_under_memory_pressure(const struct sock *sk)
{
return false;
}
#endif
static inline long sock_rcvtimeo(const struct sock *sk, bool noblock)
{
return noblock ? 0 : READ_ONCE(sk->sk_rcvtimeo);
}
static inline long sock_sndtimeo(const struct sock *sk, bool noblock)
{
return noblock ? 0 : READ_ONCE(sk->sk_sndtimeo);
}
static inline int sock_rcvlowat(const struct sock *sk, int waitall, int len)
{
int v = waitall ? len : min_t(int, READ_ONCE(sk->sk_rcvlowat), len);
return v ?: 1;
}
/* Alas, with timeout socket operations are not restartable.
* Compare this to poll().
*/
static inline int sock_intr_errno(long timeo)
{
return timeo == MAX_SCHEDULE_TIMEOUT ? -ERESTARTSYS : -EINTR;
}
struct sock_skb_cb {
u32 dropcount;
};
/* Store sock_skb_cb at the end of skb->cb[] so protocol families
* using skb->cb[] would keep using it directly and utilize its
* alignment guarantee.
*/
#define SOCK_SKB_CB_OFFSET (sizeof_field(struct sk_buff, cb) - \
sizeof(struct sock_skb_cb))
#define SOCK_SKB_CB(__skb) ((struct sock_skb_cb *)((__skb)->cb + \
SOCK_SKB_CB_OFFSET))
#define sock_skb_cb_check_size(size) \
BUILD_BUG_ON((size) > SOCK_SKB_CB_OFFSET)
static inline void sk_drops_add(struct sock *sk, int segs)
{
struct numa_drop_counters *ndc = sk->sk_drop_counters;
if (ndc)
numa_drop_add(ndc, segs);
else
atomic_add(segs, &sk->sk_drops);
}
static inline void sk_drops_inc(struct sock *sk)
{
sk_drops_add(sk, 1);
}
static inline int sk_drops_read(const struct sock *sk)
{
const struct numa_drop_counters *ndc = sk->sk_drop_counters;
if (ndc) {
DEBUG_NET_WARN_ON_ONCE(atomic_read(&sk->sk_drops));
return numa_drop_read(ndc);
}
return atomic_read(&sk->sk_drops);
}
static inline void sk_drops_reset(struct sock *sk)
{
struct numa_drop_counters *ndc = sk->sk_drop_counters;
if (ndc)
numa_drop_reset(ndc);
atomic_set(&sk->sk_drops, 0);
}
static inline void
sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
{
SOCK_SKB_CB(skb)->dropcount = sock_flag(sk, SOCK_RXQ_OVFL) ?
sk_drops_read(sk) : 0;
}
static inline void sk_drops_skbadd(struct sock *sk, const struct sk_buff *skb)
{
int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
sk_drops_add(sk, segs);
}
static inline ktime_t sock_read_timestamp(struct sock *sk)
{
#if BITS_PER_LONG==32
unsigned int seq;
ktime_t kt;
do {
seq = read_seqbegin(&sk->sk_stamp_seq);
kt = sk->sk_stamp;
} while (read_seqretry(&sk->sk_stamp_seq, seq));
return kt;
#else
return READ_ONCE(sk->sk_stamp);
#endif
}
static inline void sock_write_timestamp(struct sock *sk, ktime_t kt)
{
#if BITS_PER_LONG==32
write_seqlock(&sk->sk_stamp_seq);
sk->sk_stamp = kt;
write_sequnlock(&sk->sk_stamp_seq);
#else
WRITE_ONCE(sk->sk_stamp, kt);
#endif
}
void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb);
void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb);
bool skb_has_tx_timestamp(struct sk_buff *skb, const struct sock *sk);
int skb_get_tx_timestamp(struct sk_buff *skb, struct sock *sk,
struct timespec64 *ts);
static inline void
sock_recv_timestamp(struct msghdr *msg, struct sock *sk, struct sk_buff *skb)
{
struct skb_shared_hwtstamps *hwtstamps = skb_hwtstamps(skb);
u32 tsflags = READ_ONCE(sk->sk_tsflags);
ktime_t kt = skb->tstamp;
/*
* generate control messages if
* - receive time stamping in software requested
* - software time stamp available and wanted
* - hardware time stamps available and wanted
*/
if (sock_flag(sk, SOCK_RCVTSTAMP) ||
(tsflags & SOF_TIMESTAMPING_RX_SOFTWARE) ||
(kt && tsflags & SOF_TIMESTAMPING_SOFTWARE) ||
(hwtstamps->hwtstamp &&
(tsflags & SOF_TIMESTAMPING_RAW_HARDWARE)))
__sock_recv_timestamp(msg, sk, skb);
else
sock_write_timestamp(sk, kt);
if (sock_flag(sk, SOCK_WIFI_STATUS) && skb_wifi_acked_valid(skb))
__sock_recv_wifi_status(msg, sk, skb);
}
void __sock_recv_cmsgs(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb);
#define SK_DEFAULT_STAMP (-1L * NSEC_PER_SEC)
static inline void sock_recv_cmsgs(struct msghdr *msg, struct sock *sk,
struct sk_buff *skb)
{
#define FLAGS_RECV_CMSGS ((1UL << SOCK_RXQ_OVFL) | \
(1UL << SOCK_RCVTSTAMP) | \
(1UL << SOCK_RCVMARK) | \
(1UL << SOCK_RCVPRIORITY) | \
(1UL << SOCK_TIMESTAMPING_ANY))
#define TSFLAGS_ANY (SOF_TIMESTAMPING_SOFTWARE | \
SOF_TIMESTAMPING_RAW_HARDWARE)
if (READ_ONCE(sk->sk_flags) & FLAGS_RECV_CMSGS)
__sock_recv_cmsgs(msg, sk, skb);
else if (unlikely(sock_flag(sk, SOCK_TIMESTAMP)))
sock_write_timestamp(sk, skb->tstamp);
else if (unlikely(sock_read_timestamp(sk) == SK_DEFAULT_STAMP))
sock_write_timestamp(sk, 0);
}
void __sock_tx_timestamp(__u32 tsflags, __u8 *tx_flags);
/**
* _sock_tx_timestamp - checks whether the outgoing packet is to be time stamped
* @sk: socket sending this packet
* @sockc: pointer to socket cmsg cookie to get timestamping info
* @tx_flags: completed with instructions for time stamping
* @tskey: filled in with next sk_tskey (not for TCP, which uses seqno)
*
* Note: callers should take care of initial ``*tx_flags`` value (usually 0)
*/
static inline void _sock_tx_timestamp(struct sock *sk,
const struct sockcm_cookie *sockc,
__u8 *tx_flags, __u32 *tskey)
{
__u32 tsflags = sockc->tsflags;
if (unlikely(tsflags)) {
__sock_tx_timestamp(tsflags, tx_flags);
if (tsflags & SOF_TIMESTAMPING_OPT_ID && tskey &&
tsflags & SOF_TIMESTAMPING_TX_RECORD_MASK) {
if (tsflags & SOCKCM_FLAG_TS_OPT_ID)
*tskey = sockc->ts_opt_id;
else
*tskey = atomic_inc_return(&sk->sk_tskey) - 1;
}
}
}
static inline void sock_tx_timestamp(struct sock *sk,
const struct sockcm_cookie *sockc,
__u8 *tx_flags)
{
_sock_tx_timestamp(sk, sockc, tx_flags, NULL);
}
static inline void skb_setup_tx_timestamp(struct sk_buff *skb,
const struct sockcm_cookie *sockc)
{
_sock_tx_timestamp(skb->sk, sockc, &skb_shinfo(skb)->tx_flags,
&skb_shinfo(skb)->tskey);
}
static inline bool sk_is_inet(const struct sock *sk)
{
int family = READ_ONCE(sk->sk_family);
return family == AF_INET || family == AF_INET6;
}
static inline bool sk_is_tcp(const struct sock *sk)
{
return sk_is_inet(sk) &&
sk->sk_type == SOCK_STREAM &&
sk->sk_protocol == IPPROTO_TCP;
}
static inline bool sk_is_udp(const struct sock *sk)
{
return sk_is_inet(sk) &&
sk->sk_type == SOCK_DGRAM &&
sk->sk_protocol == IPPROTO_UDP;
}
static inline bool sk_is_unix(const struct sock *sk)
{
return sk->sk_family == AF_UNIX;
}
static inline bool sk_is_stream_unix(const struct sock *sk)
{
return sk_is_unix(sk) && sk->sk_type == SOCK_STREAM;
}
static inline bool sk_is_vsock(const struct sock *sk)
{
return sk->sk_family == AF_VSOCK;
}
static inline bool sk_may_scm_recv(const struct sock *sk)
{
return (IS_ENABLED(CONFIG_UNIX) && sk->sk_family == AF_UNIX) ||
sk->sk_family == AF_NETLINK ||
(IS_ENABLED(CONFIG_BT) && sk->sk_family == AF_BLUETOOTH);
}
/**
* sk_eat_skb - Release a skb if it is no longer needed
* @sk: socket to eat this skb from
* @skb: socket buffer to eat
*
* This routine must be called with interrupts disabled or with the socket
* locked so that the sk_buff queue operation is ok.
*/
static inline void sk_eat_skb(struct sock *sk, struct sk_buff *skb)
{
__skb_unlink(skb, &sk->sk_receive_queue);
__kfree_skb(skb);
}
static inline bool
skb_sk_is_prefetched(struct sk_buff *skb)
{
#ifdef CONFIG_INET
return skb->destructor == sock_pfree;
#else
return false;
#endif /* CONFIG_INET */
}
/* This helper checks if a socket is a full socket,
* ie _not_ a timewait or request socket.
*/
static inline bool sk_fullsock(const struct sock *sk)
{
return (1 << sk->sk_state) & ~(TCPF_TIME_WAIT | TCPF_NEW_SYN_RECV);
}
static inline bool
sk_is_refcounted(struct sock *sk)
{
/* Only full sockets have sk->sk_flags. */
return !sk_fullsock(sk) || !sock_flag(sk, SOCK_RCU_FREE);
}
static inline bool
sk_requests_wifi_status(struct sock *sk)
{
return sk && sk_fullsock(sk) && sock_flag(sk, SOCK_WIFI_STATUS);
}
/* This helper checks if a socket is a LISTEN or NEW_SYN_RECV
* SYNACK messages can be attached to either ones (depending on SYNCOOKIE)
*/
static inline bool sk_listener(const struct sock *sk)
{
return (1 << sk->sk_state) & (TCPF_LISTEN | TCPF_NEW_SYN_RECV);
}
/* This helper checks if a socket is a LISTEN or NEW_SYN_RECV or TIME_WAIT
* TCP SYNACK messages can be attached to LISTEN or NEW_SYN_RECV (depending on SYNCOOKIE)
* TCP RST and ACK can be attached to TIME_WAIT.
*/
static inline bool sk_listener_or_tw(const struct sock *sk)
{
return (1 << READ_ONCE(sk->sk_state)) &
(TCPF_LISTEN | TCPF_NEW_SYN_RECV | TCPF_TIME_WAIT);
}
void sock_enable_timestamp(struct sock *sk, enum sock_flags flag);
int sock_recv_errqueue(struct sock *sk, struct msghdr *msg, int len, int level,
int type);
bool sk_ns_capable(const struct sock *sk,
struct user_namespace *user_ns, int cap);
bool sk_capable(const struct sock *sk, int cap);
bool sk_net_capable(const struct sock *sk, int cap);
void sk_get_meminfo(const struct sock *sk, u32 *meminfo);
/* Take into consideration the size of the struct sk_buff overhead in the
* determination of these values, since that is non-constant across
* platforms. This makes socket queueing behavior and performance
* not depend upon such differences.
*/
#define _SK_MEM_PACKETS 256
#define _SK_MEM_OVERHEAD SKB_TRUESIZE(256)
#define SK_WMEM_DEFAULT (_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
#define SK_RMEM_DEFAULT (_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
extern __u32 sysctl_wmem_max;
extern __u32 sysctl_rmem_max;
extern __u32 sysctl_wmem_default;
extern __u32 sysctl_rmem_default;
#define SKB_FRAG_PAGE_ORDER get_order(32768)
DECLARE_STATIC_KEY_FALSE(net_high_order_alloc_disable_key);
static inline int sk_get_wmem0(const struct sock *sk, const struct proto *proto)
{
/* Does this proto have per netns sysctl_wmem ? */
if (proto->sysctl_wmem_offset)
return READ_ONCE(*(int *)((void *)sock_net(sk) + proto->sysctl_wmem_offset));
return READ_ONCE(*proto->sysctl_wmem);
}
static inline int sk_get_rmem0(const struct sock *sk, const struct proto *proto)
{
/* Does this proto have per netns sysctl_rmem ? */
if (proto->sysctl_rmem_offset)
return READ_ONCE(*(int *)((void *)sock_net(sk) + proto->sysctl_rmem_offset));
return READ_ONCE(*proto->sysctl_rmem);
}
/* Default TCP Small queue budget is ~1 ms of data (1sec >> 10)
* Some wifi drivers need to tweak it to get more chunks.
* They can use this helper from their ndo_start_xmit()
*/
static inline void sk_pacing_shift_update(struct sock *sk, int val)
{
if (!sk || !sk_fullsock(sk) || READ_ONCE(sk->sk_pacing_shift) == val)
return;
WRITE_ONCE(sk->sk_pacing_shift, val);
}
/* if a socket is bound to a device, check that the given device
* index is either the same or that the socket is bound to an L3
* master device and the given device index is also enslaved to
* that L3 master
*/
static inline bool sk_dev_equal_l3scope(struct sock *sk, int dif)
{
int bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
int mdif;
if (!bound_dev_if || bound_dev_if == dif)
return true;
mdif = l3mdev_master_ifindex_by_index(sock_net(sk), dif);
if (mdif && mdif == bound_dev_if)
return true;
return false;
}
void sock_def_readable(struct sock *sk);
int sock_bindtoindex(struct sock *sk, int ifindex, bool lock_sk);
void sock_set_timestamp(struct sock *sk, int optname, bool valbool);
int sock_set_timestamping(struct sock *sk, int optname,
struct so_timestamping timestamping);
#if defined(CONFIG_CGROUP_BPF)
void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op);
#else
static inline void bpf_skops_tx_timestamping(struct sock *sk, struct sk_buff *skb, int op)
{
}
#endif
void sock_no_linger(struct sock *sk);
void sock_set_keepalive(struct sock *sk);
void sock_set_priority(struct sock *sk, u32 priority);
void sock_set_rcvbuf(struct sock *sk, int val);
void sock_set_mark(struct sock *sk, u32 val);
void sock_set_reuseaddr(struct sock *sk);
void sock_set_reuseport(struct sock *sk);
void sock_set_sndtimeo(struct sock *sk, s64 secs);
int sock_bind_add(struct sock *sk, struct sockaddr_unsized *addr, int addr_len);
int sock_get_timeout(long timeo, void *optval, bool old_timeval);
int sock_copy_user_timeval(struct __kernel_sock_timeval *tv,
sockptr_t optval, int optlen, bool old_timeval);
int sock_ioctl_inout(struct sock *sk, unsigned int cmd,
void __user *arg, void *karg, size_t size);
int sk_ioctl(struct sock *sk, unsigned int cmd, void __user *arg);
static inline bool sk_is_readable(struct sock *sk)
{
const struct proto *prot = READ_ONCE(sk->sk_prot);
if (prot->sock_is_readable)
return prot->sock_is_readable(sk);
return false;
}
#endif /* _SOCK_H */
] |
| 15/2 | 2026/02/23 14:32 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 45001 output: 24 thoughts: 914 Thoughts: **Dissecting the `inet_num` Assignment**
Okay, so I'm looking at this line of code where `inet_sk(sk)->inet_num = 0;`. My initial thought was, "Wait, isn't `inet_num` just another name for `sk_num`?" It seemed like a straightforward assignment. I know I've been working with the `sock` and `inet_sock` structures in the kernel networking stack, but now I'm questioning the assumed equivalence here.
First, I went back and re-examined the definitions. I saw that within `struct sock_common` in `sock.h`, the members `skc_dport` and `skc_num` are grouped together inside a `union`. `sk_num` is also defined with the macro `#define sk_num __sk_common.skc_num`. This seems to confirm my initial assumption. Then I took another look at `inet_sock`, also in `inet_sock.h`, and saw a member `inet_num` and was confused.
The `inet_sk` macro, which is defined as `static inline struct inet_sock *inet_sk(const struct sock *sk) { return (struct inet_sock *)sk; }`, is the key. It's essentially a typecast, meaning that `inet_sk(sk)` takes a pointer to a `sock` and treats it as a pointer to an `inet_sock`. I know that because `inet_sock` starts with a `struct sock`, the cast is valid. So, `inet_sk(sk)->inet_num` is indeed accessing the `inet_num` member of the `inet_sock` structure, relative to the base address of the `sock` structure.
To confirm the offset of `sk_num` within `struct sock_common`, I went over the layout of the members. The structure members `skc_portpair` takes up 24 bytes (192 bits), including `skc_dport`. `skc_num` occupies the 26th byte (208 bits).
But, where is `inet_num`? I reviewed the `inet_sock` layout again. I had initially missed it, but now I see it: `inet_num` isn't explicitly in the listing I was looking at. Now I know that it doesn't exist, I suspect it's another macro, so I will now look to see how it's defined and how it relates to `sk_num`.
|
| 16/2 | 2026/02/23 14:32 | tool | read-file |
0mArgs:map[File:include/net/inet_sock.h] Results: map[Contents:/* SPDX-License-Identifier: GPL-2.0-or-later */
/*
* INET An implementation of the TCP/IP protocol suite for the LINUX
* operating system. INET is implemented using the BSD Socket
* interface as the means of communication with the user level.
*
* Definitions for inet_sock
*
* Authors: Many, reorganised here by
* Arnaldo Carvalho de Melo <acme@mandriva.com>
*/
#ifndef _INET_SOCK_H
#define _INET_SOCK_H
#include <linux/bitops.h>
#include <linux/string.h>
#include <linux/types.h>
#include <linux/jhash.h>
#include <linux/netdevice.h>
#include <net/flow.h>
#include <net/inet_dscp.h>
#include <net/sock.h>
#include <net/request_sock.h>
#include <net/netns/hash.h>
#include <net/tcp_states.h>
#include <net/l3mdev.h>
#define IP_OPTIONS_DATA_FIXED_SIZE 40
/** struct ip_options - IP Options
*
* @faddr - Saved first hop address
* @nexthop - Saved nexthop address in LSRR and SSRR
* @is_strictroute - Strict source route
* @srr_is_hit - Packet destination addr was our one
* @is_changed - IP checksum more not valid
* @rr_needaddr - Need to record addr of outgoing dev
* @ts_needtime - Need to record timestamp
* @ts_needaddr - Need to record addr of outgoing dev
*/
struct ip_options {
__be32 faddr;
__be32 nexthop;
unsigned char optlen;
unsigned char srr;
unsigned char rr;
unsigned char ts;
unsigned char is_strictroute:1,
srr_is_hit:1,
is_changed:1,
rr_needaddr:1,
ts_needtime:1,
ts_needaddr:1;
unsigned char router_alert;
unsigned char cipso;
unsigned char __pad2;
unsigned char __data[];
};
struct ip_options_rcu {
struct rcu_head rcu;
/* Must be last as it ends in a flexible-array member. */
struct ip_options opt;
};
struct inet_request_sock {
struct request_sock req;
#define ir_loc_addr req.__req_common.skc_rcv_saddr
#define ir_rmt_addr req.__req_common.skc_daddr
#define ir_num req.__req_common.skc_num
#define ir_rmt_port req.__req_common.skc_dport
#define ir_v6_rmt_addr req.__req_common.skc_v6_daddr
#define ir_v6_loc_addr req.__req_common.skc_v6_rcv_saddr
#define ir_iif req.__req_common.skc_bound_dev_if
#define ir_cookie req.__req_common.skc_cookie
#define ireq_net req.__req_common.skc_net
#define ireq_state req.__req_common.skc_state
#define ireq_family req.__req_common.skc_family
u16 snd_wscale : 4,
rcv_wscale : 4,
tstamp_ok : 1,
sack_ok : 1,
wscale_ok : 1,
ecn_ok : 1,
acked : 1,
no_srccheck: 1,
smc_ok : 1;
u32 ir_mark;
union {
struct ip_options_rcu __rcu *ireq_opt;
#if IS_ENABLED(CONFIG_IPV6)
struct {
struct ipv6_txoptions *ipv6_opt;
struct sk_buff *pktopts;
};
#endif
};
};
#define inet_rsk(ptr) container_of_const(ptr, struct inet_request_sock, req)
static inline u32 inet_request_mark(const struct sock *sk, struct sk_buff *skb)
{
u32 mark = READ_ONCE(sk->sk_mark);
if (!mark && READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_fwmark_accept))
return skb->mark;
return mark;
}
static inline int inet_request_bound_dev_if(const struct sock *sk,
struct sk_buff *skb)
{
int bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
#ifdef CONFIG_NET_L3_MASTER_DEV
struct net *net = sock_net(sk);
if (!bound_dev_if && READ_ONCE(net->ipv4.sysctl_tcp_l3mdev_accept))
return l3mdev_master_ifindex_by_index(net, skb->skb_iif);
#endif
return bound_dev_if;
}
static inline int inet_sk_bound_l3mdev(const struct sock *sk)
{
#ifdef CONFIG_NET_L3_MASTER_DEV
struct net *net = sock_net(sk);
if (!READ_ONCE(net->ipv4.sysctl_tcp_l3mdev_accept))
return l3mdev_master_ifindex_by_index(net,
sk->sk_bound_dev_if);
#endif
return 0;
}
static inline bool inet_bound_dev_eq(bool l3mdev_accept, int bound_dev_if,
int dif, int sdif)
{
if (!bound_dev_if)
return !sdif || l3mdev_accept;
return bound_dev_if == dif || bound_dev_if == sdif;
}
static inline bool inet_sk_bound_dev_eq(const struct net *net,
int bound_dev_if,
int dif, int sdif)
{
#if IS_ENABLED(CONFIG_NET_L3_MASTER_DEV)
return inet_bound_dev_eq(!!READ_ONCE(net->ipv4.sysctl_tcp_l3mdev_accept),
bound_dev_if, dif, sdif);
#else
return inet_bound_dev_eq(true, bound_dev_if, dif, sdif);
#endif
}
struct inet6_cork {
struct ipv6_txoptions *opt;
u8 hop_limit;
u8 tclass;
u8 dontfrag:1;
};
struct inet_cork {
unsigned int flags;
__be32 addr;
struct ip_options *opt;
unsigned int fragsize;
int length; /* Total length of all frames */
struct dst_entry *dst;
u8 tx_flags;
__u8 ttl;
__s16 tos;
u32 priority;
__u16 gso_size;
u32 ts_opt_id;
u64 transmit_time;
u32 mark;
};
struct inet_cork_full {
struct inet_cork base;
struct flowi fl;
#if IS_ENABLED(CONFIG_IPV6)
struct inet6_cork base6;
#endif
};
struct ip_mc_socklist;
struct ipv6_pinfo;
struct rtable;
/** struct inet_sock - representation of INET sockets
*
* @sk - ancestor class
* @pinet6 - pointer to IPv6 control block
* @inet_daddr - Foreign IPv4 addr
* @inet_rcv_saddr - Bound local IPv4 addr
* @inet_dport - Destination port
* @inet_num - Local port
* @inet_flags - various atomic flags
* @inet_saddr - Sending source
* @uc_ttl - Unicast TTL
* @inet_sport - Source port
* @inet_id - ID counter for DF pkts
* @tos - TOS
* @mc_ttl - Multicasting TTL
* @uc_index - Unicast outgoing device index
* @mc_index - Multicast device index
* @mc_list - Group array
* @cork - info to build ip hdr on each ip frag while socket is corked
*/
struct inet_sock {
/* sk and pinet6 has to be the first two members of inet_sock */
struct sock sk;
#if IS_ENABLED(CONFIG_IPV6)
struct ipv6_pinfo *pinet6;
struct ipv6_fl_socklist __rcu *ipv6_fl_list;
#endif
/* Socket demultiplex comparisons on incoming packets. */
#define inet_daddr sk.__sk_common.skc_daddr
#define inet_rcv_saddr sk.__sk_common.skc_rcv_saddr
#define inet_dport sk.__sk_common.skc_dport
#define inet_num sk.__sk_common.skc_num
unsigned long inet_flags;
__be32 inet_saddr;
__s16 uc_ttl;
__be16 inet_sport;
struct ip_options_rcu __rcu *inet_opt;
atomic_t inet_id;
__u8 tos;
__u8 min_ttl;
__u8 mc_ttl;
__u8 pmtudisc;
__u8 rcv_tos;
__u8 convert_csum;
int uc_index;
int mc_index;
__be32 mc_addr;
u32 local_port_range; /* high << 16 | low */
struct ip_mc_socklist __rcu *mc_list;
struct inet_cork_full cork;
};
#define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
#define IPCORK_TS_OPT_ID 2 /* ts_opt_id field is valid, overriding sk_tskey */
enum {
INET_FLAGS_PKTINFO = 0,
INET_FLAGS_TTL = 1,
INET_FLAGS_TOS = 2,
INET_FLAGS_RECVOPTS = 3,
INET_FLAGS_RETOPTS = 4,
INET_FLAGS_PASSSEC = 5,
INET_FLAGS_ORIGDSTADDR = 6,
INET_FLAGS_CHECKSUM = 7,
INET_FLAGS_RECVFRAGSIZE = 8,
INET_FLAGS_RECVERR = 9,
INET_FLAGS_RECVERR_RFC4884 = 10,
INET_FLAGS_FREEBIND = 11,
INET_FLAGS_HDRINCL = 12,
INET_FLAGS_MC_LOOP = 13,
INET_FLAGS_MC_ALL = 14,
INET_FLAGS_TRANSPARENT = 15,
INET_FLAGS_IS_ICSK = 16,
INET_FLAGS_NODEFRAG = 17,
INET_FLAGS_BIND_ADDRESS_NO_PORT = 18,
INET_FLAGS_DEFER_CONNECT = 19,
INET_FLAGS_MC6_LOOP = 20,
INET_FLAGS_RECVERR6_RFC4884 = 21,
INET_FLAGS_MC6_ALL = 22,
INET_FLAGS_AUTOFLOWLABEL_SET = 23,
INET_FLAGS_AUTOFLOWLABEL = 24,
INET_FLAGS_DONTFRAG = 25,
INET_FLAGS_RECVERR6 = 26,
INET_FLAGS_REPFLOW = 27,
INET_FLAGS_RTALERT_ISOLATE = 28,
INET_FLAGS_SNDFLOW = 29,
INET_FLAGS_RTALERT = 30,
};
/* cmsg flags for inet */
#define IP_CMSG_PKTINFO BIT(INET_FLAGS_PKTINFO)
#define IP_CMSG_TTL BIT(INET_FLAGS_TTL)
#define IP_CMSG_TOS BIT(INET_FLAGS_TOS)
#define IP_CMSG_RECVOPTS BIT(INET_FLAGS_RECVOPTS)
#define IP_CMSG_RETOPTS BIT(INET_FLAGS_RETOPTS)
#define IP_CMSG_PASSSEC BIT(INET_FLAGS_PASSSEC)
#define IP_CMSG_ORIGDSTADDR BIT(INET_FLAGS_ORIGDSTADDR)
#define IP_CMSG_CHECKSUM BIT(INET_FLAGS_CHECKSUM)
#define IP_CMSG_RECVFRAGSIZE BIT(INET_FLAGS_RECVFRAGSIZE)
#define IP_CMSG_ALL (IP_CMSG_PKTINFO | IP_CMSG_TTL | \
IP_CMSG_TOS | IP_CMSG_RECVOPTS | \
IP_CMSG_RETOPTS | IP_CMSG_PASSSEC | \
IP_CMSG_ORIGDSTADDR | IP_CMSG_CHECKSUM | \
IP_CMSG_RECVFRAGSIZE)
static inline unsigned long inet_cmsg_flags(const struct inet_sock *inet)
{
return READ_ONCE(inet->inet_flags) & IP_CMSG_ALL;
}
static inline dscp_t inet_sk_dscp(const struct inet_sock *inet)
{
return inet_dsfield_to_dscp(READ_ONCE(inet->tos));
}
#define inet_test_bit(nr, sk) \
test_bit(INET_FLAGS_##nr, &inet_sk(sk)->inet_flags)
#define inet_set_bit(nr, sk) \
set_bit(INET_FLAGS_##nr, &inet_sk(sk)->inet_flags)
#define inet_clear_bit(nr, sk) \
clear_bit(INET_FLAGS_##nr, &inet_sk(sk)->inet_flags)
#define inet_assign_bit(nr, sk, val) \
assign_bit(INET_FLAGS_##nr, &inet_sk(sk)->inet_flags, val)
/**
* sk_to_full_sk - Access to a full socket
* @sk: pointer to a socket
*
* SYNACK messages might be attached to request sockets.
* Some places want to reach the listener in this case.
*/
static inline struct sock *sk_to_full_sk(struct sock *sk)
{
#ifdef CONFIG_INET
if (sk && READ_ONCE(sk->sk_state) == TCP_NEW_SYN_RECV)
sk = inet_reqsk(sk)->rsk_listener;
if (sk && READ_ONCE(sk->sk_state) == TCP_TIME_WAIT)
sk = NULL;
#endif
return sk;
}
/* sk_to_full_sk() variant with a const argument */
static inline const struct sock *sk_const_to_full_sk(const struct sock *sk)
{
#ifdef CONFIG_INET
if (sk && READ_ONCE(sk->sk_state) == TCP_NEW_SYN_RECV)
sk = ((const struct request_sock *)sk)->rsk_listener;
if (sk && READ_ONCE(sk->sk_state) == TCP_TIME_WAIT)
sk = NULL;
#endif
return sk;
}
static inline struct sock *skb_to_full_sk(const struct sk_buff *skb)
{
return sk_to_full_sk(skb->sk);
}
#define inet_sk(ptr) container_of_const(ptr, struct inet_sock, sk)
int inet_sk_rebuild_header(struct sock *sk);
/**
* inet_sk_state_load - read sk->sk_state for lockless contexts
* @sk: socket pointer
*
* Paired with inet_sk_state_store(). Used in places we don't hold socket lock:
* tcp_diag_get_info(), tcp_get_info(), tcp_poll(), get_tcp4_sock() ...
*/
static inline int inet_sk_state_load(const struct sock *sk)
{
/* state change might impact lockless readers. */
return smp_load_acquire(&sk->sk_state);
}
/**
* inet_sk_state_store - update sk->sk_state
* @sk: socket pointer
* @newstate: new state
*
* Paired with inet_sk_state_load(). Should be used in contexts where
* state change might impact lockless readers.
*/
void inet_sk_state_store(struct sock *sk, int newstate);
void inet_sk_set_state(struct sock *sk, int state);
static inline unsigned int __inet_ehashfn(const __be32 laddr,
const __u16 lport,
const __be32 faddr,
const __be16 fport,
u32 initval)
{
return jhash_3words((__force __u32) laddr,
(__force __u32) faddr,
((__u32) lport) << 16 | (__force __u32)fport,
initval);
}
struct request_sock *inet_reqsk_alloc(const struct request_sock_ops *ops,
struct sock *sk_listener,
bool attach_listener);
static inline __u8 inet_sk_flowi_flags(const struct sock *sk)
{
__u8 flags = 0;
if (inet_test_bit(TRANSPARENT, sk) || inet_test_bit(HDRINCL, sk))
flags |= FLOWI_FLAG_ANYSRC;
return flags;
}
static inline void inet_inc_convert_csum(struct sock *sk)
{
inet_sk(sk)->convert_csum++;
}
static inline void inet_dec_convert_csum(struct sock *sk)
{
if (inet_sk(sk)->convert_csum > 0)
inet_sk(sk)->convert_csum--;
}
static inline bool inet_get_convert_csum(struct sock *sk)
{
return !!inet_sk(sk)->convert_csum;
}
static inline bool inet_can_nonlocal_bind(struct net *net,
struct inet_sock *inet)
{
return READ_ONCE(net->ipv4.sysctl_ip_nonlocal_bind) ||
test_bit(INET_FLAGS_FREEBIND, &inet->inet_flags) ||
test_bit(INET_FLAGS_TRANSPARENT, &inet->inet_flags);
}
static inline bool inet_addr_valid_or_nonlocal(struct net *net,
struct inet_sock *inet,
__be32 addr,
int addr_type)
{
return inet_can_nonlocal_bind(net, inet) ||
addr == htonl(INADDR_ANY) ||
addr_type == RTN_LOCAL ||
addr_type == RTN_MULTICAST ||
addr_type == RTN_BROADCAST;
}
#endif /* _INET_SOCK_H */
] |
| 17/2 | 2026/02/23 14:32 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 51324 output: 46 thoughts: 1023 Thoughts: **Analyzing a Potential Race Condition in Socket Lookup**
Okay, let's break this down systematically. I'm looking at a potential race condition in the Linux networking stack, specifically related to socket binding and lookups. The key is in the interactions between `inet_num` and `sk_portpair`.
My initial thought is that `inet_num` is the same as `skc_num` and that the write is modifying this member of the `sk_portpair` union. The write to `inet_sk(sk)->inet_num = 0;` translates directly to `sk->__sk_common.skc_num = 0;`, and `skc_num` is a 16-bit unsigned integer. The read side uses `sk->sk_portpair != ports;` and `sk_portpair` is the union of `skc_dport` and `skc_num`.
On a little-endian system (which I'm assuming, and the provided address suggests it is x86_64), the memory layout of this union is crucial. With the structure `struct { __be16 skc_dport; __u16 skc_num; }`, `skc_dport` will be at the lower address, and `skc_num` will be at the higher address within the 32-bit `skc_portpair`. If we look at the provided memory addresses, the write is to `0xffff88810b7c30ce` and is writing 2 bytes. The read is to `0xffff88810b7c30cc` and reading 4 bytes. This confirms that the write is indeed modifying `skc_num`.
The write is in `__inet_put_port` (during socket closing/destruction), and the read is during packet arrival in `__inet_lookup_established` (specifically within `inet_match`). This is where the race comes in. If `inet_match` is reading a partially updated `sk_portpair`, we may get an incorrect match or non-match, which could lead to packet loss, connection hijacking, or other undesirable effects.
Let's look at the memory contents. The `skc_num` changed from `0x4001` to `0x0000`, while the `skc_dport` was `0x90d0`. The write is unbinding the socket by setting `inet_num` to 0.
However, the crucial piece of information is the locking mechanism. The write operation within `__inet_put_port` is protected by `spin_lock(&head->lock)`. So this code is executed with the lock. `head` is a `bhash` bucket lock which indicates it's meant to protect concurrent accesses to the same bucket in the hash table. This implies that while the write is happening, the RCU-protected read in `__inet_lookup_established` *might* still be active, even if the bucket is protected by the spinlock. It seems like the spinlock protects the modifications to the hash table, and RCU is protecting the lookups. The locking strategy appears to be designed to minimize contention by using RCU for lookups, but in the case that the same bucket is accessed concurrently by writes (deleting the socket) and reads (looking up the socket), there could be a window of vulnerability during the write when an incorrect read is performed.
Therefore, although `inet_num` is set to 0 in `__inet_put_port` under the protection of a spinlock, we need to carefully analyze the RCU usage in `__inet_lookup_established` to understand if a stale read is possible during socket unbinding when packets are arriving simultaneously. This is where I'll need to dig deeper into the RCU semantics within `__inet_lookup_established`. I need to ensure that the RCU read side synchronization correctly handles concurrent socket unbinding from a specific bucket.
|
| 18/2 | 2026/02/23 14:32 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/inet_hashtables.c IncludeLines:true Name:__inet_lookup_established] Results: map[SourceCode: 527: struct sock *__inet_lookup_established(const struct net *net,
528: const __be32 saddr, const __be16 sport,
529: const __be32 daddr, const u16 hnum,
530: const int dif, const int sdif)
531: {
532: const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
533: INET_ADDR_COOKIE(acookie, saddr, daddr);
534: const struct hlist_nulls_node *node;
535: struct inet_ehash_bucket *head;
536: struct inet_hashinfo *hashinfo;
537: unsigned int hash, slot;
538: struct sock *sk;
539:
540: hashinfo = net->ipv4.tcp_death_row.hashinfo;
541: hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
542: slot = hash & hashinfo->ehash_mask;
543: head = &hashinfo->ehash[slot];
544:
545: begin:
546: sk_nulls_for_each_rcu(sk, node, &head->chain) {
547: if (sk->sk_hash != hash)
548: continue;
549: if (likely(inet_match(net, sk, acookie, ports, dif, sdif))) {
550: if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
551: goto out;
552: if (unlikely(!inet_match(net, sk, acookie,
553: ports, dif, sdif))) {
554: sock_gen_put(sk);
555: goto begin;
556: }
557: goto found;
558: }
559: }
560: /*
561: * if the nulls value we got at the end of this lookup is
562: * not the expected one, we must restart lookup.
563: * We probably met an item that was moved to another chain.
564: */
565: if (get_nulls_value(node) != slot)
566: goto begin;
567: out:
568: sk = NULL;
569: found:
570: return sk;
571: }
SourceFile:net/ipv4/inet_hashtables.c] |
| 19/2 | 2026/02/23 14:32 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 53990 output: 42 thoughts: 285 Thoughts: **Analyzing a Potential Race Condition in the Kernel's IP Stack**
Okay, so I'm looking at this lookup operation and how it uses `sk_nulls_for_each_rcu`. The core of the issue seems to center around the calls to `inet_match`. The process is: `inet_match` is called, and if it returns true, there's an attempt to increment the refcount. Then, to make sure it's consistent, `inet_match` is checked *again*. This pattern raises a red flag.
The race condition I suspect stems from `inet_num` being set to 0 in `__inet_put_port`, and the problem is that this isn't using `WRITE_ONCE`. On the other side of this, `inet_match` reads `sk_portpair` without using `READ_ONCE`. This uncoordinated write and read pair is definitely a cause for concern. I need to dig further into how these two functions interact.
Specifically, I need to know under what circumstances `__inet_put_port` is called. Ah, right, I see it's called from `tcp_done` which then calls `inet_put_port`. Now I'm asking myself: is the socket still in the established hash table when `__inet_put_port` is being invoked? That's the crux of it. If it is, then the lack of memory barriers could cause `inet_match` to observe inconsistent state.
Let's trace `tcp_done`.
```c
void tcp_done(struct sock *sk)
{
...
tcp_set_state(sk, TCP_CLOSE);
...
if (!sock_flag(sk, SOCK_DEAD))
sk->sk_state_change(sk);
else
inet_csk_destroy_sock(sk);
}
```
Okay, `tcp_set_state(sk, TCP_CLOSE)` *might* be the culprit. It's likely that `inet_put_port` is called indirectly through there. Time to confirm my suspicions and check the implementation of `tcp_set_state`. I need to figure out exactly how the socket state transition impacts the port release and whether that release is properly synchronized.
|
| 20/2 | 2026/02/23 14:33 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/tcp.c IncludeLines:true Name:tcp_set_state] Results: map[SourceCode:2996: void tcp_set_state(struct sock *sk, int state)
2997: {
2998: int oldstate = sk->sk_state;
2999:
3000: /* We defined a new enum for TCP states that are exported in BPF
3001: * so as not force the internal TCP states to be frozen. The
3002: * following checks will detect if an internal state value ever
3003: * differs from the BPF value. If this ever happens, then we will
3004: * need to remap the internal value to the BPF value before calling
3005: * tcp_call_bpf_2arg.
3006: */
3007: BUILD_BUG_ON((int)BPF_TCP_ESTABLISHED != (int)TCP_ESTABLISHED);
3008: BUILD_BUG_ON((int)BPF_TCP_SYN_SENT != (int)TCP_SYN_SENT);
3009: BUILD_BUG_ON((int)BPF_TCP_SYN_RECV != (int)TCP_SYN_RECV);
3010: BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT1 != (int)TCP_FIN_WAIT1);
3011: BUILD_BUG_ON((int)BPF_TCP_FIN_WAIT2 != (int)TCP_FIN_WAIT2);
3012: BUILD_BUG_ON((int)BPF_TCP_TIME_WAIT != (int)TCP_TIME_WAIT);
3013: BUILD_BUG_ON((int)BPF_TCP_CLOSE != (int)TCP_CLOSE);
3014: BUILD_BUG_ON((int)BPF_TCP_CLOSE_WAIT != (int)TCP_CLOSE_WAIT);
3015: BUILD_BUG_ON((int)BPF_TCP_LAST_ACK != (int)TCP_LAST_ACK);
3016: BUILD_BUG_ON((int)BPF_TCP_LISTEN != (int)TCP_LISTEN);
3017: BUILD_BUG_ON((int)BPF_TCP_CLOSING != (int)TCP_CLOSING);
3018: BUILD_BUG_ON((int)BPF_TCP_NEW_SYN_RECV != (int)TCP_NEW_SYN_RECV);
3019: BUILD_BUG_ON((int)BPF_TCP_BOUND_INACTIVE != (int)TCP_BOUND_INACTIVE);
3020: BUILD_BUG_ON((int)BPF_TCP_MAX_STATES != (int)TCP_MAX_STATES);
3021:
3022: /* bpf uapi header bpf.h defines an anonymous enum with values
3023: * BPF_TCP_* used by bpf programs. Currently gcc built vmlinux
3024: * is able to emit this enum in DWARF due to the above BUILD_BUG_ON.
3025: * But clang built vmlinux does not have this enum in DWARF
3026: * since clang removes the above code before generating IR/debuginfo.
3027: * Let us explicitly emit the type debuginfo to ensure the
3028: * above-mentioned anonymous enum in the vmlinux DWARF and hence BTF
3029: * regardless of which compiler is used.
3030: */
3031: BTF_TYPE_EMIT_ENUM(BPF_TCP_ESTABLISHED);
3032:
3033: if (BPF_SOCK_OPS_TEST_FLAG(tcp_sk(sk), BPF_SOCK_OPS_STATE_CB_FLAG))
3034: tcp_call_bpf_2arg(sk, BPF_SOCK_OPS_STATE_CB, oldstate, state);
3035:
3036: switch (state) {
3037: case TCP_ESTABLISHED:
3038: if (oldstate != TCP_ESTABLISHED)
3039: TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
3040: break;
3041: case TCP_CLOSE_WAIT:
3042: if (oldstate == TCP_SYN_RECV)
3043: TCP_INC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
3044: break;
3045:
3046: case TCP_CLOSE:
3047: if (oldstate == TCP_CLOSE_WAIT || oldstate == TCP_ESTABLISHED)
3048: TCP_INC_STATS(sock_net(sk), TCP_MIB_ESTABRESETS);
3049:
3050: sk->sk_prot->unhash(sk);
3051: if (inet_csk(sk)->icsk_bind_hash &&
3052: !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
3053: inet_put_port(sk);
3054: fallthrough;
3055: default:
3056: if (oldstate == TCP_ESTABLISHED || oldstate == TCP_CLOSE_WAIT)
3057: TCP_DEC_STATS(sock_net(sk), TCP_MIB_CURRESTAB);
3058: }
3059:
3060: /* Change state AFTER socket is unhashed to avoid closed
3061: * socket sitting in hash tables.
3062: */
3063: inet_sk_state_store(sk, state);
3064: }
SourceFile:net/ipv4/tcp.c] |
| 21/2 | 2026/02/23 14:33 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 56255 output: 44 thoughts: 252 Thoughts: **My Analysis of TCP Socket State Transition** Okay, let's break down this `tcp_set_state` function. The first thing I see is `sk->sk_prot->unhash(sk)`. This is removing the socket from the established hash table. Right, so for TCP, that's calling `inet_unhash`. Got it. Now, a crucial point. It looks like `inet_unhash` is called *before* `inet_put_port`. That seems logical, but it implies a potential race condition. If `inet_unhash` happens first, then `__inet_lookup_established` shouldn't find the socket anymore, correct? It's been removed from the hash. However, `inet_unhash` utilizes RCU (Read-Copy-Update) to remove the socket. This means that a concurrent reader, even after `inet_unhash` is invoked, could still potentially *see* the socket in the hash table until the RCU grace period expires. That's the core of RCU, isn't it? Hold on... but `__inet_lookup_established` is called under `rcu_read_lock()`. That should prevent the problem because we are now protected, and so the issue is not really one. But, just to be sure... Let me check `inet_unhash` to confirm my understanding of this. |
| 22/2 | 2026/02/23 14:33 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/inet_hashtables.c IncludeLines:true Name:inet_unhash] Results: map[SourceCode: 819: void inet_unhash(struct sock *sk)
820: {
821: struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);
822:
823: if (sk_unhashed(sk))
824: return;
825:
826: sock_rps_delete_flow(sk);
827: if (sk->sk_state == TCP_LISTEN) {
828: struct inet_listen_hashbucket *ilb2;
829:
830: ilb2 = inet_lhash2_bucket_sk(hashinfo, sk);
831: /* Don't disable bottom halves while acquiring the lock to
832: * avoid circular locking dependency on PREEMPT_RT.
833: */
834: spin_lock(&ilb2->lock);
835: if (rcu_access_pointer(sk->sk_reuseport_cb))
836: reuseport_stop_listen_sock(sk);
837:
838: __sk_nulls_del_node_init_rcu(sk);
839: sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
840: spin_unlock(&ilb2->lock);
841: } else {
842: spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
843:
844: spin_lock_bh(lock);
845: __sk_nulls_del_node_init_rcu(sk);
846: sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
847: spin_unlock_bh(lock);
848: }
849: }
SourceFile:net/ipv4/inet_hashtables.c] |
| 23/2 | 2026/02/23 14:33 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 57396 output: 20 thoughts: 4486 Thoughts: **Analyzing a Data Race in the Linux Kernel's Networking Stack** Okay, so I've got this KCSAN report flagged, and I need to figure out if it's a real problem or just a false positive. It seems like there's a potential data race between the `inet_put_port` function and `__inet_lookup_established`. Basically, a socket is being removed from the established hash table, and there's a window where a concurrent reader might see an inconsistent state. `__sk_nulls_del_node_init_rcu` is called to remove a socket from the hash chain. This happens concurrently with lookups. `tcp_set_state` initiates socket teardown, including unhashing it via `unhash()` and setting the port number to zero using `inet_put_port`. Specifically, `inet_put_port` sets `inet_num` (which is part of the `sk_portpair` field) to zero. The core of the issue lies in `inet_match`, the function that determines if a socket is the right match for an incoming connection. It compares `sk->sk_portpair` with a `ports` variable. The data race happens when `inet_match` is reading `sk->sk_portpair` (a 4-byte value), and at the same time, `inet_put_port` is writing to two bytes of it to set `inet_num` to 0. My immediate thought is that this should be okay. If a reader sees the "new" value (0), it won't match, and the lookup continues. It's essentially treating the socket as if it's already removed, which is exactly what's happening. If it sees the "old" value, it might match *before* the socket is fully unhashed. But the `inet_match` is called twice: one is a *pre-check* and another is a *refcount check*. If it matches at pre-check but gets the socket removed during refcount checks, it handles it safely by decrementing the refcount and retrying. The report also mentions a torn value and some port number values that changed. This seems to be describing the state before and after the `inet_num` field is set to 0. The issue is that the `sk_portpair` access could be torn. However, the `skc_portpair` is 4-byte aligned, and x86 guarantees atomic reads and writes of 4-byte aligned variables. This should avoid the scenario where a single read grabs half the old value and half the new value. However, the compiler could still mis-optimize. KCSAN, though, is still flagging this. It's not necessarily a functional bug, but KCSAN considers it a data race because there aren't explicit synchronization primitives like `READ_ONCE`/`WRITE_ONCE` used for these accesses. This means the compiler *could* reorder things in ways that would break assumptions. I then check `inet_bind_hash`, and it writes the socket number to `inet_num`, but it's called during `bind()` not during unhash. I then go back to `inet_match` to see that it checks `sk->sk_portpair != ports`. During the close process, `unhash()` is called before `put_port()`, which sets `sk->sk_portpair` (or a portion of it) to zero. Thus, the only way a reader would see a value is if it got a pointer to the socket, or had an RCU read. This is a very common scenario. Now, could it cause `inet_match` to return `true` incorrectly? If the `ports` variable (used in `__inet_lookup_established`) also had 0 as the port? That's impossible. I then remember that `sk_addrpair` is also used in `inet_match` to prevent other matches, so nothing is missed. I then check `sk_state` with `tcp_set_state` and ensure that the right state is set. The state is only set at the end. I then revisit the dual calls to `inet_match` in `__inet_lookup_established` to reiterate that this is safe. So, there's no way that the race leads to an *incorrect* match (a false positive). It can only lead to a *false negative* (missing an entry), which is fine because the socket is being closed. So, why KCSAN? Well, even though it's likely a benign race from a functional point of view, it *is* a data race according to the kernel's memory model. The lack of `READ_ONCE`/`WRITE_ |
| 24/2 | 2026/02/23 14:33 | tool | grepper |
0mArgs:map[Expression:inet_num = 0] Results: map[Output:net/ipv4/inet_connection_sock.c=1306=void inet_csk_prepare_forced_close(struct sock *sk)
--
net/ipv4/inet_connection_sock.c-1312- inet_csk_prepare_for_destroy_sock(sk);
net/ipv4/inet_connection_sock.c:1313: inet_sk(sk)->inet_num = 0;
net/ipv4/inet_connection_sock.c-1314-}
--
net/ipv4/inet_hashtables.c=212=static void __inet_put_port(struct sock *sk)
--
net/ipv4/inet_hashtables.c-226- inet_csk(sk)->icsk_bind_hash = NULL;
net/ipv4/inet_hashtables.c:227: inet_sk(sk)->inet_num = 0;
net/ipv4/inet_hashtables.c-228- sk->sk_userlocks &= ~SOCK_CONNECT_BIND;
--
net/ipv4/inet_hashtables.c=1035=int __inet_hash_connect(struct inet_timewait_death_row *death_row,
--
net/ipv4/inet_hashtables.c-1208- inet_sk(sk)->inet_sport = 0;
net/ipv4/inet_hashtables.c:1209: inet_sk(sk)->inet_num = 0;
net/ipv4/inet_hashtables.c-1210-
--
net/ipv4/udp.c=239=int udp_lib_get_port(struct sock *sk, unsigned short snum,
--
net/ipv4/udp.c-322- udp_reuseport_add_sock(sk, hslot)) {
net/ipv4/udp.c:323: inet_sk(sk)->inet_num = 0;
net/ipv4/udp.c-324- udp_sk(sk)->udp_port_hash = 0;
--
net/ipv4/udp.c=2246=void udp_lib_unhash(struct sock *sk)
--
net/ipv4/udp.c-2261- hslot->count--;
net/ipv4/udp.c:2262: inet_sk(sk)->inet_num = 0;
net/ipv4/udp.c-2263- sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
--
net/sctp/socket.c=8773=static inline void __sctp_put_port(struct sock *sk)
--
net/sctp/socket.c-8783- sctp_sk(sk)->bind_hash = NULL;
net/sctp/socket.c:8784: inet_sk(sk)->inet_num = 0;
net/sctp/socket.c-8785- sctp_bucket_destroy(pp);
] |
| 25/2 | 2026/02/23 14:33 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 63696 output: 19 thoughts: 166 Thoughts: **Investigating Potential Data Races with `inet_num = 0`** Okay, so I'm seeing a bunch of places where `inet_num` is being set to zero. This could be a red flag, potentially pointing to a data race or a concurrency issue. I need to dig deeper. Aha! There's a commit that *might* be related to this. Let me check it out; perhaps it's related to the `inet_match` function and whether or not it should be using `READ_ONCE`. That could definitely be the cause. Alright, let me go back and double-check `include/net/inet_hashtables.h`. Specifically, that line: ```c 348: sk->sk_portpair != ports || ``` Now, if I had the history of this file handy, I'd probably be able to see if there were any previous attempts to address similar problems. Unfortunately, I'm working with the current source and don't have direct access to the Git history. This is making the investigation a little trickier, but no matter. I can work around that. What I can do is a search for where `sk_portpair` is *written* to. I'll search for every instance where `sk_portpair =` occurs. Let's see if it's ever written using `WRITE_ONCE`. That could give me a crucial clue about whether we're dealing with a synchronization issue here. |
| 26/2 | 2026/02/23 14:33 | tool | grepper |
0mError:no matches Args: map[Expression:sk_portpair =] Results: map[Output:] |
| 27/2 | 2026/02/23 14:33 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 64217 output: 20 thoughts: 34 Thoughts: Okay, here's my take on those thoughts, framed as an expert's internal monologue: **Investigating the Data Structure's Origin** Right, this smells like a union, which means the underlying data is probably accessed through different "views" or members. In this case, I'm guessing the structure in question, `skc_portpair`, is at the heart of the matter. So, to track down where it's defined and how it's used, my first instinct is to hunt for assignments or declarations of that very name. I'll need to search the code base for instances where `skc_portpair =` appears, likely finding the code that dictates how this union is actually built or assigned. That'll be the starting point to understand this data. |
| 28/2 | 2026/02/23 14:34 | tool | grepper |
0mError:no matches Args: map[Expression:skc_portpair =] Results: map[Output:] |
| 29/2 | 2026/02/23 14:34 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 64451 output: 19 thoughts: 9 Thoughts: Okay, here's my attempt at summarizing the thought process, tailored for an expert audience, and written in the first person: **Decoding the Data Stream: Locating the Key** Alright, I'm diving in. The first thing that pops into my head is, "Where's the critical piece of information I'm looking for?" I need to isolate a specific value within this data, likely an identifier or key. My immediate instinct is to look for a variable or field, probably a unique string. My intuition, based on previous encounters with similar data structures, suggests that `skc_num` is the likely candidate. The search starts there. It's often a good starting point to hunt for known variables and then branch out if that is not it. |
| 30/2 | 2026/02/23 14:34 | tool | grepper |
0mError:no matches Args: map[Expression:skc_num =] Results: map[Output:] |
| 31/2 | 2026/02/23 14:34 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 64653 output: 22 thoughts: 81 Thoughts: Here's my thought process, as if I were considering this myself: **Analyzing the `inet_num` Assignment** Okay, so I'm looking at this `inet_num` assignment. Right, `inet_num` is actually defined as `sk.__sk_common.skc_num`. That makes sense; it's the network-level specific number associated with the socket. Therefore, `inet_sk(sk)->inet_num = 0` is, essentially, the same as writing `sk->sk_num = 0`. That's important to keep in mind, and it clears up a potential source of confusion. Now, I need to understand if there are any specific memory ordering considerations involved here. I'll need to check if there are any `WRITE_ONCE` or similar constructs on this particular field. I should go straight to `net/ipv4/inet_hashtables.c` and search for any instances of `WRITE_ONCE` being used with `inet_num` (or potentially `skc_num`). That will immediately give me a clue about the potential concurrency implications, or how to handle any memory ordering related to the variable. |
| 32/2 | 2026/02/23 14:34 | tool | grepper |
0mArgs:map[Expression:WRITE_ONCE.*inet_num] Results: map[Output:net/ipv4/ping.c=144=void ping_unhash(struct sock *sk)
--
net/ipv4/ping.c-150- if (sk_del_node_init_rcu(sk)) {
net/ipv4/ping.c:151: WRITE_ONCE(isk->inet_num, 0);
net/ipv4/ping.c-152- isk->inet_sport = 0;
] |
| 33/2 | 2026/02/23 14:34 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 65131 output: 43 thoughts: 213 Thoughts: **Concurrent Access Conundrum** Alright, so I'm digging into this code and something's bugging me. I see that `ping_unhash` is using `WRITE_ONCE` when setting `isk->inet_num` to zero. But then, I look at `__inet_put_port`, and it's just a direct assignment: `inet_sk(sk)->inet_num = 0;`. Hmm. This discrepancy is raising a red flag. It feels like `inet_num` *should* be protected by `WRITE_ONCE` if it's potentially accessed concurrently. My next step is to revisit the `inet_match` function. I need to confirm if it's ever called in a context that overlaps with `__inet_put_port`. If `inet_match` is called from `__inet_lookup_established`, it's definitely going to be concurrent with `__inet_put_port`, and the lack of `WRITE_ONCE` in `__inet_put_port` would be a real problem. The question is why this difference? Why does `ping_unhash` use `WRITE_ONCE` while `__inet_put_port` doesn't? Is there a lookup path or something that is using different techniques for ping sockets? I need to understand if ping sockets are treated differently. Perhaps the lookup mechanism is different. `ping` sockets use `ping_v4_lookup` or `ping_v6_lookup`, which might be the root of the difference. Let's dive into `ping_v4_lookup` and see what's going on there. |
| 34/2 | 2026/02/23 14:34 | tool | codesearch-definition-source |
0mError:requested entity does not exist Args: map[ContextFile:net/ipv4/ping.c IncludeLines:true Name:ping_v4_lookup] Results: map[SourceCode: SourceFile:] |
| 35/2 | 2026/02/23 14:34 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 65762 output: 40 thoughts: 6 Thoughts: Okay, here's my attempt at summarizing the thought process, tailored for an expert audience, and written in the first person: **Initial Assessment and Approach** Alright, let's see. My first instinct here, given the apparent need for information retrieval or a network-related task, is to leverage a foundational utility. Hmm, what's the simplest, most direct approach? Let's try `ping_lookup`. It's my go-to when I need to quickly ascertain the accessibility of a host or understand network reachability. It’s the building block, so let's start there. |
| 36/2 | 2026/02/23 14:34 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/ping.c IncludeLines:true Name:ping_lookup] Results: map[SourceCode: 159: /* Called under rcu_read_lock() */
160: static struct sock *ping_lookup(struct net *net, struct sk_buff *skb, u16 ident)
161: {
162: struct hlist_head *hslot = ping_hashslot(&ping_table, net, ident);
163: struct sock *sk = NULL;
164: struct inet_sock *isk;
165: int dif, sdif;
166:
167: if (skb->protocol == htons(ETH_P_IP)) {
168: dif = inet_iif(skb);
169: sdif = inet_sdif(skb);
170: pr_debug("try to find: num = %d, daddr = %pI4, dif = %d\n",
171: (int)ident, &ip_hdr(skb)->daddr, dif);
172: #if IS_ENABLED(CONFIG_IPV6)
173: } else if (skb->protocol == htons(ETH_P_IPV6)) {
174: dif = inet6_iif(skb);
175: sdif = inet6_sdif(skb);
176: pr_debug("try to find: num = %d, daddr = %pI6c, dif = %d\n",
177: (int)ident, &ipv6_hdr(skb)->daddr, dif);
178: #endif
179: } else {
180: return NULL;
181: }
182:
183: sk_for_each_rcu(sk, hslot) {
184: int bound_dev_if;
185:
186: if (!net_eq(sock_net(sk), net))
187: continue;
188: isk = inet_sk(sk);
189:
190: pr_debug("iterate\n");
191: if (READ_ONCE(isk->inet_num) != ident)
192: continue;
193:
194: bound_dev_if = READ_ONCE(sk->sk_bound_dev_if);
195: if (skb->protocol == htons(ETH_P_IP) &&
196: sk->sk_family == AF_INET) {
197: __be32 rcv_saddr = READ_ONCE(isk->inet_rcv_saddr);
198:
199: pr_debug("found: %p: num=%d, daddr=%pI4, dif=%d\n", sk,
200: ident, &rcv_saddr,
201: bound_dev_if);
202:
203: if (rcv_saddr && rcv_saddr != ip_hdr(skb)->daddr)
204: continue;
205: #if IS_ENABLED(CONFIG_IPV6)
206: } else if (skb->protocol == htons(ETH_P_IPV6) &&
207: sk->sk_family == AF_INET6) {
208:
209: pr_debug("found: %p: num=%d, daddr=%pI6c, dif=%d\n", sk,
210: ident,
211: &sk->sk_v6_rcv_saddr,
212: bound_dev_if);
213:
214: if (!ipv6_addr_any(&sk->sk_v6_rcv_saddr) &&
215: !ipv6_addr_equal(&sk->sk_v6_rcv_saddr,
216: &ipv6_hdr(skb)->daddr))
217: continue;
218: #endif
219: } else {
220: continue;
221: }
222:
223: if (bound_dev_if && bound_dev_if != dif &&
224: bound_dev_if != sdif)
225: continue;
226:
227: goto exit;
228: }
229:
230: sk = NULL;
231: exit:
232:
233: return sk;
234: }
SourceFile:net/ipv4/ping.c] |
| 37/2 | 2026/02/23 14:34 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 67190 output: 23 thoughts: 1084 Thoughts: **Analyzing Potential Data Races in `inet_match`**
Okay, so I'm thinking about `inet_match` and whether it needs `READ_ONCE`. I see that `ping_lookup` *does* use `READ_ONCE` for `isk->inet_num`, but `inet_match` doesn't for `sk_portpair`. This seems odd, especially since `inet_match` is used by `__inet_lookup_established` for both TCP and UDP. Since `__inet_lookup_established` would benefit from `READ_ONCE` for consistency, I need to understand why.
I've confirmed `inet_match` is indeed used within `__inet_lookup_established` and appears consistently across relevant header files. There's no obvious alternative implementation, so I'm focusing on the current code.
My initial thought was that it was an optimization, avoiding overhead on a 4-byte field. However, on x86, `READ_ONCE` on a 4-byte aligned value is a normal load, so that's probably not it.
Then I realized that `sk_portpair` is actually a combination of `dport` and `num`. Using `READ_ONCE` on the entire `sk_portpair` effectively gives us both at once, ensuring we get a consistent "snapshot." Without it, the compiler *could* theoretically split the load, introducing a data race.
I'd like to check for any existing reports of data races on `sk_portpair`, but I can't easily search the commit history.
Let's look closely at the `inet_match` code again:
```c
343: static inline bool inet_match(const struct net *net, const struct sock *sk,
344: const __addrpair cookie, const __portpair ports,
345: int dif, int sdif)
346: {
347: if (!net_eq(sock_net(sk), net) ||
348: sk->sk_portpair != ports ||
349: sk->sk_addrpair != cookie)
350: return false;
```
Changing this to use `READ_ONCE` for both `sk_portpair` and `sk_addrpair` seems straightforward:
```c
READ_ONCE(sk->sk_portpair) != ports ||
READ_ONCE(sk->sk_addrpair) != cookie)
```
That should fix the potential data race, right?
Now, is this race "benign"? I initially thought it was, because of these points:
1. The socket is in the process of being unhashed.
2. Missing it is okay.
3. If we find it, we re-check under refcount.
However, "benign" in KCSAN often means the result doesn't matter. In this case, the result *does* matter (matching the socket), but the race only happens when the socket is being removed. The "wrong" result (not matching) is then consistent with the system state.
But here's a wrinkle: what if `inet_num` is being zeroed *before* the socket is removed from the hash table? I reviewed `__inet_put_port`:
```c
224: spin_lock(&head->lock);
225: tb = inet_csk(sk)->icsk_bind_hash;
226: inet_csk(sk)->icsk_bind_hash = NULL;
227: inet_sk(sk)->inet_num = 0;
...
241: spin_unlock(&head->lock);
```
It doesn't seem to call `unhash` directly. I need to find where `unhash` is called.
Okay, I see it's called in `tcp_set_state`:
```c
3050: sk->sk_prot->unhash(sk);
3051: if (inet_csk(sk)->icsk_bind_hash &&
3052: !(sk->sk_userlocks & SOCK_BINDPORT_LOCK))
3053: inet_put_port(sk);
```
So `unhash` is called *before* `inet_put_port`, meaning the socket *is* being removed from the hash table before `inet_num` is zeroed. That's reassuring.
|
| 38/2 | 2026/02/23 14:35 | tool | codesearch-find-references |
0mArgs:map[Name:inet_put_port] Results: map[References:[map[ReferenceKind:calls ReferencingEntityKind:function ReferencingEntityName:__inet_bhash2_update_saddr SourceFile:net/ipv4/inet_hashtables.c SourceLine:959] map[ReferenceKind:calls ReferencingEntityKind:function ReferencingEntityName:tcp_set_state SourceFile:net/ipv4/tcp.c SourceLine:3053] map[ReferenceKind:calls ReferencingEntityKind:function ReferencingEntityName:tcp_v4_destroy_sock SourceFile:net/ipv4/tcp_ipv4.c SourceLine:2518]] TruncatedOutput:false] |
| 39/2 | 2026/02/23 14:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 69509 output: 48 thoughts: 14 Thoughts: Okay, let's dive into this. Here's my thought process on what we're looking at, summarized from the inside: **Analyzing `tcp_v4_destroy_sock`** Alright, the first thing I need to do is get my bearings. The focus is `tcp_v4_destroy_sock`, so I'm immediately thinking TCP IPv4 socket destruction. That's a critical function, and I'm probably going to be interested in resource cleanup, state transitions, and how it interacts with other kernel subsystems. I want to check for race conditions. I wonder what the locking scheme looks like within this function, and if there are any potential deadlocks with other parts of the network stack. I know destruction is usually tricky, particularly when connections are in a messy state. I need to understand what states are allowed when a socket is destroyed and how to do it safely. What are the dependencies? Are there any data structures that need careful handling to avoid corruption? I need to analyze how it handles the different socket states and associated data. Also, what about concurrent accesses? That's a huge potential point of failure. I should check how it interacts with the garbage collection mechanisms, and is there any memory allocated, or deallocated? And finally, what are the error conditions that are checked? |
| 40/2 | 2026/02/23 14:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/tcp_ipv4.c IncludeLines:true Name:tcp_v4_destroy_sock] Results: map[SourceCode:2491: void tcp_v4_destroy_sock(struct sock *sk)
2492: {
2493: struct tcp_sock *tp = tcp_sk(sk);
2494:
2495: tcp_release_user_frags(sk);
2496:
2497: xa_destroy(&sk->sk_user_frags);
2498:
2499: trace_tcp_destroy_sock(sk);
2500:
2501: tcp_clear_xmit_timers(sk);
2502:
2503: tcp_cleanup_congestion_control(sk);
2504:
2505: tcp_cleanup_ulp(sk);
2506:
2507: /* Cleanup up the write buffer. */
2508: tcp_write_queue_purge(sk);
2509:
2510: /* Check if we want to disable active TFO */
2511: tcp_fastopen_active_disable_ofo_check(sk);
2512:
2513: /* Cleans up our, hopefully empty, out_of_order_queue. */
2514: skb_rbtree_purge(&tp->out_of_order_queue);
2515:
2516: /* Clean up a referenced TCP bind bucket. */
2517: if (inet_csk(sk)->icsk_bind_hash)
2518: inet_put_port(sk);
2519:
2520: BUG_ON(rcu_access_pointer(tp->fastopen_rsk));
2521:
2522: /* If socket is aborted during connect operation */
2523: tcp_free_fastopen_req(tp);
2524: tcp_fastopen_destroy_cipher(sk);
2525: tcp_saved_syn_free(tp);
2526:
2527: sk_sockets_allocated_dec(sk);
2528: }
SourceFile:net/ipv4/tcp_ipv4.c] |
| 41/2 | 2026/02/23 14:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 70421 output: 47 thoughts: 1584 Thoughts: **Analyzing a Potential Race Condition in TCP Socket Destruction** Okay, so I'm looking at this call to `inet_put_port` within `tcp_v4_destroy_sock`. My initial thought is, is the socket already unhashed by this point? It *should* be, since `destroy_sock` is usually called when the last reference is gone, but I need to be absolutely sure. The tricky part is, if the last reference is gone, how can `__inet_lookup_established` even *see* the socket? That function is called under `rcu_read_lock()`. If the socket is *still* in the hash table, it means either it hasn't been unhashed yet, or the RCU grace period hasn't passed. This is a crucial point. Right, then I'm reminding myself that `sock_put` eventually calls `sk_free`, which then calls `sk_destruct`, and that in turn invokes `sk->sk_destruct`. For TCP, it *seems* like `sk->sk_destruct` should be `tcp_v4_destruct_sock`, but I realize that's not quite right. It's actually `inet_sock_destruct`, and `tcp_v4_destroy_sock` is called via `sk->sk_prot->destroy`. Okay, let's trace this back to the `sk_free` function. I can see that `__sk_free` is calling `sk_prot_free`, and that ultimately `tcp_v4_destroy_sock` is being called from `inet_csk_destroy_sock`. This helps clarify the chain of events. So if `inet_csk_destroy_sock` is being called, the socket *should* already be unhashed. Let's look at `tcp_close`. If the socket's in `TCP_LISTEN`, we stop listening and set it to `TCP_CLOSE`. Established sockets are more complex, of course. The important thing is that `inet_num = 0` is only set when the socket is closing or being destroyed. But is there *another* place where `inet_put_port` might be called? I remember that `__inet_bhash2_update_saddr` function, that's called when the source address changes, and it calls `inet_put_port`. Does this happen for established sockets? Typically, not for TCP. Ah, wait, now I see something interesting in the stack trace from a write: `rds_tcp_accept_one` is calling `inet_put_port`. That's during socket release! And I see the `tcp_v4_rcv` call from a read. So if a packet arrives *exactly* when the socket is being closed, we might have a race! If `inet_match` returns false because of this race, `tcp_v4_rcv` won't find the established socket. It'll then look for a listener, and if it doesn't find one, it'll send a RST. Since the socket is likely already in `TCP_CLOSE` or about to be, sending that RST is probably fine. Wait a second. If the socket is in `TCP_TIME_WAIT`, it should be in the `tw_hash`, not the main `ehash`. The code confirms that `tcp_time_wait` moves the socket to time-wait. Right, `tcp_time_wait` calls `inet_put_port` *before* `__inet_hash_nolisten`. And `__inet_hash_nolisten` unhashes the `sk` from the `ehash` and hashes `tw` into the `ehash` or `tw_hash`, right? For TCP, both established and time-wait sockets are in the same `ehash` table, but time-wait sockets are handled specially. I'm going to double check `__inet_hash_nolisten` to solidify my understanding. |
| 42/2 | 2026/02/23 14:35 | tool | codesearch-definition-source |
0mError:requested entity does not exist Args: map[ContextFile:net/ipv4/inet_hashtables.c IncludeLines:true Name:__inet_hash_nolisten] Results: map[SourceCode: SourceFile:] |
| 43/2 | 2026/02/23 14:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 72937 output: 44 thoughts: 7 Thoughts: Okay, here's my attempt at summarizing the thought process, tailored for an expert audience, and written in the first person: **Investigating a Hashing Function** Alright, let's see... I'm thinking about the hashing function at hand. I should try `__inet_hash`. It's a key candidate given the context. Let me examine how the underlying mechanisms are implemented. I'll need to remember to check for any potential edge cases or biases it might introduce, especially considering this application. Perhaps I should compare it with similar algorithms to check it's suitability. |
| 44/2 | 2026/02/23 14:35 | tool | codesearch-definition-source |
0mError:requested entity does not exist Args: map[ContextFile:net/ipv4/inet_hashtables.c IncludeLines:true Name:__inet_hash] Results: map[SourceCode: SourceFile:] |
| 45/2 | 2026/02/23 14:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 73132 output: 21 thoughts: 91 Thoughts: **Investigating the TCP Connection Allocation Process** Okay, so I need to understand the connection allocation flow. I'll start by tracing the use of `inet_twsk_alloc`. Where is it called, and what happens directly afterwards? Let's see... right, here it is in `net/ipv4/tcp_minisocks.c`. Interesting, I see `inet_put_port(sk);` and then `__inet_hash_nolisten(sk, (struct sock *)tw);`. Now, I need to know what `__inet_hash_nolisten` does in relation to everything. I'll do a tree-wide search for that function to get the broader context. |
| 46/2 | 2026/02/23 14:35 | tool | grepper |
0mError:no matches Args: map[Expression:__inet_hash_nolisten] Results: map[Output:] |
| 47/2 | 2026/02/23 14:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 73408 output: 27 thoughts: 101 Thoughts: **Debugging a TCP Time Wait Issue** Okay, so I'm trying to figure out this TCP `time_wait` issue. First thought, could it be related to `inet_hash_nolisten`? Seems unlikely, but worth a quick check. Nope, doesn't seem to be it. Alright, time to dig deeper. The stack trace points me to `net/ipv4/tcp_minisocks.c` around line 394. Let me double-check that... Yep, the trace is very specific: `tcp_time_wait+0x4fd/0x5c0 net/ipv4/tcp_minisocks.c:394`. Right, looks like I'll be spending some time in that file, starting with that line. Let's see what's happening there. |
| 48/2 | 2026/02/23 14:35 | tool | read-file |
0mArgs:map[File:net/ipv4/tcp_minisocks.c] Results: map[Contents:// SPDX-License-Identifier: GPL-2.0-only
/*
* INET An implementation of the TCP/IP protocol suite for the LINUX
* operating system. INET is implemented using the BSD Socket
* interface as the means of communication with the user level.
*
* Implementation of the Transmission Control Protocol(TCP).
*
* Authors: Ross Biro
* Fred N. van Kempen, <waltje@uWalt.NL.Mugnet.ORG>
* Mark Evans, <evansmp@uhura.aston.ac.uk>
* Corey Minyard <wf-rch!minyard@relay.EU.net>
* Florian La Roche, <flla@stud.uni-sb.de>
* Charles Hedrick, <hedrick@klinzhai.rutgers.edu>
* Linus Torvalds, <torvalds@cs.helsinki.fi>
* Alan Cox, <gw4pts@gw4pts.ampr.org>
* Matthew Dillon, <dillon@apollo.west.oic.com>
* Arnt Gulbrandsen, <agulbra@nvg.unit.no>
* Jorge Cwik, <jorge@laser.satlink.net>
*/
#include <net/tcp.h>
#include <net/tcp_ecn.h>
#include <net/xfrm.h>
#include <net/busy_poll.h>
#include <net/rstreason.h>
#include <net/psp.h>
static bool tcp_in_window(u32 seq, u32 end_seq, u32 s_win, u32 e_win)
{
if (seq == s_win)
return true;
if (after(end_seq, s_win) && before(seq, e_win))
return true;
return seq == e_win && seq == end_seq;
}
static enum tcp_tw_status
tcp_timewait_check_oow_rate_limit(struct inet_timewait_sock *tw,
const struct sk_buff *skb, int mib_idx)
{
struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
if (!tcp_oow_rate_limited(twsk_net(tw), skb, mib_idx,
&tcptw->tw_last_oow_ack_time)) {
/* Send ACK. Note, we do not put the bucket,
* it will be released by caller.
*/
return TCP_TW_ACK_OOW;
}
/* We are rate-limiting, so just release the tw sock and drop skb. */
inet_twsk_put(tw);
return TCP_TW_SUCCESS;
}
static void twsk_rcv_nxt_update(struct tcp_timewait_sock *tcptw, u32 seq,
u32 rcv_nxt)
{
#ifdef CONFIG_TCP_AO
struct tcp_ao_info *ao;
ao = rcu_dereference(tcptw->ao_info);
if (unlikely(ao && seq < rcv_nxt))
WRITE_ONCE(ao->rcv_sne, ao->rcv_sne + 1);
#endif
WRITE_ONCE(tcptw->tw_rcv_nxt, seq);
}
/*
* * Main purpose of TIME-WAIT state is to close connection gracefully,
* when one of ends sits in LAST-ACK or CLOSING retransmitting FIN
* (and, probably, tail of data) and one or more our ACKs are lost.
* * What is TIME-WAIT timeout? It is associated with maximal packet
* lifetime in the internet, which results in wrong conclusion, that
* it is set to catch "old duplicate segments" wandering out of their path.
* It is not quite correct. This timeout is calculated so that it exceeds
* maximal retransmission timeout enough to allow to lose one (or more)
* segments sent by peer and our ACKs. This time may be calculated from RTO.
* * When TIME-WAIT socket receives RST, it means that another end
* finally closed and we are allowed to kill TIME-WAIT too.
* * Second purpose of TIME-WAIT is catching old duplicate segments.
* Well, certainly it is pure paranoia, but if we load TIME-WAIT
* with this semantics, we MUST NOT kill TIME-WAIT state with RSTs.
* * If we invented some more clever way to catch duplicates
* (f.e. based on PAWS), we could truncate TIME-WAIT to several RTOs.
*
* The algorithm below is based on FORMAL INTERPRETATION of RFCs.
* When you compare it to RFCs, please, read section SEGMENT ARRIVES
* from the very beginning.
*
* NOTE. With recycling (and later with fin-wait-2) TW bucket
* is _not_ stateless. It means, that strictly speaking we must
* spinlock it. I do not want! Well, probability of misbehaviour
* is ridiculously low and, seems, we could use some mb() tricks
* to avoid misread sequence numbers, states etc. --ANK
*
* We don't need to initialize tmp_out.sack_ok as we don't use the results
*/
enum tcp_tw_status
tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
const struct tcphdr *th, u32 *tw_isn,
enum skb_drop_reason *drop_reason)
{
struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
u32 rcv_nxt = READ_ONCE(tcptw->tw_rcv_nxt);
struct tcp_options_received tmp_opt;
enum skb_drop_reason psp_drop;
bool paws_reject = false;
int ts_recent_stamp;
/* Instead of dropping immediately, wait to see what value is
* returned. We will accept a non psp-encapsulated syn in the
* case where TCP_TW_SYN is returned.
*/
psp_drop = psp_twsk_rx_policy_check(tw, skb);
tmp_opt.saw_tstamp = 0;
ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
if (th->doff > (sizeof(*th) >> 2) && ts_recent_stamp) {
tcp_parse_options(twsk_net(tw), skb, &tmp_opt, 0, NULL);
if (tmp_opt.saw_tstamp) {
if (tmp_opt.rcv_tsecr)
tmp_opt.rcv_tsecr -= tcptw->tw_ts_offset;
tmp_opt.ts_recent = READ_ONCE(tcptw->tw_ts_recent);
tmp_opt.ts_recent_stamp = ts_recent_stamp;
paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
}
}
if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2) {
/* Just repeat all the checks of tcp_rcv_state_process() */
if (psp_drop)
goto out_put;
/* Out of window, send ACK */
if (paws_reject ||
!tcp_in_window(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq,
rcv_nxt,
rcv_nxt + tcptw->tw_rcv_wnd))
return tcp_timewait_check_oow_rate_limit(
tw, skb, LINUX_MIB_TCPACKSKIPPEDFINWAIT2);
if (th->rst)
goto kill;
if (th->syn && !before(TCP_SKB_CB(skb)->seq, rcv_nxt))
return TCP_TW_RST;
/* Dup ACK? */
if (!th->ack ||
!after(TCP_SKB_CB(skb)->end_seq, rcv_nxt) ||
TCP_SKB_CB(skb)->end_seq == TCP_SKB_CB(skb)->seq) {
inet_twsk_put(tw);
return TCP_TW_SUCCESS;
}
/* New data or FIN. If new data arrive after half-duplex close,
* reset.
*/
if (!th->fin ||
TCP_SKB_CB(skb)->end_seq != rcv_nxt + 1)
return TCP_TW_RST;
/* FIN arrived, enter true time-wait state. */
WRITE_ONCE(tw->tw_substate, TCP_TIME_WAIT);
twsk_rcv_nxt_update(tcptw, TCP_SKB_CB(skb)->end_seq,
rcv_nxt);
if (tmp_opt.saw_tstamp) {
u64 ts = tcp_clock_ms();
WRITE_ONCE(tw->tw_entry_stamp, ts);
WRITE_ONCE(tcptw->tw_ts_recent_stamp,
div_u64(ts, MSEC_PER_SEC));
WRITE_ONCE(tcptw->tw_ts_recent,
tmp_opt.rcv_tsval);
}
inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
return TCP_TW_ACK;
}
/*
* Now real TIME-WAIT state.
*
* RFC 1122:
* "When a connection is [...] on TIME-WAIT state [...]
* [a TCP] MAY accept a new SYN from the remote TCP to
* reopen the connection directly, if it:
*
* (1) assigns its initial sequence number for the new
* connection to be larger than the largest sequence
* number it used on the previous connection incarnation,
* and
*
* (2) returns to TIME-WAIT state if the SYN turns out
* to be an old duplicate".
*/
if (!paws_reject &&
(TCP_SKB_CB(skb)->seq == rcv_nxt &&
(TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq || th->rst))) {
/* In window segment, it may be only reset or bare ack. */
if (psp_drop)
goto out_put;
if (th->rst) {
/* This is TIME_WAIT assassination, in two flavors.
* Oh well... nobody has a sufficient solution to this
* protocol bug yet.
*/
if (!READ_ONCE(twsk_net(tw)->ipv4.sysctl_tcp_rfc1337)) {
kill:
inet_twsk_deschedule_put(tw);
return TCP_TW_SUCCESS;
}
} else {
inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
}
if (tmp_opt.saw_tstamp) {
WRITE_ONCE(tcptw->tw_ts_recent,
tmp_opt.rcv_tsval);
WRITE_ONCE(tcptw->tw_ts_recent_stamp,
ktime_get_seconds());
}
inet_twsk_put(tw);
return TCP_TW_SUCCESS;
}
/* Out of window segment.
All the segments are ACKed immediately.
The only exception is new SYN. We accept it, if it is
not old duplicate and we are not in danger to be killed
by delayed old duplicates. RFC check is that it has
newer sequence number works at rates <40Mbit/sec.
However, if paws works, it is reliable AND even more,
we even may relax silly seq space cutoff.
RED-PEN: we violate main RFC requirement, if this SYN will appear
old duplicate (i.e. we receive RST in reply to SYN-ACK),
we must return socket to time-wait state. It is not good,
but not fatal yet.
*/
if (th->syn && !th->rst && !th->ack && !paws_reject &&
(after(TCP_SKB_CB(skb)->seq, rcv_nxt) ||
(tmp_opt.saw_tstamp &&
(s32)(READ_ONCE(tcptw->tw_ts_recent) - tmp_opt.rcv_tsval) < 0))) {
u32 isn = tcptw->tw_snd_nxt + 65535 + 2;
if (isn == 0)
isn++;
*tw_isn = isn;
return TCP_TW_SYN;
}
if (psp_drop)
goto out_put;
if (paws_reject) {
*drop_reason = SKB_DROP_REASON_TCP_RFC7323_TW_PAWS;
__NET_INC_STATS(twsk_net(tw), LINUX_MIB_PAWS_TW_REJECTED);
}
if (!th->rst) {
/* In this case we must reset the TIMEWAIT timer.
*
* If it is ACKless SYN it may be both old duplicate
* and new good SYN with random sequence number <rcv_nxt.
* Do not reschedule in the last case.
*/
if (paws_reject || th->ack)
inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
return tcp_timewait_check_oow_rate_limit(
tw, skb, LINUX_MIB_TCPACKSKIPPEDTIMEWAIT);
}
out_put:
inet_twsk_put(tw);
return TCP_TW_SUCCESS;
}
EXPORT_IPV6_MOD(tcp_timewait_state_process);
static void tcp_time_wait_init(struct sock *sk, struct tcp_timewait_sock *tcptw)
{
#ifdef CONFIG_TCP_MD5SIG
const struct tcp_sock *tp = tcp_sk(sk);
struct tcp_md5sig_key *key;
/*
* The timewait bucket does not have the key DB from the
* sock structure. We just make a quick copy of the
* md5 key being used (if indeed we are using one)
* so the timewait ack generating code has the key.
*/
tcptw->tw_md5_key = NULL;
if (!static_branch_unlikely(&tcp_md5_needed.key))
return;
key = tp->af_specific->md5_lookup(sk, sk);
if (key) {
tcptw->tw_md5_key = kmemdup(key, sizeof(*key), GFP_ATOMIC);
if (!tcptw->tw_md5_key)
return;
if (!static_key_fast_inc_not_disabled(&tcp_md5_needed.key.key))
goto out_free;
}
return;
out_free:
WARN_ON_ONCE(1);
kfree(tcptw->tw_md5_key);
tcptw->tw_md5_key = NULL;
#endif
}
/*
* Move a socket to time-wait or dead fin-wait-2 state.
*/
void tcp_time_wait(struct sock *sk, int state, int timeo)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
struct inet_timewait_sock *tw;
tw = inet_twsk_alloc(sk, &net->ipv4.tcp_death_row, state);
if (tw) {
struct tcp_timewait_sock *tcptw = tcp_twsk((struct sock *)tw);
const int rto = (icsk->icsk_rto << 2) - (icsk->icsk_rto >> 1);
tw->tw_mark = sk->sk_mark;
tw->tw_priority = READ_ONCE(sk->sk_priority);
tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale;
/* refreshed when we enter true TIME-WAIT state */
tw->tw_entry_stamp = tcp_time_stamp_ms(tp);
tcptw->tw_rcv_nxt = tp->rcv_nxt;
tcptw->tw_snd_nxt = tp->snd_nxt;
tcptw->tw_rcv_wnd = tcp_receive_window(tp);
tcptw->tw_ts_recent = tp->rx_opt.ts_recent;
tcptw->tw_ts_recent_stamp = tp->rx_opt.ts_recent_stamp;
tcptw->tw_ts_offset = tp->tsoffset;
tw->tw_usec_ts = tp->tcp_usec_ts;
tcptw->tw_last_oow_ack_time = 0;
tcptw->tw_tx_delay = tp->tcp_tx_delay;
tw->tw_txhash = sk->sk_txhash;
tw->tw_tx_queue_mapping = sk->sk_tx_queue_mapping;
#ifdef CONFIG_SOCK_RX_QUEUE_MAPPING
tw->tw_rx_queue_mapping = sk->sk_rx_queue_mapping;
#endif
#if IS_ENABLED(CONFIG_IPV6)
if (tw->tw_family == PF_INET6) {
struct ipv6_pinfo *np = inet6_sk(sk);
tw->tw_v6_daddr = sk->sk_v6_daddr;
tw->tw_v6_rcv_saddr = sk->sk_v6_rcv_saddr;
tw->tw_tclass = np->tclass;
tw->tw_flowlabel = be32_to_cpu(np->flow_label & IPV6_FLOWLABEL_MASK);
tw->tw_ipv6only = sk->sk_ipv6only;
}
#endif
tcp_time_wait_init(sk, tcptw);
tcp_ao_time_wait(tcptw, tp);
/* Get the TIME_WAIT timeout firing. */
if (timeo < rto)
timeo = rto;
if (state == TCP_TIME_WAIT)
timeo = TCP_TIMEWAIT_LEN;
/* Linkage updates.
* Note that access to tw after this point is illegal.
*/
inet_twsk_hashdance_schedule(tw, sk, net->ipv4.tcp_death_row.hashinfo, timeo);
} else {
/* Sorry, if we're out of memory, just CLOSE this
* socket up. We've got bigger problems than
* non-graceful socket closings.
*/
NET_INC_STATS(net, LINUX_MIB_TCPTIMEWAITOVERFLOW);
}
tcp_update_metrics(sk);
tcp_done(sk);
}
EXPORT_SYMBOL(tcp_time_wait);
void tcp_twsk_destructor(struct sock *sk)
{
#ifdef CONFIG_TCP_MD5SIG
if (static_branch_unlikely(&tcp_md5_needed.key)) {
struct tcp_timewait_sock *twsk = tcp_twsk(sk);
if (twsk->tw_md5_key) {
kfree(twsk->tw_md5_key);
static_branch_slow_dec_deferred(&tcp_md5_needed);
}
}
#endif
tcp_ao_destroy_sock(sk, true);
psp_twsk_assoc_free(inet_twsk(sk));
}
void tcp_twsk_purge(struct list_head *net_exit_list)
{
bool purged_once = false;
struct net *net;
list_for_each_entry(net, net_exit_list, exit_list) {
if (net->ipv4.tcp_death_row.hashinfo->pernet) {
/* Even if tw_refcount == 1, we must clean up kernel reqsk */
inet_twsk_purge(net->ipv4.tcp_death_row.hashinfo);
} else if (!purged_once) {
inet_twsk_purge(&tcp_hashinfo);
purged_once = true;
}
}
}
/* Warning : This function is called without sk_listener being locked.
* Be sure to read socket fields once, as their value could change under us.
*/
void tcp_openreq_init_rwin(struct request_sock *req,
const struct sock *sk_listener,
const struct dst_entry *dst)
{
struct inet_request_sock *ireq = inet_rsk(req);
const struct tcp_sock *tp = tcp_sk(sk_listener);
int full_space = tcp_full_space(sk_listener);
u32 window_clamp;
__u8 rcv_wscale;
u32 rcv_wnd;
int mss;
mss = tcp_mss_clamp(tp, dst_metric_advmss(dst));
window_clamp = READ_ONCE(tp->window_clamp);
/* Set this up on the first call only */
req->rsk_window_clamp = window_clamp ? : dst_metric(dst, RTAX_WINDOW);
/* limit the window selection if the user enforce a smaller rx buffer */
if (sk_listener->sk_userlocks & SOCK_RCVBUF_LOCK &&
(req->rsk_window_clamp > full_space || req->rsk_window_clamp == 0))
req->rsk_window_clamp = full_space;
rcv_wnd = tcp_rwnd_init_bpf((struct sock *)req);
if (rcv_wnd == 0)
rcv_wnd = dst_metric(dst, RTAX_INITRWND);
else if (full_space < rcv_wnd * mss)
full_space = rcv_wnd * mss;
/* tcp_full_space because it is guaranteed to be the first packet */
tcp_select_initial_window(sk_listener, full_space,
mss - (ireq->tstamp_ok ? TCPOLEN_TSTAMP_ALIGNED : 0),
&req->rsk_rcv_wnd,
&req->rsk_window_clamp,
ireq->wscale_ok,
&rcv_wscale,
rcv_wnd);
ireq->rcv_wscale = rcv_wscale;
}
static void tcp_ecn_openreq_child(struct sock *sk,
const struct request_sock *req,
const struct sk_buff *skb)
{
const struct tcp_request_sock *treq = tcp_rsk(req);
struct tcp_sock *tp = tcp_sk(sk);
if (treq->accecn_ok) {
tcp_ecn_mode_set(tp, TCP_ECN_MODE_ACCECN);
tp->syn_ect_snt = treq->syn_ect_snt;
tcp_accecn_third_ack(sk, skb, treq->syn_ect_snt);
tp->saw_accecn_opt = treq->saw_accecn_opt;
if (treq->accecn_fail_mode & TCP_ACCECN_ACE_FAIL_SEND)
tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_SEND);
if (treq->accecn_fail_mode & TCP_ACCECN_ACE_FAIL_RECV)
tcp_accecn_fail_mode_set(tp, TCP_ACCECN_ACE_FAIL_RECV);
tp->prev_ecnfield = treq->syn_ect_rcv;
tp->accecn_opt_demand = 1;
tcp_ecn_received_counters_payload(sk, skb);
} else {
if (inet_rsk(req)->ecn_ok && !tcp_ca_no_fallback_rfc3168(sk))
tcp_ecn_mode_set(tp, TCP_ECN_MODE_RFC3168);
else
tcp_ecn_mode_set(tp, TCP_ECN_DISABLED);
}
}
void tcp_ca_openreq_child(struct sock *sk, const struct dst_entry *dst)
{
struct inet_connection_sock *icsk = inet_csk(sk);
u32 ca_key = dst_metric(dst, RTAX_CC_ALGO);
bool ca_got_dst = false;
if (ca_key != TCP_CA_UNSPEC) {
const struct tcp_congestion_ops *ca;
rcu_read_lock();
ca = tcp_ca_find_key(ca_key);
if (likely(ca && bpf_try_module_get(ca, ca->owner))) {
icsk->icsk_ca_dst_locked = tcp_ca_dst_locked(dst);
icsk->icsk_ca_ops = ca;
ca_got_dst = true;
}
rcu_read_unlock();
}
/* If no valid choice made yet, assign current system default ca. */
if (!ca_got_dst &&
(!icsk->icsk_ca_setsockopt ||
!bpf_try_module_get(icsk->icsk_ca_ops, icsk->icsk_ca_ops->owner)))
tcp_assign_congestion_control(sk);
tcp_set_ca_state(sk, TCP_CA_Open);
}
EXPORT_IPV6_MOD_GPL(tcp_ca_openreq_child);
static void smc_check_reset_syn_req(const struct tcp_sock *oldtp,
struct request_sock *req,
struct tcp_sock *newtp)
{
#if IS_ENABLED(CONFIG_SMC)
struct inet_request_sock *ireq;
if (static_branch_unlikely(&tcp_have_smc)) {
ireq = inet_rsk(req);
if (oldtp->syn_smc && !ireq->smc_ok)
newtp->syn_smc = 0;
}
#endif
}
/* This is not only more efficient than what we used to do, it eliminates
* a lot of code duplication between IPv4/IPv6 SYN recv processing. -DaveM
*
* Actually, we could lots of memory writes here. tp of listening
* socket contains all necessary default parameters.
*/
struct sock *tcp_create_openreq_child(const struct sock *sk,
struct request_sock *req,
struct sk_buff *skb)
{
struct sock *newsk = inet_csk_clone_lock(sk, req, GFP_ATOMIC);
const struct inet_request_sock *ireq = inet_rsk(req);
struct tcp_request_sock *treq = tcp_rsk(req);
struct inet_connection_sock *newicsk;
const struct tcp_sock *oldtp;
struct tcp_sock *newtp;
u32 seq;
if (!newsk)
return NULL;
newicsk = inet_csk(newsk);
newtp = tcp_sk(newsk);
oldtp = tcp_sk(sk);
smc_check_reset_syn_req(oldtp, req, newtp);
/* Now setup tcp_sock */
newtp->pred_flags = 0;
seq = treq->rcv_isn + 1;
newtp->rcv_wup = seq;
WRITE_ONCE(newtp->copied_seq, seq);
WRITE_ONCE(newtp->rcv_nxt, seq);
newtp->segs_in = 1;
seq = treq->snt_isn + 1;
newtp->snd_sml = newtp->snd_una = seq;
WRITE_ONCE(newtp->snd_nxt, seq);
newtp->snd_up = seq;
INIT_LIST_HEAD(&newtp->tsq_node);
INIT_LIST_HEAD(&newtp->tsorted_sent_queue);
tcp_init_wl(newtp, treq->rcv_isn);
minmax_reset(&newtp->rtt_min, tcp_jiffies32, ~0U);
newicsk->icsk_ack.lrcvtime = tcp_jiffies32;
newtp->lsndtime = tcp_jiffies32;
newsk->sk_txhash = READ_ONCE(treq->txhash);
newtp->total_retrans = req->num_retrans;
tcp_init_xmit_timers(newsk);
WRITE_ONCE(newtp->write_seq, newtp->pushed_seq = treq->snt_isn + 1);
if (sock_flag(newsk, SOCK_KEEPOPEN))
tcp_reset_keepalive_timer(newsk, keepalive_time_when(newtp));
newtp->rx_opt.tstamp_ok = ireq->tstamp_ok;
newtp->rx_opt.sack_ok = ireq->sack_ok;
newtp->window_clamp = req->rsk_window_clamp;
newtp->rcv_ssthresh = req->rsk_rcv_wnd;
newtp->rcv_wnd = req->rsk_rcv_wnd;
newtp->rx_opt.wscale_ok = ireq->wscale_ok;
if (newtp->rx_opt.wscale_ok) {
newtp->rx_opt.snd_wscale = ireq->snd_wscale;
newtp->rx_opt.rcv_wscale = ireq->rcv_wscale;
} else {
newtp->rx_opt.snd_wscale = newtp->rx_opt.rcv_wscale = 0;
newtp->window_clamp = min(newtp->window_clamp, 65535U);
}
newtp->snd_wnd = ntohs(tcp_hdr(skb)->window) << newtp->rx_opt.snd_wscale;
newtp->max_window = newtp->snd_wnd;
if (newtp->rx_opt.tstamp_ok) {
newtp->tcp_usec_ts = treq->req_usec_ts;
newtp->rx_opt.ts_recent = req->ts_recent;
newtp->rx_opt.ts_recent_stamp = ktime_get_seconds();
newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
} else {
newtp->tcp_usec_ts = 0;
newtp->rx_opt.ts_recent_stamp = 0;
newtp->tcp_header_len = sizeof(struct tcphdr);
}
if (req->num_timeout) {
newtp->total_rto = req->num_timeout;
newtp->undo_marker = treq->snt_isn;
if (newtp->tcp_usec_ts) {
newtp->retrans_stamp = treq->snt_synack;
newtp->total_rto_time = (u32)(tcp_clock_us() -
newtp->retrans_stamp) / USEC_PER_MSEC;
} else {
newtp->retrans_stamp = div_u64(treq->snt_synack,
USEC_PER_SEC / TCP_TS_HZ);
newtp->total_rto_time = tcp_clock_ms() -
newtp->retrans_stamp;
}
newtp->total_rto_recoveries = 1;
}
newtp->tsoffset = treq->ts_off;
#ifdef CONFIG_TCP_MD5SIG
newtp->md5sig_info = NULL; /*XXX*/
#endif
#ifdef CONFIG_TCP_AO
newtp->ao_info = NULL;
if (tcp_rsk_used_ao(req)) {
struct tcp_ao_key *ao_key;
ao_key = treq->af_specific->ao_lookup(sk, req, tcp_rsk(req)->ao_keyid, -1);
if (ao_key)
newtp->tcp_header_len += tcp_ao_len_aligned(ao_key);
}
#endif
if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
newtp->rx_opt.mss_clamp = req->mss;
tcp_ecn_openreq_child(newsk, req, skb);
newtp->fastopen_req = NULL;
RCU_INIT_POINTER(newtp->fastopen_rsk, NULL);
newtp->bpf_chg_cc_inprogress = 0;
tcp_bpf_clone(sk, newsk);
__TCP_INC_STATS(sock_net(sk), TCP_MIB_PASSIVEOPENS);
xa_init_flags(&newsk->sk_user_frags, XA_FLAGS_ALLOC1);
return newsk;
}
EXPORT_SYMBOL(tcp_create_openreq_child);
/*
* Process an incoming packet for SYN_RECV sockets represented as a
* request_sock. Normally sk is the listener socket but for TFO it
* points to the child socket.
*
* XXX (TFO) - The current impl contains a special check for ack
* validation and inside tcp_v4_reqsk_send_ack(). Can we do better?
*
* We don't need to initialize tmp_opt.sack_ok as we don't use the results
*
* Note: If @fastopen is true, this can be called from process context.
* Otherwise, this is from BH context.
*/
struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
bool fastopen, bool *req_stolen,
enum skb_drop_reason *drop_reason)
{
struct tcp_options_received tmp_opt;
struct sock *child;
const struct tcphdr *th = tcp_hdr(skb);
__be32 flg = tcp_flag_word(th) & (TCP_FLAG_RST|TCP_FLAG_SYN|TCP_FLAG_ACK);
bool tsecr_reject = false;
bool paws_reject = false;
bool own_req;
tmp_opt.saw_tstamp = 0;
tmp_opt.accecn = 0;
if (th->doff > (sizeof(struct tcphdr)>>2)) {
tcp_parse_options(sock_net(sk), skb, &tmp_opt, 0, NULL);
if (tmp_opt.saw_tstamp) {
tmp_opt.ts_recent = req->ts_recent;
if (tmp_opt.rcv_tsecr) {
if (inet_rsk(req)->tstamp_ok && !fastopen)
tsecr_reject = !between(tmp_opt.rcv_tsecr,
tcp_rsk(req)->snt_tsval_first,
READ_ONCE(tcp_rsk(req)->snt_tsval_last));
tmp_opt.rcv_tsecr -= tcp_rsk(req)->ts_off;
}
/* We do not store true stamp, but it is not required,
* it can be estimated (approximately)
* from another data.
*/
tmp_opt.ts_recent_stamp = ktime_get_seconds() -
tcp_reqsk_timeout(req) / HZ;
paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
}
}
/* Check for pure retransmitted SYN. */
if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn &&
flg == TCP_FLAG_SYN &&
!paws_reject) {
/*
* RFC793 draws (Incorrectly! It was fixed in RFC1122)
* this case on figure 6 and figure 8, but formal
* protocol description says NOTHING.
* To be more exact, it says that we should send ACK,
* because this segment (at least, if it has no data)
* is out of window.
*
* CONCLUSION: RFC793 (even with RFC1122) DOES NOT
* describe SYN-RECV state. All the description
* is wrong, we cannot believe to it and should
* rely only on common sense and implementation
* experience.
*
* Enforce "SYN-ACK" according to figure 8, figure 6
* of RFC793, fixed by RFC1122.
*
* Note that even if there is new data in the SYN packet
* they will be thrown away too.
*
* Reset timer after retransmitting SYNACK, similar to
* the idea of fast retransmit in recovery.
*/
if (!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSYNRECV,
&tcp_rsk(req)->last_oow_ack_time)) {
if (tcp_rsk(req)->accecn_ok) {
u8 ect_rcv = TCP_SKB_CB(skb)->ip_dsfield &
INET_ECN_MASK;
tcp_rsk(req)->syn_ect_rcv = ect_rcv;
if (tcp_accecn_ace(tcp_hdr(skb)) == 0x0)
tcp_rsk(req)->accecn_fail_mode |= TCP_ACCECN_ACE_FAIL_RECV;
}
if (!tcp_rtx_synack(sk, req)) {
unsigned long expires = jiffies;
if (req->num_retrans > 1 && tcp_rsk(req)->accecn_ok)
tcp_rsk(req)->accecn_fail_mode |= TCP_ACCECN_ACE_FAIL_SEND;
expires += tcp_reqsk_timeout(req);
if (!fastopen)
mod_timer_pending(&req->rsk_timer,
expires);
else
req->rsk_timer.expires = expires;
}
}
return NULL;
}
/* Further reproduces section "SEGMENT ARRIVES"
for state SYN-RECEIVED of RFC793.
It is broken, however, it does not work only
when SYNs are crossed.
You would think that SYN crossing is impossible here, since
we should have a SYN_SENT socket (from connect()) on our end,
but this is not true if the crossed SYNs were sent to both
ends by a malicious third party. We must defend against this,
and to do that we first verify the ACK (as per RFC793, page
36) and reset if it is invalid. Is this a true full defense?
To convince ourselves, let us consider a way in which the ACK
test can still pass in this 'malicious crossed SYNs' case.
Malicious sender sends identical SYNs (and thus identical sequence
numbers) to both A and B:
A: gets SYN, seq=7
B: gets SYN, seq=7
By our good fortune, both A and B select the same initial
send sequence number of seven :-)
A: sends SYN|ACK, seq=7, ack_seq=8
B: sends SYN|ACK, seq=7, ack_seq=8
So we are now A eating this SYN|ACK, ACK test passes. So
does sequence test, SYN is truncated, and thus we consider
it a bare ACK.
If icsk->icsk_accept_queue.rskq_defer_accept, we silently drop this
bare ACK. Otherwise, we create an established connection. Both
ends (listening sockets) accept the new incoming connection and try
to talk to each other. 8-)
Note: This case is both harmless, and rare. Possibility is about the
same as us discovering intelligent life on another plant tomorrow.
But generally, we should (RFC lies!) to accept ACK
from SYNACK both here and in tcp_rcv_state_process().
tcp_rcv_state_process() does not, hence, we do not too.
Note that the case is absolutely generic:
we cannot optimize anything here without
violating protocol. All the checks must be made
before attempt to create socket.
*/
/* RFC793 page 36: "If the connection is in any non-synchronized state ...
* and the incoming segment acknowledges something not yet
* sent (the segment carries an unacceptable ACK) ...
* a reset is sent."
*
* Invalid ACK: reset will be sent by listening socket.
* Note that the ACK validity check for a Fast Open socket is done
* elsewhere and is checked directly against the child socket rather
* than req because user data may have been sent out.
*/
if ((flg & TCP_FLAG_ACK) && !fastopen &&
(TCP_SKB_CB(skb)->ack_seq !=
tcp_rsk(req)->snt_isn + 1))
return sk;
/* RFC793: "first check sequence number". */
if (paws_reject || tsecr_reject ||
!tcp_in_window(TCP_SKB_CB(skb)->seq,
TCP_SKB_CB(skb)->end_seq,
tcp_rsk(req)->rcv_nxt,
tcp_rsk(req)->rcv_nxt +
tcp_synack_window(req))) {
/* Out of window: send ACK and drop. */
if (!(flg & TCP_FLAG_RST) &&
!tcp_oow_rate_limited(sock_net(sk), skb,
LINUX_MIB_TCPACKSKIPPEDSYNRECV,
&tcp_rsk(req)->last_oow_ack_time))
req->rsk_ops->send_ack(sk, skb, req);
if (paws_reject) {
SKB_DR_SET(*drop_reason, TCP_RFC7323_PAWS);
NET_INC_STATS(sock_net(sk), LINUX_MIB_PAWSESTABREJECTED);
} else if (tsecr_reject) {
SKB_DR_SET(*drop_reason, TCP_RFC7323_TSECR);
NET_INC_STATS(sock_net(sk), LINUX_MIB_TSECRREJECTED);
} else {
SKB_DR_SET(*drop_reason, TCP_OVERWINDOW);
}
return NULL;
}
/* In sequence, PAWS is OK. */
if (TCP_SKB_CB(skb)->seq == tcp_rsk(req)->rcv_isn) {
/* Truncate SYN, it is out of window starting
at tcp_rsk(req)->rcv_isn + 1. */
flg &= ~TCP_FLAG_SYN;
}
/* RFC793: "second check the RST bit" and
* "fourth, check the SYN bit"
*/
if (flg & (TCP_FLAG_RST|TCP_FLAG_SYN)) {
TCP_INC_STATS(sock_net(sk), TCP_MIB_ATTEMPTFAILS);
goto embryonic_reset;
}
/* ACK sequence verified above, just make sure ACK is
* set. If ACK not set, just silently drop the packet.
*
* XXX (TFO) - if we ever allow "data after SYN", the
* following check needs to be removed.
*/
if (!(flg & TCP_FLAG_ACK))
return NULL;
if (tcp_rsk(req)->accecn_ok && tmp_opt.accecn &&
tcp_rsk(req)->saw_accecn_opt < TCP_ACCECN_OPT_COUNTER_SEEN) {
u8 saw_opt = tcp_accecn_option_init(skb, tmp_opt.accecn);
tcp_rsk(req)->saw_accecn_opt = saw_opt;
if (tcp_rsk(req)->saw_accecn_opt == TCP_ACCECN_OPT_FAIL_SEEN) {
u8 fail_mode = TCP_ACCECN_OPT_FAIL_RECV;
tcp_rsk(req)->accecn_fail_mode |= fail_mode;
}
}
/* For Fast Open no more processing is needed (sk is the
* child socket).
*/
if (fastopen)
return sk;
/* While TCP_DEFER_ACCEPT is active, drop bare ACK. */
if (req->num_timeout < READ_ONCE(inet_csk(sk)->icsk_accept_queue.rskq_defer_accept) &&
TCP_SKB_CB(skb)->end_seq == tcp_rsk(req)->rcv_isn + 1) {
inet_rsk(req)->acked = 1;
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPDEFERACCEPTDROP);
return NULL;
}
/* OK, ACK is valid, create big socket and
* feed this segment to it. It will repeat all
* the tests. THIS SEGMENT MUST MOVE SOCKET TO
* ESTABLISHED STATE. If it will be dropped after
* socket is created, wait for troubles.
*/
child = inet_csk(sk)->icsk_af_ops->syn_recv_sock(sk, skb, req, NULL,
req, &own_req);
if (!child)
goto listen_overflow;
if (own_req && tmp_opt.saw_tstamp &&
!after(TCP_SKB_CB(skb)->seq, tcp_rsk(req)->rcv_nxt))
tcp_sk(child)->rx_opt.ts_recent = tmp_opt.rcv_tsval;
if (own_req && rsk_drop_req(req)) {
reqsk_queue_removed(&inet_csk(req->rsk_listener)->icsk_accept_queue, req);
inet_csk_reqsk_queue_drop_and_put(req->rsk_listener, req);
return child;
}
sock_rps_save_rxhash(child, skb);
tcp_synack_rtt_meas(child, req);
*req_stolen = !own_req;
return inet_csk_complete_hashdance(sk, child, req, own_req);
listen_overflow:
SKB_DR_SET(*drop_reason, TCP_LISTEN_OVERFLOW);
if (sk != req->rsk_listener)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMIGRATEREQFAILURE);
if (!READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_abort_on_overflow)) {
inet_rsk(req)->acked = 1;
return NULL;
}
embryonic_reset:
if (!(flg & TCP_FLAG_RST)) {
/* Received a bad SYN pkt - for TFO We try not to reset
* the local connection unless it's really necessary to
* avoid becoming vulnerable to outside attack aiming at
* resetting legit local connections.
*/
req->rsk_ops->send_reset(sk, skb, SK_RST_REASON_INVALID_SYN);
} else if (fastopen) { /* received a valid RST pkt */
reqsk_fastopen_remove(sk, req, true);
tcp_reset(sk, skb);
}
if (!fastopen) {
bool unlinked = inet_csk_reqsk_queue_drop(sk, req);
if (unlinked)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_EMBRYONICRSTS);
*req_stolen = !unlinked;
}
return NULL;
}
EXPORT_IPV6_MOD(tcp_check_req);
/*
* Queue segment on the new socket if the new socket is active,
* otherwise we just shortcircuit this and continue with
* the new socket.
*
* For the vast majority of cases child->sk_state will be TCP_SYN_RECV
* when entering. But other states are possible due to a race condition
* where after __inet_lookup_established() fails but before the listener
* locked is obtained, other packets cause the same connection to
* be created.
*/
enum skb_drop_reason tcp_child_process(struct sock *parent, struct sock *child,
struct sk_buff *skb)
__releases(&((child)->sk_lock.slock))
{
enum skb_drop_reason reason = SKB_NOT_DROPPED_YET;
int state = child->sk_state;
/* record sk_napi_id and sk_rx_queue_mapping of child. */
sk_mark_napi_id_set(child, skb);
tcp_segs_in(tcp_sk(child), skb);
if (!sock_owned_by_user(child)) {
reason = tcp_rcv_state_process(child, skb);
/* Wakeup parent, send SIGIO */
if (state == TCP_SYN_RECV && child->sk_state != state)
parent->sk_data_ready(parent);
} else {
/* Alas, it is possible again, because we do lookup
* in main socket hash table and lock on listening
* socket does not protect us more.
*/
__sk_add_backlog(child, skb);
}
bh_unlock_sock(child);
sock_put(child);
return reason;
}
EXPORT_IPV6_MOD(tcp_child_process);
] |
| 49/2 | 2026/02/23 14:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 86218 output: 49 thoughts: 77 Thoughts: Here's my thought process, as if I were reviewing this code: **Missed a Crucial Detail in TCP Time Wait Handling** Okay, hold on a second. I was going through the `tcp_time_wait` function, and I seem to have glossed over something important. I see the call to `inet_twsk_hashdance_schedule` right here: ```c inet_twsk_hashdance_schedule(tw, sk, net->ipv4.tcp_death_row.hashinfo, timeo); ``` I completely skipped over that! That's a key part of how the Time-Wait socket is managed, and it likely pertains to its eventual removal. It's crucial, so let's dig into `inet_twsk_hashdance_schedule` right away. I need to understand what this function is actually doing with that Time-Wait socket (`tw`), the underlying socket (`sk`), the hash information (`net->ipv4.tcp_death_row.hashinfo`), and the timeout (`timeo`). Understanding how this function schedules the tw socket for eventual cleanup is paramount to understanding how the entire TCP TIME_WAIT state is handled. |
| 50/2 | 2026/02/23 14:35 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/inet_hashtables.c IncludeLines:true Name:inet_twsk_hashdance_schedule] Results: map[SourceCode: 96: /*
97: * Enter the time wait state.
98: * Essentially we whip up a timewait bucket, copy the relevant info into it
99: * from the SK, and mess with hash chains and list linkage.
100: *
101: * The caller must not access @tw anymore after this function returns.
102: */
103: void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
104: struct sock *sk,
105: struct inet_hashinfo *hashinfo,
106: int timeo)
107: {
108: const struct inet_sock *inet = inet_sk(sk);
109: const struct inet_connection_sock *icsk = inet_csk(sk);
110: spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
111: struct inet_bind_hashbucket *bhead, *bhead2;
112:
113: /* Put TW into bind hash. Original socket stays there too.
114: * Note, that any socket with inet->num != 0 MUST be bound in
115: * binding cache, even if it is closed.
116: */
117: bhead = &hashinfo->bhash[inet_bhashfn(twsk_net(tw), inet->inet_num,
118: hashinfo->bhash_size)];
119: bhead2 = inet_bhashfn_portaddr(hashinfo, sk, twsk_net(tw), inet->inet_num);
120:
121: local_bh_disable();
122: spin_lock(&bhead->lock);
123: spin_lock(&bhead2->lock);
124:
125: tw->tw_tb = icsk->icsk_bind_hash;
126: WARN_ON(!icsk->icsk_bind_hash);
127:
128: tw->tw_tb2 = icsk->icsk_bind2_hash;
129: WARN_ON(!icsk->icsk_bind2_hash);
130: sk_add_bind_node((struct sock *)tw, &tw->tw_tb2->owners);
131:
132: spin_unlock(&bhead2->lock);
133: spin_unlock(&bhead->lock);
134:
135: spin_lock(lock);
136:
137: /* tw_refcnt is set to 3 because we have :
138: * - one reference for bhash chain.
139: * - one reference for ehash chain.
140: * - one reference for timer.
141: * Also note that after this point, we lost our implicit reference
142: * so we are not allowed to use tw anymore.
143: */
144: refcount_set(&tw->tw_refcnt, 3);
145:
146: /* Ensure tw_refcnt has been set before tw is published.
147: * smp_wmb() provides the necessary memory barrier to enforce this
148: * ordering.
149: */
150: smp_wmb();
151:
152: hlist_nulls_replace_init_rcu(&sk->sk_nulls_node, &tw->tw_node);
153: sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
154:
155: inet_twsk_schedule(tw, timeo);
156:
157: spin_unlock(lock);
158: local_bh_enable();
159: }
SourceFile:net/ipv4/inet_timewait_sock.c] |
| 51/2 | 2026/02/23 14:35 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 87691 output: 48 thoughts: 2777 Thoughts: **Analysis of the TCP Socket Race Condition**
Okay, so I've been digging into this reported race condition and here's my breakdown of what's happening. The core issue revolves around a socket being removed (specifically, transitioning to a time-wait state and then closing) and a concurrent lookup trying to find it.
My initial thought was that `inet_twsk_hashdance_schedule` is key. That function, in `net/ipv4/inet_timewait_sock.c`, handles moving the socket from the established hash table to the time-wait hash. It does so by replacing the established socket with the time-wait socket in `ehash`. The race occurs where `inet_sk(sk)->inet_num = 0` gets written in `__inet_put_port` while `sk->sk_portpair` is read in `inet_match`.
The stack trace points to the time-wait handling path where a time-wait socket wasn't allocated. This means the socket is being fully closed (not transitioning to time-wait). In this `tcp_done` flow, `inet_put_port` eventually gets called, which sets `inet_num = 0`. So, the socket is being unhashed and "zeroed" while a lookup might still be in progress.
Essentially, `__inet_lookup_established` is reading a value while it is being changed.
The potential for a problem is that `__inet_lookup_established` might see the socket still existing (because of the RCU grace period), but with `inet_num = 0`. It then fails to match in this case.
Is this race harmful? It seems that if the lookup fails, we are not necessarily in trouble, as a failed lookup means that a packet would be handled as if the socket doesn't exist, which in most cases would be fine. If `inet_match` returns `true`, the packet gets handled by `tcp_v4_rcv` for a socket in `TCP_CLOSE` state. If `inet_match` returns `false`, `tcp_v4_rcv` handles the packet as if no established socket exists. The result is likely a RST or the packet is dropped, which is fine in either case. Even if a listener socket exists on the same port, established connections are filtered by the 4-tuple.
The bigger question is: should we care? Is it benign? Conceptually, there's a data race on a field that's part of a 4-byte union, where one side is unhash the socket (writes `0`) while the other side is trying to match it. This race could be exploited by a malicious actor if it was "harmful." However, since there is no crash or incorrect behavior, it's safe. It simply might lead to a dropped packet, which is acceptable since the socket is being destroyed anyway.
Now, how about the compiler and memory model? Is the code safe against compiler optimizations, or could the compiler mess this up? I know from past experience, Eric Dumazet has said similar races are okay but should be fixed.
I see that `inet_match` in some kernels *has* been updated to use `READ_ONCE`. But I can't check other kernels for the `inet_match` code, so I will look at the provided source code. It *doesn't* have `READ_ONCE`. Now, the code checks `sk->sk_portpair != ports` and there is no `READ_ONCE`. `sk_addrpair` is also a union, so `sk_addrpair` also has no `READ_ONCE`. This is concerning.
The fact that `ping_unhash` uses `WRITE_ONCE`, but this code doesn't is an inconsistency.
Is there any chance this is harmful? Well, compiler optimizations could make `sk_portpair != ports` return true or false when it should return the opposite.
Thinking again, what happens in a socket *being opened*? `inet_bind_hash` sets `inet_num`. But the socket isn't in `ehash` until later. The `sk_nulls_add_node_rcu` has a memory barrier, so any readers in `ehash` will see the proper `inet_num`.
What about `sk_state`? `inet_match` doesn't check it; the caller (`tcp_v4_rcv`) does. It's fine if the socket is matched while the state is `TCP_CLOSE`.
If the lookup fails, it's generally fine.
Here is the key point though:
```c
if (likely(inet_match(net, sk, acookie, ports, dif, sdif))) {
|
| 52/2 | 2026/02/23 14:36 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv6/inet6_hashtables.c IncludeLines:true Name:__inet6_lookup_established] Results: map[SourceCode: 43: /*
44: * Sockets in TCP_CLOSE state are _always_ taken out of the hash, so
45: * we need not check it for TCP lookups anymore, thanks Alexey. -DaveM
46: *
47: * The sockhash lock must be held as a reader here.
48: */
49: struct sock *__inet6_lookup_established(const struct net *net,
50: const struct in6_addr *saddr,
51: const __be16 sport,
52: const struct in6_addr *daddr,
53: const u16 hnum,
54: const int dif, const int sdif)
55: {
56: const __portpair ports = INET_COMBINED_PORTS(sport, hnum);
57: const struct hlist_nulls_node *node;
58: struct inet_ehash_bucket *head;
59: struct inet_hashinfo *hashinfo;
60: unsigned int hash, slot;
61: struct sock *sk;
62:
63: hashinfo = net->ipv4.tcp_death_row.hashinfo;
64: hash = inet6_ehashfn(net, daddr, hnum, saddr, sport);
65: slot = hash & hashinfo->ehash_mask;
66: head = &hashinfo->ehash[slot];
67: begin:
68: sk_nulls_for_each_rcu(sk, node, &head->chain) {
69: if (sk->sk_hash != hash)
70: continue;
71: if (!inet6_match(net, sk, saddr, daddr, ports, dif, sdif))
72: continue;
73: if (unlikely(!refcount_inc_not_zero(&sk->sk_refcnt)))
74: goto out;
75:
76: if (unlikely(!inet6_match(net, sk, saddr, daddr, ports, dif, sdif))) {
77: sock_gen_put(sk);
78: goto begin;
79: }
80: goto found;
81: }
82: if (get_nulls_value(node) != slot)
83: goto begin;
84: out:
85: sk = NULL;
86: found:
87: return sk;
88: }
SourceFile:net/ipv6/inet6_hashtables.c] |
| 53/2 | 2026/02/23 14:36 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 92248 output: 44 thoughts: 35 Thoughts: Okay, here's my take on those thoughts, framed as an expert's internal monologue: **UDP and the Established Connection Lookup** Right, so I'm drilling down into this connection lookup process. I see that `__inet6_lookup_established` *also* implements that double-check mechanism, just like we see elsewhere. That's good to confirm. It's a standard pattern, I expect to see it consistently, and that's precisely what's happening. Now, my mind shifts slightly: What's the story with UDP? That's the part I need to verify. Ah, yes, UDP uses `__udp4_lib_lookup`. Excellent. That's the expected function call. Need to see how that function works, but at least, I've got a handle on the correct function being invoked in this case. |
| 54/2 | 2026/02/23 14:36 | tool | codesearch-definition-source |
0mArgs:map[ContextFile:net/ipv4/udp.c IncludeLines:true Name:__udp4_lib_lookup] Results: map[SourceCode: 674: /* UDP is nearly always wildcards out the wazoo, it makes no sense to try
675: * harder than this. -DaveM
676: */
677: struct sock *__udp4_lib_lookup(const struct net *net, __be32 saddr,
678: __be16 sport, __be32 daddr, __be16 dport, int dif,
679: int sdif, struct udp_table *udptable, struct sk_buff *skb)
680: {
681: unsigned short hnum = ntohs(dport);
682: struct udp_hslot *hslot2;
683: struct sock *result, *sk;
684: unsigned int hash2;
685:
686: hash2 = ipv4_portaddr_hash(net, daddr, hnum);
687: hslot2 = udp_hashslot2(udptable, hash2);
688:
689: if (udp_has_hash4(hslot2)) {
690: result = udp4_lib_lookup4(net, saddr, sport, daddr, hnum,
691: dif, sdif, udptable);
692: if (result) /* udp4_lib_lookup4 return sk or NULL */
693: return result;
694: }
695:
696: /* Lookup connected or non-wildcard socket */
697: result = udp4_lib_lookup2(net, saddr, sport,
698: daddr, hnum, dif, sdif,
699: hslot2, skb);
700: if (!IS_ERR_OR_NULL(result) && result->sk_state == TCP_ESTABLISHED)
701: goto done;
702:
703: /* Lookup redirect from BPF */
704: if (static_branch_unlikely(&bpf_sk_lookup_enabled) &&
705: udptable == net->ipv4.udp_table) {
706: sk = inet_lookup_run_sk_lookup(net, IPPROTO_UDP, skb, sizeof(struct udphdr),
707: saddr, sport, daddr, hnum, dif,
708: udp_ehashfn);
709: if (sk) {
710: result = sk;
711: goto done;
712: }
713: }
714:
715: /* Got non-wildcard socket or error on first lookup */
716: if (result)
717: goto done;
718:
719: /* Lookup wildcard sockets */
720: hash2 = ipv4_portaddr_hash(net, htonl(INADDR_ANY), hnum);
721: hslot2 = udp_hashslot2(udptable, hash2);
722:
723: result = udp4_lib_lookup2(net, saddr, sport,
724: htonl(INADDR_ANY), hnum, dif, sdif,
725: hslot2, skb);
726: if (!IS_ERR_OR_NULL(result))
727: goto done;
728:
729: /* Primary hash (destination port) lookup as fallback for this race:
730: * 1. __ip4_datagram_connect() sets sk_rcv_saddr
731: * 2. lookup (this function): new sk_rcv_saddr, hashes not updated yet
732: * 3. rehash operation updating _secondary and four-tuple_ hashes
733: * The primary hash doesn't need an update after 1., so, thanks to this
734: * further step, 1. and 3. don't need to be atomic against the lookup.
735: */
736: result = udp4_lib_lookup1(net, saddr, sport, daddr, hnum, dif, sdif,
737: udptable);
738:
739: done:
740: if (IS_ERR(result))
741: return NULL;
742: return result;
743: }
SourceFile:net/ipv4/udp.c] |
| 55/2 | 2026/02/23 14:36 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 93728 output: 19 thoughts: 488 Thoughts: **Analyzing the UDP Lookup and the Reported Race Condition** Okay, so I'm looking at this UDP lookup code, specifically `udp4_lib_lookup2`. It's using RCU for its lookup, which is standard. I see the `sk_for_each_entry_rcu` loop iterating through a hash slot (`hslot2->head`). Inside, `udp4_lib_match` is called. The interesting part is the reported race. The issue is supposedly with `sk_portpair` within `inet_match`, and that's used by the TCP layer, not UDP. That's a bit odd, but maybe there's some shared data structure or indirect interaction I'm missing. However, I'm pretty confident this race is benign because of the double check in `__inet_lookup_established`. Now, I'm looking at those value changes, specifically `0x400190d0 -> 0x000090d0`. Let's break this down: `0x4001` is 16385, and `0x90d0` is 37072. Now, I have to remember that `skc_dport` is `0x90d0` and `skc_num` is `0x4001`. The crucial part is the write setting `skc_num = 0`, and the read sees that value change. Thus, the value of `sk_portpair` becomes `0x000090d0` from `0x400190d0`. Because these are now different, `inet_match` correctly returns false. Alright, after this careful review and analysis, everything seems to be consistent, so I'm going to set my results. |
| 56/2 | 2026/02/23 14:36 | tool | set-results |
0mArgs:map[Benign:true Confident:true] Results: map[Benign:true Confident:true] |
| 57/2 | 2026/02/23 14:36 | llm | expert |
0mModel:gemini-3-flash-preview Tokens: input: 94648 output: 581 thoughts: 322 Thoughts: **Benign Data Race Analysis** Okay, so I've dug into this data race and I'm comfortable concluding it's benign. The key is the interaction between `__inet_put_port`, which is writing to `inet_num` setting it to `0`, and `inet_match`, which is reading `sk_portpair`. `__inet_put_port` is getting called during socket cleanup – either when a socket is being destroyed or transitioning to the `TCP_CLOSE` state. This makes sense. `inet_match` on the other hand is crucial during incoming packet processing, specifically within `__inet_lookup_established`. That function is RCU-protected, as expected, and it's employing a double-check pattern. It calls `inet_match`, then it attempts to increment the socket's reference count. Then, critically, it calls `inet_match` *again*. This double check is the linchpin. Let's break down the implications. If the first `inet_match` happens to read a partially updated `inet_num` (i.e., it sees the old value or 0), the algorithm is designed to handle it. If it sees the old value, it's still a valid match and the subsequent reference increment and second `inet_match` will catch the change. If it sees a zero, the lookup simply continues, because the socket is, after all, being removed. It's safe. It's designed to handle a situation where an entry might be in the middle of a deletion operation. Crucially, this race doesn't appear to be introducing any memory corruption or causing protocol violations, which is the bottom line. While using `READ_ONCE` and `WRITE_ONCE` would certainly be the preferred way to quiet KCSAN and it's generally good practice in kernel code, the lack of them here isn't resulting in a functional issue. The RCU-protected double check, and the transient nature of the `inet_num` change during socket teardown, provides adequate protection. |
| Total Calls | Total Tokens | Avg Tokens | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|
| Total Calls | Total Duration (Seconds) | Avg Duration (Seconds) |
|---|