Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

at master 437 lines 15 kB view raw
1.. SPDX-License-Identifier: GPL-2.0 2 3===================================== 4Network Devices, the Kernel, and You! 5===================================== 6 7 8Introduction 9============ 10The following is a random collection of documentation regarding 11network devices. It is intended for driver developers. 12 13struct net_device lifetime rules 14================================ 15Network device structures need to persist even after module is unloaded and 16must be allocated with alloc_netdev_mqs() and friends. 17If device has registered successfully, it will be freed on last use 18by free_netdev(). This is required to handle the pathological case cleanly 19(example: ``rmmod mydriver </sys/class/net/myeth/mtu``) 20 21alloc_netdev_mqs() / alloc_netdev() reserve extra space for driver 22private data which gets freed when the network device is freed. If 23separately allocated data is attached to the network device 24(netdev_priv()) then it is up to the module exit handler to free that. 25 26There are two groups of APIs for registering struct net_device. 27First group can be used in normal contexts where ``rtnl_lock`` is not already 28held: register_netdev(), unregister_netdev(). 29Second group can be used when ``rtnl_lock`` is already held: 30register_netdevice(), unregister_netdevice(), free_netdevice(). 31 32Simple drivers 33-------------- 34 35Most drivers (especially device drivers) handle lifetime of struct net_device 36in context where ``rtnl_lock`` is not held (e.g. driver probe and remove paths). 37 38In that case the struct net_device registration is done using 39the register_netdev(), and unregister_netdev() functions: 40 41.. code-block:: c 42 43 int probe() 44 { 45 struct my_device_priv *priv; 46 int err; 47 48 dev = alloc_netdev_mqs(...); 49 if (!dev) 50 return -ENOMEM; 51 priv = netdev_priv(dev); 52 53 /* ... do all device setup before calling register_netdev() ... 54 */ 55 56 err = register_netdev(dev); 57 if (err) 58 goto err_undo; 59 60 /* net_device is visible to the user! */ 61 62 err_undo: 63 /* ... undo the device setup ... */ 64 free_netdev(dev); 65 return err; 66 } 67 68 void remove() 69 { 70 unregister_netdev(dev); 71 free_netdev(dev); 72 } 73 74Note that after calling register_netdev() the device is visible in the system. 75Users can open it and start sending / receiving traffic immediately, 76or run any other callback, so all initialization must be done prior to 77registration. 78 79unregister_netdev() closes the device and waits for all users to be done 80with it. The memory of struct net_device itself may still be referenced 81by sysfs but all operations on that device will fail. 82 83free_netdev() can be called after unregister_netdev() returns or when 84register_netdev() failed. 85 86Device management under RTNL 87---------------------------- 88 89Registering struct net_device while in context which already holds 90the ``rtnl_lock`` requires extra care. In those scenarios most drivers 91will want to make use of struct net_device's ``needs_free_netdev`` 92and ``priv_destructor`` members for freeing of state. 93 94Example flow of netdev handling under ``rtnl_lock``: 95 96.. code-block:: c 97 98 static void my_setup(struct net_device *dev) 99 { 100 dev->needs_free_netdev = true; 101 } 102 103 static void my_destructor(struct net_device *dev) 104 { 105 some_obj_destroy(priv->obj); 106 some_uninit(priv); 107 } 108 109 int create_link() 110 { 111 struct my_device_priv *priv; 112 int err; 113 114 ASSERT_RTNL(); 115 116 dev = alloc_netdev(sizeof(*priv), "net%d", NET_NAME_UNKNOWN, my_setup); 117 if (!dev) 118 return -ENOMEM; 119 priv = netdev_priv(dev); 120 121 /* Implicit constructor */ 122 err = some_init(priv); 123 if (err) 124 goto err_free_dev; 125 126 priv->obj = some_obj_create(); 127 if (!priv->obj) { 128 err = -ENOMEM; 129 goto err_some_uninit; 130 } 131 /* End of constructor, set the destructor: */ 132 dev->priv_destructor = my_destructor; 133 134 err = register_netdevice(dev); 135 if (err) 136 /* register_netdevice() calls destructor on failure */ 137 goto err_free_dev; 138 139 /* If anything fails now unregister_netdevice() (or unregister_netdev()) 140 * will take care of calling my_destructor and free_netdev(). 141 */ 142 143 return 0; 144 145 err_some_uninit: 146 some_uninit(priv); 147 err_free_dev: 148 free_netdev(dev); 149 return err; 150 } 151 152If struct net_device.priv_destructor is set it will be called by the core 153some time after unregister_netdevice(), it will also be called if 154register_netdevice() fails. The callback may be invoked with or without 155``rtnl_lock`` held. 156 157There is no explicit constructor callback, driver "constructs" the private 158netdev state after allocating it and before registration. 159 160Setting struct net_device.needs_free_netdev makes core call free_netdevice() 161automatically after unregister_netdevice() when all references to the device 162are gone. It only takes effect after a successful call to register_netdevice() 163so if register_netdevice() fails driver is responsible for calling 164free_netdev(). 165 166free_netdev() is safe to call on error paths right after unregister_netdevice() 167or when register_netdevice() fails. Parts of netdev (de)registration process 168happen after ``rtnl_lock`` is released, therefore in those cases free_netdev() 169will defer some of the processing until ``rtnl_lock`` is released. 170 171Devices spawned from struct rtnl_link_ops should never free the 172struct net_device directly. 173 174.ndo_init and .ndo_uninit 175~~~~~~~~~~~~~~~~~~~~~~~~~ 176 177``.ndo_init`` and ``.ndo_uninit`` callbacks are called during net_device 178registration and de-registration, under ``rtnl_lock``. Drivers can use 179those e.g. when parts of their init process need to run under ``rtnl_lock``. 180 181``.ndo_init`` runs before device is visible in the system, ``.ndo_uninit`` 182runs during de-registering after device is closed but other subsystems 183may still have outstanding references to the netdevice. 184 185MTU 186=== 187Each network device has a Maximum Transfer Unit. The MTU does not 188include any link layer protocol overhead. Upper layer protocols must 189not pass a socket buffer (skb) to a device to transmit with more data 190than the mtu. The MTU does not include link layer header overhead, so 191for example on Ethernet if the standard MTU is 1500 bytes used, the 192actual skb will contain up to 1514 bytes because of the Ethernet 193header. Devices should allow for the 4 byte VLAN header as well. 194 195Segmentation Offload (GSO, TSO) is an exception to this rule. The 196upper layer protocol may pass a large socket buffer to the device 197transmit routine, and the device will break that up into separate 198packets based on the current MTU. 199 200MTU is symmetrical and applies both to receive and transmit. A device 201must be able to receive at least the maximum size packet allowed by 202the MTU. A network device may use the MTU as mechanism to size receive 203buffers, but the device should allow packets with VLAN header. With 204standard Ethernet mtu of 1500 bytes, the device should allow up to 2051518 byte packets (1500 + 14 header + 4 tag). The device may either: 206drop, truncate, or pass up oversize packets, but dropping oversize 207packets is preferred. 208 209 210struct net_device synchronization rules 211======================================= 212ndo_open: 213 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 214 lock if the driver implements queue management or shaper API. 215 Context: process 216 217ndo_stop: 218 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 219 lock if the driver implements queue management or shaper API. 220 Context: process 221 Note: netif_running() is guaranteed false 222 223ndo_do_ioctl: 224 Synchronization: rtnl_lock() semaphore. 225 226 This is only called by network subsystems internally, 227 not by user space calling ioctl as it was in before 228 linux-5.14. 229 230ndo_siocbond: 231 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 232 lock if the driver implements queue management or shaper API. 233 Context: process 234 235 Used by the bonding driver for the SIOCBOND family of 236 ioctl commands. 237 238ndo_siocwandev: 239 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 240 lock if the driver implements queue management or shaper API. 241 Context: process 242 243 Used by the drivers/net/wan framework to handle 244 the SIOCWANDEV ioctl with the if_settings structure. 245 246ndo_siocdevprivate: 247 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 248 lock if the driver implements queue management or shaper API. 249 Context: process 250 251 This is used to implement SIOCDEVPRIVATE ioctl helpers. 252 These should not be added to new drivers, so don't use. 253 254ndo_eth_ioctl: 255 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 256 lock if the driver implements queue management or shaper API. 257 Context: process 258 259ndo_get_stats: 260 Synchronization: RCU (can be called concurrently with the stats 261 update path). 262 Context: atomic (can't sleep under RCU) 263 264ndo_start_xmit: 265 Synchronization: __netif_tx_lock spinlock. 266 267 When the driver sets dev->lltx this will be 268 called without holding netif_tx_lock. In this case the driver 269 has to lock by itself when needed. 270 The locking there should also properly protect against 271 set_rx_mode. WARNING: use of dev->lltx is deprecated. 272 Don't use it for new drivers. 273 274 Context: Process with BHs disabled or BH (timer), 275 will be called with interrupts disabled by netconsole. 276 277 Return codes: 278 279 * NETDEV_TX_OK everything ok. 280 * NETDEV_TX_BUSY Cannot transmit packet, try later 281 Usually a bug, means queue start/stop flow control is broken in 282 the driver. Note: the driver must NOT put the skb in its DMA ring. 283 284ndo_tx_timeout: 285 Synchronization: netif_tx_lock spinlock; all TX queues frozen. 286 Context: BHs disabled 287 Notes: netif_queue_stopped() is guaranteed true 288 289ndo_set_rx_mode: 290 Synchronization: netif_addr_lock spinlock. 291 Context: BHs disabled 292 Notes: Deprecated in favor of ndo_set_rx_mode_async which runs 293 in process context. 294 295ndo_set_rx_mode_async: 296 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 297 lock if the driver implements queue management or shaper API. 298 Context: process (from a work queue) 299 Notes: Async version of ndo_set_rx_mode which runs in process 300 context. Receives snapshots of the unicast and multicast address lists. 301 302ndo_change_rx_flags: 303 Synchronization: rtnl_lock() semaphore. In addition, netdev instance 304 lock if the driver implements queue management or shaper API. 305 306ndo_setup_tc: 307 ``TC_SETUP_BLOCK`` and ``TC_SETUP_FT`` are running under NFT locks 308 (i.e. no ``rtnl_lock`` and no device instance lock). The rest of 309 ``tc_setup_type`` types run under netdev instance lock if the driver 310 implements queue management or shaper API. 311 312Most ndo callbacks not specified in the list above are running 313under ``rtnl_lock``. In addition, netdev instance lock is taken as well if 314the driver implements queue management or shaper API. 315 316struct napi_struct synchronization rules 317======================================== 318napi->poll: 319 Synchronization: 320 NAPI_STATE_SCHED bit in napi->state. Device 321 driver's ndo_stop method will invoke napi_disable() on 322 all NAPI instances which will do a sleeping poll on the 323 NAPI_STATE_SCHED napi->state bit, waiting for all pending 324 NAPI activity to cease. 325 326 Context: 327 softirq 328 will be called with interrupts disabled by netconsole. 329 330netdev instance lock 331==================== 332 333Historically, all networking control operations were protected by a single 334global lock known as ``rtnl_lock``. There is an ongoing effort to replace this 335global lock with separate locks for each network namespace. Additionally, 336properties of individual netdev are increasingly protected by per-netdev locks. 337 338For device drivers that implement shaping or queue management APIs, all control 339operations will be performed under the netdev instance lock. 340Drivers can also explicitly request instance lock to be held during ops 341by setting ``request_ops_lock`` to true. Code comments and docs refer 342to drivers which have ops called under the instance lock as "ops locked". 343See also the documentation of the ``lock`` member of struct net_device. 344 345There is also a case of taking two per-netdev locks in sequence when netdev 346queues are leased, that is, the netdev-scope lock is taken for both the 347virtual and the physical device. To prevent deadlocks, the virtual device's 348lock must always be acquired before the physical device's (see 349``netdev_nl_queue_create_doit``). 350 351In the future, there will be an option for individual 352drivers to opt out of using ``rtnl_lock`` and instead perform their control 353operations directly under the netdev instance lock. 354 355Device drivers are encouraged to rely on the instance lock where possible. 356 357For the (mostly software) drivers that need to interact with the core stack, 358there are two sets of interfaces: ``dev_xxx``/``netdev_xxx`` and ``netif_xxx`` 359(e.g., ``dev_set_mtu`` and ``netif_set_mtu``). The ``dev_xxx``/``netdev_xxx`` 360functions handle acquiring the instance lock themselves, while the 361``netif_xxx`` functions assume that the driver has already acquired 362the instance lock. 363 364struct net_device_ops 365--------------------- 366 367``ndos`` are called without holding the instance lock for most drivers. 368 369"Ops locked" drivers will have most of the ``ndos`` invoked under 370the instance lock. 371 372struct ethtool_ops 373------------------ 374 375Similarly to ``ndos`` the instance lock is only held for select drivers. 376For "ops locked" drivers all ethtool ops without exceptions should 377be called under the instance lock. 378 379struct netdev_stat_ops 380---------------------- 381 382"qstat" ops are invoked under the instance lock for "ops locked" drivers, 383and under rtnl_lock for all other drivers. 384 385struct net_shaper_ops 386--------------------- 387 388All net shaper callbacks are invoked while holding the netdev instance 389lock. ``rtnl_lock`` may or may not be held. 390 391Note that supporting net shapers automatically enables "ops locking". 392 393struct netdev_queue_mgmt_ops 394---------------------------- 395 396All queue management callbacks are invoked while holding the netdev instance 397lock. ``rtnl_lock`` may or may not be held. 398 399Note that supporting struct netdev_queue_mgmt_ops automatically enables 400"ops locking". 401 402Notifiers and netdev instance lock 403---------------------------------- 404 405For device drivers that implement shaping or queue management APIs, 406some of the notifiers (``enum netdev_cmd``) are running under the netdev 407instance lock. 408 409The following netdev notifiers are always run under the instance lock: 410* ``NETDEV_XDP_FEAT_CHANGE`` 411 412For devices with locked ops, currently only the following notifiers are 413running under the lock: 414* ``NETDEV_CHANGE`` 415* ``NETDEV_REGISTER`` 416* ``NETDEV_UP`` 417 418The following notifiers are running without the lock: 419* ``NETDEV_UNREGISTER`` 420 421There are no clear expectations for the remaining notifiers. Notifiers not on 422the list may run with or without the instance lock, potentially even invoking 423the same notifier type with and without the lock from different code paths. 424The goal is to eventually ensure that all (or most, with a few documented 425exceptions) notifiers run under the instance lock. Please extend this 426documentation whenever you make explicit assumption about lock being held 427from a notifier. 428 429NETDEV_INTERNAL symbol namespace 430================================ 431 432Symbols exported as NETDEV_INTERNAL can only be used in networking 433core and drivers which exclusively flow via the main networking list and trees. 434Note that the inverse is not true, most symbols outside of NETDEV_INTERNAL 435are not expected to be used by random code outside netdev either. 436Symbols may lack the designation because they predate the namespaces, 437or simply due to an oversight.