Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

[PATCH] delay accounting taskstats interface send tgid once

Send per-tgid data only once during exit of a thread group instead of once
with each member thread exit.

Currently, when a thread exits, besides its per-tid data, the per-tgid data
of its thread group is also sent out, if its thread group is non-empty.
The per-tgid data sent consists of the sum of per-tid stats for all
*remaining* threads of the thread group.

This patch modifies this sending in two ways:

- the per-tgid data is sent only when the last thread of a thread group
exits. This cuts down heavily on the overhead of sending/receiving
per-tgid data, especially when other exploiters of the taskstats
interface aren't interested in per-tgid stats

- the semantics of the per-tgid data sent are changed. Instead of being
the sum of per-tid data for remaining threads, the value now sent is the
true total accumalated statistics for all threads that are/were part of
the thread group.

The patch also addresses a minor issue where failure of one accounting
subsystem to fill in the taskstats structure was causing the send of
taskstats to not be sent at all.

The patch has been tested for stability and run cerberus for over 4 hours
on an SMP.

[akpm@osdl.org: bugfixes]
Signed-off-by: Shailabh Nagar <nagar@watson.ibm.com>
Signed-off-by: Balbir Singh <balbir@in.ibm.com>
Cc: Jay Lan <jlan@engr.sgi.com>
Signed-off-by: Andrew Morton <akpm@osdl.org>
Signed-off-by: Linus Torvalds <torvalds@osdl.org>

authored by

Shailabh Nagar and committed by
Linus Torvalds
ad4ecbcb 25890454

+161 -80
+4 -9
Documentation/accounting/delay-accounting.txt
··· 48 48 experienced by the task waiting for the corresponding resource 49 49 in that interval. 50 50 51 - When a task exits, records containing the per-task and per-process statistics 52 - are sent to userspace without requiring a command. More details are given in 53 - the taskstats interface description. 51 + When a task exits, records containing the per-task statistics 52 + are sent to userspace without requiring a command. If it is the last exiting 53 + task of a thread group, the per-tgid statistics are also sent. More details 54 + are given in the taskstats interface description. 54 55 55 56 The getdelays.c userspace utility in this directory allows simple commands to 56 57 be run and the corresponding delay statistics to be displayed. It also serves ··· 108 107 0 0 109 108 MEM count delay total 110 109 0 0 111 - 112 - 113 - 114 - 115 - 116 -
+12 -19
Documentation/accounting/taskstats.txt
··· 32 32 statistics for all tasks of the process (if tgid is specified). 33 33 34 34 To obtain statistics for tasks which are exiting, userspace opens a multicast 35 - netlink socket. Each time a task exits, two records are sent by the kernel to 36 - each listener on the multicast socket. The first the per-pid task's statistics 37 - and the second is the sum for all tasks of the process to which the task 38 - belongs (the task does not need to be the thread group leader). The need for 39 - per-tgid stats to be sent for each exiting task is explained in the per-tgid 40 - stats section below. 35 + netlink socket. Each time a task exits, its per-pid statistics is always sent 36 + by the kernel to each listener on the multicast socket. In addition, if it is 37 + the last thread exiting its thread group, an additional record containing the 38 + per-tgid stats are also sent. The latter contains the sum of per-pid stats for 39 + all threads in the thread group, both past and present. 41 40 42 41 getdelays.c is a simple utility demonstrating usage of the taskstats interface 43 42 for reporting delay accounting statistics. ··· 103 104 of atomicity). 104 105 105 106 However, maintaining per-process, in addition to per-task stats, within the 106 - kernel has space and time overheads. Hence the taskstats implementation 107 - dynamically sums up the per-task stats for each task belonging to a process 108 - whenever per-process stats are needed. 107 + kernel has space and time overheads. To address this, the taskstats code 108 + accumalates each exiting task's statistics into a process-wide data structure. 109 + When the last task of a process exits, the process level data accumalated also 110 + gets sent to userspace (along with the per-task data). 109 111 110 - Not maintaining per-tgid stats creates a problem when userspace is interested 111 - in getting these stats when the process dies i.e. the last thread of 112 - a process exits. It isn't possible to simply return some aggregated per-process 113 - statistic from the kernel. 114 - 115 - The approach taken by taskstats is to return the per-tgid stats *each* time 116 - a task exits, in addition to the per-pid stats for that task. Userspace can 117 - maintain task<->process mappings and use them to maintain the per-process stats 118 - in userspace, updating the aggregate appropriately as the tasks of a process 119 - exit. 112 + When a user queries to get per-tgid data, the sum of all other live threads in 113 + the group is added up and added to the accumalated total for previously exited 114 + threads of the same thread group. 120 115 121 116 Extending taskstats 122 117 -------------------
+12
MAINTAINERS
··· 2240 2240 L: netdev@vger.kernel.org 2241 2241 S: Maintained 2242 2242 2243 + PER-TASK DELAY ACCOUNTING 2244 + P: Shailabh Nagar 2245 + M: nagar@watson.ibm.com 2246 + L: linux-kernel@vger.kernel.org 2247 + S: Maintained 2248 + 2243 2249 PERSONALITY HANDLING 2244 2250 P: Christoph Hellwig 2245 2251 M: hch@infradead.org ··· 2771 2765 TI OMAP RANDOM NUMBER GENERATOR SUPPORT 2772 2766 P: Deepak Saxena 2773 2767 M: dsaxena@plexity.net 2768 + S: Maintained 2769 + 2770 + TASKSTATS STATISTICS INTERFACE 2771 + P: Shailabh Nagar 2772 + M: nagar@watson.ibm.com 2773 + L: linux-kernel@vger.kernel.org 2774 2774 S: Maintained 2775 2775 2776 2776 TI PARALLEL LINK CABLE DRIVER
+4
include/linux/sched.h
··· 463 463 #ifdef CONFIG_BSD_PROCESS_ACCT 464 464 struct pacct_struct pacct; /* per-process accounting information */ 465 465 #endif 466 + #ifdef CONFIG_TASKSTATS 467 + spinlock_t stats_lock; 468 + struct taskstats *stats; 469 + #endif 466 470 }; 467 471 468 472 /* Context switch must be unlocked if interrupts are to be enabled */
+55 -16
include/linux/taskstats_kern.h
··· 19 19 extern kmem_cache_t *taskstats_cache; 20 20 extern struct mutex taskstats_exit_mutex; 21 21 22 - static inline void taskstats_exit_alloc(struct taskstats **ptidstats, 23 - struct taskstats **ptgidstats) 22 + static inline void taskstats_exit_alloc(struct taskstats **ptidstats) 24 23 { 25 24 *ptidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); 26 - *ptgidstats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); 27 25 } 28 26 29 - static inline void taskstats_exit_free(struct taskstats *tidstats, 30 - struct taskstats *tgidstats) 27 + static inline void taskstats_exit_free(struct taskstats *tidstats) 31 28 { 32 29 if (tidstats) 33 30 kmem_cache_free(taskstats_cache, tidstats); 34 - if (tgidstats) 35 - kmem_cache_free(taskstats_cache, tgidstats); 36 31 } 37 32 38 - extern void taskstats_exit_send(struct task_struct *, struct taskstats *, 39 - struct taskstats *); 40 - extern void taskstats_init_early(void); 33 + static inline void taskstats_tgid_init(struct signal_struct *sig) 34 + { 35 + spin_lock_init(&sig->stats_lock); 36 + sig->stats = NULL; 37 + } 41 38 39 + static inline void taskstats_tgid_alloc(struct signal_struct *sig) 40 + { 41 + struct taskstats *stats; 42 + unsigned long flags; 43 + 44 + stats = kmem_cache_zalloc(taskstats_cache, SLAB_KERNEL); 45 + if (!stats) 46 + return; 47 + 48 + spin_lock_irqsave(&sig->stats_lock, flags); 49 + if (!sig->stats) { 50 + sig->stats = stats; 51 + stats = NULL; 52 + } 53 + spin_unlock_irqrestore(&sig->stats_lock, flags); 54 + 55 + if (stats) 56 + kmem_cache_free(taskstats_cache, stats); 57 + } 58 + 59 + static inline void taskstats_tgid_free(struct signal_struct *sig) 60 + { 61 + struct taskstats *stats = NULL; 62 + unsigned long flags; 63 + 64 + spin_lock_irqsave(&sig->stats_lock, flags); 65 + if (sig->stats) { 66 + stats = sig->stats; 67 + sig->stats = NULL; 68 + } 69 + spin_unlock_irqrestore(&sig->stats_lock, flags); 70 + if (stats) 71 + kmem_cache_free(taskstats_cache, stats); 72 + } 73 + 74 + extern void taskstats_exit_send(struct task_struct *, struct taskstats *, int); 75 + extern void taskstats_init_early(void); 76 + extern void taskstats_tgid_alloc(struct signal_struct *); 42 77 #else 43 - static inline void taskstats_exit_alloc(struct taskstats **ptidstats, 44 - struct taskstats **ptgidstats) 78 + static inline void taskstats_exit_alloc(struct taskstats **ptidstats) 45 79 {} 46 - static inline void taskstats_exit_free(struct taskstats *ptidstats, 47 - struct taskstats *ptgidstats) 80 + static inline void taskstats_exit_free(struct taskstats *ptidstats) 48 81 {} 49 82 static inline void taskstats_exit_send(struct task_struct *tsk, 50 - struct taskstats *tidstats, 51 - struct taskstats *tgidstats) 83 + struct taskstats *tidstats, 84 + int group_dead) 85 + {} 86 + static inline void taskstats_tgid_init(struct signal_struct *sig) 87 + {} 88 + static inline void taskstats_tgid_alloc(struct signal_struct *sig) 89 + {} 90 + static inline void taskstats_tgid_free(struct signal_struct *sig) 52 91 {} 53 92 static inline void taskstats_init_early(void) 54 93 {}
+4 -4
kernel/exit.c
··· 845 845 fastcall NORET_TYPE void do_exit(long code) 846 846 { 847 847 struct task_struct *tsk = current; 848 - struct taskstats *tidstats, *tgidstats; 848 + struct taskstats *tidstats; 849 849 int group_dead; 850 850 851 851 profile_task_exit(tsk); ··· 884 884 current->comm, current->pid, 885 885 preempt_count()); 886 886 887 - taskstats_exit_alloc(&tidstats, &tgidstats); 887 + taskstats_exit_alloc(&tidstats); 888 888 889 889 acct_update_integrals(tsk); 890 890 if (tsk->mm) { ··· 905 905 #endif 906 906 if (unlikely(tsk->audit_context)) 907 907 audit_free(tsk); 908 - taskstats_exit_send(tsk, tidstats, tgidstats); 909 - taskstats_exit_free(tidstats, tgidstats); 908 + taskstats_exit_send(tsk, tidstats, group_dead); 909 + taskstats_exit_free(tidstats); 910 910 delayacct_tsk_exit(tsk); 911 911 912 912 exit_mm(tsk);
+4
kernel/fork.c
··· 44 44 #include <linux/acct.h> 45 45 #include <linux/cn_proc.h> 46 46 #include <linux/delayacct.h> 47 + #include <linux/taskstats_kern.h> 47 48 48 49 #include <asm/pgtable.h> 49 50 #include <asm/pgalloc.h> ··· 820 819 if (clone_flags & CLONE_THREAD) { 821 820 atomic_inc(&current->signal->count); 822 821 atomic_inc(&current->signal->live); 822 + taskstats_tgid_alloc(current->signal); 823 823 return 0; 824 824 } 825 825 sig = kmem_cache_alloc(signal_cachep, GFP_KERNEL); ··· 865 863 INIT_LIST_HEAD(&sig->cpu_timers[0]); 866 864 INIT_LIST_HEAD(&sig->cpu_timers[1]); 867 865 INIT_LIST_HEAD(&sig->cpu_timers[2]); 866 + taskstats_tgid_init(sig); 868 867 869 868 task_lock(current->group_leader); 870 869 memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim); ··· 887 884 void __cleanup_signal(struct signal_struct *sig) 888 885 { 889 886 exit_thread_group_keys(sig); 887 + taskstats_tgid_free(sig); 890 888 kmem_cache_free(signal_cachep, sig); 891 889 } 892 890
+66 -32
kernel/taskstats.c
··· 132 132 static int fill_tgid(pid_t tgid, struct task_struct *tgidtsk, 133 133 struct taskstats *stats) 134 134 { 135 - int rc; 136 135 struct task_struct *tsk, *first; 136 + unsigned long flags; 137 137 138 + /* 139 + * Add additional stats from live tasks except zombie thread group 140 + * leaders who are already counted with the dead tasks 141 + */ 138 142 first = tgidtsk; 139 - read_lock(&tasklist_lock); 140 143 if (!first) { 144 + read_lock(&tasklist_lock); 141 145 first = find_task_by_pid(tgid); 142 146 if (!first) { 143 147 read_unlock(&tasklist_lock); 144 148 return -ESRCH; 145 149 } 146 - } 150 + get_task_struct(first); 151 + read_unlock(&tasklist_lock); 152 + } else 153 + get_task_struct(first); 154 + 155 + /* Start with stats from dead tasks */ 156 + spin_lock_irqsave(&first->signal->stats_lock, flags); 157 + if (first->signal->stats) 158 + memcpy(stats, first->signal->stats, sizeof(*stats)); 159 + spin_unlock_irqrestore(&first->signal->stats_lock, flags); 160 + 147 161 tsk = first; 162 + read_lock(&tasklist_lock); 148 163 do { 164 + if (tsk->exit_state == EXIT_ZOMBIE && thread_group_leader(tsk)) 165 + continue; 149 166 /* 150 - * Each accounting subsystem adds calls its functions to 167 + * Accounting subsystem can call its functions here to 151 168 * fill in relevant parts of struct taskstsats as follows 152 169 * 153 - * rc = per-task-foo(stats, tsk); 154 - * if (rc) 155 - * break; 170 + * per-task-foo(stats, tsk); 156 171 */ 157 - 158 - rc = delayacct_add_tsk(stats, tsk); 159 - if (rc) 160 - break; 172 + delayacct_add_tsk(stats, tsk); 161 173 162 174 } while_each_thread(first, tsk); 163 175 read_unlock(&tasklist_lock); 164 176 stats->version = TASKSTATS_VERSION; 165 177 166 - 167 178 /* 168 - * Accounting subsytems can also add calls here if they don't 169 - * wish to aggregate statistics for per-tgid stats 179 + * Accounting subsytems can also add calls here to modify 180 + * fields of taskstats. 170 181 */ 171 182 172 - return rc; 183 + return 0; 173 184 } 185 + 186 + 187 + static void fill_tgid_exit(struct task_struct *tsk) 188 + { 189 + unsigned long flags; 190 + 191 + spin_lock_irqsave(&tsk->signal->stats_lock, flags); 192 + if (!tsk->signal->stats) 193 + goto ret; 194 + 195 + /* 196 + * Each accounting subsystem calls its functions here to 197 + * accumalate its per-task stats for tsk, into the per-tgid structure 198 + * 199 + * per-task-foo(tsk->signal->stats, tsk); 200 + */ 201 + delayacct_add_tsk(tsk->signal->stats, tsk); 202 + ret: 203 + spin_unlock_irqrestore(&tsk->signal->stats_lock, flags); 204 + return; 205 + } 206 + 174 207 175 208 static int taskstats_send_stats(struct sk_buff *skb, struct genl_info *info) 176 209 { ··· 263 230 264 231 /* Send pid data out on exit */ 265 232 void taskstats_exit_send(struct task_struct *tsk, struct taskstats *tidstats, 266 - struct taskstats *tgidstats) 233 + int group_dead) 267 234 { 268 235 int rc; 269 236 struct sk_buff *rep_skb; ··· 271 238 size_t size; 272 239 int is_thread_group; 273 240 struct nlattr *na; 241 + unsigned long flags; 274 242 275 243 if (!family_registered || !tidstats) 276 244 return; 277 245 278 - is_thread_group = !thread_group_empty(tsk); 279 - rc = 0; 246 + spin_lock_irqsave(&tsk->signal->stats_lock, flags); 247 + is_thread_group = tsk->signal->stats ? 1 : 0; 248 + spin_unlock_irqrestore(&tsk->signal->stats_lock, flags); 280 249 250 + rc = 0; 281 251 /* 282 252 * Size includes space for nested attributes 283 253 */ ··· 304 268 *tidstats); 305 269 nla_nest_end(rep_skb, na); 306 270 307 - if (!is_thread_group || !tgidstats) { 308 - send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); 309 - goto ret; 310 - } 271 + if (!is_thread_group) 272 + goto send; 311 273 312 - rc = fill_tgid(tsk->pid, tsk, tgidstats); 313 274 /* 314 - * If fill_tgid() failed then one probable reason could be that the 315 - * thread group leader has exited. fill_tgid() will fail, send out 316 - * the pid statistics collected earlier. 275 + * tsk has/had a thread group so fill the tsk->signal->stats structure 276 + * Doesn't matter if tsk is the leader or the last group member leaving 317 277 */ 318 - if (rc < 0) { 319 - send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); 320 - goto ret; 321 - } 278 + 279 + fill_tgid_exit(tsk); 280 + if (!group_dead) 281 + goto send; 322 282 323 283 na = nla_nest_start(rep_skb, TASKSTATS_TYPE_AGGR_TGID); 324 284 NLA_PUT_U32(rep_skb, TASKSTATS_TYPE_TGID, (u32)tsk->tgid); 285 + /* No locking needed for tsk->signal->stats since group is dead */ 325 286 NLA_PUT_TYPE(rep_skb, struct taskstats, TASKSTATS_TYPE_STATS, 326 - *tgidstats); 287 + *tsk->signal->stats); 327 288 nla_nest_end(rep_skb, na); 328 289 290 + send: 329 291 send_reply(rep_skb, 0, TASKSTATS_MSG_MULTICAST); 330 - goto ret; 292 + return; 331 293 332 294 nla_put_failure: 333 295 genlmsg_cancel(rep_skb, reply);