Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

Merge tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull timer migration updates from Thomas Gleixner:
"Fixes and minor updates for the timer migration code:

- Stop testing the group->parent pointer as it is not guaranteed to
be stable over a chain of operations by design.

This includes a warning which would be nice to have but it produces
false positives due to the racy nature of the check.

- Plug a race between CPUs going in and out of idle and a CPU hotplug
operation. The latter can create and connect a new hierarchy level
which is missed in the concurrent updates of CPUs which go into
idle. As a result the events of such a CPU might not be processed
and timers go stale.

Cure it by splitting the hotplug operation into a prepare and
online callback. The prepare callback is guaranteed to run on an
online and therefore active CPU. This CPU updates the hierarchy and
being online ensures that there is always at least one migrator
active which handles the modified hierarchy correctly when going
idle. The online callback which runs on the incoming CPU then just
marks the CPU active and brings it into operation.

- Improve tracing and polish the code further so it is more obvious
what's going on"

* tag 'timers-urgent-2024-07-26' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers/migration: Fix grammar in comment
timers/migration: Spare write when nothing changed
timers/migration: Rename childmask by groupmask to make naming more obvious
timers/migration: Read childmask and parent pointer in a single place
timers/migration: Use a single struct for hierarchy walk data
timers/migration: Improve tracing
timers/migration: Move hierarchy setup into cpuhotplug prepare callback
timers/migration: Do not rely always on group->parent

+224 -213
+1
include/linux/cpuhotplug.h
··· 122 122 CPUHP_KVM_PPC_BOOK3S_PREPARE, 123 123 CPUHP_ZCOMP_PREPARE, 124 124 CPUHP_TIMERS_PREPARE, 125 + CPUHP_TMIGR_PREPARE, 125 126 CPUHP_MIPS_SOC_PREPARE, 126 127 CPUHP_BP_PREPARE_DYN, 127 128 CPUHP_BP_PREPARE_DYN_END = CPUHP_BP_PREPARE_DYN + 20,
+8 -8
include/trace/events/timer_migration.h
··· 43 43 __field( unsigned int, lvl ) 44 44 __field( unsigned int, numa_node ) 45 45 __field( unsigned int, num_children ) 46 - __field( u32, childmask ) 46 + __field( u32, groupmask ) 47 47 ), 48 48 49 49 TP_fast_assign( ··· 52 52 __entry->lvl = child->parent->level; 53 53 __entry->numa_node = child->parent->numa_node; 54 54 __entry->num_children = child->parent->num_children; 55 - __entry->childmask = child->childmask; 55 + __entry->groupmask = child->groupmask; 56 56 ), 57 57 58 - TP_printk("group=%p childmask=%0x parent=%p lvl=%d numa=%d num_children=%d", 59 - __entry->child, __entry->childmask, __entry->parent, 58 + TP_printk("group=%p groupmask=%0x parent=%p lvl=%d numa=%d num_children=%d", 59 + __entry->child, __entry->groupmask, __entry->parent, 60 60 __entry->lvl, __entry->numa_node, __entry->num_children) 61 61 ); 62 62 ··· 72 72 __field( unsigned int, lvl ) 73 73 __field( unsigned int, numa_node ) 74 74 __field( unsigned int, num_children ) 75 - __field( u32, childmask ) 75 + __field( u32, groupmask ) 76 76 ), 77 77 78 78 TP_fast_assign( ··· 81 81 __entry->lvl = tmc->tmgroup->level; 82 82 __entry->numa_node = tmc->tmgroup->numa_node; 83 83 __entry->num_children = tmc->tmgroup->num_children; 84 - __entry->childmask = tmc->childmask; 84 + __entry->groupmask = tmc->groupmask; 85 85 ), 86 86 87 - TP_printk("cpu=%d childmask=%0x parent=%p lvl=%d numa=%d num_children=%d", 88 - __entry->cpu, __entry->childmask, __entry->parent, 87 + TP_printk("cpu=%d groupmask=%0x parent=%p lvl=%d numa=%d num_children=%d", 88 + __entry->cpu, __entry->groupmask, __entry->parent, 89 89 __entry->lvl, __entry->numa_node, __entry->num_children) 90 90 ); 91 91
+197 -196
kernel/time/timer_migration.c
··· 475 475 return bitmap_weight(&active, BIT_CNT) <= 1; 476 476 } 477 477 478 - typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, void *); 478 + /** 479 + * struct tmigr_walk - data required for walking the hierarchy 480 + * @nextexp: Next CPU event expiry information which is handed into 481 + * the timer migration code by the timer code 482 + * (get_next_timer_interrupt()) 483 + * @firstexp: Contains the first event expiry information when 484 + * hierarchy is completely idle. When CPU itself was the 485 + * last going idle, information makes sure, that CPU will 486 + * be back in time. When using this value in the remote 487 + * expiry case, firstexp is stored in the per CPU tmigr_cpu 488 + * struct of CPU which expires remote timers. It is updated 489 + * in top level group only. Be aware, there could occur a 490 + * new top level of the hierarchy between the 'top level 491 + * call' in tmigr_update_events() and the check for the 492 + * parent group in walk_groups(). Then @firstexp might 493 + * contain a value != KTIME_MAX even if it was not the 494 + * final top level. This is not a problem, as the worst 495 + * outcome is a CPU which might wake up a little early. 496 + * @evt: Pointer to tmigr_event which needs to be queued (of idle 497 + * child group) 498 + * @childmask: groupmask of child group 499 + * @remote: Is set, when the new timer path is executed in 500 + * tmigr_handle_remote_cpu() 501 + * @basej: timer base in jiffies 502 + * @now: timer base monotonic 503 + * @check: is set if there is the need to handle remote timers; 504 + * required in tmigr_requires_handle_remote() only 505 + * @tmc_active: this flag indicates, whether the CPU which triggers 506 + * the hierarchy walk is !idle in the timer migration 507 + * hierarchy. When the CPU is idle and the whole hierarchy is 508 + * idle, only the first event of the top level has to be 509 + * considered. 510 + */ 511 + struct tmigr_walk { 512 + u64 nextexp; 513 + u64 firstexp; 514 + struct tmigr_event *evt; 515 + u8 childmask; 516 + bool remote; 517 + unsigned long basej; 518 + u64 now; 519 + bool check; 520 + bool tmc_active; 521 + }; 479 522 480 - static void __walk_groups(up_f up, void *data, 523 + typedef bool (*up_f)(struct tmigr_group *, struct tmigr_group *, struct tmigr_walk *); 524 + 525 + static void __walk_groups(up_f up, struct tmigr_walk *data, 481 526 struct tmigr_cpu *tmc) 482 527 { 483 528 struct tmigr_group *child = NULL, *group = tmc->tmgroup; ··· 535 490 536 491 child = group; 537 492 group = group->parent; 493 + data->childmask = child->groupmask; 538 494 } while (group); 539 495 } 540 496 541 - static void walk_groups(up_f up, void *data, struct tmigr_cpu *tmc) 497 + static void walk_groups(up_f up, struct tmigr_walk *data, struct tmigr_cpu *tmc) 542 498 { 543 499 lockdep_assert_held(&tmc->lock); 544 500 545 501 __walk_groups(up, data, tmc); 546 502 } 547 - 548 - /** 549 - * struct tmigr_walk - data required for walking the hierarchy 550 - * @nextexp: Next CPU event expiry information which is handed into 551 - * the timer migration code by the timer code 552 - * (get_next_timer_interrupt()) 553 - * @firstexp: Contains the first event expiry information when last 554 - * active CPU of hierarchy is on the way to idle to make 555 - * sure CPU will be back in time. 556 - * @evt: Pointer to tmigr_event which needs to be queued (of idle 557 - * child group) 558 - * @childmask: childmask of child group 559 - * @remote: Is set, when the new timer path is executed in 560 - * tmigr_handle_remote_cpu() 561 - */ 562 - struct tmigr_walk { 563 - u64 nextexp; 564 - u64 firstexp; 565 - struct tmigr_event *evt; 566 - u8 childmask; 567 - bool remote; 568 - }; 569 - 570 - /** 571 - * struct tmigr_remote_data - data required for remote expiry hierarchy walk 572 - * @basej: timer base in jiffies 573 - * @now: timer base monotonic 574 - * @firstexp: returns expiry of the first timer in the idle timer 575 - * migration hierarchy to make sure the timer is handled in 576 - * time; it is stored in the per CPU tmigr_cpu struct of 577 - * CPU which expires remote timers 578 - * @childmask: childmask of child group 579 - * @check: is set if there is the need to handle remote timers; 580 - * required in tmigr_requires_handle_remote() only 581 - * @tmc_active: this flag indicates, whether the CPU which triggers 582 - * the hierarchy walk is !idle in the timer migration 583 - * hierarchy. When the CPU is idle and the whole hierarchy is 584 - * idle, only the first event of the top level has to be 585 - * considered. 586 - */ 587 - struct tmigr_remote_data { 588 - unsigned long basej; 589 - u64 now; 590 - u64 firstexp; 591 - u8 childmask; 592 - bool check; 593 - bool tmc_active; 594 - }; 595 503 596 504 /* 597 505 * Returns the next event of the timerqueue @group->events ··· 616 618 617 619 static bool tmigr_active_up(struct tmigr_group *group, 618 620 struct tmigr_group *child, 619 - void *ptr) 621 + struct tmigr_walk *data) 620 622 { 621 623 union tmigr_state curstate, newstate; 622 - struct tmigr_walk *data = ptr; 623 624 bool walk_done; 624 625 u8 childmask; 625 626 ··· 646 649 647 650 } while (!atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)); 648 651 649 - if ((walk_done == false) && group->parent) 650 - data->childmask = group->childmask; 652 + trace_tmigr_group_set_cpu_active(group, newstate, childmask); 651 653 652 654 /* 653 655 * The group is active (again). The group event might be still queued ··· 662 666 */ 663 667 group->groupevt.ignore = true; 664 668 665 - trace_tmigr_group_set_cpu_active(group, newstate, childmask); 666 - 667 669 return walk_done; 668 670 } 669 671 ··· 669 675 { 670 676 struct tmigr_walk data; 671 677 672 - data.childmask = tmc->childmask; 678 + data.childmask = tmc->groupmask; 673 679 674 680 trace_tmigr_cpu_active(tmc); 675 681 ··· 854 860 855 861 static bool tmigr_new_timer_up(struct tmigr_group *group, 856 862 struct tmigr_group *child, 857 - void *ptr) 863 + struct tmigr_walk *data) 858 864 { 859 - struct tmigr_walk *data = ptr; 860 - 861 865 return tmigr_update_events(group, child, data); 862 866 } 863 867 ··· 987 995 988 996 static bool tmigr_handle_remote_up(struct tmigr_group *group, 989 997 struct tmigr_group *child, 990 - void *ptr) 998 + struct tmigr_walk *data) 991 999 { 992 - struct tmigr_remote_data *data = ptr; 993 1000 struct tmigr_event *evt; 994 1001 unsigned long jif; 995 1002 u8 childmask; ··· 1025 1034 } 1026 1035 1027 1036 /* 1028 - * Update of childmask for the next level and keep track of the expiry 1029 - * of the first event that needs to be handled (group->next_expiry was 1030 - * updated by tmigr_next_expired_groupevt(), next was set by 1031 - * tmigr_handle_remote_cpu()). 1037 + * Keep track of the expiry of the first event that needs to be handled 1038 + * (group->next_expiry was updated by tmigr_next_expired_groupevt(), 1039 + * next was set by tmigr_handle_remote_cpu()). 1032 1040 */ 1033 - data->childmask = group->childmask; 1034 1041 data->firstexp = group->next_expiry; 1035 1042 1036 1043 raw_spin_unlock_irq(&group->lock); ··· 1044 1055 void tmigr_handle_remote(void) 1045 1056 { 1046 1057 struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1047 - struct tmigr_remote_data data; 1058 + struct tmigr_walk data; 1048 1059 1049 1060 if (tmigr_is_not_available(tmc)) 1050 1061 return; 1051 1062 1052 - data.childmask = tmc->childmask; 1063 + data.childmask = tmc->groupmask; 1053 1064 data.firstexp = KTIME_MAX; 1054 1065 1055 1066 /* ··· 1057 1068 * in tmigr_handle_remote_up() anyway. Keep this check to speed up the 1058 1069 * return when nothing has to be done. 1059 1070 */ 1060 - if (!tmigr_check_migrator(tmc->tmgroup, tmc->childmask)) { 1071 + if (!tmigr_check_migrator(tmc->tmgroup, tmc->groupmask)) { 1061 1072 /* 1062 1073 * If this CPU was an idle migrator, make sure to clear its wakeup 1063 1074 * value so it won't chase timers that have already expired elsewhere. ··· 1086 1097 1087 1098 static bool tmigr_requires_handle_remote_up(struct tmigr_group *group, 1088 1099 struct tmigr_group *child, 1089 - void *ptr) 1100 + struct tmigr_walk *data) 1090 1101 { 1091 - struct tmigr_remote_data *data = ptr; 1092 1102 u8 childmask; 1093 1103 1094 1104 childmask = data->childmask; ··· 1106 1118 * group before reading the next_expiry value. 1107 1119 */ 1108 1120 if (group->parent && !data->tmc_active) 1109 - goto out; 1121 + return false; 1110 1122 1111 1123 /* 1112 1124 * The lock is required on 32bit architectures to read the variable ··· 1131 1143 raw_spin_unlock(&group->lock); 1132 1144 } 1133 1145 1134 - out: 1135 - /* Update of childmask for the next level */ 1136 - data->childmask = group->childmask; 1137 1146 return false; 1138 1147 } 1139 1148 ··· 1142 1157 bool tmigr_requires_handle_remote(void) 1143 1158 { 1144 1159 struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1145 - struct tmigr_remote_data data; 1160 + struct tmigr_walk data; 1146 1161 unsigned long jif; 1147 1162 bool ret = false; 1148 1163 ··· 1150 1165 return ret; 1151 1166 1152 1167 data.now = get_jiffies_update(&jif); 1153 - data.childmask = tmc->childmask; 1168 + data.childmask = tmc->groupmask; 1154 1169 data.firstexp = KTIME_MAX; 1155 1170 data.tmc_active = !tmc->idle; 1156 1171 data.check = false; ··· 1215 1230 if (nextexp != tmc->cpuevt.nextevt.expires || 1216 1231 tmc->cpuevt.ignore) { 1217 1232 ret = tmigr_new_timer(tmc, nextexp); 1233 + /* 1234 + * Make sure the reevaluation of timers in idle path 1235 + * will not miss an event. 1236 + */ 1237 + WRITE_ONCE(tmc->wakeup, ret); 1218 1238 } 1219 1239 } 1220 - /* 1221 - * Make sure the reevaluation of timers in idle path will not miss an 1222 - * event. 1223 - */ 1224 - WRITE_ONCE(tmc->wakeup, ret); 1225 - 1226 1240 trace_tmigr_cpu_new_timer_idle(tmc, nextexp); 1227 1241 raw_spin_unlock(&tmc->lock); 1228 1242 return ret; ··· 1229 1245 1230 1246 static bool tmigr_inactive_up(struct tmigr_group *group, 1231 1247 struct tmigr_group *child, 1232 - void *ptr) 1248 + struct tmigr_walk *data) 1233 1249 { 1234 1250 union tmigr_state curstate, newstate, childstate; 1235 - struct tmigr_walk *data = ptr; 1236 1251 bool walk_done; 1237 1252 u8 childmask; 1238 1253 ··· 1282 1299 1283 1300 WARN_ON_ONCE((newstate.migrator != TMIGR_NONE) && !(newstate.active)); 1284 1301 1285 - if (atomic_try_cmpxchg(&group->migr_state, &curstate.state, 1286 - newstate.state)) 1302 + if (atomic_try_cmpxchg(&group->migr_state, &curstate.state, newstate.state)) { 1303 + trace_tmigr_group_set_cpu_inactive(group, newstate, childmask); 1287 1304 break; 1305 + } 1288 1306 1289 1307 /* 1290 1308 * The memory barrier is paired with the cmpxchg() in ··· 1301 1317 /* Event Handling */ 1302 1318 tmigr_update_events(group, child, data); 1303 1319 1304 - if (group->parent && (walk_done == false)) 1305 - data->childmask = group->childmask; 1306 - 1307 - /* 1308 - * data->firstexp was set by tmigr_update_events() and contains the 1309 - * expiry of the first global event which needs to be handled. It 1310 - * differs from KTIME_MAX if: 1311 - * - group is the top level group and 1312 - * - group is idle (which means CPU was the last active CPU in the 1313 - * hierarchy) and 1314 - * - there is a pending event in the hierarchy 1315 - */ 1316 - WARN_ON_ONCE(data->firstexp != KTIME_MAX && group->parent); 1317 - 1318 - trace_tmigr_group_set_cpu_inactive(group, newstate, childmask); 1319 - 1320 1320 return walk_done; 1321 1321 } 1322 1322 ··· 1309 1341 struct tmigr_walk data = { .nextexp = nextexp, 1310 1342 .firstexp = KTIME_MAX, 1311 1343 .evt = &tmc->cpuevt, 1312 - .childmask = tmc->childmask }; 1344 + .childmask = tmc->groupmask }; 1313 1345 1314 1346 /* 1315 1347 * If nextexp is KTIME_MAX, the CPU event will be ignored because the ··· 1368 1400 * the only one in the level 0 group; and if it is the 1369 1401 * only one in level 0 group, but there are more than a 1370 1402 * single group active on the way to top level) 1371 - * * nextevt - when CPU is offline and has to handle timer on his own 1403 + * * nextevt - when CPU is offline and has to handle timer on its own 1372 1404 * or when on the way to top in every group only a single 1373 1405 * child is active but @nextevt is before the lowest 1374 1406 * next_expiry encountered while walking up to top level. ··· 1387 1419 if (WARN_ON_ONCE(tmc->idle)) 1388 1420 return nextevt; 1389 1421 1390 - if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->childmask)) 1422 + if (!tmigr_check_migrator_and_lonely(tmc->tmgroup, tmc->groupmask)) 1391 1423 return KTIME_MAX; 1392 1424 1393 1425 do { ··· 1408 1440 } while (group); 1409 1441 1410 1442 return KTIME_MAX; 1443 + } 1444 + 1445 + /* 1446 + * tmigr_trigger_active() - trigger a CPU to become active again 1447 + * 1448 + * This function is executed on a CPU which is part of cpu_online_mask, when the 1449 + * last active CPU in the hierarchy is offlining. With this, it is ensured that 1450 + * the other CPU is active and takes over the migrator duty. 1451 + */ 1452 + static long tmigr_trigger_active(void *unused) 1453 + { 1454 + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1455 + 1456 + WARN_ON_ONCE(!tmc->online || tmc->idle); 1457 + 1458 + return 0; 1459 + } 1460 + 1461 + static int tmigr_cpu_offline(unsigned int cpu) 1462 + { 1463 + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1464 + int migrator; 1465 + u64 firstexp; 1466 + 1467 + raw_spin_lock_irq(&tmc->lock); 1468 + tmc->online = false; 1469 + WRITE_ONCE(tmc->wakeup, KTIME_MAX); 1470 + 1471 + /* 1472 + * CPU has to handle the local events on his own, when on the way to 1473 + * offline; Therefore nextevt value is set to KTIME_MAX 1474 + */ 1475 + firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX); 1476 + trace_tmigr_cpu_offline(tmc); 1477 + raw_spin_unlock_irq(&tmc->lock); 1478 + 1479 + if (firstexp != KTIME_MAX) { 1480 + migrator = cpumask_any_but(cpu_online_mask, cpu); 1481 + work_on_cpu(migrator, tmigr_trigger_active, NULL); 1482 + } 1483 + 1484 + return 0; 1485 + } 1486 + 1487 + static int tmigr_cpu_online(unsigned int cpu) 1488 + { 1489 + struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1490 + 1491 + /* Check whether CPU data was successfully initialized */ 1492 + if (WARN_ON_ONCE(!tmc->tmgroup)) 1493 + return -EINVAL; 1494 + 1495 + raw_spin_lock_irq(&tmc->lock); 1496 + trace_tmigr_cpu_online(tmc); 1497 + tmc->idle = timer_base_is_idle(); 1498 + if (!tmc->idle) 1499 + __tmigr_cpu_activate(tmc); 1500 + tmc->online = true; 1501 + raw_spin_unlock_irq(&tmc->lock); 1502 + return 0; 1411 1503 } 1412 1504 1413 1505 static void tmigr_init_group(struct tmigr_group *group, unsigned int lvl, ··· 1542 1514 } 1543 1515 1544 1516 static void tmigr_connect_child_parent(struct tmigr_group *child, 1545 - struct tmigr_group *parent) 1517 + struct tmigr_group *parent, 1518 + bool activate) 1546 1519 { 1547 - union tmigr_state childstate; 1520 + struct tmigr_walk data; 1548 1521 1549 1522 raw_spin_lock_irq(&child->lock); 1550 1523 raw_spin_lock_nested(&parent->lock, SINGLE_DEPTH_NESTING); 1551 1524 1552 1525 child->parent = parent; 1553 - child->childmask = BIT(parent->num_children++); 1526 + child->groupmask = BIT(parent->num_children++); 1554 1527 1555 1528 raw_spin_unlock(&parent->lock); 1556 1529 raw_spin_unlock_irq(&child->lock); 1557 1530 1558 1531 trace_tmigr_connect_child_parent(child); 1532 + 1533 + if (!activate) 1534 + return; 1559 1535 1560 1536 /* 1561 1537 * To prevent inconsistent states, active children need to be active in ··· 1576 1544 * child to the new parent. So tmigr_connect_child_parent() is 1577 1545 * executed with the formerly top level group (child) and the newly 1578 1546 * created group (parent). 1547 + * 1548 + * * It is ensured that the child is active, as this setup path is 1549 + * executed in hotplug prepare callback. This is exectued by an 1550 + * already connected and !idle CPU. Even if all other CPUs go idle, 1551 + * the CPU executing the setup will be responsible up to current top 1552 + * level group. And the next time it goes inactive, it will release 1553 + * the new childmask and parent to subsequent walkers through this 1554 + * @child. Therefore propagate active state unconditionally. 1579 1555 */ 1580 - childstate.state = atomic_read(&child->migr_state); 1581 - if (childstate.migrator != TMIGR_NONE) { 1582 - struct tmigr_walk data; 1556 + data.childmask = child->groupmask; 1583 1557 1584 - data.childmask = child->childmask; 1585 - 1586 - /* 1587 - * There is only one new level per time. When connecting the 1588 - * child and the parent and set the child active when the parent 1589 - * is inactive, the parent needs to be the uppermost 1590 - * level. Otherwise there went something wrong! 1591 - */ 1592 - WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent); 1593 - } 1558 + /* 1559 + * There is only one new level per time (which is protected by 1560 + * tmigr_mutex). When connecting the child and the parent and set the 1561 + * child active when the parent is inactive, the parent needs to be the 1562 + * uppermost level. Otherwise there went something wrong! 1563 + */ 1564 + WARN_ON(!tmigr_active_up(parent, child, &data) && parent->parent); 1594 1565 } 1595 1566 1596 1567 static int tmigr_setup_groups(unsigned int cpu, unsigned int node) ··· 1646 1611 * Update tmc -> group / child -> group connection 1647 1612 */ 1648 1613 if (i == 0) { 1649 - struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1614 + struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu); 1650 1615 1651 1616 raw_spin_lock_irq(&group->lock); 1652 1617 1653 1618 tmc->tmgroup = group; 1654 - tmc->childmask = BIT(group->num_children++); 1619 + tmc->groupmask = BIT(group->num_children++); 1655 1620 1656 1621 raw_spin_unlock_irq(&group->lock); 1657 1622 ··· 1661 1626 continue; 1662 1627 } else { 1663 1628 child = stack[i - 1]; 1664 - tmigr_connect_child_parent(child, group); 1629 + /* Will be activated at online time */ 1630 + tmigr_connect_child_parent(child, group, false); 1665 1631 } 1666 1632 1667 1633 /* check if uppermost level was newly created */ ··· 1673 1637 1674 1638 lvllist = &tmigr_level_list[top]; 1675 1639 if (group->num_children == 1 && list_is_singular(lvllist)) { 1640 + /* 1641 + * The target CPU must never do the prepare work, except 1642 + * on early boot when the boot CPU is the target. Otherwise 1643 + * it may spuriously activate the old top level group inside 1644 + * the new one (nevertheless whether old top level group is 1645 + * active or not) and/or release an uninitialized childmask. 1646 + */ 1647 + WARN_ON_ONCE(cpu == raw_smp_processor_id()); 1648 + 1676 1649 lvllist = &tmigr_level_list[top - 1]; 1677 1650 list_for_each_entry(child, lvllist, list) { 1678 1651 if (child->parent) 1679 1652 continue; 1680 1653 1681 - tmigr_connect_child_parent(child, group); 1654 + tmigr_connect_child_parent(child, group, true); 1682 1655 } 1683 1656 } 1684 1657 } ··· 1709 1664 return ret; 1710 1665 } 1711 1666 1712 - static int tmigr_cpu_online(unsigned int cpu) 1667 + static int tmigr_cpu_prepare(unsigned int cpu) 1713 1668 { 1714 - struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1715 - int ret; 1669 + struct tmigr_cpu *tmc = per_cpu_ptr(&tmigr_cpu, cpu); 1670 + int ret = 0; 1716 1671 1717 - /* First online attempt? Initialize CPU data */ 1718 - if (!tmc->tmgroup) { 1719 - raw_spin_lock_init(&tmc->lock); 1672 + /* Not first online attempt? */ 1673 + if (tmc->tmgroup) 1674 + return ret; 1720 1675 1721 - ret = tmigr_add_cpu(cpu); 1722 - if (ret < 0) 1723 - return ret; 1724 - 1725 - if (tmc->childmask == 0) 1726 - return -EINVAL; 1727 - 1728 - timerqueue_init(&tmc->cpuevt.nextevt); 1729 - tmc->cpuevt.nextevt.expires = KTIME_MAX; 1730 - tmc->cpuevt.ignore = true; 1731 - tmc->cpuevt.cpu = cpu; 1732 - 1733 - tmc->remote = false; 1734 - WRITE_ONCE(tmc->wakeup, KTIME_MAX); 1735 - } 1736 - raw_spin_lock_irq(&tmc->lock); 1737 - trace_tmigr_cpu_online(tmc); 1738 - tmc->idle = timer_base_is_idle(); 1739 - if (!tmc->idle) 1740 - __tmigr_cpu_activate(tmc); 1741 - tmc->online = true; 1742 - raw_spin_unlock_irq(&tmc->lock); 1743 - return 0; 1744 - } 1745 - 1746 - /* 1747 - * tmigr_trigger_active() - trigger a CPU to become active again 1748 - * 1749 - * This function is executed on a CPU which is part of cpu_online_mask, when the 1750 - * last active CPU in the hierarchy is offlining. With this, it is ensured that 1751 - * the other CPU is active and takes over the migrator duty. 1752 - */ 1753 - static long tmigr_trigger_active(void *unused) 1754 - { 1755 - struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1756 - 1757 - WARN_ON_ONCE(!tmc->online || tmc->idle); 1758 - 1759 - return 0; 1760 - } 1761 - 1762 - static int tmigr_cpu_offline(unsigned int cpu) 1763 - { 1764 - struct tmigr_cpu *tmc = this_cpu_ptr(&tmigr_cpu); 1765 - int migrator; 1766 - u64 firstexp; 1767 - 1768 - raw_spin_lock_irq(&tmc->lock); 1769 - tmc->online = false; 1676 + raw_spin_lock_init(&tmc->lock); 1677 + timerqueue_init(&tmc->cpuevt.nextevt); 1678 + tmc->cpuevt.nextevt.expires = KTIME_MAX; 1679 + tmc->cpuevt.ignore = true; 1680 + tmc->cpuevt.cpu = cpu; 1681 + tmc->remote = false; 1770 1682 WRITE_ONCE(tmc->wakeup, KTIME_MAX); 1771 1683 1772 - /* 1773 - * CPU has to handle the local events on his own, when on the way to 1774 - * offline; Therefore nextevt value is set to KTIME_MAX 1775 - */ 1776 - firstexp = __tmigr_cpu_deactivate(tmc, KTIME_MAX); 1777 - trace_tmigr_cpu_offline(tmc); 1778 - raw_spin_unlock_irq(&tmc->lock); 1684 + ret = tmigr_add_cpu(cpu); 1685 + if (ret < 0) 1686 + return ret; 1779 1687 1780 - if (firstexp != KTIME_MAX) { 1781 - migrator = cpumask_any_but(cpu_online_mask, cpu); 1782 - work_on_cpu(migrator, tmigr_trigger_active, NULL); 1783 - } 1688 + if (tmc->groupmask == 0) 1689 + return -EINVAL; 1784 1690 1785 - return 0; 1691 + return ret; 1786 1692 } 1787 1693 1788 1694 static int __init tmigr_init(void) ··· 1792 1796 tmigr_hierarchy_levels, TMIGR_CHILDREN_PER_GROUP, 1793 1797 tmigr_crossnode_level); 1794 1798 1799 + ret = cpuhp_setup_state(CPUHP_TMIGR_PREPARE, "tmigr:prepare", 1800 + tmigr_cpu_prepare, NULL); 1801 + if (ret) 1802 + goto err; 1803 + 1795 1804 ret = cpuhp_setup_state(CPUHP_AP_TMIGR_ONLINE, "tmigr:online", 1796 1805 tmigr_cpu_online, tmigr_cpu_offline); 1797 1806 if (ret) ··· 1808 1807 pr_err("Timer migration setup failed\n"); 1809 1808 return ret; 1810 1809 } 1811 - late_initcall(tmigr_init); 1810 + early_initcall(tmigr_init);
+18 -9
kernel/time/timer_migration.h
··· 22 22 * struct tmigr_group - timer migration hierarchy group 23 23 * @lock: Lock protecting the event information and group hierarchy 24 24 * information during setup 25 - * @parent: Pointer to the parent group 25 + * @parent: Pointer to the parent group. Pointer is updated when a 26 + * new hierarchy level is added because of a CPU coming 27 + * online the first time. Once it is set, the pointer will 28 + * not be removed or updated. When accessing parent pointer 29 + * lock less to decide whether to abort a propagation or 30 + * not, it is not a problem. The worst outcome is an 31 + * unnecessary/early CPU wake up. But do not access parent 32 + * pointer several times in the same 'action' (like 33 + * activation, deactivation, check for remote expiry,...) 34 + * without holding the lock as it is not ensured that value 35 + * will not change. 26 36 * @groupevt: Next event of the group which is only used when the 27 37 * group is !active. The group event is then queued into 28 38 * the parent timer queue. ··· 51 41 * @num_children: Counter of group children to make sure the group is only 52 42 * filled with TMIGR_CHILDREN_PER_GROUP; Required for setup 53 43 * only 54 - * @childmask: childmask of the group in the parent group; is set 55 - * during setup and will never change; can be read 56 - * lockless 44 + * @groupmask: mask of the group in the parent group; is set during 45 + * setup and will never change; can be read lockless 57 46 * @list: List head that is added to the per level 58 47 * tmigr_level_list; is required during setup when a 59 48 * new group needs to be connected to the existing ··· 68 59 unsigned int level; 69 60 int numa_node; 70 61 unsigned int num_children; 71 - u8 childmask; 62 + u8 groupmask; 72 63 struct list_head list; 73 64 }; 74 65 ··· 88 79 * hierarchy 89 80 * @remote: Is set when timers of the CPU are expired remotely 90 81 * @tmgroup: Pointer to the parent group 91 - * @childmask: childmask of tmigr_cpu in the parent group 82 + * @groupmask: mask of tmigr_cpu in the parent group 92 83 * @wakeup: Stores the first timer when the timer migration 93 84 * hierarchy is completely idle and remote expiry was done; 94 85 * is returned to timer code in the idle path and is only ··· 101 92 bool idle; 102 93 bool remote; 103 94 struct tmigr_group *tmgroup; 104 - u8 childmask; 95 + u8 groupmask; 105 96 u64 wakeup; 106 97 struct tmigr_event cpuevt; 107 98 }; ··· 117 108 u32 state; 118 109 /** 119 110 * struct - split state of tmigr_group 120 - * @active: Contains each childmask bit of the active children 121 - * @migrator: Contains childmask of the child which is migrator 111 + * @active: Contains each mask bit of the active children 112 + * @migrator: Contains mask of the child which is migrator 122 113 * @seq: Sequence counter needs to be increased when an update 123 114 * to the tmigr_state is done. It prevents a race when 124 115 * updates in the child groups are propagated in changed