Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

i387: re-introduce FPU state preloading at context switch time

After all the FPU state cleanups and finally finding the problem that
caused all our FPU save/restore problems, this re-introduces the
preloading of FPU state that was removed in commit b3b0870ef3ff ("i387:
do not preload FPU state at task switch time").

However, instead of simply reverting the removal, this reimplements
preloading with several fixes, most notably

- properly abstracted as a true FPU state switch, rather than as
open-coded save and restore with various hacks.

In particular, implementing it as a proper FPU state switch allows us
to optimize the CR0.TS flag accesses: there is no reason to set the
TS bit only to then almost immediately clear it again. CR0 accesses
are quite slow and expensive, don't flip the bit back and forth for
no good reason.

- Make sure that the same model works for both x86-32 and x86-64, so
that there are no gratuitous differences between the two due to the
way they save and restore segment state differently due to
architectural differences that really don't matter to the FPU state.

- Avoid exposing the "preload" state to the context switch routines,
and in particular allow the concept of lazy state restore: if nothing
else has used the FPU in the meantime, and the process is still on
the same CPU, we can avoid restoring state from memory entirely, just
re-expose the state that is still in the FPU unit.

That optimized lazy restore isn't actually implemented here, but the
infrastructure is set up for it. Of course, older CPU's that use
'fnsave' to save the state cannot take advantage of this, since the
state saving also trashes the state.

In other words, there is now an actual _design_ to the FPU state saving,
rather than just random historical baggage. Hopefully it's easier to
follow as a result.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

+139 -48
+93 -17
arch/x86/include/asm/i387.h
··· 29 29 extern void fpu_init(void); 30 30 extern void mxcsr_feature_mask_init(void); 31 31 extern int init_fpu(struct task_struct *child); 32 + extern void __math_state_restore(struct task_struct *); 32 33 extern void math_state_restore(void); 33 34 extern int dump_fpu(struct pt_regs *, struct user_i387_struct *); 34 35 ··· 213 212 #endif /* CONFIG_X86_64 */ 214 213 215 214 /* 216 - * These must be called with preempt disabled 215 + * These must be called with preempt disabled. Returns 216 + * 'true' if the FPU state is still intact. 217 217 */ 218 - static inline void fpu_save_init(struct fpu *fpu) 218 + static inline int fpu_save_init(struct fpu *fpu) 219 219 { 220 220 if (use_xsave()) { 221 221 fpu_xsave(fpu); ··· 225 223 * xsave header may indicate the init state of the FP. 226 224 */ 227 225 if (!(fpu->state->xsave.xsave_hdr.xstate_bv & XSTATE_FP)) 228 - return; 226 + return 1; 229 227 } else if (use_fxsr()) { 230 228 fpu_fxsave(fpu); 231 229 } else { 232 230 asm volatile("fnsave %[fx]; fwait" 233 231 : [fx] "=m" (fpu->state->fsave)); 234 - return; 232 + return 0; 235 233 } 236 234 237 - if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES)) 235 + /* 236 + * If exceptions are pending, we need to clear them so 237 + * that we don't randomly get exceptions later. 238 + * 239 + * FIXME! Is this perhaps only true for the old-style 240 + * irq13 case? Maybe we could leave the x87 state 241 + * intact otherwise? 242 + */ 243 + if (unlikely(fpu->state->fxsave.swd & X87_FSW_ES)) { 238 244 asm volatile("fnclex"); 245 + return 0; 246 + } 247 + return 1; 239 248 } 240 249 241 - static inline void __save_init_fpu(struct task_struct *tsk) 250 + static inline int __save_init_fpu(struct task_struct *tsk) 242 251 { 243 - fpu_save_init(&tsk->thread.fpu); 252 + return fpu_save_init(&tsk->thread.fpu); 244 253 } 245 254 246 255 static inline int fpu_fxrstor_checking(struct fpu *fpu) ··· 314 301 } 315 302 316 303 /* 304 + * FPU state switching for scheduling. 305 + * 306 + * This is a two-stage process: 307 + * 308 + * - switch_fpu_prepare() saves the old state and 309 + * sets the new state of the CR0.TS bit. This is 310 + * done within the context of the old process. 311 + * 312 + * - switch_fpu_finish() restores the new state as 313 + * necessary. 314 + */ 315 + typedef struct { int preload; } fpu_switch_t; 316 + 317 + /* 318 + * FIXME! We could do a totally lazy restore, but we need to 319 + * add a per-cpu "this was the task that last touched the FPU 320 + * on this CPU" variable, and the task needs to have a "I last 321 + * touched the FPU on this CPU" and check them. 322 + * 323 + * We don't do that yet, so "fpu_lazy_restore()" always returns 324 + * false, but some day.. 325 + */ 326 + #define fpu_lazy_restore(tsk) (0) 327 + #define fpu_lazy_state_intact(tsk) do { } while (0) 328 + 329 + static inline fpu_switch_t switch_fpu_prepare(struct task_struct *old, struct task_struct *new) 330 + { 331 + fpu_switch_t fpu; 332 + 333 + fpu.preload = tsk_used_math(new) && new->fpu_counter > 5; 334 + if (__thread_has_fpu(old)) { 335 + if (__save_init_fpu(old)) 336 + fpu_lazy_state_intact(old); 337 + __thread_clear_has_fpu(old); 338 + old->fpu_counter++; 339 + 340 + /* Don't change CR0.TS if we just switch! */ 341 + if (fpu.preload) { 342 + __thread_set_has_fpu(new); 343 + prefetch(new->thread.fpu.state); 344 + } else 345 + stts(); 346 + } else { 347 + old->fpu_counter = 0; 348 + if (fpu.preload) { 349 + if (fpu_lazy_restore(new)) 350 + fpu.preload = 0; 351 + else 352 + prefetch(new->thread.fpu.state); 353 + __thread_fpu_begin(new); 354 + } 355 + } 356 + return fpu; 357 + } 358 + 359 + /* 360 + * By the time this gets called, we've already cleared CR0.TS and 361 + * given the process the FPU if we are going to preload the FPU 362 + * state - all we need to do is to conditionally restore the register 363 + * state itself. 364 + */ 365 + static inline void switch_fpu_finish(struct task_struct *new, fpu_switch_t fpu) 366 + { 367 + if (fpu.preload) 368 + __math_state_restore(new); 369 + } 370 + 371 + /* 317 372 * Signal frame handlers... 318 373 */ 319 374 extern int save_i387_xstate(void __user *buf); 320 375 extern int restore_i387_xstate(void __user *buf); 321 - 322 - static inline void __unlazy_fpu(struct task_struct *tsk) 323 - { 324 - if (__thread_has_fpu(tsk)) { 325 - __save_init_fpu(tsk); 326 - __thread_fpu_end(tsk); 327 - } else 328 - tsk->fpu_counter = 0; 329 - } 330 376 331 377 static inline void __clear_fpu(struct task_struct *tsk) 332 378 { ··· 546 474 static inline void unlazy_fpu(struct task_struct *tsk) 547 475 { 548 476 preempt_disable(); 549 - __unlazy_fpu(tsk); 477 + if (__thread_has_fpu(tsk)) { 478 + __save_init_fpu(tsk); 479 + __thread_fpu_end(tsk); 480 + } else 481 + tsk->fpu_counter = 0; 550 482 preempt_enable(); 551 483 } 552 484
+4 -1
arch/x86/kernel/process_32.c
··· 299 299 *next = &next_p->thread; 300 300 int cpu = smp_processor_id(); 301 301 struct tss_struct *tss = &per_cpu(init_tss, cpu); 302 + fpu_switch_t fpu; 302 303 303 304 /* never put a printk in __switch_to... printk() calls wake_up*() indirectly */ 304 305 305 - __unlazy_fpu(prev_p); 306 + fpu = switch_fpu_prepare(prev_p, next_p); 306 307 307 308 /* 308 309 * Reload esp0. ··· 357 356 */ 358 357 if (prev->gs | next->gs) 359 358 lazy_load_gs(next->gs); 359 + 360 + switch_fpu_finish(next_p, fpu); 360 361 361 362 percpu_write(current_task, next_p); 362 363
+4 -1
arch/x86/kernel/process_64.c
··· 386 386 int cpu = smp_processor_id(); 387 387 struct tss_struct *tss = &per_cpu(init_tss, cpu); 388 388 unsigned fsindex, gsindex; 389 + fpu_switch_t fpu; 389 390 390 - __unlazy_fpu(prev_p); 391 + fpu = switch_fpu_prepare(prev_p, next_p); 391 392 392 393 /* 393 394 * Reload esp0, LDT and the page table pointer: ··· 457 456 if (next->gs) 458 457 wrmsrl(MSR_KERNEL_GS_BASE, next->gs); 459 458 prev->gsindex = gsindex; 459 + 460 + switch_fpu_finish(next_p, fpu); 460 461 461 462 /* 462 463 * Switch the PDA and FPU contexts.
+38 -29
arch/x86/kernel/traps.c
··· 571 571 } 572 572 573 573 /* 574 - * 'math_state_restore()' saves the current math information in the 575 - * old math state array, and gets the new ones from the current task 576 - * 577 - * Careful.. There are problems with IBM-designed IRQ13 behaviour. 578 - * Don't touch unless you *really* know how it works. 579 - * 580 - * Must be called with kernel preemption disabled (eg with local 581 - * local interrupts as in the case of do_device_not_available). 574 + * This gets called with the process already owning the 575 + * FPU state, and with CR0.TS cleared. It just needs to 576 + * restore the FPU register state. 582 577 */ 583 - void math_state_restore(void) 578 + void __math_state_restore(struct task_struct *tsk) 584 579 { 585 - struct task_struct *tsk = current; 586 - 587 580 /* We need a safe address that is cheap to find and that is already 588 - in L1. We're just bringing in "tsk->thread.has_fpu", so use that */ 581 + in L1. We've just brought in "tsk->thread.has_fpu", so use that */ 589 582 #define safe_address (tsk->thread.has_fpu) 590 - 591 - if (!tsk_used_math(tsk)) { 592 - local_irq_enable(); 593 - /* 594 - * does a slab alloc which can sleep 595 - */ 596 - if (init_fpu(tsk)) { 597 - /* 598 - * ran out of memory! 599 - */ 600 - do_group_exit(SIGKILL); 601 - return; 602 - } 603 - local_irq_disable(); 604 - } 605 - 606 - __thread_fpu_begin(tsk); 607 583 608 584 /* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception 609 585 is pending. Clear the x87 state here by setting it to fixed ··· 599 623 force_sig(SIGSEGV, tsk); 600 624 return; 601 625 } 626 + } 627 + 628 + /* 629 + * 'math_state_restore()' saves the current math information in the 630 + * old math state array, and gets the new ones from the current task 631 + * 632 + * Careful.. There are problems with IBM-designed IRQ13 behaviour. 633 + * Don't touch unless you *really* know how it works. 634 + * 635 + * Must be called with kernel preemption disabled (eg with local 636 + * local interrupts as in the case of do_device_not_available). 637 + */ 638 + void math_state_restore(void) 639 + { 640 + struct task_struct *tsk = current; 641 + 642 + if (!tsk_used_math(tsk)) { 643 + local_irq_enable(); 644 + /* 645 + * does a slab alloc which can sleep 646 + */ 647 + if (init_fpu(tsk)) { 648 + /* 649 + * ran out of memory! 650 + */ 651 + do_group_exit(SIGKILL); 652 + return; 653 + } 654 + local_irq_disable(); 655 + } 656 + 657 + __thread_fpu_begin(tsk); 658 + __math_state_restore(tsk); 602 659 603 660 tsk->fpu_counter++; 604 661 }