fix: ensure callosum drains queue before shutdown and improve task cleanup
Fix race condition in CallosumConnection where exit events could be lost when
stop() was called immediately after emit(). Refactored _run_loop to only exit
when both stop_event is set AND queue is empty, guaranteeing all messages are
sent before shutdown. Reduced join timeout from 2s to 0.5s since proper queue
draining makes long waits unnecessary.
Replace timeout-based task cleanup with PID validity checking in service
manager. Remove arbitrary 5-minute timeout that could kill slow processes.
Instead, detect dead tasks by checking if PID exists via psutil, cleaning up
tasks that crashed, were killed, or became zombies. Check every 5s for faster
cleanup.
Add resilience to missed exec events by detecting new tasks from logs/line
events. If manager starts after processes are running, it will pick them up
from first log output using ref, name, and pid fields.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>