Linux kernel mirror (for testing) git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
kernel os linux
1
fork

Configure Feed

Select the types of activity you want to include in your feed.

xen/xenbus: better handle backend crash

When the backend domain crashes, coordinated device cleanup is not
possible (as it involves waiting for the backend state change). In that
case, toolstack forcefully removes frontend xenstore entries.
xenbus_dev_changed() handles this case, and triggers device cleanup.
It's possible that toolstack manages to connect new device in that
place, before xenbus_dev_changed() notices the old one is missing. If
that happens, new one won't be probed and will forever remain in
XenbusStateInitialising.

Fix this by checking the frontend's state in Xenstore. In case it has
been reset to XenbusStateInitialising by Xen tools, consider this
being the result of an unplug+plug operation.

It's important that cleanup on such unplug doesn't modify Xenstore
entries (especially the "state" key) as it belong to the new device
to be probed - changing it would derail establishing connection to the
new backend (most likely, closing the device before it was even
connected). Handle this case by setting new xenbus_device->vanished
flag to true, and check it before changing state entry.

And even if xenbus_dev_changed() correctly detects the device was
forcefully removed, the cleanup handling is still racy. Since this whole
handling doesn't happened in a single Xenstore transaction, it's possible
that toolstack might put a new device there already. Avoid re-creating
the state key (which in the case of loosing the race would actually
close newly attached device).

The problem does not apply to frontend domain crash, as this case
involves coordinated cleanup.

Problem originally reported at
https://lore.kernel.org/xen-devel/aOZvivyZ9YhVWDLN@mail-itl/T/#t,
including reproduction steps.

Based-on-patch-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Message-ID: <20260218095205.453657-3-jgross@suse.com>

+48 -2
+11 -2
drivers/xen/xenbus/xenbus_client.c
··· 226 226 struct xenbus_transaction xbt; 227 227 int current_state; 228 228 int err, abort; 229 + bool vanished = false; 229 230 230 - if (state == dev->state) 231 + if (state == dev->state || dev->vanished) 231 232 return 0; 232 233 233 234 again: ··· 243 242 err = xenbus_scanf(xbt, dev->nodename, "state", "%d", &current_state); 244 243 if (err != 1) 245 244 goto abort; 245 + if (current_state != dev->state && current_state == XenbusStateInitialising) { 246 + vanished = true; 247 + goto abort; 248 + } 246 249 247 250 err = xenbus_printf(xbt, dev->nodename, "state", "%d", state); 248 251 if (err) { ··· 261 256 if (err == -EAGAIN && !abort) 262 257 goto again; 263 258 xenbus_switch_fatal(dev, depth, err, "ending transaction"); 264 - } else 259 + } else if (!vanished) 265 260 dev->state = state; 266 261 267 262 return 0; ··· 946 941 const char *path) 947 942 { 948 943 enum xenbus_state result; 944 + 945 + if (dev && dev->vanished) 946 + return XenbusStateUnknown; 947 + 949 948 int err = xenbus_gather(XBT_NIL, path, "state", "%d", &result, NULL); 950 949 if (err) 951 950 result = XenbusStateUnknown;
+36
drivers/xen/xenbus/xenbus_probe.c
··· 444 444 info.dev = NULL; 445 445 bus_for_each_dev(bus, NULL, &info, cleanup_dev); 446 446 if (info.dev) { 447 + dev_warn(&info.dev->dev, 448 + "device forcefully removed from xenstore\n"); 449 + info.dev->vanished = true; 447 450 device_unregister(&info.dev->dev); 448 451 put_device(&info.dev->dev); 449 452 } ··· 662 659 return; 663 660 664 661 dev = xenbus_device_find(root, &bus->bus); 662 + /* 663 + * Backend domain crash results in not coordinated frontend removal, 664 + * without going through XenbusStateClosing. If this is a new instance 665 + * of the same device Xen tools will have reset the state to 666 + * XenbusStateInitializing. 667 + * It might be that the backend crashed early during the init phase of 668 + * device setup, in which case the known state would have been 669 + * XenbusStateInitializing. So test the backend domid to match the 670 + * saved one. In case the new backend happens to have the same domid as 671 + * the old one, we can just carry on, as there is no inconsistency 672 + * resulting in this case. 673 + */ 674 + if (dev && !strcmp(bus->root, "device")) { 675 + enum xenbus_state state = xenbus_read_driver_state(dev, dev->nodename); 676 + unsigned int backend = xenbus_read_unsigned(root, "backend-id", 677 + dev->otherend_id); 678 + 679 + if (state == XenbusStateInitialising && 680 + (state != dev->state || backend != dev->otherend_id)) { 681 + /* 682 + * State has been reset, assume the old one vanished 683 + * and new one needs to be probed. 684 + */ 685 + dev_warn(&dev->dev, 686 + "state reset occurred, reconnecting\n"); 687 + dev->vanished = true; 688 + } 689 + if (dev->vanished) { 690 + device_unregister(&dev->dev); 691 + put_device(&dev->dev); 692 + dev = NULL; 693 + } 694 + } 665 695 if (!dev) 666 696 xenbus_probe_node(bus, type, root); 667 697 else
+1
include/xen/xenbus.h
··· 80 80 const char *devicetype; 81 81 const char *nodename; 82 82 const char *otherend; 83 + bool vanished; 83 84 int otherend_id; 84 85 struct xenbus_watch otherend_watch; 85 86 struct device dev;