feat: add degradded ssds post · dunkirk.sh/zera@3465265

+121

1 changed file

expand all

content

blog

2025-02-02_degraded-zpool-proxmox.md

+121

content/blog/2025-02-02_degraded-zpool-proxmox.md

··· 1 + +++ 2 + title = "Fixing a degraded zpool on proxmox" 3 + date = 2025-01-31 4 + slug = "degraded-zpool-proxmox" 5 + description = "replacing a failed drive in a proxmox zpool" 6 + 7 + [taxonomies] 8 + tags = ["homelab", "tutorial"] 9 + 10 + [extra] 11 + has_toc = true 12 + +++ 13 + 14 + I decided to finally fix the network issues with my proxmox server (old static ip and used vlans which I hadn't setup with the new switch and router) as I had some time today but after fixing that fairly easily I discovered that my main `2.23 TB` zpool had a drive failure. Thankfully I had managed to stuff 3 disks into the case before so loosing one meant no data loss (thankfully 😬; all my projects from the last 5 years as well as my entire video archive is on this pool). I still have 3 more disks of the same type so I can swap in a new one 2 more times after this. 15 + 16 + {{ img(id="https://cloud-n6m4bt2xl-hack-club-bot.vercel.app/2image.png" alt="the zpool reporting a downed disk" caption="That really scared the pants off me when I first saw it 😂") }} 17 + 18 + ## Actually fixing it 19 + 20 + First I had to find the affected disk physically in my case. Because I was stupid I didn't bother to label them but thankfully the serial numbers of the drives are stuck to them with a sticker so that wasn't terrible. 21 + 22 + {{ img(id="https://cloud-pi335w1l0-hack-club-bot.vercel.app/0image_from_ios.jpg" alt="chick-fil-a macaroni and cheese with 2 nuggets and some ketchup" caption="(By this point I had spent 30 minutes moaning so I went to lunch)") }} 23 + 24 + Now we can run `lsblk -o +MODEL,SERIAL` to find the serial number of our new drive. 25 + 26 + > root@thespia:~# lsblk -o +MODEL,SERIAL 27 + ```bash 28 + NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS MODEL SERIAL 29 + sda 8:0 0 698.6G 0 disk ST3750640NS 3QD0BG0J 30 + ├─sda1 8:1 0 698.6G 0 part 31 + └─sda9 8:9 0 8M 0 part 32 + sdb 8:16 0 698.6G 0 disk ST3750640NS 3QD0BN6V 33 + sdc 8:32 0 698.6G 0 disk ST3750640NS 3QD0BQ5G 34 + ├─sdc1 8:33 0 698.6G 0 part 35 + └─sdc9 8:41 0 8M 0 part 36 + sdd 8:48 1 111.8G 0 disk Hitachi HTS543212L9SA02 090130FBEB00LGGJ35RF 37 + ├─sdd1 8:49 1 1007K 0 part 38 + ├─sdd2 8:50 1 512M 0 part /boot/efi 39 + └─sdd3 8:51 1 111.3G 0 part 40 + ├─pve-swap 253:0 0 8G 0 lvm [SWAP] 41 + ├─pve-root 253:1 0 37.8G 0 lvm / 42 + ├─pve-data_tmeta 253:2 0 1G 0 lvm 43 + │ └─pve-data-tpool 253:4 0 49.6G 0 lvm 44 + │ ├─pve-data 253:5 0 49.6G 1 lvm 45 + │ ├─pve-vm--100--cloudinit 46 + │ │ 253:6 0 4M 0 lvm 47 + │ ├─pve-vm--101--cloudinit 48 + │ │ 253:7 0 4M 0 lvm 49 + │ ├─pve-vm--103--disk--0 50 + │ │ 253:8 0 4M 0 lvm 51 + │ └─pve-vm--103--disk--1 52 + │ 253:9 0 32G 0 lvm 53 + └─pve-data_tdata 253:3 0 49.6G 0 lvm 54 + └─pve-data-tpool 253:4 0 49.6G 0 lvm 55 + ├─pve-data 253:5 0 49.6G 1 lvm 56 + ├─pve-vm--100--cloudinit 57 + │ 253:6 0 4M 0 lvm 58 + ├─pve-vm--101--cloudinit 59 + │ 253:7 0 4M 0 lvm 60 + ├─pve-vm--103--disk--0 61 + │ 253:8 0 4M 0 lvm 62 + └─pve-vm--103--disk--1 63 + 253:9 0 32G 0 lvm 64 + sde 8:64 0 465.8G 0 disk WDC WD5000AAKS-65YGA0 WD-WCAS83511331 65 + ├─sde1 8:65 0 465.8G 0 part 66 + └─sde9 8:73 0 8M 0 part 67 + sdf 8:80 1 0B 0 disk Multi-Card 20120926571200000 68 + zd0 230:0 0 32G 0 disk 69 + ├─zd0p1 230:1 0 100M 0 part 70 + ├─zd0p2 230:2 0 16M 0 part 71 + ├─zd0p3 230:3 0 31.4G 0 part 72 + └─zd0p4 230:4 0 522M 0 part 73 + zd16 230:16 0 80G 0 disk 74 + ├─zd16p1 230:17 0 1M 0 part 75 + └─zd16p2 230:18 0 80G 0 part 76 + zd32 230:32 0 4M 0 disk 77 + zd48 230:48 0 80G 0 disk 78 + ├─zd48p1 230:49 0 1M 0 part 79 + └─zd48p2 230:50 0 80G 0 part 80 + zd64 230:64 0 32G 0 disk 81 + ├─zd64p1 230:65 0 512K 0 part 82 + └─zd64p2 230:66 0 32G 0 part 83 + zd80 230:80 0 1M 0 disk 84 + ``` 85 + 86 + Our two current drives are `3QD0BG0J` and `3QD0BQ5G` as we can see in proxmox but we can also see that they have partitions and `sdb/3QD0BN6V` does not so thats our target drive. Now we can find the disk by id with `ls /dev/disk/by-id | grep 3QD0BN6V` which gives us: 87 + 88 + > ls /dev/disk/by-id | grep 3QD0BN6V 89 + ```bash 90 + ata-ST3750640NS_3QD0BN6V 91 + ``` 92 + 93 + {{ img(id="https://cloud-d0bjeue06-hack-club-bot.vercel.app/0image_from_ios.jpg" alt="chick-fil-a macaroni and cheese with 2 nuggets and some ketchup" caption="My case situation is a bit of a mess and I'm using old 7200rpm server drives for pretty much everything; the dream is a 3 drive 2 TB each m.2 nvme ssd setup, maybe someday 🤷") }} 94 + 95 + We are going to go with the first id so no we move on to the zfs part. Running `zpool status vault-of-the-eldunari` we can get the status of the pool: 96 + 97 + > zpool status vault-of-the-eldunari 98 + ```bash 99 + pool: vault-of-the-eldunari 100 + state: DEGRADED 101 + status: One or more devices could not be used because the label is missing or 102 + invalid. Sufficient replicas exist for the pool to continue 103 + functioning in a degraded state. 104 + action: Replace the device using 'zpool replace'. 105 + see: https://openzfs.github.io/openzfs-docs/msg/ZFS-8000-4J 106 + scan: resilvered 8.33G in 00:48:26 with 0 errors on Thu Nov 14 18:38:03 2024 107 + config: 108 + 109 + NAME STATE READ WRITE CKSUM 110 + vault-of-the-eldunari DEGRADED 0 0 0 111 + raidz1-0 DEGRADED 0 0 0 112 + 9201394420428878514 UNAVAIL 0 0 0 was /dev/disk/by-id/ata-ST3750640NS_3QD0BM29-part1 113 + ata-ST3750640NS_3QD0BQ5G ONLINE 0 0 0 114 + ata-ST3750640NS_3QD0BG0J ONLINE 0 0 0 115 + 116 + errors: No known data errors 117 + ``` 118 + 119 + We can add our new disk with `zpool replace vault-of-the-eldunari 9201394420428878514 ata-ST3750640NS_3QD0BN6V` but first we wipe the disk from proxmox under the disks tab on our proxmox node to make sure its all clean before we restore the pool after we do that we also initalize a new gpt table. Now we are ready to replace the disk. Running this command can take quite a while and it doesn't output anything so sit tight. After waiting a few minutes proxmox reported that resilvering would take 1:49 minutes and it was 5% done already! I hope this helped at least one other person but I'm mainly writing this to remind myself how to do this when it inevitably happens again :) 120 + 121 + {{ img(id="https://cloud-n6m4bt2xl-hack-club-bot.vercel.app/0image.png" alt="the zpool reporting a downed disk" caption="It's slow but faster then I expected for HDDs") }}

Configure Feed

Configure Feed