Server – Somewhere Out There

Sometime around the new year, one of the drives in my server appeared to have died. I had some issues with it in the past, but usually unseating and reseating it seemed to fix whatever problems it was presenting. But not this time.

My server has 7 500GB HDDs, set-up in a RAID 5 configuration. It gives me about 2.7TB of storage space. These are just consumer-level WD Blue 7200rpm drives. It’s a Homelab that’s mainly experimental; I’m not into spending big money on it. Not yet, anyway.

While I’ve since heard that RAID 5 isn’t great, I’m OK with this since this is just a Homelab. Anyway, in RAID5, one drive can die and the array will still function. Which is exactly what happened here.

However, I began tempting fate by not immediately swapping the failed drive. I didn’t have any spares at home, but more importantly, I was being cheap. So I let it run in a degraded state for a month or two months. This was very dangerous as I don’t backup the VMs or ESXi. I only backup my main Windows Server instance via Windows Server backup to an external HDD. Even then, I’ve committed the common cardinal sin of backups: I’ve yet to test a single WS backup. So using something like Veeam is probably worth looking into for backing up full VMs. And of course testing my Windows Server full bare-metal backups.

Luckily, fates were on my side and no other drive failures were reported. I finally got around to replacing the drive about month ago. Got my hands on a similar WD Blue 500GB drive; a used one at that. It was pretty straightforward. I swapped the drives, went into the RAID configuration in the system BIOS, designated the drive as part of the array, and then had it rebuild. I think it took at least 10hrs.

While it was rebuilding, everything else was down. All VMs were down, because ESXi was down. Thought it best to rebuild while nothing else was happening. Who knows how long it would’ve taken otherwise and if I’d run into other issues. I wonder how this is done on real-life production servers.

Afterwards, the RAID controller reported that everything was in tip-top shape.

But of course, I wanted more. More storage, that is. I ended up getting two of the 500GB WD Blue HDDs: one for the replacement and the other as an additional disk to the array.

Unfortunately, Dell does not make it easy to add additional drives to an existing array. I couldn’t do it directly on the RAID controller (pressing Ctrl+R during boot), nor in Dell’s GUI-based BIOS or Lifecycle Controller. IDRAC didn’t allow it either.

Looking around online, it seemed that the only way to do it would be via something called OpenManage, some kind of remote system controller from Dell. But I couldn’t get it to work no matter what I did. The instructions on what I needed to install, how to install it once I figured out what to install, nor how to actually use it once I determined how to install it, were poor. Thanks, Dell.

In the end, after spending at least a few hours researching and experimenting, it didn’t seem worth it for 500GB more of storage space. I did add the 8th drive in, but as a hot spare. I may even take it out and use it as cold spare.

But yeah, I can now say that I’ve dealt with a failed drive in a RAID configuration. Hopefully it never goes further than that.

One of my plans at work is to properly remove an older physical servers from the network. This server once functioned as the primary – and only – domain controller, DNS, fileserver, print server, VPN server, Exchange server, etc. It was replaced in 2018, but was never really offlined. It existed in limbo; sometimes on, sometimes off. During the pandemic, my “successor/predecessor” turned it back on so staff could VPN in to the office from home.

Long story short, it’s time to take it down. To start, I want to remove it’s DC role. But I’ve never done that before. I’ve added DCs, but never taken one out of the network. So that’s why I did this.

I started by creating a new Win2016 VM in ESXi. This would be my third Windows Server instance, and I named it appropriately: DC03.

I set a static IP and added the domain controller role to it via Server Manager. The installation went off without a hitch, so I completed the post-installation wizard and added it as a third domain controller. Again, no issues. In a command prompt, I used the command repadmin /replsummary to verify that links to the other two DCs were up and that replication was occurring. After that, I checked that DNS settings had replicated. All DNS entries were present, including the DNS Forwarders.

Wait, what?

In a moment of serendipity, I had a couple weeks prior created an impromptu experiment setup. I added DNS forwarders to DC01 after DC02 was added as a DC. I had seen guides and best practices saying that DNS settings either coming from a router via DHCP or statically put on a workstation shouldn’t mix internal and external servers. So DNS1 shouldn’t be an internal DNS server, while DNS2 points to a public DNS like Google’s 8.8.8.8. So that’s how I found out about DNS fowarders in Windows DNS mananger.

I expected the DNS forwarders to eventually replicate from DC01 to DC02, but they never did, even after multiple forced replications. At the time, I didn’t understand why that was the case. In the end, I manually added the forwarders to DC02.

And then a few days after that, I added another forwarder on DC01, but not to DC02. And of course, that last entry didn’t replicate, leaving a discrepancy.

Apparently, DNS forwarders are local only and they don’t replicate. Conditional forwarders will, but not full-on external forwarders. This has something to do with the fact that DCs in the real world may be in different geographical locations, with different ISPs, that require the use of separate external DNS forwarders at each location.

So imagine my surprise when DC03 automatically had the DNS forwarders that I had placed on DC01. But I quickly stumbled upon the answer:

By [adding DNS roles], the server automatically pulled the forwarders’ list from the original DNS servers, and it placed these settings in the new DNS server role. This behavior is by default and cannot be changed.
Self-Replicating DNS Forwarders Problems in Windows Server 2008/2012 | Petri IT Knowledgebase

That’s why DC03 had the DNS forwarders. When a new DC is added that has a DNS role, it will do a one-time pull from the other DNS server; in this case, my “main” DC. But after that, DC03’s forwarders will forever be local.

Case closed!

With the new DC03 in place, with its proper roles, I left it for 24hrs. Just to see if anything weird would happen.

And wouldn’t you know it, nothing weird happened. Sweet!

I ran nslookup on a few different computers on my network, including domain- and non-domain joined ones.

It looked like that on all the computers. All three DCs/DNSs were present.

After confirming that everything was OK, I started removing the newest DC from the environment. I attempted to remove the role via Server Manager, but was prompted to run dcpromo.exe first. Since it wasn’t the last DC, I made sure not to check the box asking if it was last DC in the domain. Once again, everything went smoothly.

To confirm that DC03 was no longer an actual DC, I did another nslookup on various computers. The IP address of DC03 was no longer showing. In addition, I checked DNS Manager on DC01 (and DC02) and saw that DC03 was no longer a nameserver. Though a static host (A) record was still present, as was a PTR in the reverse lookup zone; both expected results. I left the AD role on the server, but I could completely remove it if I wanted.

Pretty simple and straightforward.

This gave me the confidence to do this at work. Consequently, I removed the DC role from the old server last week with no issues whatsoever. No one even knows it happened. Which is all a sysadmin can ask for!

Tag: Server

Homelab Chronicles 05: Dead Disk Drive

Homelab Chronicles 02 – Admin Giveth and Taketh Away…the Domain Controller