Consider the following scenario: You have a physical IBM server running Windows 2008 R2 with an LSI SAS controller PCIe card connected to an IBM LTO5 tape library and controlled via NetBackup 7.6. You wish to obsolete the Windows 2008 server and move the SAS card onto a new IBM x3650 M4 server running ESXi 5.5. Once that’s done, you wish to configure DirectPath to map the physical card to your new Windows 2012 R2 NetBackup VM. At this point, there should be nothing in the story that sounds terribly difficult. Physically move the card into the new ESXi host, configure DirectPath to allow the card for Passthru, reboot and you’re done. Well, that WOULD be true… but only so long as if you pick the right check box. It turns out the VMware DirectPath box can be much more dangerous then at least I first thought. What do I mean? Here is the screen I was presented with after installing the LSI SAS card. The keywords in that sentence are LSI and SAS.
See the keywords “LSI” and “SAS”? So did I . So I picked that one and pressed OK.
I was told I had to restart the host for the change to take effect. So I did. That turned out to be one of the stupider decisions I’ve ever made in my life. Can you guess why? Before we explain that, let me show you what I was greeted with after I rebooted the server. ESXi booted fine except this is what I saw when I connected to it:
Excuse me, what? What happened to my 11 Terabytes of VMs??? This is where I had my first heart attack of the night. I tried to go back to Advanced Settings and uncheck the card and reboot, but that unfortunately did absolutely nothing.
So, do you know what happened? In the screenshot above, you’ll note that “LSI Logic / Symbios Logic LSI2008” is selected but that’s only because I took the screenshot once everything was working again. The first time around however, I selected “LSI / Symbios Logic MegaRAID SAS Fusion Controller.” Why? Because it said LSI and SAS in the name and that seemed good enough for me.
What ultimately happened is that I told VMware to enable passthru on the primary RAID controller that manages the primary datastore. Dutifully, it did exactly that. Then, once the server rebooted, since the controller was remapped, it couldn’t connect to any of the datastores and thus we are presented with the warning above. As you might imagine, it was at this point that I got on the phone with VMware technical support. We went through and started troubleshooting. This process in itself was painful as each change required a reboot which on this server was close to 11 minutes a boot. (More on that later)
Eventually, we found ourselves looking inside the file /etc/vmware/esx.conf (As an aside, it’s funny how hard it is to change a product name in code. By all accounts this file should likely be called esxi.conf but I digress)
As we looked through the file, we found the line /device/000:022:00:0/owner = “passthru”
Interesting. I wonder what the 22 means? Let’s go back to the DirectPath configuration and check the SAS controller configuration:
Ah ha! Bus 22 is the RAID controller we want but it’s currently configured for Passthru! Another way to figure this out is using the command esxcfg-scsidevs –a. However, during the actual failure, this command was not returning the controller at all and only started showing up once the issue was corrected (and when the screenshot was taken). I’m not sure why the GUI and CLI produced different results here. vCenter remembered maybe?
So we’ve found the problem and it looks like a simple fix. All we need to do is change the keyword “passthru” to “vmkernel”, save the file, reboot and we should be back in business. So we do that, reboot and…
“Oh sh!@”, I think. It took us a while to figure out what was going on. But the clue should have been “wait a minute, if ESXi can’t see it’s storage controller, how the heck is ESXi booting in the first place since the OS lives on the same array?”
After talking with an escalation engineer at VMware, I have an explanation. I likely have the following terms wrong but the concepts I’m sure are valid. When ESXi first boots, it runs in a kind of kernel mode. This gives it low level access to the disks among other things. It then boots up and copies everything it needs into RAM. Importantly for our purposes, this boot process happens before loading of esx.conf. This allows ESXi to fully boot at which point it passes control over to a kind of user mode. Once this context switch takes place, any disk configuration (such as passthru) that esx.conf requests are implemented. In other words, in my scenario, ESXi booted and then (because I told it to) effectively disconnected itself from it’s own storage but kept working because it was relying entirely on its in-memory copy of the OS and configuration.
See the issue? We kept making modifications to correct this problem from both the VI client and by editing the esx.conf file directly, but since these changes require a reboot to take effect, they kept getting lost each time we did. Below is a quote from a VMware KB article on this subject that we’ll reference more later explaining this as well:
At this point I’m going to jump ahead to what appears to be the easy solution to this issue. However, it was not the solution we implemented as we only realized we could do it after we fixed this the hard way. You see it turns out that the code and configuration that ESXi boots from is collectively called the “Bootbank”. The engineers that wrote ESXi obviously realized that people like me were going to come around every once and a while and make really stupid configuration changes that would prevent ESXi from booting. So what they do is they have a primary an an alternate bootbank.
When you make any changes to ESXi, those changes are committed only to the in-memory configuration and thus will not persist after a reboot. To combat this, VMware has a shell script called /sbin/auto-backup.sh that runs automatically. What this script does is take all of the collective configuration files (including esx.conf) and stores them in a compressed file called local.tgz. That file is then compressed again and saved as state.tgz. Two copies of this file exist on two different partitions on the local file system, each from different points in time. Therefore, to correct the issue above, it appears all I needed to do was reboot the server and when ESXi was booting, press Shift-R to enter recovery mode and select the alternate bootbank.
However, it’s worth noting I just tried this in my lab and this is what I got:
I suspect this is because this was a net new deployment of ESXi and hasn’t run long enough to generate the alternate bootbank. Unfortunately I can’t reboot the production server in question to find out if it was an option there for hopefully obvious reasons.
That’s ok, we can do it the hard way. It turns out that while VMware uses its proprietary VMFS file system for VMs, it still relies on fat16 for its configuration data. This means it’s technically readable by any Linux OS. Thankfully, our server has an remote management card that supports virtual ISO mounting. So we downloaded the Knoppix Recovery Live CD (available here) which I ended up finding via Google as the VMware guys were unable to recommend one.
Once booted, you can open up GParted which the developers kindly included a link to right on the desktop. From here we can see that two 250MB partitions are present, otherwise known as Hypervisor1 and Hypervisor2 which in my case are on sda5 and sda6.
What we need to do now to correct our issue is:
- Mount the two hypervisor partitions (we need to look in both as we don’t know which one is currently the active one)
- Find the file state.tgz on both nodes and determine which one is newer (aka active)
- Extract the files inside of state.tgz using tar and specifically edit the file called esx.conf.
- Find the entry that is configuring passthrough for your RAID controller, in my case this line: /device/000:022:00:0/owner = “passthru”
- Replace the keyword passthru with vmkernel and save the file
- Using tar, recompress the files back into state.tgz and copy the new file over the existing one
- Reboot and magically your datastore should return!
Below is an example of what this looks like as I just tried it again in my lab.
You can find more complete instructions from this VMware KB article: http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2043048 (which I only found this this afternoon or well after the issue was resolved)
And there you have it. I ended up spending over 7 hours on the phone with VMware and 3 technicians to identify and resolve a problem that started by innocently selecting the wrong LSI Controller for DirectPath.
Moral of the story: If you’re making configuration changes to a server, read every word of the device your changing and make sure it’s the device you expect! In this case, I probably should have gone as far as to match up the BUS ID VMware saw to the one the BIOS reported as the names as we have seen can be confusing.
As an aside, it’s worth noting that when the primary storage was unavailable due to my mistake, ESXi was taking around 11 minutes to boot. The bulk of this time was spent trying to load the NFSClient module (upwards of 6 minutes!) That was espeically odd since this server relied on local storage exclusively and did not use NFS. Once the storage connectivity was restored, this module loaded in about 11 seconds.
Lastly, during my conversation with VMware, I was informed that VMware no longer supports running tape drive controllers using DirectPath. That’s not to say it won’t work but just that if it doesn’t work for you, they are not going to provide you any troubleshooting assistance. I’ve been using this approach in some capacity for years and it’s always worked really well so I was surprised to hear that. They must have had a lot of support calls on this.