Matt Pson Just another dead sysadmin blog

28Mar/11Off

Using a Debian server as NFS storage for VMware ESXi

When installing a new VMware environment at work I thought up the idea of not using expensive SAN disks to provide various ISO files for installing virtual machines or software. In my environment I have access to several physical linux servers running Debian (Squeeze) so why not use one of those for a (readonly) share of ISOs via NFS, especially since we have way too much unused storage in most of those servers anyway.

After picking a suitable server, scp'ing some ISO files to a directory - time to install a NFS server. Really easy, just:

# apt-get install nfs-kernel-server

and after installing some additional packages, there it was ready to use. A quick edit to the /etc/exports file to share the directory /export/iso as read-only to hosts on my backend ESXi network:

/export/iso         10.10.10.0/24(ro,sync,no_subtree_check)

and reloading the NFS server to enable the new config. Brilliant, that should be all (I thought).

Trying to mount the NFS share on one of my ESXi hosts it failed and gave a quite cryptic error message ("Unable to connect to NAS volume mynfs: Unable to complete Sysinfo operation. Please see the VMkernel log file for more details").

Ok, a quick look at the ESXi logs produced another message, "MOUNT RPC failed with RPC status 9 (RPC program version mismatch)". That's pretty odd. I know ESXi wants to use NFS version 3 over TCP but I also know that Linux has been NFSv3 capable since a long time by now. Time to see what the Linux server thinks about this. Glancing at the logfiles on the Linux side gave nothing useful. A rcpinfo maybe?

$ rpcinfo -p  | grep nfs

100003    2   tcp   2049  nfs
100003    3   tcp   2049  nfs
100003    4   tcp   2049  nfs

So I clearly see a NFS version 3 over TCP up there. Something else must be wrong, after checking all configuration again to make sure that there wasn't any weird stuff anywhere. Mounting the share from another Linux server worked (of course) so the functionality was there. Insert some headscratching here and a new cup of tea.

After looking at the output of "nfsstat" I realized that the NFS mount by the other Linux server generated statistics under "Server nfs v2" and not under "Server nfs v3" (all zeroes there). So the Linux-Linux mount used version 2 and not 3. Why? A "ps auxwww" later I noticed that the mountd process looked quite odd:

/usr/sbin/rpc.mountd --manage-gids --no-nfs-version 3

So mountd has version 3 disabled? No configuration I saw included that "--no-nfs-version 3" option so it had to come from somewhere else. A quick read of the /etc/init.d/nfs-kernel-server file gave the answer:

$PREFIX/bin/rpcinfo -u localhost nfs 3 >/dev/null 2>&1 ||                    RPCMOUNTDOPTS="$RPCMOUNTDOPTS --no-nfs-version 3"

The startup script looks for NFS version 3 over UDP (the -u flag) and if it doesn't find it, it adds the "--no-nfs-version 3" flag. The "rpcinfo" command above did not list any NFS over UDP so it would always add that flag.

A quick edit of the /etc/init.d/nfs-kernel-server file to change the '-u' to '-t' to look for TCP instead of UDP and a restart of the NFS service via:

# /etc/init.d/nfs-kernel-server restart

and voila! - it works. The ESXi host mounted the directory without a problem. Problem solved.

So in short, Debian Squeeze provides a NFS version 3 service over TCP but always seems to disable it in the start-up script as there is no such service over UDP. I guess there is a reason for that somewhere but it would have been nice to see a bit more flexible start-up script, it would certainly have saved me some time.

Edit: when googling on the above problem I came across posts mentioning DNS problems and systemclocks that was out of sync so I guess in theory that those things can cause troubles too.