An Open Access Peon

19 August 2006

NFS Server Gone Away

So your servers are happily serving away when somebody knocks over your (inaccessible) NFS server. What happens next seems to be dependent on the mount settings used, but the bottom line is any processes that attempt to open the mounted NFS system will hang in 'D' state (most likely nfs_wai(t)). As far as I can tell 'D' means the process is locked in a kernel call, which means the process itself can't be touched. And you certainly can't unmount (umount -f) the file system because 1) there are files open by those unkillable processes and 2) in order to unmount an NFS mount the NFS server needs to be contacted to let you know there's no file locks open (except it isn't there!).

How do you recover an NFS client that has been overrun with 'D' state processes? The answer is remarkably quick and simple, if you know how. You can't kill 'D' state processes and you can't restart (because the system will hang while trying to kill them). The reason it hangs is because NFS is trying to contact the NFS server, not getting a response, timing out and trying again. But, bizarrely, if you give the NFS server's IP address to another machine, NFS will connect, find the share is no longer there and give up. So:

ifconfig eth0 add <NFS server IP address>

Which should cause all your nfs_wait processes to reset, so you can kill any dead processes and unmount the file system. Removing the IP address requires adding the sub-interface name (because you just added two IP addresses to one interface):

ifconfig eth0:0 del <NFS server IP address>

You probably need NFS running on the temporary machine too ...