Error: Login to iSCSI target iqn.####-##.com.######## on vmhba## @ ### failed. The iSCSI initiator could not establish a network connection to the target.

The error Error: “Login to iSCSI target iqn.####-##.com.######## on vmhba## @ ### failed. The iSCSI initiator could not establish a network connection to the target” is not an absolute error. Meaning, its not like ESX was unable to connect at all ever. It simply means that one attempt failed. For example, we have a ESX host, connected to a VNX 5300, we receive that error one time if we re-scan hba’s, then the LUNs connect up just fine, and latency is around 1-5ms during a load using iometer of 512 read/writes. The Latency is measure on both the SAN, iometer and the ESX host.

 

So is the error a bogus error? something to worry about? what does the error mean.  I set out to answer just those questions. We opened a case with VMware, we also opened a case with EMC, and dove into the copper based iscsi networking side of things. Below are all the things we went through to assure we had a healthy host and datastore.

 

High level

  1. Made sure all drivers and firmware on all devices were up to date (of course is is the first thing support goes to, but the first thing to be invalid sometimes within days of the supposed golden “Up to date” as new releases of both come out all the time.
  2. Confirm MTU of 8792 passes through all iscsi connections and that iSCSI is on a physical separate network, and no dropped packets exists
  3. Install iometer on ESX hosts, perform read/write test measuring latency on LUN, iometer, and ESX host
  4. EMC support Health Review – EMC had nothing really to come back with, other than the SAN was healthy, but maybe we should change a bit for the D-ACK (Delayed-ACK) parameter on ESX that has a delay on datastore connections. EMC said it was a iSCSI best practice for VMware ESX hosts. VMware didn’t say anything about this best practice. I researched the D-ACK and the article states that the D-ACK useful to enable when you have a lot of congestion. We have almost zero congestion when we get the error, logic says EMC was pinning a tail on the donkey(Idiom that means they are guessing and trying anything)
  5. VMWare support feedback – They overly focused on the technical aspect, and had a hard time listening. The immediately attacked that we have a driver that today is a version or two back from the latest. But in the same paragraph, said we should not upgrade to ESX 6.5 until at least a few more versions come out.
  6. Made sure jumbo frames end to end are set using vmkping -d -s 8792 192.168.12.104  the -d command does do not fragment. So if there is any point where there are no jumbo frames, the ping will fail. So make sure you ping from the host to the SAN iscsi NIC.

 

Iometer test

We ran a io-analyzer test with the work load SQL server 64k for 480 seconds against Storage Pool B LUN called DS1. Looking at latency on both the VNX and ESX, here are the results. It shows the SAN is working good and being responsive between the time it receives the request and gives back the data, but the ESX has much worse latency between the time it sends out the request, and receives the response.

VNX – showed a latency of under 5 ms

ESX – showed a read/write latency of 10-39 and 10-41 ms latency

 

Confirm iscsi connectivity

using the tool netcat or nc you can shell into a ESX host, and send a port scan, connect and disconnect to a iSCSI target to assure the host can connect, create a session and tear down that session.

usage: nc [-46DdhklnrStUuvzC] [-i interval] [-p source_port]
[-s source_ip_address] [-T ToS] [-w timeout] [-X proxy_version]
[-x proxy_address[:port]] [hostname] [port[s]]
[root@vsphere01:~] nc -z 192.168.12.104 3260
Connection to 192.168.12.104 3260 port [tcp/*] succeeded!
[root@vsphere01:~] nc -z 192.168.12.105 3260
Connection to 192.168.12.105 3260 port [tcp/*] succeeded!

 

So the big question is, what is the difference between netcat which can connect just fine to the same iSCSI target that the ESX host complains is “The iSCSI initiator could not establish a network connection to the target”. It would appear that either one fails the other fails right?

 

So now I add a timeout setting of 1 ms, and also specify the switch interface to leave. I try both VMkernel ports, and both suceed with 1ms timeout. You cannot go lower than 1 ms.

[root@vsphere01:~] nc -z -w 1 -s 192.168.12.31 192.168.12.105 3260
Connection to 192.168.12.105 3260 port [tcp/*] succeeded!
[root@vsphere01:~] nc -z -w 1 -s 192.168.12.32 192.168.12.105 3260
Connection to 192.168.12.105 3260 port [tcp/*] succeeded!