NetApp ONTAP errors on network ports

Error Description

We can view the error in the Event Log in the ONTAP System Manager. If we have it set up, it will also be sent to us by email. The content looks like this.

Node: AFF-01
Time: Thu, Oct 21 17:36:40 2021 +0200
Severity: ALERT

Message: vifmgr.cluscheck.hwerrors: Port e2b on node AFF-01 is reporting a high number (at least 1 per 1000 packets) of
 observed hardware errors (CRC, length, alignment, dropped).

Description: This message occurs when a network device reports a high number of observed hardware errors, such as CRC errors
, length errors, alignment errors, or dropped frames.

Corrective Action: The errors could be originating from the specified port, a remote port, or a port on another component of
 the network. Check the statistics for both the port and the switch. Contact NetApp technical support for assistance
 and specific instructions.

Source: vifmgr
Sequence#: 143803

Viewing Interface (Port) Statistics

Using the command line in Node Shell, we can view the port statistics, where counters for various types of errors and other data are displayed.

Displaying a Single Port

system node run -node <nodename> -command ifstat <interface>

AFF::> system node run -node AFF-02 -command ifstat e2c

-- interface  e2c  (18 days, 14 hours, 17 minutes, 57 seconds) --

RECEIVE
 Total frames:      890m | Frames/second:     554  | Total bytes:      3354g
 Bytes/second:     2088k | Total errors:     1148  | Errors/minute:       0 
 Total discards:      0  | Discards/minute:     0  | Multi/broadcast:  1515k
 Non-primary u/c:     0  | Errored frames:      0  | Unsupported Op:      0 
 CRC errors:        534  | Runt frames:         0  | Fragment:            0 
 Long frames:        43  | Jabber:              0  | Length errors:      37 
 Alignment errors:    0  | No buffer:           0  | Pause:               0 
 Jumbo:             411m | Error symbol:      534  | Bus overruns:        0 
 Queue drops:         0  | LRO segments:      737m | LRO bytes:        3342g
 LRO6 segments:       0  | LRO6 bytes:          0  | Bad UDP cksum:       0 
 Bad UDP6 cksum:      0  | Bad TCP cksum:       0  | Bad TCP6 cksum:      0 
 Mcast v6 solicit:    0  | Lagg errors:         0  | Lacp errors:         0 
 Lacp PDU errors:     0 
TRANSMIT
 Total frames:     1041m | Frames/second:     648  | Total bytes:      6336g
 Bytes/second:     3943k | Total errors:        0  | Errors/minute:       0 
 Total discards:      0  | Queue overflow:      0  | Multi/broadcast:   107k
 Collisions:          0  | Pause:               0  | Jumbo:             760m
 Cfg Up to Downs:     0  | TSO segments:      101m | TSO bytes:        5792g
 TSO6 segments:       0  | TSO6 bytes:          0  | HW UDP cksums:       0 
 HW UDP6 cksums:      0  | HW TCP cksums:       0  | HW TCP6 cksums:      0 
 Mcast v6 solicit:    0  | Lagg drops:          0  | Lagg no buffer:      0 
 Lagg no entries:     0 
DEVICE
 Mcast addresses:     3  | Rx MBuf Sz:       9216 
LINK INFO
 Speed:           10000M | Duplex:            full | Flowcontrol:      full
 Media state:     active | Up to downs:          2 | HW assist:        5655

Here the total number of errors for the given period is shown, and then a breakdown of the different types of errors. The errors recorded here are CRC errors, Long frames, Error symbol, and Length errors. Other possible errors include Alignment errors.

Displaying All Ports

We can display the statistics for all ports at once.

system node run -node <nodename> -command ifstat -a

Clearing Port Statistics

To more easily monitor statistics after a change, we can clear the counter on the port.

system node run -node <nodename> -command ifstat -z <interface>

AFF::> system node run -node AFF-02 -command ifstat -z e2c
-- interface  e2c  (23 days, 14 hours, 10 minutes, 55 seconds) --

Possible Causes of Port Errors

Probably the first step is to check the active components (switches), where errors on the ports should also be displayed in many cases. This could help identify the port where the errors are coming from. More complex are situations where there are no errors here. Common are checks of cabling, SFP modules, etc. Another option is to verify the MTU on the elements in the (SAN) network.

Later, I was able to find a number of articles in the NetApp KB that suggest various options and causes of errors.

Different Flowcontrol Setting on Array and Switch

The first article describes that CRC errors appear when replacing controllers. But this is not as important as the mention that it is important that Flowcontrol is set the same on the NetApp node ports and the switch ports where they are connected (generally throughout the network). The previous command to display the port statistics also shows the Flowcontrol setting. It can be Flowcontrol: full, which has been the default value for NetApp for some time. Or Flowcontrol: none.

I had never dealt with this before. I looked at the switches, which are Cisco Nexus for SAN and Cisco Catalyst for LAN, and on both flow-control is disabled.

iSCSI1# sh int Eth1/50/1 | inc flow
  Input flow-control is off, output flow-control is off

LAN1#sh int Gi1/0/47 | inc flow
  input flow-control is off, output flow-control is unsupported

The other articles mentioned discuss various opinions on whether it is better to have Flowcontrol enabled or disabled. But the main thing is that it should be set the same throughout the network. Therefore, we can disable it on NetApp. This will cause a reset of the port, i.e., a downtime. But we should definitely have redundancy, so that shouldn't be a problem.

net port modify -node <node that owns port> -port <port> -flowcontrol-admin none

AFF::> network port modify -node AFF-01 -port e2c -flowcontrol-admin none

Warning: This command will cause a several second interruption of service on this network
         port.
Do you want to continue? {y|n}: y

CRC Errors - Component Failure

CRC errors are media errors. They can be caused by a faulty cable or SFP module. They can also be propagated from the network. We need to check the connection between the port with the error and the next connected device. Check the port itself. Replace the SFP.

Long Frames - Large MTU

Ifstat output reports long frames

If we see Long frames in the port statistics, it means that frames are arriving with a larger Maximum Transfer Unit (MTU) than is set on the given port. We need to go through the servers that are connecting to the array and see if they have a larger value set.

Error Symbol - Component Failure

Error symbol and Illegal symbol count incrementing on a NIC port

If Error symbol appears in the statistics, NetApp indicates that this is a hardware component failure. The error occurs during transmission from a physically connected device. It cannot be propagated from the network. We need to check the network card and SFP on the NetApp, on the connected device (switch), the connecting cable, and the proper cable connection.

Length Errors

The first description is related only to certain types of interfaces or cards (X1143A). But perhaps it can be used that a small number of these errors can be ignored. Another article mentions an incompatible twinax cable.