Error Description
We can view the error in the Event Log in the ONTAP System Manager. If we have it set up, it will also be sent to us by email. The content looks like this.
Node: AFF-01 Time: Thu, Oct 21 17:36:40 2021 +0200 Severity: ALERT Message: vifmgr.cluscheck.hwerrors: Port e2b on node AFF-01 is reporting a high number (at least 1 per 1000 packets) of observed hardware errors (CRC, length, alignment, dropped). Description: This message occurs when a network device reports a high number of observed hardware errors, such as CRC errors , length errors, alignment errors, or dropped frames. Corrective Action: The errors could be originating from the specified port, a remote port, or a port on another component of the network. Check the statistics for both the port and the switch. Contact NetApp technical support for assistance and specific instructions. Source: vifmgr Sequence#: 143803
Viewing Interface (Port) Statistics
Using the command line in Node Shell, we can view the port statistics, where counters for various types of errors and other data are displayed.
Displaying a Single Port
system node run -node <nodename> -command ifstat <interface> AFF::> system node run -node AFF-02 -command ifstat e2c -- interface e2c (18 days, 14 hours, 17 minutes, 57 seconds) -- RECEIVE Total frames: 890m | Frames/second: 554 | Total bytes: 3354g Bytes/second: 2088k | Total errors: 1148 | Errors/minute: 0 Total discards: 0 | Discards/minute: 0 | Multi/broadcast: 1515k Non-primary u/c: 0 | Errored frames: 0 | Unsupported Op: 0 CRC errors: 534 | Runt frames: 0 | Fragment: 0 Long frames: 43 | Jabber: 0 | Length errors: 37 Alignment errors: 0 | No buffer: 0 | Pause: 0 Jumbo: 411m | Error symbol: 534 | Bus overruns: 0 Queue drops: 0 | LRO segments: 737m | LRO bytes: 3342g LRO6 segments: 0 | LRO6 bytes: 0 | Bad UDP cksum: 0 Bad UDP6 cksum: 0 | Bad TCP cksum: 0 | Bad TCP6 cksum: 0 Mcast v6 solicit: 0 | Lagg errors: 0 | Lacp errors: 0 Lacp PDU errors: 0 TRANSMIT Total frames: 1041m | Frames/second: 648 | Total bytes: 6336g Bytes/second: 3943k | Total errors: 0 | Errors/minute: 0 Total discards: 0 | Queue overflow: 0 | Multi/broadcast: 107k Collisions: 0 | Pause: 0 | Jumbo: 760m Cfg Up to Downs: 0 | TSO segments: 101m | TSO bytes: 5792g TSO6 segments: 0 | TSO6 bytes: 0 | HW UDP cksums: 0 HW UDP6 cksums: 0 | HW TCP cksums: 0 | HW TCP6 cksums: 0 Mcast v6 solicit: 0 | Lagg drops: 0 | Lagg no buffer: 0 Lagg no entries: 0 DEVICE Mcast addresses: 3 | Rx MBuf Sz: 9216 LINK INFO Speed: 10000M | Duplex: full | Flowcontrol: full Media state: active | Up to downs: 2 | HW assist: 5655
Here the total number of errors for the given period is shown, and then a breakdown of the different types of errors. The errors recorded here are CRC errors, Long frames, Error symbol, and Length errors. Other possible errors include Alignment errors.
Displaying All Ports
We can display the statistics for all ports at once.
system node run -node <nodename> -command ifstat -a
Clearing Port Statistics
To more easily monitor statistics after a change, we can clear the counter on the port.
system node run -node <nodename> -command ifstat -z <interface> AFF::> system node run -node AFF-02 -command ifstat -z e2c -- interface e2c (23 days, 14 hours, 10 minutes, 55 seconds) --
Possible Causes of Port Errors
Probably the first step is to check the active components (switches), where errors on the ports should also be displayed in many cases. This could help identify the port where the errors are coming from. More complex are situations where there are no errors here. Common are checks of cabling, SFP modules, etc. Another option is to verify the MTU on the elements in the (SAN) network.
Later, I was able to find a number of articles in the NetApp KB that suggest various options and causes of errors.
Different Flowcontrol Setting on Array and Switch
- CRC Errors seen on data ports after a head upgrade
- What are the flow control best practices for Ethernet?
- What is the potential impact of PAUSE frames on a network connection?
- Configuring Link Level Flow Control
- To flow or not to flow? - Cisco Blogs
The first article describes that CRC errors appear when replacing controllers. But this is not as important as the mention that it is important that Flowcontrol is set the same on the NetApp node ports and the switch ports where they are connected (generally throughout the network). The previous command to display the port statistics also shows the Flowcontrol setting. It can be Flowcontrol: full, which has been the default value for NetApp for some time. Or Flowcontrol: none.
I had never dealt with this before. I looked at the switches, which are Cisco Nexus for SAN and Cisco Catalyst for LAN, and on both flow-control is disabled.
iSCSI1# sh int Eth1/50/1 | inc flow Input flow-control is off, output flow-control is off LAN1#sh int Gi1/0/47 | inc flow input flow-control is off, output flow-control is unsupported
The other articles mentioned discuss various opinions on whether it is better to have Flowcontrol enabled or disabled. But the main thing is that it should be set the same throughout the network. Therefore, we can disable it on NetApp. This will cause a reset of the port, i.e., a downtime. But we should definitely have redundancy, so that shouldn't be a problem.
net port modify -node <node that owns port> -port <port> -flowcontrol-admin none
AFF::> network port modify -node AFF-01 -port e2c -flowcontrol-admin none
Warning: This command will cause a several second interruption of service on this network
port.
Do you want to continue? {y|n}: y
CRC Errors - Component Failure
- CRC errors received on a single NIC port
- NIC port seeing CRC errors in ifstat
- Cluster Network Degraded alerts reported multiple times due to errors on cluster ports
CRC errors are media errors. They can be caused by a faulty cable or SFP module. They can also be propagated from the network. We need to check the connection between the port with the error and the next connected device. Check the port itself. Replace the SFP.
Long Frames - Large MTU
If we see Long frames in the port statistics, it means that frames are arriving with a larger Maximum Transfer Unit (MTU) than is set on the given port. We need to go through the servers that are connecting to the array and see if they have a larger value set.
Error Symbol - Component Failure
If Error symbol appears in the statistics, NetApp indicates that this is a hardware component failure. The error occurs during transmission from a physically connected device. It cannot be propagated from the network. We need to check the network card and SFP on the NetApp, on the connected device (switch), the connecting cable, and the proper cable connection.
Length Errors
- ONTAP reports length errors in ifstat output
- Length Errors Incrementing on Data Network Ports
- Unknown length error counts up
- Incrementing frame length errors on port interface
The first description is related only to certain types of interfaces or cards (X1143A). But perhaps it can be used that a small number of these errors can be ignored. Another article mentions an incompatible twinax cable.
There are no comments yet.