常见网络故障诊断解析课件.ppt
设备维护和故障诊断,路由器设备维护和故障诊断,Show Interface,缓存和队列(什么是 Ignore?),当没有缓存可以被用来储存输入的帧时,Ignore就会增加队列极限值可以被超过以便处理突发的流量,Interface Bus,InterfaceCard,InterfaceCard,1,2,3,x,1,2,3,y,Ethernet ,Token Ring 1,Packet Buffers,GlobalPool,System Buffers,CPU,T1,T2,Ty,Ty+8,Ex+2,Ex+1,Ex,E2,1, 2, 3 x,1, 2, 3 x,缓存和队列 (Input Drops),当没有足够的系统缓存可用时Input drop值就会增加系统缓存被用作处理所有经处理交换和路由器自身生成的数据包,Interface Bus,Interface Buffers,CPU,NOVACANCY,X,Ethernet Frame 1,Ethernet ,1, 2, 3 x,缓存和队列(Output Drops),当没有足够的接口缓存可用作处理输出帧时,Output drop值便会增加,Interface Bus,Interface Buffers,CPU,Token Ring,X,New Frame,NOVACANCY,缓存和队列(Output Drops),当超出输出接口的队列极限值时, Output drop值会增加,Interface Bus,Interface Buffers,CPU,New Frame,Token Ring 1,接口缓存,HoStage show controller cbus,cBus 0, controller type 6.0, microcode version 10.0,512 Kbytes of main memory, 128 Kbytes cache memory,134 1520 byte buffers, 65 4496 byte buffers,Restarts: 0 line down, 0 hung output, 0 controller error,MEC 0, controller type 5.1, microcode version 10.0,Interface 0 - Ethernet0, station address 0000.0c06.4ae0 (bia 0000.0c06.4ae0),11 buffer RX queue,threshold,18 buffer TX queue,limit, buffer size 1520,ift 0000,rql 11,tq 0000 0000,tql 18,Transmitter delay is 0 microseconds,CTR 1, controller type 9.0, microcode version 10.1,Interface 8 - TokenRing0, station address 0000.3060.3219 (bia 0000.3060.3219),13 buffer RX queue,threshold,31 buffer TX queue,limit, buffer size 4496,ift 0005,rql 13, tq 0000 0000,tql 31,Transmitter delay is 0 microseconds,FDDI-T 3, controller type 7.2, microcode version 10.1,Interface 24 - Fddi0, station address 0000.0c06.36d7 (bia 0000.0c06.36d7),13 buffer RX queue,threshold,32 buffer TX queue,limit, buffer size 4496,ift 0006,rql 9, tq 0000 0000,tql 32,系统缓存,HoStage# show buffers,Buffer elements:,500 in free list (500 max allowed),51640224 hits, 0 misses, 0 created,Small buffers, 104 bytes (total 121, permanent 121):,119 in free list (20 min, 250 max allowed),19229201 hits, 0 misses, 0 trims, 0 created,Middle buffers, 600 bytes (total 90, permanent 90):,89 in free list (10 min, 200 max allowed),20513359 hits, 91 misses, 115 trims, 115 created,Big buffers, 1524 bytes (total 90, permanent 90):,90 in free list (5 min, 300 max allowed),7160285 hits, 0 misses, 0 trims, 0 created,Large buffers, 5024 bytes (total 5, permanent 5):,5 in free list (0 min, 30 max allowed),233295 hits, 0 misses, 0 trims, 0 created,Huge buffers, 18024 bytes (total 0, permanent 0):,0 in free list (0 min, 4 max allowed),0 hits, 0 misses, 0 trims, 0 created,命令Show process cpu,CPU utilization for five seconds: X%/Y%; one minute: Z%; five minutes: W% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 下面的表格解释了命令输出的各详细参数意思:,高利用率分析,router-5#show process cpu CPU utilization for five seconds: 83%/21%; one minute: 79%; five minutes: 84% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process 1 104 3707 28 0.00% 0.00% 0.00% 0 Load Meter 2 10208 15222 670 0.00% 0.01% 0.00% 0 OSPF Hello 3 34620 579 59792 0.00% 0.20% 0.17% 0 Check heaps,总共83%的利用率,其中 21% 是中断83-21=62% 实时运行的进程,中断引起的高CPU利用率,CPU中断的主要原因是数据流量的快速交换路由器上配置了语音端口 路由器上有活跃的异步传输 (ATM) 端口 路由器上配置了不正确的交换路径 CPU 在处理内存修正 路由器高负荷运转 检查端口show命令的输出 IOS软件有bug,进程引起的高CPU 利用率,如果一个进程占用了大量的CPU资源,检查log信息。进程上的不寻常活动会引起log中的错误信息。下面这些进程会引起 CPU的高利用率: IP Input HyBridge Input IP Simple Network Management Protocol (SNMP) Virtual EXEC TCP Timer VTEMPLATE Backgr Other Processes,Show memory,Head Total(b) Used(b) Free(b) Lowest(b) Largest(b)Processor 815D3828 15440232 6758292 8681940 2827144 3901312 I/O 2400000 12582912 1625120 10957792 10858580 10871708 Processor memory Address Bytes Prev Next Ref PrevF NextF Alloc PC what815D8674 0000001500 815D3828 815D8C7C 001 - - 801E14DC List Elements815D8C7C 0000005000 815D8674 815DA030 001 - - 801E1518 List Headers815DA030 0000000044 815D8C7C 815DA088 001 - - 80D6FCCC *Init*815DA088 0000001500 815DA030 815DA690 001 - - 801EBD04 messages,主内存,包交换内存,每个分配的内存块,内存问题,MALLOC Failures(内存分配失败)Memory Leaks(内存漏洞)Fragmentation(碎片)Alignment Errors(修正错误)Spurious Accesses(虚假存取)Memory Corruption(内存崩溃)Processor Memory Parity Errors(处理器内存校验错误),Show process memory,Routershow processes memory Total: 3149760, Used: 2334300, Free: 815460 PID TTY Allocated Freed Holding Getbufs Retbufs Process0 0 226548 1252 1804376 0 0 *Initialization* 0 0 320 5422288 320 0 0 *Scheduler* 0 0 5663692 2173356 0 1856100 0 *Dead* 1 0 264 264 3784 0 0 Load Meter 2 2 5700 5372 13124 0 0 Virtual Exec 3 0 0 0 6784 0 0 Check heaps4 0 96 0 6880 0 0 Pool Manager Allocated = 路由器启动后分配给进程的总字节数Freed = 进程释放的总字节数Holding = 进程拥有的总字节数。这是进程拥有的实际字节数,是故障诊断中最重要的判断依据。这个数值不一定等于Allocated 减去 Freed,因为有些进程在分配了一块内存后,会被另外的进程返回到空闲池中。,交换机设备维护和故障诊断,自动协商总结,检查双工不匹配,CDP 将会在第一次连接时告警全双工意味着碰撞监测机制不启用,一个全双工的设备将会不检查传输介质是否空闲就直接发送数据帧双工不匹配的症状FCS errors ( seen on FD side)Align errors ( seen on FD side)Runts ( seen on FD side)Excessive collision (seen on HD side)Late collision (seen on HD side),总结 : 10/100 M自动协商,尽量使用: Auto 对 auto 固定的 speed/duplex 对固定的 speed/duplex避免 : Auto 对固定的 speed/duplex,千兆以太协商问题,有些设备不支持千兆协商或部分支持如果问题引起link up,disable千兆连接的协商协商需要同时在两端enabled 或者 disabled,千兆以太协商问题,Two switches A and B are connected Via Gig Ether,CatOS 如何监控,Sh port : 显示端口状态和一些错误计数器(counters)Sh mac : 显示端口下 Rx 和 TX 的流量数据Sh counters mod/port : 显示所有的计数器 Sh top : 在30秒内数据流量最大的10个端口 快速诊断广播风暴的源和环路端口的方法,XL 交换机如何监控,Sh interface : 显示端口信息,类似于IOS里的sh int 命令Sh controller ethernet fast|gig x/x : 显示更多的端口计数器数值,常见的端口问题,坏的连接线 (失效, 错误的类型): 导致连接无法建立, FCS, runt, align,.(GigE Mode conditioning cable required for LX/LH-GBIC with MM Cable dist 300m)(ZX-GBIC is for extended distances, minimum distance with 8db attenuator 10km min, without attenuator 40km min )双工不匹配: 通常CDP协议会报告双工不匹配,但不会自动修复CatOS下的端口状态未激活 (状态 LED灯为桔黄色): 通常由于端口分配到一个不存在的 (VTP 的问题见后),常见的端口问题: err-disable,端口状态为 err-disable (仅对于CatOS) 的原因是 :某个端口上的大量错误Ether Channel 配置错误 BDPU 端口告警其他r如果端口进入 err-disable状态 : 设置端口选项 errport enable这步操作防止所有端口进入err-disable状态,在需要时使用最好先找到err原因!,常见的端口问题: err-disable,err-disable 的端口需要手工的 re-enabled可以设置 err-disable 的端口在x 秒后重新enable:Taras (enable) set errdisable-timeout Usage: set errdisable-timeout set errdisable-timeout interval (reason = bpdu-guard, channel-misconfig, duplex-mismatch,udld, other, all interval = 30.86400 seconds),五种 Trunking 模式,off 表示 Trunk不会建立auto (默认配置) 表示会响应Trunk协商但不会主动发起协商( 它不会主动trunk, 但如果对端要求Trunk,也会参与协商)desirable 表示连接将会协商并主动建立Trunk,Trunking,on 表示建立Trunk并发送DTP数据包 ( 需要设置封装方式 ISL 或者dot1q )Nonegotiate表示建立Trunk但不发送DTP数据包 nonegotiate 应该在Trunk不稳定(trunk/non-trunk)的情况下暂时使用,或者是设备不支持DTP协议比如路由器或者XL系列的交换机。,Trunking,建议在核心和边界分别配置desirable 对 auto ,或者是 desirable 对 desirable如果连接必须是Trunk,配置on 对 on,Trunking 总结,Trunk或连接上的问题:确认端口连接在in non trunk模式确认至少一端是desirable 模式,或者两端都是on模式,然后检查两端的封装方式是否一致做一次show mac 来检查in-discards 数值不再增长检查 VTP 域名为 TAC捕捉以下信息:Sh trunk (or sh int x/x switchport) Sh spant x/x (or sh spanning int x/x)Sh config,FEC / PAgP,FEC 也有 on, off, auto, and desirable 几种配置状态,但含义上仅有微小的区别on 表示端口会进入channel但不会运行PAgP (Port Aggregation Protocol),FEC,auto 表示端口会响应协商 (他们监听 PAgP) 但不会主动形成channeldesirable 表示端口会收发PAgP 并主动建立channel,FEC,警告: auto to on 不会工作警告: on to on 会工作但不会运行 PAgP推荐 desirable to desirable ,如果两端都支持 PAgPon 用于连接一个不支持PAgP 的设备(连接路由器或者一个XL系列的交换机),FEC,如果错误的配置了FEC ,或者FEC监测到 Spanning Tree 环路, 它会将端口设成ERR-DISABLE 状态同一个channel中的所有端口都必须有相同的配置参数: 同样的 speed, duplex, trunking status, DTP config, vlan allowed,.,以太网 Channel 总结和要点,Sh port capa 告诉channel容量检查两端的工作模式 ( 比如,最好是 desirable to desirable或者 on to on的配置, 不要 auto to auto 或其他混和方式 )如果channel 起不来,尝试以下处理方法停掉所有端口的 trunking确认speed, duplex, native vlan 匹配,并放入channel一旦channel 起来, 在第一个端口配置trunk,然后trunk的配置会被拷贝至其余的所有端口,Spanning Tree Troubleshooting,What Causes Loops?,1) Configuration problems Spantree disabled Spantree enabled on some switches but not on others Speed/duplex mismatches Portfast enabled on ports connected to hubs or switches Router, multiport NIC, configured for bridging Using different spantree protocols within the same VLAN Misconfigured or buggy trunk- or channel-capable NIC Loops with hubs or switches Port channeling misconfiguration,What Causes Loops?,2) Design issues Too large of a switched network Bridging over the WAN (delay problems),What Causes Loops?,3) Software issues Software bugs Forwarding traffic across blocked ports UplinkFast/BackboneFast Etc. Loss of management communication to line cards,What Causes Loops?,4) Hardware Issues Layer one links that are bad (i.e. CRCs, other input errors) Unidirectional links Data corruption (BPDUs dropped) Port Stuck (BPDUs dropped) NMP stops listening to spanning-tree (stuck inband) Loss of management communication to line cards,Detecting Spanning Tree Loops,1) Network is EXTREMELY slow for all nodes2) Network outage3) High system utilization on switch System Utilization in “show system” above 20% usually indicates a loop Above 7% indicates possible transitory loop Depends on network traffic and hardware (Cat5000 Sup1 vs. Cat6000 Sup2, etc.)4) System LED indicators on Switch Utilization Bar 5) High Amount of In-lost and Out-lost on “show mac”6) HSRP, OSPF, etc report duplicate IP address7) Unicast flooding,Detecting Spanning Tree Loops,Check spantree blocked and root ports for errors using “show port”, “show mac” & “show counters”Set up a syslog server and turn on logging for the “spantree” facility to 6, which will show port transitions through the spantree states (listening, learning, etc.)Use “show inband” to check for “RsrcErrors” (BPDU could be dropped if supervisor is unable to process the BPDU)Check to see if you are exceeding spanning tree instances “show spantree summary”,During an Event,Remove redundant Ethernet segments from the networkStart with connections between core switchesBegin with EtherChannels, if usedWait for 30-60 seconds for the network to recover before removing another linkIf the network does not recover, continue methodically removing redundancy until the network stabilizesAvoid rebooting or powering off switchesIf you do this youll lose the logging buffer & spantree stats on the switchSyslog to a server cannot necessarily be trusted during a network failure,Finding the Smoking Gun,Use “show system” to find switches with high backplane utilizationUse “show mac” and look for large amounts of broadcast/multicast received & transmittedUse “show spantree statistics” to follow the problem through the networkOn the root, check the “topology change initiator” to see which bridge last generated a TCNLook for “msg age expiry count” on blocked ports to see whether we expired a BPDU on the port (MaxAge was reached)Look for “tcn bpdus xmitted” to see whether a bridge sent many TCNsLook for “forward trans count” to see how many times the port transitioned into the forwarding state,Preparing for the Next Time,Take proactive measures (perform these tasks prior to having another event)Turn spantree logging level on the switches to 6 (“set logging level spantree 6 default”) to see state transitions & TCNs (also, log to a server)On switches running IOS, use “debug spanning events”Enter “clear counters” on all switches,Finding the Root,Verify the location of the rootThe customer might have failed to deterministically set the rootThe root might have moved due to a new bridge in the network, or a bridge priority change,esc-cat6500-a (enable) show spantree 5VLAN 5Spanning tree enabled Spanning tree type ieeeDesignated Root 00-d0-06-26-f4-04Designated Root Priority 8192Designated Root Cost 3Designated Root Port 2/1-2 (agPort 13/33)Root Max Age 20 sec Hello Time 2 sec Forward Delay 15 secBridge ID MAC ADDR 00-d0-bb-01-30-04Bridge ID Priority 32768Bridge Max Age 20 sec Hello Time 2 sec Forward Delay 15 secPort Vlan Port-State Cost Priority Portfast Channel_id- - - - - - -2/1-2 5 forwarding 3 32 disabled 801 15/1 5 forwarding 4 32 enabled 0,The bridge ID of the root bridge,Root port (port to get to root bridge),esc-6500-b (enable) show spantree 5VLAN 5Spanning tree enabledSpanning tree type ieeeDesignated Root 00-d0-06-26-f4-04Designated Root Priority 8192Designated Root Cost 0Designated Root Port 1/0Root Max Age 20 sec Hello Time 2 sec Forward Delay 15 secBridge ID MAC ADDR 00-d0-06-26-f4-04Bridge ID Priority 8192Bridge Max Age 20 sec Hello Time 2 sec Forward Delay 15 secPort Vlan Port-State Cost Priority Portfast Channel_id- - - - - - -4/1-2 5 forwarding 3 32 disabled 865esc-6500-b (enable) show spantree summaryRoot switch for vlans: 4-10.,Finding the Root,RootID and BID will match on the root bridge,Designated root cost on the root is always “0”,In 5.4 and later, use “show spantree summary” to see for which VLANs the switch is root,esc-6500-b (enable) show spantree summarySummary of connected spanning tree ports by vlanVlan Blocking Listening Learning Forwarding STP Active- - - - - - 1 2 0 0 4 6 4 0 0 0 2 2 5 0 0 0 6 6 6 0 0 0 4 4 7 0 0 0 4 4 8 0 0 0 4 4 9 0 0 0 4 4 10 0 0 0 4 4 Blocking Listening Learning Forwarding STP Active- - - - - -Total 2 0 0 32 34,Finding Active and Blocked Ports,Total blocking ports on the switch,Total ports in the spanning tree (do not exceed limits specified for your supervisor engine in the Release Notes,Viewing Blocked Ports,esc-6500-b (enable) show spantree blockedT = trunkg = groupPorts Vlans- - 8/23 (T) 1 8/24 (T) 1Number of blocked ports (segments) in the system : 2,Ports 8/23 and 8/24 are blocking for VLAN 1,Monitoring Blocked & Root Ports,esc-6500-b (enable) show spantree stat 8/23 1Port 8/23 VLAN 1SpanningTree enabled for vlanNo = 1 BPDU-related parametersport spanning tree enabledstate blockingport_id 0 x836cport number 0 x36cpath cost 12message age (port/VLAN) 3(20)designated_root 00-30-94-93-e5-80designated_cost 19designated_bridge 00-50-53-59-a0-00designated_port 0 x8001top_change_ack FALSEconfig_pending FALSEport_inconsistency none PORT based information & statisticsconfig bpdus xmitted (port/VLAN) 36(698871)config bpdus received (port/VLAN) 215843(608891)tcn bpdus xmitted (port/VLAN) 0(7),Blocked & root ports should receive BPDUs every 2 secondsMonitor blocked and root ports to see if they are receiving config BPDUs every 2 secondsCheck for errors on blocked or root ports, which might cause a blocked port to transition out of blocking mode, or a root bridge change,Ports 8/23 is blocking for VLAN 1,Make sure the “config bpdus received” counter is incrementing on the port approximately every 2 seconds,If BPDUs are not being received every 2 seconds (or at all) on the port, check for errors using:show port counters Check for Layer 1 errors (Align, FCS, etc.)show mac Make sure the “Rcv-Multicast” counter is incrementing; make sure the “In-Discard” counter is not incrementingshow counters Check for any errors on the receive sideshow inband Look for “RsrcErrors”show cam system Make sure 01-80-c2-00-00-00 (IEEE 802.1d BPDU MAC) is listed as a system entry for the VLAN,Monitoring Blocked & Root Ports,Monitoring Spanning Tree,Console (enable) show spantree 3/47Port Vlan Port-State Cost Priority Portfast Channel_id- - - - - - - 3/47 1 blocking 3019 32 disabled 0 3/47 2 blocking 3019 32 disabled 0 3/47 3 blocking 3019 32 disabled 0 3/47 4 forwarding 3019 32 disabled 0 3/47 5 forwardin