----------------------------------------------------------------------- NOV-100.DOC -- 19980304 -- Email thread on NetWare 4.x 100% Utilization ----------------------------------------------------------------------- Feel free to add or edit this document and then email it back to faq@jelyon.com Date: Mon, 23 Oct 1995 12:15:22 GMT0BST From: Kens Mail List Host Subject: NW4.1 High CPU Utilization a CURE! (sort of) We've recently been plagued by slow network responce time, this was always accompanied by very high server utilization (80-100%). We have various NICs and grades of machine, but mostly 486's and above. Using Norton SI program the data thoughput was stated as 160KB/s on a 486 with a 3Com 3c509 card. This was obviously not good. After much testing using different configurations we eventually discovered the source of our problem. It was Packet Signing. When the packet signing was removed the network thoughput rose to 460KB/s on the same machine! At best we were losing 40-50% of our cable bandwidth on a 486 to packet signing. At worst one of our 386's was down to 16KB/s with packet signing and now without is at 250KB/s, an approximate 94% loss of bandwidth!!!! I knew that packet signing was going to impact on the thoughput of the network but didn't think it would effectivly half it! Surely, this is a broken piece of software! My advice, don't use it unless you really really need to! Ian Kennedy ------------------------------ Another idea is turn off Pburst in the NW shell. Reports are that under not so good conditions Pburst can negotiate itself into a corner and slow down a great deal. Yet another is have a look at the server to ensure it's not wedged in some manner. Overdoing disk writes to a not-very-strong disk system (and/or one with no free space so stealing from \deleted.sav must occur) can block traffic. Similarly, a client-class lan adapter in a server can be stomped upon by too much traffic and temporarily go bananas. The old 8-bit 3C503 boards were famous for such behavior. If you have Novell's Lanalyzer or equivalent then watching the traffic in detail might yield the cause. Joe D. ------------------------------ Date: Wed, 25 Oct 1995 21:51:06 PDT From: Luke Mitchell Subject: Server utilization-100% One solution I have not seen posted for the utilization issue: After all the patches are applied and you have a Pentium server, make sure MAXIMUM SERVER PROCESSES is set all the way up to 1000. Novell tech. indicated to me that my Proliant 1500 5/100 server could handle that. Also turn the minimum time to add new process all the way down. I have not had a problem since those steps and patches. This server runs GMHS and serves related apps for 500 users. ------------------------------ From: nicholas cline eggleston One other culpret of intermittent 100% utilization under netware 4.1 is compression. If you've got it turned on, the server sits around and compresses files at night, like it should. However, when your users go to read those compressed files, the server will spawn a very high priority decompression thread. This keeps the server busy enonough that client connections can be lost. This problem is especially noticable on servers with a 486/66 or less. Nick Eggleston ------------------------------ Date: Sun, 22 Oct 1995 12:15:53 -0600 From: Joe Doupnik Subject: Re: NW 4.1 High Utilization 100% - Solutions >=> Fellow Network Engineers, >=> >=> Since I first replied to this problem on the list, I have gotten a >=> lot of inquires. So I am posting this to help anyone that may be >=> experiencing high utilization problems with Netware 4.1 NDS. ------- We'll have to do some careful collecting of information on the problem because NW 4 is very prone to the 100% utilization problem. Sundry revisions to nds.nlm and dsrepair and so on have helped but not eliminated it. Novell is actively working on the NDS database robustness problem so I expect we will eventually get good solutions to the multitude of causes. Joe D. ------------------------------ Date: Fri, 20 Oct 1995 15:07:00 PDT From: "Funderburg, Karl - G6" Subject: NW 4.1 High Utilization 100% - Solutions Since I first replied to this problem on the list, I have gotten a lot of inquires. So I am posting this to help anyone that may be experiencing high utilization problems with Netware 4.1 NDS. Perhaps it can go to the FAQ. I am not sure how that process works. High utilization is a known problem in the Netware 4.x environment. I have found that there is not one simple answer to fix it. There are several things that can and should be done to help alleviate the problem. I worked through most of what I am about to tell you with the help of Novell engineers (we are a premium service account so our problem got escalated to the actual engineers at Novell). I. Patches and updates - First of all apply all Novell patch kits and updates to your server. These files are all available from Netwire through the Web or Compuserve. 3 files in particular are 410IT4.EXE, 410PT2.EXE and DSENH.EXE. Novell will not help you further until you patch to this level. The DSENH.EXE will give you the most recent DS.NLM v4.89a and DSREPAIR.NLM v4.26b. Version 4.89a of DS is supposed to fix a lot of the high utilization problems. Also you get a neat utility called DSMAINT.NLM that will let you copy NDS to a file on disk. This is useful if you are planning to bring a server down for an upgrade. II. Set parameters - In 410IT4.EXE you get SVRPRSFX.NLM which increases maximum allowable service processes from 100 to 1000 and DSPRCSFX.NLM which limits NDS use of Service Processes to 50%. The following parameters should be set if you have over 100 connections. Service Processes Set New Service Process Wait Time = 0.3 (speeds the allocation of additional service processes) Set Maximum Service Processes = number (number = 5 to 1000 see below) No. of Client Recommended Maximum Connections Service Processes 1 - 100 2 - 200 (don't use less than default) 101 - 250 200 - 500 251 - 500 500 - 1000 501 - 1000 1000 Directory Cache Buffers Set Minimum Directory Cache Buffers = number (number = 10 - 2000 see below) Set Maximum Directory Cache Buffers = number (number = 20 - 4000 see below) Set Directory Cache Allocation Wait Time = 0.5 (speeds the allocation of additional DCBs) No. of Client Recommended Directory Connections Cache Buffers minimum maximum 1 - 100 2 - 200 4000 101 - 250 200 - 500 4000 251 - 500 500 - 1000 4000 501 - 1000 1000 - 2000 4000 Packet Receive Buffers Set Minimum Packet Receive Buffers = number (number = 10 - 2000 see below) Set Maximum Packet Receive Buffers = number (number = 20 - 4000 see below) Set New Packet Receive Buffer Wait Time = 0.1 (speeds the allocation of additional PRBs) No. of Client Recommended Packet Connections Receive Buffers minimum maximum 1 - 100 2 - 200 4000 101 - 250 200 - 500 4000 251 - 500 500 - 1000 4000 501 - 1000 1000 - 2000 4000 III. NDS partitions - Limit the number of replications of a partition to 3 servers. If you must do bindery emulation on all servers (requires replica of partition containing bindery objects) then keep your partition containing bindery objects as small as possible. IV. ArcServe tip - Load TSANDS.NLM only on servers containing master copies of replica. There is no need to make a backup of each replica of NDS, only the masters. It helps to have a stable (normal utilization) network when running Arcserve backups. Do steps I, II & III. My network is a WAN consisting of 14 Novell 4.1 servers, 2000 nodes, a dedicated backup server running Arcserve 5.01g, an Exobyte DAT tape changer. Topology is ethernet using UTP cabling and FDDI backbone. Routers and hubs are Synoptics (Cisco). Our NDS tree now contains 3 partitions. One partition is very small containing only necessary bindery objects required for bindery emulation. Prior to taking the above steps I experienced frequent and prolonged periods of 100% utilization to point where servers would lock up and have to be powered off. There were frequent crashes and if you ever had to bring a server back from a down state the entire network came to its knees while the new server synched back up to the NDS tree. After implementing the above steps I now get to sleep at night and even enjoy my weekends. I will still occasionally see a 100% spike but nothing like it used to be. ------------------------------ Date: Mon, 30 Oct 1995 10:53:51 MST7MDT From: Matt Zufelt >>I recall that when Netware 4.x was first released that Novell's gurus >>recommended that a container object should hold no more than 500 leaf >>objects (ie users). Does this still hold true with NW4.1? Has anyone any >>experience on this? > >I have 4000 accounts in one container on my 4.1 server. It doesn't seem >to cause any major problems. NWADMIN even handles it ok. Whether or not >there are performance penalties is another question. I have had and still >have some performance problems on my 4.1 servers... We have one container on a 4.1 network here that has almost 3000 objects in it, and it has become a major source of heartburn. When the NDS janitor process kicks in, the server essentially grinds to a halt. It is extremely slow for the active users until the process finishes (which can sometimes take 15-20 min.). Also, anytime we try to make a new replica (or remove a replica) from another server on the tree, things seem to stop. We tried to remove a replica of this context off a server two weekends ago, after running all day Sunday and Monday, it still was not removed. The server was still munching on it. We finally reset the server. Now, we have a "dying" replica on the server. It takes up disk space, but at least it is no longer receiving updates--another process that took an inordinate amount of time. ------------------------------ Server utilization will become very large if the server is handling directly attached printers via polling (non-interrupt fashion). Joe D. ------------------------------ Date: Wed, 8 Nov 1995 10:44:06 MST7MDT From: "Timothy D. Porter" We are having a real problem with some 4.1 servers here when the NDS Janitor/Flatcleaner process starts. When it starts the server utilization goes to 100%, and response times for people logged in as well as other processes running on the server grind to a halt. It takes to janitor between 5 and 10 minutes to complete its run. Is there some way to set the schedule priority of this process lower so everything doesn't stop while it runs? It doesn't show up in the scheduling information screeen in the monitor. We have all of the latest patches installed. One of the machines that is having the problem has 132MB RAM, and is a P90 processor. ------------------------------ Date: Thu, 9 Nov 1995 02:47:10 GMT From: Rich Silva Subject: Re: NDS weirdness "resolved" Todd W Herring wrote: >Our four server, NW 4.1 network had been experiencing some rather >perplexing difficulties recently. From 100% processor utilization - >to NDS replicas that weren't synching - to the inability to login in >bindery emulation. For the first three months after upgrading to >4.1, we didn't have any of these problems. They snuck up on us and >pounced rather suddenly. > >The final straw, and the most puzzling, was the 100% processor >utilization problem. Our servers, one by one, were attracting Get >Bindery Object NCP packets from print servers all over our campus. >This, we thought, was causing the 100% util. After applying a filter >to our router, the packets stopped coming in, but the 100% util did >not. Weird... > >We did all the normal things you'd do to resolve the problems -- ran >DSREPAIR, VREPAIR, downing the server, unloading NLMs, updating to >the newest patch, etc. We even called Novell. Novell techs wanted >to dial in to one of our servers and do some teeth-clinching things >to our NDS. We declined the offer, not knowing what kooky things may >happen after they were done. So we bit the bullet, gathered all the >information about trustee rights that we could, and re-installed >Netware on each of our servers. We began by removing the replica >from a server, then renamed the SYS volume, downed the server, and >re-installed NW 4.1 into a brand new tree. All went well (Thank God) >and we rebuilt trustee assignments using batch files. > >Looking back it would have been nice to pin down the exact cause. We >can only assume it was an NDS corruption. What caused the corruption >we don't know. We suspect it may have been imported from the upgrade >of a 3.12 server into the tree, although this upgrade did not >immediately precede the corruption. That particular server had >several unknown objects show up when running BINDFIXes, and the >bindery objects were pulled into the tree during upgrade. Most of >those objects were summarily deleted from our new tree because they >already existed (we had Netsynced with a redundant server to pull the >bindery into our tree). The bottom line is we don't know what caused >the havoc. > >I know what you're saying. Couldn't we have restored NDS from tape? >No, because in all our mess Arcserve got screwed up. It's bindery >queue object was corrupted when I downgraded to Arcserve 4.0 to run >an experiment. I couldn't re-install Arcserve 5.01g because I >couldn't login in bindery emulation!! Ahh! I shiver when I even >think of it! > >My advice if you haven't upgraded to 4.1 - learn as much as you can >before you do, NDS is nothing like 3.x's bindery. Take classes, ask >people who have upgraded, and when you do upgrade set up a test >server and hammer it for a few weeks at least (test backup and >restores, etc). My advice for those who have already upgraded - >learn as much as you can before something goes wrong, NDS is nothing >like 3.x's bindery. I would also suggest you find the resources to >set up a test tree and server and test your backup and restores of >NDS. You'll want to have the experience before it becomes necessary. What happened is the replicas should have been removed before a server is brought down. The servers left get bombarded with serrches trying to update replicas. Sometimes it may never get in sync leaving you with 100% utilization. I think only if the server ip address has changed or the name or something. But if this happens the only way I know to fix it is reinstall. The moral of the story is pay atention to where you put your replicas. Only use them where you need and when a server is going to be brought down remove any replicas of that server first. ------------------------------ Date: Fri, 10 Nov 1995 12:50:22 -0800 From: Jonn Martell Subject: Utilization at 100% This (long) message contains a description of two Server Utilization tools and a question regarding how to track individual Process CPU utilization. .... We had a problem with one of our servers this week. The Utilization went to 100% and stuck there (for several agonizing minutes - felt like hours). Although it's a 4.1 server, it's a single server tree so the old NDS bug probably isn't problem. Nothing on the server had been added or deleted and we aren't running the CDROM.NLM. This is the first time this problem has happened and it's quite scary because at 100% utilization the server stops accepting connections, users can't access the server and the console "feels" stuck. After trying to identify the rogue Processor in monitor (without success) I decided to unload monitor, and that seemed to unlock it and the utilization stabilized back to normal (300 connections with average of 15% CPU utilization). The Console log shows "A scheduled "Work To Do" took over one minute to run." and the System error log shows nothing. The server has been running fine since. In trying to locate tools that would allow me to isolate the problem I found two that displays server information (including utilization) graphically over time. Both of these are 3rd party commercial and evaluation copies are available over the net. The first is Nconsole by Advanti Technology. The NLM provides Monitor information in a much better format. It not only has current utilization (like monitor) but also displays average and peak. The screen saver is a histogram that shows utilization (current, average and peak). The windows client shows utilization and trends (although the Windows client would probably fail if the server hits 100% again). Available from http://www.avanti-tech.com/~prodinfo/ They also have a SNMP version which looks much better than the expensive NMS agents by Novell. Tech support very responsive. The second is NetTune Pro by Hawknet available from http://www.cts.com/~netinfo/nettune2.html. NetTune is much more powerful although I found several bugs with the more advanced features (documentation and set parameter tuning). It only runs as a client-server with a Windows front-end and a NLM back-end. It's a great tool to document your server! Now for my question: Nconsole shows me that there is a process that sometimes makes utilization jump to 100% (for a fraction of a second). This would be very hard to pick out in monitor but Nconsole makes it very apparent. There is no pattern that I can see except that it does it at least a few times per hour. Does anyone know of any tools that can isolate which process is making utilization jump to 100% for a fraction of a second? The Avanti-Tech folks said they are working on individual process CPU utilization monitoring it but they don't have anything right now. ------------------------------ Date: Tue, 21 Nov 1995 17:06:30 -0800 From: Charles Martini Subject: Re: NDS over WAN >I would be very interested in your replicating strategy and experience with >traffic generated by NDS synchronization over slow WAN links. Would it, >be feasible to open a WAN link only temporarily, say four times of 15 >minutes each a day, just to force NDS to do its sync-work TO THE BEST OF MY KNOWLEDGE, (and I may be wrong on this, but I don't think so), there is no way to control when NDS does its syncs. BUT, I do have some data from Novell on when & how much traffic NDS sync's happen. "High Convergence" activities, such as user/object creation or deletion, are sync'd every 10 seconds. "Low Convergence" activities, such as updating Login Time & Addresses, sync every 30 minutes. There's an NDS heartbeat that sync's every 30 minutes as well. NDS verifies backlings and external references every 25 hours. Sample traffic loads (bytes): Replica heartbeat: 750 Sync ten users, 2 replicas: 6286 Sync ten users, 3 replicas: 14108 Create user, workstation to server: 13000 Create user, server to server sync, 3 replicas: 15796 Tree Walking: you'll also be stymied in trying to schedule NDS connections if any of your users need to authenticate to remote servers. Bottom line: NDS sync traffic is pretty minimal, so it shouldn't be too costly if you have only a few remote servers & slow WAN links. If, however, you have a lot of remote servers & remote users, and want to share resources across the entire WAN, you'll obviously need to invest in faster links. ------------------------------ Date: Fri, 24 Nov 1995 09:48:03 GMT0BST From: Kens Mail List Host Subject: Re: IPXRTR NCP Work To Do >>Some of our Netware 4.1 servers occassionally jump up to 100% CPU >>utilization and stay there for some time. The console reveals that >>almost all the work is being done by the 'IPXRTR NCP Work to do' >>process. We are not routing IPX on the server and is not heavily >>loaded with users. Does anyone know what this means? > >Sounds like you have a serious NDS problem. Go to server console and type: > SET DSTRACE=ON > SET DSTRACE=+ALL > SET DSTRACE=+SYNC > SET DSTRACE=*H >Switch to the DSTRACE Screen and watch for any unsuccessful updates >and/or errors. If you do have problems, I would recommend contacting a >NASC or Novell Tech Support directly, unless you are a NDS >troubleshooting expert. This is not a NDS problem. We've just spent several months getting to the bottom of this one. It's related to packet burst and the VLMs. Novell have a patch called PBRSTOFF which disables packet burst on the server. It took our utilisation from 80-100% down to 0-30%. I'm not sure the patch is generally available but they will give it out on demand. Ian Kennedy ------------------------------ Date: Sat, 30 Dec 1995 14:09:12 -6 From: "Mike Avery" To: netw4-l@bgu.edu Subject: Re: Two network Cards >>Following up on this interesting topic with another question... >>What about the scenerio of an exclusively Cisco routed network with >>multiple local 16 MBps token-ring segments. May a NetWare 4.1 >>server improve throughput to clients on the segments (IP or IPXng) >>by adding NICs which directly connect to a unique segment? >It depends on where your bottleneck is. IF you've got 80-90% >utilization on the server's ring, you'd benefit. How much depends on >your server hardware, ># of users and traffic particulars, but on a modern server with >eisa/pci/microchannel and sufficient # of disks you should be able >to increase your throughput several times. Way back when 386/16s >were new a pretty complete test showed double performance with a >second NIC, with little improvement beyond that - but that was 2.15! The book "Optimizing NetWare Networks" by Rick Sant'Angelo (M&T Books) covers this topic pretty well. The history of data processing and computer science could be viewed as a matter of moving the bottlenecks. Yesterdays solution becomes today's problem. More NIC's can be added, but each NIC will increase processor overhead. At a certain point, adding more cards will actually start to decrease performance. Some cards use more system resources than others, and that is due to a combination of hardware and software. 3Com's Ethernet cards deliver high performance, but at ahigh price - as much as 1/3 of a 486 server's CPU resources can be spent servicing the NIC, according to some reports. The point of diminishing returns can be reached rather quickly. As the network load goes up, it sometimes makes more sense to put a high speed backbone in place connecting the servers to routers, and letting the users connect to other segments (or rings, depending on your topology), and let the routers handle the routing services. In one case, we had a Compaq file server with 3 10mbps Ethernet neics in it and over 200 users. More were routing through the server. We removed 2 nics, cabled the server to a router, and had all the users go through the router. The performance was greatly improved. It was further improved when we removed the ethernet NIC and put in a FDDI card. (I inherited the original server configuration....) The performance of 100mbps Ethernet may well be comparable to that of FDDI for most applications, at a considerable savings. All in all, routing seems to be more expensive in terms of CPU requirements than one might think... and getting rid of the contention for the Ethernet segments probably also helped. ------------------------------ Date: Mon, 1 Jan 1996 01:55:05 -0800 From: rgrein@halcyon.com (Randy Grein) To: netw4-l@bgu.edu Subject: Re: Two network Cards >I'd like to explore the routing issue more. > >My opinion is that peak client and overall system throughput would be enhanced >by supporting demanding local segments with direct access to shared server(s) >via multiple NICs. I seen PERFORM3 benchmark volume throughput (e.g. >delivering server located applications) on a local segment is twice or more >faster for a client than reaching out to through a Cisco router backplane to >servers on different segments. People will whine and argue, but the fact remains that you are correct - crossing routers does take time, especially if you're not using packet burst. My boss wrote an article about the subject several years ago, comparing the performance of routers vs. bridges. This is appropriate in the current "switch" debate, as a switch is really nothing more complex than a jumped-up bridge. I've not seen quite this extreme in performance differential, and there's a couple of caviats: 1. The penalty is largely negated if you use packet burst 2. Routers do much more than bridges/switches, make sure you'll not need the extra functionality 3. While the speed reduction is measurable and important, it may not matter in many situations. Aggragate throughput will be essentially the same (modern routers and switches both forward at or nearly at wire speed), and unless the user is moving many megabytes of information the difference in time is usually not noticable. 4. Be careful using the perform series to draw performance conclusions. It generates a VERY artifical load which is only valid for preliminary analysis. Other, more complex tools are available, more accurate but harder to use. >Establishing more than one logical routing path between clients and servers >looks like a problem. Is is still reasonable/workable to disable routing on a >NetWare 4.1 server to enhance throughput with multiple NICs for selected >NetWare client segments while avoiding an unsupported mesh routing situation? The "mesh" or "web" routing network IS supported - it bears some advantages is basic fault tolerance. In fact, up to a certain size it's advantageous to place each server on each segment by installing an additional NIC. In fact, large WANs use this concept to provide fault tolerance, although I believe they use OSPF instead of RIP to resolve paths and reroute around downed links. You can disable routing if you wish, but it's not necessary - I wouldn't recommend it unless you had a specific need. ------------------------------ Date: Thu, 4 Jan 1996 21:38:00 -0800 From: rgrein@halcyon.com (Randy Grein) To: netw4-l@bgu.edu Subject: Re: 100% Utilization crashed the server >Our 4.1 server crashed three times yesterday, the utilization >was up 100% in all instances. The console was frozen and >therefore I couldn't tell what open files caused the crash. >Does anyone know of a way to track this resources hog? >Any utility out on the market that may do the job? I hate to tell you this, but open files do not cause a server to crash! However, what you are looking for (a tracking utility is more or less possible, but reconstruction is difficult at best. 1. Load conlog MANUALLY after the server mounts. It will then write any error messages to the console log, which will NOT be overwritten on reboot. It overwrites the log file when reloaded. 2. NOTHING will track reliably during a server crash like this - no software, anyway. The trouble is that the instrument to be monitored (the OS and CPU) is being used to perform diagnostics. This is invariably limiting. The closest you can come to the type of tracing you're looking for is purchasing server machines from compaq, HP or IBM. These in addition to using ECC memory also have diagnostic circuitry built in that can at least detect hardware problems independent of the CPU or OS. 3. What you really need to do is to diagnose the utilization problem before it gets out of hand. Look for things like backup, DS.NLM less than version 3.89, no patches, incorrect DS partitioning, or NT servers on the network using MS IPX emulation. There's been verified reports of it abending SFTIII servers because Microsoft used some of the wrong communications sockets; it's remotely possible this could be a problem. ------------------------------ Date: Mon, 8 Jan 1996 06:02:30 -6 From: "Mike Avery" To: netw4-l@bgu.edu Subject: Re: 100% Utilization crashed the server >>>Our 4.1 server crashed three times yesterday, the utilization >>>was up 100% in all instances. The console was frozen and >>>therefore I couldn't tell what open files caused the crash. >>>Does anyone know of a way to track this resources hog? >>>Any utility out on the market that may do the job? >>Have you applied ALL the patches and fixes? >I also have a sick 4.1 server. Anytime you attempt to load a >console utility, ie... INSTALL, the processor util' goes to 100%. >At first we thought the DISCPORT NLM's were at fault. ( had >problems with this just after upgrading the server to 4.1). I have >applied the latest (fall last year) patches to the server. as soon >as time allows I will be checking for later patches. If you check the current Novell patches, a number address the 100% utilization issue. However, the documentation also makes it pretty clear that 100% utilization, by itself, is nothing to get excited about. If it stays at 100% AND the performance of other tasks is impaired, then it is an issue. The latest revisions of the Discport software is 4.10a. There are some signifigant improvements over the earlier versions, especially in a NetWare 4.10 environment. ------------------------------ Date: Fri, 19 Jan 1996 08:30:39 +0100 From: Henno Keers Subject: Re: Cause of CPU utilization increase? [4] >>>Background: >>>We are running NetWare 3.12 on a 486-33 EISA box in an educational >>>setting. Our 100 user license is just about maxed out by 3 labs and >>>a number of office machines. We do lots of remote booting, and most >>>applications are loaded from the server. >>> >>>Question: >>>How can we tell what processes running on the server are heavy >>>contributors to an increase in CPU utilization? We have always had >>>spikes of high utilization, but these could be attributed to specific >>>temporary causes (people copying files to/from local drives, a bunch >>>of machines booting up or loading software at the beginning of a lab >>>period, or similar things). What we have noticed recently is >>>utilization ramping up and staying there for a while. The spikes >>>don't concern us much (should they?) but longer periods of high >>>utilization do. Depending upon what the cause is there may not be >>>much we can do, but it would be nice to know specifically what's >>>causing them anyway. >>> >>>To quantify things a bit: our server idles most of the time in the >>>teens or even single digits. Spikes have been up to the 60-70% range >>>for possibly a few seconds and certainly less than a minute. Our >>>recent high utilization has been hovering in ranges from 40% to 90% >>>for 5-10-20 minutes at a time (possibly longer, but that's as long as I >>>actually watched it). >>> >>>To answer the obvious question "if it's a recent problem, what have >>>you changed recently?", the only significant recent change is >>>Internet access. Our LAN is attached to the organization WAN. The >>>server is running Mercury as a mail agent, and people are starting to >>>use Web browsers. However, this is really only in its infancy. We >>>don't have labs full of 25 people all trying to run Netscape at the >>>same time...yet. Could 3 or 4 people running Netscape generate a >>>that significant a load for the server? (An additional 50% CPU >>>utilization?) >>> >>>A related question: >>>If the answer to the first question is "get some network >>>management/monitoring software", is there anything available in the >>>shareware/freeware category? (Even if it only generates simple >>>reports on CPU utilization, concurrent logins, network traffic...). >>>Educational budgets won't support much these days...but you already >>>knew that. :) >>> >>>Mark Holland >>--------------- >> What a well written report! Terrific. >> It just so happens that I have a 486-33 EISA NW 3.12 server in >>a public lab environment which the other day showed 100% utilization for >>many many minutes on end, according to my student lab consultants on duty. >> Normally printing to a printer attached to a server is either a >>small activity if using interrupts or an intensive polling activity >>otherwise. But that brings server utilization up to the 40-60% level, more >>or less, and it's obvious what's going on (from the noise alone). >> We don't see 100% utilization from normal activities. Something >>strange was occuring, but what (as you ask too). I wasn't there to see. >> I looked at MONITOR and discover one lan adapter had far too many >>received bytes than it ought, by about 1GB in a day. Ah ha! Some student >>was pulling in bytes by the truckload, with no place to store them (diskless >>clients). What could that person be doing? Hard to tell. Putting up a >>temporary IRC/MUD relay station is a guaranteed way of using all available >>bandwidth, and we have killed off a number of those things. Running a >>home grown networking program is another way of bringing a server to its >>knees (sit in a tight loop hammering on the server). EISA machines kneel >>but don't submit to such abuse, and thus "they keep on running and running," >>battery bunny style, no matter what the load. >> We had a recent experience with a lan adapter test program, shipped >>on floppy with the Ethernet board, putting illegal packets on the wire >>as fast as it could go. I told the list about this situation (pkts spread >>throughout the campus and caused quite a flap). A generic no-name NE-2000, >>in case you are interested. Anyone can run such programs in a lab just >>by stuffing a disk in a floppy drive. The identifier, in this case, is >>server lan adapter statistics. >> I presume it is possible for a commercial programmer to get it >>wrong and hammer the daylights out of a server while trying to print >>or write a file while that attempt fails. Again, lan adapter stats should >>show the traffic, but that's about all we would see. >> Finally, most of us have a look at recently added NLMs with the >>feeling that not all may be well in NLM-land. It happens. Unloading what >>we can is certainly worth doing during server saturation. Tape backup >>programs are candidates, as are virus scanning programs, and whatever >>other goodies are present (Pmail in your case, though it's always been >>well behaved in my environment). Software metering programs are candidates >>too. The problem isolation thought here is if lan stats are normal then >>activity is occuring within the server itself, and unloading is the quickest >>way of turning off internal items. >> My US$0.02 on the matter. >> Joe D. > >We all have our horror stories, I hope you don't mind mine... > >We, too, found our 5-30% utilization jump steadily one day to 60%, >70%... up to 100% utilization. > >LANAlyzer for Windows found an 'unrecognizable bunch of stuff' >occupying more and more of our bandwidth. We were able to isolate it to a >general location in one of our buildings. > >A 25pin-to-RJ45 adapter had been used on a PC to allow a network print >TSR to run a serial printer located across the hall. Cat 3 cable was run >from the PC to printer. > >A user moved the PC with all its cabling from one location to another. >It seemed to him that the 25pin-to-RJ45 adapter was a 'network connector'. >He plugged the it into the wall jack, along with the patch cable for the >NIC. When he booted his PC, the TSR for network printing fired up [along >with standard DOS serial stuff], flooding the network. > >Daniel E. Cullinan A couple of years ago I was working as a field service engineer at a large PC shop (which went financially flat on it's face) where I was in a branch office working with a collegue. He was moving an original IBM AT with monochrome display adapter and a token-ring card from one place to another where he hooked up the token-ring (NetWare) network to the MDA DB9 port ... And switched on the PC. The token-ring net went totally silent on the moment that he flipped the switch, needless to say that the users in the rooms down the hall where not totally silent after this incident {;-) ------------------------------ Date: Tue, 23 Jan 1996 17:37:40 EST From: Paul Massue-Monat Subject: High CPU Utilization - Servman.nlm - Netware Connection In the latest Netware Connection Magazine (January / February 1996 = volume 7, number 1), one can find an interesting article on servam (as in "load servman.nlm" on the server console). The magazine is published by Netware Users International (1-801-221-9634); Novell and/or knowledgeable people contribute to it regularly. (For joining a user group, call 1-800-228-4nui). Anyway ... Since there's a thread running at this time on "high CPU utilization", I want to repeat what it is said in the article written by David Doering (74106.1551@compuserve.com) a senior analyst with Technical Services, a Provo, Utah-based consulting company. Before I start, let me say that personally I've noticed that "high util" when I load NLMs in the autoexec.ncf. Depending on the order I load the NLMs, I sometimes can get an eternal 100%. The server is doing something and it gets caught. I shuffle the NLMs in the autoexec, down the server gracefully and re-boot. I now found a stable order and I'll try to keep it that way. (I'm still testing third-party products on a production server: ugly ugly ...). Before I start, may I had that the article informs you on things that are NOT in the red books of course. There's info on free blocks, on directory entries, etc. Very useful to understand Netware. In the hope of helping someone, here goes: --------- start of quote High util is often caused by a server process that is consuming an inordinate amount of CPU time. Using the monitor utility, you can see how much time the CPU is servicing each server process. IN monitor, chose the Processor Util option then press F3. You will see a list of processes along with each one's load percentage. If the process that is using the most CPU time is one with a low priority (such as the monitor utility itself), you can change how the CPU services that process. To do this, chose the Scheduling INfo option from the monitor's main menu and select the process you want to change. Then press the + or - sign. For example, if you increase the value from 1 to 2, the CPU will handle the process half as frequently as it did before. High util can also be caused if Netware is using suballocation blocks to save small files and the volume does not have enough free space. Netware 4's suballocation feature conserves space by saving files that are smaller than the physical disk block size in suballocation blocks. Theses blocks allow for files as small as 512 bytes. A single 64kb physical disk block can hold as many as 128 suballocation blocks. If the amount of fee space on a volume falls below 10-20 percent, the suballocation routine does not have enough space to perform this task efficiently. Finally, high util can ve caused by servers that include power consumption firmware in the CMOS. If your system has a CMOS-based power saver, you should verify that it is off when you boot the server. Otherwise, Netware spends time preventing the power-down feature from activating when the server is idle. ----------- end of quote ------------------------------ Date: Wed, 24 Jan 1996 22:54:07 GMT From: Kevin Kinnell Subject: Re: Novell 4.1 Server crashing with network overload...help! >You're not backing up at the same time as compression is taking place >(default 12 midnight to 6 am) are you? > >Decompressing isn't a high utilization issue -- in fact, you can barely tell >the difference in loading a file if it has to be decompressed (I can tell, >but I'm looking for it!) If you want to test the compression angle, use the >SET command to turn compression OFF for the server. That will keep the files >from recompressing and you should have the most common ones decompressed >within a pretty short time with that many users. >There have been some problems with high utilization, but the largest problems >I see concern continuous high utilization, not spiking. What type of system >are you running? Isn't there a static patch for the loader that addresses the utilization spike? Seems like the original code would peak the utilization during a load if there was an interrupt generated. If I had to theorize, I'd guess that Aaron is doing exactly what you suggested with the compression (compression is occurring at the same time as backup) plus the un-patched loader which is causing the usage spike. Aaron, are you using ArcServe? ------------------------------ Date: Thu, 25 Jan 1996 21:00:15 -0600 From: Joe Doupnik Subject: Re: Slow Server >I need some suggestions as to why my network seems to slow down at some times >during the day. > >Currently we are running novell 3.12 BNC with 3 NE2000 cards and 6 loops. >During peak operation in the day (noon to 3) a lot of users are on the server >and some workstations seem to >get bogged down. They run real slow. But towards the end of the day things get >back to normal speed. The >server does not have a lot of free storage capacity and fluxuates from 80 to >100 meg of free storage. ------------ Hmmm, that one will require a little head scratching. There isn't a direct answer that we can offer from here because there lots of candidates. But let me suggest a start to your own investigations. Use MONITOR and examine lan traffic on each board. If a wire is pretty well saturated, say by running > 1000 pkts/sec on average, then time on the wire is a bottleneck. If you have limited packet receive buffers, and the "no ECB" count keeps going up then the comms channel in the server is a problem. Often the problem is in the server disk system, and we don't know what that looks like. I certainly would look into that part before jumping to other conclusions. I run a busy server with three NE-3200's and it does not bog down even under heavy load. A 486-33 server is handling 49 Pentium 90's being used by engineering students and others, and surprizingly it does the job well. A very crude and quick estimator of server saturation is watching the number of server processes over time on NW 3.x. If it keeps edging up then the server is really being pushed hard. The one above levels off at about 13 (MONITOR shows only the peak value and hangs on to it.). A better measurment is employ Lanalyzer for Windows by Novell and look for server overload packets. That analyzer is a good tool for sizing up your network. Another part of the equation is how strong are the clients. If they drop lots of packets under load then matters crawl. LZFW can help one see that too (see repeated replies, particularly when Pburst is used). Finally, the wires may carry other traffic and have bridges and routers in series. What happens to those boxes can have a major impact on your IPX services. Joe D. ------------------------------ Date: Sat, 3 Feb 1996 16:10:58 -0500 From: Daniel Tran Subject: Re: help: high cpu utilization under 3.12 >We are running Novell 3.12 and recently we began experiencing a >problem that suddenly monitor would indicate 40-50% utilization and >all disks requests are very slow. load monitor -p. ...with the -p parameter, it will give you the processor utilization. When you get high utilization, go to monitor, go to processor utilization, hit F3. At this point you can see what process is eating up your CPU cycles. ------------------------------ Date: Thu, 22 Feb 1996 01:07:37 -0500 From: ATL1DDJ@aol.com To: netw4-l@bgu.edu Subject: Re: High Server utilization I stayed at work until 6:00 am one day going crazy about the very same thing. I turned off every workstation, printer and unloaded every NLM except for monitor and the utilization was still 98%. If you go into processor information from monitor expand the window to full screen (I think F3 or F6) and monitor every process. You will notice a large percentage of process % in the ideal loop (like 90%). This is Novells' way of keeping the processor busy while its not doing anything significant. As soon as another process starts up, Novell displays the *true* utilization. BTW, if you look at the bottom of this processor resource screen, you will see the substraction result of server utilization minus ideal loop utilization. Why Novell did not display this % on the main monitor screen I do not know. P.S. In monitor use the help feature (F1). It explains the above in more detail. -David ------------------------------ Date: Thu, 22 Aug 1996 22:12:53 -0400 From: Glenn Fund Subject: Re: Fine tuning Servers Get a hold of BMC's NETtune Pro and let it discover your server(s) and make recommendations for optimization. Great server monitoring and optimization tool. ------------------------------ Date: Wed, 23 Oct 1996 20:48:06 -0600 From: Joe Doupnik Subject: Packets/sec story, with numbers I just got back to my office tonight after being called with the classical message "The network is slow!" in one of my student labs. What happened is sort of interesting. The NW 4.11 file server (a 486-33 EISA bus machine) registered 100% cpu utilization, dstrace said not fully synchronized, users were grumbling audibly. Monitor also said a couple of lan adapters had very high "No ECB available" counts and climbing quickly. The packets sent and received numbers were also climbing rapidly. All were in the many millions. Normally there are no "No ECB available" counts. What the heck was going on? I took another sip of coffee and thought and poked the server keyboard. No, the wiring (coax) was just fine. Plenty of server memory. The directly attached printers were going fine, but user apps were extremely sluggish. The server did look ok except for the ECB loss rate and the cpu utilization value. Oh. There is someone running Lanalyzer for Windows, and Wow! Look at the packets per second dial over in the red zone! 3000 pkts/sec!!! Yikes, no wonder things were slow with that traffic rate as competition. Capture a few thousand packets (while grumbles increase in volume) and they were tinygrams, 64B guys, zillions of them, TCP/IP Telnet, going from one wire in the room through the server to another wire in the room. Ah ha! I know what is happening. It's my grad Computer Networks class doing their homework assignment. That was to measure throughput versus packet size (TCP MSS) for various situations, with MS-DOS Kermit acting as the TCP app at both ends of the connection. Sure enough, someone had tried an MSS (the TCP payload) of 16 bytes, generating tinygrams. ECB's were not available because the packet rate was high enough to exceed the rate at which the server could supply buffers to the NE-3200 boards, and some packets were consequently lost. (No problem, TCP repeats them and thus keeps adding load). Once we stopped the file transfer test everything was perfectly normal again. The overhead handling tinygrams with a bus master board is often greater than with a simpler port i/o or shared memory board because of the busy work setting up a block transfer. Hence a simpler board, say an NE-2000 flavor, would have done less work moving tinygrams than the better board, and the other way around with larger pkts. A general rule of thumb on 10Mbps Ethernet is 1000 packets per second is a hefty load on machines. Here we were at triple that rate, and in this case the server was acting as an IP router rather than as a file server so we did not have delays while the server's disk drives were accessed. MS-DOS Kermit has rather efficient Kermit and TCP/IP protocol stack code, in this case altogether too efficient. By the way, the throughput was only 684 file data bytes/sec, compared to about 160KB/sec with normal sized IP packets. All that overhead of headers with just a few data bytes, plus repeats for lost frames. If the server were less strong, or some other item in the server were consuming lots of resources (say running a tape drive for backups), then the same loss of ECBs would arise simply because the server could not attend to every arriving packet. In this case the server was healthy but the packet rate was way out of bounds. And we see the server is not an IP router which routes packets "at wire speed" comfortably. It certainly can be handy to be the system manager as well as the course instructor because I didn't have to blame anyone else (this time). Joe D. ------------------------------ Date: Thu, 31 Oct 1996 08:48:17 +1100 From: Adrian Moore Subject: Re: 4.1 server utilization at 100% Check out TID: 2905856 which recommends preallocation values for service processes and receive buffers. This article is titled: "Additional Notes for High Utilization Issues". It is a supplement to TID 1005963. Alternatively: Did you recently put LANDR8 on? If so, are you using an ODI 3.3 spec LAN driver? If not... look to the vendor web site for the ODI 3.3 .lan driver. I've had one report where high utilization was caused by using the ODI 3.3 ETHERTSM, MSM, NBI combination with a pre-ODI 3.3 LAN driver. There was an executable which would check this for you but I do not recall which kit it was shipped with. A quick search of LANDR8 and Client32 haven't jogged my memory. ------------------------------ Date: Wed, 30 Oct 1996 19:12:53 +0100 From: Urban Svensson Subject: NW 4 and 100 % utilization >Jackie D. Firkins had problems with NW 4.1 and 100 % utilization. Have You checked this: 1. There is a newer MSM and ETHERTSM in LANDR9/8. 2. What is Upgrade Low Priority Threads set to? Should be OFF. 3. Are You really using 410PT6? 4. Maximum Service Processes hitting the upper limit? 5. Maximum Directory Cache Buffers high/hitting the limit? 6. Not yet too familiar with the 3Com XL but is this not an ISA card? Could this cause the server to kneel? 7. Everything else OK? Memory - enough, Diskspace at least 15 per cent free and so on? ------------------------------ Date: Tue, 12 Nov 1996 21:51:14 +0100 From: Hakan_Andersson Subject: Re: 100% utilization -Reply >>>Every now and then, about once an hour, the utilization of my 4.10 >>>server (486DX50, Adaptec 2742T EISA, Seagate Barracuda 2GB) shoots >>>up to 100% and the drive goes frantically for several minutes. This sounds very much like "agressive mode" which can be triggered by a volume beeing 90% full or by a disk not having more than 1000 blocks free (hardcoded). Note: blocks free does not include the purgeable blocks so even though you have maybe 50% free there could be less than 1000 blocks free on a disk. --------- Date: Tue, 12 Nov 1996 16:35:24 -0500 From: "Martin C. Mueller" Subject: Re: 100% utilization - Solved >>>Every now and then, about once an hour, the utilization of my 4.10 >>>server (486DX50, Adaptec 2742T EISA, Seagate Barracuda 2GB) shoots up to >>>100% and the drive goes frantically for several minutes. Checking the >> >>Just a thought, block subalocation cleanup? > > Never heard of such a thing. In fact, I can't imagine such a thing. >And it's never been seen here either. > > Seeing NDS activity is usually simplicity itself: console SET DSTRACE >ON. Watch its screen. In addition, snooping the wire is very useful. Novell's >Lanalyzer does a good job with this. > Joe D. The solution lied in the direction of compression. The cause was telling the server to immediately compress deleted files (deemed usefull at times of low diskspace and never revoked). It turned out that a handfull of users deleted substantial amounts with a certain abundance and periodicity. This triggered the high utilization and the disk activity in spite of not serving requests. I still don't understand why the the compressing thread's CPU share couldn't been seen in the process table. Best of all - it's been mentioned in the books! In this case, the fine text UTILIZAT.TXT within 410pt6.exe by Rich Jardine. Again I skipped a seemingly not-so-important technical document to "get work done" >:-( At least, I remembered there was something :-) . (BTW, the text also mentions suballocation and NDS as possible causes among others) --------- Date: Wed, 13 Nov 1996 14:26:30 +1300 From: "Baird, John" Subject: Re: 100% utilization -Reply >>>Every now and then, about once an hour, the utilization of my 4.10 >>>server (486DX50, Adaptec 2742T EISA, Seagate Barracuda 2GB) shoots up to >>>100% and the drive goes frantically for several minutes. Checking the >> >>Just a thought, block subalocation cleanup? > > Never heard of such a thing. In fact, I can't imagine such a thing. >And it's never been seen here either. > > Joe D. Yup, it does happen from time to time, but I dont know at what priority it is done under normal circumstances and whether it has a noticeable impact on server performance. File fragments are stored in chains of suballoc blocks, with separate chains for each fragment size i.e. < 512, 513-1024, etc. As files are expanded or deleted, holes appear in the chains and Netware will occasionally compact the chains and release unused blocks for 'normal' storage. ------------------------------ Date: Thu, 14 Nov 1996 09:39:58 -0500 From: Craig Lyndes Subject: SOLVED! New Epoch creates 100% utilization -Reply I thought I would share this solution, and warning, with the list. Repair time stamps and declare a new epoch (DSREPAIR) apparently is very processor and resource intensive. The Novell Tech that finally talked me through the repair said it can create problems in partitions that otherwise appear to be fine. When doing directory and partition operations Transaction Tracking System (TTS) must be active or everything comes to a halt. What had happened to me is while doing the new epoch, TTS had run out of resources (64 Megs RAM in an 8 GIG 250 User 4.1 - Joe, could this be where Novell comes up with their rather excessive memory requirements?) and shut down. No matter how long I waited nothing was going to happen. When I powered off I corrupted the DS databases on that server so bad that whenever they were loaded the server would go into an infinite loop. The repair then became, deleting that replica and the DS databases on that server and recreating them without losing any objects' attributes, a $200.00 Novell call with the tech using lots of undocumented switches and commands. The solution for next time, whenever TTS shuts down or you get -621 errors in dstrace, restart TTS. It will finish what it was working on and go after the problem again. This may have to be repeated. I also upped the server RAM to 128 Megs. ------------------------------ Date: Thu, 6 Feb 1997 15:38:32 +1000 From: Mark Cramer Subject: Suballocation & upgrade to 4.1 >After the upgrade from Netware 3.12 to Netware 4.1, if I like to use the >suballocation feature on the volumes I have to set it manually and remount >them. However if I want that the suballocation become *effective* the only >way I can figure is to backup, delete and restore the entire volume >contents. Somebody have a better idea? After the discussions on suballocation here in the last few weeks, I've been doing a bit of background testing to see what happens with sub allocations aggressive mode (when it goes under 1000 free blocks) You see some surprising things! When your above 1000 free blocks, files start at the beginning of a block, and the file tails go into the suballocation chains, when you go under 1000... I took a 1gig 64K block volume down to 590 free blocks (~40M) and copied a number (~200) small files to the volume, the free blocks went UP! not down. The sub alloc process shuffled small files into the sub alloc chains, freeing up full blocks. It obviously shuffled preexisting files into the chains as well. After copying about 2000 160byte files to the disk, I had 623 free blocks. (as reported by John Bairds Vol_Info program) I then filled the volume with large files, free disk space went to 0 bytes, (as reported by DIR and NDIR) but lots of free sectors left in the suballoc chains. So I copied another 800 small files to the drive, no problems, free disk space still at 0. At no point did utilization on the server become excessive (P133, 32M, Adaptec 2940UW and an old Fujitsu 1Gig 5.25 inch full height drive, Nw4.11 2 user test box) hovering around 10-20%, though the disk channel audibly was taking a hammering for short periods (10's of seconds not minutes) After you go under 1000 free blocks, Nw seems to have no problems starting files in the sub alloc chains, and moving other files around. You will get extra (noticeable) disk activity, and some performance degradation, but it appears to handle the situation well. YMMV, this wasn't a rigorous test, just interesting. ------------------------------ Date: Mon, 10 Feb 1997 12:42:36 -0500 From: "Brien K. Meehan" Subject: Re: SERVER-4.10-3227 error (file compression) >SERVER-4.10-3227 >Severity=2 locus-2 Class=20 >Compressed files are not being committed decompressed on volume > SYS due to lack of space. > >Could anybody tell me why this message is being displayed ? >Is 80Mb left on SYS not enough for something ? Netware 4.1 comes with automatic file compression. It's "on" by default when you install a server, but you can turn it off as you're creating the volumes during custom installation. So, if you didn't do that (and it looks like you ran the "Quick Setup"), it's on. There's a bunch of settings related to compression. One of them is "Days Untouched Before Compression" and the default value is 7. If you don't touch a file in the span of 7 days, Netware compresses it and stores it on the disk compressed, and passes the savings on to you! When you "touch" the file after that 7 days, Netware opens it, presents it decompressed, and writes it back to the volume decompressed (depending on some other settings...). You, Sara, are running into another setting, namely "Decompress Percent Disk Space Free To Allow Commit," for which the default value is 10. That is, when you open a compressed file, there has to be 10% free space on the volume in order for Netware to write the file back to the volume decompressed. ... and you're 80MB free is pretty close to the 10% mark. Have a look at the documentation regarding compression, or at least have a look at SERVMAN to see what your settings are, and what your general compression situation is. ------------------------------ Date: Wed, 19 Feb 1997 21:12:36 -0600 From: Scott Hasse Subject: Re: NCP Requests & 100% Utilization I have seen 100% utilization with 70%-80% of the processor utilization in the 'IPXRTR NCP work to do' on a Netware 4.1 with approx. 400 users. Unfortuantely, these problems are hard to troubleshoot since they are so sporadic. Novell documentation seems to say that if you are get high utilization in the NCP work to do, it may be passed there from some other process. If you are loading IPXRTR.NLM, and can unload it without causing a major outage, do so. That may show you the process that is actually causing your high utilization. One question to ask is: Are you experiencing any significant response time problems during these high utilization sessions? Another question is: Do you have a large number of bindery contexts set, or a large number of bindery printers? Reducing these may help. Bindery process work usually shows up in the DS AES process, but may cause NCP work to do as well. Another question: Do you have any OS/2 machines? OS/2 may cause high utilization if you browse printers through the network folder. If you can, unload third party NLMs temporarily to try to narrow down the problem. Try this: at the server console, type set compress screen = on, then change to the compress screen. Then watch to see if you are having a larger then usual number of decompresses, or decompressing of a very large file. You should hopefully have no compressions during the day if all is well. I have had comress/decompress utilization show up in the NCP processor thread instead of the compress/decompress threads. Also, try this: Download the NLM that disables pburst on the file server and load it. Fast PCs with pburst can at times thrash a server, although you would notice a signficant preformance decrease if this were the case. The best thing you can do, though, is to get LANAlyzer or a Sniffer and put it on the same subnet as the file server. Then when one of these 100% utilization sessions starts, you can see exactly what is going on. Don't get too excited about any one individual sending a large number of packets. Watch for users that have a consistent but not necessarily high number of packets to and from the server. Then capture their traffic and see what it is. If it what I had, then you will see a large number of repeated bindery requests from an OS/2 station. --------- Date: Wed, 19 Feb 1997 20:32:35 -0600 From: Joe Doupnik Subject: Re: NCP Requests & 100% Utilization >I have seen 100% utilization with 70%-80% of the processor utilization in >the 'IPXRTR NCP work to do' on a Netware 4.1 with approx. 400 users. >Unfortuantely, these problems are hard to troubleshoot since they are so >sporadic. Novell documentation seems to say that if you are get high >utilization in the NCP work to do, it may be passed there from some other >process. ----------- Supplementing your good comments. NDS itself can and often does run the cpu to very high utilization values without much network traffic occuring. Apparently digesting the sundry NDS files in _NETWARE is cpu intensive. Novell is aware of this behavior and last spring one of the guys in the NDS group said they were experimenting with ways of reducing the quadratic time sort/merge aspects of some NDS updates. We may infer that the larger and more varied the NDS files the longer the cpu peaks may last. If there are enough replicas needing synchronization on top of this, and the system can't finish the update job in the allotted 22 min (as I recall), then NDS life becomes difficult for mere servers. By itself high cpu utilization is not a problem. If packets can't be processed, however, then we get worried. My casual observations suggest that packets do get processed, but folks with more heavily loaded servers and vastly larger NDS databases are better judges of this. Joe D. ------------------------------ Date: Sat, 22 Feb 1997 14:22:22 -0600 From: "Alan L. Welsh" Subject: lockups going to DOS mode >One of our servers locks up in 100% utilization for about 4-5 minutes >and drops user connections when trying to access the DOS partition. Although I'm not the engineer that is most familiar with this problem, a year ago I believe we encountered, confirmed, and then reconfirmed with Adaptec that they have a bug that sometimes causes this when going to real mode (DOS partition) when communicating with their ASPI device driver. >This could happen when saving Startup.Ncf or choosing "Product >Options" from Install.Nlm. >It actually seems to be stuck in DOS because it's possible to restart >the server by pressing "Ctrl-Alt-Del", even though the NetWare console >appears on the screen. > >I have gone through the adapter configuration with Storage Dimensions >tech support and it seems ok. The couldn't help us and have assigned >the case to a higher support level but things seems to happen slow there. If you'd like, have them call me directly so the head device driver engineers can speak directly to our head of engineering to reconfirm (and maybe fix) this problem. We no longer expose this 'feature', but would be happy to help resolve it. >The latest operating system patches and disk drivers are applied. >This issue are now exposed because I'm rebuilding the server with >NetWare 4.11. The problem also existed under NW 3.12 but then I >related it to the NW Faq where someone wrote about a similiar problem >due to a conflict between Intel Landesk Virus Protect (PSCAN.NLM) >and the NW path kit 312PT6, wich then applied to our environment. > >I read somewhere on Usenet that Adaptec controllers could have >problems while switching between protected and real mode, related to a >BIOS issue. Netware ASPI driver issue, not a (rom)BIOS issue. Other ideas are that it could be a PCI-compability issue between >the server and the controllers or maybe the fact that the controllers have >different BIOS revision levels. > >This is a server that are planned to be moved across the country to a >location where we don't have regular support so this issue has to be >resolved before we can move it there. (My only plug) Go to our website at http://www.cdp.com to find out how remote Netware novices could easily restore your server or individual files if it does crash. Alan L. Welsh, president, Columbia Data Products, Inc. snapback@ix.netcom.com 1800-613-6288 http://www.cdp.com ------------------------------ Date: Tue, 4 Mar 1997 11:28:12 -0700 From: "Steven W. Smith" Subject: 100% utilization, MONITOR.NLM 4.34 IMHO, the first thing to do when diagnosing a 100% utilization condition is to unload then reload MONITOR.NLM. On one of our fully-patched 4.11 servers using MONITOR.NLM 4.34 I've seen it stuck at 100% utilization, usually when BackupExec 7.11 is having some sort of difficulty. In all cases so far (4 times in the last couple of weeks) unloading/reloading MONITOR "solved" the problem with the utilization number dropping to something realistic. Some time back one of our 3.11 servers had a habit of becoming stuck at 89% "utilization". Unload/load fixed that as well. I don't recall seeing this discussed here and didn't find anything that seemed to relate to it by searching the TIDs. Mostly, I'm just curious if others have noticed this phenomenon. ------------------------------ Date: Sat, 22 Mar 1997 12:13:55 GMT From: Teo Kirkinen Subject: Re: NCP Requests & 100% Utilization >LRU sitting times run from 40 seconds to 2 minutes during these times. >Cache buffers were at roughly 78%. My compression start and end times >are in the 3-6am range. We're running a Compaq 1500R with 11GB and >160MB of RAM -- compression and BSA on all but SYS volume (600MB) and >no additional name space (yet). I would add more memory to the server until the LRU sitting time stays above 15 minutes. This has made our servers more responsive during peak work hours. The magic number comes from some NDS update interval. I learned it at last year's Brainshare Europe. ------------------------------ Date: Mon, 14 Jul 1997 09:18:41 +0100 From: "Erik Bos, AMC afd. PC-LAN" Subject: 100% util We had much help from the 'Troubleshooting High Utilisation on NW4.x' document on Novell's web site. After checking everything that was on this document 70% of those problems are no more. After we upgraded to NW 4.11 there were no problems like this left. ------------------------------ Date: Sun, 1 Mar 1998 12:46:47 +0100 From: Camaszotisz Gyorgy Subject: Re: How to limit NLM CPU usage >Somebody knows if is it possible to limit the CPU usage of an >Intranetware's NLM program ? This would be very useful in some >occasions, when a NLM goes in abend taking the 100% of the CPU power, >hanging up the whole server. Go to Monitor.nlm, Scheduling information, search for the thread taking too much CPU time, then press gray + several times. There is a command-line equivalent also, I don't remember the name, maybe load schdelay? But: from a developer`s point of view, this setting has effect only to the sleeping time before ThreadSwitchWithDelay() API call will return to the caller. So, if your NLM does not use this API call, you will have no success. And this happens almost everytime, because the SDK documentation suggest using this API call only when your NLM is waiting for a resource to be available. ------------------------------ Date: Wed, 4 Mar 1998 07:23:17 -0500 From: "David G. Pile" Subject: Re: Intermitant high utilization >I'm back to the list with another strange problem. Netware 3.12 >server with intermitant high utilization up to 70% and them drops >to the norm of 1% to 5% after about 2 to 5 minutes. Causes >workstation apps to crash. Any chance someone has RCONSOLE open and minimized in the background? This is a problem particularly from a Win 3.1x machine. Not so much from 95 but still it can peg servers. ------------------------------