-----------------------------------------------------------------------
NOV-100.DOC -- 19980304 -- Email thread on NetWare 4.x 100% Utilization
-----------------------------------------------------------------------

	Feel free to add or edit this document and then email
	it back to faq@jelyon.com




Date: Mon, 23 Oct 1995 12:15:22 GMT0BST
From: Kens Mail List Host <MAIL@DRL.OX.AC.UK>
Subject: NW4.1 High CPU Utilization a CURE! (sort of)

We've recently been plagued by slow network responce time, this was
always accompanied by very high server utilization (80-100%). We have
various NICs and grades of machine, but mostly 486's and above. Using
Norton SI program the data thoughput was stated as 160KB/s on a 486
with a 3Com 3c509 card. This was obviously not good. After much
testing using different configurations we eventually discovered the
source of our problem. It was Packet Signing. When the packet signing
was removed the network thoughput rose to 460KB/s on the same
machine! At best we were losing 40-50% of our cable bandwidth on a
486 to packet signing. At worst one of our 386's was down to 16KB/s
with packet signing and now without is at 250KB/s, an approximate 94%
loss of bandwidth!!!!

I knew that packet signing was going to impact on the thoughput of
the network but didn't think it would effectivly half it! Surely,
this is a broken piece of software!

My advice, don't use it unless you really really need to!

Ian Kennedy

------------------------------

	Another idea is turn off Pburst in the NW shell. Reports are that
under not so good conditions Pburst can negotiate itself into a corner
and slow down a great deal.
	Yet another is have a look at the server to ensure it's not wedged
in some manner. Overdoing disk writes to a not-very-strong disk system
(and/or one with no free space so stealing from \deleted.sav must occur)
can block traffic. Similarly, a client-class lan adapter in a server can
be stomped upon by too much traffic and temporarily go bananas. The old
8-bit 3C503 boards were famous for such behavior.
	If you have Novell's Lanalyzer or equivalent then watching the
traffic in detail might yield the cause.
	Joe D.

------------------------------

Date: Wed, 25 Oct 1995 21:51:06 PDT
From: Luke Mitchell <mitchlt@NANDO.NET>
Subject: Server utilization-100%

One solution I have not seen posted for the utilization issue:

After all the patches are applied and you have a Pentium
server, make sure MAXIMUM SERVER PROCESSES is set all the
way up to 1000.  Novell tech. indicated to me that my
Proliant 1500 5/100 server could handle that.  Also turn
the minimum time to add new process all the way down.  I
have not had a problem since those steps and patches.  This
server runs GMHS and serves related apps for 500 users.

------------------------------

From: nicholas cline eggleston <cline@BRONZE.UCS.INDIANA.EDU>

One other culpret of intermittent 100% utilization under netware 4.1 is
compression. If you've got it turned on, the server sits around and
compresses files at night, like it should. However, when your users go
to read those compressed files, the server will spawn a very high
priority decompression thread. This keeps the server busy enonough that
client connections can be lost. This problem is especially noticable on
servers with a 486/66 or less.

Nick Eggleston

------------------------------

Date: Sun, 22 Oct 1995 12:15:53 -0600
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Re: NW 4.1 High Utilization 100% - Solutions

>=> Fellow Network Engineers,
>=>
>=> Since I first replied to this problem on the list, I have gotten a
>=> lot of inquires.  So I am posting this to help anyone that may be
>=> experiencing high utilization problems with Netware 4.1 NDS.
-------
	We'll have to do some careful collecting of information on the
problem because NW 4 is very prone to the 100% utilization problem. Sundry
revisions to nds.nlm and dsrepair and so on have helped but not eliminated
it. Novell is actively working on the NDS database robustness problem so
I expect we will eventually get good solutions to the multitude of causes.
	Joe D.

------------------------------

Date: Fri, 20 Oct 1995 15:07:00 PDT
From: "Funderburg, Karl - G6" <funderbk@FTMCPHSN-EMH1.ARMY.MIL>
Subject: NW 4.1 High Utilization 100% - Solutions

Since I first replied to this problem on the list, I have gotten a lot of
inquires. So I am posting this to help anyone that may be experiencing high
utilization problems with Netware 4.1 NDS.  Perhaps it can go to the FAQ.
I am not sure how that process works.

High utilization is a known problem in the Netware 4.x environment.  I have
found that there is not one simple answer to fix it.  There are several
things that can and should be done to help alleviate the problem.  I worked
through most of what I am about to tell you with the help of Novell
engineers (we are a premium service account so our problem got escalated to
the actual engineers at Novell).

I.  Patches and updates - First of all apply all Novell patch kits and
updates to your server.  These files are all available from Netwire through
the Web or Compuserve.  3 files in particular are 410IT4.EXE, 410PT2.EXE
and DSENH.EXE.  Novell will not help you further until you patch to this
level.  The DSENH.EXE will give you the most recent DS.NLM v4.89a and
DSREPAIR.NLM v4.26b.  Version 4.89a of DS is supposed to fix a lot of the
high utilization problems.  Also you get a neat utility called DSMAINT.NLM
that will let you copy NDS to a file on disk.  This is useful if you are
planning to bring a server down for an upgrade.

II.  Set parameters - In 410IT4.EXE you get SVRPRSFX.NLM which increases
maximum allowable service processes from 100 to 1000 and DSPRCSFX.NLM which
limits NDS use of Service Processes to 50%.  The following parameters
should be set if you have over 100 connections.

Service Processes
     Set New Service Process Wait Time = 0.3
	(speeds the allocation of additional service processes)
     Set Maximum Service Processes = number
	(number = 5 to 1000 see below)

No. of Client    Recommended Maximum
 Connections      Service Processes
  1 -  100            2 - 200  (don't use less than default)
101 -  250          200 - 500
251 -  500          500 - 1000
501 - 1000            1000

Directory Cache Buffers
     Set Minimum Directory Cache Buffers = number
	(number = 10 - 2000 see below)
     Set Maximum Directory Cache Buffers = number
	(number = 20 - 4000 see below)
     Set Directory Cache Allocation Wait Time = 0.5
	(speeds the allocation of additional DCBs)

No. of Client     Recommended Directory
Connections          Cache Buffers
		 minimum          maximum
  1 -  100        2 -  200         4000
101 -  250      200 -  500         4000
251 -  500      500 - 1000         4000
501 - 1000     1000 - 2000         4000

Packet Receive Buffers
     Set Minimum Packet Receive Buffers = number
	(number = 10 - 2000 see below)
     Set Maximum Packet Receive Buffers = number
	(number = 20 - 4000 see below)
     Set New Packet Receive Buffer Wait Time = 0.1
	(speeds the allocation of additional PRBs)

No. of Client      Recommended Packet
Connections         Receive Buffers
		 minimum           maximum
  1 -  100         2 - 200          4000
101 -  250       200 - 500          4000
251 -  500       500 - 1000         4000
501 - 1000      1000 - 2000         4000

III.  NDS partitions - Limit the number of replications of a partition to
3 servers.  If you must do bindery emulation on all servers (requires
replica of partition containing bindery objects) then keep your partition
containing bindery objects as small as possible.

IV.  ArcServe tip - Load TSANDS.NLM only on servers containing master
copies of replica.  There is no need to make a backup of each replica of
NDS, only the masters.  It helps to have a stable (normal utilization)
network when running Arcserve backups.  Do steps I, II & III.

My network is a WAN consisting of 14 Novell 4.1 servers, 2000 nodes, a
dedicated backup server running Arcserve 5.01g, an Exobyte DAT tape changer.
Topology is ethernet using UTP cabling and FDDI backbone.  Routers and hubs
are Synoptics (Cisco).  Our NDS tree now contains 3 partitions.  One
partition is very small containing only necessary bindery objects required
for bindery emulation.  Prior to taking the above steps I experienced
frequent and prolonged periods of 100% utilization to point where servers
would lock up and have to be powered off.  There were frequent crashes and
if you ever had to bring a server back from a down state the entire network
came to its knees while the new server synched back up to the NDS tree.
 After implementing the above steps I now get to sleep at night and even
enjoy my weekends.  I will still occasionally see a 100% spike but nothing
like it used to be.

------------------------------

Date: Mon, 30 Oct 1995 10:53:51 MST7MDT
From: Matt Zufelt <Zufelt@SUU.EDU>

>>I recall that when Netware 4.x was first released that Novell's gurus
>>recommended that a container object should hold no more than 500 leaf
>>objects (ie users). Does this still hold true with NW4.1? Has anyone any
>>experience on this?
>
>I have 4000 accounts in one container on my 4.1 server. It doesn't seem
>to cause any major problems. NWADMIN even handles it ok. Whether or not
>there are performance penalties is another question. I have had and still
>have some performance problems on my 4.1 servers...

We have one container on a 4.1 network here that has almost 3000
objects in it, and it has become a major source of heartburn.  When
the NDS janitor process kicks in, the server essentially grinds to a
halt.  It is extremely slow for the active users until the process
finishes (which can sometimes take 15-20 min.).  Also, anytime we try
to make a new replica (or remove a replica) from another server on
the tree, things seem to stop.  We tried to remove a replica of this
context off a server two weekends ago, after running all day Sunday
and Monday, it still was not removed.  The server was still munching
on it.  We finally reset the server.  Now, we have a "dying" replica
on the server.  It takes up disk space, but at least it is no longer
receiving updates--another process that took an inordinate amount of
time.

------------------------------

	Server utilization will become very large if the server is
handling directly attached printers via polling (non-interrupt fashion).

	Joe D.

------------------------------

Date: Wed, 8 Nov 1995 10:44:06 MST7MDT
From: "Timothy D. Porter" <portert@EDU-SUU-LIFAC.LI.SUU.EDU>

  We are having a real problem with some 4.1 servers here when the NDS
Janitor/Flatcleaner process starts.  When it starts the server
utilization goes to 100%, and response times for people logged in as
well as other processes running on the server grind to a halt.  It
takes to janitor between 5 and 10 minutes to complete its run.  Is
there some way to set the schedule priority of this process lower so
everything doesn't stop while it runs?  It doesn't show up in the
scheduling information screeen in the monitor.  We have all of the
latest patches installed.  One of the machines that is having the
problem has 132MB RAM, and is a P90 processor.

------------------------------

Date: Thu, 9 Nov 1995 02:47:10 GMT
From: Rich Silva <rsilva@A.CRL.COM>
Subject: Re: NDS weirdness "resolved"

Todd W Herring <THERRING@SPEANET.IUPUI.EDU> wrote:
>Our four server, NW 4.1 network had been experiencing some rather
>perplexing difficulties recently.  From 100% processor utilization -
>to NDS replicas that weren't synching - to the inability to login in
>bindery emulation.  For the first three months after upgrading to
>4.1, we didn't have any of these problems.  They snuck up on us and
>pounced rather suddenly.
>
>The final straw, and the most puzzling, was the 100% processor
>utilization problem.  Our servers, one by one, were attracting Get
>Bindery Object NCP packets from print servers all over our campus.
>This, we thought, was causing the 100% util.  After applying a filter
>to our router, the packets stopped coming in, but the 100% util did
>not.  Weird...
>
>We did all the normal things you'd do to resolve the problems -- ran
>DSREPAIR, VREPAIR, downing the server, unloading NLMs, updating to
>the newest patch, etc.  We even called Novell.  Novell techs wanted
>to dial in to one of our servers and do some teeth-clinching things
>to our NDS.  We declined the offer, not knowing what kooky things may
>happen after they were done.  So we bit the bullet, gathered all the
>information about trustee rights that we could, and re-installed
>Netware on each of our servers.  We began by removing the replica
>from a server, then renamed the SYS volume, downed the server, and
>re-installed NW 4.1 into a brand new tree.  All went well (Thank God)
>and we rebuilt trustee assignments using batch files.
>
>Looking back it would have been nice to pin down the exact cause.  We
>can only assume it was an NDS corruption.  What caused the corruption
>we don't know.  We suspect it may have been imported from the upgrade
>of a 3.12 server into the tree, although this upgrade did not
>immediately precede the corruption.  That particular server had
>several unknown objects show up when running BINDFIXes, and the
>bindery objects were pulled into the tree during upgrade.  Most of
>those objects were summarily deleted from our new tree because they
>already existed (we had Netsynced with a redundant server to pull the
>bindery into our tree).  The bottom line is we don't know what caused
>the havoc.
>
>I know what you're saying.  Couldn't we have restored NDS from tape?
>No, because in all our mess Arcserve got screwed up.  It's bindery
>queue object was corrupted when I downgraded to Arcserve 4.0 to run
>an experiment.  I couldn't re-install Arcserve 5.01g because I
>couldn't login in bindery emulation!!  Ahh!  I shiver when I even
>think of it!
>
>My advice if you haven't upgraded to 4.1 - learn as much as you can
>before you do, NDS is nothing like 3.x's bindery.  Take classes, ask
>people who have upgraded, and when you do upgrade set up a test
>server and hammer it for a few weeks at least (test backup and
>restores, etc).  My advice for those who have already upgraded -
>learn as much as you can before something goes wrong, NDS is nothing
>like 3.x's bindery.  I would also suggest you find the resources to
>set up a test tree and server and test your backup and restores of
>NDS.  You'll want to have the experience before it becomes necessary.

What happened is the replicas should have been removed before a server is
brought down. The servers left get bombarded with serrches trying to update
replicas.  Sometimes it may never get in sync leaving you with 100%
utilization.  I think only if the server ip address has changed or the
name or something. But if this happens the only way I know to fix it is
reinstall.  The moral of the story is pay atention to where you put your
replicas. Only use them where you need and when a server is going to be
brought down remove any replicas of that server first.

------------------------------

Date: Fri, 10 Nov 1995 12:50:22 -0800
From: Jonn Martell <martell@UCS.UBC.CA>
Subject: Utilization at 100%

This (long) message contains a description of two Server Utilization tools
and a question regarding how to track individual Process CPU utilization.
....
We had a problem with one of our servers this week.  The Utilization went
to 100% and stuck there (for several agonizing minutes - felt like hours).
Although it's a 4.1 server, it's a single server tree so the old NDS
bug probably isn't problem.  Nothing on the server had been added or
deleted and we aren't running the CDROM.NLM.

This is the first time this problem has happened and it's quite scary
because at 100% utilization the server stops accepting connections, users
can't access the server and the console "feels" stuck.

After trying to identify the rogue Processor in monitor (without success)
I decided to unload monitor, and that seemed to unlock it and the
utilization stabilized back to normal (300 connections with average of 15%
CPU utilization).

The Console log shows "A scheduled "Work To Do" took over one minute to
run." and the System error log shows nothing.

The server has been running fine since.  In trying to locate tools that
would allow me to isolate the problem I found two that displays server
information (including utilization) graphically over time.  Both of these
are 3rd party commercial and evaluation copies are available over the net.

The first is Nconsole by Advanti Technology.  The NLM provides Monitor
information in a much better format. It not only has current utilization
(like monitor) but also displays average and peak. The screen saver is a
histogram that shows utilization (current, average and peak).  The windows
client shows utilization and trends (although the Windows client would
probably fail if the server hits 100% again). Available from
http://www.avanti-tech.com/~prodinfo/ They also have a SNMP version which
looks much better than the expensive NMS agents by Novell. Tech support
very responsive.

The second is NetTune Pro by Hawknet available from
http://www.cts.com/~netinfo/nettune2.html. NetTune is much more powerful
although I found several bugs with the more advanced features
(documentation and set parameter tuning). It only runs as a client-server
with a Windows front-end and a NLM back-end. It's a great tool to document
your server!

Now for my question:  Nconsole shows me that there is a process that
sometimes makes utilization jump to 100% (for a fraction of a second).
This would be very hard to pick out in monitor but Nconsole makes it very
apparent. There is no pattern that I can see except that it does it at
least a few times per hour.

Does anyone know of any tools that can isolate which process is making
utilization jump to 100% for a fraction of a second?

The Avanti-Tech folks said they are working on individual process CPU
utilization monitoring it but they don't have anything right now.

------------------------------

Date: Tue, 21 Nov 1995 17:06:30 -0800
From: Charles Martini <cmartini@HALCYON.COM>
Subject: Re: NDS over WAN

>I would be very interested in your replicating strategy and experience with
>traffic generated by NDS synchronization over slow WAN links. Would it,
>be feasible to open a WAN link only temporarily, say four times of 15
>minutes each a day, just to force NDS to do its sync-work

TO THE BEST OF MY KNOWLEDGE, (and I may be wrong on this, but I don't
think so), there is no way to control when NDS does its syncs.

BUT, I do have some data from Novell on when & how much traffic NDS
sync's happen.
	"High Convergence" activities, such as user/object creation or
deletion, are sync'd every 10 seconds.
	"Low Convergence" activities, such as updating Login Time &
Addresses, sync every 30 minutes.
	There's an NDS heartbeat that sync's every 30 minutes as well.
	NDS verifies backlings and external references every 25 hours.

Sample traffic loads (bytes):
	Replica heartbeat: 750
	Sync ten users, 2 replicas: 6286
	Sync ten users, 3 replicas: 14108
	Create user, workstation to server: 13000
	Create user, server to server sync, 3 replicas: 15796

Tree Walking: you'll also be stymied in trying to schedule NDS
connections if any of your users need to authenticate to remote servers.

Bottom line: NDS sync traffic is pretty minimal, so it shouldn't be too
costly if you have only a few remote servers & slow WAN links.  If,
however, you have a lot of remote servers & remote users, and want to
share resources across the entire WAN, you'll obviously need to invest in
faster links.

------------------------------

Date: Fri, 24 Nov 1995 09:48:03 GMT0BST
From: Kens Mail List Host <MAIL@DRL.OX.AC.UK>
Subject: Re: IPXRTR NCP Work To Do

>>Some of our Netware 4.1 servers occassionally jump up to 100% CPU
>>utilization and stay there for some time. The console reveals that
>>almost all the work is being done by the 'IPXRTR NCP Work to do'
>>process. We are not routing IPX on the server and is not heavily
>>loaded with users. Does anyone know what this means?
>
>Sounds like you have a serious NDS problem. Go to server console and type:
>	SET DSTRACE=ON
>	SET DSTRACE=+ALL
>	SET DSTRACE=+SYNC
>	SET DSTRACE=*H
>Switch to the DSTRACE Screen and watch for any unsuccessful updates
>and/or errors.  If you do have problems, I would recommend contacting a
>NASC or Novell Tech Support directly, unless you are a NDS
>troubleshooting expert.

This is not a NDS problem. We've just spent several months getting to
the bottom of this one. It's related to packet burst and the VLMs.
Novell have a patch called PBRSTOFF which disables packet burst on
the server. It took our utilisation from 80-100% down to 0-30%.
I'm not sure the patch is generally available but they will give it
out on demand.

Ian Kennedy

------------------------------

Date: Sat, 30 Dec 1995 14:09:12 -6
From: "Mike Avery" <mavery@aus.sig.net>
To: netw4-l@bgu.edu
Subject: Re: Two network Cards

>>Following up on this interesting topic with another question...

>>What about the scenerio of an exclusively Cisco routed network with
>>multiple local 16 MBps token-ring segments.  May a NetWare 4.1
>>server improve throughput to clients on the segments (IP or IPXng)
>>by adding NICs which directly connect to a unique segment?

>It depends on where your bottleneck is. IF you've got 80-90%
>utilization on the server's ring, you'd benefit. How much depends on
>your server hardware,
># of users and traffic particulars, but on a modern server with
>eisa/pci/microchannel and sufficient # of disks you should be able
>to increase your throughput several times. Way back when 386/16s
>were new a pretty complete test showed double performance with a
>second NIC, with little improvement beyond that - but that was 2.15!

The book "Optimizing NetWare Networks" by Rick Sant'Angelo (M&T
Books) covers this topic pretty well.

The history of data processing and computer science could be viewed as
a matter of moving the bottlenecks.  Yesterdays solution becomes
today's problem.

More NIC's can be added, but each NIC will increase processor
overhead.  At a certain point, adding more cards will actually start
to decrease performance.  Some cards use more system resources than
others, and that is due to a combination of hardware and software.

3Com's Ethernet cards deliver high performance, but at ahigh price -
as much as 1/3 of a 486 server's CPU resources can be spent servicing
the NIC, according to some reports.

The point of diminishing returns can be reached rather quickly.

As the network load goes up, it sometimes makes more sense to put a
high speed backbone in place connecting the servers to routers, and
letting the users connect to other segments (or rings, depending on
your topology), and let the routers handle the routing services.

In one case, we had a Compaq file server with 3 10mbps Ethernet neics
in it and over 200 users.  More were routing through the server.  We
removed 2 nics, cabled the server to a router, and had all the users
go through the router.  The performance was greatly improved.  It was
further improved when we removed the ethernet NIC and put in a FDDI
card.  (I inherited the original server configuration....)

The performance of 100mbps Ethernet may well be comparable to that of
FDDI for most applications, at a considerable savings.  All in all,
routing seems to be more expensive in terms of CPU requirements than
one might think... and getting rid of the contention for the Ethernet
segments probably also helped.

------------------------------

Date: Mon, 1 Jan 1996 01:55:05 -0800
From: rgrein@halcyon.com (Randy Grein)
To: netw4-l@bgu.edu
Subject: Re: Two network Cards

>I'd like to explore the routing issue more.
>
>My opinion is that peak client and overall system throughput would be enhanced
>by supporting demanding local segments with direct access to shared server(s)
>via multiple NICs.  I seen PERFORM3 benchmark volume throughput (e.g.
>delivering server located applications) on a local segment is twice or more
>faster for a client than reaching out to through a Cisco router backplane to
>servers on different segments.

People will whine and argue, but the fact remains that you are correct -
crossing routers does take time, especially if you're not using packet
burst. My boss wrote an article about the subject several years ago,
comparing the performance of routers vs. bridges. This is appropriate in
the current "switch" debate, as a switch is really nothing more complex
than a jumped-up bridge. I've not seen quite this extreme in performance
differential, and there's a couple of caviats:

1. The penalty is largely negated if you use packet burst
2. Routers do much more than bridges/switches, make sure you'll not need
the extra functionality
3. While the speed reduction is measurable and important, it may not matter
in many situations. Aggragate throughput will be essentially the same
(modern routers and switches both forward at or nearly at wire speed), and
unless the user is moving many megabytes of information the difference in
time is usually not noticable.
4. Be careful using the perform series to draw performance conclusions. It
generates a VERY artifical load which is only valid for preliminary
analysis. Other, more complex tools are available, more accurate but harder
to use.

>Establishing more than one logical routing path between clients and servers
>looks like a problem.  Is is still reasonable/workable to disable routing on a
>NetWare 4.1 server to enhance throughput with multiple NICs for selected
>NetWare client segments while avoiding an unsupported mesh routing situation?

The "mesh" or "web" routing network IS supported - it bears some advantages
is basic fault tolerance. In fact, up to a certain size it's advantageous
to place each server on each segment by installing an additional NIC. In
fact, large WANs use this concept to provide fault tolerance, although I
believe they use OSPF instead of RIP to resolve paths and reroute around
downed links. You can disable routing if you wish, but it's not necessary -
I wouldn't recommend it unless you had a specific need.

------------------------------

Date: Thu, 4 Jan 1996 21:38:00 -0800
From: rgrein@halcyon.com (Randy Grein)
To: netw4-l@bgu.edu
Subject: Re: 100% Utilization crashed the server

>Our 4.1 server crashed three times yesterday, the utilization
>was up 100% in all instances.  The console was frozen and
>therefore I couldn't tell what open files caused the crash.
>Does anyone know of a way to track this resources hog?
>Any utility out on the market that may do the job?

I hate to tell you this, but open files do not cause a server to crash!
However, what you are looking for (a tracking utility is more or less
possible, but reconstruction is difficult at best.

1. Load conlog MANUALLY after the server mounts. It will then write any
error messages to the console log, which will NOT be overwritten on reboot.
It overwrites the log file when reloaded.

2. NOTHING will track reliably during a server crash like this - no
software, anyway. The trouble is that the instrument to be monitored (the
OS and CPU) is being used to perform diagnostics. This is invariably
limiting. The closest you can come to the type of tracing you're looking
for is purchasing server machines from compaq, HP or IBM. These in addition
to using ECC  memory also have diagnostic circuitry built in that can at
least detect hardware problems independent of the CPU or OS.

3. What you really need to do is to diagnose the utilization problem before
it gets out of hand. Look for things like backup, DS.NLM less than version
3.89, no patches, incorrect DS partitioning, or NT servers on the network
using MS IPX emulation. There's been verified reports of it abending SFTIII
servers because Microsoft used some of the wrong communications sockets;
it's remotely possible this could be a problem.

------------------------------

Date: Mon, 8 Jan 1996 06:02:30 -6
From: "Mike Avery" <mavery@aus.sig.net>
To: netw4-l@bgu.edu
Subject: Re: 100% Utilization crashed the server

>>>Our 4.1 server crashed three times yesterday, the utilization
>>>was up 100% in all instances.  The console was frozen and
>>>therefore I couldn't tell what open files caused the crash. 
>>>Does anyone know of a way to track this resources hog? 
>>>Any utility out on the market that may do the job?

>>Have you applied ALL the patches and fixes?

>I also have a sick 4.1 server.  Anytime you attempt to load a
>console utility, ie... INSTALL, the processor util' goes to 100%. 
>At first we thought the DISCPORT NLM's were at fault. ( had
>problems with this just after upgrading the server to 4.1).  I have
>applied the latest (fall last year) patches to the server. as soon
>as time allows I will be checking for later patches.

If you check the current Novell patches, a number address the 100% 
utilization issue.  However, the documentation also makes it pretty 
clear that 100% utilization, by itself, is nothing to get excited 
about.  If it stays at 100% AND the performance of other tasks is 
impaired, then it is an issue.

The latest revisions of the Discport software is 4.10a.  There are 
some signifigant improvements over the earlier versions, especially 
in a NetWare 4.10 environment.

------------------------------

Date: Fri, 19 Jan 1996 08:30:39 +0100
From: Henno Keers <Mailbox@ICE.NL>
Subject: Re: Cause of CPU utilization increase? [4]

>>>Background:
>>>We are running NetWare 3.12 on a 486-33 EISA box in an educational
>>>setting.  Our 100 user license is just about maxed out by 3 labs and
>>>a number of office machines.  We do lots of remote booting, and most
>>>applications are loaded from the server.
>>>
>>>Question:
>>>How can we tell what processes running on the server are heavy
>>>contributors to an increase in CPU utilization?  We have always had
>>>spikes of high utilization, but these could be attributed to specific
>>>temporary causes (people copying files to/from local drives, a bunch
>>>of machines booting up or loading software at the beginning of a lab
>>>period, or similar things).  What we have noticed recently is
>>>utilization ramping up and staying there for a while.  The spikes
>>>don't concern us much (should they?) but longer periods of high
>>>utilization do.  Depending upon what the cause is there may not be
>>>much we can do, but it would be nice to know specifically what's
>>>causing them anyway.
>>>
>>>To quantify things a bit: our server idles most of the time in the
>>>teens or even single digits.  Spikes have been up to the 60-70% range
>>>for possibly a few seconds and certainly less than a minute.  Our
>>>recent high utilization has been hovering in ranges from 40% to 90%
>>>for 5-10-20 minutes at a time (possibly longer, but that's as long as I
>>>actually watched it).
>>>
>>>To answer the obvious question "if it's a recent problem, what have
>>>you changed recently?", the only significant recent change is
>>>Internet access.  Our LAN is attached to the organization WAN. The
>>>server is running Mercury as a mail agent, and people are starting to
>>>use Web browsers.  However, this is really only in its infancy.  We
>>>don't have labs full of 25 people all trying to run Netscape at the
>>>same time...yet.  Could 3 or 4 people running Netscape generate a
>>>that significant a load for the server? (An additional 50% CPU
>>>utilization?)
>>>
>>>A related question:
>>>If the answer to the first question is "get some network
>>>management/monitoring software", is there anything available in the
>>>shareware/freeware category? (Even if it only generates simple
>>>reports on CPU utilization, concurrent logins, network traffic...).
>>>Educational budgets won't support much these days...but you already
>>>knew that.  :)
>>>
>>>Mark Holland
>>---------------
>>	What a well written report! Terrific.
>>	It just so happens that I have a 486-33 EISA NW 3.12 server in
>>a public lab environment which the other day showed 100% utilization for
>>many many minutes on end, according to my student lab consultants on duty.
>>	Normally printing to a printer attached to a server is either a
>>small activity if using interrupts or an intensive polling activity
>>otherwise. But that brings server utilization up to the 40-60% level, more
>>or less, and it's obvious what's going on (from the noise alone).
>>	We don't see 100% utilization from normal activities. Something
>>strange was occuring, but what (as you ask too). I wasn't there to see.
>>	I looked at MONITOR and discover one lan adapter had far too many
>>received bytes than it ought, by about 1GB in a day. Ah ha! Some student
>>was pulling in bytes by the truckload, with no place to store them (diskless
>>clients). What could that person be doing? Hard to tell. Putting up a
>>temporary IRC/MUD relay station is a guaranteed way of using all available
>>bandwidth, and we have killed off a number of those things. Running a
>>home grown networking program is another way of bringing a server to its
>>knees (sit in a tight loop hammering on the server). EISA machines kneel
>>but don't submit to such abuse, and thus "they keep on running and running,"
>>battery bunny style, no matter what the load.
>>	We had a recent experience with a lan adapter test program, shipped
>>on floppy with the Ethernet board, putting illegal packets on the wire
>>as fast as it could go. I told the list about this situation (pkts spread
>>throughout the campus and caused quite a flap). A generic no-name NE-2000,
>>in case you are interested. Anyone can run such programs in a lab just
>>by stuffing a disk in a floppy drive. The identifier, in this case, is
>>server lan adapter statistics.
>>	I presume it is possible for a commercial programmer to get it
>>wrong and hammer the daylights out of a server while trying to print
>>or write a file while that attempt fails. Again, lan adapter stats should
>>show the traffic, but that's about all we would see.
>>	Finally, most of us have a look at recently added NLMs with the
>>feeling that not all may be well in NLM-land. It happens. Unloading what
>>we can is certainly worth doing during server saturation. Tape backup
>>programs are candidates, as are virus scanning programs, and whatever
>>other goodies are present (Pmail in your case, though it's always been
>>well behaved in my environment). Software metering programs are candidates
>>too. The problem isolation thought here is if lan stats are normal then
>>activity is occuring within the server itself, and unloading is the quickest
>>way of turning off internal items.
>>	My US$0.02 on the matter.
>>	Joe D.
>
>We all have our horror stories, I hope you don't mind mine...
>
>We, too, found our 5-30% utilization jump steadily one day to 60%,
>70%... up to 100% utilization.
>
>LANAlyzer for Windows found an 'unrecognizable bunch of stuff'
>occupying more and more of our bandwidth. We were able to isolate it to a
>general location in one of our buildings.
>
>A 25pin-to-RJ45 adapter had been used on a PC to allow a network print
>TSR to run a serial printer located across the hall. Cat 3 cable was run
>from the PC to printer.
>
>A user moved the PC with all its cabling from one location to another.
>It seemed to him that the 25pin-to-RJ45 adapter was a 'network connector'.
>He plugged the it into the wall jack, along with the patch cable for the
>NIC. When he booted his PC, the TSR for network printing fired up [along
>with standard DOS serial stuff], flooding the network.
>
>Daniel E. Cullinan

A couple of years ago I was working as a field service engineer at a
large PC shop (which went financially flat on it's face) where I was
in a branch office working with a collegue.

He was moving an original IBM AT with monochrome display adapter and
a token-ring card from one place to another where he hooked up the
token-ring (NetWare) network to the MDA DB9 port ... And switched on
the PC.

The token-ring net went totally silent on the moment that he flipped
the switch, needless to say that the users in the rooms down the hall
where not totally silent after this incident {;-)

------------------------------

Date: Tue, 23 Jan 1996 17:37:40 EST
From: Paul Massue-Monat <MONAT@PROFS.ADMIN.UOTTAWA.CA>
Subject: High CPU Utilization - Servman.nlm - Netware Connection

In the latest Netware Connection Magazine (January / February 1996 =
volume 7, number 1), one can find an interesting article on servam (as
in "load servman.nlm" on the server console). The magazine is published
by Netware Users International (1-801-221-9634); Novell and/or
knowledgeable people contribute to it regularly. (For joining a user
group, call 1-800-228-4nui). Anyway ...

Since there's a thread running at this time on "high CPU utilization",
I want to repeat what it is said in the article written by David Doering
(74106.1551@compuserve.com) a senior analyst with Technical Services, a
Provo, Utah-based consulting company.

Before I start, let me say that personally I've noticed that "high util"
when I load NLMs in the autoexec.ncf. Depending on the order I load the
NLMs, I sometimes can get an eternal 100%. The server is doing something
and it gets caught. I shuffle the NLMs in the autoexec, down the
server gracefully and re-boot. I now found a stable order and I'll try
to keep it that way. (I'm still testing third-party products on a
production server: ugly ugly ...).

Before I start, may I had that the article informs you on things that are
NOT in the red books of course. There's info on free blocks, on directory
entries, etc. Very useful to understand Netware.

In the hope of helping someone, here goes:

--------- start of quote

High util is often caused by a server process that is consuming an
inordinate amount of CPU time. Using the monitor utility, you can see
how much time the CPU is servicing each server process. IN monitor,
chose the Processor Util option then press F3. You will see a list of
processes along with each one's load percentage. If the process that is
using the most CPU time is one with a low priority (such as the monitor
utility itself), you can change how the CPU services that process.

To do this, chose the Scheduling INfo option from the monitor's main menu
and select the process you want to change. Then press the + or - sign.
For example, if you increase the value from 1 to 2, the CPU will handle
the process half as frequently as it did before.

High util can also be caused if Netware is using suballocation blocks to
save small files and the volume does not have enough free space. Netware
4's suballocation feature conserves space by saving files that are
smaller than the physical disk block size in suballocation blocks. Theses
blocks allow for files as small as 512 bytes. <cut> A single 64kb
physical disk block can hold as many as 128 suballocation blocks. If the
amount of fee space on a volume falls below 10-20 percent, the
suballocation routine does not have enough space to perform this task
efficiently.

Finally, high util can ve caused by servers that include power
consumption firmware in the CMOS. If your system has a CMOS-based power
saver, you should verify that it is off when you boot the server.
Otherwise, Netware spends time preventing the power-down feature from
activating when the server is idle.

----------- end of quote

------------------------------

Date: Wed, 24 Jan 1996 22:54:07 GMT
From: Kevin Kinnell <kinnell@INDIANA.EDU>
Subject: Re: Novell 4.1 Server crashing with network overload...help!

>You're not backing up at the same time as compression is taking place
>(default 12 midnight to 6 am) are you?
>
>Decompressing isn't a high utilization issue -- in fact, you can barely tell
>the difference in loading a file if it has to be decompressed (I can tell,
>but I'm looking for it!) If you want to test the compression angle, use the
>SET command to turn compression OFF for the server. That will keep the files
>from recompressing and you should have the most common ones decompressed
>within a pretty short time with that many users.

>There have been some problems with high utilization, but the largest problems
>I see concern continuous high utilization, not spiking. What type of system
>are you running?

Isn't there a static patch for the loader that addresses the
utilization spike?  Seems like the original code would peak the
utilization during a load if there was an interrupt generated.

If I had to theorize, I'd guess that Aaron is doing exactly what you
suggested with the compression (compression is occurring at the same
time as backup) plus the un-patched loader which is causing the usage
spike.

Aaron, are you using ArcServe?

------------------------------

Date: Thu, 25 Jan 1996 21:00:15 -0600
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Re: Slow Server

>I need some suggestions as to why my network seems to slow down at some times
>during the day.
>
>Currently we are running novell 3.12 BNC with 3 NE2000 cards and 6 loops.
>During peak operation in the day (noon to 3) a lot of users are on the server
>and some workstations seem to
>get bogged down. They run real slow. But towards the end of the day things get
>back to normal speed. The
>server does not have a lot of free storage capacity and fluxuates from 80 to
>100 meg of free storage.
------------
	Hmmm, that one will require a little head scratching. There isn't
a direct answer that we can offer from here because there lots of candidates.
But let me suggest a start to your own investigations. Use MONITOR and
examine lan traffic on each board. If a wire is pretty well saturated,
say by running > 1000 pkts/sec on average, then time on the wire is a
bottleneck. If you have limited packet receive buffers, and the "no ECB"
count keeps going up then the comms channel in the server is a problem.
	Often the problem is in the server disk system, and we don't
know what that looks like. I certainly would look into that part before
jumping to other conclusions.
	I run a busy server with three NE-3200's and it does not bog
down even under heavy load. A 486-33 server is handling 49 Pentium 90's
being used by engineering students and others, and surprizingly it does
the job well. A very crude and quick estimator of server saturation is
watching the number of server processes over time on NW 3.x. If it keeps
edging up then the server is really being pushed hard. The one above
levels off at about 13 (MONITOR shows only the peak value and hangs on
to it.). A better measurment is employ Lanalyzer for Windows by Novell
and look for server overload packets.
	That analyzer is a good tool for sizing up your network.
	Another part of the equation is how strong are the clients. If
they drop lots of packets under load then matters crawl. LZFW can help
one see that too (see repeated replies, particularly when Pburst is
used).
	Finally, the wires may carry other traffic and have bridges
and routers in series. What happens to those boxes can have a major
impact on your IPX services.
	Joe D.

------------------------------

Date: Sat, 3 Feb 1996 16:10:58 -0500
From: Daniel Tran <dtran@UCLA.EDU>
Subject: Re: help: high cpu utilization under 3.12

>We are running Novell 3.12 and recently we began experiencing a
>problem that suddenly monitor would indicate 40-50% utilization and
>all disks requests are very slow.

load monitor -p.

...with the -p parameter, it will give you the processor utilization.

When you get high utilization, go to monitor, go to processor utilization,
hit F3.  At this point you can see what process is eating up your CPU
cycles.

------------------------------

Date: Thu, 22 Feb 1996 01:07:37 -0500
From: ATL1DDJ@aol.com
To: netw4-l@bgu.edu
Subject: Re: High Server utilization

I stayed at work until 6:00 am one day going crazy about the very same
thing.  I turned off every workstation, printer and unloaded every NLM
except for monitor and the  utilization was still 98%.  If you go into
processor information from monitor expand the window to full screen
(I think F3 or F6) and monitor every process.  You will notice a large
percentage of process % in the ideal loop (like 90%).  This is Novells'
way of keeping the processor busy while its not doing anything significant.

As soon as another process starts up, Novell displays the *true*
utilization.  BTW, if you look at the bottom of this processor resource
screen, you will see the substraction result of server utilization minus
ideal loop utilization.  Why Novell did not display this % on the main
monitor screen I do not know.

P.S.  In monitor use the help feature (F1).  It explains the above in
more detail.

-David

------------------------------

Date: Thu, 22 Aug 1996 22:12:53 -0400
From: Glenn Fund <gfund@HOMENET.ULTRANET.COM>
Subject: Re: Fine tuning Servers

Get a hold of BMC's NETtune Pro and let it discover your server(s) and
make recommendations for optimization.  Great server monitoring and
optimization tool.

------------------------------

Date: Wed, 23 Oct 1996 20:48:06 -0600
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Packets/sec story, with numbers

	I just got back to my office tonight after being called with the
classical message "The network is slow!" in one of my student labs. What
happened is sort of interesting.
	The NW 4.11 file server (a 486-33 EISA bus machine) registered
100% cpu utilization, dstrace said not fully synchronized, users were
grumbling audibly. Monitor also said a couple of lan adapters had very
high "No ECB available" counts and climbing quickly. The packets sent
and received numbers were also climbing rapidly. All were in the many
millions. Normally there are no "No ECB available" counts. What the heck
was going on? I took another sip of coffee and thought and poked the
server keyboard.
	No, the wiring (coax) was just fine. Plenty of server memory.
The directly attached printers were going fine, but user apps were
extremely sluggish. The server did look ok except for the ECB loss
rate and the cpu utilization value.
	Oh. There is someone running Lanalyzer for Windows, and Wow!
Look at the packets per second dial over in the red zone! 3000 pkts/sec!!!
Yikes, no wonder things were slow with that traffic rate as competition.
	Capture a few thousand packets (while grumbles increase in
volume) and they were tinygrams, 64B guys, zillions of them, TCP/IP
Telnet, going from one wire in the room through the server to another
wire in the room.
	Ah ha! I know what is happening. It's my grad Computer Networks
class doing their homework assignment. That was to measure throughput
versus packet size (TCP MSS) for various situations, with MS-DOS Kermit
acting as the TCP app at both ends of the connection. Sure enough, someone
had tried an MSS (the TCP payload) of 16 bytes, generating tinygrams.
	ECB's were not available because the packet rate was high enough
to exceed the rate at which the server could supply buffers to the NE-3200
boards, and some packets were consequently lost. (No problem, TCP repeats
them and thus keeps adding load). Once we stopped the file transfer test
everything was perfectly normal again. The overhead handling tinygrams with
a bus master board is often greater than with a simpler port i/o or shared
memory board because of the busy work setting up a block transfer. Hence
a simpler board, say an NE-2000 flavor, would have done less work moving
tinygrams than the better board, and the other way around with larger pkts.
	A general rule of thumb on 10Mbps Ethernet is 1000 packets per second
is a hefty load on machines. Here we were at triple that rate, and in this
case the server was acting as an IP router rather than as a file server
so we did not have delays while the server's disk drives were accessed.
MS-DOS Kermit has rather efficient Kermit and TCP/IP protocol stack code,
in this case altogether too efficient.
	By the way, the throughput was only 684 file data bytes/sec,
compared to about 160KB/sec with normal sized IP packets. All that overhead
of headers with just a few data bytes, plus repeats for lost frames.
	If the server were less strong, or some other item in the server
were consuming lots of resources (say running a tape drive for backups),
then the same loss of ECBs would arise simply because the server could
not attend to every arriving packet. In this case the server was healthy
but the packet rate was way out of bounds. And we see the server is not
an IP router which routes packets "at wire speed" comfortably.
	It certainly can be handy to be the system manager as well as
the course instructor because I didn't have to blame anyone else (this
time).
	Joe D.

------------------------------

Date: Thu, 31 Oct 1996 08:48:17 +1100
From: Adrian Moore <C-Moore@MAIL.DEC.COM>
Subject: Re: 4.1 server utilization at 100%

Check out TID: 2905856 which recommends preallocation values for
service processes and receive buffers. This article is titled:
"Additional Notes for High Utilization Issues". It is a supplement to
TID 1005963.

Alternatively: Did you recently put LANDR8 on? If so, are you using an
ODI 3.3 spec LAN driver? If not... look to the vendor web site for the
ODI 3.3 .lan driver. I've had one report where high utilization was
caused by using the ODI 3.3 ETHERTSM, MSM, NBI combination with a
pre-ODI 3.3 LAN driver. There was an executable which would check this
for you but I do not recall which kit it was shipped with. A quick
search of LANDR8 and Client32 haven't jogged my memory.

------------------------------

Date: Wed, 30 Oct 1996 19:12:53 +0100
From: Urban Svensson <urban@DKK.SE>
Subject: NW 4 and 100 % utilization

>Jackie D. Firkins had problems with NW 4.1 and 100 % utilization.

Have You checked this:

1.  There is a newer MSM and ETHERTSM in LANDR9/8.
2.  What is Upgrade Low Priority Threads set to? Should be OFF.
3.  Are You really using 410PT6?
4.  Maximum Service Processes hitting the upper limit?
5.  Maximum Directory Cache Buffers high/hitting the limit?
6.  Not yet too familiar with the 3Com XL but is this not an ISA card? Could
      this cause the server to kneel?
7.  Everything else OK? Memory - enough, Diskspace at least 15 per cent
     free and so on?

------------------------------

Date: Tue, 12 Nov 1996 21:51:14 +0100
From: Hakan_Andersson <hakana@VD.VOLVO.SE>
Subject: Re: 100% utilization -Reply

>>>Every now and then, about once an hour, the utilization of my 4.10
>>>server (486DX50, Adaptec 2742T EISA, Seagate Barracuda 2GB) shoots
>>>up to 100% and the drive goes frantically for several minutes.

This sounds very much like "agressive mode" which can be triggered by
a volume beeing 90% full or by a disk not having more than 1000
blocks free (hardcoded). Note: blocks free does not include the
purgeable blocks so even though you have maybe 50% free there could
be less than 1000 blocks free on a disk.

---------

Date: Tue, 12 Nov 1996 16:35:24 -0500
From: "Martin C. Mueller" <mcm@MATHEMATIK.UNI-KL.DE>
Subject: Re: 100% utilization - Solved

>>>Every now and then, about once an hour, the utilization of my 4.10
>>>server (486DX50, Adaptec 2742T EISA, Seagate Barracuda 2GB) shoots up to
>>>100% and the drive goes frantically for several minutes. Checking the
>>
>>Just a thought, block subalocation cleanup?
>
>	Never heard of such a thing. In fact, I can't imagine such a thing.
>And it's never been seen here either.
>
>	Seeing NDS activity is usually simplicity itself: console SET DSTRACE
>ON. Watch its screen. In addition, snooping the wire is very useful. Novell's
>Lanalyzer does a good job with this.
>	Joe D.

The solution lied in the direction of compression. The cause was telling
the server to immediately compress deleted files (deemed usefull at
times of low diskspace and never revoked). It turned out that a handfull
of users deleted substantial amounts with a certain abundance and
periodicity. This triggered the high utilization and the disk activity
in spite of not serving requests. I still don't understand why the the
compressing thread's CPU share couldn't been seen in the process table.

Best of all - it's been mentioned in the books! In this case, the fine
text UTILIZAT.TXT within 410pt6.exe by Rich Jardine. Again I skipped a
seemingly not-so-important technical document to "get work done" >:-( At
least, I remembered there was something :-) . (BTW, the text also
mentions suballocation and NDS as possible causes among others)

---------

Date: Wed, 13 Nov 1996 14:26:30 +1300
From: "Baird, John" <BAIRD2@WHIO.LINCOLN.AC.NZ>
Subject: Re: 100% utilization -Reply

>>>Every now and then, about once an hour, the utilization of my 4.10
>>>server (486DX50, Adaptec 2742T EISA, Seagate Barracuda 2GB) shoots up to
>>>100% and the drive goes frantically for several minutes. Checking the
>>
>>Just a thought, block subalocation cleanup?
>
>	Never heard of such a thing. In fact, I can't imagine such a thing.
>And it's never been seen here either.
>
>	Joe D.

Yup, it does happen from time to time, but I dont know at what priority it
is done under normal circumstances and whether it has a noticeable
impact on server performance. File fragments are stored in chains of
suballoc blocks, with separate chains for each fragment size i.e. < 512,
513-1024, etc. As files are expanded or deleted, holes appear in the chains
and Netware will occasionally compact the chains and release unused blocks
for 'normal' storage.

------------------------------

Date: Thu, 14 Nov 1996 09:39:58 -0500
From: Craig Lyndes <CRAIG@CVUMAIL.CVU.CSSD.K12.VT.US>
Subject: SOLVED!  New Epoch creates 100% utilization -Reply

I thought I would share this solution, and warning, with the list.

Repair time stamps and declare a new epoch (DSREPAIR) apparently is
very processor and resource intensive.  The Novell Tech that finally
talked me through the repair said it can create problems in partitions
that otherwise appear to be fine.

When doing directory and partition operations Transaction Tracking
System (TTS) must be active or everything comes to a halt.

What had happened to me is while doing the new epoch, TTS had run out
of resources (64 Megs RAM in an 8 GIG 250 User 4.1 - Joe, could this be
where Novell comes up with their rather excessive memory requirements?)
and shut down.  No matter how long I waited nothing was going to
happen.  When I powered off I corrupted the DS databases on that server
so bad that whenever they were loaded the server would go into an infinite
loop.

The repair then became, deleting that replica and the DS databases on
that server and recreating them without losing any objects' attributes, a
$200.00 Novell call with the tech using lots of undocumented switches and
commands.

The solution for next time, whenever TTS shuts down or you get -621
errors in dstrace, restart TTS.  It will finish what it was working on and
go after the problem again.  This may have to be repeated.  I also upped
the server RAM to 128 Megs.

------------------------------

Date: Thu, 6 Feb 1997 15:38:32 +1000
From: Mark Cramer <m.cramer@QUT.EDU.AU>
Subject: Suballocation & upgrade to 4.1

>After the upgrade from Netware 3.12 to Netware 4.1, if I like to use the
>suballocation feature on the volumes I have to set it manually and remount
>them. However if I want that the suballocation become *effective* the only
>way I can figure is to backup, delete and restore the entire volume
>contents. Somebody have a better idea?

After the discussions on suballocation here in the last few weeks, I've been
doing a bit of background testing to see what happens with sub allocations
aggressive mode (when it goes under 1000 free blocks)  You see some
surprising things!

When your above 1000 free blocks, files start at the beginning of a block,
and the file tails go into the suballocation chains, when you go under
1000...

I took a 1gig 64K block volume down to 590 free blocks (~40M) and copied a
number (~200) small files to the volume, the free blocks went UP! not down.
The sub alloc process shuffled small files into the sub alloc chains,
freeing up full blocks. It obviously shuffled preexisting files into the
chains as well. After copying about 2000 160byte files to the disk, I had
623 free blocks.  (as reported by John Bairds Vol_Info program)

I then filled the volume with large files, free disk space went to 0 bytes,
(as reported by DIR and NDIR) but lots of free sectors left in the suballoc
chains.

So I copied another 800 small files to the drive, no problems, free disk
space still at 0.

At no point did utilization on the server become excessive (P133, 32M,
Adaptec 2940UW and an old Fujitsu 1Gig 5.25 inch full height drive, Nw4.11
2 user test box) hovering around 10-20%, though the disk channel audibly
was taking a hammering for short periods (10's of seconds not minutes)

After you go under 1000 free blocks, Nw seems to have no problems starting
files in the sub alloc chains, and moving other files around.  You will get
extra (noticeable) disk activity, and some performance degradation, but it
appears to handle the situation well.

YMMV, this wasn't a rigorous test, just interesting.

------------------------------

Date: Mon, 10 Feb 1997 12:42:36 -0500
From: "Brien K. Meehan" <MEEHANB@DETROITEDISON.COM>
Subject: Re: SERVER-4.10-3227 error (file compression)

>SERVER-4.10-3227
>Severity=2  locus-2  Class=20
>Compressed files are not being committed decompressed on volume
>       SYS due to lack of space.
>
>Could anybody tell me why this message is being displayed ?
>Is 80Mb left on SYS not enough for something ?

Netware 4.1 comes with automatic file compression.  It's "on" by default
when you install a server, but you can turn it off as you're creating the
volumes during custom installation.  So, if you didn't do that (and it
looks like you ran the "Quick Setup"), it's on.

There's a bunch of settings related to compression.  One of them is "Days
Untouched Before Compression" and the default value is 7.  If you don't
touch a file in the span of 7 days, Netware compresses it and stores it on
the disk compressed, and passes the savings on to you!

When you "touch" the file after that 7 days, Netware opens it, presents it
decompressed, and writes it back to the volume decompressed (depending on
some other settings...).

You, Sara, are running into another setting, namely "Decompress Percent
Disk Space Free To Allow Commit," for which the default value is 10.  That
is, when you open a compressed file, there has to be 10% free space on the
volume in order for Netware to write the file back to the volume
decompressed.

... and you're 80MB free is pretty close to the 10% mark.

Have a look at the documentation regarding compression, or at least have
a look at SERVMAN to see what your settings are, and what your general
compression situation is.

------------------------------

Date: Wed, 19 Feb 1997 21:12:36 -0600
From: Scott Hasse <shasse@GOLIATH.COM>
Subject: Re: NCP Requests & 100% Utilization

I have seen 100% utilization with 70%-80% of the processor utilization in
the 'IPXRTR NCP work to do' on a Netware 4.1 with approx. 400 users.
Unfortuantely, these problems are hard to troubleshoot since they are so
sporadic.  Novell documentation seems to say that if you are get high
utilization in the NCP work to do, it may be passed there from some other
process.

If you are loading IPXRTR.NLM, and can unload it without causing a major
outage, do so.  That may show you the process that is actually causing your
high utilization.

One question to ask is:  Are you experiencing any significant response time
problems during these high utilization sessions?

Another question is:  Do you have a large number of bindery contexts set, or
a large number of bindery printers?  Reducing these may help.  Bindery
process work usually shows up in the DS AES process, but may cause NCP work
to do as well.

Another question:  Do you have any OS/2 machines?  OS/2 may cause high
utilization if you browse printers through the network folder.

If you can, unload third party NLMs temporarily to try to narrow down the
problem.

Try this:  at the server console, type set compress screen = on, then change
to the compress screen.  Then watch to see if you are having a larger then
usual number of decompresses, or decompressing of a very large file.  You
should hopefully have no compressions during the day if all is well.  I have
had comress/decompress utilization show up in the NCP processor thread
instead of the compress/decompress threads.

Also, try this:  Download the NLM that disables pburst on the file server
and load it.  Fast PCs with pburst can at times thrash a server, although
you would notice a signficant preformance decrease if this were the case.

The best thing you can do, though, is to get LANAlyzer or a Sniffer and put
it on the same subnet as the file server.  Then when one of these 100%
utilization sessions starts, you can see exactly what is going on.  Don't
get too excited about any one individual sending a large number of packets.
Watch for users that have a consistent but not necessarily high number of
packets to and from the server.  Then capture their traffic and see what it
is.  If it what I had, then you will see a large number of repeated bindery
requests from an OS/2 station.

---------

Date: Wed, 19 Feb 1997 20:32:35 -0600
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Re: NCP Requests & 100% Utilization

>I have seen 100% utilization with 70%-80% of the processor utilization in
>the 'IPXRTR NCP work to do' on a Netware 4.1 with approx. 400 users.
>Unfortuantely, these problems are hard to troubleshoot since they are so
>sporadic.  Novell documentation seems to say that if you are get high
>utilization in the NCP work to do, it may be passed there from some other
>process.
	<omitting lots of good advice>
-----------
	Supplementing your good comments. NDS itself can and often does
run the cpu to very high utilization values without much network traffic
occuring. Apparently digesting the sundry NDS files in _NETWARE is cpu
intensive. Novell is aware of this behavior and last spring one of the
guys in the NDS group said they were experimenting with ways of reducing
the quadratic time sort/merge aspects of some NDS updates. We may infer
that the larger and more varied the NDS files the longer the cpu peaks
may last.
	If there are enough replicas needing synchronization on top of
this, and the system can't finish the update job in the allotted 22 min
(as I recall), then NDS life becomes difficult for mere servers.
	By itself high cpu utilization is not a problem. If packets can't
be processed, however, then we get worried. My casual observations suggest
that packets do get processed, but folks with more heavily loaded servers
and vastly larger NDS databases are better judges of this.
	Joe D.

------------------------------

Date: Sat, 22 Feb 1997 14:22:22 -0600
From: "Alan L. Welsh" <snapback@IX.NETCOM.COM>
Subject: lockups going to DOS mode

>One of our servers locks up in 100% utilization for about 4-5 minutes
>and drops user connections when trying to access the DOS partition.

Although I'm not the engineer that is most familiar with this problem, a
year ago I believe we encountered, confirmed, and then reconfirmed with
Adaptec that they have a bug that sometimes causes this when going to real
mode (DOS partition) when communicating with their ASPI device driver.

>This could happen when saving Startup.Ncf or choosing "Product
>Options" from Install.Nlm.
>It actually seems to be stuck in DOS because it's possible to restart
>the server by pressing "Ctrl-Alt-Del", even though the NetWare console
>appears on the screen.
>
>I have gone through the adapter configuration with Storage Dimensions
>tech support and it seems ok. The couldn't help us and have assigned
>the case to a higher support level but things seems to happen slow there.

If you'd like, have them call me directly so the head device driver
engineers can speak directly to our head of engineering to reconfirm (and
maybe fix) this problem.  We no longer expose this 'feature', but would be
happy to help resolve it.

>The latest operating system patches and disk drivers are applied.
>This issue are now exposed because I'm rebuilding the server with
>NetWare 4.11. The problem also existed under NW 3.12 but then I
>related it to the NW Faq where someone wrote about a similiar problem
>due to a conflict between Intel Landesk Virus Protect (PSCAN.NLM)
>and the NW path kit 312PT6, wich then applied to our environment.
>
>I read somewhere on Usenet that Adaptec controllers could have
>problems while switching between protected and real mode, related to a
>BIOS issue.

Netware ASPI driver issue, not a (rom)BIOS issue.

Other ideas are that it could be a PCI-compability issue between
>the server and the controllers or maybe the fact that the controllers have
>different BIOS revision levels.
>
>This is a server that are planned to be moved across the country to a
>location where we don't have regular support so this issue has to be
>resolved before we can move it there.

(My only plug) Go to our website at http://www.cdp.com to find out how
remote Netware novices could easily restore your server or individual files
if it does crash.

Alan L. Welsh, president, Columbia Data Products, Inc.
snapback@ix.netcom.com  1800-613-6288  http://www.cdp.com

------------------------------

Date: Tue, 4 Mar 1997 11:28:12 -0700
From: "Steven W. Smith" <SYSSWS@GC.MARICOPA.EDU>
Subject: 100% utilization, MONITOR.NLM 4.34

IMHO, the first thing to do when diagnosing a 100% utilization condition is
to unload then reload MONITOR.NLM.  On one of our fully-patched 4.11 servers
using MONITOR.NLM 4.34 I've seen it stuck at 100% utilization, usually when
BackupExec 7.11 is having some sort of difficulty.  In all cases so far (4
times in the last couple of weeks) unloading/reloading MONITOR "solved" the
problem with the utilization number dropping to something realistic.

Some time back one of our 3.11 servers had a habit of becoming stuck at 89%
"utilization".  Unload/load fixed that as well.  I don't recall seeing this
discussed here and didn't find anything that seemed to relate to it by
searching the TIDs.

Mostly, I'm just curious if others have noticed this phenomenon.

------------------------------

Date: Sat, 22 Mar 1997 12:13:55 GMT
From: Teo Kirkinen <kirkinen@CC.HELSINKI.FI>
Subject: Re: NCP Requests & 100% Utilization

>LRU sitting times run from 40 seconds to 2 minutes during these times.
>Cache buffers were at roughly 78%.  My compression start and end times
>are in the 3-6am range.  We're running a Compaq 1500R with 11GB and
>160MB of RAM -- compression and BSA on all but SYS volume (600MB) and
>no additional name space (yet).

I would add more memory to the server until the LRU sitting time stays
above 15 minutes. This has made our servers more responsive during
peak work hours. The magic number comes from some NDS update interval.
I learned it at last year's Brainshare Europe.

------------------------------

Date: Mon, 14 Jul 1997 09:18:41 +0100
From: "Erik Bos, AMC afd. PC-LAN" <A.H.Bos@AMC.UVA.NL>
Subject: 100% util

We had much help from the 'Troubleshooting High Utilisation on NW4.x'
document on Novell's web site.  After checking everything that was on
this document 70% of those problems are no more.  After we upgraded to
NW 4.11 there were no problems like this left.

------------------------------

Date: Sun, 1 Mar 1998 12:46:47 +0100
From: Camaszotisz Gyorgy <cama@FREEMAIL.C3.HU>
Subject: Re: How to limit NLM CPU usage

>Somebody knows if is it possible to limit the CPU usage of an
>Intranetware's  NLM program ? This would be very useful in some
>occasions, when a NLM goes in abend taking the 100% of the CPU power,
>hanging up the whole server.

Go to Monitor.nlm, Scheduling information, search for the thread taking
too much CPU time, then press gray + several times. There is a
command-line equivalent also, I don't remember the name, maybe load
schdelay?

But: from a developer`s point of view, this setting has effect only
to the sleeping time before ThreadSwitchWithDelay() API call will
return to the caller. So, if your NLM does not use this API call,
you will have no success. And this happens almost everytime, because
the SDK documentation suggest using this API call only when your NLM
is waiting for a resource to be available.

------------------------------

Date: Wed, 4 Mar 1998 07:23:17 -0500
From: "David G. Pile" <dgpile@STRATOS.NET>
Subject: Re: Intermitant high utilization

>I'm back to the list with another strange problem.  Netware 3.12
>server with intermitant high utilization up to 70% and them drops
>to the norm of 1% to 5% after about 2 to 5 minutes. Causes
>workstation apps to crash.

Any chance someone has RCONSOLE open and minimized in the background? This
is a problem particularly from a Win 3.1x machine. Not so much from 95 but
still it can peg servers.

------------------------------