-----------------------------------------------------------------
NOVABEND.DOC -- 19980324 -- Email thread on NetWare ABnormal ENDs
-----------------------------------------------------------------

	Feel free to add or edit this document and then email
	it back to faq@jelyon.com




Date: Thu, 2 Nov 1995 20:39:45 GMT
From: Edwin Cleton <ecl@INTER.NL.NET>
Subject: Re: Netware 4.1 crashes with NETX

>A little more info, I got to see the screen this time after the abend:

>   Page Fault Processor Exception (error code 01700000)
>   Running Process: Interrupt Service Routine (nested count 2)

TechDoc  : Abend -> Polling process
Composer : Ecl@inter.NL.net (LESi)
Targets  : NetWare *.*
Keywords : Polling process abends
Revision : Fri  04-08-1995

Reasons for an Abend on Polling process (i.e. Running Process: Interrupt
Service Routine)

1) Not all required patches loaded or loaded in the wrong .ncf file
2) Insufficient server memory
3) Eventhough enough memory, fragmented memory pools due to SYS: mounting
   before all memory was registered (>16mB)
4) Energy save feature(s) enabled in servers PC bios
5) Keyboard bios bugged (is a different chip then system bios)
6) Using a flacky/invalid IRQ on the server NIC (i.e. irq 2/9 - irq 14/15)
7) A flacky third party (backup?) nlm
8) Overheating (inside server or harddisk(s) case(s))
9) Static interferance via extended cables
10) Outdated pserver.nlm
11) Server NIC going bad
12) Server has been downed and started without a hard reset

And of course, a real new OS bug is also possible.

------------------------------

Date: Thu, 18 Jan 1996 20:03:13 GMT+0001
From: "Szekeres Bela, Jr." <SZEKERES@EVT.BME.HU>
Subject: Re: Debugger step "2" -sometimes not enough

>>WHAT IS THE KEY SEQUENCE?? I'm going mad looking for it.. Help!
>
><FAQ snippet>
> b. Drop into the debugger using the <Left shft><alt><Right
> shft><esc> key sequence.

It happens quite often, that you can use only one of the two Alts.
(Which one depends on the computer.)

But, on a DTK 486 even the sequence and speed was important. It was
<Left shft><Right shft><Right alt><ESC> ...in this sequence and
within about 1 second.

Do not ask why, it did not work otherwise.

------------------------------

From:	Teo Kirkinen <kirkinen@cc.helsinki.fi>
Subject: Re: Debugging after an Abend
Date:	Fri, 19 Jan 1996 18:07:16 +0200 (EET)

>Teo, are you serious about "386debug" above?

Yes, I'm serious. It doesn't work when the server has NOT
abended but when it has abended, it works also in situations
where shift-shift-alt-esc just reboots the server. It has
been very useful when we diagnosed PFPE where the running
process was Server xx and the "normal" way of getting in
the debugger didn't work

I learned it from somebody from England, most likely the owner
of the SOFTRACK mailing list but I can't recall his name. If
I remember correctly 386debug is also documented in some older
version of the server SDK.

------------------------------

Date: Fri, 12 Apr 1996 00:20:00 PDT
From: "Randy G Hutchings, Contr CZE" <HutchiRG@GPS1.LAAFB.AF.MIL>
Subject: Great Info!  ABEND Recovery

I had an ABEND yesterday and this information help me
restore my 4.1 server in seconds. This information was taken
from LAN TIMES April addition.

IF the ABEND is NOT a Processor Exception variety, follow these steps:

Now enable the server's internal NetWare debugger by simultaneously
 pressing both <shift>keys, the <Alt>key, and the <Esc>key on
the server console.
NOTE: If your server was locked using the SECURE CONSOLE
 command, you will not be able to do this.

Type the case-sensitive command
EIP=CSleepUntilInterrupt. (this a small L and a capital i )
Type G <Enter>, and the server should now appear to come back to life.
As soon as users have saved their data, use the DOWN
command to bring the server down gracefully and then reboot it.


IF the ABEND is a Processor Exception type, follow these steps:

If the ABEND message has the words NMI or Machine Check in it,
then your chance of restarting the server and bringing it down are
somewhat reduced because failed hardware may be the cause of
the problem. Here are the steps to follow:

Now enable the server's internal NetWare debugger by simultaneously
 pressing both <shift>keys, the <Alt>key, and the <Esc>key on the
server console.
NOTE: If your server was locked using the SECURE CONSOLE command,
you will not be able to do this.

Type T <Enter>. The server will display some debugging information
and return to the # prompt.

Type G <Enter>. The server should now appear to come back to life.
 Again, any client connections that are sitting at an Abort, Retry, Ignore
message should be retried. Users should be able to save their files
and exit their Application. Use the DOWN command to bring the server
down and then reboot it. If the server comes back up with your data intact,
you are ready to start trouble shooting.


IF the ABEND message has GPPE or Page Fault in it, follow these steps:

Type .R <Enter> to display the name of the running process.
Write this information down for later trouble-shooting.

Type ? <Enter> to display the NLM and function names that the server
was executing during the ABEND. Write this down too.

Type .M <Enter> to display the currently loaded modules and their address.
 Write down the starting address and length of the module that failed.

Type RC <Enter> to display the contents of the control registers.
 Write down the value of the CR2 register, which contains the address
that produced the page fault.

Type the case-sensitive command
EIP=CSleepUntilInterrupt. (this a small L and a capital i )
Type G <Enter>, and the server should now appear to come back to life.
 As soon as users have saved their data, use the DOWN command to
bring the server down gracefully and then reboot it.

The server should come back up with your data intact; you are then
 ready to start trouble shooting. A good place to start will be the
NLM that cause the problem-remove it and contact its author.

------------------------------

Date: Mon, 3 Jun 1996 11:33:23 +-100
From: Stephen Knight <stephenk@FIREFOX.CO.UK>
Subject: Server Freezes

The following may be use of to some of you who've been talking about server
freezes etc. and is a technique our of our dev. guys brought to my
attention.  It has got me out of ALL of the freezes I have had by doing a
controlled abend of the server by forcing a memory parity error when I
couldn't use the normal keyboard method.

Now what use is that ?  Well then you can then hopefully bring the server
back up using the debug commands before getting users out and taking it
down neatly...

Now I don't suggest you run this in your production server permanently but
if you need to catch a particular problem and are confident in what you are
doing then you can wait for the lock to occur then it may help.... just
don't coming to me if you destroy your brand new server!

Someone who knows more than me about the debugger (not hard) can help you
more with what you can get out of the debugger once you are in there but it
may be possible to identify which NLM it was stuck in, and more importantly,
get the server going again with the commands shown on the list a while ago:

EIP=CSleepUntilInterrupt
G

The hardware needed is as follows :

----------------------------------------- | Back of PC
8 bit ISA Slot (or 1/2 a 16 bit slot)   A |-----O----+
----------------------------------------- |     1k   / push
					B |----------+ switch

This is a view of the *ISA* slots when looking down on the motherboard.
Pin B is the NMI line normally triggered by a memory parity error (you
need to have this turned on in the BIOS in some PC's) and Pin A is 0v.
If you momentarily ground the NMI line (through a 1k resistor) then an
abend occurs...  I built mine onto an old Dragon cartridge (Hi Roy!) but
an 8 bit network cut down to remove all but the edge connector or a proper
prototyping card would do just as well...

Doesn't work on all servers / PC's, possibly doesn't work for all freezes
but hope it works for you.

Credit must goto Carl Young who told me about all this stuff...

------------------------------

Date: Mon, 8 Jul 1996 16:46:45 +0100
From: John Bazeley <johnb@JSB.CO.UK>
Subject: Re: Debuggers, Abends and NLM's

>I seen some recent mail about switching to the debugger after an
>ABEND, in order to identify the offending process and send it to
>sleep.  Being a newbie to Novell, I've only just discovered the
>debugger.  Since I've no info on the commands, I've currently left it
>alone, however I would be interested if anyone can give me a few
>pointers.  I especially want to know about stopping an NLM process
>after an ABEND.  The reason being we currently have a 4.1
>server which periodically (3 - 4 Weeks) falls over.

I think this is in the FAQ [Floyd: Yup, section H.54.2 & this file],
but here goes.

To enter the debugger, do alt, both shifts and esc at the same time,
OR type 386debug (only works from abend)

You'll be presented with some hex stuff which, unless you wrote
the crashing NLM, you can probably ignore.

Type EIP=CSleepUntilInterrupt
Type G

You'll get your server back, probably. You should probably down it ASAP.

Other interesting debug commands:

v: cycles through all screens that were active at the time of the abend
.p: gives you a list of all running threads and what they were up to at
   the time of the crash
?: gives you the address of the crash and the 2 nearest exported symbols.
h: help on commands
.h: help on dot commands

------------------------------

Date: Fri, 20 Sep 1996 04:36:06 -0400
From: Mark Snape <MARKS@MICROLISE.CO.UK>
Subject: An easy ABEND

Those of you that like crashing NW4.1 servers might like to try the
following:

Hold down a key on the console keyboard........

Until it starts beeping (buffer full)

Press Enter.

Whoops!

I suggest that you don't do this in a user environment unless it's your
last day.

I have never come across a documented fix for this, but if
anyone knows of one etc, etc...

We came across this when one of our guys leaned the keyboard against
the system unit. We were searching for minutes trying to find the
source of the beeping, when I picked up the keyboard the beeping
stopped and I unfortunately pressed enter the clear the command line.
Following that, we have tried it on a number of boxes, and got the same
result in each case.

------------------------------

Date: Thu, 10 Oct 1996 16:10:39 -0600
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Re: Warning: Netware Web Server is *very* buggy

<snip>
----------
	As Mr. Andersson of Volvo remarked, there is a Web server v2.51.
As I looked at my CD-ROM collection I found it and just installed it on
a NW 4.11 server. Now this is part of NW4.11/beta etc rather than the
regular distribution channel, but the thought is there may be fixes in
their for your situation (though the docs do not mention anything like
that). How you might get v2.51 out of Novell is beyond my understanding.
	Server abends from wild pointers are nasty things. I wonder, if
by chance, if you have read & write fault emulation turned on (I have them
turned off, but notification turned on). That's SET 2 (memory) at the
console. Also, "allow invalid pointers=off" on my machinery, same SET
area. There is one other parameter of interest in this group, "alloc
memory check flag=on/off", which is off here but you may wish to turn
on for safety's sake.
	I would not be surprized to learn wild pointers come from the Perl
NLM, given the things Perl does with memory.
	Lastly, one needs to pick and choose the version of the underlying
tcpip.nlm, and there are a number in the archives. I'm at version 3.00
because the latest stuff blocks RDATE.
	Joe D.

------------------------------

Date: Mon, 14 Oct 96 18:39:19 -0700
From: Randy Grein <rgrein@halcyon.com>
To: "NetWare 4 list" <netw4-l@bgu.edu>
Subject: Re: Console.log problem

>I've have a Netware 4.1 that's rebooting by itself every couple of days
>at random times. I checked the vol$log.err and it showed this message.
>
>volume mount had the following errors that were fixed:
>proble with the file console.log, length extended.

Wrong order: The crash and recovery causes this message, it's not the
message causing the crash. If you get really stuck check out Alexander
Lan's excellent Server Protection kit. It traps most crashes, and
provides detailed information on nearly all thoe ones it can't stop.
Check out http://www.alexander.com.

------------------------------

Date: Mon, 28 Oct 1996 16:15:12 +1100
From: Adrian Moore <C-Moore@MAIL.DEC.COM>
Subject: Re: DEBUG HELP

>Can someone please explain to me how does one EFFECTIVLY use the
>debug utility after an abend?

A couple of things I use:

Use: ?  to show the running process <- if it's always the same it
could be the cause of the problem

Use: dds  to show the contents of the stack (decodes addresses, so you
can see what else has been running recently). Ignore things like
MONITOR and CONSOLE ;]

Use: ? <4 byte addresss> to find out which code segment of which NLM
the address corresponds to

See the app notes: Resolving Critical Server Issues (Feb 1995) & Abend
Recovery Techniques for NetWare 3 and 4 Servers (June 1995).
TABND2.EXE off http://support.novell.com/ contains the latter app note.

If anyone has any cute tricks regarding the debugger I would also love
to hear them. There are some interesting titled books in the FAQ for
this list which I am yet to track down, though.

In particular, on Windows NT you can run the debug utilities on an NT
crash dump to pull out a text file of the most important information,
like running processes, the state of all threads, loaded files, etc.
Has anyone written/seen a utility like this? Novell don't have
something similar for general distribution...

------------------------------

Date: Mon, 28 Oct 1996 09:54:48 -0600
From: James Federline <federlin@CHEM.UMN.EDU>
Subject: Re: DEBUG HELP

>Can someone please explain to me how does one EFFECTIVLY use the debug
>utility after an abend ?

Novell's AppNote entitled "Abend Recovery Techniques for NetWare 3 and 4
Servers" describes this in detail. It's 30 pages long - I' won't try to
transcribe it here...

Topically, you must first analyze the type of abend before you can decide
what things in the debugger are helpful to do. My goals going into the
debugger are to 1) gather as much data on the abended state of the
machine as possible, 2) attempt to restart the server thru thread
quarrantine (except for an NMI, then trace-and-go) for a graceful shutdown.

If you are struggling with a particular type of abend (Page Fault, GPPE,
NMI, Machine Check, invalid opcode, or a software exception of some
sort), some of us might be able to help you follow the right path.

Here's some basics:

- the Abend message will tell you the breed of abend you ahev experienced,
	write it down
- to redisplay the abend message, once in the debugger, type .a
- the R command displays registers and flags (not immediately useful,
	except for EIP - the CPU's instruction pointer)
- the ? command will display the NLM and function in the NLM that EIP
	is pointing at. since the server froze it's state for you
	this is most likely thing that crashed your server. Now
	whether or not this nodule is bad is another story - another
	modules could have passed it bad data, for instance.

	Also pay attention to the functions (previous, current and next)
	if shown. It won't probably be shown if EIP is at SERVER.NLM.
- the .R command shows the running process and info on that process, as
	well as a couple three lines of core dump around the instruction
	pointer, with ASCII translation from hex for more insight.
- the RC command works only in NetWare 4. The only thing I can use of
	this is the CR2 register - if this is listed as 00000000, you
	could have buggy software. The appnote says that this address is
	frequently used by software that mistakenly dereferences a null
	pointer. Novell says this is a very common problem.


To recover from a Page fault, GPPE, invalid opcode, or a software
exception, try this, and then down and restart the server. This might
quite possibly give you a chance to let NetWare write any cached data to
disk and let any important processes finish up. Of course, the following
quarrantine method assumes you don't need the process that abended the
server to continue running (at least in a hobbled state).

1) gather info with commands above.
2) enter this:
	EIP = CSleepUntilInterrupt
3) enter  G , and the server will attempt to continue at to run from the
	point of the abend, minus the thread you just quarrantined.
4) have any clients connected attempt to close files and let NW flush
	it to disk.
5) issue the DOWN command and then restart your server and do something with
	the information you gathered (call tech support, apply patches,
	buy more RAM, etc...)

I've used these methods to deal with two ARCserve bugs. It makes a real
difference when you can provide the tech support person with debugger
output - in both ARCserve cases, the support rep was able to tell me
exactly what was buggy and why, and what patch to download and apply.

An NMI or machine check requires a method called "trace and go", quite a
bit different from thread quarrantine, since the machine state is not
completely preserved, and it's iffy if it will work.

The above is always prefferable to just power cycling the box in my
opinion - it could mean the difference between an intact filesystem, or a
scribbled mess. I personally hate using VREPAIR... :)

------------------------------

Date: Tue, 12 Nov 1996 09:44:18 +1000
From: Greg J Priestley <Greg_J_Priestley%PKF@PKF.COM.AU>
Subject: Re: Forcing Abends

You can try the Novell Consulting site
	http://www.novell.com/corp/programs/ncs/toolkit/main.html

I believe they have a utility.

------------------------------

Date: Tue, 10 Dec 1996 17:07:35 -0000
From: John Bazeley <johnb@JSB.CO.UK>
Subject: Re: Abend Troubleshooter Request

>Does anyone have a book or product they recommend for Server
>Abend Troubleshooting ?

1. The list FAQ. See Floyd Maxwell's post yesterday for where
   and how to get it.

2. support.novell.com: search for file tabnd2.exe.

3. 3.12a 'system messages' manual (not too useful, just a list).

4. 4.11 'supervising the network' manual.

5. SDK docs. Plenty info on the best ways to cause abends.

6. NetWare Application Notes, June 1995  "Abend Recovery Techniques
   for NetWare 3 and 4 Servers"

You may be interested to know that 4.11 has an automatic
abend recovery setting, which seems to work OK.

------------------------------

Date: Tue, 7 Jan 1997 20:57:46 -0600
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Re: Server Freezing

>>I have 4 Netware 4.1 servers and over the last month or so the main server
>>has been freezing about twice a week and the only way to clear it is to
>>cycle the power. I have been investigating this as a hardware problem
>>replacing memory etc.
>>
>>Today two of the other servers also froze and again could only be cleared
>>by cycling the power.
>
>This sounds familiar.  In my case its three identical 3.11 servers, so
>your problem may be unrelated but it may be worth a shot.  Try changing
>the network card.
	<details omitted>
>What I'm guessing is happening is that either the 3C595 or its driver
>does not handle errors properly.  I may be way out on this one so anyone
>is welcome to jump in and point out my errors, but for now I'm in search
>of a new 10/100 card.  Although we don't run 100 Mbit yet, I'd rather not
>leave a 509 in a server any longer than needed.
----------
	I think you have it right John. Mangled frames can do wonderous
things to a board and/or its driver if the designers didn't prepare for
such eventualities. Way back in the dark ages (NW 2.0a and such) when
monolithic IPX was built from .obj modules and Ethernet_802.3 was the
only choice we (my place) had machine crashes left and right when servers
were placed on the new backbone. It turned out the illegal frame tripped
up "real" machines and legal frames tripped up the NW servers of that era.
My NW 2.0a server would live about twenty minutes on the backbone.
	Time passed. I binary edited .obj files to use Ethernet_II frames
for IPX, and and so on. Eventually more robustness was designed into various
drivers and machines began to cohabit the wires again. To this day we forbid
Ethernet_802.3 on the main wires. Mind you, this is with otherwise rational
frames except for the missing 802.2 interior on IEEE-802.3 style Ethernet
frames. Given wider variations of construction, some not legit, anything
can happen. [Story is in ethernet.txt, netlab2.usu.edu, cd misc, and in the
list's FAQ.]
	Checking costs time and memory, and hence is a pain in the butt. But
that's the price of robustness. There is also the well known machine test of
putting 3 year olds at a keyboard. They have killed many big machines, by
simply holding down a repeating key. To this day a LOT of systems are opened
by accepting strings from abroad which exceed buffer lengths (the basis of
the infamous Internet worm of several years ago).
	It could well be some device on the net is putting out "sensitive"
frames which kill the drivers. Some "self test" programs of boards emit a
fast paced stream of broadcasts, etc. Broken boards emit all kinds of things.
A packet monitor might help spot suspicious stuff.
	Joe D.

------------------------------

Date: Tue, 28 Jan 1997 13:08:47 -0600
From: "Mike Avery" <mavery@mail.otherwhen.com>
To: netw4-l@bgu.edu
Subject: Re: couple questions...

>>Also, you should have as much free space on your DOS paritition as
>>you have memory in your server so you can do a core dump to disk
>>quickly and efficiently.

>Could you elaborate a bit about the circumstances which cause a
>memory dump, capabilities of dump analyzer software, and does the
>dump space in the DOS partition need to be contiguous? Are there
>any serious consequences of not allowing a sufficient space to
>contain the eentire core image?

When a NetWare server abends (at least up to 4.11... I haven't
installed that yet) you are offered the chance to dump memory to disk
or to shut down the server.

If you have an abend every 3 months or so, it's probably not a big
deal.  Maybe you'll update your drivers and patches and see if it
goes away.  If you abend often enough that your threshold of
tolerance is exceeded, you'll probably hassle your vendors.  So if
the message suggests "Abend in module DripBack", you'll call the
dripback help line.  If the problem continues long enough, they'll
probably ask you for "a dump next time you abend".

If you dump to floppy disk, it will take a while.  About 10 minutes
per meg, plus whatever wasted time results from not having the next
diskette ready to go.  The NetWare diskette handler is remarkable
inefficient.

By being able to dump the memory image to hard disk a lot of time is
saved.  Then you copy the abend core file to another server (assuming
you have more than one server), laplink it to another computer, or
somehow get it off the server's DOS paritition.  Then PKZIP will
slash the size of the file, and it can be put on diskettes and
mailed, or emailed directly to your vendor.

As to what the dump is for, the vendor can examine it and see
everything in your server's memory.   Register contents, data at
pointers, stack levels, what was stacked, what drivers are loaded,
what versions of what drivers are loaded, and on and on and on.

There are some utilities around to let you take a dump apart, but
they waid to require programming experience to unravel the contents
of memory.

The dump is just a DOS file, so it doesn't need to be contiguous.
And the consequense of not having the space is that if you need to
capture a memory dump, you'll be using diskettes.

(Note - you CAN dump to another server, if you connected, logged
into the other server, and mapped a drive to a volume where space for
the memory dump is available  before you launched NetWare on the
server in trouble.  However, in practice I've found that this doesn't
seem to really work.  By the time I have my abend, I've lost the
connection to the other server.)

---------

Date: Tue, 28 Jan 97 01:38:59 -0800
From: Randy Grein <rgrein@halcyon.com>
To: "NetWare 4 list" <netw4-l@bgu.edu>
Subject: Re: couple questions... -Reply

>There is now a utility to allow you to dump to the dos partition and
>start the server back up and then copy the dump up to a volume on the
>server.  Not sure of the name, but it is on the web site.

I can go you one better. Check out Alexander Lan's SPK. It provides very
good memory protection, trapping a large range of crash types, provides
auto restart features (both similar to what's available in 4.11; not
suprising because Novell bought the technology from them), provides a
much smaller crash file option of around 6 megs (strips off the cache
buffers and other relatively unimportant information), and finally
provides an excellent diagnostic utility for debugging the crash files!

BTW, regarding excessive dirty cache buffers - under some circumstances
simply increasing the max concurrent disk cache writes will significantly
improve write performance, dropping the number of dirty cache buffers.
I've had to increase it as high as 800 on mid-sized systems, and along
with other tuning parameters was rewarded with significant improvements
in throughput - 30-50% at times. Of course, this is HIGHLY dependent on
the environment and data access patterns, so the usual disclaimers apply.

---------

Date: Wed, 29 Jan 97 08:45 EST
From: Ed Marczak 
To: netw4 l <netw4-l@bgu.edu>
Subject: Re[2]: couple questions... -Reply

>There is now a utility to allow you to dump to the dos partition and
>start the server back up and then copy the dump up to a volume on the
>server.  Not sure of the name, but it is on the web site.

It's called imgcpy.exe - it's in the 41 patches directory.

------------------------------

Date: Thu, 30 Jan 1997 01:01:28 +0100
From: "Arthur B." <arthur-b@zeelandnet.nl>
To: <netw4-l@bgu.edu>
Subject: Re: Read from a non-present page

>I have been receiving the following message on my console for a while
>now.  Once I get this message my 4.1 server usually crashes shortly after.
>
> 1-28-97  12:02:51 am:    SERVER-4.10-748 
>     Read from a nonpresent page.
>     Process: Server 16.
>     Module: NetWare Server Operating System
>     Code offset in module: 0001C824h.
>     Access address: 00008000h.
>
>I mostly understand what the error message means but how do I find out
>which NLM is going off the deep end.  According to the Red book, the
>Process: line should tell you which NLM is causing the problem but in my
>case, I just get Server #.  The Server # changes from time to time also.
>Sometimes I get Server 06, Server 13, Server 19, Server 16.  Does anyone
>know how to figure out what Server # means.

Maybe this would help:
LOAD CONLOG
SET ALLOW INVALID POINTERS=OFF
SET READ FAULT NOTIFICATION=ON
SET READ FAULT EMULATION=OFF
SET UPGRADE LOW PRIORITY THREADS=ON
SET DISPLAY RELINQUISH CONTROL ALERTS=ON

Then, after a crash go to SYS:ETC and salvage CONSOLE.LOG.
Maybe that will tell you more.

---------

Date: Thu, 30 Jan 1997 10:13:04 +0800
From: BLooney@comtech.com.au
To: Netw4-l@bgu.edu
Subject: Read from a non-present page

>I just get Server #.  The Server # changes from time to
>time also.  Sometimes I get Server 06, Server 13, Server
>19, Server 16.  Does anyone know how to figure out what
>Server # means.

Service processes are threads on a server that handle incoming packets. As
the server gets busier, more of these processes are allocated to deal with
requests from clients - up to the maximum (which is a settable parameter).

Server 00, Server 01 ...... Server 20 are all service processes. Most times
I've seen this problem (ie. they are abending the server) it is a problem
with either the LAN card driver or the card itself. There have been some
obscure exceptions though...

------------------------------

Date: Fri, 31 Jan 1997 22:56:19 +0100
From: "Arthur B." <arthur-b@ZEELANDNET.NL>
Subject: Re: EASY QUESTION! - emergency disk set

>Sam Martin <Sam_Martin@G1.COM> wrote:

> server -ns -na ; no startup, autoexec.ncf
> load disk drivers
> load install
> create the sys volume.
> install tape backup software if necessary.
> restore the bindery ; ** important **
> restore SYS
> down exit
> server
> hopefully, voila
> ... get to work on your emergency disk set.

Most people that have 3.1x forget to restore the bindery
before restoring the volume. As having an *up-to-date*
emergency disk set. Write Sam's list down people.

An emergency disk set should contain:

1. A DOS bootable diskette with (make a backup of it also):
- a copy of the bootsector, CMOS and partition table and
  an util that can restore it all
- a tool to lowlevel format the HDU if allowed
- FDISK, SYS, FORMAT and a diskeditor tool (Norton DE).
- a small diagnostic tool for checking IRQ and such
- updated SERVER.EXE and all drivers needed that are called during
  execution of STARTUP.NCF and AUTOEXEC.NCF
- driver and configuration software for  hardware that starts before
  CONFIG.SYS is executed (like Adaptec cards)
- STARTUP.NCF and AUTOEXEC.NCF
- a tool to hack the SUPERVISOR password
- a copy of the bindery files
- updated VREPAIR and supporting files (name spaces!)
- updated INSTALL.NLM and supporting files
(you may be in need of several floppies)

2. A DOS bootable diskette with:
- anti-virus scanner and cleaner

3. A DOS diskette with:
- everything you need to get your backup programm up again
  without reconfiguring it from scratch

4. A DOS diskette with:
- all currently used patches and updates from everything
(may need more then one floppy)

4. A hardcopy of your startup files

5. A hardcopy of your CMOS settings and other BIOS setting
   (like Adaptec)

6. Procedures describing what to do if... in detail

7. A list of people/organizations that can help when all fails
   and how to reach them from what time to what time and
   how much it will cost

8. Paper describing how to get hold of original software
   diskettes/CD-ROM's

9. Paper describing what software versions should be on the server
    and how to determine them

10. Paper describing directory structure

11. Paper describing the bindery (rights and such)

12. Paper describing the minimum needed hardware and software
     to completly replace the server

Well. Not so much at all really.

In short everything the milkman needs to rebuild the server as if
it has been gone for only one day even if the whole building goes
up in smoke and the admin was last seen entering the toilet...

I hope you never meet the day you'll thank me for this info.

------------------------------

Date: Thu, 30 Jan 97 07:18:02 -0800
From: Randy Grein <rgrein@halcyon.com>
To: "NetWare 4 list" <netw4-l@bgu.edu>
Subject: Re: couple questions...

>Real answer: Removal of DOS partition _is_ possible, forcing
>the need to boot from floppy.  No gains that I can see,
>drawbacks include a very slow reboot (when users are not at
>their most patient) and adding another device that can fail
>(and is highly prone to do so) -- i.e. floppies.  Since I
>started working here all new installs have a C: partition...
>and the larger the better (although not for "core dumps"...
>has anyone actually learned something _useful_ from one of these?)

Seems to me that people used to remove the DOS partition in the mistaken
belief that it enhanced security; the boot floppies were kept locked
elsewhere. Dumb, dumb, dumb. I THINK there's a procedure to boot 4.10
from floppies with no C: drive, but doubt it's possible to do so with
4.11. There's 15 megs worth of drivers in the startup directory!

Oh, BTW Floyd, I HAVE managed to extract useful information from a dump,
but I used Alexander Lan's SPK. It provides REAL post crash diagnostics,
as well as automated recovery and crash file maintenance. Check 'em out!

------------------------------

Date: Mon, 17 Mar 1997 20:17:58 -0700
From: "Mau, Steve" <steven.mau@ASU.EDU>
Subject: Re: Down a server remotely?

>>>Anybody know of any other way to then down the server when you
>>>can't get to a console prompt.  Like the old fconsole perhaps - is
>>>that OK with 4.1?
>
>3.1x FCONSOLE should solve your problem.
>
>If not. Visit www.novell.com and look for the TID that contains
>a small debug script that will do the job for you.
>
>* Arthur B.

I believe that Arthur is referring to TID 1001120 on
http://support.novell.com.  It gives instructions for doing
a down via debug commands.  I have a couple of questions
for those of you who have already tried this:

1) Did it work?  Did it do an orderly down, as far as you know,
in regards to flushing all file buffers to the server's hard disks,
etc.?

2) Would we be better off using fconsole if we can dig up a
copy of it?

------------------------------

Date: Tue, 1 Apr 1997 11:30:18 +0100
From: Richard Letts <R.J.Letts@SALFORD.AC.UK>
Subject: Re: A Schedule "Work To Do" error message.

>Our 150 user network froze for this afternoon for no obvious reason.
>I came check on my server and found this message on the console:
>
>Server-4.10-911
>A scheduled "Work To Do" took over one minute to be run.
>
>After a few minute, the server was back to normal (the console screen
>did not freeze on me anymore and the utilization drop from a steady
>100%), but a whole lot of users had to reboot their systems because
>they lost network drive connections.
>
>Please advise on this error message and how I can prevent it.  I still
>don't know what the error message mean as of this time.  Thanks in
>advance for your help.

I can't help with the symptoms, but here's some education on the internals
of NetWare 4.x

There are three priority levels in 4.x:
 - low priority thread (eg compression, sub-allocation)
 - normal priority threads (eg pserver, monitor, server xx process)
 - worker's

Worker processes are managed by threads rather than the kernel; the
control structures for them are created by the thread that owns them
rather than the kernel. they are designed to carry out tasks of defined,
sort duration, and are scheduled /before/ all of the others. they do
things like transfer data between memory and controllers, etc.
A NLM doesn't need to use worker processes to do this; they look awful
to program and I've never needed them.

If one of these took too long to run you can see none of the other
processes got a chance to run. Since the other processes include the
server xx processes, user's [could/]would lose their connection.

[further aside:
A thread belongs to a thread-group; when a NLM is loaded it is given an
initial thread and thread-group context. It can create additional threads
and thread-groups if it wants. Thread-group contexts are used by some
libraries, for example the inet_ntoa(), etc  functions use static memory
referenced off the thread group context. Forgetting this leads to
unexpected output of different threads in the same context! ]

Back onto the track of your problem.  I'd suspect the disk
controller/drives as being most suspect.  Worker threads are most-often
used here.

------------------------------

Date: Tue, 1 Apr 1997 22:02:44 -0600
From: John Bezy <JBezy@HCI-OMAHA.COM>
Subject: Server Abend

<snip>

To further expand, yes, you can change the ABEND defaults.  Can be
done either thru SERVMAN or the SET command.  The ABEND options
are contained in SERVMAN/SERVER Parameters/ERROR HANDLING, or in
SET, option 12...  The two applicable options are

     Auto Restart After Abend Delay Time      2
     Auto Restrat After Abend                 1

The values above are the defaults- 2 minutes for the first line (values are
from 2 to 60 minutes), and 0, 1, or 2 for the second line (0 means do not
try to recover; 1 means that for certain types of Abends, try to recover;
and 2 means for all hardware and software Abends, attempt to recover,
down the server in the configured amount of time, and restart the OS).

For certain type errors, it is advisable to either let the server go down
and restart itself, or to shut it down immediately (after users have had a
chance to save their files and logout).  As your error message says,
data structures may have become corrupt.  If you allow the server to
continue to function, you have a good chance of totally corrupting things.
So it's up to you as how you elect to have it handled.  I would follow the
Novell guidance for now and make sure the server goes down...  If it's
not a critical error, the server will continue to operate, allowing the users
to stay working, giving you the opportunity to down it later...

---------

Date: Wed, 2 Apr 1997 21:17:02 +0200
From: "Arthur B." <arthur-b@ZEELANDNET.NL>
Subject: Re: Server Abend

<snip>

Maybe something is changed in Netware but an ABEND means that something
bad has happened from a NOS side of view. In other words, data integrity
can't be guarenteed.

So the server is halted. And it's up to a knowledgeble human to decide
if it's allowed to do the EIP=CSleepUntilInterrupt routine (or lookalikes)
and let users save their work (allthough this is more for the autosave
feature to the local harddisc IMO which many applications have implanted).

Most of the time I just view what 386DEBUG has to say and then
down the server. Not allowing users to close normally.
Just to be sure that beeing able to save some data doesn't result
in even more dataloss. Since it does take some time to find all
the users and tell them to save their data and wait for me to give
them green light again. All the time that takes the server is unstable.

Another thing with an auto-reboot gadget can be that you get stuck in
some sort of reboot loop (I don't have hand-on-expirience with INW...
yet).

------------------------------

Date: Tue, 01 Apr 1997 21:59:59 -0600
From: Darwin Collins <dcollins@fastlane.net>
To: netw4-l@ecnet.net
Subject: Re: Disappearing Netware server prompt

>I have seen this on quite a few occasions, but when an NLM becomes
>unstable, and you try to unload it, the process never ends, and the
>server prompt is never returned from the offending NLM. In the past I
>have tried Fconsole (on both 4.x and 3.x), debug, and several other
>utilities with limited success. I have always had to resort to powering
>the server off without downing it. Anyone have a solution or suggestions
>for this under 4.1?

This may 'not' be any real solution... but, with 4.11, you can do a
SHIFT CTRL ESC     and it will display a prompt to 'down the server'.

(I learned this from Brainshare 97)

------------------------------

Date: Thu, 29 May 1997 19:18:10 -0400
From: "Brien K. Meehan" <MEEHANB@DETROITEDISON.COM>
Subject: Re: RAID 5 with Netware

>A problem doesn't happen unless you change something, generally, so
>when you change something, you always allow yourself a stable state
>to go back to.

Well, out in the real world, things change every day, and not always with
your consent.  You can't fall back forever, sooner or later you have to
put on your "Computing in the '90's" hat and figure out how to move
forward and make it work.

>Picking through a Novell core dump is probably the greatest waste of
>time I can imagine.

The Novell engineers who have asked me to send them a core dump didn't
seem to think so.

>If necessary to implement a new module or component, I can understand
>the need, but the system wouldn't be available for login and general
>use, would it?  Whatever resources, and hardware, would also be
>available at that time.

Core dumps are usually reviewed offline, AFTER getting a server back on
its feet.  At least, wiser server administrators would do it that way.

>Don't forget you were rude to me first...

No, sir, YOU were rude first.  It was YOU who posted the "nugget" of
information that a 5MB DOS partition is sufficient to run Netware.
That, sir, is a very rude piece of misinformation.  Like I said, it
goes against conventional wisdom, Novell's recommendations, AND Netware
documentation!  It is simply untrue!  It was VERY rude of YOU to post
that "opinion" on this list, in front of almost 3800 readers.  Many
readers of this list are not that experienced, and might have mistaken
you for a compentent source of information.  Being irresponsible in
your own environment is one thing, but spreading that sort of "wisdom"
is very, very rude.

>Rather than figure things out yourself, you're probably a stuffed shirt
>who calls and asks someone else to do it, and spends someone else's
>money.

An interesting speculation, especially because you have no basis for it.
As it turns out, your speculation is as wrong and as backwards as your
understanding of DOS partition requirements.

I'm the one other people call when they can't figure it out.  People call
me with problems like, "I need to install Netware 4.11, but my DOS
partition is too small, isn't 10MB enough?" or, "I loaded something on
my server, and now it's crashing.  Here's 128 floppy disks so you can get
a core dump."  That is, I get out my mop and clean up the messes other
people make by believing bad information and not reading the
documentation.

------------------------------

Date: Thu, 26 Jun 1997 15:47:11 +0200
From: Camaszotisz Gyorgy <webmaster@KERORG.HU>
Subject: TID finder

For those of you running into trouble finding a TID by number at Novell,
I suggest a quick way:

	http://support.novell.com/cgi-bin/search/tidfinder.cgi?2927123

The number at the end of the line should be the requested TID's number.

------------------------------

Date: Tue, 16 Dec 1997 17:30:34 -0600
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Re: How much memory??

>We have a NetWare 4.11 server (SP 4A) running on a IBM PC 704 Server 30GB
>disk space 256MB RAM.  It has anywhere from 300 - 500 users and approx.
>6,000 - 8,000 files open.  With these specs we had about 64% cache buffers
>which was a little low so we have now upgraded to 512MB RAM which has
>pumped our cache buffers up to 82%.  The machine is connected to a
>100MBs Ethernet port and also connected to 16MB Token-Ring.  We have
>checked Network Utilisation and while it is high, it is not excessive.
>The machine is just doing file and print serving with no fancy apps or
>anything on board.  Bare bones NetWare.
>
>We can have x amount of users and the machine can run fine and then
>user x+1 gets on and the machine just appears to die and the whole
>thing will just go down in a downward spiral and eventually it will
>just stop all the users from working.
>
>We have been monitoring stats like dirty cache buffers, LRU sitting time
>and these appear to be within spec.
---------
	You do have more than sufficient memory, and even Monitor agrees.
The cliff-edge drop in performance is likely due to other factors than just
memory. One is running out of directory cache buffers so accesses grind
to a crawl. Another is the LAN adapter stuff and the rest of the LAN is
overwhelmed and gives up, and that is a very likely fault. If the number of
processes grows and grows then the machine itself is blocked internally.
Of course there is the rest of the server, the disk farm, which might
collapse under stress.
	Every report I have seen says NW servers degrade slowly & gracefully
when overloaded, and don't behave in the n=ok, n+1=bad fashion you describe.
	If you have Novell's Lanalyzer product put it on the ring and see
what happens.
	Joe D.

------------------------------

Date: Wed, 7 Jan 1998 19:38:38 +0100
From: Hauke de Vries <H.de.Vries@PHILOS.RUG.NL>
Subject: Re: Use of Netware 3.12 Debugger

>Could somebody please let me know if/where I can get hold of any
>documentation on how to use the Netware 3.12 debugger to analyse
>server crashes?
>
>Can I, for instance, use the debugger to find why a server crashes
>out on an INT3 Breakpoint?

http://developer.novell.com/research/appnotes.htm

"IntranetWare Server Automated Abend Recovery",
Novell AppNotes, March 1997, p. 32

"Resolving Critical Server Issues",
Novell AppNotes, February 1995, p. 35

"Abend Recovery Techniques for NetWare 3 and 4 Servers",
Novell AppNotes, June 1995, p. 75

The last one probably is the most intersting, but appnotes before
1996 seems not to be on line :-(

Maybe it's possible to back order those.

------------------------------

Date: Fri, 9 Jan 1998 10:01:00 -0700
From: Hansang Bae <hbae@PRIMENET.COM>
Subject: Re: Use of Netware 3.12 Debugger

>>>Could somebody please let me know if/where I can get hold of any
>>>documentation on how to use the Netware 3.12 debugger to analyse
>>>server crashes?

June 96 Application Notes.
Network Management TIP section on abend debugging at www.avanti-tech.com

------------------------------

Date: Wed, 14 Jan 1998 13:45:40 -0700
From: Joe Doupnik <JRD@CC.USU.EDU>
Subject: Re: Error writing FAT

>I've been having problems all day long on my NetWare 4.1 server.  Came
>in this morning to find the DATA volume dismounted and error messages
>saying that the system was unable to write the FAT table, so the volume
>was dismounted.  I downed the system and brought it back up only to see
>250MB available.  Within about 15 minutes, I was back down to zero
>disk space available.  So, I added in another 4.5GB disk and started the
>server back up.  I monitored the disk space for about an hour and
>nothing unusual occurred.  However, I once again began to get the
>same 'unable to write FAT' message.
-------
	Yup, agreed, this is a serious problem indeed. From experience
here on the same matter:
	Bring up the server, load Vrepair, toggle round and dismount
volume DATA, run Vrepair on it a few times. Remount it.
	Run PURGE /ALL from the root of the volume.
	Check SCSI cables and termination, and see that DOS translation is
turned off in the SCSI controller. Stuff like that.
	Check for overheating (easy with today's cheap fans), flakey p/s.
	Check for inadequate memory (type MEMORY at the console prompt),
and heed the abundant advice in the FAQ about registering memory upon need.
	It is possible that the drive holding DATA: is on its way out. That
has occurred to most of us. It is possible your memory chips are unhappy.
	Finally, if disk compresion is turned on, good luck fella. There
could be a crunch in decompressing material. I never use disk compression.
	I try to keep in mind that what I think is wrong is probably
incorrect and something else is sneaking up on me. It is difficult to do
that, but now and then it has been true.
	Joe D.

------------------------------

Date: Tue, 24 Mar 1998 14:58:29 -0700
From: The Abundant One <sfinney@CCIT.ARIZONA.EDU>
Subject: Re: Abend

>Abend: Machine Check Processor Exception (Error Code 00000009)
>
>Memory Parity Error NMI

Novell has a really good CBT that explains ABENDS pretty good.  URL is:
	http://support.novell.com/cbt/cbt1001/

You will need Real Audio to listen to the CBT.

To summarize one portion of the CBT, they say that the Machine Check and
NMI processor exception Abends are always hardware related.  NMI is
typically associated with memory parity checks so its likely that your
server has some bad ram.

------------------------------