----------------------------------------------------------------- NOVABEND.DOC -- 19980324 -- Email thread on NetWare ABnormal ENDs ----------------------------------------------------------------- Feel free to add or edit this document and then email it back to faq@jelyon.com Date: Thu, 2 Nov 1995 20:39:45 GMT From: Edwin Cleton Subject: Re: Netware 4.1 crashes with NETX >A little more info, I got to see the screen this time after the abend: > Page Fault Processor Exception (error code 01700000) > Running Process: Interrupt Service Routine (nested count 2) TechDoc : Abend -> Polling process Composer : Ecl@inter.NL.net (LESi) Targets : NetWare *.* Keywords : Polling process abends Revision : Fri 04-08-1995 Reasons for an Abend on Polling process (i.e. Running Process: Interrupt Service Routine) 1) Not all required patches loaded or loaded in the wrong .ncf file 2) Insufficient server memory 3) Eventhough enough memory, fragmented memory pools due to SYS: mounting before all memory was registered (>16mB) 4) Energy save feature(s) enabled in servers PC bios 5) Keyboard bios bugged (is a different chip then system bios) 6) Using a flacky/invalid IRQ on the server NIC (i.e. irq 2/9 - irq 14/15) 7) A flacky third party (backup?) nlm 8) Overheating (inside server or harddisk(s) case(s)) 9) Static interferance via extended cables 10) Outdated pserver.nlm 11) Server NIC going bad 12) Server has been downed and started without a hard reset And of course, a real new OS bug is also possible. ------------------------------ Date: Thu, 18 Jan 1996 20:03:13 GMT+0001 From: "Szekeres Bela, Jr." Subject: Re: Debugger step "2" -sometimes not enough >>WHAT IS THE KEY SEQUENCE?? I'm going mad looking for it.. Help! > > > b. Drop into the debugger using the shft> key sequence. It happens quite often, that you can use only one of the two Alts. (Which one depends on the computer.) But, on a DTK 486 even the sequence and speed was important. It was ...in this sequence and within about 1 second. Do not ask why, it did not work otherwise. ------------------------------ From: Teo Kirkinen Subject: Re: Debugging after an Abend Date: Fri, 19 Jan 1996 18:07:16 +0200 (EET) >Teo, are you serious about "386debug" above? Yes, I'm serious. It doesn't work when the server has NOT abended but when it has abended, it works also in situations where shift-shift-alt-esc just reboots the server. It has been very useful when we diagnosed PFPE where the running process was Server xx and the "normal" way of getting in the debugger didn't work I learned it from somebody from England, most likely the owner of the SOFTRACK mailing list but I can't recall his name. If I remember correctly 386debug is also documented in some older version of the server SDK. ------------------------------ Date: Fri, 12 Apr 1996 00:20:00 PDT From: "Randy G Hutchings, Contr CZE" Subject: Great Info! ABEND Recovery I had an ABEND yesterday and this information help me restore my 4.1 server in seconds. This information was taken from LAN TIMES April addition. IF the ABEND is NOT a Processor Exception variety, follow these steps: Now enable the server's internal NetWare debugger by simultaneously pressing both keys, the key, and the key on the server console. NOTE: If your server was locked using the SECURE CONSOLE command, you will not be able to do this. Type the case-sensitive command EIP=CSleepUntilInterrupt. (this a small L and a capital i ) Type G , and the server should now appear to come back to life. As soon as users have saved their data, use the DOWN command to bring the server down gracefully and then reboot it. IF the ABEND is a Processor Exception type, follow these steps: If the ABEND message has the words NMI or Machine Check in it, then your chance of restarting the server and bringing it down are somewhat reduced because failed hardware may be the cause of the problem. Here are the steps to follow: Now enable the server's internal NetWare debugger by simultaneously pressing both keys, the key, and the key on the server console. NOTE: If your server was locked using the SECURE CONSOLE command, you will not be able to do this. Type T . The server will display some debugging information and return to the # prompt. Type G . The server should now appear to come back to life. Again, any client connections that are sitting at an Abort, Retry, Ignore message should be retried. Users should be able to save their files and exit their Application. Use the DOWN command to bring the server down and then reboot it. If the server comes back up with your data intact, you are ready to start trouble shooting. IF the ABEND message has GPPE or Page Fault in it, follow these steps: Type .R to display the name of the running process. Write this information down for later trouble-shooting. Type ? to display the NLM and function names that the server was executing during the ABEND. Write this down too. Type .M to display the currently loaded modules and their address. Write down the starting address and length of the module that failed. Type RC to display the contents of the control registers. Write down the value of the CR2 register, which contains the address that produced the page fault. Type the case-sensitive command EIP=CSleepUntilInterrupt. (this a small L and a capital i ) Type G , and the server should now appear to come back to life. As soon as users have saved their data, use the DOWN command to bring the server down gracefully and then reboot it. The server should come back up with your data intact; you are then ready to start trouble shooting. A good place to start will be the NLM that cause the problem-remove it and contact its author. ------------------------------ Date: Mon, 3 Jun 1996 11:33:23 +-100 From: Stephen Knight Subject: Server Freezes The following may be use of to some of you who've been talking about server freezes etc. and is a technique our of our dev. guys brought to my attention. It has got me out of ALL of the freezes I have had by doing a controlled abend of the server by forcing a memory parity error when I couldn't use the normal keyboard method. Now what use is that ? Well then you can then hopefully bring the server back up using the debug commands before getting users out and taking it down neatly... Now I don't suggest you run this in your production server permanently but if you need to catch a particular problem and are confident in what you are doing then you can wait for the lock to occur then it may help.... just don't coming to me if you destroy your brand new server! Someone who knows more than me about the debugger (not hard) can help you more with what you can get out of the debugger once you are in there but it may be possible to identify which NLM it was stuck in, and more importantly, get the server going again with the commands shown on the list a while ago: EIP=CSleepUntilInterrupt G The hardware needed is as follows : ----------------------------------------- | Back of PC 8 bit ISA Slot (or 1/2 a 16 bit slot) A |-----O----+ ----------------------------------------- | 1k / push B |----------+ switch This is a view of the *ISA* slots when looking down on the motherboard. Pin B is the NMI line normally triggered by a memory parity error (you need to have this turned on in the BIOS in some PC's) and Pin A is 0v. If you momentarily ground the NMI line (through a 1k resistor) then an abend occurs... I built mine onto an old Dragon cartridge (Hi Roy!) but an 8 bit network cut down to remove all but the edge connector or a proper prototyping card would do just as well... Doesn't work on all servers / PC's, possibly doesn't work for all freezes but hope it works for you. Credit must goto Carl Young who told me about all this stuff... ------------------------------ Date: Mon, 8 Jul 1996 16:46:45 +0100 From: John Bazeley Subject: Re: Debuggers, Abends and NLM's >I seen some recent mail about switching to the debugger after an >ABEND, in order to identify the offending process and send it to >sleep. Being a newbie to Novell, I've only just discovered the >debugger. Since I've no info on the commands, I've currently left it >alone, however I would be interested if anyone can give me a few >pointers. I especially want to know about stopping an NLM process >after an ABEND. The reason being we currently have a 4.1 >server which periodically (3 - 4 Weeks) falls over. I think this is in the FAQ [Floyd: Yup, section H.54.2 & this file], but here goes. To enter the debugger, do alt, both shifts and esc at the same time, OR type 386debug (only works from abend) You'll be presented with some hex stuff which, unless you wrote the crashing NLM, you can probably ignore. Type EIP=CSleepUntilInterrupt Type G You'll get your server back, probably. You should probably down it ASAP. Other interesting debug commands: v: cycles through all screens that were active at the time of the abend .p: gives you a list of all running threads and what they were up to at the time of the crash ?: gives you the address of the crash and the 2 nearest exported symbols. h: help on commands .h: help on dot commands ------------------------------ Date: Fri, 20 Sep 1996 04:36:06 -0400 From: Mark Snape Subject: An easy ABEND Those of you that like crashing NW4.1 servers might like to try the following: Hold down a key on the console keyboard........ Until it starts beeping (buffer full) Press Enter. Whoops! I suggest that you don't do this in a user environment unless it's your last day. I have never come across a documented fix for this, but if anyone knows of one etc, etc... We came across this when one of our guys leaned the keyboard against the system unit. We were searching for minutes trying to find the source of the beeping, when I picked up the keyboard the beeping stopped and I unfortunately pressed enter the clear the command line. Following that, we have tried it on a number of boxes, and got the same result in each case. ------------------------------ Date: Thu, 10 Oct 1996 16:10:39 -0600 From: Joe Doupnik Subject: Re: Warning: Netware Web Server is *very* buggy ---------- As Mr. Andersson of Volvo remarked, there is a Web server v2.51. As I looked at my CD-ROM collection I found it and just installed it on a NW 4.11 server. Now this is part of NW4.11/beta etc rather than the regular distribution channel, but the thought is there may be fixes in their for your situation (though the docs do not mention anything like that). How you might get v2.51 out of Novell is beyond my understanding. Server abends from wild pointers are nasty things. I wonder, if by chance, if you have read & write fault emulation turned on (I have them turned off, but notification turned on). That's SET 2 (memory) at the console. Also, "allow invalid pointers=off" on my machinery, same SET area. There is one other parameter of interest in this group, "alloc memory check flag=on/off", which is off here but you may wish to turn on for safety's sake. I would not be surprized to learn wild pointers come from the Perl NLM, given the things Perl does with memory. Lastly, one needs to pick and choose the version of the underlying tcpip.nlm, and there are a number in the archives. I'm at version 3.00 because the latest stuff blocks RDATE. Joe D. ------------------------------ Date: Mon, 14 Oct 96 18:39:19 -0700 From: Randy Grein To: "NetWare 4 list" Subject: Re: Console.log problem >I've have a Netware 4.1 that's rebooting by itself every couple of days >at random times. I checked the vol$log.err and it showed this message. > >volume mount had the following errors that were fixed: >proble with the file console.log, length extended. Wrong order: The crash and recovery causes this message, it's not the message causing the crash. If you get really stuck check out Alexander Lan's excellent Server Protection kit. It traps most crashes, and provides detailed information on nearly all thoe ones it can't stop. Check out http://www.alexander.com. ------------------------------ Date: Mon, 28 Oct 1996 16:15:12 +1100 From: Adrian Moore Subject: Re: DEBUG HELP >Can someone please explain to me how does one EFFECTIVLY use the >debug utility after an abend? A couple of things I use: Use: ? to show the running process <- if it's always the same it could be the cause of the problem Use: dds to show the contents of the stack (decodes addresses, so you can see what else has been running recently). Ignore things like MONITOR and CONSOLE ;] Use: ? <4 byte addresss> to find out which code segment of which NLM the address corresponds to See the app notes: Resolving Critical Server Issues (Feb 1995) & Abend Recovery Techniques for NetWare 3 and 4 Servers (June 1995). TABND2.EXE off http://support.novell.com/ contains the latter app note. If anyone has any cute tricks regarding the debugger I would also love to hear them. There are some interesting titled books in the FAQ for this list which I am yet to track down, though. In particular, on Windows NT you can run the debug utilities on an NT crash dump to pull out a text file of the most important information, like running processes, the state of all threads, loaded files, etc. Has anyone written/seen a utility like this? Novell don't have something similar for general distribution... ------------------------------ Date: Mon, 28 Oct 1996 09:54:48 -0600 From: James Federline Subject: Re: DEBUG HELP >Can someone please explain to me how does one EFFECTIVLY use the debug >utility after an abend ? Novell's AppNote entitled "Abend Recovery Techniques for NetWare 3 and 4 Servers" describes this in detail. It's 30 pages long - I' won't try to transcribe it here... Topically, you must first analyze the type of abend before you can decide what things in the debugger are helpful to do. My goals going into the debugger are to 1) gather as much data on the abended state of the machine as possible, 2) attempt to restart the server thru thread quarrantine (except for an NMI, then trace-and-go) for a graceful shutdown. If you are struggling with a particular type of abend (Page Fault, GPPE, NMI, Machine Check, invalid opcode, or a software exception of some sort), some of us might be able to help you follow the right path. Here's some basics: - the Abend message will tell you the breed of abend you ahev experienced, write it down - to redisplay the abend message, once in the debugger, type .a - the R command displays registers and flags (not immediately useful, except for EIP - the CPU's instruction pointer) - the ? command will display the NLM and function in the NLM that EIP is pointing at. since the server froze it's state for you this is most likely thing that crashed your server. Now whether or not this nodule is bad is another story - another modules could have passed it bad data, for instance. Also pay attention to the functions (previous, current and next) if shown. It won't probably be shown if EIP is at SERVER.NLM. - the .R command shows the running process and info on that process, as well as a couple three lines of core dump around the instruction pointer, with ASCII translation from hex for more insight. - the RC command works only in NetWare 4. The only thing I can use of this is the CR2 register - if this is listed as 00000000, you could have buggy software. The appnote says that this address is frequently used by software that mistakenly dereferences a null pointer. Novell says this is a very common problem. To recover from a Page fault, GPPE, invalid opcode, or a software exception, try this, and then down and restart the server. This might quite possibly give you a chance to let NetWare write any cached data to disk and let any important processes finish up. Of course, the following quarrantine method assumes you don't need the process that abended the server to continue running (at least in a hobbled state). 1) gather info with commands above. 2) enter this: EIP = CSleepUntilInterrupt 3) enter G , and the server will attempt to continue at to run from the point of the abend, minus the thread you just quarrantined. 4) have any clients connected attempt to close files and let NW flush it to disk. 5) issue the DOWN command and then restart your server and do something with the information you gathered (call tech support, apply patches, buy more RAM, etc...) I've used these methods to deal with two ARCserve bugs. It makes a real difference when you can provide the tech support person with debugger output - in both ARCserve cases, the support rep was able to tell me exactly what was buggy and why, and what patch to download and apply. An NMI or machine check requires a method called "trace and go", quite a bit different from thread quarrantine, since the machine state is not completely preserved, and it's iffy if it will work. The above is always prefferable to just power cycling the box in my opinion - it could mean the difference between an intact filesystem, or a scribbled mess. I personally hate using VREPAIR... :) ------------------------------ Date: Tue, 12 Nov 1996 09:44:18 +1000 From: Greg J Priestley Subject: Re: Forcing Abends You can try the Novell Consulting site http://www.novell.com/corp/programs/ncs/toolkit/main.html I believe they have a utility. ------------------------------ Date: Tue, 10 Dec 1996 17:07:35 -0000 From: John Bazeley Subject: Re: Abend Troubleshooter Request >Does anyone have a book or product they recommend for Server >Abend Troubleshooting ? 1. The list FAQ. See Floyd Maxwell's post yesterday for where and how to get it. 2. support.novell.com: search for file tabnd2.exe. 3. 3.12a 'system messages' manual (not too useful, just a list). 4. 4.11 'supervising the network' manual. 5. SDK docs. Plenty info on the best ways to cause abends. 6. NetWare Application Notes, June 1995 "Abend Recovery Techniques for NetWare 3 and 4 Servers" You may be interested to know that 4.11 has an automatic abend recovery setting, which seems to work OK. ------------------------------ Date: Tue, 7 Jan 1997 20:57:46 -0600 From: Joe Doupnik Subject: Re: Server Freezing >>I have 4 Netware 4.1 servers and over the last month or so the main server >>has been freezing about twice a week and the only way to clear it is to >>cycle the power. I have been investigating this as a hardware problem >>replacing memory etc. >> >>Today two of the other servers also froze and again could only be cleared >>by cycling the power. > >This sounds familiar. In my case its three identical 3.11 servers, so >your problem may be unrelated but it may be worth a shot. Try changing >the network card.
>What I'm guessing is happening is that either the 3C595 or its driver >does not handle errors properly. I may be way out on this one so anyone >is welcome to jump in and point out my errors, but for now I'm in search >of a new 10/100 card. Although we don't run 100 Mbit yet, I'd rather not >leave a 509 in a server any longer than needed. ---------- I think you have it right John. Mangled frames can do wonderous things to a board and/or its driver if the designers didn't prepare for such eventualities. Way back in the dark ages (NW 2.0a and such) when monolithic IPX was built from .obj modules and Ethernet_802.3 was the only choice we (my place) had machine crashes left and right when servers were placed on the new backbone. It turned out the illegal frame tripped up "real" machines and legal frames tripped up the NW servers of that era. My NW 2.0a server would live about twenty minutes on the backbone. Time passed. I binary edited .obj files to use Ethernet_II frames for IPX, and and so on. Eventually more robustness was designed into various drivers and machines began to cohabit the wires again. To this day we forbid Ethernet_802.3 on the main wires. Mind you, this is with otherwise rational frames except for the missing 802.2 interior on IEEE-802.3 style Ethernet frames. Given wider variations of construction, some not legit, anything can happen. [Story is in ethernet.txt, netlab2.usu.edu, cd misc, and in the list's FAQ.] Checking costs time and memory, and hence is a pain in the butt. But that's the price of robustness. There is also the well known machine test of putting 3 year olds at a keyboard. They have killed many big machines, by simply holding down a repeating key. To this day a LOT of systems are opened by accepting strings from abroad which exceed buffer lengths (the basis of the infamous Internet worm of several years ago). It could well be some device on the net is putting out "sensitive" frames which kill the drivers. Some "self test" programs of boards emit a fast paced stream of broadcasts, etc. Broken boards emit all kinds of things. A packet monitor might help spot suspicious stuff. Joe D. ------------------------------ Date: Tue, 28 Jan 1997 13:08:47 -0600 From: "Mike Avery" To: netw4-l@bgu.edu Subject: Re: couple questions... >>Also, you should have as much free space on your DOS paritition as >>you have memory in your server so you can do a core dump to disk >>quickly and efficiently. >Could you elaborate a bit about the circumstances which cause a >memory dump, capabilities of dump analyzer software, and does the >dump space in the DOS partition need to be contiguous? Are there >any serious consequences of not allowing a sufficient space to >contain the eentire core image? When a NetWare server abends (at least up to 4.11... I haven't installed that yet) you are offered the chance to dump memory to disk or to shut down the server. If you have an abend every 3 months or so, it's probably not a big deal. Maybe you'll update your drivers and patches and see if it goes away. If you abend often enough that your threshold of tolerance is exceeded, you'll probably hassle your vendors. So if the message suggests "Abend in module DripBack", you'll call the dripback help line. If the problem continues long enough, they'll probably ask you for "a dump next time you abend". If you dump to floppy disk, it will take a while. About 10 minutes per meg, plus whatever wasted time results from not having the next diskette ready to go. The NetWare diskette handler is remarkable inefficient. By being able to dump the memory image to hard disk a lot of time is saved. Then you copy the abend core file to another server (assuming you have more than one server), laplink it to another computer, or somehow get it off the server's DOS paritition. Then PKZIP will slash the size of the file, and it can be put on diskettes and mailed, or emailed directly to your vendor. As to what the dump is for, the vendor can examine it and see everything in your server's memory. Register contents, data at pointers, stack levels, what was stacked, what drivers are loaded, what versions of what drivers are loaded, and on and on and on. There are some utilities around to let you take a dump apart, but they waid to require programming experience to unravel the contents of memory. The dump is just a DOS file, so it doesn't need to be contiguous. And the consequense of not having the space is that if you need to capture a memory dump, you'll be using diskettes. (Note - you CAN dump to another server, if you connected, logged into the other server, and mapped a drive to a volume where space for the memory dump is available before you launched NetWare on the server in trouble. However, in practice I've found that this doesn't seem to really work. By the time I have my abend, I've lost the connection to the other server.) --------- Date: Tue, 28 Jan 97 01:38:59 -0800 From: Randy Grein To: "NetWare 4 list" Subject: Re: couple questions... -Reply >There is now a utility to allow you to dump to the dos partition and >start the server back up and then copy the dump up to a volume on the >server. Not sure of the name, but it is on the web site. I can go you one better. Check out Alexander Lan's SPK. It provides very good memory protection, trapping a large range of crash types, provides auto restart features (both similar to what's available in 4.11; not suprising because Novell bought the technology from them), provides a much smaller crash file option of around 6 megs (strips off the cache buffers and other relatively unimportant information), and finally provides an excellent diagnostic utility for debugging the crash files! BTW, regarding excessive dirty cache buffers - under some circumstances simply increasing the max concurrent disk cache writes will significantly improve write performance, dropping the number of dirty cache buffers. I've had to increase it as high as 800 on mid-sized systems, and along with other tuning parameters was rewarded with significant improvements in throughput - 30-50% at times. Of course, this is HIGHLY dependent on the environment and data access patterns, so the usual disclaimers apply. --------- Date: Wed, 29 Jan 97 08:45 EST From: Ed Marczak To: netw4 l Subject: Re[2]: couple questions... -Reply >There is now a utility to allow you to dump to the dos partition and >start the server back up and then copy the dump up to a volume on the >server. Not sure of the name, but it is on the web site. It's called imgcpy.exe - it's in the 41 patches directory. ------------------------------ Date: Thu, 30 Jan 1997 01:01:28 +0100 From: "Arthur B." To: Subject: Re: Read from a non-present page >I have been receiving the following message on my console for a while >now. Once I get this message my 4.1 server usually crashes shortly after. > > 1-28-97 12:02:51 am: SERVER-4.10-748 > Read from a nonpresent page. > Process: Server 16. > Module: NetWare Server Operating System > Code offset in module: 0001C824h. > Access address: 00008000h. > >I mostly understand what the error message means but how do I find out >which NLM is going off the deep end. According to the Red book, the >Process: line should tell you which NLM is causing the problem but in my >case, I just get Server #. The Server # changes from time to time also. >Sometimes I get Server 06, Server 13, Server 19, Server 16. Does anyone >know how to figure out what Server # means. Maybe this would help: LOAD CONLOG SET ALLOW INVALID POINTERS=OFF SET READ FAULT NOTIFICATION=ON SET READ FAULT EMULATION=OFF SET UPGRADE LOW PRIORITY THREADS=ON SET DISPLAY RELINQUISH CONTROL ALERTS=ON Then, after a crash go to SYS:ETC and salvage CONSOLE.LOG. Maybe that will tell you more. --------- Date: Thu, 30 Jan 1997 10:13:04 +0800 From: BLooney@comtech.com.au To: Netw4-l@bgu.edu Subject: Read from a non-present page >I just get Server #. The Server # changes from time to >time also. Sometimes I get Server 06, Server 13, Server >19, Server 16. Does anyone know how to figure out what >Server # means. Service processes are threads on a server that handle incoming packets. As the server gets busier, more of these processes are allocated to deal with requests from clients - up to the maximum (which is a settable parameter). Server 00, Server 01 ...... Server 20 are all service processes. Most times I've seen this problem (ie. they are abending the server) it is a problem with either the LAN card driver or the card itself. There have been some obscure exceptions though... ------------------------------ Date: Fri, 31 Jan 1997 22:56:19 +0100 From: "Arthur B." Subject: Re: EASY QUESTION! - emergency disk set >Sam Martin wrote: > server -ns -na ; no startup, autoexec.ncf > load disk drivers > load install > create the sys volume. > install tape backup software if necessary. > restore the bindery ; ** important ** > restore SYS > down exit > server > hopefully, voila > ... get to work on your emergency disk set. Most people that have 3.1x forget to restore the bindery before restoring the volume. As having an *up-to-date* emergency disk set. Write Sam's list down people. An emergency disk set should contain: 1. A DOS bootable diskette with (make a backup of it also): - a copy of the bootsector, CMOS and partition table and an util that can restore it all - a tool to lowlevel format the HDU if allowed - FDISK, SYS, FORMAT and a diskeditor tool (Norton DE). - a small diagnostic tool for checking IRQ and such - updated SERVER.EXE and all drivers needed that are called during execution of STARTUP.NCF and AUTOEXEC.NCF - driver and configuration software for hardware that starts before CONFIG.SYS is executed (like Adaptec cards) - STARTUP.NCF and AUTOEXEC.NCF - a tool to hack the SUPERVISOR password - a copy of the bindery files - updated VREPAIR and supporting files (name spaces!) - updated INSTALL.NLM and supporting files (you may be in need of several floppies) 2. A DOS bootable diskette with: - anti-virus scanner and cleaner 3. A DOS diskette with: - everything you need to get your backup programm up again without reconfiguring it from scratch 4. A DOS diskette with: - all currently used patches and updates from everything (may need more then one floppy) 4. A hardcopy of your startup files 5. A hardcopy of your CMOS settings and other BIOS setting (like Adaptec) 6. Procedures describing what to do if... in detail 7. A list of people/organizations that can help when all fails and how to reach them from what time to what time and how much it will cost 8. Paper describing how to get hold of original software diskettes/CD-ROM's 9. Paper describing what software versions should be on the server and how to determine them 10. Paper describing directory structure 11. Paper describing the bindery (rights and such) 12. Paper describing the minimum needed hardware and software to completly replace the server Well. Not so much at all really. In short everything the milkman needs to rebuild the server as if it has been gone for only one day even if the whole building goes up in smoke and the admin was last seen entering the toilet... I hope you never meet the day you'll thank me for this info. ------------------------------ Date: Thu, 30 Jan 97 07:18:02 -0800 From: Randy Grein To: "NetWare 4 list" Subject: Re: couple questions... >Real answer: Removal of DOS partition _is_ possible, forcing >the need to boot from floppy. No gains that I can see, >drawbacks include a very slow reboot (when users are not at >their most patient) and adding another device that can fail >(and is highly prone to do so) -- i.e. floppies. Since I >started working here all new installs have a C: partition... >and the larger the better (although not for "core dumps"... >has anyone actually learned something _useful_ from one of these?) Seems to me that people used to remove the DOS partition in the mistaken belief that it enhanced security; the boot floppies were kept locked elsewhere. Dumb, dumb, dumb. I THINK there's a procedure to boot 4.10 from floppies with no C: drive, but doubt it's possible to do so with 4.11. There's 15 megs worth of drivers in the startup directory! Oh, BTW Floyd, I HAVE managed to extract useful information from a dump, but I used Alexander Lan's SPK. It provides REAL post crash diagnostics, as well as automated recovery and crash file maintenance. Check 'em out! ------------------------------ Date: Mon, 17 Mar 1997 20:17:58 -0700 From: "Mau, Steve" Subject: Re: Down a server remotely? >>>Anybody know of any other way to then down the server when you >>>can't get to a console prompt. Like the old fconsole perhaps - is >>>that OK with 4.1? > >3.1x FCONSOLE should solve your problem. > >If not. Visit www.novell.com and look for the TID that contains >a small debug script that will do the job for you. > >* Arthur B. I believe that Arthur is referring to TID 1001120 on http://support.novell.com. It gives instructions for doing a down via debug commands. I have a couple of questions for those of you who have already tried this: 1) Did it work? Did it do an orderly down, as far as you know, in regards to flushing all file buffers to the server's hard disks, etc.? 2) Would we be better off using fconsole if we can dig up a copy of it? ------------------------------ Date: Tue, 1 Apr 1997 11:30:18 +0100 From: Richard Letts Subject: Re: A Schedule "Work To Do" error message. >Our 150 user network froze for this afternoon for no obvious reason. >I came check on my server and found this message on the console: > >Server-4.10-911 >A scheduled "Work To Do" took over one minute to be run. > >After a few minute, the server was back to normal (the console screen >did not freeze on me anymore and the utilization drop from a steady >100%), but a whole lot of users had to reboot their systems because >they lost network drive connections. > >Please advise on this error message and how I can prevent it. I still >don't know what the error message mean as of this time. Thanks in >advance for your help. I can't help with the symptoms, but here's some education on the internals of NetWare 4.x There are three priority levels in 4.x: - low priority thread (eg compression, sub-allocation) - normal priority threads (eg pserver, monitor, server xx process) - worker's Worker processes are managed by threads rather than the kernel; the control structures for them are created by the thread that owns them rather than the kernel. they are designed to carry out tasks of defined, sort duration, and are scheduled /before/ all of the others. they do things like transfer data between memory and controllers, etc. A NLM doesn't need to use worker processes to do this; they look awful to program and I've never needed them. If one of these took too long to run you can see none of the other processes got a chance to run. Since the other processes include the server xx processes, user's [could/]would lose their connection. [further aside: A thread belongs to a thread-group; when a NLM is loaded it is given an initial thread and thread-group context. It can create additional threads and thread-groups if it wants. Thread-group contexts are used by some libraries, for example the inet_ntoa(), etc functions use static memory referenced off the thread group context. Forgetting this leads to unexpected output of different threads in the same context! ] Back onto the track of your problem. I'd suspect the disk controller/drives as being most suspect. Worker threads are most-often used here. ------------------------------ Date: Tue, 1 Apr 1997 22:02:44 -0600 From: John Bezy Subject: Server Abend To further expand, yes, you can change the ABEND defaults. Can be done either thru SERVMAN or the SET command. The ABEND options are contained in SERVMAN/SERVER Parameters/ERROR HANDLING, or in SET, option 12... The two applicable options are Auto Restart After Abend Delay Time 2 Auto Restrat After Abend 1 The values above are the defaults- 2 minutes for the first line (values are from 2 to 60 minutes), and 0, 1, or 2 for the second line (0 means do not try to recover; 1 means that for certain types of Abends, try to recover; and 2 means for all hardware and software Abends, attempt to recover, down the server in the configured amount of time, and restart the OS). For certain type errors, it is advisable to either let the server go down and restart itself, or to shut it down immediately (after users have had a chance to save their files and logout). As your error message says, data structures may have become corrupt. If you allow the server to continue to function, you have a good chance of totally corrupting things. So it's up to you as how you elect to have it handled. I would follow the Novell guidance for now and make sure the server goes down... If it's not a critical error, the server will continue to operate, allowing the users to stay working, giving you the opportunity to down it later... --------- Date: Wed, 2 Apr 1997 21:17:02 +0200 From: "Arthur B." Subject: Re: Server Abend Maybe something is changed in Netware but an ABEND means that something bad has happened from a NOS side of view. In other words, data integrity can't be guarenteed. So the server is halted. And it's up to a knowledgeble human to decide if it's allowed to do the EIP=CSleepUntilInterrupt routine (or lookalikes) and let users save their work (allthough this is more for the autosave feature to the local harddisc IMO which many applications have implanted). Most of the time I just view what 386DEBUG has to say and then down the server. Not allowing users to close normally. Just to be sure that beeing able to save some data doesn't result in even more dataloss. Since it does take some time to find all the users and tell them to save their data and wait for me to give them green light again. All the time that takes the server is unstable. Another thing with an auto-reboot gadget can be that you get stuck in some sort of reboot loop (I don't have hand-on-expirience with INW... yet). ------------------------------ Date: Tue, 01 Apr 1997 21:59:59 -0600 From: Darwin Collins To: netw4-l@ecnet.net Subject: Re: Disappearing Netware server prompt >I have seen this on quite a few occasions, but when an NLM becomes >unstable, and you try to unload it, the process never ends, and the >server prompt is never returned from the offending NLM. In the past I >have tried Fconsole (on both 4.x and 3.x), debug, and several other >utilities with limited success. I have always had to resort to powering >the server off without downing it. Anyone have a solution or suggestions >for this under 4.1? This may 'not' be any real solution... but, with 4.11, you can do a SHIFT CTRL ESC and it will display a prompt to 'down the server'. (I learned this from Brainshare 97) ------------------------------ Date: Thu, 29 May 1997 19:18:10 -0400 From: "Brien K. Meehan" Subject: Re: RAID 5 with Netware >A problem doesn't happen unless you change something, generally, so >when you change something, you always allow yourself a stable state >to go back to. Well, out in the real world, things change every day, and not always with your consent. You can't fall back forever, sooner or later you have to put on your "Computing in the '90's" hat and figure out how to move forward and make it work. >Picking through a Novell core dump is probably the greatest waste of >time I can imagine. The Novell engineers who have asked me to send them a core dump didn't seem to think so. >If necessary to implement a new module or component, I can understand >the need, but the system wouldn't be available for login and general >use, would it? Whatever resources, and hardware, would also be >available at that time. Core dumps are usually reviewed offline, AFTER getting a server back on its feet. At least, wiser server administrators would do it that way. >Don't forget you were rude to me first... No, sir, YOU were rude first. It was YOU who posted the "nugget" of information that a 5MB DOS partition is sufficient to run Netware. That, sir, is a very rude piece of misinformation. Like I said, it goes against conventional wisdom, Novell's recommendations, AND Netware documentation! It is simply untrue! It was VERY rude of YOU to post that "opinion" on this list, in front of almost 3800 readers. Many readers of this list are not that experienced, and might have mistaken you for a compentent source of information. Being irresponsible in your own environment is one thing, but spreading that sort of "wisdom" is very, very rude. >Rather than figure things out yourself, you're probably a stuffed shirt >who calls and asks someone else to do it, and spends someone else's >money. An interesting speculation, especially because you have no basis for it. As it turns out, your speculation is as wrong and as backwards as your understanding of DOS partition requirements. I'm the one other people call when they can't figure it out. People call me with problems like, "I need to install Netware 4.11, but my DOS partition is too small, isn't 10MB enough?" or, "I loaded something on my server, and now it's crashing. Here's 128 floppy disks so you can get a core dump." That is, I get out my mop and clean up the messes other people make by believing bad information and not reading the documentation. ------------------------------ Date: Thu, 26 Jun 1997 15:47:11 +0200 From: Camaszotisz Gyorgy Subject: TID finder For those of you running into trouble finding a TID by number at Novell, I suggest a quick way: http://support.novell.com/cgi-bin/search/tidfinder.cgi?2927123 The number at the end of the line should be the requested TID's number. ------------------------------ Date: Tue, 16 Dec 1997 17:30:34 -0600 From: Joe Doupnik Subject: Re: How much memory?? >We have a NetWare 4.11 server (SP 4A) running on a IBM PC 704 Server 30GB >disk space 256MB RAM. It has anywhere from 300 - 500 users and approx. >6,000 - 8,000 files open. With these specs we had about 64% cache buffers >which was a little low so we have now upgraded to 512MB RAM which has >pumped our cache buffers up to 82%. The machine is connected to a >100MBs Ethernet port and also connected to 16MB Token-Ring. We have >checked Network Utilisation and while it is high, it is not excessive. >The machine is just doing file and print serving with no fancy apps or >anything on board. Bare bones NetWare. > >We can have x amount of users and the machine can run fine and then >user x+1 gets on and the machine just appears to die and the whole >thing will just go down in a downward spiral and eventually it will >just stop all the users from working. > >We have been monitoring stats like dirty cache buffers, LRU sitting time >and these appear to be within spec. --------- You do have more than sufficient memory, and even Monitor agrees. The cliff-edge drop in performance is likely due to other factors than just memory. One is running out of directory cache buffers so accesses grind to a crawl. Another is the LAN adapter stuff and the rest of the LAN is overwhelmed and gives up, and that is a very likely fault. If the number of processes grows and grows then the machine itself is blocked internally. Of course there is the rest of the server, the disk farm, which might collapse under stress. Every report I have seen says NW servers degrade slowly & gracefully when overloaded, and don't behave in the n=ok, n+1=bad fashion you describe. If you have Novell's Lanalyzer product put it on the ring and see what happens. Joe D. ------------------------------ Date: Wed, 7 Jan 1998 19:38:38 +0100 From: Hauke de Vries Subject: Re: Use of Netware 3.12 Debugger >Could somebody please let me know if/where I can get hold of any >documentation on how to use the Netware 3.12 debugger to analyse >server crashes? > >Can I, for instance, use the debugger to find why a server crashes >out on an INT3 Breakpoint? http://developer.novell.com/research/appnotes.htm "IntranetWare Server Automated Abend Recovery", Novell AppNotes, March 1997, p. 32 "Resolving Critical Server Issues", Novell AppNotes, February 1995, p. 35 "Abend Recovery Techniques for NetWare 3 and 4 Servers", Novell AppNotes, June 1995, p. 75 The last one probably is the most intersting, but appnotes before 1996 seems not to be on line :-( Maybe it's possible to back order those. ------------------------------ Date: Fri, 9 Jan 1998 10:01:00 -0700 From: Hansang Bae Subject: Re: Use of Netware 3.12 Debugger >>>Could somebody please let me know if/where I can get hold of any >>>documentation on how to use the Netware 3.12 debugger to analyse >>>server crashes? June 96 Application Notes. Network Management TIP section on abend debugging at www.avanti-tech.com ------------------------------ Date: Wed, 14 Jan 1998 13:45:40 -0700 From: Joe Doupnik Subject: Re: Error writing FAT >I've been having problems all day long on my NetWare 4.1 server. Came >in this morning to find the DATA volume dismounted and error messages >saying that the system was unable to write the FAT table, so the volume >was dismounted. I downed the system and brought it back up only to see >250MB available. Within about 15 minutes, I was back down to zero >disk space available. So, I added in another 4.5GB disk and started the >server back up. I monitored the disk space for about an hour and >nothing unusual occurred. However, I once again began to get the >same 'unable to write FAT' message. ------- Yup, agreed, this is a serious problem indeed. From experience here on the same matter: Bring up the server, load Vrepair, toggle round and dismount volume DATA, run Vrepair on it a few times. Remount it. Run PURGE /ALL from the root of the volume. Check SCSI cables and termination, and see that DOS translation is turned off in the SCSI controller. Stuff like that. Check for overheating (easy with today's cheap fans), flakey p/s. Check for inadequate memory (type MEMORY at the console prompt), and heed the abundant advice in the FAQ about registering memory upon need. It is possible that the drive holding DATA: is on its way out. That has occurred to most of us. It is possible your memory chips are unhappy. Finally, if disk compresion is turned on, good luck fella. There could be a crunch in decompressing material. I never use disk compression. I try to keep in mind that what I think is wrong is probably incorrect and something else is sneaking up on me. It is difficult to do that, but now and then it has been true. Joe D. ------------------------------ Date: Tue, 24 Mar 1998 14:58:29 -0700 From: The Abundant One Subject: Re: Abend >Abend: Machine Check Processor Exception (Error Code 00000009) > >Memory Parity Error NMI Novell has a really good CBT that explains ABENDS pretty good. URL is: http://support.novell.com/cbt/cbt1001/ You will need Real Audio to listen to the CBT. To summarize one portion of the CBT, they say that the Machine Check and NMI processor exception Abends are always hardware related. NMI is typically associated with memory parity checks so its likely that your server has some bad ram. ------------------------------