OBSOLETE Patch-ID# 101945-07 Keywords: sockmod RDBMS segmapdev savecore hang crash fcntl prgetstatus strrput Synopsis: SunOS 5.4: jumbo patch for kernel (includes libsocket procfs kadb nfs) Date: Oct/03/1994 Install Requirements: None Solaris Release: 2.4 SunOS Release: 5.4 Unbundled Product: Unbundled Release: Xref: This patch available for x86 as patch 101946 Topic: SunOS 5.4: jumbo patch for kernel (includes libsocket procfs kadb hypersparc nfs) Relevant Architectures: sparc BugId's fixed with this patch: 1120225 1150556 1151364 1152710 1152922 1157053 1160112 1165687 1166779 1166848 1167235 1168398 1169686 1169909 1170862 1171478 1172009 1172242 1172243 1172243 1172245 1172979 1172998 1173626 1173969 1174572 1174738 1174830 1174847 1175829 1175968 1176467 1177091 1177572 1177578 Changes incorporated in this version: 1177091 1177578 1176467 1172243 Patches accumulated and obsoleted by this patch: 101918-01 Patches which conflict with this patch: Patches required with this patch: Obsoleted by: Files included with this patch: /kadb /kernel/drv/cn /kernel/drv/tl /kernel/fs/autofs /kernel/fs/nfs /kernel/fs/nfs /kernel/fs/procfs /kernel/strmod/sockmod /kernel/sys/nfs /kernel/unix /usr/kvm/crash /usr/kvm/prtdiag /usr/lib/libsocket.a /usr/lib/libsocket.so.1 Problem Description: 1177091 prgetstatus can generate pagefault holding p_lock, can deadlock if freemem is 0 1177578 strmakemsg/strgeterr causes panic in strrput due to NULL mblk ptr 1176467 fcntl system call fails in process run by rcmd 1172243 Customer runs application from dumb terminal and system crashes. The system can freeze under heavy swapping pressure due to procfs holding a critical lock when it takes a page fault. Doing I_SETSIG on a console window through serial line and exiting the process could cause a system panic. Kernel panic in putnext/ptcwrite. A socket endpoint not created through the socket library (by dup() of a socket endpoint for example) may experience some failures on fcntl()/ioctl() calls. (This bug is only limited to 2.4 release) (from 101945-06) 1174847 SS5 running 4.1.3U1 - running customer application - HARD HANGS 1177572 Installing Solaris 2.4 ON patch 101945-05 and running OW causes machine to panic The patch to bug ID 1151364 broke OW''s consolidation. This happened bacause releasef() changed to have an extra argument. OW shouldn''t have been dependent on releasef() which is private to the ON consolidation. Since this problem was not discovered until after the patch was made, it made more sense for ON to produce a new patch which restores releasef() to have its old interface. The interface changed for kaio. A new interface is added called areleasef() which is only used by kaio. This is an enhancement to the workaround created for bug 1161592. Change is local to sun4m/swift cpu code and has NO impact on other non-swift platforms. (from 101945-05) 1174830 savecore on diskless machine didn''t generate unix, vmcore is trash 1174738 segmapdev uses condition variables with spin-type mutexes 1172998 x86: auto_lookup(): assertion failure in mutex_exit() on non-existent fs 1175829 booting continuously a diskless machine over network hangs a single CPU machine 1151364 asynchronous I/O in the user level hurts RDBMS performance This is a performance improvement for applications that are using libaio for doing async IO to raw files or devices. There are no API changes, only a new version of libaio.so.1 is installed. One side benefit of this fix is that async IO to tape should now work. This patch to bug 1151364 requires installation of libaio/kaio patch 102020-01 or later) 1175829 -------- During booting diskless over the net, a single CPU system hangs. The cause is that the kernel decides to let PROM handle the profiling timer (L14) interrupt after programming the timer with default values. The PROM waits for profiling timer (L14) to interrrupt but that never happens as timer has been mistakenly stopped (when it was programmed with default values) 1174738 ------- The problem is that lp is a spin-type mutex (the devctx lock) passed by segmapdev_fault. It is illegal, but probably not well documented, to pass a spin-type mutex to cv_wait(). It is also illegal to pass 1 as the arg to mutex_init() for spin-type mutexes. That is a machine-dependent spl argument, and 1 isn''t a valid choice (it needs to be a SPL above lock level. 1172998 -------- The panic: panic: mutex_adaptive_exit: mutex not owned by thread, lp f58936 70 owner e0000000 lock 0 waiters 0 curthread f5ce6360 Kernel crash dumps generated on diskless sun4m, sun4d or i86pc systems are not complete. (from 101945-04) 1175968 non-master cpu network interfaces broken on SS1000 1172243 Customer runs application from dumb terminal and system crashes. 1169686 4.1.3 system on network goes down, hangs 2.3 system The problem shows up when a "ps" thread is running through the virtual memory area to get the address space size for a mapped file. The address space lock is held and a get attributes function is called. This initiates an nfs get attribute request. If the machine that the request is made to is not responding the nfs request will block. The address space lock which is held by the blocked ps thread might block other processes on the local machine. Typically when a server goes down all nfs file system activity is blocked on any clients. The nfs operation resumes once the server comes up. In this situation a server is powered down and causes a client to hang. The hang is due to a process pile-up. The client is doing a ps and its thread is holding the address space lock (as_lock) for a running process lets call A. The A process is a mapped file from the server. The client ps thread path has reached rm_assize() which needs to get the file size so it calls VOP_GETATTR() which goes across the wire to the server. This operation goes nowhere because the server is not running. The as_lock held by the ps process is blocking other processes such as init. The solution is not to go over the wire but to return a cached entry for the file size. The change is to define a new attribute flag in vnode.h called ATTR_HINT. The rm_assize() function recognizes will use this flag when it calls VOP_GETATTR(). The nfs getattr function will see that the size of the file is requested and that the passed in flag is ATTR_HINT. It will return the file size from the rnode rather than make a request to the server. Typically when a server goes down all nfs file system activity is blocked on any clients. The nfs operation resumes once the server comes up. In this situation a server is powered down and causes a client to hang. The hang is due to a process pile-up. The client is doing a ps and its thread is holding the address space lock (as_lock) for a running process lets call A. The A process is a mapped file from the server. The client ps thread path has reached rm_assize() which needs to get the file size so it calls VOP_GETATTR() which goes across the wire to the server. This operation goes nowhere because the server is not running. The as_lock held by the ps process is blocking other processes such as init. The solution is not to go over the wire but to return a cached entry for the file size. The change is to define a new attribute flag in vnode.h called ATTR_HINT. The rm_assize() function recognizes will use this flag when it calls VOP_GETATTR(). The nfs getattr function will see that the size of the file is requested and that the passed in flag is ATTR_HINT. It will return the file size from the rnode rather than make a request to the server. Running applications that do I_SETSIG on console, when console is the serial port (i.e not the frame buffer), causes system to crash, when attempting to send signal to a process. Support for SC2000E and SS1000E was patched in the 2.3 and 2.4 releases, and integrated into the 2.5 release. This fix introduced a bug which causes non-zero system boards to have tpe-link-test turned to the incorrect setting. This has the effect of rendering the additional le interfaces non-functional. (from 101945-03) 1169909 Running xlib code in Realtime class causes code to block. in poll() 1167235 panic data fault in strioctl - apparently doing TIOCSPGRP 1150556 System becomes "panic: Overflow of asynchronous faults". This change is for proper handling of memory ECC errors. Previously, an attempt to enqueue an error when the async fault queue was full, resulted in panic: "Overflow of asynchronous faults" The new functionality is: When the queue is full, discard the entry and disable correctable error interrupt generation. Schedule re-enable of interrupt generation (via timeout) after a period of 30 minutes. Message generation is enabled to log information regarding SIMM and faulting address. An additional message is output: Excessive Asynchronous Faults: Possible Memory Deterioration Uncorrectable error occuring while the async fault queue is full results in immediate panic. In addition to queue overflow handling, the rate of error occurance is also monitored. If the rate of errors is such that 256 errors are reported in less than 1 second, ce interrupts are disabled. Re-enable of the ce interrupts is scheduled for 30 minutes (via timeout). Protect with mutex the testing and setting of the session and controlling terminal related flags in the streamhead. Real time stream threads will block in a poll. (from 101945-02) 1174572 Viking workaround enabled on parts that do not need it 1172979 spurious SIGALRM received in test program that forks child processes 1172009 recv() on sockets should return the error only once for SunOS 4.X compatibility 1170862 kadb hangs on MP configuration 1173626 Race condition in ross625_mp_mmu_writepte() where ref/mod bits can be lost 1172242 HyperSPARC Ross_625 A.2/A.3 has bcopy error if destination page is read-only 1172245 iflush code need to be more intelligent for HyperSPARC-MP 1168398 MP CPU start up causes machine to lock up when booting from net 1166848 L1 A and then sync locks up machine 1166779 Add support for dragon+ dual power supply 1152922 prtdiag(1M) should display SBus clock frequency 1165687 reads on acceptor sockets not non-blocking under Solaris 2 when listener is 1160112 socket library accidentally closes file descriptor on error 1120225 recv() returns EPIPE when called with MSG_PEEK 1152710 socket lib in 2.3/2.2 have problems with not clearing bad connections and errno 1171478 socket recv() calls fail with EINVAL due to bad fix in 494 AF_UNIX and AF_INET sockets can sometimes get EPIPE errors for recv(MSG_PEEK). When the socket library sees the EPIPE error it will in some cases close the file descriptor causing the application to get EBADF errors for subsequent operations. A AF_UNIX listening socket can get into a permanant error state (returning EPIPE or ECONNRESET) for any operation until the socket is closed. The non-blocking attribute of a socket endpoint is not transferred from a non-blocking listener endpoint to a accepting endpoint. This causes some socket non-blocking programs to block. This patch fixes the problem by setting the accepting endpoint non-blocking attribute if the listener was non-blocking. Add dual power supply support to SC2000 and SC2000E systems. Systems with dual power supplies will receive warnings on system console when one of the redundant power supplies fails. Modify prtdiag(1M) to indicate SBus Clock frequency. The SC2000E and SS1000E run with 25 MHz SBus clock frequency. The SC2000 and SS1000 run with 20 MHz SBus clock frequency. This change to prtdiag(1M) makes it easy to determine the SBus clock frequency on the system. For versions greater than 2.10 of the Open Boot Prom, L1-A followed by "sync" will sometimes hang. Summary: Patch to support Hypersparc CPU (Colorado) Modules. Below is a brief description of each bug: 1170862 ------- On a Colorado MP machine, if you set up break-points in the kernel try to resume from there, the machine sometimes hangs. Neither L1-A nor taking the keyboard helps. One has to power cycle the machine. 1173626 ------- There is a small window in "ross625_mp_mmu_writepte()" where the reference or modified bits can be lost before cpus are captured 1172242 ------- This is needed for a Ross625 A0-A3 bug where it''s possible for the hardware bcopy to write into the destination if the destination is write protected under some circumstances. 1172245: ------- The iflush code for Ross 625 (virtual address-cache cpu) needed to add in per-cpu local flush support instead of doing global broadcast to flush all cpus all the times. 1168398 ------- We have two processors in the system: CPU 0 and CPU 2. CPU 2 executed pause_cpus(), pause_cpus() created a pause_thread for CPU 0 and CPU2 was spinning on safe_list[0] waiting for pause thread for CPU 0 to set safe_list[0] to a 1. But CPU 0 never executed its pause thread. Instead, CPU 0 took a level 14 interrupt and dropped into the PROM and never re-surfaced from the PROM. In SunOS 4.X sockets when a read() or recv*() call returns an error the application can do another read()/recv*() and get an EOF. This patch applies this subtle aspect of socket semantics to SunOS 5.X. This specification of signal actions from the signal(5) manual page was being violated: Setting a signal action to SIG_IGN for a signal that is pending causes the pending signal to be discarded, whether or not it is blocked. Any queued values pending are also discarded, and the resources used to queue them are released and made available to queue other signals. The condition under which the pending signal was not being discarded was the specific case of SIGALRM signals generated by the setitimer(ITIMER_REAL) interface. The malfunction happens in a narrow race condition which will be triggered under intensive setting of a signal handler and setting it to SIG_IGN while the itimer is active. SunOS 5.4 sometimes enables a bug workaround on systems that do not need it. (from 101945-01) 1173969 MT process doesn''t stop on multi processor systems dbx appears to malfunction when controlling a multithreaded process that does many fork1()s. The bug is in the system, not dbx. Also, stopping dbx with a jobcontrol signal from the terminal, ^Z, while it is controlling a multithreaded process will cause the multithreaded process to becomed permanently stopped. (from 101918-01) 1157053 ESC8146 System panics when doing a copy to NFS file system mounted across FDDI-S Cause of problem is due to non-aligned transfers. The memory address alignment trap happened in xdr_writeargs() when copying data in a loop. The address was not on a long word boundary, it was on a word boundary. nfs_feedback() can adjust the transfer address and size for a request such as for a retransmission. The xdr_writeargs() can make use of bcopy(). The xdr_writeargs() is in file nfs_xdr.c. There are a few other functions in this file that do a similar copy operation that should be changed to use bcopy. Patch Installation Instructions: -------------------------------- Generic ''installpatch'' and ''backoutpatch'' scripts are provided within each patch package with instructions appended to this section. Other specific or unique installation instructions may also be necessary and should be described below. Special Install Instructions: ----------------------------- none README -- Last modified date: Tuesday, January 7, 2003