OBSOLETE Patch-ID# 101945-07
Keywords: sockmod RDBMS segmapdev savecore hang crash fcntl prgetstatus strrput
Synopsis: SunOS 5.4: jumbo patch for kernel (includes libsocket procfs kadb nfs)
Date: Oct/03/1994

Install Requirements: None                      
Solaris Release: 2.4

SunOS Release: 5.4

Unbundled Product: 

Unbundled Release: 

Xref: This patch available for x86 as patch 101946

Topic: SunOS 5.4: jumbo patch for kernel (includes libsocket procfs kadb hypersparc nfs)

Relevant Architectures: sparc

BugId's fixed with this patch: 1120225 1150556 1151364 1152710 1152922 1157053 1160112 1165687 1166779 1166848 1167235 1168398 1169686 1169909 1170862 1171478 1172009 1172242 1172243 1172243 1172245 1172979 1172998 1173626 1173969 1174572 1174738 1174830 1174847 1175829 1175968 1176467 1177091 1177572 1177578

Changes incorporated in this version: 1177091 1177578 1176467 1172243

Patches accumulated and obsoleted by this patch: 101918-01

Patches which conflict with this patch: 

Patches required with this patch: 

Obsoleted by: 

Files included with this patch: 

/kadb
/kernel/drv/cn
/kernel/drv/tl
/kernel/fs/autofs
/kernel/fs/nfs
/kernel/fs/nfs
/kernel/fs/procfs
/kernel/strmod/sockmod
/kernel/sys/nfs
/kernel/unix
/usr/kvm/crash
/usr/kvm/prtdiag
/usr/lib/libsocket.a
/usr/lib/libsocket.so.1

Problem Description:

1177091 prgetstatus can generate pagefault holding p_lock, can deadlock if freemem is 0
1177578 strmakemsg/strgeterr causes panic in strrput due to NULL mblk ptr
1176467 fcntl system call fails in process run by rcmd
1172243 Customer runs application from dumb terminal and system crashes.
The system can freeze under heavy swapping pressure
due to procfs holding a critical lock when it takes
a page fault.
Doing I_SETSIG on a console window through serial line and exiting the process could cause a system panic.
Kernel panic in putnext/ptcwrite.
A socket endpoint not created through the socket library (by dup()
of a socket endpoint for example) may experience some failures
on fcntl()/ioctl() calls. (This bug is only limited to 2.4 release)
(from 101945-06)
1174847 SS5 running 4.1.3U1 - running customer application - HARD HANGS
1177572 Installing Solaris 2.4 ON patch 101945-05 and running OW causes machine to panic
The patch to bug ID 1151364 broke OW''s consolidation. This happened
bacause releasef() changed to have an extra argument. OW shouldn''t
have been dependent on releasef() which is private to the ON
consolidation. Since this problem was not discovered until after
the patch was made, it made more sense for ON to produce a new
patch which restores releasef() to have its old interface. The
interface changed for kaio. A new interface is added called areleasef()
which is only used by kaio.
This is an enhancement to the workaround created for bug 1161592.
Change is local to sun4m/swift cpu code and has NO impact on other
non-swift platforms.
(from 101945-05)
1174830 savecore on diskless machine didn''t generate unix, vmcore is trash
1174738 segmapdev uses condition variables with spin-type mutexes
1172998 x86: auto_lookup(): assertion failure in mutex_exit() on non-existent fs
1175829 booting continuously a diskless machine over network hangs a single CPU machine
1151364 asynchronous I/O in the user level hurts RDBMS performance
This is a performance improvement for applications that are using libaio
for doing async IO to raw files or devices. There are no API changes, only
a new version of libaio.so.1 is installed. One side benefit of this fix
is that async IO to tape should now work.  This patch to bug 1151364
requires installation of libaio/kaio patch 102020-01 or later)
1175829
--------
During booting diskless over the net, a single CPU system hangs. The cause is that 
the kernel decides to let PROM handle the profiling timer (L14) interrupt after 
programming the timer with default values. The PROM waits 
for profiling timer (L14) to interrrupt but that never happens as timer has been 
mistakenly stopped (when it was programmed with default values)
1174738
-------
The problem is that lp is a spin-type mutex (the devctx lock) passed by
segmapdev_fault.
It is illegal, but probably not well documented, to pass a spin-type mutex
to cv_wait().
It is also illegal to pass 1 as the arg to mutex_init() for spin-type mutexes.
That is a machine-dependent spl argument, and 1 isn''t a valid choice (it needs
to be a SPL above lock level.
1172998
--------
The panic:
panic: mutex_adaptive_exit: mutex not owned by thread, lp f58936
	70 owner e0000000 lock 0 waiters 0 curthread f5ce6360
Kernel crash dumps generated on diskless sun4m, sun4d or i86pc systems are
not complete.
(from 101945-04)
1175968 non-master cpu network interfaces broken on SS1000
1172243 Customer runs application from dumb terminal and system crashes.
1169686 4.1.3 system on network goes down, hangs 2.3 system
The problem shows up when a "ps" thread is running through the virtual memory
area to get the address space size for a mapped file. The address space lock is
held and a get attributes function is called. This initiates an nfs get 
attribute request. If the machine that the request is made to is not responding
the nfs request will block. The address space lock which is held by the blocked
ps thread might block other processes on the local machine.
Typically when a server goes down all nfs file system activity is blocked
on any clients. The nfs operation resumes once the server comes up. In this
situation a server is powered down and causes a client to hang. The hang is
due to a process pile-up. The client is doing a ps and its thread is holding
the address space lock (as_lock) for a running process lets call A. The A
process is a mapped file from the server. The client ps thread path has reached
rm_assize() which needs to get the file size so it calls VOP_GETATTR()
which goes across the wire to the server. This operation goes nowhere because
the server is not running. The as_lock held by the ps process is blocking
other processes such as init.
The solution is not to go over the wire but to return a cached entry for the
file size. The change is to define a new attribute flag in vnode.h called
ATTR_HINT. The rm_assize() function recognizes will use this flag when it
calls VOP_GETATTR(). The nfs getattr function will see that the size of the
file is requested and that the passed in flag is ATTR_HINT. It will return the
file size from the rnode rather than make a request to the server.
Typically when a server goes down all nfs file system activity is blocked
on any clients. The nfs operation resumes once the server comes up. In this
situation a server is powered down and causes a client to hang. The hang is
due to a process pile-up. The client is doing a ps and its thread is holding
the address space lock (as_lock) for a running process lets call A. The A
process is a mapped file from the server. The client ps thread path has reached
rm_assize() which needs to get the file size so it calls VOP_GETATTR()
which goes across the wire to the server. This operation goes nowhere because
the server is not running. The as_lock held by the ps process is blocking
other processes such as init.
The solution is not to go over the wire but to return a cached entry for the
file size. The change is to define a new attribute flag in vnode.h called
ATTR_HINT. The rm_assize() function recognizes will use this flag when it
calls VOP_GETATTR(). The nfs getattr function will see that the size of the
file is requested and that the passed in flag is ATTR_HINT. It will return the
file size from the rnode rather than make a request to the server.
Running applications that do I_SETSIG on console, when console
is the serial port (i.e not the frame buffer), causes system
to crash, when attempting to send signal to a process.
Support for SC2000E and SS1000E was patched in the 2.3 and 2.4 releases,
and integrated into the 2.5 release. This fix introduced a bug which
causes non-zero system boards to have tpe-link-test turned to the
incorrect setting. This has the effect of rendering the additional
le interfaces non-functional.
(from 101945-03)
1169909 Running xlib code in Realtime class causes code to block. in poll()
1167235 panic data fault in strioctl - apparently doing TIOCSPGRP
1150556 System becomes "panic: Overflow of asynchronous faults".
          This change is for proper handling of memory ECC errors.  Previously,
          an attempt to enqueue an error when the async fault queue was full,
          resulted in panic: "Overflow of asynchronous faults"
          The new functionality is:
          When the queue is full, discard the entry and disable correctable
          error interrupt generation.  Schedule re-enable of interrupt
          generation (via timeout) after a period of 30 minutes.
          Message generation is enabled to log information regarding SIMM and
          faulting address.  An additional message is output:
          Excessive Asynchronous Faults: Possible Memory Deterioration
          Uncorrectable error occuring while the async fault queue is full
          results in immediate panic.
          In addition to queue  overflow handling, the rate of error
          occurance is also monitored.  If the rate of errors is such
          that 256 errors are reported in less than 1 second, ce interrupts
          are disabled.  Re-enable of the ce interrupts is scheduled for
          30 minutes (via timeout).
Protect with mutex the testing and setting of the session and controlling
terminal related flags in the streamhead. 
Real time stream threads will block in a poll.
(from 101945-02)
1174572 Viking workaround enabled on parts that do not need it
1172979 spurious SIGALRM received in test program that forks child processes
1172009 recv() on sockets should return the error only once for SunOS 4.X compatibility
1170862 kadb hangs on MP configuration
1173626 Race condition in ross625_mp_mmu_writepte() where ref/mod bits can be lost
1172242 HyperSPARC Ross_625 A.2/A.3 has bcopy error if destination page is read-only
1172245 iflush code need to be more intelligent for HyperSPARC-MP
1168398 MP CPU start up causes machine to lock up when booting from net
1166848 L1 A and then sync locks up machine
1166779 Add support for dragon+ dual power supply
1152922 prtdiag(1M) should display SBus clock frequency
1165687 reads on acceptor sockets not non-blocking under Solaris 2 when listener is
1160112 socket library accidentally closes file descriptor on error
1120225 recv() returns EPIPE when called with MSG_PEEK
1152710 socket lib in 2.3/2.2 have problems with not clearing bad connections and errno
1171478 socket recv() calls fail with EINVAL due to bad fix in 494
AF_UNIX and AF_INET sockets can sometimes get EPIPE errors for recv(MSG_PEEK).
When the socket library sees the EPIPE error it will in some cases close
the file descriptor causing the application to get EBADF errors for subsequent
operations.
A AF_UNIX listening socket can get into a permanant error state 
(returning EPIPE or ECONNRESET) for any operation until the socket is closed.
The non-blocking attribute of a socket endpoint is not transferred
from a non-blocking listener endpoint to a accepting endpoint.
This causes some socket non-blocking programs to block. This
patch fixes the problem by setting the accepting endpoint non-blocking
attribute if the listener was non-blocking.
Add dual power supply support to SC2000 and SC2000E systems. Systems
with dual power supplies will receive warnings on system console when
one of the redundant power supplies fails.
Modify prtdiag(1M) to indicate SBus Clock frequency. The SC2000E and
SS1000E run with 25 MHz SBus clock frequency. The SC2000 and SS1000
run with 20 MHz SBus clock frequency. This change to prtdiag(1M) 
makes it easy to determine the SBus clock frequency on the system.
For versions greater than 2.10 of the Open Boot Prom, L1-A followed
by "sync" will sometimes hang.
Summary: Patch to support Hypersparc CPU (Colorado) Modules. 
Below is a brief description of each bug:
1170862
-------
On a Colorado MP machine, if you set up break-points in the kernel
try to resume from there, the machine sometimes hangs. Neither L1-A
nor taking the keyboard helps. One has to power cycle the machine.
1173626
-------
There is a small window in "ross625_mp_mmu_writepte()" where the reference
or modified bits can be lost before cpus are captured
1172242
-------
This is needed for a Ross625 A0-A3 bug where it''s possible
for the hardware bcopy to write into the destination if the 
destination is write protected under some circumstances.
1172245:
-------
The iflush code for Ross 625 (virtual address-cache cpu) needed to 
add in  per-cpu local flush support instead of doing global broadcast
to  flush all cpus all the times.
1168398
-------
We have two processors in the system: CPU 0 and CPU 2. CPU 2 executed pause_cpus(), pause_cpus() created a pause_thread for CPU 0 and CPU2 was spinning on safe_list[0] waiting for pause thread for CPU 0 to set safe_list[0] to a 1. But CPU 0 never executed its pause thread. Instead, CPU 0 took a level 14 interrupt and  dropped into the PROM and never re-surfaced from the PROM.  
In SunOS 4.X sockets when a read() or recv*() call returns an error the 
application can do another read()/recv*() and get an EOF. This patch applies 
this subtle aspect of socket semantics to SunOS 5.X.
This specification of signal actions from the signal(5)
manual page was being violated:
                Setting  a  signal action to SIG_IGN for a signal
     that is pending causes the pending signal to  be  discarded,
     whether or not it is blocked.  Any queued values pending are
     also discarded, and the resources used  to  queue  them  are
     released and made available to queue other signals.
The condition under which the pending signal was not being
discarded was the specific case of SIGALRM signals generated
by the setitimer(ITIMER_REAL) interface.  The malfunction
happens in a narrow race condition which will be triggered
under intensive setting of a signal handler and setting it
to SIG_IGN while the itimer is active.
SunOS 5.4 sometimes enables a bug workaround on systems
that do not need it.
(from 101945-01)
1173969 MT process doesn''t stop on multi processor systems
dbx appears to malfunction when controlling a multithreaded
process that does many fork1()s.  The bug is in the system, not dbx.
Also, stopping dbx with a jobcontrol signal from the terminal, ^Z,
while it is controlling a multithreaded process will cause the
multithreaded process to becomed permanently stopped.
(from 101918-01)
1157053 ESC8146 System panics when doing a copy to NFS file system mounted across FDDI-S
Cause of problem is due to non-aligned transfers.  The memory address alignment trap happened in xdr_writeargs() when copying data in a loop. The address was not on a long word boundary, it was on a word boundary. nfs_feedback() can adjust the transfer address and size for a request such as for a retransmission. The xdr_writeargs() can make use of bcopy(). The xdr_writeargs() is in file nfs_xdr.c. There are a few other functions in this file that do a similar copy operation that should be changed to use bcopy.

Patch Installation Instructions:
--------------------------------
Generic ''installpatch'' and ''backoutpatch'' scripts are provided
within each patch package with instructions appended to this section.
Other specific or unique installation instructions may also be
necessary and should be described below.

Special Install Instructions:
-----------------------------
none

README -- Last modified date:  Tuesday, January 7, 2003