OverviewThis course describes and explains what can go wrong in an IBM z Systems environment, and what you can do about it as an operator or systems programmer. It looks at failure situations from many points of view: the physical computer rooms, hardware problems and the software environment.<br>The software environment is further examined by looking at the Recovery Termination Manager (RTM) - the 'cleaning-up' function of z/OS - and its ABEND-concept.<br>All the different reports that come out of a z/OS system in conjunction with failures (messages, dumps, traces, etc.) are also discussed. The most common reasons for system ABENDs (and how you can analyze the information coming out of the system when they occur) are also covered.<br><br>This course is also available for one-company, on-site presentations and for live presentation over the Internet, via the Virtual Classroom Environment service.
PrerequisitesAn understanding of z/OS generally and of operational concepts in a z/OS system in particular - as taught in the course z/OS & JES2 Operations.
Delegates will learn how to
- explain the background of error messages
- diagnose operational central system problems
- suggest solutions to operational problems
- describe the diagnosis tools
- report problems and communicate with applications personnel and systems programmers
- understand what MVS’s Recovery Termination Manager (RTM) does when programs fail
- understand the concept of an ABEND
- analyze ABEND situations
- resolve ABEND situations.
OutlineWhat is an Operational z/OS problem?
The z/OS mainframe - a large system; What can go wrong?: the operational view, the application view; the "hole in the ground"; loss of electric power: mains supply, power inside the system, preparing for power failures; hardware problems - total loss of critical components; critical system software failure; MVS or DFP problems; VTAM or TCPIP problems; TSO problems; problems in Database or Transaction Management systems; partial loss of hardware; CPUs; I/O channel paths; disk subsystems and DASD volumes; sections of the network; partial loss of system software;; JES2 problems; JES3 problems; SMF problems; switching GLOBAL JES3; preparing for DSI; performing the DSI; ACTIVE - NOACTIVE; LASTDS; NOBUFFS; application systems failures; performance degradation; non-operational components; badly tuned system; humans or hardware?; actions: Act "real time" to attempt recovery' analyze afterwards; summary; review questions.The Hardware - CPU and Storage
A mainframe installation - a lot of hardware; the hardware components; Central Processing Unit (CPU): real (central) storage; expanded storage; Channel Subsystem (CSS) and peripheral devices; Virtual Storage; CPU modes; controlling the modes - PSW; PSW control bits; where do you find the PSW?; the real thing in each CPU; partitioning creates multiple logical CPUs; copies saved by software; Disabled Wait; MVS has decided to stop the system; an incorrect PSW: was accidentally loaded, was deliberately loaded; Enabled Wait; Enabled Loop; Disabled Loop; review questions.The Hardware - Input/Output Processing
I/O devices; Control Units; I/O processing in principle; Defining the I/O Configuration; the Hardware System Area (HSA); the MVS configuration; Hardware Configuration Definition (HCD); the I/O users in MVS; review questions.Hardware Errors & Recovery
What is System Recovery?; hardware error types; soft errors; hard errors; terminating errors; Machine Check processing and MCIC; masking MC interrupts; external damage code; hardware error areas; CPU errors; storage errors; Channel Subsystem errors (I/O errors); soft CPU errors; System Recovery (SR); Degradation (DG); soft CPU error reporting; hard CPU errors; System Damage (SD); instruction Processing Damage (PD); Information in PSW or Registers are valid (IV); Timing Facility Damage; the effect of hard CPU errors; terminating CPU errors; processing terminating CPU errors; Service Processor Damage; soft storage errors; MVS action after soft errors; hard storage errors; effect of hard storage errors; Channel Subsystem error reporting; Channel Path recovery; Terminal Error Condition; outstanding RESERVEs; Permanent Error Condition; Initialized Condition; I/O related errors; device/Control Unit errors (I/O errors); no path available; device status errors; Subchannel status errors; Hot I/O conditions; Hot I/O recovery; Hot I/O messages (non-DASD); Hot I/O messages (DASD); response to Hot I/O message; using IECIOSxx for Hot I/O processing; HIO options in IECIOSxx; example of IECIOSxx parameters; missing Interrupts; missing Interrupt intervals; special considerations for MIH intervals; Missing Interrupt messages; I/O Timing Facility; I/O Timing Messages; review questions.z/OS MVS Software Environment
The z/OS environment - a lot of programs; software categories; the mission of an Operating System; workload in MVS; asking for MVS services; asynchronous MVS activities; asynchronous "unwelcome" MVS activities; summary; review questions.Recovery Termination Manager (RTM)
Normal Program Termination; EXIT (SVC 3); abnormal program termination; Program Checks; system forced ABEND; program ABEND; why abnormal termination?; logical application error; program incomplete; application detected software error; system detected software error; hardware detected software error; PC FLIH and ABENDs; hardware detected software error example; Program Checks in the Supervisor; hardware problems; RTM actions; recovery; Functional Recovery Routines (FRRs); Extended Specify Task Abnormal Exit (ESTAE); system breakdown; software problem types; review questions.MVS Error Reporting & Dumps
System error reporting; MVS dumps; Stand-Alone Dump (SADUMP); SVC dumps; user ABEND dumps; SYSUDUMP; SYSABEND; SYSMDUMP; CEEDUMP; generating a user ABEND dump; system generated ABEND dump; snap dumps; symptom dumps; review questions.ABEND Analysis
What is ABEND?; the MVS ABEND service; why ABEND?; allows for recovery routines ; task termination; tasks in an Address Space; how RTM is invoked; program checks; ABEND; how to trigger an ABEND; ABEND macro and SVC 13; CALLRTM macro; why not normal end?; application detected software errors; system detected software errors; all the system ABEND codes; where do you see the ABEND codes?; the NOTIFY message; the SYSLOG; the job log; the symptom dump; ABEND dumps; SVC dumps; Stand-Alone dumps; the symptom dump in the SYSLOG; the symptom dump in the job log; explanations of ABEND and reason codes; IBM z/OS manuals on the web; Quickref and similar tools; analysis approach; examples of ABEND code explanation; system messages - a good information source; system message prefix; message level; standard message types; alternative message types; message identifier and MVS components; examples of system messages; explanation of system messages; common system ABEND codes; system ABEND code numbers; common SVCs and their macros; the x22 codes - caused by outside events; the x13 codes - OPEN problems; other x13 codes; example of S013-18; 806 - Program not found; sequence of events; example of S806-04; 804, 80A, 878, 878 and DC2 - virtual storage problems; the Virtual Address Space; "above the bar"; traditional address space areas; the need for managing virtual storage; storage for the program code; storage obtained outside the program; Virtual Storage requests; limitations on Virtual Storage; ABEND and reason codes; requests for storage below 2 GB (GETMAIN and STORAGE OBTAIN); requests for storage above 2 GB (IAR64 GETSTOR); the REGION limit; the effects of different REGION values; example of ABEND S822; the MEMLIMIT parameter; example of ABEND SDC2; the 0Cx codes; the Program Check Interrupt; running RTM1; PC FLIH and ABENDs; the meaning of Program Checks; common ABENDs from Program Checks; ABEND S0C4; Storage Protect Keys; virtual address protection; reasons for translation exceptions; address truly invalid; address valid - new area; address valid - old area; other S0Cx ABENDs; PIC 0001 Operation Exception (ABEND S0C1); PIC 0002 Privileged Operation Exception (ABEND S0C2); PIC 0007 Data Exception (ABEND S0C7); the S0E0 and 0Dx codes; miscellaneous problems; problems with translations; Linkage Stack problems; the Sx37 and SB14 codes; Sx37; EOV processing; how disk data sets are allocated; Physical Sequential (PS) data sets; problems when allocating a PS data set; initial allocation; primary allocation failure; data set full; no secondary allocation (SD37-04); secondary allocations (SB37-04); example of unavailable primary allocation; example of SD37-04; message IEC031I; example of ABEND SB37-04; message IEC030I; Partitioned Data Sets (PDS); problems when allocating a PDS; initial allocation; data set full; no secondary allocation (SD37-04); secondary allocations (SE37-04); directory full (SB14-0C); example of ABEND SE37-04; message IEC032I; example of ABEND SB14; message IEC217I; Partitioned Data Sets Extended (PDSE); problems when allocating a PDSE; summary of common system ABEND codes; other ABEND codes; MVS system codes (Sxxx); user ABEND codes (Uxxxx).LOGREC and EREP
The Error Recording Data Set (ERDS) of MVS; LOGREC in MVS; LOGREC contents; LOGREC Event Record types; re-initializing LOGREC with IFCDIP00; re-allocating LOGREC with IFCDIP00; the EREP program; EREP reports; controlling EREP.Generalized Trace Facility (GTF)
Traces in MVS; what is GTF?; how to obtain a GTF trace; the GTF JCL procedure; starting GTF; traceable events; GTF parameters - I/O events; examples of I/O parameters; CCW tracing example; CCW tracing output; dispatcher events; external interrupts; program interrupts; GTF-tracing of VTAM activity; SVC interrupts; recovery routines and SLIP events; parameter summary.