Network outages to Atlanta...

If we lose network connectivity to Atlanta for significant periods, usually because of FLR or NLR maintenance, the TSM infrastructure can paint itself into a corner from which it can't recover, stalling offsite processing until someone intervenes. OSG would learn about this because tsm3 would fail in Nagios, resulting in a call to the on-call staff person. If this happens, we may need to cancel some jobs that will not recover on their own. Here's how to check for this state, and how to resolve it.

This procedure can be done from the ATLCOPY server, or from the CTRL server using command redirection.

First, run Q SESS against ATLCOPY. The interesting symptoms will be the lowest-numbered (first few in the report) sessions. If they are in Receive Wait (RecvW) or Send Wait (SendW) for a duration approximately dating to the beginning of the outage, that's a definite problem. If there's a session in Idle Wait (IdleW) for a similar duration, that's a likely problem too.

The most precise test is to Q sess [session number] f=d . For example:

 

       tsm: ATLCOPY>q sess 24587 f=d 

Sess Number: 24,587
Comm. Method: Tcp/Ip
Sess State: IdleW
Wait Time: 1.1 M
Bytes Sent: 866
Bytes Recvd: 4.0 G
Sess Type: Node
Platform: AIX-RS/6000
Client Name: CTRL
Media Access Status: Current output volume(s): U00117,(122 Seconds)
User Name: CTRL
Date/Time First Data Sent: 07/25/08 06:02:46
Proxy By Storage Agent:

 

Any session with wait time corresponding to the beginning of the outage, that is also using a tape drive (see Media Access Status line) is likely a problem.

Here are some guidelines for which sessions might need attention:

  • SendW -- Send Wait: this is usually zero, sometimes a few minutes if the remote side is changing tapes. Long times are a warning flag.
  • RecvW -- Receive Wait: this is usually zero, sometimes a few minutes if the remote side is changing tapes. Long times are a warning flag.
  • MediaW -- Media Wait; Any process waiting for a tape drive might sit in MediaW for a long time.
  • IdleW -- Idle Wait; sessions to ATLCTRL might be in IdleW for a long time and not be an issue. Long-waiting sessions, greater than 20 minutes, to other servers might be questionable.

Once you've identified problem sessions, then you can simply CANCEL SESSION [session number]