Clicky

How can I know if the server envcountered a disk problem before reboot and who reboot it ?
Hpux 11.31

asked 10/14/2011 09:35

LindaC's gravatar image

LindaC ♦♦


12 Answers:
A disk problem should be logged to syslog and hopefully monitored by the event log.  You can also use cstm/stm to look at the drives.

To see how rebooted a machine you need to enable auditing.
This can be configured using SAM, but there are also a slew of
 commands (named aud*) to control the audit log system from the command
 line.
 
The audit log system is disabled by default. When it is enabled, it
 can log detailed information of the user's actions (the names of files
 read/written, the starting or stopping of processes and the like).
 The full list of audit action categories is available by the command
 "man 5 audit".

The file /.secure/etc/audnames lists the names of the audit logfiles,
 if the audit log system is enabled. The audit logfiles are in a binary
 format: you must use the "audisp" command to view them.
 
link

answered

paulc's gravatar image

paulc

Thank you.

I have the following entries in the OLDsyslog..log, that I have attched to this entry.

 
 
link

answered 10/14/11 07:38 PM

LindaC's gravatar image

LindaC

You need to look above that snipet but excessive io like that can cause a crash.  Is the VG external to the server?  Check those connections such as fc cable and gbics.  Most arrays have some built in drive health checks you might consider running.
link

answered 10/14/11 07:38 PM

paulc's gravatar image

paulc

Some disks are in the San disk.

This are the lists of the filesystems: (I'am an oracle dba)

Filesystem          kbytes    used   avail %used Mounted on
/dev/vg00/lvol3    2097152  237912 1844824   11% /
/dev/vg00/lvol1    1835008  180592 1641584   10% /stand
/dev/vg00/lvol8    18432000 1681728 16621024    9% /var
/dev/vg00/lvol7    10240000 4211992 5981024   41% /usr
/dev/vg04/lvol1    50176000 36302270 13006719   74% /u06
/dev/vg03/lvol1    70656000 33531106 34804791   49% /u04
/dev/vg02/lvol1    245760000 54511994 179295126   23% /u03
/dev/vg01/lvol1    101376000 31103631 66045976   32% /u01
/dev/vg00/lvol6    5636096 3188720 2430128   57% /tmp
/dev/vg06/lvol1    40894464 1114537 37293817    3% /prod/ARCHIVE
/dev/vg00/lvol5    12288000 4289008 7936632   35% /opt
/dev/vg00/lvol4    1048576   69288  971704    7% /home
/dev/vg07/lvol1    19922944  712798 18009628    4% /home/oracle
/dev/vg05/lvol1    102400000 27891973 69851281   29% /exports
link

answered 10/14/11 08:23 PM

LindaC's gravatar image

LindaC

OK.  What was happening when the IO timeouts started?  For instance I can create this problem by deleting luns on my VA because lun deletion is a foreground process, especially if the server is busy.

What else is in syslog?  A typical disk problem looks like this:

hostname vmunix: SCSI: Async write error -- dev: b 31 0x022000, errno: 126, resid: 8192,
hostname vmunix:   blkno: 45699672, sectno: 91399344, offset: 3846791168, bcount: 8192.
 hostname vmunix:   blkno: 45699128, sectno: 91398256, offset: 3846234112, bcount: 8192.
 hostname vmunix: SCSI: Read error -- dev: b 31 0x022000, errno: 126, resid: 1024,
 hostname vmunix: SCSI: Async write error -- dev: b 31 0x022000, errno: 126, resid: 8192,
 hostname vmunix:   blkno: 8, sectno: 16, offset: 8192, bcount: 1024.
 hostname vmunix: LVM: VG 64 0x000000: PVLink 31 0x022000 Failed! The PV is not accessible.
link

answered 10/14/11 08:26 PM

paulc's gravatar image

paulc

Do you know where is the reboot log?  Why it is not in syslog?
link

answered 10/14/11 08:58 PM

LindaC's gravatar image

LindaC

Can it be that it crashed and it did not register anything today at 11:45 am ?

Now is 12:26 am (Saturday)

uptime

12:26am  up 12:55,  1 user,  load average: 0.22, 0.25, 0.23
link

answered 10/14/11 09:12 PM

LindaC's gravatar image

LindaC

I found the restart in the syslog.log, but I don't know if it was that is was shutdown or it crashed:

Oct 14 11:32:15 ebsprdb syslogd: restart
Oct 14 11:32:15 ebsprdb vmunix:
Oct 14 11:32:15 ebsprdb vmunix: MFS is defined: base= 0xe00000010205e000  size=
39928 KB
Oct 14 11:32:15 ebsprdb vmunix: Loaded ACPI revision 2.0 tables.
Oct 14 11:32:15 ebsprdb vmunix: MCA recovery subsystem disabled, not supported o
n this platform.
Oct 14 11:32:15 ebsprdb vmunix: montecito_proc_features: PROC_GET_FEATURES retur
ned 0xfffffffffffffff8
Oct 14 11:32:15 ebsprdb vmunix: Using /stand/ext_ioconfig
Oct 14 11:32:15 ebsprdb vmunix: Memory Class Setup
Oct 14 11:32:15 ebsprdb vmunix: ------------------------------------------------
-------------------------
Oct 14 11:32:15 ebsprdb vmunix: Class     Physmem              Lockmem
    Swapmem
Oct 14 11:32:15 ebsprdb vmunix:
Oct 14 11:32:15 ebsprdb  above message repeats 3 times
Oct 14 11:32:15 ebsprdb vmunix: System :  16552 MB             16552 MB
    16552 MB
Oct 14 11:32:15 ebsprdb vmunix: Kernel :  16552 MB             16552 MB
    16552 MB
Oct 14 11:32:15 ebsprdb vmunix: User   :  15243 MB             13665 MB
    13718 MB
syslog.log (6%)
link

answered 10/14/11 09:26 PM

LindaC's gravatar image

LindaC

Restart is not a crash that is someone or some process doing it.  Look above the restart for messages that lead to it if anything.  If there was a crash you should have files from the 14th in /var/log/crash.
link

answered 10/14/11 09:29 PM

paulc's gravatar image

paulc

Thank you.  No crash log.

The thing is that we know that the other tests servers from this particular production servers where affected by a work that was done yesterday in the San disks by Ibm external personnell.  
The thing is that this production server did not appeared to have yesterday october 14 an issue with disks at all and it was restarted maybe as an error.  We need to know if this server have some type of disks error.  I have uploaded as an attachment the syslog and the OLDsyslog.log.

But no crash, and the tests server has no crash also.
link

answered 10/15/11 10:10 AM

LindaC's gravatar image

LindaC

I don't see those files but I'd be happy to look at them.  LIke I said, if this machine is attached to that san then it is very likely that work caused the excessive IO and it can cause a crash.  I had this happen to my 8420s earlier this year when the DBAs were migrating LUNs to different EMC arrays.  In the previous syslog excerpts you posted there are no indications of disk issues.

Take a look at the logtool in cstm for any other issues.  Cstm can also be used to test the system.  xstm is the graphical version.

http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02620764/c02620764.pdf
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c02620756/c02620756.pdf

link

answered 10/15/11 10:26 AM

paulc's gravatar image

paulc

Thank you so much for your help and valuable information!
link

answered 10/15/11 12:01 PM

LindaC's gravatar image

LindaC

Your answer
[hide preview]

Follow this question

By Email:

Once you sign in you will be able to subscribe for any updates here

By RSS:

Answers

Answers and Comments

Tags:

Asked: 10/14/2011 09:35

Seen: 454 times

Last updated: 10/15/2011 08:06