| Last reviewed: 05/20/2016 | |
| HPE iLO NMI Watchdog Driver | |
| NMI sourcing for iLO based ProLiant Servers | |
| Documentation and Driver by | |
| Thomas Mingarelli | |
| The HPE iLO NMI Watchdog driver is a kernel module that provides basic | |
| watchdog functionality and the added benefit of NMI sourcing. Both the | |
| watchdog functionality and the NMI sourcing capability need to be enabled | |
| by the user. Remember that the two modes are not dependent on one another. | |
| A user can have the NMI sourcing without the watchdog timer and vice-versa. | |
| All references to iLO in this document imply it also works on iLO2 and all | |
| subsequent generations. | |
| Watchdog functionality is enabled like any other common watchdog driver. That | |
| is, an application needs to be started that kicks off the watchdog timer. A | |
| basic application exists in tools/testing/selftests/watchdog/ named | |
| watchdog-test.c. Simply compile the C file and kick it off. If the system | |
| gets into a bad state and hangs, the HPE ProLiant iLO timer register will | |
| not be updated in a timely fashion and a hardware system reset (also known as | |
| an Automatic Server Recovery (ASR)) event will occur. | |
| The hpwdt driver also has three (3) module parameters. They are the following: | |
| soft_margin - allows the user to set the watchdog timer value. | |
| Default value is 30 seconds. | |
| allow_kdump - allows the user to save off a kernel dump image after an NMI. | |
| Default value is 1/ON | |
| nowayout - basic watchdog parameter that does not allow the timer to | |
| be restarted or an impending ASR to be escaped. | |
| Default value is set when compiling the kernel. If it is set | |
| to "Y", then there is no way of disabling the watchdog once | |
| it has been started. | |
| NOTE: More information about watchdog drivers in general, including the ioctl | |
| interface to /dev/watchdog can be found in | |
| Documentation/watchdog/watchdog-api.txt and Documentation/IPMI.txt. | |
| The NMI sourcing capability is disabled by default due to the inability to | |
| distinguish between "NMI Watchdog Ticks" and "HW generated NMI events" in the | |
| Linux kernel. What this means is that the hpwdt nmi handler code is called | |
| each time the NMI signal fires off. This could amount to several thousands of | |
| NMIs in a matter of seconds. If a user sees the Linux kernel's "dazed and | |
| confused" message in the logs or if the system gets into a hung state, then | |
| the hpwdt driver can be reloaded. | |
| 1. If the kernel has not been booted with nmi_watchdog turned off then | |
| edit and place the nmi_watchdog=0 at the end of the currently booting | |
| kernel line. Depending on your Linux distribution and platform setup: | |
| For non-UEFI systems | |
| /boot/grub/grub.conf or | |
| /boot/grub/menu.lst | |
| For UEFI systems | |
| /boot/efi/EFI/distroname/grub.conf or | |
| /boot/efi/efi/distroname/elilo.conf | |
| 2. reboot the sever | |
| 3. Once the system comes up perform a modprobe -r hpwdt | |
| 4. modprobe /lib/modules/`uname -r`/kernel/drivers/watchdog/hpwdt.ko | |
| Now, the hpwdt can successfully receive and source the NMI and provide a log | |
| message that details the reason for the NMI (as determined by the HPE BIOS). | |
| Below is a list of NMIs the HPE BIOS understands along with the associated | |
| code (reason): | |
| No source found 00h | |
| Uncorrectable Memory Error 01h | |
| ASR NMI 1Bh | |
| PCI Parity Error 20h | |
| NMI Button Press 27h | |
| SB_BUS_NMI 28h | |
| ILO Doorbell NMI 29h | |
| ILO IOP NMI 2Ah | |
| ILO Watchdog NMI 2Bh | |
| Proc Throt NMI 2Ch | |
| Front Side Bus NMI 2Dh | |
| PCI Express Error 2Fh | |
| DMA controller NMI 30h | |
| Hypertransport/CSI Error 31h | |
| -- Tom Mingarelli |