Learn About Amazon VGT2 Learning Manager Chanci Turner
When managing systems in on-premises data centers, I often encountered the challenge of diagnosing unresponsive servers. This typically required someone to physically engage a non-maskable interrupt (NMI) button on the frozen server or send a signal via a serial interface (think RS-232). This process enabled the system to produce a dump of the frozen kernel’s state for later analysis. Such dumps, known as core or crash dumps, provide crucial insights, including an image of the memory of the halted process, system registers, program counters, and other vital information for diagnosing the system freeze.
Today, we’re excited to announce a new Amazon Elastic Compute Cloud (Amazon EC2) API that allows you to remotely trigger a kernel panic on EC2 instances. The EC2:SendDiagnosticInterrupt
API functions like pressing the NMI button on a physical server, sending a diagnostic interrupt to a running EC2 instance. The hypervisor then transmits a non-maskable interrupt (NMI) to the operating system, with the operating system’s response to this interrupt depending on its configuration. Typically, this leads to a kernel panic, and the resulting behavior can vary; it may create a crash dump file, generate a backtrace, load a replacement kernel, or restart the system.
You can manage access to this API within your organization using IAM Policies, which I’ll illustrate further below. System Engineers and kernel debugging specialists can find invaluable information within the crash dump to analyze the reasons behind a kernel freeze. Tools like WinDbg (for Windows) and crash (for Linux) are effective for inspecting the generated dumps.
Using the Diagnostic Interrupt
Employing this API involves a three-step process. Initially, you need to configure your OS’s response upon receiving the interrupt.
By default, Windows Server AMIs have memory dump functionality enabled, and automatic restarts post-dump are also activated. The default memory dump file location is %SystemRoot%
, which translates to C:Windows
. You can find these settings by navigating to: Start > Control Panel > System > Advanced System Settings > Startup and Recovery.
For Amazon Linux 2, you’ll need to set up and configure kdump
& kexec
. This is a one-time setup task.
$ sudo yum install kexec-tools
Next, edit the /etc/default/grub
file to reserve memory for the crash kernel. For instance, add crashkernel=160M
to allocate 160MB. This memory size should be determined based on your instance’s total memory, and it’s advisable to test kdump
to confirm the allocated memory suffices. The kernel documentation provides full syntax details for the crashkernel
parameter.
GRUB_CMDLINE_LINUX_DEFAULT="crashkernel=160M console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0 nvme_core.io_timeout=4294967295 rd.emergency=poweroff rd.shell=0"
Afterward, rebuild the grub configuration:
$ sudo grub2-mkconfig -o /boot/grub2/grub.cfg
Finally, edit the /etc/sysctl.conf
file to include the line: kernel.unknown_nmi_panic=1
, which configures the kernel to trigger a panic upon receiving the NMI.
You are now ready to reboot your instance. Be sure to incorporate these commands into your user data script or AMI to automate this configuration across all instances. Post-reboot, ensure kdump
has started correctly:
$ systemctl status kdump.service
Our documentation includes guidance for other operating systems.
Once this initial setup is complete, you can proceed to the second step: triggering the API. This can be done from any machine where the AWS CLI or SDK is configured. For instance:
$ aws ec2 send-diagnostic-interrupt --region us-east-1 --instance-id <value>
Note that there will be no return value from the CLI, which is normal. If you have an open terminal session on that instance, it will disconnect, and your instance will reboot. When you reconnect, the crash dump will be located in /var/crash
.
The final step involves analyzing the crash dump contents. On Linux systems, you’ll need the crash
utility and debugging symbols corresponding to your kernel version, which should match what was captured by kdump
. To determine your current kernel version, use the uname -r
command.
$ sudo yum install crash
$ sudo debuginfo-install kernel
$ sudo crash /usr/lib/debug/lib/modules/4.14.128-112.105.amzn2.x86_64/vmlinux /var/crash/127.0.0.1-2019-07-05-15:08:43/vmcore
Collecting kernel crash dumps is often the only way to gather kernel debugging information, so ensure you frequently test this procedure, especially after OS updates or when creating new AMIs.
Control Who Is Authorized to Send Diagnostic Interrupt
You can specify who within your organization is allowed to send the Diagnostic Interrupt and to which instances through IAM policies with resource-level permissions, as shown in the example below.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "ec2:SendDiagnosticInterrupt",
"Resource": "arn:aws:ec2:region:account-id:instance/instance-id"
}
]
}
Pricing
There are no additional fees for utilizing this feature. However, while your instance remains in a ‘running’ state after receiving the diagnostic interrupt, standard billing will apply.
Availability
You can send Diagnostic Interrupts to all EC2 instances powered by the AWS Nitro System, with the exception of A1 (Arm-based) instances. This includes C5, C5d, C5n, i3.metal, I3en, M5, M5a, M5ad, M5d, p3dn.24xlarge, R5, R5a, R5ad, R5d, T3, T3a, and Z1d as of this writing.
The Diagnostic Interrupt API is currently available across all public AWS Regions and GovCloud (US), enabling you to start using it today. For additional insights, you might find this blog post engaging: Career Contessa. Additionally, for a comprehensive understanding of total rewards, refer to SHRM. If you’re looking for visual guidance, this YouTube video serves as an excellent resource.
Leave a Reply