Project:Infrastructure/Developer Machines/ia64

From Gentoo Wiki
Jump to:navigation Jump to:search

ia64 Admin Notes

These are various notes mainly targeted at people administrating Gentoo dev machines, although most things are probably generally useful. These are not general "how do I administrate a Gentoo box" notes.

Machine-specific Notes

dolphin

Host: dolphin.ia64.dev.gentoo.org

HP RX2600, CD writer. Donated by HP in 2003. This machine is powered off since two years ago to save power/cooling resources.

ILO is accessible using port 1 of console2. I used to access it using ssh armin76:dolphin-iLO@console2.gentoo.osuosl.org but console2 doesn't seem to answer now.

ttyS0 is accessible using port 2 of console2, ssh armin76:dolphin-ttyS0@console2.gentoo.osuosl.org

This machine had 4GB of RAM, 2x900MHz processors, 1x36GB HDD SCSI 80pin, 2x72GB HDD SCSI 80pin. No RAID.

This machine should still be in gentoo's rack in OSL, on top of bender. It does not have rails.

beluga

HP RX2620, CD/DVD reader only. Donated by HP in 2012, previously it used to be in HP's datacenter. It's stored in OSL but not in Gentoo's rack. It was sent as-is from HP, so iLO is configured with wrong parameters, probably. Also it will have static IP in the OS, wrongly configured too. I think it had a RAID5 by HW using 72GB HDDs. Cannot remember how many, probably 4 or 5. It had 2x 1.6GHz processors and 12GB of RAM.

It was stored in case guppy failed in the future and we had no other option.

guppy

HP RX3600. DVD/CD writer IIRC. Used to be in HP's DC but was sent to OSL when HP pulled the plug in DC. iLO is accessible from port 5 in console2. Once logged in you can access the remote console too.

Admin notes

Hostnames

These are the current systems we have available. See machine specific notes at bottom for more details.

Machine Name IP DNS Hostnames Console Server Console Account
guppy 140.211.166.179 guppy.ia64.dev.gentoo.org ?? ??
Console Access

iLO2 is accessible over telnet and SSH from dev.gentoo.org box (ssh needs some legacy ciphers). Ask infra@ for credentials and IP address.

You can use this to:

  • Interact with the EFI (e.g. to select recovery kernel, boot from plugged Gentoo DVD, change boot order)
  • Log in directly over ttyS1 to recover
  • Reboot machine
Hardware notes

List devices over MP console as: 'CM' > 'DF'

PSU status

PSU status can be checked over MP console as: 'CM' > 'PS':

   Power supplies                State
   -----------------------------------
   Power Supply 0                Fault
   Power Supply 1                Normal

Here we see that PSU-0 needs to be swapped. Tracked (and fixed) at bug #671420.

HDD status

Disk array needs to be checked from operating system:

root #cciss_vol_status -V /dev/sda
Controller: Smart Array P600
  Board ID: 0x3225103c
  Logical drives: 0
  Running firmware: 1.52
  ROM firmware: 1.52
/dev/cciss/c0d0: (Smart Array P600) RAID 5 Volume 0 status: Using interim recovery mode. 
  Failed drives:
         connector 1I box 1 bay 6                 HP      DH072ABAA6                           3PD0YA8B00009816N8B5     HPD4

    Total of 1 failed physical drives detected on this logical drive.
  Physical drives: 7
         connector 1I box 1 bay 8                 HP      DG072A8B54                           3LB0RFWF00007703FJ9Y     HPD7 OK
         connector 1I box 1 bay 7                 HP      DG072A9BB7                               B365P6A072YP0641     HPD0 OK
         connector 1I box 1 bay 5                 HP      DG072A9BB7                               B365P6A074CF0641     HPD0 OK
         connector 2I box 1 bay 4                 HP      DG072A9BB7                               B365P6A073U40641     HPD0 OK
         connector 2I box 1 bay 3                 HP      DG072A9BB7                               B365P6A073KC0641     HPD0 OK
         connector 2I box 1 bay 2                 HP      DG072A9BB7                               B365P6904NHC0635     HPD0 OK
         connector 2I box 1 bay 1                 HP      DG072A9BB7                               B365P6A072RM0641     HPD0 OK
/dev/cciss/c0d0(Smart Array P600:0): Non-Volatile Cache status:
                   Cache configured: Yes
                 Total cache memory: 224 MiB
                        Cache Ratio: 50% Read / 50% Write
                  Read cache memory: 112 MiB
                 Write cache memory: 112 MiB
                Write cache enabled: No
   Write cache temporarily disabled
           Temporary disable condition. Posted write operations have
been disabled due to the fact that less than 75% of the
battery packs are at the sufficient voltage level.

Here we see that HDD-6 needs to be swapped. Tracked (and fixed) at bug #671420. Leaving the error example here for posterity.

Batteries are also dead. I'm not sure how many batteries are there: one per controller or one per SAS I/O card. TODO: find out how to check those as well.

Common iLO commands

  • Get remote console output (ttyS1): CO
  • Get interactive console (to login and recover system on ttyS1): CO Ctrl-E f c
  • Reboot main machine: RS
  • Power cycle main machine and RAID: PC -cycle
  • Manage iLO users: UC
  • Get builtin help: HE

Other stuff

There are a few concepts to keep in mind when using iLO:

  • MP (iLO): a separate from main machine board that accepts telnet and ssh connections, issues commands to main machine over BMC interface, can to I/O on ttyS1
  • BMC: an FPGA on motherboard of main machine, accepts commands from MP. Can reboot machine, return hardware parts, report health status, etc.
  • main machine itself: a few ia64 CPUs, RAM and so on.
 user ---<telnet>---> MP --> [ BMC <-> ia64-machine ].

Typical problems

Mysterious hangups on reboot

Sometimes BMC hangs up on main machine reboot. Not clear why.

You can usually still access MP but I have not figured out how to reboot the machine in this state without physical help. End up asking infra/on-site staff to reboot a machine.

Makes each reboot a challenge.

Kernel Management

ia64 systems are EFI systems. guppy uses standard grub2 efi64 setup.

To update a kernel:

  • build kernel in /usr/src/linux
  • install kernel as make install && make modules_install
  • boot-test new kernel over iLO by changing path to vmlinux.
  • regenerate configs via grub-mkconfig --output=/boot/grub/grub.cfg

Needed patches/configs

  • bug #808405: stack canary has to be removed as it assumes that one of stack tops is unused
  • bug #808408: VM_FLUSH_RESET_PERMS has to be ignored as it breaks BPF and kernel module loading sometimes by corrupting vmalloc() state.
  • hardened_usercopy=0 kernel command (or set CONFIG_HARDENED_USERCOPY=n) because linear mapping is not accounted in usercopy check. Full of false positives on any buffer checks.

Sample Config Files

Recovery notes

iLO (CM > CO) serial console runs on ttyS1, ttyS0 is wired to physical(?) console.

Console is configured in EFI as P Serial Acpi(HWP0002,PNP0A03,0)/Pci(1|2) Vt100+ 115200.

EFI shell

In interactive EFI boot menu pick EFI Shell [Built-in]. And run the DVD kernel:

# inspect cdrom
fs0:\> ls fs0:\efi\boot
Directory of: fs0:\efi\boot

  09/27/09  08:42p <DIR>          2,048  .
  09/27/09  08:42p <DIR>          2,048  ..
  09/27/09  08:42p                  698  elilo.conf
  09/27/09  08:42p            7,020,793  gentoo
  09/27/09  08:42p              374,212  bootia64.efi
  09/27/09  08:42p            6,092,363  gentoo.igz
  09/27/09  08:42p                  380  elilo.msg

# run kernel with custom arguments (cdrom's defaults and not very suitable)
fs0:\> fs0:\efi\boot\bootia64.efi -i gentoo.igz gentoo initrd=gentoo.igz root=/dev/ram0 init=/linuxrc dokeymap looptype=squashfs loop=/image.squashfs cdroot console=ttyS1,115200n8
...
livecd ~ # uname -r
2.6.30-gentoo-r6

Or alternatively you can boot directly from HDD if you need non-standard arguments:

Shell> fs1:\EFI\gentoo\elilo.efi boot\vmlinuz-4.9.72-gentoo root=/dev/cciss!c0d0p3

For newer kernel (4.19+) devices got renamed from /dev/cciss!c0d0p${N} to /dev/sda${N}:

Shell> fs1:\EFI\gentoo\elilo.efi boot\vmlinuz-4.19.86-gentoo root=/dev/sda3

To get network setup just configure the addresses (see below for up-to-date setup):

# ip addr add 140.211.166.179/27 dev eth1
# ip link set up dev eth1
# ip r add default via 140.211.166.161 dev eth1

eth1 is a NIC with MAC ..:..:..:51:cf:57.

ELILO shell

Type TAB at ELILO boot: prompt to interrupt boot process.

TODO: actual syntax to load initrd

Config snippets

Config snippets on plugged Gentoo-2009 cdrom:

/etc/inittab
...
# TERMINALS
c1:12345:respawn:/sbin/agetty 38400 tty1 linux
c2:2345:respawn:/sbin/agetty 38400 tty2 linux
c3:2345:respawn:/sbin/agetty 38400 tty3 linux
c4:2345:respawn:/sbin/agetty 38400 tty4 linux
c5:2345:respawn:/sbin/agetty 38400 tty5 linux
c6:2345:respawn:/sbin/agetty 38400 tty6 linux

# SERIAL CONSOLES
#s0:12345:respawn:/sbin/agetty 9600 ttyS0 vt100
#s1:12345:respawn:/sbin/agetty 9600 ttyS1 vt100
...
elilo.conf
prompt
message=/efi/boot/elilo.msg
chooser=simple
timeout=50
relocatable

image=/efi/boot/gentoo
  label=gentoo
  append="initrd=gentoo.igz root=/dev/ram0 init=/linuxrc dokeymap looptype=squashfs loop=/image.squashfs cdroot"
  initrd=/efi/boot/gentoo.igz

image=/efi/boot/gentoo
  label=gentoo-serial
  append="initrd=gentoo.igz root=/dev/ram0 init=/linuxrc dokeymap looptype=squashfs loop=/image.squashfs cdroot console=tty0 console=ttyS0,9600"
  initrd=/efi/boot/gentoo.igz

image=/efi/boot/gentoo
  label=gentoo-sgi
  append="initrd=gentoo.igz root=/dev/ram0 init=/linuxrc dokeymap looptype=squashfs loop=/image.squashfs cdroot console=tty0 console=ttySG0,115200"
  initrd=/efi/boot/gentoo.igz
/etc/conf.d/net

Useful for livecd as DHCP does not acquire data:

config_eth1="140.211.166.179/27"
routes_eth1="default via 140.211.166.161"