kexec is a set of systems call that allows you to load another kernel from the currently executing Linux kernel. The current implementation has only been tested, and had the kinks worked out on x86, but the generic code should work on any architecture.
Could I get some feed back on where this work and where this breaks. With the maturation of kexec-tools to skip attempting bios calls, I expect a new the linux kernel to load for most people. Though I also expect some device drivers will not reinitialize after the reboot.
The patch is archived at: http://www.xmission.com/~ebiederm/files/kexec/
And is currently kept in two pieces. The pure system call. http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff
And the set of hardware fixes known to help kexec. http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes....
A compatible user space is at: http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz This code boots either a static ELF executable or a bzImage.
As of version 1.6 /sbin/kexec now works much more like /sbin/reboot. It is recommend you place /sbin/kexec -e in /etc/init.d/reboot just before the the call to /sbin/reboot. If you haven't called /sbin/kexec previously it will fail, and you can then call /sbin/reboot. Given the similiarity it is now the plan to merge in reboot via kexec into /sbin/reboot.
One bug was fixed in the move to 2.5.48. Previously I had failed to clear PAE and PSE in the kernel. This caused reboot failures when CONFIG_HIGHMEM_64G was enabled, as the new kernel would fail when enabling paging, as these bits remained set. Is %cr4 present on all 386+ intel cpus, or do I need to conditionalize the code that accesses it?
As of version 1.6 /sbin/kexec when presented with a bzImage by default avoids all BIOS calls and jumps directly to the kernels 32 bit entry point. The information it would usually get from the BIOS is instead collected from the current kernel. Accurately getting things like the BIOS memory map from the current kernel is a challenge, still needs to be addressed. Safe defaults have been provided for the cases I do not currently have good code to gather the information from the running kernel.
In bug reports please include the serial console output of kexec kexec_test. kexec_test exercises most of the interesting code paths that are needed to load a kernel (mainly BIOS calls) with lots of debugging print statements, so hangs can easily be detected.
Eric
MAINTAINERS | 7 arch/i386/Kconfig | 17 arch/i386/kernel/Makefile | 1 arch/i386/kernel/entry.S | 2 arch/i386/kernel/machine_kexec.c | 142 ++++++++ arch/i386/kernel/relocate_kernel.S | 107 ++++++ include/asm-i386/kexec.h | 25 + include/asm-i386/unistd.h | 2 include/linux/kexec.h | 45 ++ include/linux/reboot.h | 2 kernel/Makefile | 1 kernel/kexec.c | 640 +++++++++++++++++++++++++++++++++++++ kernel/sys.c | 23 + 13 files changed, 1012 insertions, 2 deletions
diff -uNr linux-2.5.48/MAINTAINERS linux-2.5.48.x86kexec/MAINTAINERS --- linux-2.5.48/MAINTAINERS Mon Nov 11 00:22:33 2002 +++ linux-2.5.48.x86kexec/MAINTAINERS Sun Nov 17 22:53:09 2002 @@ -968,6 +968,13 @@ W: http://www.cse.unsw.edu.au/~neilb/patches/linux-devel/ S: Maintained
+KEXEC +P: Eric Biederman +M: ebiederm@xmission.com +M: ebiederman@lnxi.com +L: linux-kernel@vger.kernel.org +S: Maintained + LANMEDIA WAN CARD DRIVER P: Andrew Stanley-Jones M: asj@lanmedia.com diff -uNr linux-2.5.48/arch/i386/Kconfig linux-2.5.48.x86kexec/arch/i386/Kconfig --- linux-2.5.48/arch/i386/Kconfig Sun Nov 17 22:51:14 2002 +++ linux-2.5.48.x86kexec/arch/i386/Kconfig Sun Nov 17 22:53:09 2002 @@ -784,6 +784,23 @@ depends on (SMP || PREEMPT) && X86_CMPXCHG default y
+config KEXEC + bool "kexec system call (EXPERIMENTAL)" + depends on EXPERIMENTAL + help + kexec is a system call that implements the ability to shutdown your + current kernel, and to start another kernel. It is like a reboot + but it is indepedent of the system firmware. And like a reboot + you can start any kernel with it not just Linux. + + The name comes from the similiarity to the exec system call. + + It is on an going process to be certain the hardware in a machine + is properly shutdown, so do not be surprised if this code does not + initially work for you. It may help to enable device hotplugging + support. As of this writing the exact hardware interface is + strongly in flux, so no good recommendation can be made. + endmenu
diff -uNr linux-2.5.48/arch/i386/kernel/Makefile linux-2.5.48.x86kexec/arch/i386/kernel/Makefile --- linux-2.5.48/arch/i386/kernel/Makefile Sun Nov 17 22:51:14 2002 +++ linux-2.5.48.x86kexec/arch/i386/kernel/Makefile Sun Nov 17 22:53:09 2002 @@ -24,6 +24,7 @@ obj-$(CONFIG_X86_MPPARSE) += mpparse.o obj-$(CONFIG_X86_LOCAL_APIC) += apic.o nmi.o obj-$(CONFIG_X86_IO_APIC) += io_apic.o +obj-$(CONFIG_KEXEC) += machine_kexec.o relocate_kernel.o obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o suspend_asm.o obj-$(CONFIG_X86_NUMAQ) += numaq.o obj-$(CONFIG_PROFILING) += profile.o diff -uNr linux-2.5.48/arch/i386/kernel/entry.S linux-2.5.48.x86kexec/arch/i386/kernel/entry.S --- linux-2.5.48/arch/i386/kernel/entry.S Sun Nov 17 22:51:14 2002 +++ linux-2.5.48.x86kexec/arch/i386/kernel/entry.S Sun Nov 17 22:56:43 2002 @@ -768,7 +768,7 @@ .long sys_epoll_wait .long sys_remap_file_pages .long sys_set_tid_address - + .long sys_kexec_load
.rept NR_syscalls-(.-sys_call_table)/4 .long sys_ni_syscall diff -uNr linux-2.5.48/arch/i386/kernel/machine_kexec.c linux-2.5.48.x86kexec/arch/i386/kernel/machine_kexec.c --- linux-2.5.48/arch/i386/kernel/machine_kexec.c Wed Dec 31 17:00:00 1969 +++ linux-2.5.48.x86kexec/arch/i386/kernel/machine_kexec.c Sun Nov 17 22:53:09 2002 @@ -0,0 +1,142 @@ +#include <linux/config.h> +#include <linux/mm.h> +#include <linux/kexec.h> +#include <linux/delay.h> +#include <asm/pgtable.h> +#include <asm/pgalloc.h> +#include <asm/tlbflush.h> +#include <asm/io.h> +#include <asm/apic.h> + + +/* + * machine_kexec + * ======================= + */ + + +static void set_idt(void *newidt, __u16 limit) +{ + unsigned char curidt[6]; + + /* ia32 supports unaliged loads & stores */ + (*(__u16 *)(curidt)) = limit; + (*(__u32 *)(curidt +2)) = (unsigned long)(newidt); + + __asm__ __volatile__ ( + "lidt %0\n" + : "=m" (curidt) + ); +}; + + +static void set_gdt(void *newgdt, __u16 limit) +{ + unsigned char curgdt[6]; + + /* ia32 supports unaliged loads & stores */ + (*(__u16 *)(curgdt)) = limit; + (*(__u32 *)(curgdt +2)) = (unsigned long)(newgdt); + + __asm__ __volatile__ ( + "lgdt %0\n" + : "=m" (curgdt) + ); +}; + +static void load_segments(void) +{ +#define __STR(X) #X +#define STR(X) __STR(X) + + __asm__ __volatile__ ( + "\tljmp $"STR(__KERNEL_CS)",$1f\n" + "\t1:\n" + "\tmovl $"STR(__KERNEL_DS)",%eax\n" + "\tmovl %eax,%ds\n" + "\tmovl %eax,%es\n" + "\tmovl %eax,%fs\n" + "\tmovl %eax,%gs\n" + "\tmovl %eax,%ss\n" + ); +#undef STR +#undef __STR +} + +static void identity_map_page(unsigned long address) +{ + /* This code is x86 specific... + * general purpose code must be more carful + * of caches and tlbs... + */ + pgd_t *pgd; + pmd_t *pmd; + struct mm_struct *mm = current->mm; + spin_lock(&mm->page_table_lock); + + pgd = pgd_offset(mm, address); + pmd = pmd_alloc(mm, pgd, address); + + if (pmd) { + pte_t *pte = pte_alloc_map(mm, pmd, address); + if (pte) { + set_pte(pte, + mk_pte(virt_to_page(phys_to_virt(address)), + PAGE_SHARED)); + __flush_tlb_one(address); + } + } + spin_unlock(&mm->page_table_lock); +} + + +typedef void (*relocate_new_kernel_t)( + unsigned long indirection_page, unsigned long reboot_code_buffer, + unsigned long start_address); + +const extern unsigned char relocate_new_kernel[]; +extern void relocate_new_kernel_end(void); +const extern unsigned int relocate_new_kernel_size; + +void machine_kexec(struct kimage *image) +{ + unsigned long *indirection_page; + void *reboot_code_buffer; + relocate_new_kernel_t rnk; + + /* Interrupts aren't acceptable while we reboot */ + local_irq_disable(); + reboot_code_buffer = image->reboot_code_buffer; + indirection_page = phys_to_virt(image->head & PAGE_MASK); + + identity_map_page(virt_to_phys(reboot_code_buffer)); + + /* copy it out */ + memcpy(reboot_code_buffer, relocate_new_kernel, + relocate_new_kernel_size); + + /* The segment registers are funny things, they are + * automatically loaded from a table, in memory wherever you + * set them to a specific selector, but this table is never + * accessed again you set the segment to a different selector. + * + * The more common model is are caches where the behide + * the scenes work is done, but is also dropped at arbitrary + * times. + * + * I take advantage of this here by force loading the + * segments, before I zap the gdt with an invalid value. + */ + load_segments(); + /* The gdt & idt are now invalid. + * If you want to load them you must set up your own idt & gdt. + */ + set_gdt(phys_to_virt(0),0); + set_idt(phys_to_virt(0),0); + + /* now call it */ + rnk = (relocate_new_kernel_t) virt_to_phys(reboot_code_buffer); + (*rnk)(virt_to_phys(indirection_page), virt_to_phys(reboot_code_buffer), + image->start); +} + diff -uNr linux-2.5.48/arch/i386/kernel/relocate_kernel.S linux-2.5.48.x86kexec/arch/i386/kernel/relocate_kernel.S --- linux-2.5.48/arch/i386/kernel/relocate_kernel.S Wed Dec 31 17:00:00 1969 +++ linux-2.5.48.x86kexec/arch/i386/kernel/relocate_kernel.S Sun Nov 17 23:58:29 2002 @@ -0,0 +1,107 @@ +#include <linux/config.h> +#include <linux/linkage.h> + + /* Must be relocatable PIC code callable as a C function, that once + * it starts can not use the previous processes stack. + * + */ + .globl relocate_new_kernel +relocate_new_kernel: + /* read the arguments and say goodbye to the stack */ + movl 4(%esp), %ebx /* indirection_page */ + movl 8(%esp), %ebp /* reboot_code_buffer */ + movl 12(%esp), %edx /* start address */ + + /* zero out flags, and disable interrupts */ + pushl $0 + popfl + + /* set a new stack at the bottom of our page... */ + lea 4096(%ebp), %esp + + /* store the parameters back on the stack */ + pushl %edx /* store the start address */ + + /* Set cr0 to a known state: + * 31 0 == Paging disabled + * 18 0 == Alignment check disabled + * 16 0 == Write protect disabled + * 3 0 == No task switch + * 2 0 == Don't do FP software emulation. + * 0 1 == Proctected mode enabled + */ + movl %cr0, %eax + andl $~((1<<31)|(1<<18)|(1<<16)|(1<<3)|(1<<2)), %eax + orl $(1<<0), %eax + movl %eax, %cr0 + + /* Set cr4 to a known state: + * Setting everything to zero seems safe. + */ + movl %cr4, %eax + andl $0, %eax + movl %eax, %cr4 + + jmp 1f +1: + + /* Flush the TLB (needed?) */ + xorl %eax, %eax + movl %eax, %cr3 + + /* Do the copies */ + cld +0: /* top, read another word for the indirection page */ + movl %ebx, %ecx + movl (%ebx), %ecx + addl $4, %ebx + testl $0x1, %ecx /* is it a destination page */ + jz 1f + movl %ecx, %edi + andl $0xfffff000, %edi + jmp 0b +1: + testl $0x2, %ecx /* is it an indirection page */ + jz 1f + movl %ecx, %ebx + andl $0xfffff000, %ebx + jmp 0b +1: + testl $0x4, %ecx /* is it the done indicator */ + jz 1f + jmp 2f +1: + testl $0x8, %ecx /* is it the source indicator */ + jz 0b /* Ignore it otherwise */ + movl %ecx, %esi /* For every source page do a copy */ + andl $0xfffff000, %esi + + movl $1024, %ecx + rep ; movsl + jmp 0b + +2: + + /* To be certain of avoiding problems with self modifying code + * I need to execute a serializing instruction here. + * So I flush the TLB, it's handy, and not processor dependent. + */ + xorl %eax, %eax + movl %eax, %cr3 + + /* set all of the registers to known values */ + /* leave %esp alone */ + + xorl %eax, %eax + xorl %ebx, %ebx + xorl %ecx, %ecx + xorl %edx, %edx + xorl %esi, %esi + xorl %edi, %edi + xorl %ebp, %ebp + ret +relocate_new_kernel_end: + + .globl relocate_new_kernel_size +relocate_new_kernel_size: + .long relocate_new_kernel_end - relocate_new_kernel diff -uNr linux-2.5.48/include/asm-i386/kexec.h linux-2.5.48.x86kexec/include/asm-i386/kexec.h --- linux-2.5.48/include/asm-i386/kexec.h Wed Dec 31 17:00:00 1969 +++ linux-2.5.48.x86kexec/include/asm-i386/kexec.h Sun Nov 17 22:53:09 2002 @@ -0,0 +1,25 @@ +#ifndef _I386_KEXEC_H +#define _I386_KEXEC_H + +#include <asm/fixmap.h> + +/* + * KEXEC_SOURCE_MEMORY_LIMIT maximum page get_free_page can return. + * I.e. Maximum page that is mapped directly into kernel memory, + * and kmap is not required. + * + * Someone correct me if FIXADDR_START - PAGEOFFSET is not the correct + * calculation for the amount of memory directly mappable into the + * kernel memory space. + */ + +/* Maximum physical address we can use pages from */ +#define KEXEC_SOURCE_MEMORY_LIMIT (FIXADDR_START - PAGE_OFFSET) +/* Maximum address we can reach in physical address mode */ +#define KEXEC_DESTINATION_MEMORY_LIMIT (-1UL) + +#define KEXEC_REBOOT_CODE_SIZE 4096 +#define KEXEC_REBOOT_CODE_ALIGN 0 + + +#endif /* _I386_KEXEC_H */ diff -uNr linux-2.5.48/include/asm-i386/unistd.h linux-2.5.48.x86kexec/include/asm-i386/unistd.h --- linux-2.5.48/include/asm-i386/unistd.h Sun Nov 17 22:51:25 2002 +++ linux-2.5.48.x86kexec/include/asm-i386/unistd.h Sun Nov 17 22:54:03 2002 @@ -263,7 +263,7 @@ #define __NR_sys_epoll_wait 256 #define __NR_remap_file_pages 257 #define __NR_set_tid_address 258 - +#define __NR_sys_kexec_load 259
/* user-visible error numbers are in the range -1 - -124: see <asm-i386/errno.h> */
diff -uNr linux-2.5.48/include/linux/kexec.h linux-2.5.48.x86kexec/include/linux/kexec.h --- linux-2.5.48/include/linux/kexec.h Wed Dec 31 17:00:00 1969 +++ linux-2.5.48.x86kexec/include/linux/kexec.h Sun Nov 17 22:53:09 2002 @@ -0,0 +1,45 @@ +#ifndef LINUX_KEXEC_H +#define LINUX_KEXEC_H + +#if CONFIG_KEXEC +#include <linux/types.h> +#include <asm/kexec.h> + +/* + * This structure is used to hold the arguments that are used when loading + * kernel binaries. + */ + +typedef unsigned long kimage_entry_t; +#define IND_DESTINATION 0x1 +#define IND_INDIRECTION 0x2 +#define IND_DONE 0x4 +#define IND_SOURCE 0x8 + +struct kimage { + kimage_entry_t head; + kimage_entry_t *entry; + kimage_entry_t *last_entry; + + unsigned long destination; + unsigned long offset; + + unsigned long start; + void *reboot_code_buffer; +}; + +struct kexec_segment { + void *buf; + size_t bufsz; + void *mem; + size_t memsz; +}; + +/* kexec interface functions */ +extern void machine_kexec(struct kimage *image); +extern asmlinkage long sys_kexec(unsigned long entry, long nr_segments, + struct kexec_segment *segments); +extern struct kimage *kexec_image; +#endif +#endif /* LINUX_KEXEC_H */ + diff -uNr linux-2.5.48/include/linux/reboot.h linux-2.5.48.x86kexec/include/linux/reboot.h --- linux-2.5.48/include/linux/reboot.h Fri Oct 11 22:22:47 2002 +++ linux-2.5.48.x86kexec/include/linux/reboot.h Sun Nov 17 22:53:09 2002 @@ -21,6 +21,7 @@ * POWER_OFF Stop OS and remove all power from system, if possible. * RESTART2 Restart system using given command string. * SW_SUSPEND Suspend system using Software Suspend if compiled in + * KEXEC Restart the system using a different kernel. */
#define LINUX_REBOOT_CMD_RESTART 0x01234567 @@ -30,6 +31,7 @@ #define LINUX_REBOOT_CMD_POWER_OFF 0x4321FEDC #define LINUX_REBOOT_CMD_RESTART2 0xA1B2C3D4 #define LINUX_REBOOT_CMD_SW_SUSPEND 0xD000FCE2 +#define LINUX_REBOOT_CMD_KEXEC 0x45584543
#ifdef __KERNEL__ diff -uNr linux-2.5.48/kernel/Makefile linux-2.5.48.x86kexec/kernel/Makefile --- linux-2.5.48/kernel/Makefile Sun Nov 17 22:51:26 2002 +++ linux-2.5.48.x86kexec/kernel/Makefile Sun Nov 17 22:53:09 2002 @@ -21,6 +21,7 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o obj-$(CONFIG_BSD_PROCESS_ACCT) += acct.o obj-$(CONFIG_SOFTWARE_SUSPEND) += suspend.o +obj-$(CONFIG_KEXEC) += kexec.o
ifneq ($(CONFIG_IA64),y) # According to Alan Modra alan@linuxcare.com.au, the -fno-omit-frame-pointer is diff -uNr linux-2.5.48/kernel/kexec.c linux-2.5.48.x86kexec/kernel/kexec.c --- linux-2.5.48/kernel/kexec.c Wed Dec 31 17:00:00 1969 +++ linux-2.5.48.x86kexec/kernel/kexec.c Sun Nov 17 22:53:09 2002 @@ -0,0 +1,640 @@ +#include <linux/mm.h> +#include <linux/file.h> +#include <linux/slab.h> +#include <linux/fs.h> +#include <linux/version.h> +#include <linux/compile.h> +#include <linux/kexec.h> +#include <linux/spinlock.h> +#include <net/checksum.h> +#include <asm/page.h> +#include <asm/uaccess.h> +#include <asm/io.h> +#include <asm/system.h> + +/* As designed kexec can only use the memory that you don't + * need to use kmap to access. Memory that you can use virt_to_phys() + * on an call get_free_page to allocate. + * + * In the best case you need one page for the transition from + * virtual to physical memory. And this page must be identity + * mapped. Which pretty much leaves you with pages < PAGE_OFFSET + * as you can only mess with user pages. + * + * As the only subset of memory that it is easy to restrict allocation + * to is the physical memory mapped into the kernel, I do that + * with get_free_page and hope it is enough. + * + * I don't know of a good way to do this calcuate which pages get_free_page + * will return independent of architecture so I depend on + * <asm/kexec.h> to properly set + * KEXEC_SOURCE_MEMORY_LIMIT and KEXEC_DESTINATION_MEMORY_LIMIT + * + */ + +static struct kimage *kimage_alloc(void) +{ + struct kimage *image; + image = kmalloc(sizeof(*image), GFP_KERNEL); + if (!image) + return 0; + memset(image, 0, sizeof(*image)); + image->head = 0; + image->entry = &image->head; + image->last_entry = &image->head; + return image; +} +static int kimage_add_entry(struct kimage *image, kimage_entry_t entry) +{ + if (image->offset != 0) { + image->entry++; + } + if (image->entry == image->last_entry) { + kimage_entry_t *ind_page; + ind_page = (void *)__get_free_page(GFP_KERNEL); + if (!ind_page) { + return -ENOMEM; + } + *image->entry = virt_to_phys(ind_page) | IND_INDIRECTION; + image->entry = ind_page; + image->last_entry = + ind_page + ((PAGE_SIZE/sizeof(kimage_entry_t)) - 1); + } + *image->entry = entry; + image->entry++; + image->offset = 0; + return 0; +} + +static int kimage_verify_destination(unsigned long destination) +{ + int result; + + /* Assume the page is bad unless we pass the checks */ + result = -EADDRNOTAVAIL; + + if (destination >= KEXEC_DESTINATION_MEMORY_LIMIT) { + goto out; + } + + /* NOTE: The caller is responsible for making certain we + * don't attempt to load the new image into invalid or + * reserved areas of RAM. + */ + result = 0; +out: + return result; +} + +static int kimage_set_destination( + struct kimage *image, unsigned long destination) +{ + int result; + destination &= PAGE_MASK; + result = kimage_verify_destination(destination); + if (result) { + return result; + } + result = kimage_add_entry(image, destination | IND_DESTINATION); + if (result == 0) { + image->destination = destination; + } + return result; +} + + +static int kimage_add_page(struct kimage *image, unsigned long page) +{ + int result; + page &= PAGE_MASK; + result = kimage_verify_destination(image->destination); + if (result) { + return result; + } + result = kimage_add_entry(image, page | IND_SOURCE); + if (result == 0) { + image->destination += PAGE_SIZE; + } + return result; +} + + +static int kimage_terminate(struct kimage *image) +{ + int result; + result = kimage_add_entry(image, IND_DONE); + if (result == 0) { + /* Point at the terminating element */ + image->entry--; + } + return result; +} + +#define for_each_kimage_entry(image, ptr, entry) \ + for (ptr = &image->head; (entry = *ptr) && !(entry & IND_DONE); \ + ptr = (entry & IND_INDIRECTION)? \ + phys_to_virt((entry & PAGE_MASK)): ptr +1) + +static void kimage_free(struct kimage *image) +{ + kimage_entry_t *ptr, entry; + kimage_entry_t ind = 0; + if (!image) + return; + for_each_kimage_entry(image, ptr, entry) { + if (entry & IND_INDIRECTION) { + /* Free the previous indirection page */ + if (ind & IND_INDIRECTION) { + free_page((unsigned long)phys_to_virt(ind & PAGE_MASK)); + } + /* Save this indirection page until we are + * done with it. + */ + ind = entry; + } + else if (entry & IND_SOURCE) { + free_page((unsigned long)phys_to_virt(entry & PAGE_MASK)); + } + } + kfree(image); +} + +static int kimage_is_destination_page( + struct kimage *image, unsigned long page) +{ + kimage_entry_t *ptr, entry; + unsigned long destination; + destination = 0; + page &= PAGE_MASK; + for_each_kimage_entry(image, ptr, entry) { + if (entry & IND_DESTINATION) { + destination = entry & PAGE_MASK; + } + else if (entry & IND_SOURCE) { + if (page == destination) { + return 1; + } + destination += PAGE_SIZE; + } + } + return 0; +} + +static int kimage_get_unused_area( + struct kimage *image, unsigned long size, unsigned long align, + unsigned long *area) +{ + /* Walk through mem_map and find the first chunk of + * ununsed memory that is at least size bytes long. + */ + /* Since the kernel plays with Page_Reseved mem_map is less + * than ideal for this purpose, but it will give us a correct + * conservative estimate of what we need to do. + */ + /* For now we take advantage of the fact that all kernel pages + * are marked with PG_resereved to allocate a large + * contiguous area for the reboot code buffer. + */ + unsigned long addr; + unsigned long start, end; + unsigned long mask; + mask = ((1 << align) -1); + start = end = PAGE_SIZE; + for(addr = PAGE_SIZE; addr < KEXEC_SOURCE_MEMORY_LIMIT; addr += PAGE_SIZE) { + struct page *page; + unsigned long aligned_start; + page = virt_to_page(phys_to_virt(addr)); + if (PageReserved(page) || + kimage_is_destination_page(image, addr)) { + /* The current page is reserved so the start & + * end of the next area must be atleast at the + * next page. + */ + start = end = addr + PAGE_SIZE; + } + else { + /* O.k. The current page isn't reserved + * so push up the end of the area. + */ + end = addr; + } + aligned_start = (start + mask) & ~mask; + if (aligned_start > start) { + continue; + } + if (aligned_start > end) { + continue; + } + if (end - aligned_start >= size) { + *area = aligned_start; + return 0; + } + } + *area = 0; + return -ENOSPC; +} + +static kimage_entry_t *kimage_dst_conflict( + struct kimage *image, unsigned long page, kimage_entry_t *limit) +{ + kimage_entry_t *ptr, entry; + unsigned long destination = 0; + for_each_kimage_entry(image, ptr, entry) { + if (ptr == limit) { + return 0; + } + else if (entry & IND_DESTINATION) { + destination = entry & PAGE_MASK; + } + else if (entry & IND_SOURCE) { + if (page == destination) { + return ptr; + } + destination += PAGE_SIZE; + } + } + return 0; +} + +static kimage_entry_t *kimage_src_conflict( + struct kimage *image, unsigned long destination, kimage_entry_t *limit) +{ + kimage_entry_t *ptr, entry; + for_each_kimage_entry(image, ptr, entry) { + unsigned long page; + if (ptr == limit) { + return 0; + } + else if (entry & IND_DESTINATION) { + /* nop */ + } + else if (entry & IND_DONE) { + /* nop */ + } + else { + /* SOURCE & INDIRECTION */ + page = entry & PAGE_MASK; + if (page == destination) { + return ptr; + } + } + } + return 0; +} + +static int kimage_get_off_destination_pages(struct kimage *image) +{ + kimage_entry_t *ptr, *cptr, entry; + unsigned long buffer, page; + unsigned long destination = 0; + + /* Here we implement safe guards to insure that + * a source page is not copied to it's destination + * page before the data on the destination page is + * no longer useful. + * + * To make it work we actually wind up with a + * stronger condition. For every page considered + * it is either it's own destination page or it is + * not a destination page of any page considered. + * + * Invariants + * 1. buffer is not a destination of a previous page. + * 2. page is not a destination of a previous page. + * 3. destination is not a previous source page. + * + * Result: Either a source page and a destination page + * are the same or the page is not a destination page. + * + * These checks could be done when we allocate the pages, + * but doing it as a final pass allows us more freedom + * on how we allocate pages. + * + * Also while the checks are necessary, in practice nothing + * happens. The destination kernel wants to sit in the + * same physical addresses as the current kernel so we never + * actually allocate a destination page. + * + * BUGS: This is a O(N^2) algorithm. + */ + + + buffer = __get_free_page(GFP_KERNEL); + if (!buffer) { + return -ENOMEM; + } + buffer = virt_to_phys((void *)buffer); + for_each_kimage_entry(image, ptr, entry) { + /* Here we check to see if an allocated page */ + kimage_entry_t *limit; + if (entry & IND_DESTINATION) { + destination = entry & PAGE_MASK; + } + else if (entry & IND_INDIRECTION) { + /* Indirection pages must include all of their + * contents in limit checking. + */ + limit = phys_to_virt(page + PAGE_SIZE - sizeof(*limit)); + } + if (!((entry & IND_SOURCE) | (entry & IND_INDIRECTION))) { + continue; + } + page = entry & PAGE_MASK; + limit = ptr; + + /* See if a previous page has the current page as it's + * destination. + * i.e. invariant 2 + */ + cptr = kimage_dst_conflict(image, page, limit); + if (cptr) { + unsigned long cpage; + kimage_entry_t centry; + centry = *cptr; + cpage = centry & PAGE_MASK; + memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE); + memcpy(phys_to_virt(page), phys_to_virt(cpage), PAGE_SIZE); + *cptr = page | (centry & ~PAGE_MASK); + *ptr = buffer | (entry & ~PAGE_MASK); + buffer = cpage; + } + if (!(entry & IND_SOURCE)) { + continue; + } + + /* See if a previous page is our destination page. + * If so claim it now. + * i.e. invariant 3 + */ + cptr = kimage_src_conflict(image, destination, limit); + if (cptr) { + unsigned long cpage; + kimage_entry_t centry; + centry = *cptr; + cpage = centry & PAGE_MASK; + memcpy(phys_to_virt(buffer), phys_to_virt(cpage), PAGE_SIZE); + memcpy(phys_to_virt(cpage), phys_to_virt(page), PAGE_SIZE); + *cptr = buffer | (centry & ~PAGE_MASK); + *ptr = cpage | ( entry & ~PAGE_MASK); + buffer = page; + } + /* If the buffer is my destination page do the copy now + * i.e. invariant 3 & 1 + */ + if (buffer == destination) { + memcpy(phys_to_virt(buffer), phys_to_virt(page), PAGE_SIZE); + *ptr = buffer | (entry & ~PAGE_MASK); + buffer = page; + } + } + free_page((unsigned long)phys_to_virt(buffer)); + return 0; +} + +static int kimage_add_empty_pages(struct kimage *image, + unsigned long len) +{ + unsigned long pos; + int result; + for(pos = 0; pos < len; pos += PAGE_SIZE) { + char *page; + result = -ENOMEM; + page = (void *)__get_free_page(GFP_KERNEL); + if (!page) { + goto out; + } + result = kimage_add_page(image, virt_to_phys(page)); + if (result) { + goto out; + } + } + result = 0; + out: + return result; +} + + +static int kimage_load_segment(struct kimage *image, + struct kexec_segment *segment) +{ + unsigned long mstart; + int result; + unsigned long offset; + unsigned long offset_end; + unsigned char *buf; + + result = 0; + buf = segment->buf; + mstart = (unsigned long)segment->mem; + + offset_end = segment->memsz; + + result = kimage_set_destination(image, mstart); + if (result < 0) { + goto out; + } + for(offset = 0; offset < segment->memsz; offset += PAGE_SIZE) { + char *page; + size_t size, leader; + page = (char *)__get_free_page(GFP_KERNEL); + if (page == 0) { + result = -ENOMEM; + goto out; + } + result = kimage_add_page(image, virt_to_phys(page)); + if (result < 0) { + goto out; + } + if (segment->bufsz < offset) { + /* We are past the end zero the whole page */ + memset(page, 0, PAGE_SIZE); + continue; + } + size = PAGE_SIZE; + leader = 0; + if ((offset == 0)) { + leader = mstart & ~PAGE_MASK; + } + if (leader) { + /* We are on the first page zero the unused portion */ + memset(page, 0, leader); + size -= leader; + page += leader; + } + if (size > (segment->bufsz - offset)) { + size = segment->bufsz - offset; + } + result = copy_from_user(page, buf + offset, size); + if (result) { + result = (result < 0)?result : -EIO; + goto out; + } + if (size < (PAGE_SIZE - leader)) { + /* zero the trailing part of the page */ + memset(page + size, 0, (PAGE_SIZE - leader) - size); + } + } + out: + return result; +} + + +/* do_kexec executes a new kernel + */ +static int do_kexec(unsigned long start, unsigned long nr_segments, + struct kexec_segment *arg_segments, struct kimage *image) +{ + struct kexec_segment *segments; + size_t segment_bytes; + int i; + + int result; + unsigned long reboot_code_buffer; + kimage_entry_t *end; + + /* Initialize variables */ + segments = 0; + + segment_bytes = nr_segments * sizeof(*segments); + segments = kmalloc(GFP_KERNEL, segment_bytes); + if (segments == 0) { + result = -ENOMEM; + goto out; + } + result = copy_from_user(segments, arg_segments, segment_bytes); + if (result) { + goto out; + } + + /* Read in the data from user space */ + image->start = start; + for(i = 0; i < nr_segments; i++) { + result = kimage_load_segment(image, &segments[i]); + if (result) { + goto out; + } + } + + /* Terminate early so I can get a place holder. */ + result = kimage_terminate(image); + if (result) + goto out; + end = image->entry; + + /* Usage of the reboot code buffer is subtle. We first + * find a continguous area of ram, that is not one + * of our destination pages. We do not allocate the ram. + * + * The algorithm to make certain we do not have address + * conflicts requires each destination region to have some + * backing store so we allocate abitrary source pages. + * + * Later in machine_kexec when we copy data to the + * reboot_code_buffer it still may be allocated for other + * purposes, but we do know there are no source or destination + * pages in that area. And since the rest of the kernel + * is already shutdown those pages are free for use, + * regardless of their page->count values. + * + * The kernel mapping is of the reboot code buffer is passed to + * the machine dependent code. If it needs something else + * it is free to set that up. + */ + result = kimage_get_unused_area( + image, KEXEC_REBOOT_CODE_SIZE, KEXEC_REBOOT_CODE_ALIGN, + &reboot_code_buffer); + if (result) + goto out; + + /* Allocating pages we should never need is silly but the + * code won't work correctly unless we have dummy pages to + * work with. + */ + result = kimage_set_destination(image, reboot_code_buffer); + if (result) + goto out; + result = kimage_add_empty_pages(image, KEXEC_REBOOT_CODE_SIZE); + if (result) + goto out; + image->reboot_code_buffer = phys_to_virt(reboot_code_buffer); + + result = kimage_terminate(image); + if (result) + goto out; + + result = kimage_get_off_destination_pages(image); + if (result) + goto out; + + /* Now hide the extra source pages for the reboot code buffer. + */ + image->entry = end; + result = kimage_terminate(image); + if (result) + goto out; + + result = 0; + out: + /* cleanup and exit */ + if (segments) kfree(segments); + return result; +} + + +/* + * Exec Kernel system call: for obvious reasons only root may call it. + * + * This call breaks up into three pieces. + * - A generic part which loads the new kernel from the current + * address space, and very carefully places the data in the + * allocated pages. + * + * - A generic part that interacts with the kernel and tells all of + * the devices to shut down. Preventing on-going dmas, and placing + * the devices in a consistent state so a later kernel can + * reinitialize them. + * + * - A machine specific part that includes the syscall number + * and the copies the image to it's final destination. And + * jumps into the image at entry. + * + * kexec does not sync, or unmount filesystems so if you need + * that to happen you need to do that yourself. + */ +struct kimage *kexec_image = 0; + +asmlinkage long sys_kexec_load(unsigned long entry, unsigned long nr_segments, + struct kexec_segment *segments, unsigned long flags) +{ + /* Am I using to much stack space here? */ + struct kimage *image, *old_image; + int result; + + /* We only trust the superuser with rebooting the system. */ + if (!capable(CAP_SYS_ADMIN)) + return -EPERM; + + /* In case we need just a little bit of special behavior for + * reboot on panic + */ + if (flags != 0) + return -EINVAL; + + image = 0; + if (nr_segments > 0) { + image = kimage_alloc(); + if (!image) { + return -ENOMEM; + } + result = do_kexec(entry, nr_segments, segments, image); + if (result) { + kimage_free(image); + return result; + } + } + + old_image = xchg(&kexec_image, image); + + kimage_free(old_image); + return 0; +} diff -uNr linux-2.5.48/kernel/sys.c linux-2.5.48.x86kexec/kernel/sys.c --- linux-2.5.48/kernel/sys.c Sun Nov 17 22:51:26 2002 +++ linux-2.5.48.x86kexec/kernel/sys.c Sun Nov 17 22:53:09 2002 @@ -16,6 +16,7 @@ #include <linux/init.h> #include <linux/highuid.h> #include <linux/fs.h> +#include <linux/kexec.h> #include <linux/workqueue.h> #include <linux/device.h> #include <linux/times.h> @@ -206,6 +207,7 @@ cond_syscall(sys_lookup_dcookie) cond_syscall(sys_swapon) cond_syscall(sys_swapoff) +cond_syscall(sys_kexec_load) cond_syscall(sys_init_module) cond_syscall(sys_delete_module)
@@ -416,6 +418,27 @@ machine_restart(buffer); break;
+#ifdef CONFIG_KEXEC + case LINUX_REBOOT_CMD_KEXEC: + { + struct kimage *image; + if (arg) { + unlock_kernel(); + return -EINVAL; + } + image = xchg(&kexec_image, 0); + if (!image) { + unlock_kernel(); + return -EINVAL; + } + notifier_call_chain(&reboot_notifier_list, SYS_RESTART, NULL); + system_running = 0; + device_shutdown(); + printk(KERN_EMERG "Starting new kernel\n"); + machine_kexec(image); + break; + } +#endif #ifdef CONFIG_SOFTWARE_SUSPEND case LINUX_REBOOT_CMD_SW_SUSPEND: if (!software_suspend_enabled) {
On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
kexec is a set of systems call that allows you to load another kernel from the currently executing Linux kernel. The current implementation has only been tested, and had the kinks worked out on x86, but the generic code should work on any architecture.
Great News, Eric. For the first time *ever* I got a kexec reboot to work on my most troublesome machine (see below).
Current .config settings: # CONFIG_SMP is not set CONFIG_X86_GOOD_APIC=y # CONFIG_X86_UP_APIC is not set CONFIG_KEXEC=y
Oddly, kexec_test still hangs. # ./kexec-1.7 --force ./kexec_test-1.7 FIXME assuming 6Synchronizing SCSI caches: 4M of ram
Shutting down devices Starting new kernel kexec_test 1.7 starting... eax: 0E1FB007 ebx: 0000111C ecx: 00000000 edx: 00000000 esi: 00000000 edi: 00000000 esp: 00000000 ebp: 00000000 idt: 00000000 C0000000 gdt: 0000006F 000010A0 Switching descriptors. Descriptors changed. Legacy pic setup. In real mode. <hang>
Complete kernel boot-up log attached below. I'm going to try to find my other 576MB of RAM with the right command-line magic... ;^)
For those looking to replicate:
0. apply these two patches to 2.5.48 (bk Changeset 1.842) http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes....
2. compile this: http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz
3. my recipe for rebooting: a) I have a script that I execute by hand after "init 1" to unmount my filesystems and then remount / and /boot read-only. b) I have the kexec binary installed in /boot. c) ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5
Thanks, Eric!
Andy
# ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5 FIXME assuming 64M of ram setup16_end: 00091b1f FIXME assuming 64M of ram Synchronizing SCSI caches: Shutting down devices Starting new kernel Linux version 2.5.48 (andyp@joe) (gcc version 2.95.3 20010315 (SuSE)) #1 Mon Nov 18 15:03:14 PST 2002 Video mode to be used for restore is ffff BIOS-provided physical RAM map: BIOS-e820: 0000000000001000 - 000000000009ffff (usable) BIOS-e820: 0000000000100000 - 0000000003ffffff (usable) 63MB LOWMEM available. hm, page 00000000 reserved twice. On node 0 totalpages: 16383 DMA zone: 4096 pages, LIFO batch:1 Normal zone: 12287 pages, LIFO batch:2 HighMem zone: 0 pages, LIFO batch:1 IBM machine detected. Enabling interrupts during APM calls. IBM machine detected. Disabling SMBus accesses. Building zonelist for node : 0 Kernel command line: ro root=805 console=ttyS0,9600n8 Initializing CPU#0 Detected 799.717 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1581.05 BogoMIPS Memory: 60868k/65532k available (2087k kernel code, 4204k reserved, 825k data, 304k init, 0k highmem) Security Scaffold v1.0.0 initialized Dentry cache hash table entries: 8192 (order: 4, 65536 bytes) Inode-cache hash table entries: 4096 (order: 3, 32768 bytes) Mount-cache hash table entries: 512 (order: 0, 4096 bytes) -> /dev -> /dev/console -> /root CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: Intel Pentium III (Coppermine) stepping 0a Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket mtrr: v2.0 (20020519) Linux Plug and Play Support v0.9 (c) Adam Belay PCI: PCI BIOS revision 2.10 entry at 0xfd5dc, last bus=1 PCI: Using configuration type 1 BIO: pool of 256 setup, 14Kb (56 bytes/bio) biovec pool[0]: 1 bvecs: 116 entries (12 bytes) biovec pool[1]: 4 bvecs: 116 entries (48 bytes) biovec pool[2]: 16 bvecs: 58 entries (192 bytes) biovec pool[3]: 64 bvecs: 29 entries (768 bytes) biovec pool[4]: 128 bvecs: 14 entries (1536 bytes) biovec pool[5]: 256 bvecs: 7 entries (3072 bytes) block request queues: 112 requests per read queue 112 requests per write queue 8 requests per batch enter congestion at 27 exit congestion at 29 isapnp: Scanning for PnP cards... isapnp: No Plug & Play device found drivers/usb/core/usb.c: registered new driver usbfs drivers/usb/core/usb.c: registered new driver hub PCI: Probing PCI hardware PCI: Probing PCI hardware (bus 00) PCI: Discovered peer bus 01 Starting kswapd aio_setup: sizeof(struct page) = 40 [c3fb2040] eventpoll: successfully initialized. Journalled Block Device driver loaded Installing knfsd (copyright (C) 1996 okir@monad.swb.de). udf: registering filesystem Capability LSM initialized Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A parport0: PC-style at 0x378 [PCSPP] pty: 256 Unix98 ptys configured lp0: using parport0 (polling). Linux agpgart interface v0.99 (c) Jeff Hartmann agpgart: Maximum main memory to use for agp memory: 27M agpgart: unable to determine aperture size. agpgart: Maximum main memory to use for agp memory: 27M agpgart: unable to determine aperture size. [drm] Initialized radeon 1.7.0 20020828 on minor 0 Floppy drive(s): fd0 is 1.44M FDC 0 is a National Semiconductor PC87306 Intel(R) PRO/100 Network Driver - version 2.1.24-k2 Copyright (c) 2002 Intel Corporation
e100: eth0: Intel(R) PRO/100+ Server Adapter (PILA8470B) Mem:0xfeb7f000 IRQ:11 Speed:0 Mbps Dx:N/A Hardware receive checksums enabled cpu cycle saver enabled
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: ATAPI 48X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.12 end_request: I/O error, dev hda, sector 0 SCSI subsystem driver Revision: 1.00 PCI: Enabling device 01:03.0 (0156 -> 0157) scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4 <Adaptec aic7892 Ultra160 SCSI adapter> aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit) Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281 Type: Direct-Access ANSI SCSI revision: 03 (scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit) Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281 Type: Direct-Access ANSI SCSI revision: 03 Vendor: IBM Model: YGLv3 S2 Rev: 0 Type: Processor ANSI SCSI revision: 02 scsi0:A:0:0: Tagged Queuing enabled. Depth 64 SCSI device sda: drive cache: write through SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB) sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 sda10 > Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 scsi0:A:1:0: Tagged Queuing enabled. Depth 64 SCSI device sdb: drive cache: write through SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB) sdb: sdb1 Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0 Attached scsi generic sg2 at scsi0, channel 0, id 8, lun 0, type 3 Initializing USB Mass Storage driver... drivers/usb/core/usb.c: registered new driver usb-storage USB Mass Storage support registered. mice: PS/2 mouse device common for all mice input: ImPS/2 Generic Wheel Mouse on isa0060/serio1 serio: i8042 AUX port at 0x60,0x64 irq 12 input: AT Set 2 keyboard on isa0060/serio0 serio: i8042 KBD port at 0x60,0x64 irq 1 Advanced Linux Sound Architecture Driver Version 0.9.0rc5 (Sun Nov 10 19:48:18 2002 UTC). request_module[snd-card-0]: not ready request_module[snd-card-1]: not ready request_module[snd-card-2]: not ready request_module[snd-card-3]: not ready request_module[snd-card-4]: not ready request_module[snd-card-5]: not ready request_module[snd-card-6]: not ready request_module[snd-card-7]: not ready ALSA device list: No soundcards found. NET4: Linux TCP/IP 1.0 for NET4.0 IP: routing cache hash table of 512 buckets, 4Kbytes TCP: Hash tables configured (established 4096 bind 4096) NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. VFS: Mounted root (ext3 filesystem) readonly. Freeing unused kernel memory: 304k freed INIT: version 2.82 booting Running /etc/init.d/boot Mounting /proc device done Mounting /dev/ptsblogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled showconsole: Warning: the ioctl TIOCGDEV is not known by the kerAdding 530104k swap on /dev/sda6. Priority:42 extents:1 nel Activating swap-devices in /etc/fstab... done showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel Checking file systems... fsck 1.26 (3-Feb-2002) /dev/sda5: clean, 16935/66264 files, 104836/265041 blocks /dev/sda1: clean, 55/10040 files, 24115/40131 blocks /dev/sdb1: clean, 11/2223872 files, 78008/4441964 blocks /dev/sda10: clean, 523256/1198208 files, 2052639/2393677 blocks /dev/sda9: clean, 51895/263296 files, 310582/526120 blocks /dev/sda8: clean, 140195/525888 files, 590977/1050241 blocks /dev/sda7: clean, EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), 2747/131616 fileinternal journal s, 111363/263056 blocks done Setting up /lib/modules/2.5.48 failed Mounting local file systems... kjournald starting. Commit interval 5 seconds proc on /proc tyEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,17), pe proc (rw) deinternal journal vpts on /dev/ptsEXT3-fs: mounted filesystem with ordered data mode. type devpts (rw,mode=0620,gid=5) /dev/sdb1 on /2nd type ext3 (kjournald starting. Commit interval 5 seconds rw) /dev/sda1 oEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,10), n /boot type extinternal journal 2 (rw) EXT3-fs: mounted filesystem with ordered data mode. /dev/sda10 on /home type ext3 (rw) kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal EXT3-fs: mounted filesystem with ordered data mode. /dev/sda9 on /opt type ext3 (rw) kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal EXT3-fs: mounted filesystem with ordered data mode. /dev/sda8 on /usr type ext3 (rw) kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,7), internal journal EXT3-fs: mounted filesystem with ordered data mode. /dev/sda7 on /var type ext3 (rw) done Restore device permissions done Activating remaining swap-devices in /etc/fstab... done Setting up the CMOS clock done Setting up timezone data done Configuring serial ports... ttyS0 at 0x03f8 (irq = 4) is a 16550A ttyS1 at 0x02f8 (irq = 3) is a 16550A Configured serial ports done Setting up hostname 'joe' done Setting up loopback interface done Creating /var/log/boot.msg done showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel INIT: Entering runlevel: 5 blogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled Master Resource Control: previous runlevel: N, switching to runlevel:5 Starting personal-firewall (initial) [not active] unused Initializing random number generator done Setting up network interfaces: lo done eth0 (DHCP) IP address: 172.20.1.38 done Starting syslog services done Starting hotplugging services [ net pci usb ] failed Starting hardware scan on boote100: eth0 NIC Link is Up 100 Mbps Full duplex done Starting RPC portmap daemon done Starting SSH daemon done Starting sound driver: already running done Starting service at daemon done Initializing SMTP port (sendmail) done Loading keymap qwerty/us.map.gz done Loading compose table winkeys shiftctrl latin1.add done Loading console font lat1-16.psfu done Loading screenmap none done Setting up console ttys done Starting service kdm done Starting CRON daemon done Starting Name Service Cache Daemon done Starting inetd done Starting personal-firewall (final) [not active] unused Master Resource Control: runlevel 5 has been reached Failed services in runlevel 5: hotplug Skipped services in runlevel 5: personal-firewall.initial splash personal-firewall.final
Eric W. Biederman wrote:
kexec is a set of systems call that allows you to load another kernel from the currently executing Linux kernel. The current implementation has only been tested, and had the kinks worked out on x86, but the generic code should work on any architecture.
Could I get some feed back on where this work and where this breaks. With the maturation of kexec-tools to skip attempting bios calls, I expect a new the linux kernel to load for most people. Though I also expect some device drivers will not reinitialize after the reboot.
I give it a big thumbs-up. Between the NUMAQs and the big xSeries machines, we have a lot of slow rebooters. The 16GB intel boxes take at about 5 minutes to get back to the bootloader after a reboot, and the 4 and 8-quad NUMAQ's take closer to 10.
The IBM machines I've tried it on are a 4-way and 8-way PIII. They both have aic7xxx cards and the 8-way has a ServeRAID 4 controller. They have a collection of acenic, e1000, pcnet32 and eepro100 net cards. All seem to work just fine.
The NUMAQ is another story, though. I get nothing after "Starting new kernel". But, I wasn't expecting much. The NUMAQ is pretty weird hardware and god knows what is actually happening. I'll try it some more when I'm more confident in what I'm doing.
What's the deal with "FIXME assuming 64M of ram"? I was a little surprised when my 16GB machine started to OOM as I did a "make -j8 bzImage" :) Why is it that you need the memory size at load time?
On Mon, Nov 18, 2002 at 05:10:38PM -0800, Andy Pfiffer wrote:
On Mon, 2002-11-18 at 00:53, Eric W. Biederman wrote:
kexec is a set of systems call that allows you to load another kernel from the currently executing Linux kernel. The current implementation has only been tested, and had the kinks worked out on x86, but the generic code should work on any architecture.
Great News, Eric. For the first time *ever* I got a kexec reboot to work on my most troublesome machine (see below).
Same here - preloading the new kernel and issuing kexec -e after init 1 works on the troublesome SMP system I'd earlier been sending you earlier. Bootimg used to work on this setup, so bypassing the bios calls had the expected effect.
If I issue the call earlier though, it runs into trouble with aic7xxx reporting interrupts during setup. Guess you know why we are looking at that case - eventually need to be able to transition directly at dump time without a chance to go through user-space shutdown ...
Regards Suparna
For those looking to replicate:
0. apply these two patches to 2.5.48 (bk Changeset 1.842) http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec.diff http://www.xmission.com/~ebiederm/files/kexec/linux-2.5.48.x86kexec-hwfixes.diff 2. compile this: http://www.xmission.com/~ebiederm/files/kexec/kexec-tools-1.7.tar.gz 3. my recipe for rebooting: a) I have a script that I execute by hand after "init 1" to unmount my filesystems and then remount / and /boot read-only. b) I have the kexec binary installed in /boot. c) ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5
Thanks, Eric!
Andy
# ./kexec-1.7 --force --debug "--command-line=ro root=805 console=ttyS0,9600n8" ./linux-2.5 FIXME assuming 64M of ram setup16_end: 00091b1f FIXME assuming 64M of ram Synchronizing SCSI caches: Shutting down devices Starting new kernel Linux version 2.5.48 (andyp@joe) (gcc version 2.95.3 20010315 (SuSE)) #1 Mon Nov 18 15:03:14 PST 2002 Video mode to be used for restore is ffff BIOS-provided physical RAM map: BIOS-e820: 0000000000001000 - 000000000009ffff (usable) BIOS-e820: 0000000000100000 - 0000000003ffffff (usable) 63MB LOWMEM available. hm, page 00000000 reserved twice. On node 0 totalpages: 16383 DMA zone: 4096 pages, LIFO batch:1 Normal zone: 12287 pages, LIFO batch:2 HighMem zone: 0 pages, LIFO batch:1 IBM machine detected. Enabling interrupts during APM calls. IBM machine detected. Disabling SMBus accesses. Building zonelist for node : 0 Kernel command line: ro root=805 console=ttyS0,9600n8 Initializing CPU#0 Detected 799.717 MHz processor. Console: colour VGA+ 80x25 Calibrating delay loop... 1581.05 BogoMIPS Memory: 60868k/65532k available (2087k kernel code, 4204k reserved, 825k data, 304k init, 0k highmem) Security Scaffold v1.0.0 initialized Dentry cache hash table entries: 8192 (order: 4, 65536 bytes) Inode-cache hash table entries: 4096 (order: 3, 32768 bytes) Mount-cache hash table entries: 512 (order: 0, 4096 bytes) -> /dev -> /dev/console -> /root CPU: L1 I cache: 16K, L1 D cache: 16K CPU: L2 cache: 256K Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: Intel Pentium III (Coppermine) stepping 0a Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Checking 'hlt' instruction... OK. POSIX conformance testing by UNIFIX Linux NET4.0 for Linux 2.4 Based upon Swansea University Computer Society NET3.039 Initializing RT netlink socket mtrr: v2.0 (20020519) Linux Plug and Play Support v0.9 (c) Adam Belay PCI: PCI BIOS revision 2.10 entry at 0xfd5dc, last bus=1 PCI: Using configuration type 1 BIO: pool of 256 setup, 14Kb (56 bytes/bio) biovec pool[0]: 1 bvecs: 116 entries (12 bytes) biovec pool[1]: 4 bvecs: 116 entries (48 bytes) biovec pool[2]: 16 bvecs: 58 entries (192 bytes) biovec pool[3]: 64 bvecs: 29 entries (768 bytes) biovec pool[4]: 128 bvecs: 14 entries (1536 bytes) biovec pool[5]: 256 bvecs: 7 entries (3072 bytes) block request queues: 112 requests per read queue 112 requests per write queue 8 requests per batch enter congestion at 27 exit congestion at 29 isapnp: Scanning for PnP cards... isapnp: No Plug & Play device found drivers/usb/core/usb.c: registered new driver usbfs drivers/usb/core/usb.c: registered new driver hub PCI: Probing PCI hardware PCI: Probing PCI hardware (bus 00) PCI: Discovered peer bus 01 Starting kswapd aio_setup: sizeof(struct page) = 40 [c3fb2040] eventpoll: successfully initialized. Journalled Block Device driver loaded Installing knfsd (copyright (C) 1996 okir@monad.swb.de). udf: registering filesystem Capability LSM initialized Serial: 8250/16550 driver $Revision: 1.90 $ IRQ sharing disabled ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A parport0: PC-style at 0x378 [PCSPP] pty: 256 Unix98 ptys configured lp0: using parport0 (polling). Linux agpgart interface v0.99 (c) Jeff Hartmann agpgart: Maximum main memory to use for agp memory: 27M agpgart: unable to determine aperture size. agpgart: Maximum main memory to use for agp memory: 27M agpgart: unable to determine aperture size. [drm] Initialized radeon 1.7.0 20020828 on minor 0 Floppy drive(s): fd0 is 1.44M FDC 0 is a National Semiconductor PC87306 Intel(R) PRO/100 Network Driver - version 2.1.24-k2 Copyright (c) 2002 Intel Corporation
e100: eth0: Intel(R) PRO/100+ Server Adapter (PILA8470B) Mem:0xfeb7f000 IRQ:11 Speed:0 Mbps Dx:N/A Hardware receive checksums enabled cpu cycle saver enabled
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2 ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx hda: LG CD-ROM CRD-8484B, ATAPI CD/DVD-ROM drive ide0 at 0x1f0-0x1f7,0x3f6 on irq 14 hda: ATAPI 48X CD-ROM drive, 128kB Cache Uniform CD-ROM driver Revision: 3.12 end_request: I/O error, dev hda, sector 0 SCSI subsystem driver Revision: 1.00 PCI: Enabling device 01:03.0 (0156 -> 0157) scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.4 <Adaptec aic7892 Ultra160 SCSI adapter> aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit) Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281 Type: Direct-Access ANSI SCSI revision: 03 (scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 31, 16bit) Vendor: IBM-PSG Model: ST318436LC !# Rev: 3281 Type: Direct-Access ANSI SCSI revision: 03 Vendor: IBM Model: YGLv3 S2 Rev: 0 Type: Processor ANSI SCSI revision: 02 scsi0:A:0:0: Tagged Queuing enabled. Depth 64 SCSI device sda: drive cache: write through SCSI device sda: 35548320 512-byte hdwr sectors (18201 MB) sda: sda1 sda2 < sda5 sda6 sda7 sda8 sda9 sda10 > Attached scsi disk sda at scsi0, channel 0, id 0, lun 0 scsi0:A:1:0: Tagged Queuing enabled. Depth 64 SCSI device sdb: drive cache: write through SCSI device sdb: 35548320 512-byte hdwr sectors (18201 MB) sdb: sdb1 Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0 Attached scsi generic sg2 at scsi0, channel 0, id 8, lun 0, type 3 Initializing USB Mass Storage driver... drivers/usb/core/usb.c: registered new driver usb-storage USB Mass Storage support registered. mice: PS/2 mouse device common for all mice input: ImPS/2 Generic Wheel Mouse on isa0060/serio1 serio: i8042 AUX port at 0x60,0x64 irq 12 input: AT Set 2 keyboard on isa0060/serio0 serio: i8042 KBD port at 0x60,0x64 irq 1 Advanced Linux Sound Architecture Driver Version 0.9.0rc5 (Sun Nov 10 19:48:18 2002 UTC). request_module[snd-card-0]: not ready request_module[snd-card-1]: not ready request_module[snd-card-2]: not ready request_module[snd-card-3]: not ready request_module[snd-card-4]: not ready request_module[snd-card-5]: not ready request_module[snd-card-6]: not ready request_module[snd-card-7]: not ready ALSA device list: No soundcards found. NET4: Linux TCP/IP 1.0 for NET4.0 IP: routing cache hash table of 512 buckets, 4Kbytes TCP: Hash tables configured (established 4096 bind 4096) NET4: Unix domain sockets 1.0/SMP for Linux NET4.0. kjournald starting. Commit interval 5 seconds EXT3-fs: mounted filesystem with ordered data mode. VFS: Mounted root (ext3 filesystem) readonly. Freeing unused kernel memory: 304k freed INIT: version 2.82 booting Running /etc/init.d/boot Mounting /proc device done Mounting /dev/ptsblogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled showconsole: Warning: the ioctl TIOCGDEV is not known by the kerAdding 530104k swap on /dev/sda6. Priority:42 extents:1 nel Activating swap-devices in /etc/fstab... done showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel Checking file systems... fsck 1.26 (3-Feb-2002) /dev/sda5: clean, 16935/66264 files, 104836/265041 blocks /dev/sda1: clean, 55/10040 files, 24115/40131 blocks /dev/sdb1: clean, 11/2223872 files, 78008/4441964 blocks /dev/sda10: clean, 523256/1198208 files, 2052639/2393677 blocks /dev/sda9: clean, 51895/263296 files, 310582/526120 blocks /dev/sda8: clean, 140195/525888 files, 590977/1050241 blocks /dev/sda7: clean, EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,5), 2747/131616 fileinternal journal s, 111363/263056 blocks done Setting up /lib/modules/2.5.48 failed Mounting local file systems... kjournald starting. Commit interval 5 seconds proc on /proc tyEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,17), pe proc (rw) deinternal journal vpts on /dev/ptsEXT3-fs: mounted filesystem with ordered data mode. type devpts (rw,mode=0620,gid=5) /dev/sdb1 on /2nd type ext3 (kjournald starting. Commit interval 5 seconds rw) /dev/sda1 oEXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,10), n /boot type extinternal journal 2 (rw) EXT3-fs: mounted filesystem with ordered data mode. /dev/sda10 on /home type ext3 (rw) kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,9), internal journal EXT3-fs: mounted filesystem with ordered data mode. /dev/sda9 on /opt type ext3 (rw) kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,8), internal journal EXT3-fs: mounted filesystem with ordered data mode. /dev/sda8 on /usr type ext3 (rw) kjournald starting. Commit interval 5 seconds EXT3 FS 2.4-0.9.16, 02 Dec 2001 on sd(8,7), internal journal EXT3-fs: mounted filesystem with ordered data mode. /dev/sda7 on /var type ext3 (rw) done Restore device permissions done Activating remaining swap-devices in /etc/fstab... done Setting up the CMOS clock done Setting up timezone data done Configuring serial ports... ttyS0 at 0x03f8 (irq = 4) is a 16550A ttyS1 at 0x02f8 (irq = 3) is a 16550A Configured serial ports done Setting up hostname 'joe' done Setting up loopback interface done Creating /var/log/boot.msg done showconsole: Warning: the ioctl TIOCGDEV is not known by the kernel INIT: Entering runlevel: 5 blogd: console=/dev/console, stdin=/dev/console, must differ, boot logging disabled Master Resource Control: previous runlevel: N, switching to runlevel:5 Starting personal-firewall (initial) [not active] unused Initializing random number generator done Setting up network interfaces: lo done eth0 (DHCP) IP address: 172.20.1.38 done Starting syslog services done Starting hotplugging services [ net pci usb ] failed Starting hardware scan on boote100: eth0 NIC Link is Up 100 Mbps Full duplex done Starting RPC portmap daemon done Starting SSH daemon done Starting sound driver: already running done Starting service at daemon done Initializing SMTP port (sendmail) done Loading keymap qwerty/us.map.gz done Loading compose table winkeys shiftctrl latin1.add done Loading console font lat1-16.psfu done Loading screenmap none done Setting up console ttys done Starting service kdm done Starting CRON daemon done Starting Name Service Cache Daemon done Starting inetd done Starting personal-firewall (final) [not active] unused Master Resource Control: runlevel 5 has been reached Failed services in runlevel 5: hotplug Skipped services in runlevel 5: personal-firewall.initial splash personal-firewall.final