1-*-Mode: outline-*-
   3                Light-weight System Calls for IA-64
   4                -----------------------------------
   6                        Started: 13-Jan-2003
   7                    Last update: 27-Sep-2003
   9                      David Mosberger-Tang
  10                      <>
  12Using the "epc" instruction effectively introduces a new mode of
  13execution to the ia64 linux kernel.  We call this mode the
  14"fsys-mode".  To recap, the normal states of execution are:
  16  - kernel mode:
  17        Both the register stack and the memory stack have been
  18        switched over to kernel memory.  The user-level state is saved
  19        in a pt-regs structure at the top of the kernel memory stack.
  21  - user mode:
  22        Both the register stack and the kernel stack are in
  23        user memory.  The user-level state is contained in the
  24        CPU registers.
  26  - bank 0 interruption-handling mode:
  27        This is the non-interruptible state which all
  28        interruption-handlers start execution in.  The user-level
  29        state remains in the CPU registers and some kernel state may
  30        be stored in bank 0 of registers r16-r31.
  32In contrast, fsys-mode has the following special properties:
  34  - execution is at privilege level 0 (most-privileged)
  36  - CPU registers may contain a mixture of user-level and kernel-level
  37    state (it is the responsibility of the kernel to ensure that no
  38    security-sensitive kernel-level state is leaked back to
  39    user-level)
  41  - execution is interruptible and preemptible (an fsys-mode handler
  42    can disable interrupts and avoid all other interruption-sources
  43    to avoid preemption)
  45  - neither the memory-stack nor the register-stack can be trusted while
  46    in fsys-mode (they point to the user-level stacks, which may
  47    be invalid, or completely bogus addresses)
  49In summary, fsys-mode is much more similar to running in user-mode
  50than it is to running in kernel-mode.  Of course, given that the
  51privilege level is at level 0, this means that fsys-mode requires some
  52care (see below).
  55* How to tell fsys-mode
  57Linux operates in fsys-mode when (a) the privilege level is 0 (most
  58privileged) and (b) the stacks have NOT been switched to kernel memory
  59yet.  For convenience, the header file <asm-ia64/ptrace.h> provides
  60three macros:
  62        user_mode(regs)
  63        user_stack(task,regs)
  64        fsys_mode(task,regs)
  66The "regs" argument is a pointer to a pt_regs structure.  The "task"
  67argument is a pointer to the task structure to which the "regs"
  68pointer belongs to.  user_mode() returns TRUE if the CPU state pointed
  69to by "regs" was executing in user mode (privilege level 3).
  70user_stack() returns TRUE if the state pointed to by "regs" was
  71executing on the user-level stack(s).  Finally, fsys_mode() returns
  72TRUE if the CPU state pointed to by "regs" was executing in fsys-mode.
  73The fsys_mode() macro is equivalent to the expression:
  75        !user_mode(regs) && user_stack(task,regs)
  77* How to write an fsyscall handler
  79The file arch/ia64/kernel/fsys.S contains a table of fsyscall-handlers
  80(fsyscall_table).  This table contains one entry for each system call.
  81By default, a system call is handled by fsys_fallback_syscall().  This
  82routine takes care of entering (full) kernel mode and calling the
  83normal Linux system call handler.  For performance-critical system
  84calls, it is possible to write a hand-tuned fsyscall_handler.  For
  85example, fsys.S contains fsys_getpid(), which is a hand-tuned version
  86of the getpid() system call.
  88The entry and exit-state of an fsyscall handler is as follows:
  90** Machine state on entry to fsyscall handler:
  92 - r10    = 0
  93 - r11    = saved ar.pfs (a user-level value)
  94 - r15    = system call number
  95 - r16    = "current" task pointer (in normal kernel-mode, this is in r13)
  96 - r32-r39 = system call arguments
  97 - b6     = return address (a user-level value)
  98 - ar.pfs = previous frame-state (a user-level value)
  99 - = cleared to zero (i.e., little-endian byte order is in effect)
 100 - all other registers may contain values passed in from user-mode
 102** Required machine state on exit to fsyscall handler:
 104 - r11    = saved ar.pfs (as passed into the fsyscall handler)
 105 - r15    = system call number (as passed into the fsyscall handler)
 106 - r32-r39 = system call arguments (as passed into the fsyscall handler)
 107 - b6     = return address (as passed into the fsyscall handler)
 108 - ar.pfs = previous frame-state (as passed into the fsyscall handler)
 110Fsyscall handlers can execute with very little overhead, but with that
 111speed comes a set of restrictions:
 113 o Fsyscall-handlers MUST check for any pending work in the flags
 114   member of the thread-info structure and if any of the
 115   TIF_ALLWORK_MASK flags are set, the handler needs to fall back on
 116   doing a full system call (by calling fsys_fallback_syscall).
 118 o Fsyscall-handlers MUST preserve incoming arguments (r32-r39, r11,
 119   r15, b6, and ar.pfs) because they will be needed in case of a
 120   system call restart.  Of course, all "preserved" registers also
 121   must be preserved, in accordance to the normal calling conventions.
 123 o Fsyscall-handlers MUST check argument registers for containing a
 124   NaT value before using them in any way that could trigger a
 125   NaT-consumption fault.  If a system call argument is found to
 126   contain a NaT value, an fsyscall-handler may return immediately
 127   with r8=EINVAL, r10=-1.
 129 o Fsyscall-handlers MUST NOT use the "alloc" instruction or perform
 130   any other operation that would trigger mandatory RSE
 131   (register-stack engine) traffic.
 133 o Fsyscall-handlers MUST NOT write to any stacked registers because
 134   it is not safe to assume that user-level called a handler with the
 135   proper number of arguments.
 137 o Fsyscall-handlers need to be careful when accessing per-CPU variables:
 138   unless proper safe-guards are taken (e.g., interruptions are avoided),
 139   execution may be pre-empted and resumed on another CPU at any given
 140   time.
 142 o Fsyscall-handlers must be careful not to leak sensitive kernel'
 143   information back to user-level.  In particular, before returning to
 144   user-level, care needs to be taken to clear any scratch registers
 145   that could contain sensitive information (note that the current
 146   task pointer is not considered sensitive: it's already exposed
 147   through ar.k6).
 149 o Fsyscall-handlers MUST NOT access user-memory without first
 150   validating access-permission (this can be done typically via
 151   probe.r.fault and/or probe.w.fault) and without guarding against
 152   memory access exceptions (this can be done with the EX() macros
 153   defined by asmmacro.h).
 155The above restrictions may seem draconian, but remember that it's
 156possible to trade off some of the restrictions by paying a slightly
 157higher overhead.  For example, if an fsyscall-handler could benefit
 158from the shadow register bank, it could temporarily disable PSR.i and
 159PSR.ic, switch to bank 0 (bsw.0) and then use the shadow registers as
 160needed.  In other words, following the above rules yields extremely
 161fast system call execution (while fully preserving system call
 162semantics), but there is also a lot of flexibility in handling more
 163complicated cases.
 165* Signal handling
 167The delivery of (asynchronous) signals must be delayed until fsys-mode
 168is exited.  This is accomplished with the help of the lower-privilege
 169transfer trap: arch/ia64/kernel/process.c:do_notify_resume_user()
 170checks whether the interrupted task was in fsys-mode and, if so, sets
 171PSR.lp and returns immediately.  When fsys-mode is exited via the
 172"br.ret" instruction that lowers the privilege level, a trap will
 173occur.  The trap handler clears PSR.lp again and returns immediately.
 174The kernel exit path then checks for and delivers any pending signals.
 176* PSR Handling
 178The "epc" instruction doesn't change the contents of PSR at all.  This
 179is in contrast to a regular interruption, which clears almost all
 180bits.  Because of that, some care needs to be taken to ensure things
 181work as expected.  The following discussion describes how each PSR bit
 182is handled.
 183  Cleared when entering fsys-mode.  A srlz.d instruction is used
 185        to ensure the CPU is in little-endian mode before the first
 186        load/store instruction is executed. is normally NOT
 187        restored upon return from an fsys-mode handler.  In other
 188        words, user-level code must not rely on being preserved
 189        across a system call.
 190PSR.up  Unchanged.  Unchanged.
 192PSR.mfl Unchanged.  Note: fsys-mode handlers must not write-registers!
 193PSR.mfh Unchanged.  Note: fsys-mode handlers must not write-registers!
 194PSR.ic  Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.
 195PSR.i   Unchanged.  Note: fsys-mode handlers can clear the bit, if needed.  Unchanged.
 197PSR.dt  Unchanged.
 198PSR.dfl Unchanged.  Note: fsys-mode handlers must not write-registers!
 199PSR.dfh Unchanged.  Note: fsys-mode handlers must not write-registers!
 200PSR.sp  Unchanged.
 201PSR.pp  Unchanged.
 202PSR.di  Unchanged.  Unchanged.
 204PSR.db  Unchanged.  The kernel prevents user-level from setting a hardware
 205        breakpoint that triggers at any privilege level other than 3 (user-mode).
 206PSR.lp  Unchanged.
 207PSR.tb  Lazy redirect.  If a taken-branch trap occurs while in
 208        fsys-mode, the trap-handler modifies the saved machine state
 209        such that execution resumes in the gate page at
 210        syscall_via_break(), with privilege level 3.  Note: the
 211        taken branch would occur on the branch invoking the
 212        fsyscall-handler, at which point, by definition, a syscall
 213        restart is still safe.  If the system call number is invalid,
 214        the fsys-mode handler will return directly to user-level.  This
 215        return will trigger a taken-branch trap, but since the trap is
 216        taken _after_ restoring the privilege level, the CPU has already
 217        left fsys-mode, so no special treatment is needed.
 218PSR.rt  Unchanged.
 219PSR.cpl Cleared to 0.  Unchanged (guaranteed to be 0 on entry to the gate page).  Unchanged.  Unchanged (guaranteed to be 1).  Unchanged.  Note: the ia64 linux kernel never sets this bit.
 224PSR.da  Unchanged.  Note: the ia64 linux kernel never sets this bit.
 225PSR.dd  Unchanged.  Note: the ia64 linux kernel never sets this bit.  Lazy redirect.  If set, "epc" will cause a Single Step Trap to
 227        be taken.  The trap handler then modifies the saved machine
 228        state such that execution resumes in the gate page at
 229        syscall_via_break(), with privilege level 3.
 230PSR.ri  Unchanged.
 231PSR.ed  Unchanged.  Note: This bit could only have an effect if an fsys-mode
 232        handler performed a speculative load that gets NaTted.  If so, this
 233        would be the normal & expected behavior, so no special treatment is
 234        needed.  Unchanged.  Note: fsys-mode handlers may clear the bit, if needed.
 236        Doing so requires clearing PSR.i and PSR.ic as well.
 237PSR.ia  Unchanged.  Note: the ia64 linux kernel never sets this bit.
 239* Using fast system calls
 241To use fast system calls, userspace applications need simply call
 242__kernel_syscall_via_epc().  For example
 244-- example fgettimeofday() call --
 245-- fgettimeofday.S --
 247#include <asm/asmmacro.h>
 250.prologue ar.pfs, r11
 252mov r11 = ar.pfs
 255mov r2 = 0xa000000000020660;;  // gate address 
 256                               // found by inspection of for the 
 257                               // __kernel_syscall_via_epc() function.  See
 258                               // below for how to do this for real.
 260mov b7 = r2
 261mov r15 = 1087                 // gettimeofday syscall
 262;; b6 = b7
 266.restore sp
 268mov ar.pfs = r11
 269br.ret.sptk.many rp;;         // return to caller
 272-- end fgettimeofday.S --
 274In reality, getting the gate address is accomplished by two extra
 275values passed via the ELF auxiliary vector (include/asm-ia64/elf.h)
 277 o AT_SYSINFO : is the address of __kernel_syscall_via_epc()
 278 o AT_SYSINFO_EHDR : is the address of the kernel gate ELF DSO
 280The ELF DSO is a pre-linked library that is mapped in by the kernel at
 281the gate page.  It is a proper ELF shared object so, with a dynamic
 282loader that recognises the library, you should be able to make calls to
 283the exported functions within it as with any other shared library.
 284AT_SYSINFO points into the kernel DSO at the
 285__kernel_syscall_via_epc() function for historical reasons (it was
 286used before the kernel DSO) and as a convenience.