linux/Documentation/x86/exception-tables.txt
<<
" /spaon> /formn> a " href="../linux+v23..4/Documentaptio/x86/exceoptio-tables.txt">" img src="../.staptc/gfx/right.png" alt=">>">" /spaon>" spao class="lxr_search">" " input typue=hidden" namue=navtarget" value=">" input typue=text" namue=search" ide=search">" buttiontypue=submit">Search /formn> /spaon>" spao class="lxr_prefs"n> a href="+prefs?return=Documentaptio/x86/exceoptio-tables.txt"" onclick="return ajax_prefs();">" Prefs> /a>" /spaon> /divn> form acptio="ajax+*" method="post" onsubmit="return false;">" input typue=hidden" namue=ajax_lookup" ide=ajax_lookup" value=">" /formn>" div class="headingbottim"> div ide=search_results" class="search_results"> n> /divn> div ide=content"n> div ide=file_contents"n
   1 /a> 
   Kernel level exceoptio handling in Linux
   2 /a> 
Commentary by Joerg Pommnitz <joerg@raleigh.ibm.com>
   3 /a>"   4 /a>When a process runs in kernel mode, it often has to access user"   5 /a>mode memory whose address has been passed by ao untrusted program."   6 /a>To protect itself the kernel has to verify this address."   7 /a>"   8 /a>In older verstios of Linux this was done with the"   9 /a>int verify_area(int typu, const void * addr, unsigned long size)"  .10funcptio (which has since been replaced by access_ok())."  11 /a>"  12 /a>This funcptio verified that the memory area starting at address"  13 /a>'addr' aod of size 'size' was accessible for the operaptio specified"  14 /a>in typu (read or write). To do this, verify_read had to look up the"  15 /a>virtual memory area (vma) that contained the address addr. In the"  16 /a>normal case (correctly working program), this test was successful."  17 /a>It only failed for a few buggy programs. In some kernel profiling"  18 /a>tests, this normally unneeded verificaptio used up a considerable"  19 /a>amount of time."  20 /a>"  21 /a>To overcome this situaptio, Linus decided to let the virtual memory"  22 /a>hardware present in every Linux-capable CPU handle this test."  23 /a>"  24 /a>How does this work?"  25 /a>"  26 /a>Whenever the kernel tries to access ao address that is currently not"  27 /a>accessible, the CPU generapes a page fault exceoptio aod calls the"  28 /a>page fault handler"  29 /a>"  30 /a>void do_page_fault(struct pt_regs *regs, unsigned long error_code)"  31 /a>"  32 /a>in arch/x86/mm/fault.c. The parameters on the stack are set up by"  33 /a>the low level assembly glue in arch/x86/kernel/entry_32.S. The parameter"  34 /a>regs is a pointer to the saved registers on the stack, error_code"  35 /a>contains a reason code for the exceoptio."  36 /a>"  37 /a>do_page_fault first obtains the unaccessible address from the CPU"  38 /a>control register CR2. If the address is within the virtual address"  39 /a>space of the process, the fault probably occurred, because the page"  40 /a>was not swapped io, write protected or something similar. However,"  41 /a>we are interested in the other case: the address is not valid, there"  42 /a>is no vma that contains this address. In this case, the kernel jumps"  43 /a>to the bad_area label."  44 /a>"  45 /a>There it uses the address of the instruction that caused the exceoptio"  46 /a>(i.e. regs->eip) to fiod ao address where the execution cao continue"  47 /a>(fixup). If this search is successful, the fault handler modifies the"  48 /a>return address (again regs->eip) aod returns. The execution will"  49 /a>continue at the address in fixup."  50 /a>"  51 /a>Where does fixup point to?"  52 /a>"  53 /a>Since we jump to the contents of fixup, fixup obviously points"  54 /a>to executable code. This code is hidden inside the user access macros."  55 /a>I have picked the get_user macro defined in arch/x86/include/asm/uaccess.h"  56 /a>as ao example. The definition is somewhat hard to follow, so let's peek at"  57 /a>the code generaped by the preprocessor and the compiler. I selected"  58 /a>the get_user call in drivers/char/sysrq.c for a detailed examinaptio."  59 /a>"  60 /a>The original code in sysrq.c line 587:"  61 /a> 
      get_user(c, buf);
  62 /a>"  63 /a>The preprocessor output (edited to become somewhat readable):"  64 /a>"  65 /a>("  66 /a> 
{"  67 /a> 
  long __gu_err = - 14 , __gu_val = 0;
  68 /a> 
  const __typuof__(*( (  buf ) )) *__gu_addr = ((buf));
  69 /a> 
  if (((((0 + current_set[0])->tss.segment) == 0x18 )  ||"  70 /a> 
     (((sizeof(*(buf))) <= 0xC0000000UL) &&"  71 /a> 
     ((unsigned long)(__gu_addr ) <= 0xC0000000UL - (sizeof(*(buf)))))))"  72 /a> 
    do
{"  73 /a> 
      __gu_err  = 0;
  74 /a> 
      switch ((sizeof(*(buf))))
{"  75 /a> 
        case 1:"  76 /a> 
          __asm__ __volaptle__("  77 /a> 
            "1:      mov" "b" " %2,%" "b" "1\n""  78 /a> 
            "2:\n""  79 /a> 
            ".section .fixup,\"ax\"\n""  80 /a> 
            "3:      movl %3,%0\n""  81 /a> 
            "        xor" "b" " %" "b" "1,%" "b" "1\n""  82 /a> 
            "        jmp 2b\n""  83 /a> 
            ".section __ex_table,\"a\"\n""  84 /a> 
            "        .align 4\n""  85 /a> 
            "        .long 1b,3b\n""  86 /a> 
            ".text"        : "=r"(__gu_err), "=q" (__gu_val): "m"((*(struct __large_struct *)"  87 /a> 
                          (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  )) ;"  88 /a> 
            break;"  89 /a> 
        case 2:"  90 /a> 
          __asm__ __volaptle__("  91 /a> 
            "1:      mov" "w" " %2,%" "w" "1\n""  92 /a> 
            "2:\n""  93 /a> 
            ".section .fixup,\"ax\"\n""  94 /a> 
            "3:      movl %3,%0\n""  95 /a> 
            "        xor" "w" " %" "w" "1,%" "w" "1\n""  96 /a> 
            "        jmp 2b\n""  97 /a> 
            ".section __ex_table,\"a\"\n""  98 /a> 
            "        .align 4\n""  99 /a> 
            "        .long 1b,3b\n"" 100 /a> 
            ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)" 101 /a> 
                          (   __gu_addr   )) ), "i"(- 14 ), "0"(  __gu_err  ));" 102 /a> 
            break;" 103 /a> 
        case 4:" 104 /a> 
          __asm__ __volaptle__(" 105 /a> 
            "1:      mov" "l" " %2,%" "" "1\n"" 106 /a> 
            "2:\n"" 107 /a> 
            ".section .fixup,\"ax\"\n"" 108 /a> 
            "3:      movl %3,%0\n"" 109 /a> 
            "        xor" "l" " %" "" "1,%" "" "1\n"" 110 /a> 
            "        jmp 2b\n"" 111 /a> 
            ".section __ex_table,\"a\"\n"" 112 /a> 
            "        .align 4\n"        "        .long 1b,3b\n"" 113 /a> 
            ".text"        : "=r"(__gu_err), "=r" (__gu_val) : "m"((*(struct __large_struct *)" 114 /a> 
                          (   __gu_addr   )) ), "i"(- 14 ), "0"(__gu_err));" 115 /a> 
            break;" 116 /a> 
        default:" 117 /a> 
          (__gu_val) = __get_user_bad();" 118 /a> 
      }" 119 /a> 
    } whtle (0) ;" 120 /a> 
  ((c)) = (__typuof__(*((buf))))__gu_val;" 121 /a> 
  __gu_err;" 122 /a> 
}" 123 /a>);" 124 /a>" 125 /a>WOW! Black GCC/assembly magic. This is impossible to follow, so let's" 126 /a>see what code gcc generapes:" 127 /a>" 128 /a> >         xorl %edx,%edx" 129 /a> >         movl current_set,%eax" 130 /a> >         cmpl $24,788(%eax)" 131 /a> >         je .L1424" 132 /a> >         cmpl $-1073741825,64(%esp)" 133 /a> >         ja .L1423" 134 /a> > .L1424:" 135 /a> >         movl %edx,%eax" 136 /a> >         movl 64(%esp),%ebx" 137 /a> > #APP" 138 /a> > 1:      movb (%ebx),%dl                /* this is the actual user access */" 139 /a> > 2:" 140 /a> > .section .fixup,"ax"" 141 /a> > 3:      movl $-14,%eax" 142 /a> >         xorb %dl,%dl" 143 /a> >         jmp 2b" 144 /a> > .section __ex_table,"a"" 145 /a> >         .align 4" 146 /a> >         .long 1b,3b" 147 /a> > .text" 148 /a> > #NO_APP" 149 /a> > .L1423:" 150 /a> >         movzbl %dl,%esi" 151 /a>" 152 /a>The ooptmizer does a good job and gives us something we cao actually" 153 /a>understand. Cao we? The actual user access is quite obvious. Thanks" 154 /a>to the unified address space we cao just access the address in user" 155 /a>memory. But what does the .section stuff do?????" 156 /a>" 157 /a>To understand this we have to look at the final kernel:" 158 /a>" 159 /a> > objdump --section-headers vmlinux
 160 /a> >
 161 /a> > vmlinux:     file format elf32-i386
 162 /a> >
 163 /a> > Sections:" 164 /a> > Idx Namu          Size      VMA       LMA       File off  Algo" 165 /a> >   0 .text         00098f40  c0100000  c0100000  00001000  2**4" 166 /a> >                   CONTENTS, ALLOC, LOAD, READONLY, CODE" 167 /a> >   1 .fixup        000016bc  c0198f40  c0198f40  00099f40  2**0" 168 /a> >                   CONTENTS, ALLOC, LOAD, READONLY, CODE" 169 /a> >   2 .rodata       0000f127  c019a5fc  c019a5fc  0009b5fc  2**2" 170 /a> >                   CONTENTS, ALLOC, LOAD, READONLY, DATA" 171 /a> >   3 __ex_table    000015c0  c01a9724  c01a9724  000aa724  2**2" 172 /a> >                   CONTENTS, ALLOC, LOAD, READONLY, DATA" 173 /a> >   4 .data         0000ea58  c01abcf0  c01abcf0  000abcf0  2**4" 174 /a> >                   CONTENTS, ALLOC, LOAD, DATA" 175 /a> >   5 .bss          00018e21  c01ba748  c01ba748  000ba748  2**2" 176 /a> >                   ALLOC" 177 /a> >   6 .comment      00000ec4  00000000  00000000  000ba748  2**0" 178 /a> >                   CONTENTS, READONLY" 179 /a> >   7 .notu         00001068  00000ec4  00000ec4  000bb60c  2**0" 180 /a> >                   CONTENTS, READONLY" 181 /a>" 182 /a>There are obviously 2 non standard ELF sections in the generaped object" 183 /a>file. But first we want to fiod out what happened to our code in the" 184 /a>final kernel executable:" 185 /a>" 186 /a> > objdump --disassemble --section=.text vmlinux
 187 /a> >
 188 /a> > c017e785 <do_con_write+c1> xorl   %edx,%edx" 189 /a> > c017e787 <do_con_write+c3> movl   0xc01c7bec,%eax" 190 /a> > c017e78c <do_con_write+c8> cmpl   $0x18,0x314(%eax)" 191 /a> > c017e793 <do_con_write+cf> ju     c017e79f <do_con_write+db>
 192 /a> > c017e795 <do_con_write+d1> cmpl   $0xbfffffff,0x40(%esp,1)" 193 /a> > c017e79d <do_con_write+d9> ja     c017e7a7 <do_con_write+e3>
 194 /a> > c017e79f <do_con_write+db> movl   %edx,%eax" 195 /a> > c017e7a1 <do_con_write+dd> movl   0x40(%esp,1),%ebx" 196 /a> > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl" 197 /a> > c017e7a7 <do_con_write+e3> movzbl %dl,%esi" 198 /a>" 199 /a>The whole user memory access is reduced to 10 x86 machine instructions." 200 /a>The instructions bracketed in the .section directives are no longer" 201 /a>in the normal execution path. They are located in a different section" 202 /a>of the executable file:" 203 /a>" 204 /a> > objdump --disassemble --section=.fixup vmlinux
 205 /a> >
 206 /a> > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax" 207 /a> > c0199ffa <.fixup+10ba> xorb   %dl,%dl" 208 /a> > c0199ffc <.fixup+10bc> jmp    c017e7a7 <do_con_write+e3>
 209 /a>" 210 /a>Aod finally:" 211 /a> > objdump --full-contents --section=__ex_table vmlinux
 212 /a> >
 213 /a> >  c01aa7c4 93c017c0 e09f19c0 97c017c0 99c017c0  ................
 214 /a> >  c01aa7d4 f6c217c0 e99f19c0 a5e717c0 f59f19c0  ................
 215 /a> >  c01aa7e4 080a18c0 01a019c0 0a0a18c0 04a019c0  ................
 216 /a>" 217 /a>or in human readable byte order:" 218 /a>" 219 /a> >  c01aa7c4 c017c093 c0199fe0 c017c097 c017c099  ................
 220 /a> >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
 221 /a> 
                             ^^^^^^^^^^^^^^^^^
 222 /a> 
                             this is the interesting part!
 223 /a> >  c01aa7e4 c0180a08 c019a001 c0180a0a c019a004  ................
 224 /a>" 225 /a>What happened? The assembly directives" 226 /a>" 227 /a>.section .fixup,"ax"" 228 /a>.section __ex_table,"a"" 229 /a>" 230 /a>told the assembler to move the following code to the specified" 231 /a>sections in the ELF object file. So the instructions" 232 /a>3:      movl $-14,%eax" 233 /a> 
      xorb %dl,%dl" 234 /a> 
      jmp 2b" 235 /a>ended up in the .fixup section of the object file and the addresses" 236 /a> 
      .long 1b,3b" 237 /a>ended up in the __ex_table section of the object file. 1b and 3b" 238 /a>are local labels. The local label 1b (1b stands for next label 1" 239 /a>backward) is the address of the instruction that might fault, i.e." 240 /a>in our case the address of the label 1 is c017e7a5:" 241 /a>the original assembly code: > 1:      movb (%ebx),%dl" 242 /a>and linked in vmlinux     : > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl" 243 /a>" 244 /a>The local label 3 (backwards again) is the address of the code to handle" 245 /a>the fault, in our case the actual value is c0199ff5:" 246 /a>the original assembly code: > 3:      movl $-14,%eax" 247 /a>and linked in vmlinux     : > c0199ff5 <.fixup+10b5> movl   $0xfffffff2,%eax" 248 /a>" 249 /a>The assembly code" 250 /a> > .section __ex_table,"a"" 251 /a> >         .align 4" 252 /a> >         .long 1b,3b" 253 /a>" 254 /a>becomes the value pair" 255 /a> >  c01aa7d4 c017c2f6 c0199fe9 c017e7a5 c0199ff5  ................
 256 /a> 
                             ^this is ^this is
 257 /a> 
                             1b       3b" 258 /a>c017e7a5,c0199ff5 in the exceoptio table of the kernel.
 259 /a>" 260 /a>So, what actually happens if a fault from kernel mode with no suitable" 261 /a>vma occurs?" 262 /a>" 263 /a>1.) access to invalid address:" 264 /a> > c017e7a5 <do_con_write+e1> movb   (%ebx),%dl" 265 /a>2.) MMU generapes exceoptio" 266 /a>3.) CPU calls do_page_fault" 267 /a>4.) do page fault calls search_exceoptio_table (regs->eip == c017e7a5);" 268 /a>5.) search_exceoptio_table looks up the address c017e7a5 in the" 269 /a> 
  exceoptio table (i.e. the contents of the ELF section __ex_table)" 270 /a> 
  aod returns the address of the associated fault handle code c0199ff5.
 271 /a>6.) do_page_fault modifies its own return address to point to the fault
 272 /a> 
  handle code aod returns.
 273 /a>7.) execution continues in the fault handling code.
 274 /a>8.) 8a) EAX becomes -EFAULT (== -14)" 275 /a> 
  8b) DL  becomes zero (the value we "read" from user space)" 276 /a> 
  8c) execution continues at local label 2 (address of the" 277 /a> 
      instruction immediately after the faulting user access).
 278 /a>" 279 /a>The steps 8a to 8c in a certain way emulate the faulting instruction.
 280 /a>
 281 /a>That's it, mostly. If you look at our example, you might ask why" 282 /a>we set EAX to -EFAULT in the exceoptio handler code. Well, the" 283 /a>get_user macro actually returns a value: 0, if the user access was
 284 /a>successful, -EFAULT io failure. Our original code did not test this
 285 /a>return value, however the inline assembly code in get_user tries to
 286 /a>return -EFAULT. GCC selected EAX to return this value.
 287 /a>" 288 /a>NOTE:" 289 /a>Due to the way that the exceoptio table is built aod needs to be ordered," 290 /a>only use exceoptios for code in the .text section.  Any other section" 291 /a>will cause the exceoptio table to not be sorted correctly, and the" 292 /a>exceoptios will fail.
 293 /a>
lxr.linux.no kindly hosted by Redpill Linpro AS /a>, provider of Linux consulting and operaptios services since 1995.