1Started Jan 2000 by Kanoj Sarcar <>
   3Memory balancing is needed for non __GFP_WAIT as well as for non
   4__GFP_IO allocations.
   6There are two reasons to be requesting non __GFP_WAIT allocations:
   7the caller can not sleep (typically intr context), or does not want
   8to incur cost overheads of page stealing and possible swap io for
   9whatever reasons.
  11__GFP_IO allocation requests are made to prevent file system deadlocks.
  13In the absence of non sleepable allocation requests, it seems detrimental
  14to be doing balancing. Page reclamation can be kicked off lazily, that
  15is, only when needed (aka zone free memory is 0), instead of making it
  16a proactive process.
  18That being said, the kernel should try to fulfill requests for direct
  19mapped pages from the direct mapped pool, instead of falling back on
  20the dma pool, so as to keep the dma pool filled for dma requests (atomic
  21or not). A similar argument applies to highmem and direct mapped pages.
  22OTOH, if there is a lot of free dma pages, it is preferable to satisfy
  23regular memory requests by allocating one from the dma pool, instead
  24of incurring the overhead of regular zone balancing.
  26In 2.2, memory balancing/page reclamation would kick off only when the
  27_total_ number of free pages fell below 1/64 th of total memory. With the
  28right ratio of dma and regular memory, it is quite possible that balancing
  29would not be done even when the dma zone was completely empty. 2.2 has
  30been running production machines of varying memory sizes, and seems to be
  31doing fine even with the presence of this problem. In 2.3, due to
  32HIGHMEM, this problem is aggravated.
  34In 2.3, zone balancing can be done in one of two ways: depending on the
  35zone size (and possibly of the size of lower class zones), we can decide
  36at init time how many free pages we should aim for while balancing any
  37zone. The good part is, while balancing, we do not need to look at sizes
  38of lower class zones, the bad part is, we might do too frequent balancing
  39due to ignoring possibly lower usage in the lower class zones. Also,
  40with a slight change in the allocation routine, it is possible to reduce
  41the memclass() macro to be a simple equality.
  43Another possible solution is that we balance only when the free memory
  44of a zone _and_ all its lower class zones falls below 1/64th of the
  45total memory in the zone and its lower class zones. This fixes the 2.2
  46balancing problem, and stays as close to 2.2 behavior as possible. Also,
  47the balancing algorithm works the same way on the various architectures,
  48which have different numbers and types of zones. If we wanted to get
  49fancy, we could assign different weights to free pages in different
  50zones in the future.
  52Note that if the size of the regular zone is huge compared to dma zone,
  53it becomes less significant to consider the free dma pages while
  54deciding whether to balance the regular zone. The first solution
  55becomes more attractive then.
  57The appended patch implements the second solution. It also "fixes" two
  58problems: first, kswapd is woken up as in 2.2 on low memory conditions
  59for non-sleepable allocations. Second, the HIGHMEM zone is also balanced,
  60so as to give a fighting chance for replace_with_highmem() to get a
  61HIGHMEM page, as well as to ensure that HIGHMEM allocations do not
  62fall back into regular zone. This also makes sure that HIGHMEM pages
  63are not leaked (for example, in situations where a HIGHMEM page is in 
  64the swapcache but is not being used by anyone)
  66kswapd also needs to know about the zones it should balance. kswapd is
  67primarily needed in a situation where balancing can not be done, 
  68probably because all allocation requests are coming from intr context
  69and all process contexts are sleeping. For 2.3, kswapd does not really
  70need to balance the highmem zone, since intr context does not request
  71highmem pages. kswapd looks at the zone_wake_kswapd field in the zone
  72structure to decide whether a zone needs balancing.
  74Page stealing from process memory and shm is done if stealing the page would
  75alleviate memory pressure on any zone in the page's node that has fallen below
  76its watermark.
  78watemark[WMARK_MIN/WMARK_LOW/WMARK_HIGH]/low_on_memory/zone_wake_kswapd: These
  79are per-zone fields, used to determine when a zone needs to be balanced. When
  80the number of pages falls below watermark[WMARK_MIN], the hysteric field
  81low_on_memory gets set. This stays set till the number of free pages becomes
  82watermark[WMARK_HIGH]. When low_on_memory is set, page allocation requests will
  83try to free some pages in the zone (providing GFP_WAIT is set in the request).
  84Orthogonal to this, is the decision to poke kswapd to free some zone pages.
  85That decision is not hysteresis based, and is done when the number of free
  86pages is below watermark[WMARK_LOW]; in which case zone_wake_kswapd is also set.
  89(Good) Ideas that I have heard:
  901. Dynamic experience should influence balancing: number of failed requests
  91for a zone can be tracked and fed into the balancing scheme (
  922. Implement a replace_with_highmem()-like replace_with_regular() to preserve
  93dma pages. (