Appendix H Slab Allocator

H.1 Cache Manipulation

H.1.1 Cache Creation

H.1.1.1 Function: kmem_cache_create

Source: mm/slab.c

The call graph for this function is shown in 8.3. This function is responsible for the creation of a new cache and will be dealt with in chunks due to its size. The chunks roughly are;

Perform basic sanity checks for bad usage
Perform debugging checks if CONFIG_SLAB_DEBUG is set
Allocate a kmem_cache_t from the cache_cache slab cache
Align the object size to the word size
Calculate how many objects will fit on a slab
Align the slab size to the hardware cache
Calculate colour offsets
Initialise remaining fields in cache descriptor
Add the new cache to the cache chain

621 kmem_cache_t *
622 kmem_cache_create (const char *name, size_t size, 
623     size_t offset, unsigned long flags, 
        void (*ctor)(void*, kmem_cache_t *, unsigned long),
624     void (*dtor)(void*, kmem_cache_t *, unsigned long))
625 {
626     const char *func_nm = KERN_ERR "kmem_create: ";
627     size_t left_over, align, slab_size;
628     kmem_cache_t *cachep = NULL;
629 
633     if ((!name) ||
634         ((strlen(name) >= CACHE_NAMELEN - 1)) ||
635         in_interrupt() ||
636         (size < BYTES_PER_WORD) ||
637         (size > (1<<MAX_OBJ_ORDER)*PAGE_SIZE) ||
638         (dtor && !ctor) ||
639         (offset < 0 || offset > size))
640             BUG();
641

Perform basic sanity checks for bad usage

622The parameters of the function are

: name The human readable name of the cache
: size The size of an object
: offset This is used to specify a specific alignment for objects in the cache but it usually left as 0
: flags Static cache flags
: ctor A constructor function to call for each object during slab creation
: dtor The corresponding destructor function. It is expected the destructor function leaves an object in an initialised state

633-640 These are all serious usage bugs that prevent the cache even attempting to create

634If the human readable name is greater than the maximum size for a cache name (CACHE_NAMELEN)

635An interrupt handler cannot create a cache as access to interrupt-safe spinlocks and semaphores are needed

636The object size must be at least a word in size. The slab allocator is not suitable for objects whose size is measured in individual bytes

637The largest possible slab that can be created is 2^{MAX_OBJ_ORDER} number of pages which provides 32 pages

638A destructor cannot be used if no constructor is available

639The offset cannot be before the slab or beyond the boundary of the first page

640Call BUG() to exit

642 #if DEBUG
643     if ((flags & SLAB_DEBUG_INITIAL) && !ctor) {
645         printk("%sNo con, but init state check 
                requested - %s\n", func_nm, name);
646         flags &= ~SLAB_DEBUG_INITIAL;
647     }
648 
649     if ((flags & SLAB_POISON) && ctor) {
651         printk("%sPoisoning requested, but con given - %s\n",
                                                  func_nm, name);
652         flags &= ~SLAB_POISON;
653     }
654 #if FORCED_DEBUG
655     if ((size < (PAGE_SIZE>>3)) && 
        !(flags & SLAB_MUST_HWCACHE_ALIGN))
660         flags |= SLAB_RED_ZONE;
661     if (!ctor)
662         flags |= SLAB_POISON;
663 #endif
664 #endif
670     BUG_ON(flags & ~CREATE_MASK);

This block performs debugging checks if CONFIG_SLAB_DEBUG is set

: 643-646The flag SLAB_DEBUG_INITIAL requests that the constructor check the objects to make sure they are in an initialised state. For this, a constructor must exist. If it does not, the flag is cleared
: 649-653A slab can be poisoned with a known pattern to make sure an object wasn't used before it was allocated but a constructor would ruin this pattern falsely reporting a bug. If a constructor exists, remove the SLAB_POISON flag if set
: 655-660Only small objects will be red zoned for debugging. Red zoning large objects would cause severe fragmentation
: 661-662If there is no constructor, set the poison bit
: 670The CREATE_MASK is set with all the allowable flags kmem_cache_create() (See Section H.1.1.1) can be called with. This prevents callers using debugging flags when they are not available and BUG()s it instead

673     cachep = 
           (kmem_cache_t *) kmem_cache_alloc(&cache_cache,
                         SLAB_KERNEL);
674     if (!cachep)
675         goto opps;
676     memset(cachep, 0, sizeof(kmem_cache_t));

Allocate a kmem_cache_t from the cache_cache slab cache.

: 673Allocate a cache descriptor object from the cache_cache with kmem_cache_alloc() (See Section H.3.2.1)
: 674-675If out of memory goto opps which handles the oom situation
: 676Zero fill the object to prevent surprises with uninitialised data

682     if (size & (BYTES_PER_WORD-1)) {
683         size += (BYTES_PER_WORD-1);
684         size &= ~(BYTES_PER_WORD-1);
685         printk("%sForcing size word alignment 
               - %s\n", func_nm, name);
686     }
687
688 #if DEBUG
689     if (flags & SLAB_RED_ZONE) {
694         flags &= ~SLAB_HWCACHE_ALIGN;
695         size += 2*BYTES_PER_WORD;
696     }
697 #endif
698     align = BYTES_PER_WORD;
699     if (flags & SLAB_HWCACHE_ALIGN)
700         align = L1_CACHE_BYTES;
701
703     if (size >= (PAGE_SIZE>>3))
708         flags |= CFLGS_OFF_SLAB;
709 
710     if (flags & SLAB_HWCACHE_ALIGN) {
714         while (size < align/2)
715             align /= 2;
716         size = (size+align-1)&(~(align-1));
717     }

Align the object size to some word-sized boundary.

: 682If the size is not aligned to the size of a word then...
: 683-684Increase the object by the size of a word then mask out the lower bits, this will effectively round the object size up to the next word boundary
: 685Print out an informational message for debugging purposes
: 688-697If debugging is enabled then the alignments have to change slightly
: 694Do not bother trying to align things to the hardware cache if the slab will be red zoned. The red zoning of the object is going to offset it by moving the object one word away from the cache boundary
: 695The size of the object increases by two BYTES_PER_WORD to store the red zone mark at either end of the object
: 698Initialise the alignment to be to a word boundary. This will change if the caller has requested a CPU cache alignment
: 699-700If requested, align the objects to the L1 CPU cache
: 703If the objects are large, store the slab descriptors off-slab. This will allow better packing of objects into the slab
: 710If hardware cache alignment is requested, the size of the objects must be adjusted to align themselves to the hardware cache
: 714-715Try and pack objects into one cache line if they fit while still keeping the alignment. This is important to arches (e.g. Alpha or Pentium 4) with large L1 cache bytes. align will be adjusted to be the smallest that will give hardware cache alignment. For machines with large L1 cache lines, two or more small objects may fit into each line. For example, two objects from the size-32 cache will fit on one cache line from a Pentium 4
: 716Round the cache size up to the hardware cache alignment

724     do {
725         unsigned int break_flag = 0;
726 cal_wastage:
727         kmem_cache_estimate(cachep->gfporder, 
                    size, flags,
728                     &left_over, 
                    &cachep->num);
729         if (break_flag)
730             break;
731         if (cachep->gfporder >= MAX_GFP_ORDER)
732             break;
733         if (!cachep->num)
734             goto next;
735         if (flags & CFLGS_OFF_SLAB && 
            cachep->num > offslab_limit) {
737             cachep->gfporder--;
738             break_flag++;
739             goto cal_wastage;
740         }
741 
746         if (cachep->gfporder >= slab_break_gfp_order)
747             break;
748 
749         if ((left_over*8) <= (PAGE_SIZE<<cachep->gfporder))
750             break;  
751 next:
752         cachep->gfporder++;
753     } while (1);
754
755     if (!cachep->num) {
756         printk("kmem_cache_create: couldn't 
                create cache %s.\n", name);
757         kmem_cache_free(&cache_cache, cachep);
758         cachep = NULL;
759         goto opps;
760     }

Calculate how many objects will fit on a slab and adjust the slab size as necessary

: 727-728kmem_cache_estimate() (see Section H.1.2.1) calculates the number of objects that can fit on a slab at the current gfp order and what the amount of leftover bytes will be
: 729-730The break_flag is set if the number of objects fitting on the slab exceeds the number that can be kept when offslab slab descriptors are used
: 731-732The order number of pages used must not exceed MAX_GFP_ORDER (5)
: 733-734If even one object didn't fill, goto next: which will increase the gfporder used for the cache
: 735If the slab descriptor is kept off-cache but the number of objects exceeds the number that can be tracked with bufctl's off-slab then ...
: 737Reduce the order number of pages used
: 738Set the break_flag so the loop will exit
: 739Calculate the new wastage figures
: 746-747The slab_break_gfp_order is the order to not exceed unless 0 objects fit on the slab. This check ensures the order is not exceeded
: 749-759This is a rough check for internal fragmentation. If the wastage as a fraction of the total size of the cache is less than one eight, it is acceptable
: 752If the fragmentation is too high, increase the gfp order and recalculate the number of objects that can be stored and the wastage
: 755If after adjustments, objects still do not fit in the cache, it cannot be created
: 757-758Free the cache descriptor and set the pointer to NULL
: 758Goto opps which simply returns the NULL pointer

761     slab_size = L1_CACHE_ALIGN(
              cachep->num*sizeof(kmem_bufctl_t) + 
              sizeof(slab_t));
762 
767     if (flags & CFLGS_OFF_SLAB && left_over >= slab_size) {
768         flags &= ~CFLGS_OFF_SLAB;
769         left_over -= slab_size;
770     }

Align the slab size to the hardware cache

: 761slab_size is the total size of the slab descriptor not the size of the slab itself. It is the size slab_t struct and the number of objects * size of the bufctl
: 767-769If there is enough left over space for the slab descriptor and it was specified to place the descriptor off-slab, remove the flag and update the amount of left_over bytes there is. This will impact the cache colouring but with the large objects associated with off-slab descriptors, this is not a problem

773     offset += (align-1);
774     offset &= ~(align-1);
775     if (!offset)
776         offset = L1_CACHE_BYTES;
777     cachep->colour_off = offset;
778     cachep->colour = left_over/offset;

Calculate colour offsets.

: 773-774offset is the offset within the page the caller requested. This will make sure the offset requested is at the correct alignment for cache usage
: 775-776If somehow the offset is 0, then set it to be aligned for the CPU cache
: 777This is the offset to use to keep objects on different cache lines. Each slab created will be given a different colour offset
: 778This is the number of different offsets that can be used

781     if (!cachep->gfporder && !(flags & CFLGS_OFF_SLAB))
782         flags |= CFLGS_OPTIMIZE;
783 
784     cachep->flags = flags;
785     cachep->gfpflags = 0;
786     if (flags & SLAB_CACHE_DMA)
787         cachep->gfpflags |= GFP_DMA;
788     spin_lock_init(&cachep->spinlock);
789     cachep->objsize = size;
790     INIT_LIST_HEAD(&cachep->slabs_full);
791     INIT_LIST_HEAD(&cachep->slabs_partial);
792     INIT_LIST_HEAD(&cachep->slabs_free);
793 
794     if (flags & CFLGS_OFF_SLAB)
795         cachep->slabp_cache =
               kmem_find_general_cachep(slab_size,0);
796     cachep->ctor = ctor;
797     cachep->dtor = dtor;
799     strcpy(cachep->name, name);
800 
801 #ifdef CONFIG_SMP
802     if (g_cpucache_up)
803         enable_cpucache(cachep);
804 #endif

Initialise remaining fields in cache descriptor

: 781-782For caches with slabs of only 1 page, the CFLGS_OPTIMIZE flag is set. In reality it makes no difference as the flag is unused
: 784Set the cache static flags
: 785Zero out the gfpflags. Defunct operation as memset() after the cache descriptor was allocated would do this
: 786-787If the slab is for DMA use, set the GFP_DMA flag so the buddy allocator will use ZONE_DMA
: 788Initialise the spinlock for access the cache
: 789Copy in the object size, which now takes hardware cache alignment if necessary
: 790-792Initialise the slab lists
: 794-795If the descriptor is kept off-slab, allocate a slab manager and place it for use in slabp_cache. See Section H.2.1.2
: 796-797Set the pointers to the constructor and destructor functions
: 799Copy in the human readable name
: 802-803If per-cpu caches are enabled, create a set for this cache. See Section 8.5

806     down(&cache_chain_sem);
807     {
808         struct list_head *p;
809 
810         list_for_each(p, &cache_chain) {
811             kmem_cache_t *pc = list_entry(p, 
                    kmem_cache_t, next);
812 
814             if (!strcmp(pc->name, name))
815                 BUG();
816         }
817     }
818 
822     list_add(&cachep->next, &cache_chain);
823     up(&cache_chain_sem);
824 opps:
825     return cachep;
826 }

Add the new cache to the cache chain

: 806Acquire the semaphore used to synchronise access to the cache chain
: 810-816Check every cache on the cache chain and make sure there is no other cache with the same name. If there is, it means two caches of the same type are been created which is a serious bug
: 811Get the cache from the list
: 814-815Compare the names and if they match, BUG(). It is worth noting that the new cache is not deleted, but this error is the result of sloppy programming during development and not a normal scenario
: 822Link the cache into the chain.
: 823Release the cache chain semaphore.
: 825Return the new cache pointer

H.1.2 Calculating the Number of Objects on a Slab

H.1.2.1 Function: kmem_cache_estimate

Source: mm/slab.c

During cache creation, it is determined how many objects can be stored in a slab and how much waste-age there will be. The following function calculates how many objects may be stored, taking into account if the slab and bufctl's must be stored on-slab.

388 static void kmem_cache_estimate (unsigned long gfporder, 
             size_t size,
389          int flags, size_t *left_over, unsigned int *num)
390 {
391     int i;
392     size_t wastage = PAGE_SIZE<<gfporder;
393     size_t extra = 0;
394     size_t base = 0;
395 
396     if (!(flags & CFLGS_OFF_SLAB)) {
397         base = sizeof(slab_t);
398         extra = sizeof(kmem_bufctl_t);
399     }
400     i = 0;
401     while (i*size + L1_CACHE_ALIGN(base+i*extra) <= wastage)
402         i++;
403     if (i > 0)
404         i--;
405 
406     if (i > SLAB_LIMIT)
407         i = SLAB_LIMIT;
408 
409     *num = i;
410     wastage -= i*size;
411     wastage -= L1_CACHE_ALIGN(base+i*extra);
412     *left_over = wastage;
413 }

388The parameters of the function are as follows

: gfporder The 2^gfporder number of pages to allocate for each slab
: size The size of each object
: flags The cache flags
: left_over The number of bytes left over in the slab. Returned to caller
: num The number of objects that will fit in a slab. Returned to caller

392wastage is decremented through the function. It starts with the maximum possible amount of wastage.

393extra is the number of bytes needed to store kmem_bufctl_t

394base is where usable memory in the slab starts

396If the slab descriptor is kept on cache, the base begins at the end of the slab_t struct and the number of bytes needed to store the bufctl is the size of kmem_bufctl_t

400i becomes the number of objects the slab can hold

401-402This counts up the number of objects that the cache can store. i*size is the the size of the object itself. L1_CACHE_ALIGN(base+i*extra) is slightly trickier. This is calculating the amount of memory needed to store the kmem_bufctl_t needed for every object in the slab. As it is at the beginning of the slab, it is L1 cache aligned so that the first object in the slab will be aligned to hardware cache. i*extra will calculate the amount of space needed to hold a kmem_bufctl_t for this object. As wast-age starts out as the size of the slab, its use is overloaded here.

403-404Because the previous loop counts until the slab overflows, the number of objects that can be stored is i-1.

406-407SLAB_LIMIT is the absolute largest number of objects a slab can store. Is is defined as 0xffffFFFE as this the largest number kmem_bufctl_t(), which is an unsigned integer, can hold

409num is now the number of objects a slab can hold

410Take away the space taken up by all the objects from wastage

411Take away the space taken up by the kmem_bufctl_t

412Wast-age has now been calculated as the left over space in the slab

H.1.3 Cache Shrinking

The call graph for kmem_cache_shrink() is shown in Figure 8.5. Two varieties of shrink functions are provided. kmem_cache_shrink() removes all slabs from slabs_free and returns the number of pages freed as a result. __kmem_cache_shrink() frees all slabs from slabs_free and then verifies that slabs_partial and slabs_full are empty. This is important during cache destruction when it doesn't matter how many pages are freed, just that the cache is empty.

H.1.3.1 Function: kmem_cache_shrink

Source: mm/slab.c

This function performs basic debugging checks and then acquires the cache descriptor lock before freeing slabs. At one time, it also used to call drain_cpu_caches() to free up objects on the per-cpu cache. It is curious that this was removed as it is possible slabs could not be freed due to an object been allocation on a per-cpu cache but not in use.

966 int kmem_cache_shrink(kmem_cache_t *cachep)
967 {
968     int ret;
969 
970     if (!cachep || in_interrupt() || 
        !is_chained_kmem_cache(cachep))
971         BUG();
972 
973     spin_lock_irq(&cachep->spinlock);
974     ret = __kmem_cache_shrink_locked(cachep);
975     spin_unlock_irq(&cachep->spinlock);
976 
977     return ret << cachep->gfporder;
978 }

966The parameter is the cache been shrunk

970Check that

The cache pointer is not NULL
That an interrupt is not the caller
That the cache is on the cache chain and not a bad pointer

973Acquire the cache descriptor lock and disable interrupts

974Shrink the cache

975Release the cache lock and enable interrupts

976This returns the number of pages freed but does not take into account the objects freed by draining the CPU.

H.1.3.2 Function: __kmem_cache_shrink

Source: mm/slab.c

This function is identical to kmem_cache_shrink() except it returns if the cache is empty or not. This is important during cache destruction when it is not important how much memory was freed, just that it is safe to delete the cache and not leak memory.

945 static int __kmem_cache_shrink(kmem_cache_t *cachep)
946 {
947     int ret;
948 
949     drain_cpu_caches(cachep);
950 
951     spin_lock_irq(&cachep->spinlock);
952     __kmem_cache_shrink_locked(cachep);
953     ret = !list_empty(&cachep->slabs_full) ||
954         !list_empty(&cachep->slabs_partial);
955     spin_unlock_irq(&cachep->spinlock);
956     return ret;
957 }

: 949Remove all objects from the per-CPU objects cache
: 951Acquire the cache descriptor lock and disable interrupts
: 952Free all slabs in the slabs_free list
: 954-954Check the slabs_partial and slabs_full lists are empty
: 955Release the cache descriptor lock and re-enable interrupts
: 956Return if the cache has all its slabs free or not

H.1.3.3 Function: __kmem_cache_shrink_locked

Source: mm/slab.c

This does the dirty work of freeing slabs. It will keep destroying them until the growing flag gets set, indicating the cache is in use or until there is no more slabs in slabs_free.

917 static int __kmem_cache_shrink_locked(kmem_cache_t *cachep)
918 {
919     slab_t *slabp;
920     int ret = 0;
921 
923     while (!cachep->growing) {
924         struct list_head *p;
925 
926         p = cachep->slabs_free.prev;
927         if (p == &cachep->slabs_free)
928             break;
929 
930         slabp = list_entry(cachep->slabs_free.prev, 
                       slab_t, list);
931 #if DEBUG
932         if (slabp->inuse)
933             BUG();
934 #endif
935         list_del(&slabp->list);
936 
937         spin_unlock_irq(&cachep->spinlock);
938         kmem_slab_destroy(cachep, slabp);
939         ret++;
940         spin_lock_irq(&cachep->spinlock);
941     }
942     return ret;
943 }

: 923While the cache is not growing, free slabs
: 926-930Get the last slab on the slabs_free list
: 932-933If debugging is available, make sure it is not in use. If it is not in use, it should not be on the slabs_free list in the first place
: 935Remove the slab from the list
: 937Re-enable interrupts. This function is called with interrupts disabled and this is to free the interrupt as quickly as possible.
: 938Delete the slab with kmem_slab_destroy() (See Section H.2.3.1)
: 939Record the number of slabs freed
: 940Acquire the cache descriptor lock and disable interrupts

H.1.4 Cache Destroying

When a module is unloaded, it is responsible for destroying any cache is has created as during module loading, it is ensured there is not two caches of the same name. Core kernel code often does not destroy its caches as their existence persists for the life of the system. The steps taken to destroy a cache are

Delete the cache from the cache chain
Shrink the cache to delete all slabs (see Section 8.1.8)
Free any per CPU caches (kfree())
Delete the cache descriptor from the cache_cache (see Section: 8.3.3)

H.1.4.1 Function: kmem_cache_destroy

Source: mm/slab.c

The call graph for this function is shown in Figure 8.7.

 997 int kmem_cache_destroy (kmem_cache_t * cachep)
 998 {
 999     if (!cachep || in_interrupt() || cachep->growing)
 1000        BUG();
 1001 
1002     /* Find the cache in the chain of caches. */
1003     down(&cache_chain_sem);
1004     /* the chain is never empty, cache_cache is never destroyed */
1005     if (clock_searchp == cachep)
1006         clock_searchp = list_entry(cachep->next.next,
1007                         kmem_cache_t, next);
1008     list_del(&cachep->next);
1009     up(&cache_chain_sem);
1010 
1011     if (__kmem_cache_shrink(cachep)) {
1012         printk(KERN_ERR 
                "kmem_cache_destroy: Can't free all objects %p\n",
1013            cachep);
1014         down(&cache_chain_sem);
1015         list_add(&cachep->next,&cache_chain);
1016         up(&cache_chain_sem);
1017         return 1;
1018     }
1019 #ifdef CONFIG_SMP
1020     {
1021         int i;
1022         for (i = 0; i < NR_CPUS; i++)
1023             kfree(cachep->cpudata[i]);
1024     }
1025 #endif
1026     kmem_cache_free(&cache_cache, cachep);
1027 
1028     return 0;
1029 }

: 999-1000Sanity check. Make sure the cachep is not null, that an interrupt is not trying to do this and that the cache has not been marked as growing, indicating it is in use
: 1003Acquire the semaphore for accessing the cache chain
: 1005-1007Acquire the list entry from the cache chain
: 1008Delete this cache from the cache chain
: 1009Release the cache chain semaphore
: 1011Shrink the cache to free all slabs with __kmem_cache_shrink() (See Section H.1.3.2)
: 1012-1017The shrink function returns true if there is still slabs in the cache. If there is, the cache cannot be destroyed so it is added back into the cache chain and the error reported
: 1022-1023If SMP is enabled, the per-cpu data structures are deleted with kfree() (See Section H.4.3.1)
: 1026Delete the cache descriptor from the cache_cache with kmem_cache_free() (See Section H.3.3.1)

H.1.5 Cache Reaping

H.1.5.1 Function: kmem_cache_reap

Source: mm/slab.c

The call graph for this function is shown in Figure 8.4. Because of the size of this function, it will be broken up into three separate sections. The first is simple function preamble. The second is the selection of a cache to reap and the third is the freeing of the slabs. The basic tasks were described in Section 8.1.7.

1738 int kmem_cache_reap (int gfp_mask)
1739 {
1740     slab_t *slabp;
1741     kmem_cache_t *searchp;
1742     kmem_cache_t *best_cachep;
1743     unsigned int best_pages;
1744     unsigned int best_len;
1745     unsigned int scan;
1746     int ret = 0;
1747 
1748     if (gfp_mask & __GFP_WAIT)
1749         down(&cache_chain_sem);
1750     else
1751         if (down_trylock(&cache_chain_sem))
1752             return 0;
1753 
1754     scan = REAP_SCANLEN;
1755     best_len = 0;
1756     best_pages = 0;
1757     best_cachep = NULL;
1758     searchp = clock_searchp;

: 1738The only parameter is the GFP flag. The only check made is against the __GFP_WAIT flag. As the only caller, kswapd, can sleep, this parameter is virtually worthless
: 1748-1749Can the caller sleep? If yes, then acquire the semaphore
: 1751-1752Else, try and acquire the semaphore and if not available, return
: 1754REAP_SCANLEN (10) is the number of caches to examine.
: 1758Set searchp to be the last cache that was examined at the last reap

1759     do {
1760         unsigned int pages;
1761         struct list_head* p;
1762         unsigned int full_free;
1763 
1765         if (searchp->flags & SLAB_NO_REAP)
1766             goto next;
1767         spin_lock_irq(&searchp->spinlock);
1768         if (searchp->growing)
1769             goto next_unlock;
1770         if (searchp->dflags & DFLGS_GROWN) {
1771             searchp->dflags &= ~DFLGS_GROWN;
1772             goto next_unlock;
1773         }
1774 #ifdef CONFIG_SMP
1775         {
1776             cpucache_t *cc = cc_data(searchp);
1777             if (cc && cc->avail) {
1778                 __free_block(searchp, cc_entry(cc),
                          cc->avail);
1779                 cc->avail = 0;
1780             }
1781         }
1782 #endif
1783 
1784         full_free = 0;
1785         p = searchp->slabs_free.next;
1786         while (p != &searchp->slabs_free) {
1787             slabp = list_entry(p, slab_t, list);
1788 #if DEBUG
1789             if (slabp->inuse)
1790                 BUG();
1791 #endif
1792             full_free++;
1793             p = p->next;
1794         }
1795 
1801         pages = full_free * (1<<searchp->gfporder);
1802         if (searchp->ctor)
1803             pages = (pages*4+1)/5;
1804         if (searchp->gfporder)
1805             pages = (pages*4+1)/5;
1806         if (pages > best_pages) {
1807             best_cachep = searchp;
1808             best_len = full_free;
1809             best_pages = pages;
1810             if (pages >= REAP_PERFECT) {
1811                 clock_searchp =
                      list_entry(searchp->next.next,
1812                      kmem_cache_t,next);
1813                 goto perfect;
1814             }
1815         }
1816 next_unlock:
1817         spin_unlock_irq(&searchp->spinlock);
1818 next:
1819         searchp =
               list_entry(searchp->next.next,kmem_cache_t,next);
1820     } while (--scan && searchp != clock_searchp);

This block examines REAP_SCANLEN number of caches to select one to free

: 1767Acquire an interrupt safe lock to the cache descriptor
: 1768-1769If the cache is growing, skip it
: 1770-1773If the cache has grown recently, skip it and clear the flag
: 1775-1781Free any per CPU objects to the global pool
: 1786-1794Count the number of slabs in the slabs_free list
: 1801Calculate the number of pages all the slabs hold
: 1802-1803If the objects have constructors, reduce the page count by one fifth to make it less likely to be selected for reaping
: 1804-1805If the slabs consist of more than one page, reduce the page count by one fifth. This is because high order pages are hard to acquire
: 1806If this is the best candidate found for reaping so far, check if it is perfect for reaping
: 1807-1809Record the new maximums
: 1808best_len is recorded so that it is easy to know how many slabs is half of the slabs in the free list
: 1810If this cache is perfect for reaping then
: 1811Update clock_searchp
: 1812Goto perfect where half the slabs will be freed
: 1816This label is reached if it was found the cache was growing after acquiring the lock
: 1817Release the cache descriptor lock
: 1818Move to the next entry in the cache chain
: 1820Scan while REAP_SCANLEN has not been reached and we have not cycled around the whole cache chain

1822     clock_searchp = searchp;
1823 
1824     if (!best_cachep)
1826         goto out;
1827 
1828     spin_lock_irq(&best_cachep->spinlock);
1829 perfect:
1830     /* free only 50% of the free slabs */
1831     best_len = (best_len + 1)/2;
1832     for (scan = 0; scan < best_len; scan++) {
1833         struct list_head *p;
1834 
1835         if (best_cachep->growing)
1836             break;
1837         p = best_cachep->slabs_free.prev;
1838         if (p == &best_cachep->slabs_free)
1839             break;
1840         slabp = list_entry(p,slab_t,list);
1841 #if DEBUG
1842         if (slabp->inuse)
1843             BUG();
1844 #endif
1845         list_del(&slabp->list);
1846         STATS_INC_REAPED(best_cachep);
1847 
1848         /* Safe to drop the lock. The slab is no longer 
1849          * lined to the cache.
1850          */
1851         spin_unlock_irq(&best_cachep->spinlock);
1852         kmem_slab_destroy(best_cachep, slabp);
1853         spin_lock_irq(&best_cachep->spinlock);
1854     }
1855     spin_unlock_irq(&best_cachep->spinlock);
1856     ret = scan * (1 << best_cachep->gfporder);
1857 out:
1858     up(&cache_chain_sem);
1859     return ret;
1860 }

This block will free half of the slabs from the selected cache

: 1822Update clock_searchp for the next cache reap
: 1824-1826If a cache was not found, goto out to free the cache chain and exit
: 1828Acquire the cache chain spinlock and disable interrupts. The cachep descriptor has to be held by an interrupt safe lock as some caches may be used from interrupt context. The slab allocator has no way to differentiate between interrupt safe and unsafe caches
: 1831Adjust best_len to be the number of slabs to free
: 1832-1854Free best_len number of slabs
: 1835-1847If the cache is growing, exit
: 1837Get a slab from the list
: 1838-1839If there is no slabs left in the list, exit
: 1840Get the slab pointer
: 1842-1843If debugging is enabled, make sure there is no active objects in the slab
: 1845Remove the slab from the slabs_free list
: 1846Update statistics if enabled
: 1851Free the cache descriptor and enable interrupts
: 1852Destroy the slab. See Section 8.2.8
: 1851Re-acquire the cache descriptor spinlock and disable interrupts
: 1855Free the cache descriptor and enable interrupts
: 1856ret is the number of pages that was freed
: 1858-1859Free the cache semaphore and return the number of pages freed

H.2 Slabs

H.2.1 Storing the Slab Descriptor

H.2.1.1 Function: kmem_cache_slabmgmt

Source: mm/slab.c

This function will either allocate allocate space to keep the slab descriptor off cache or reserve enough space at the beginning of the slab for the descriptor and the bufctls.

1032 static inline slab_t * kmem_cache_slabmgmt (
                 kmem_cache_t *cachep,
1033             void *objp, 
                 int colour_off, 
                 int local_flags)
1034 {
1035     slab_t *slabp;
1036     
1037     if (OFF_SLAB(cachep)) {
1039         slabp = kmem_cache_alloc(cachep->slabp_cache,
                          local_flags);
1040         if (!slabp)
1041             return NULL;
1042     } else {
1047         slabp = objp+colour_off;
1048         colour_off += L1_CACHE_ALIGN(cachep->num *
1049                 sizeof(kmem_bufctl_t) + 
                     sizeof(slab_t));
1050     }
1051     slabp->inuse = 0;
1052     slabp->colouroff = colour_off;
1053     slabp->s_mem = objp+colour_off;
1054 
1055     return slabp;
1056 }

1032 The parameters of the function are

: cachep The cache the slab is to be allocated to
: objp When the function is called, this points to the beginning of the slab
: colour_off The colour offset for this slab
: local_flags These are the flags for the cache

1037-1042 If the slab descriptor is kept off cache....

1039 Allocate memory from the sizes cache. During cache creation, slabp_cache is set to the appropriate size cache to allocate from.

1040 If the allocation failed, return

1042-1050 Reserve space at the beginning of the slab

1047 The address of the slab will be the beginning of the slab (objp) plus the colour offset

1048 colour_off is calculated to be the offset where the first object will be placed. The address is L1 cache aligned. cachep->num * sizeof(kmem_bufctl_t) is the amount of space needed to hold the bufctls for each object in the slab and sizeof(slab_t) is the size of the slab descriptor. This effectively has reserved the space at the beginning of the slab

1051The number of objects in use on the slab is 0

1052The colouroff is updated for placement of the new object

1053The address of the first object is calculated as the address of the beginning of the slab plus the offset

H.2.1.2 Function: kmem_find_general_cachep

Source: mm/slab.c

If the slab descriptor is to be kept off-slab, this function, called during cache creation will find the appropriate sizes cache to use and will be stored within the cache descriptor in the field slabp_cache.

1620 kmem_cache_t * kmem_find_general_cachep (size_t size, 
                          int gfpflags)
1621 {
1622     cache_sizes_t *csizep = cache_sizes;
1623 
1628     for ( ; csizep->cs_size; csizep++) {
1629         if (size > csizep->cs_size)
1630             continue;
1631         break;
1632     }
1633     return (gfpflags & GFP_DMA) ? csizep->cs_dmacachep :
                       csizep->cs_cachep;
1634 }

: 1620 size is the size of the slab descriptor. gfpflags is always 0 as DMA memory is not needed for a slab descriptor
: 1628-1632 Starting with the smallest size, keep increasing the size until a cache is found with buffers large enough to store the slab descriptor
: 1633 Return either a normal or DMA sized cache depending on the gfpflags passed in. In reality, only the cs_cachep is ever passed back

H.2.2 Slab Creation

H.2.2.1 Function: kmem_cache_grow

Source: mm/slab.c

The call graph for this function is shown in 8.11. The basic tasks for this function are;

Perform basic sanity checks to guard against bad usage
Calculate colour offset for objects in this slab
Allocate memory for slab and acquire a slab descriptor
Link the pages used for the slab to the slab and cache descriptors
Initialise objects in the slab
Add the slab to the cache

1105 static int kmem_cache_grow (kmem_cache_t * cachep, int flags)
1106 {
1107     slab_t  *slabp;
1108     struct page     *page;
1109     void        *objp;
1110     size_t       offset;
1111     unsigned int     i, local_flags;
1112     unsigned long    ctor_flags;
1113     unsigned long    save_flags;

Basic declarations. The parameters of the function are

: cachep The cache to allocate a new slab to
: flags The flags for a slab creation

1118     if (flags & ~(SLAB_DMA|SLAB_LEVEL_MASK|SLAB_NO_GROW))
1119         BUG();
1120     if (flags & SLAB_NO_GROW)
1121         return 0;
1122 
1129     if (in_interrupt() && 
             (flags & SLAB_LEVEL_MASK) != SLAB_ATOMIC)
1130         BUG();
1131 
1132     ctor_flags = SLAB_CTOR_CONSTRUCTOR;
1133     local_flags = (flags & SLAB_LEVEL_MASK);
1134     if (local_flags == SLAB_ATOMIC)
1139         ctor_flags |= SLAB_CTOR_ATOMIC;

Perform basic sanity checks to guard against bad usage. The checks are made here rather than kmem_cache_alloc() to protect the speed-critical path. There is no point checking the flags every time an object needs to be allocated.

: 1118-1119Make sure only allowable flags are used for allocation
: 1120-1121Do not grow the cache if this is set. In reality, it is never set
: 1129-1130If this called within interrupt context, make sure the ATOMIC flag is set so we don't sleep when kmem_getpages()(See Section H.7.0.3) is called
: 1132This flag tells the constructor it is to init the object
: 1133The local_flags are just those relevant to the page allocator
: 1134-1139If the SLAB_ATOMIC flag is set, the constructor needs to know about it in case it wants to make new allocations

1142     spin_lock_irqsave(&cachep->spinlock, save_flags);
1143 
1145     offset = cachep->colour_next;
1146     cachep->colour_next++;
1147     if (cachep->colour_next >= cachep->colour)
1148         cachep->colour_next = 0;
1149     offset *= cachep->colour_off;
1150     cachep->dflags |= DFLGS_GROWN;
1151 
1152     cachep->growing++;
1153     spin_unlock_irqrestore(&cachep->spinlock, save_flags);

Calculate colour offset for objects in this slab

: 1142Acquire an interrupt safe lock for accessing the cache descriptor
: 1145Get the offset for objects in this slab
: 1146Move to the next colour offset
: 1147-1148If colour has been reached, there is no more offsets available, so reset colour_next to 0
: 1149colour_off is the size of each offset, so offset * colour_off will give how many bytes to offset the objects to
: 1150Mark the cache that it is growing so that kmem_cache_reap() (See Section H.1.5.1) will ignore this cache
: 1152Increase the count for callers growing this cache
: 1153Free the spinlock and re-enable interrupts

1165     if (!(objp = kmem_getpages(cachep, flags)))
1166         goto failed;
1167 
1169     if (!(slabp = kmem_cache_slabmgmt(cachep, 
                           objp, offset,
                           local_flags)))
1160         goto opps1;

Allocate memory for slab and acquire a slab descriptor

: 1165-1166Allocate pages from the page allocator for the slab with kmem_getpages() (See Section H.7.0.3)
: 1169Acquire a slab descriptor with kmem_cache_slabmgmt() (See Section H.2.1.1)

1173     i = 1 << cachep->gfporder;
1174     page = virt_to_page(objp);
1175     do {
1176         SET_PAGE_CACHE(page, cachep);
1177         SET_PAGE_SLAB(page, slabp);
1178         PageSetSlab(page);
1179         page++;
1180     } while (--i);

Link the pages for the slab used to the slab and cache descriptors

: 1173i is the number of pages used for the slab. Each page has to be linked to the slab and cache descriptors.
: 1174objp is a pointer to the beginning of the slab. The macro virt_to_page() will give the struct page for that address
: 1175-1180Link each pages list field to the slab and cache descriptors
: 1176SET_PAGE_CACHE() links the page to the cache descriptor using the page→list.next field
: 1178SET_PAGE_SLAB() links the page to the slab descriptor using the page→list.prev field
: 1178Set the PG_slab page flag. The full set of PG_ flags is listed in Table 2.1
: 1179Move to the next page for this slab to be linked

1182     kmem_cache_init_objs(cachep, slabp, ctor_flags);

: 1182Initialise all objects (See Section H.3.1.1)

1184     spin_lock_irqsave(&cachep->spinlock, save_flags);
1185     cachep->growing--;
1186 
1188     list_add_tail(&slabp->list, &cachep->slabs_free);
1189     STATS_INC_GROWN(cachep);
1190     cachep->failures = 0;
1191 
1192     spin_unlock_irqrestore(&cachep->spinlock, save_flags);
1193     return 1;

Add the slab to the cache

: 1184Acquire the cache descriptor spinlock in an interrupt safe fashion
: 1185Decrease the growing count
: 1188Add the slab to the end of the slabs_free list
: 1189If STATS is set, increase the cachep→grown field STATS_INC_GROWN()
: 1190Set failures to 0. This field is never used elsewhere
: 1192Unlock the spinlock in an interrupt safe fashion
: 1193Return success

1194 opps1:
1195     kmem_freepages(cachep, objp);
1196 failed:
1197     spin_lock_irqsave(&cachep->spinlock, save_flags);
1198     cachep->growing--;
1199     spin_unlock_irqrestore(&cachep->spinlock, save_flags);
1300     return 0;
1301 }

Error handling

: 1194-1195opps1 is reached if the pages for the slab were allocated. They must be freed
: 1197Acquire the spinlock for accessing the cache descriptor
: 1198Reduce the growing count
: 1199Release the spinlock
: 1300Return failure

H.2.3 Slab Destroying

H.2.3.1 Function: kmem_slab_destroy

Source: mm/slab.c

The call graph for this function is shown at Figure 8.13. For reability, the debugging sections has been omitted from this function but they are almost identical to the debugging section during object allocation. See Section H.3.1.1 for how the markers and poison pattern are checked.

555 static void kmem_slab_destroy (kmem_cache_t *cachep, slab_t *slabp)
556 {
557     if (cachep->dtor
561     ) {
562         int i;
563         for (i = 0; i < cachep->num; i++) {
564             void* objp = slabp->s_mem+cachep->objsize*i;

565-574 DEBUG: Check red zone markers

575             if (cachep->dtor)
576                 (cachep->dtor)(objp, cachep, 0);

577-584 DEBUG: Check poison pattern

585         }
586     }
587 
588     kmem_freepages(cachep, slabp->s_mem-slabp->colouroff);
589     if (OFF_SLAB(cachep))
590         kmem_cache_free(cachep->slabp_cache, slabp);
591 }

: 557-586If a destructor is available, call it for each object in the slab
: 563-585Cycle through each object in the slab
: 564Calculate the address of the object to destroy
: 575-576Call the destructor
: 588Free the pages been used for the slab
: 589If the slab descriptor is been kept off-slab, then free the memory been used for it

H.3 Objects

This section will cover how objects are managed. At this point, most of the real hard work has been completed by either the cache or slab managers.

H.3.1 Initialising Objects in a Slab

H.3.1.1 Function: kmem_cache_init_objs

Source: mm/slab.c

The vast part of this function is involved with debugging so we will start with the function without the debugging and explain that in detail before handling the debugging part. The two sections that are debugging are marked in the code excerpt below as Part 1 and Part 2.

1058 static inline void kmem_cache_init_objs (kmem_cache_t * cachep,
1059             slab_t * slabp, unsigned long ctor_flags)
1060 {
1061     int i;
1062 
1063     for (i = 0; i < cachep->num; i++) {
1064         void* objp = slabp->s_mem+cachep->objsize*i;

1065-1072        /* Debugging Part 1 */

1079         if (cachep->ctor)
1080             cachep->ctor(objp, cachep, ctor_flags);

1081-1094        /* Debugging Part 2 */

1095         slab_bufctl(slabp)[i] = i+1;
1096     }
1097     slab_bufctl(slabp)[i-1] = BUFCTL_END;
1098     slabp->free = 0;
1099 }

1058The parameters of the function are

: cachepThe cache the objects are been initialised for
: slabpThe slab the objects are in
: ctor_flagsFlags the constructor needs whether this is an atomic allocation or not

1063Initialise cache→num number of objects

1064The base address for objects in the slab is s_mem. The address of the object to allocate is then i * (size of a single object)

1079-1080If a constructor is available, call it

1095The macro slab_bufctl() casts slabp to a slab_t slab descriptor and adds one to it. This brings the pointer to the end of the slab descriptor and then casts it back to a kmem_bufctl_t effectively giving the beginning of the bufctl array.

1098The index of the first free object is 0 in the bufctl array

That covers the core of initialising objects. Next the first debugging part will be covered

1065 #if DEBUG
1066         if (cachep->flags & SLAB_RED_ZONE) {
1067             *((unsigned long*)(objp)) = RED_MAGIC1;
1068             *((unsigned long*)(objp + cachep->objsize -
1069                     BYTES_PER_WORD)) = RED_MAGIC1;
1070             objp += BYTES_PER_WORD;
1071         }
1072 #endif

: 1066If the cache is to be red zones then place a marker at either end of the object
: 1067Place the marker at the beginning of the object
: 1068Place the marker at the end of the object. Remember that the size of the object takes into account the size of the red markers when red zoning is enabled
: 1070Increase the objp pointer by the size of the marker for the benefit of the constructor which is called after this debugging block

1081 #if DEBUG
1082         if (cachep->flags & SLAB_RED_ZONE)
1083             objp -= BYTES_PER_WORD;
1084         if (cachep->flags & SLAB_POISON)
1086             kmem_poison_obj(cachep, objp);
1087         if (cachep->flags & SLAB_RED_ZONE) {
1088             if (*((unsigned long*)(objp)) != RED_MAGIC1)
1089                 BUG();
1090             if (*((unsigned long*)(objp + cachep->objsize -
1091                     BYTES_PER_WORD)) != RED_MAGIC1)
1092                 BUG();
1093         }
1094 #endif

This is the debugging block that takes place after the constructor, if it exists, has been called.

: 1082-1083The objp pointer was increased by the size of the red marker in the previous debugging block so move it back again
: 1084-1086If there was no constructor, poison the object with a known pattern that can be examined later to trap uninitialised writes
: 1088Check to make sure the red marker at the beginning of the object was preserved to trap writes before the object
: 1090-1091Check to make sure writes didn't take place past the end of the object

H.3.2 Object Allocation

H.3.2.1 Function: kmem_cache_alloc

Source: mm/slab.c

The call graph for this function is shown in Figure 8.14. This trivial function simply calls __kmem_cache_alloc().

1529 void * kmem_cache_alloc (kmem_cache_t *cachep, int flags)
1531 {
1532     return __kmem_cache_alloc(cachep, flags);
1533 }

H.3.2.2 Function: __kmem_cache_alloc (UP Case)

Source: mm/slab.c

This will take the parts of the function specific to the UP case. The SMP case will be dealt with in the next section.

1338 static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, 
                                              int flags)
1339 {
1340     unsigned long save_flags;
1341     void* objp;
1342 
1343     kmem_cache_alloc_head(cachep, flags);
1344 try_again:
1345     local_irq_save(save_flags);

1367     objp = kmem_cache_alloc_one(cachep);

1369     local_irq_restore(save_flags);
1370     return objp;
1371 alloc_new_slab:

1376     local_irq_restore(save_flags);
1377     if (kmem_cache_grow(cachep, flags))
1381         goto try_again;
1382     return NULL;
1383 }

: 1338The parameters are the cache to allocate from and allocation specific flags
: 1343This function makes sure the appropriate combination of DMA flags are in use
: 1345Disable interrupts and save the flags. This function is used by interrupts so this is the only way to provide synchronisation in the UP case
: 1367kmem_cache_alloc_one() (see Section H.3.2.5) allocates an object from one of the lists and returns it. If no objects are free, this macro (note it isn't a function) will goto alloc_new_slab at the end of this function
: 1369-1370Restore interrupts and return
: 1376At this label, no objects were free in slabs_partial and slabs_free is empty so a new slab is needed
: 1377Allocate a new slab (see Section 8.2.2)
: 1379A new slab is available so try again
: 1382No slabs could be allocated so return failure

H.3.2.3 Function: __kmem_cache_alloc (SMP Case)

Source: mm/slab.c

This is what the function looks like in the SMP case

1338 static inline void * __kmem_cache_alloc (kmem_cache_t *cachep, 
                                              int flags)
1339 {
1340     unsigned long save_flags;
1341     void* objp;
1342 
1343     kmem_cache_alloc_head(cachep, flags);
1344 try_again:
1345     local_irq_save(save_flags);
1347     {
1348         cpucache_t *cc = cc_data(cachep);
1349 
1350         if (cc) {
1351             if (cc->avail) {
1352                 STATS_INC_ALLOCHIT(cachep);
1353                 objp = cc_entry(cc)[--cc->avail];
1354             } else {
1355                 STATS_INC_ALLOCMISS(cachep);
1356                 objp =
                  kmem_cache_alloc_batch(cachep,cc,flags);
1357                 if (!objp)
1358                   goto alloc_new_slab_nolock;
1359             }
1360         } else {
1361             spin_lock(&cachep->spinlock);
1362             objp = kmem_cache_alloc_one(cachep);
1363             spin_unlock(&cachep->spinlock);
1364         }
1365     }
1366     local_irq_restore(save_flags);
1370     return objp;
1371 alloc_new_slab:
1373     spin_unlock(&cachep->spinlock);
1374 alloc_new_slab_nolock:
1375     local_irq_restore(save_flags);
1377     if (kmem_cache_grow(cachep, flags))
1381         goto try_again;
1382     return NULL;
1383 }

: 1338-1347Same as UP case
: 1349Obtain the per CPU data for this cpu
: 1350-1360If a per CPU cache is available then ....
: 1351If there is an object available then ....
: 1352Update statistics for this cache if enabled
: 1353Get an object and update the avail figure
: 1354Else an object is not available so ....
: 1355Update statistics for this cache if enabled
: 1356Allocate batchcount number of objects, place all but one of them in the per CPU cache and return the last one to objp
: 1357-1358The allocation failed, so goto alloc_new_slab_nolock to grow the cache and allocate a new slab
: 1360-1364If a per CPU cache is not available, take out the cache spinlock and allocate one object in the same way the UP case does. This is the case during the initialisation for the cache_cache for example
: 1363Object was successfully assigned, release cache spinlock
: 1366-1370Re-enable interrupts and return the allocated object
: 1371-1372If kmem_cache_alloc_one() failed to allocate an object, it will goto here with the spinlock still held so it must be released
: 1375-1383Same as the UP case

H.3.2.4 Function: kmem_cache_alloc_head

Source: mm/slab.c

This simple function ensures the right combination of slab and GFP flags are used for allocation from a slab. If a cache is for DMA use, this function will make sure the caller does not accidently request normal memory and vice versa

1231 static inline void kmem_cache_alloc_head(kmem_cache_t *cachep, 
                                              int flags)
1232 {
1233     if (flags & SLAB_DMA) {
1234         if (!(cachep->gfpflags & GFP_DMA))
1235             BUG();
1236     } else {
1237         if (cachep->gfpflags & GFP_DMA)
1238             BUG();
1239     }
1240 }

: 1231The parameters are the cache we are allocating from and the flags requested for the allocation
: 1233If the caller has requested memory for DMA use and ....
: 1234The cache is not using DMA memory then BUG()
: 1237Else if the caller has not requested DMA memory and this cache is for DMA use, BUG()

H.3.2.5 Function: kmem_cache_alloc_one

Source: mm/slab.c

This is a preprocessor macro. It may seem strange to not make this an inline function but it is a preprocessor macro for a goto optimisation in __kmem_cache_alloc() (see Section H.3.2.2)

1283 #define kmem_cache_alloc_one(cachep)              \
1284 ({                                                \ 
1285     struct list_head * slabs_partial, * entry;    \
1286     slab_t *slabp;                                \
1287                                                   \
1288     slabs_partial = &(cachep)->slabs_partial;     \
1289     entry = slabs_partial->next;                  \
1290     if (unlikely(entry == slabs_partial)) {       \
1291         struct list_head * slabs_free;            \
1292         slabs_free = &(cachep)->slabs_free;       \
1293         entry = slabs_free->next;                 \
1294         if (unlikely(entry == slabs_free))        \
1295             goto alloc_new_slab;                  \
1296         list_del(entry);                          \
1297         list_add(entry, slabs_partial);           \
1298     }                                             \
1299                                                   \
1300     slabp = list_entry(entry, slab_t, list);      \
1301     kmem_cache_alloc_one_tail(cachep, slabp);     \
1302 })

: 1288-1289Get the first slab from the slabs_partial list
: 1290-1298If a slab is not available from this list, execute this block
: 1291-1293Get the first slab from the slabs_free list
: 1294-1295If there is no slabs on slabs_free, then goto alloc_new_slab(). This goto label is in __kmem_cache_alloc() and it is will grow the cache by one slab
: 1296-1297Else remove the slab from the free list and place it on the slabs_partial list because an object is about to be removed from it
: 1300Obtain the slab from the list
: 1301Allocate one object from the slab

H.3.2.6 Function: kmem_cache_alloc_one_tail

Source: mm/slab.c

This function is responsible for the allocation of one object from a slab. Much of it is debugging code.

1242 static inline void * kmem_cache_alloc_one_tail (
                             kmem_cache_t *cachep,
1243                         slab_t *slabp)
1244 {
1245     void *objp;
1246 
1247     STATS_INC_ALLOCED(cachep);
1248     STATS_INC_ACTIVE(cachep);
1249     STATS_SET_HIGH(cachep);
1250 
1252     slabp->inuse++;
1253     objp = slabp->s_mem + slabp->free*cachep->objsize;
1254     slabp->free=slab_bufctl(slabp)[slabp->free];
1255 
1256     if (unlikely(slabp->free == BUFCTL_END)) {
1257         list_del(&slabp->list);
1258         list_add(&slabp->list, &cachep->slabs_full);
1259     }
1260 #if DEBUG
1261     if (cachep->flags & SLAB_POISON)
1262         if (kmem_check_poison_obj(cachep, objp))
1263             BUG();
1264     if (cachep->flags & SLAB_RED_ZONE) {
1266         if (xchg((unsigned long *)objp, RED_MAGIC2) !=
1267                           RED_MAGIC1)
1268             BUG();
1269         if (xchg((unsigned long *)(objp+cachep->objsize -
1270             BYTES_PER_WORD), RED_MAGIC2) != RED_MAGIC1)
1271             BUG();
1272         objp += BYTES_PER_WORD;
1273     }
1274 #endif
1275     return objp;
1276 }

: 1230The parameters are the cache and slab been allocated from
: 1247-1249If stats are enabled, this will set three statistics. ALLOCED is the total number of objects that have been allocated. ACTIVE is the number of active objects in the cache. HIGH is the maximum number of objects that were active as a single time
: 1252inuse is the number of objects active on this slab
: 1253Get a pointer to a free object. s_mem is a pointer to the first object on the slab. free is an index of a free object in the slab. index * object size gives an offset within the slab
: 1254This updates the free pointer to be an index of the next free object
: 1256-1259If the slab is full, remove it from the slabs_partial list and place it on the slabs_full.
: 1260-1274Debugging code
: 1275Without debugging, the object is returned to the caller
: 1261-1263If the object was poisoned with a known pattern, check it to guard against uninitialised access
: 1266-1267If red zoning was enabled, check the marker at the beginning of the object and confirm it is safe. Change the red marker to check for writes before the object later
: 1269-1271Check the marker at the end of the object and change it to check for writes after the object later
: 1272Update the object pointer to point to after the red marker
: 1275Return the object

H.3.2.7 Function: kmem_cache_alloc_batch

Source: mm/slab.c

This function allocate a batch of objects to a CPU cache of objects. It is only used in the SMP case. In many ways it is very similar kmem_cache_alloc_one()(See Section H.3.2.5).

1305 void* kmem_cache_alloc_batch(kmem_cache_t* cachep, 
                  cpucache_t* cc, int flags)
1306 {
1307     int batchcount = cachep->batchcount;
1308 
1309     spin_lock(&cachep->spinlock);
1310     while (batchcount--) {
1311         struct list_head * slabs_partial, * entry;
1312         slab_t *slabp;
1313         /* Get slab alloc is to come from. */
1314         slabs_partial = &(cachep)->slabs_partial;
1315         entry = slabs_partial->next;
1316         if (unlikely(entry == slabs_partial)) {
1317             struct list_head * slabs_free;
1318             slabs_free = &(cachep)->slabs_free;
1319             entry = slabs_free->next;
1320             if (unlikely(entry == slabs_free))
1321                 break;
1322             list_del(entry);
1323             list_add(entry, slabs_partial);
1324         }
1325 
1326         slabp = list_entry(entry, slab_t, list);
1327         cc_entry(cc)[cc->avail++] =
1328                kmem_cache_alloc_one_tail(cachep, slabp);
1329     }
1330     spin_unlock(&cachep->spinlock);
1331 
1332     if (cc->avail)
1333         return cc_entry(cc)[--cc->avail];
1334     return NULL;
1335 }

: 1305The parameters are the cache to allocate from, the per CPU cache to fill and allocation flags
: 1307batchcount is the number of objects to allocate
: 1309Obtain the spinlock for access to the cache descriptor
: 1310-1329Loop batchcount times
: 1311-1324This is example the same as kmem_cache_alloc_one()(See Section H.3.2.5). It selects a slab from either slabs_partial or slabs_free to allocate from. If none are available, break out of the loop
: 1326-1327Call kmem_cache_alloc_one_tail() (See Section H.3.2.6) and place it in the per CPU cache
: 1330Release the cache descriptor lock
: 1332-1333Take one of the objects allocated in this batch and return it
: 1334If no object was allocated, return. __kmem_cache_alloc() (See Section H.3.2.2) will grow the cache by one slab and try again

H.3.3 Object Freeing

H.3.3.1 Function: kmem_cache_free

Source: mm/slab.c

The call graph for this function is shown in Figure 8.15.

1576 void kmem_cache_free (kmem_cache_t *cachep, void *objp)
1577 {
1578     unsigned long flags;
1579 #if DEBUG
1580     CHECK_PAGE(virt_to_page(objp));
1581     if (cachep != GET_PAGE_CACHE(virt_to_page(objp)))
1582         BUG();
1583 #endif
1584 
1585     local_irq_save(flags);
1586     __kmem_cache_free(cachep, objp);
1587     local_irq_restore(flags);
1588 }

: 1576The parameter is the cache the object is been freed from and the object itself
: 1579-1583If debugging is enabled, the page will first be checked with CHECK_PAGE() to make sure it is a slab page. Secondly the page list will be examined to make sure it belongs to this cache (See Figure 8.8)
: 1585Interrupts are disabled to protect the path
: 1586__kmem_cache_free() (See Section H.3.3.2) will free the object to the per-CPU cache for the SMP case and to the global pool in the normal case
: 1587Re-enable interrupts

H.3.3.2 Function: __kmem_cache_free (UP Case)

Source: mm/slab.c

This covers what the function looks like in the UP case. Clearly, it simply releases the object to the slab.

1493 static inline void __kmem_cache_free (kmem_cache_t *cachep, 
                                           void* objp)
1494 {
1517     kmem_cache_free_one(cachep, objp);
1519 }

H.3.3.3 Function: __kmem_cache_free (SMP Case)

Source: mm/slab.c

This case is slightly more interesting. In this case, the object is released to the per-cpu cache if it is available.

1493 static inline void __kmem_cache_free (kmem_cache_t *cachep, 
                                          void* objp)
1494 {
1496     cpucache_t *cc = cc_data(cachep);
1497 
1498     CHECK_PAGE(virt_to_page(objp));
1499     if (cc) {
1500         int batchcount;
1501         if (cc->avail < cc->limit) {
1502             STATS_INC_FREEHIT(cachep);
1503             cc_entry(cc)[cc->avail++] = objp;
1504             return;
1505         }
1506         STATS_INC_FREEMISS(cachep);
1507         batchcount = cachep->batchcount;
1508         cc->avail -= batchcount;
1509         free_block(cachep,
1510             &cc_entry(cc)[cc->avail],batchcount);
1511         cc_entry(cc)[cc->avail++] = objp;
1512         return;
1513     } else {
1514         free_block(cachep, &objp, 1);
1515     }
1519 }

: 1496Get the data for this per CPU cache (See Section 8.5.1)
: 1498Make sure the page is a slab page
: 1499-1513If a per-CPU cache is available, try to use it. This is not always available. During cache destruction for instance, the per CPU caches are already gone
: 1501-1505If the number of available in the per CPU cache is below limit, then add the object to the free list and return
: 1506Update statistics if enabled
: 1507The pool has overflowed so batchcount number of objects is going to be freed to the global pool
: 1508Update the number of available (avail) objects
: 1509-1510Free a block of objects to the global cache
: 1511Free the requested object and place it on the per CPU pool
: 1513If the per-CPU cache is not available, then free this object to the global pool

H.3.3.4 Function: kmem_cache_free_one

Source: mm/slab.c

1414 static inline void kmem_cache_free_one(kmem_cache_t *cachep, 
                                            void *objp)
1415 {
1416     slab_t* slabp;
1417 
1418     CHECK_PAGE(virt_to_page(objp));
1425     slabp = GET_PAGE_SLAB(virt_to_page(objp));
1426 
1427 #if DEBUG
1428     if (cachep->flags & SLAB_DEBUG_INITIAL)
1433         cachep->ctor(objp, cachep,
            SLAB_CTOR_CONSTRUCTOR|SLAB_CTOR_VERIFY);
1434 
1435     if (cachep->flags & SLAB_RED_ZONE) {
1436         objp -= BYTES_PER_WORD;
1437         if (xchg((unsigned long *)objp, RED_MAGIC1) !=
                             RED_MAGIC2)
1438             BUG();
1440         if (xchg((unsigned long *)(objp+cachep->objsize -
1441                 BYTES_PER_WORD), RED_MAGIC1) !=
                              RED_MAGIC2)
1443             BUG();
1444     }
1445     if (cachep->flags & SLAB_POISON)
1446         kmem_poison_obj(cachep, objp);
1447     if (kmem_extra_free_checks(cachep, slabp, objp))
1448         return;
1449 #endif
1450     {
1451         unsigned int objnr = (objp-slabp->s_mem)/cachep->objsize;
1452 
1453         slab_bufctl(slabp)[objnr] = slabp->free;
1454         slabp->free = objnr;
1455     }
1456     STATS_DEC_ACTIVE(cachep);
1457     
1459     {
1460         int inuse = slabp->inuse;
1461         if (unlikely(!--slabp->inuse)) {
1462             /* Was partial or full, now empty. */
1463             list_del(&slabp->list);
1464             list_add(&slabp->list, &cachep->slabs_free);
1465         } else if (unlikely(inuse == cachep->num)) {
1466             /* Was full. */
1467             list_del(&slabp->list);
1468             list_add(&slabp->list, &cachep->slabs_partial);
1469         }
1470     }
1471 }

: 1418Make sure the page is a slab page
: 1425Get the slab descriptor for the page
: 1427-1449Debugging material. Discussed at end of section
: 1451Calculate the index for the object been freed
: 1454As this object is now free, update the bufctl to reflect that
: 1456If statistics are enabled, disable the number of active objects in the slab
: 1461-1464If inuse reaches 0, the slab is free and is moved to the slabs_free list
: 1465-1468If the number in use equals the number of objects in a slab, it is full so move it to the slabs_full list
: 1471End of function
: 1428-1433If SLAB_DEBUG_INITIAL is set, the constructor is called to verify the object is in an initialised state
: 1435-1444Verify the red marks at either end of the object are still there. This will check for writes beyond the boundaries of the object and for double frees
: 1445-1446Poison the freed object with a known pattern
: 1447-1448This function will confirm the object is a part of this slab and cache. It will then check the free list (bufctl) to make sure this is not a double free

H.3.3.5 Function: free_block

Source: mm/slab.c

This function is only used in the SMP case when the per CPU cache gets too full. It is used to free a batch of objects in bulk

1481 static void free_block (kmem_cache_t* cachep, void** objpp, 
                             int len)
1482 {
1483     spin_lock(&cachep->spinlock);
1484     __free_block(cachep, objpp, len);
1485     spin_unlock(&cachep->spinlock);
1486 }

1481The parameters are;

: cachep The cache that objects are been freed from
: objpp Pointer to the first object to free
: len The number of objects to free

1483Acquire a lock to the cache descriptor

1486__free_block()(See Section H.3.3.6) performs the actual task of freeing up each of the pages

1487Release the lock

H.3.3.6 Function: __free_block

Source: mm/slab.c

This function is responsible for freeing each of the objects in the per-CPU array objpp.

1474 static inline void __free_block (kmem_cache_t* cachep,
1475                 void** objpp, int len)
1476 {
1477     for ( ; len > 0; len--, objpp++)
1478         kmem_cache_free_one(cachep, *objpp);
1479 }

: 1474The parameters are the cachep the objects belong to, the list of objects(objpp) and the number of objects to free (len)
: 1477Loop len number of times
: 1478Free an object from the array

H.4 Sizes Cache

H.4.1 Initialising the Sizes Cache

H.4.1.1 Function: kmem_cache_sizes_init

Source: mm/slab.c

This function is responsible for creating pairs of caches for small memory buffers suitable for either normal or DMA memory.

436 void __init kmem_cache_sizes_init(void)
437 {
438     cache_sizes_t *sizes = cache_sizes;
439     char name[20];
440
444     if (num_physpages > (32 << 20) >> PAGE_SHIFT)
445         slab_break_gfp_order = BREAK_GFP_ORDER_HI;
446     do {
452         snprintf(name, sizeof(name), "size-%Zd",
                 sizes->cs_size);
453         if (!(sizes->cs_cachep =
454             kmem_cache_create(name, sizes->cs_size,
455                       0, SLAB_HWCACHE_ALIGN, NULL, NULL))) {
456             BUG();
457         }
458 
460         if (!(OFF_SLAB(sizes->cs_cachep))) {
461             offslab_limit = sizes->cs_size-sizeof(slab_t);
462             offslab_limit /= 2;
463         }
464         snprintf(name, sizeof(name), "size-%Zd(DMA)",
                         sizes->cs_size);
465         sizes->cs_dmacachep = kmem_cache_create(name, 
                  sizes->cs_size, 0,
466                   SLAB_CACHE_DMA|SLAB_HWCACHE_ALIGN, 
                  NULL, NULL);
467         if (!sizes->cs_dmacachep)
468             BUG();
469         sizes++;
470     } while (sizes->cs_size);
471 }

: 438Get a pointer to the cache_sizes array
: 439The human readable name of the cache . Should be sized CACHE_NAMELEN which is defined to be 20 bytes long
: 444-445slab_break_gfp_order determines how many pages a slab may use unless 0 objects fit into the slab. It is statically initialised to BREAK_GFP_ORDER_LO (1). This check sees if more than 32MiB of memory is available and if it is, allow BREAK_GFP_ORDER_HI number of pages to be used because internal fragmentation is more acceptable when more memory is available.
: 446-470Create two caches for each size of memory allocation needed
: 452Store the human readable cache name in name
: 453-454Create the cache, aligned to the L1 cache
: 460-463Calculate the off-slab bufctl limit which determines the number of objects that can be stored in a cache when the slab descriptor is kept off-cache.
: 464The human readable name for the cache for DMA use
: 465-466Create the cache, aligned to the L1 cache and suitable for DMA user
: 467if the cache failed to allocate, it is a bug. If memory is unavailable this early, the machine will not boot
: 469Move to the next element in the cache_sizes array
: 470The array is terminated with a 0 as the last element

H.4.2 `kmalloc()`

H.4.2.1 Function: kmalloc

Source: mm/slab.c

Ths call graph for this function is shown in Figure 8.16.

1555 void * kmalloc (size_t size, int flags)
1556 {
1557     cache_sizes_t *csizep = cache_sizes;
1558 
1559     for (; csizep->cs_size; csizep++) {
1560         if (size > csizep->cs_size)
1561             continue;
1562         return __kmem_cache_alloc(flags & GFP_DMA ?
1563              csizep->cs_dmacachep : 
                  csizep->cs_cachep, flags);
1564     }
1565     return NULL;
1566 }

: 1557cache_sizes is the array of caches for each size (See Section 8.4)
: 1559-1564Starting with the smallest cache, examine the size of each cache until one large enough to satisfy the request is found
: 1562If the allocation is for use with DMA, allocate an object from cs_dmacachep else use the cs_cachep
: 1565If a sizes cache of sufficient size was not available or an object could not be allocated, return failure

H.4.3 `kfree()`

H.4.3.1 Function: kfree

Source: mm/slab.c

The call graph for this function is shown in Figure 8.17. It is worth noting that the work this function does is almost identical to the function kmem_cache_free() with debugging enabled (See Section H.3.3.1).

1597 void kfree (const void *objp)
1598 {
1599     kmem_cache_t *c;
1600     unsigned long flags;
1601 
1602     if (!objp)
1603         return;
1604     local_irq_save(flags);
1605     CHECK_PAGE(virt_to_page(objp));
1606     c = GET_PAGE_CACHE(virt_to_page(objp));
1607     __kmem_cache_free(c, (void*)objp);
1608     local_irq_restore(flags);
1609 }

: 1602Return if the pointer is NULL. This is possible if a caller used kmalloc() and had a catch-all failure routine which called kfree() immediately
: 1604Disable interrupts
: 1605Make sure the page this object is in is a slab page
: 1606Get the cache this pointer belongs to (See Section 8.2)
: 1607Free the memory object
: 1608Re-enable interrupts

H.5 Per-CPU Object Cache

The structure of the Per-CPU object cache and how objects are added or removed from them is covered in detail in Sections 8.5.1 and 8.5.2.

H.5.1 Enabling Per-CPU Caches

H.5.1.1 Function: enable_all_cpucaches

Source: mm/slab.c

Figure H.1: Call Graph: enable_all_cpucaches()

This function locks the cache chain and enables the cpucache for every cache. This is important after the cache_cache and sizes cache have been enabled.

1714 static void enable_all_cpucaches (void)
1715 {
1716     struct list_head* p;
1717 
1718     down(&cache_chain_sem);
1719 
1720     p = &cache_cache.next;
1721     do {
1722         kmem_cache_t* cachep = list_entry(p, kmem_cache_t, next);
1723 
1724         enable_cpucache(cachep);
1725         p = cachep->next.next;
1726     } while (p != &cache_cache.next);
1727 
1728     up(&cache_chain_sem);
1729 }

: 1718Obtain the semaphore to the cache chain
: 1719Get the first cache on the chain
: 1721-1726Cycle through the whole chain
: 1722Get a cache from the chain. This code will skip the first cache on the chain but cache_cache doesn't need a cpucache as it is so rarely used
: 1724Enable the cpucache
: 1725Move to the next cache on the chain
: 1726Release the cache chain semaphore

H.5.1.2 Function: enable_cpucache

Source: mm/slab.c

This function calculates what the size of a cpucache should be based on the size of the objects the cache contains before calling kmem_tune_cpucache() which does the actual allocation.

1693 static void enable_cpucache (kmem_cache_t *cachep)
1694 {
1695     int err;
1696     int limit;
1697 
1699     if (cachep->objsize > PAGE_SIZE)
1700         return;
1701     if (cachep->objsize > 1024)
1702         limit = 60;
1703     else if (cachep->objsize > 256)
1704         limit = 124;
1705     else
1706         limit = 252;
1707 
1708     err = kmem_tune_cpucache(cachep, limit, limit/2);
1709     if (err)
1710         printk(KERN_ERR 
            "enable_cpucache failed for %s, error %d.\n",
1711                     cachep->name, -err);
1712 }

: 1699-1700If an object is larger than a page, do not create a per-CPU cache as they are too expensive
: 1701-1702If an object is larger than 1KiB, keep the cpu cache below 3MiB in size. The limit is set to 124 objects to take the size of the cpucache descriptors into account
: 1703-1704For smaller objects, just make sure the cache doesn't go above 3MiB in size
: 1708Allocate the memory for the cpucache
: 1710-1711Print out an error message if the allocation failed

H.5.1.3 Function: kmem_tune_cpucache

Source: mm/slab.c

This function is responsible for allocating memory for the cpucaches. For each CPU on the system, kmalloc gives a block of memory large enough for one cpu cache and fills a ccupdate_struct_t struct. The function smp_call_function_all_cpus() then calls do_ccupdate_local() which swaps the new information with the old information in the cache descriptor.

1639 static int kmem_tune_cpucache (kmem_cache_t* cachep, 
                    int limit, int batchcount)
1640 {
1641     ccupdate_struct_t new;
1642     int i;
1643 
1644     /*
1645      * These are admin-provided, so we are more graceful.
1646      */
1647     if (limit < 0)
1648         return -EINVAL;
1649     if (batchcount < 0)
1650         return -EINVAL;
1651     if (batchcount > limit)
1652         return -EINVAL;
1653     if (limit != 0 && !batchcount)
1654         return -EINVAL;
1655 
1656     memset(&new.new,0,sizeof(new.new));
1657     if (limit) {
1658         for (i = 0; i< smp_num_cpus; i++) {
1659             cpucache_t* ccnew;
1660 
1661             ccnew = kmalloc(sizeof(void*)*limit+
1662                     sizeof(cpucache_t), 
                         GFP_KERNEL);
1663             if (!ccnew)
1664                 goto oom;
1665             ccnew->limit = limit;
1666             ccnew->avail = 0;
1667             new.new[cpu_logical_map(i)] = ccnew;
1668         }
1669     }
1670     new.cachep = cachep;
1671     spin_lock_irq(&cachep->spinlock);
1672     cachep->batchcount = batchcount;
1673     spin_unlock_irq(&cachep->spinlock);
1674 
1675     smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
1676 
1677     for (i = 0; i < smp_num_cpus; i++) {
1678         cpucache_t* ccold = new.new[cpu_logical_map(i)];
1679         if (!ccold)
1680             continue;
1681         local_irq_disable();
1682         free_block(cachep, cc_entry(ccold), ccold->avail);
1683         local_irq_enable();
1684         kfree(ccold);
1685     }
1686     return 0;
1687 oom:
1688     for (i--; i >= 0; i--)
1689         kfree(new.new[cpu_logical_map(i)]);
1690     return -ENOMEM;
1691 }

1639The parameters of the function are

: cachep The cache this cpucache is been allocated for
: limit The total number of objects that can exist in the cpucache
: batchcount The number of objects to allocate in one batch when the cpucache is empty

1647The number of objects in the cache cannot be negative

1649A negative number of objects cannot be allocated in batch

1651A batch of objects greater than the limit cannot be allocated

1653A batchcount must be provided if the limit is positive

1656Zero fill the update struct

1657If a limit is provided, allocate memory for the cpucache

1658-1668For every CPU, allocate a cpucache

1661The amount of memory needed is limit number of pointers and the size of the cpucache descriptor

1663If out of memory, clean up and exit

1665-1666Fill in the fields for the cpucache descriptor

1667Fill in the information for ccupdate_update_t struct

1670Tell the ccupdate_update_t struct what cache is been updated

1671-1673Acquire an interrupt safe lock to the cache descriptor and set its batchcount

1675Get each CPU to update its cpucache information for itself. This swaps the old cpucaches in the cache descriptor with the new ones in new using do_ccupdate_local() (See Section H.5.2.2)

1677-1685After smp_call_function_all_cpus() (See Section H.5.2.1), the old cpucaches are in new. This block of code cycles through them all, frees any objects in them and deletes the old cpucache

1686Return success

1688In the event there is no memory, delete all cpucaches that have been allocated up until this point and return failure

H.5.2 Updating Per-CPU Information

H.5.2.1 Function: smp_call_function_all_cpus

Source: mm/slab.c

This calls the function func() for all CPU's. In the context of the slab allocator, the function is do_ccupdate_local() and the argument is ccupdate_struct_t.

859 static void smp_call_function_all_cpus(void (*func) (void *arg), 
                       void *arg)
860 {
861     local_irq_disable();
862     func(arg);
863     local_irq_enable();
864 
865     if (smp_call_function(func, arg, 1, 1))
866         BUG();
867 }

: 861-863Disable interrupts locally and call the function for this CPU
: 865For all other CPU's, call the function. smp_call_function() is an architecture specific function and will not be discussed further here

H.5.2.2 Function: do_ccupdate_local

Source: mm/slab.c

This function swaps the cpucache information in the cache descriptor with the information in info for this CPU.

874 static void do_ccupdate_local(void *info)
875 {
876     ccupdate_struct_t *new = (ccupdate_struct_t *)info;
877     cpucache_t *old = cc_data(new->cachep);
878     
879     cc_data(new->cachep) = new->new[smp_processor_id()];
880     new->new[smp_processor_id()] = old;
881 }

: 876info is a pointer to the ccupdate_struct_t which is then passed to smp_call_function_all_cpus()(See Section H.5.2.1)
: 877Part of the ccupdate_struct_t is a pointer to the cache this cpucache belongs to. cc_data() returns the cpucache_t for this processor
: 879Place the new cpucache in cache descriptor. cc_data() returns the pointer to the cpucache for this CPU.
: 880Replace the pointer in new with the old cpucache so it can be deleted later by the caller of smp_call_function_call_cpus(), kmem_tune_cpucache() for example

H.5.3 Draining a Per-CPU Cache

This function is called to drain all objects in a per-cpu cache. It is called when a cache needs to be shrunk for the freeing up of slabs. A slab would not be freeable if an object was in the per-cpu cache even though it is not in use.

H.5.3.1 Function: drain_cpu_caches

Source: mm/slab.c

885 static void drain_cpu_caches(kmem_cache_t *cachep)
886 {
887     ccupdate_struct_t new;
888     int i;
889 
890     memset(&new.new,0,sizeof(new.new));
891 
892     new.cachep = cachep;
893 
894     down(&cache_chain_sem);
895     smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
896 
897     for (i = 0; i < smp_num_cpus; i++) {
898         cpucache_t* ccold = new.new[cpu_logical_map(i)];
899         if (!ccold || (ccold->avail == 0))
900             continue;
901         local_irq_disable();
902         free_block(cachep, cc_entry(ccold), ccold->avail);
903         local_irq_enable();
904         ccold->avail = 0;
905     }
906     smp_call_function_all_cpus(do_ccupdate_local, (void *)&new);
907     up(&cache_chain_sem);
908 }

: 890Blank the update structure as it is going to be clearing all data
: 892Set new.cachep to cachep so that smp_call_function_all_cpus() knows what cache it is affecting
: 894Acquire the cache descriptor semaphore
: 895do_ccupdate_local()(See Section H.5.2.2) swaps the cpucache_t information in the cache descriptor with the ones in new so they can be altered here
: 897-905For each CPU in the system ....
: 898Get the cpucache descriptor for this CPU
: 899If the structure does not exist for some reason or there is no objects available in it, move to the next CPU
: 901Disable interrupts on this processor. It is possible an allocation from an interrupt handler elsewhere would try to access the per CPU cache
: 902Free the block of objects with free_block() (See Section H.3.3.5)
: 903Re-enable interrupts
: 904Show that no objects are available
: 906The information for each CPU has been updated so call do_ccupdate_local() (See Section H.5.2.2) for each CPU to put the information back into the cache descriptor
: 907Release the semaphore for the cache chain

H.6 Slab Allocator Initialisation

H.6.0.2 Function: kmem_cache_init

Source: mm/slab.c

This function will

Initialise the cache chain linked list
Initialise a mutex for accessing the cache chain
Calculate the cache_cache colour

416 void __init kmem_cache_init(void)
417 {
418     size_t left_over;
419 
420     init_MUTEX(&cache_chain_sem);
421     INIT_LIST_HEAD(&cache_chain);
422 
423     kmem_cache_estimate(0, cache_cache.objsize, 0,
424             &left_over, &cache_cache.num);
425     if (!cache_cache.num)
426         BUG();
427 
428     cache_cache.colour = left_over/cache_cache.colour_off;
429     cache_cache.colour_next = 0;
430 }

: 420Initialise the semaphore for access the cache chain
: 421Initialise the cache chain linked list
: 423kmem_cache_estimate()(See Section H.1.2.1) calculates the number of objects and amount of bytes wasted
: 425If even one kmem_cache_t cannot be stored in a page, there is something seriously wrong
: 428colour is the number of different cache lines that can be used while still keeping L1 cache alignment
: 429colour_next indicates which line to use next. Start at 0

H.7 Interfacing with the Buddy Allocator

H.7.0.3 Function: kmem_getpages

Source: mm/slab.c

This allocates pages for the slab allocator

486 static inline void * kmem_getpages (kmem_cache_t *cachep, 
                                        unsigned long flags)
487 {
488     void    *addr;
495     flags |= cachep->gfpflags;
496     addr = (void*) __get_free_pages(flags, cachep->gfporder);
503     return addr;
504 }

: 495Whatever flags were requested for the allocation, append the cache flags to it. The only flag it may append is ZONE_DMA if the cache requires DMA memory
: 496Allocate from the buddy allocator with __get_free_pages() (See Section F.2.3)
: 503Return the pages or NULL if it failed

H.7.0.4 Function: kmem_freepages

Source: mm/slab.c

This frees pages for the slab allocator. Before it calls the buddy allocator API, it will remove the PG_slab bit from the page flags.

507 static inline void kmem_freepages (kmem_cache_t *cachep, void *addr)
508 {
509     unsigned long i = (1<<cachep->gfporder);
510     struct page *page = virt_to_page(addr);
511 
517     while (i--) {
518         PageClearSlab(page);
519         page++;
520     }
521     free_pages((unsigned long)addr, cachep->gfporder);
522 }

: 509Retrieve the order used for the original allocation
: 510Get the struct page for the address
: 517-520Clear the PG_slab bit on each page
: 521Free the pages to the buddy allocator with free_pages() (See Section F.4.1)

Appendix H Slab Allocator

H.1 Cache Manipulation

H.1.1 Cache Creation

H.1.1.1 Function: kmem_cache_create

H.1.2 Calculating the Number of Objects on a Slab

H.1.2.1 Function: kmem_cache_estimate

H.1.3 Cache Shrinking

H.1.3.1 Function: kmem_cache_shrink

H.1.3.2 Function: __kmem_cache_shrink

H.1.3.3 Function: __kmem_cache_shrink_locked

H.1.4 Cache Destroying

H.1.4.1 Function: kmem_cache_destroy

H.1.5 Cache Reaping

H.1.5.1 Function: kmem_cache_reap

H.2 Slabs

H.2.1 Storing the Slab Descriptor

H.2.1.1 Function: kmem_cache_slabmgmt

H.2.1.2 Function: kmem_find_general_cachep

H.2.2 Slab Creation

H.2.2.1 Function: kmem_cache_grow

H.2.3 Slab Destroying

H.2.3.1 Function: kmem_slab_destroy

H.3 Objects

H.3.1 Initialising Objects in a Slab

H.3.1.1 Function: kmem_cache_init_objs

H.3.2 Object Allocation

H.3.2.1 Function: kmem_cache_alloc

H.3.2.2 Function: __kmem_cache_alloc (UP Case)

H.3.2.3 Function: __kmem_cache_alloc (SMP Case)

H.3.2.4 Function: kmem_cache_alloc_head

H.3.2.5 Function: kmem_cache_alloc_one

H.3.2.6 Function: kmem_cache_alloc_one_tail

H.3.2.7 Function: kmem_cache_alloc_batch

H.3.3 Object Freeing

H.3.3.1 Function: kmem_cache_free

H.3.3.2 Function: __kmem_cache_free (UP Case)

H.3.3.3 Function: __kmem_cache_free (SMP Case)

H.3.3.4 Function: kmem_cache_free_one

H.3.3.5 Function: free_block

H.3.3.6 Function: __free_block

H.4 Sizes Cache

H.4.1 Initialising the Sizes Cache

H.4.1.1 Function: kmem_cache_sizes_init

H.4.2 kmalloc()

H.4.2.1 Function: kmalloc

H.4.3 kfree()

H.4.3.1 Function: kfree

H.5 Per-CPU Object Cache

H.5.1 Enabling Per-CPU Caches

H.5.1.1 Function: enable_all_cpucaches

H.5.1.2 Function: enable_cpucache

H.5.1.3 Function: kmem_tune_cpucache

H.5.2 Updating Per-CPU Information

H.5.2.1 Function: smp_call_function_all_cpus

H.5.2.2 Function: do_ccupdate_local

H.5.3 Draining a Per-CPU Cache

H.5.3.1 Function: drain_cpu_caches

H.6 Slab Allocator Initialisation

H.6.0.2 Function: kmem_cache_init

H.7 Interfacing with the Buddy Allocator

H.7.0.3 Function: kmem_getpages

H.7.0.4 Function: kmem_freepages

H.4.2 `kmalloc()`

H.4.3 `kfree()`