Demand-Attach File-Server (DAFS)

Demand-Attach File-Server (DAFS)

OpenAFS 1.5 contains Demand-Attach File-Server (DAFS). DAFS is a significant departure from the more traditional AFS file-server and this document details those changes.

Why Demand-Attach File-Server (DAFS) ?

On a traditional file-server, volumes are attached at start-up and detached only at shutdown. Any attached volume can be modified and changes are periodically flushed to disk or on shutdown. When a file-server isn't shutdown cleanly, the integrity of every attached volume has to be verified by the salvager, whether the volume had been modified or not. As file-servers grow larger (and the number of volumes increase), the length of time required to salvage and attach volumes increases, e.g. it takes around two hours for a file-server housing 512GB data to salvage and attach volumes !

On a Demand-Attach File-Server (DAFS), volumes are attached only when accessed by clients. On start-up, the file-server reads only the volume headers to determine what volumes reside on what partitions. When accessed by clients, the volumes are attached. After some period of inactivity, volumes are automatically detached. This dramatically improves start-up and shutdown times. A demand-attach file-server can be restarted in seconds compared to hours for the same traditional file-server.

The primary objective of the demand-attach file-server was to dramatically reduce the amount of time required to restart an AFS file-server.

Large portions of this document were taken / influenced by the presentation entitled Demand Attach / Fast-Restart Fileserver given by Tom Keiser at the AFS and Kerberos Best Practices Workshop in 2006.

An Overview of Demand-Attach File-Server

Demand-attach necessitated a significant re-design of certain aspects of the AFS code, including:

volume package has a number of severe limitations
- single global lock, leading to poor scaling
- lock is held across high latency operations, e.g. disk I/O
- no notion of state for concurrently accessed objects
the vnode package suffers from the same limitations
breaking callbacks is time consuming
salvaging does not have to be done with the file-server offline

The changes implemented for demand-attach include:

volume finite-state automata
volumes are attached on demand
volume garbage collector to detach unused volumes
notion of volume state means read-only volumes aren't salvaged
vnode finite-state automata
global lock is only held when required and never held across high-latency operations
automatic salvaging of volumes
shutdown is done in parallel (maximum number of threads utilized)
callbacks are no longer broken on shutdown
instead, host / callback state is preserved across restarts

The Gory Details of the Demand-Attach File-Server

Bos Configuration

A traditional file-server uses the bnode type fs and has a definition similar to

bnode fs fs 1
parm /usr/afs/bin/fileserver -p 123 -L -busyat 200 -rxpck 2000 -cb 4000000 
parm /usr/afs/bin/volserver -p 127 -log 
parm /usr/afs/bin/salvager -parallel all32
end

Since an additional component was required for the demand-attach file-server, a new bnode type ( dafs) is required. The definition should be similar to

bnode dafs dafs 1
parm /usr/afs/bin/dafileserver -p 123 -L -busyat 200 -rxpck 2000 -cb 4000000 -vattachpar 128 -vlruthresh 1440 -vlrumax 8 -vhashsize 11
parm /usr/afs/bin/davolserver -p 64 -log 
parm /usr/afs/bin/salvageserver
parm /usr/afs/bin/dasalvager -parallel all32
end

The instance for a demand-attach file-server is therefore dafs instead of fs. For a complete list of configuration options see the dafileserver man page.

File-server Start-up / Shutdown Sequence

The table below compares the start-up sequence for a traditional file-server and a demand-attach file-server.

Traditional	Demand-Attach
	host / callback state restored
	host / callback state consistency verified
build vice partition list	build vice partition list
volumes are attached	volume headers read
	volumes placed into pre-attached state

The host / callback state is covered later. The pre-attached state indicates that the file-server has read the volume headers and is aware that the volume exists, but that it has not been attached (and hence is not on-line).

The shutdown sequence for both file-server types is:

Traditional	Demand-Attach
break callbacks	quiesce host / callback state
shutdown volumes	shutdown on-line volumes
	verify host / callback state consistency
	save host / callback state

On a traditional file-server, volumes are off-lined (detached) serially. In demand-attach, as many threads as possible are used to detach volumes, which is possible due to the notion of a volume has an associated state.

Volume Finite-State Automata

The volume finite-state automata is available in the source tree under doc/arch/dafs-fsa.dot. See =fssync-debug= for information on debugging the volume package.

Volume Least Recently Used (VLRU) Queues

The Volume Least Recently Used (VLRU) is a garbage collection facility which automatically off-lines volumes in the background. The purpose of this facility is to pro-actively off-line infrequently used volumes to improve shutdown and salvage times. The process of off-lining a volume from the "attached" state to the "pre-attached" state is called soft detachment.

VLRU works in a manner similar to a generational garbage collector. There are five queues on which volumes can reside.

Queue	Meaning
candidate	Volumes which have not been accessed recently and hence are candidates for soft detachment.
held	Volumes which are administratively prevented from VLRU activity, i.e. will never be detached.
intermediate (mid)	Volumes transitioning from new -> old (see [[DemandAttach#VLRUStateTransitions][state transitions] for details).
new	Volumes which have been accessed. See [[DemandAttach#VLRUStateTransitions][state transitions] for details.
old	Volumes which are continually accessed. See [[DemandAttach.#VLRUStateTransitions][state transitions] for details.

The state of the various VLRU queues is dumped with the file-server state and at shutdown.

The VLRU queues new, mid (intermediate) and old are generational queues for active volumes. State transitions are controlled by inactivity timers and are

Transition	Timeout (minutes)	Actual Timeout (in MS)	Reason (since last transition)
candidate->new	-	-	new activity
new->candidate	1 * vlruthresh	24 hrs	no activity
new->mid	2 * vlruthresh	48 hrs	activity
mid->old	4 * vlruthresh	96 hrs	activity
old->mid	2 * vlruthresh	48 hrs	no activity
mid->new	1 * vlruthresh	24 hrs	no activity

vlruthresh has been optimized for RO file-servers, where volumes are frequently accessed once a day and soft-detaching has little effect (RO volumes are not salvaged; one of the main reasons for soft detaching).

Vnode Finite-State Automata

The vnode finite-state automata is available in the source tree under doc/arch/dafs-vnode-fsa.dot

/usr/afs/bin/fssync-debug provides low-level inspection and control of the file-server volume package. Indiscriminate use of fsync-debug can lead to extremely bad things occurring. Use with care.

Demand Salvaging

Demand salvaging is implemented by the salvageserver. The actual code for salvaging a volume remains largely unchanged. However, the method for invoking salvaging with demand-attach has changed:

file-server automatically requests volumes be salvaged as required, i.e. they are marked as requiring salvaging when attached.
manual initiation of salvaging may be required when access is through the volserver (may be addressed at some later date).
bos salvage requires the -forceDAFS flag to initiate salvaging with DAFS. However, salvaging should not be initiated using this method.
infinite salvage, attach, salvage, ... loops are possible. There is therefore a hard-limit on the number of times a volume will be salvaged which is reset when the volume is removed or the file-server is restarted.
volumes are salvaged in parallel and is controlled by the -Parallel argument to the salvageserver. Defaults to 4.
the salvageserver and the inode file-server are incompatible:
- because volumes are inter-mingled on a partition (rather than being separated), a lock for the entire partition on which the volume is located is held throughout. Both the fileserver and volserver will block if they require this lock, e.g. to restore / dump a volume located on the partition.
- inodes for a particular volume can be located anywhere on a partition. Salvaging therefore results in every inode on a partition having to be read to determine whether it belongs to the volume. This is extremely I/O intensive and leads to horrendous salvaging performance.
/usr/afs/bin/salvsync-debug provides low-level inspection and control over the salvageserver. Indiscriminate use of salvsync-debug can lead to extremely bad things occurring. Use with care.
See =salvsync-debug= for information on debugging problems with the salvageserver.

File-Server Host / Callback State

Host / callback information is persistent across restarts with demand-attach. On shutdown, the file-server writes the data to /usr/afs/local/fsstate.dat. The contents of this file are read and verified at start-up and hence it is unnecessary to break callbacks on shutdown with demand-attach.

The contents of fsstate.dat can be inspected using /usr/afs/bin/state_analyzer.

File-Server Arguments (relating to Demand-Attach)

These are available in the man-pages (section 8) for the fileserver; some details are provided here for convenience:

Arguments controlling the host / callback state:

Parameter	Options	Default	Suggested Value	Meaning
`fs-state-dont-save`	n/a	state saved	-	`fileserver` state will not be saved during shutdown
`fs-state-dont-restore`	n/a	state restored	-	`fileserver` state will not be restored during startup
`fs-state-verify`	n/a	both	-	Controls the behavior of the state verification mechanism. Before saving or restoring the `fileserver` state information, the internal host and callback data structures are verified. A value of 'none' turns off all verification. A value of 'save' only performs the verification steps prior to saving state to disk. A value of 'restore' only performs the verification steps after restoring state from disk. A value of 'both' performs all verification steps both prior to saving and after restoring state.

Arguments controlling the VLRU

Parameter	Options	Default	Suggested Value	Meaning
`vattachpar`	positive integer	1	128	Controls the parallelism of the volume package start-up and shutdown routines. On start-up, vice partitions are scanned for volumes to pre-attach using a number of worker threads, the number of which is the minimum of `vattachpar` or the number of vice partitions. On shutdown, `vattachpar` worker threads are used to detach volumes. The shutdown code is mp-scaleable well beyond the number of vice partitions. Tom Keiser (from SNA) found 128 threads for a single vice partition had a statistically significant performance improvement over 64 threads.
`vhashsize`	positive integer	8	11	This parameter controls the size of the volume hash table. The table will contain 2^( `vhashsize`) entries. Hash bucket utilization statistics are given in the `fileserver` state information as well as on shutdown.
`vlrudisable`	n/a	enabled	-	Disables the Volume Least Recently Used (VLRU) cache.
`vlruthresh`	positive integer	120 minutes	1440 (24 hrs)	Minutes of inactivity before a volume is eligible for soft detachment.
`vlruinterval`	positive integer	120 seconds	-	Number of seconds between VLRU candidate queue scans
`vlrumax`	positive integer	8	8	Max number of volumes which will be soft detached in a single pass of the scanner

Tools for Debugging Demand-Attach File-Server

Several tools aid debugging problems with demand-attach file-servers. They operate at an extremely low-level and hence require a detailed knowledge of the architecture / code.

`fssync-debug`

Indiscriminate use of fssync-debug can have extremely dire consequences. Use with care.

fssync-debug provides low-level inspection and control over the volume package of the file-server. It can be used to display the file-server information associated with a volume, e.g.

ozsaw7 2# vos exam user.af -cell w.ln
user.af                           537119916 RW    2123478 K  On-line
  ln1qaf01 /vicepb
  RWrite  537119916 ROnly          0 Backup  537119917
  MaxQuota    3200000 K
  Creation    Wed Sep 17 17:48:17 2003
  Copy        Thu Dec 11 18:01:37 2008
  Backup      Thu Jun 25 01:49:20 2009
  Last Update Thu Jun 25 16:17:35 2009
  85271 accesses in the past day (i.e., vnode references)

  RWrite: 537119916     Backup: 537119917
  number of sites -> 1
     server ln1qaf01 partition /vicepb RW Site
ozsaw7 3# /usr/afs/bin/fssync-debug query -vol 537119916 -part /vicepb
calling FSYNC_VolOp with command code 65543 (FSYNC_VOL_QUERY)
FSYNC_VolOp returned 0 (SYNC_OK)
protocol response code was 0 (SYNC_OK)
protocol reason code was 0 (0)
volume = {
      hashid          = 537119916
      header          = 0xf93f160
      device          = 1
      partition       = 0xf90dfb8
      linkHandle      = 0x10478400
      nextVnodeUnique = 2259017
      diskDataHandle  = 0x104783d0
      vnodeHashOffset = 14
      shuttingDown    = 0
      goingOffline    = 0
      cacheCheck      = 49167
      nUsers          = 0
      needsPutBack    = 0
      specialStatus   = 0
      updateTime      = 1245943107
      vnodeIndex[vSmall] = {
              handle       = 0x104783a0
              bitmap       = 0x13f44df0
              bitmapSize   = 2792
              bitmapOffset = 64
      }
      vnodeIndex[vLarge] = {
              handle       = 0x10478370
              bitmap       = 0x13c96040
              bitmapSize   = 296
              bitmapOffset = 252
      }
      updateTime      = 1245943107
      attach_state    = VOL_STATE_ATTACHED
      attach_flags    = VOL_HDR_ATTACHED | VOL_HDR_LOADED | VOL_HDR_IN_LRU | VOL_IN_HASH | VOL_ON_VBYP_LIST | VOL_ON_VLRU
      nWaiters        = 0
      chainCacheCheck = 14
      salvage = {
              prio      = 0
              reason    = 0
              requested = 0
              scheduled = 0
      }
      stats = {
              hash_lookups = {
                      hi = 0
                      lo = 459560
              }
              hash_short_circuits = {
                      hi = 0
                      lo = 30
              }
              hdr_loads = {
                      hi = 0
                      lo = 32
              }
              hdr_gets = {
                      hi = 0
                      lo = 456011
              }
              attaches         = 32
              soft_detaches    = 0
              salvages         = 1
              vol_ops          = 32
              last_attach      = 1245891030
              last_get         = 1245943107
              last_promote     = 1245891030
              last_hdr_get     = 1245943107
              last_hdr_load    = 1245891030
              last_salvage     = 1242508846
              last_salvage_req = 1242508846
              last_vol_op      = 1245890958
      }
      vlru = {
              idx = 0 (VLRU_QUEUE_NEW)
      }
      pending_vol_op  = 0x0
}

Note that the volumeid argument must be the numeric ID and the partition argument must be the exact partition name (and not an abbreviation). An explanation of all these values is beyond the scope of this document. The important fields are:

attach_state, which is usually
- VOL_STATE_PREATTACHED which means the volume headers have been read, but the volume is not attached
- VOL_STATE_ATTACHED which means the volume is fully attached
- VOL_STATE_ERROR which indicates that the volume cannot be attached
attach_flags
- VOL_HDR_ATTACHED means the volume headers have been read (and hence the file-server is aware of the volumes existence)
- VOL_HDR_LOADED means the volume headers are resident in memory
- VOL_HDR_IN_LRU means the volume headers are on the least-recently used queue
- VOL_IN_HASH indicates that the volume has been added to the volume linked-list
- VOL_ON_VBYP_LIST indicates that the volume is linked off the partition list
- VOL_ON_VLRU means the volume is on a VLRU queue
the salvage structure (detailed here)
the stats structure, particularly the volume operation times ( last_*).
the vlru structure, particularly the VLRU queue

An understanding of the volume finite-state machine is required before the state of a volume should be manipulated.

`salvsync-debug`

Indiscriminate use of salvsync-debug can have extremely dire consequences. Use with care

salvsync-debug provides low-level inspection and control of the salvageserver process, including the scheduling order of volumes.

salvsync-debug can be used to query the current salvage status of a volume, e,g,

ozsaw7 4# /usr/afs/bin/salvsync-debug query -vol 537119916 -part /vicepb
calling SALVSYNC_SalvageVolume with command code 65540 (SALVSYNC_QUERY)
SALVSYNC_SalvageVolume returned 0 (SYNC_OK)
protocol response code was 0 (SYNC_OK)
protocol reason code was 0 (**UNKNOWN**)
state = {
      state = 4 (SALVSYNC_STATE_DONE)
      prio = 0
      sq_len = 0
      pq_len = 0
}

To initiate the salvaging of a volume

ozsaw7 5# /usr/afs/bin/salvsync-debug salvage -vol 537119916 -part /vicepb
calling SALVSYNC_SalvageVolume with command code 65537 (SALVSYNC_SALVAGE)
SALVSYNC_SalvageVolume returned 0 (SYNC_OK)
protocol response code was 0 (SYNC_OK)
protocol reason code was 0 (**UNKNOWN**)
state = {
      state = 1 (SALVSYNC_STATE_QUEUED)
      prio = 0
      sq_len = 1
      pq_len = 0
}
ozsaw7 6# /usr/afs/bin/salvsync-debug query -vol 537119916 -part /vicepb calling SALVSYNC_SalvageVolume with command code 65540 (SALVSYNC_QUERY)
SALVSYNC_SalvageVolume returned 0 (SYNC_OK)
protocol response code was 0 (SYNC_OK)
protocol reason code was 0 (**UNKNOWN**)
state = {
      state = 2 (SALVSYNC_STATE_SALVAGING)
      prio = 0
      sq_len = 0
      pq_len = 1
}

This is the method that should be used on demand-attach file-servers to initiate the manual salvage of volumes. It should be used with care.

Under normal circumstances, the priority ( prio) of a salvage request is the number of times the volume has been requested by clients. Modifying the priority (and hence the order volumes are salvaged) under heavy demand-salvaging usually leads to extremely bad things happening. To modify the priority of a request, use

salvsync-debug priority -vol 537119916 -part /vicepb -priority 999999

(where priority is a 32-bit integer).

`state_analyzer`

state_analyzer allows the contents of the host / callback state file ( /usr/afs/local/fsstate.dat) to be inspected.

Header Information

Header information is gleaned through the hdr command

fs state analyzer> hdr
loading structure from address 0xfed80000 (offset 0)
      fs_state_header = {
              stamp = {
                      magic = 0x62fa841c
                      version = 2
              }
              timestamp = "Tue Jun 23 11:51:49 2009"
              sys_name = 941
              server_uuid = "002e9712-ae67-1a2e-8a-42-900e866eaa77"
              valid = 0
              endianness = 1
              stats_detailed = 1
              h_offset = {
                      hi = 0
                      lo = 1024
              }
              cb_offset = {
                      hi = 0
                      lo = 93928
              }
              server_version_string = "@(#) OpenAFS 1.4.6-22 built  2009-04-18 "
      }
fs state analyzer>

Host Information

Host information can be gleaned through the h command, e.g.

fs state analyzer> h
fs state analyzer: h(0)> this
loading structure from address 0xfed80500 (offset 1280)
      host_state_entry_header = {
              magic = 0xa8b9cadb
              len = 104
              interfaces = 2
              hcps = 0
      }
      hostDiskEntry = {
              host = "161.144.167.187"
              port = 7001
              hostFlags = 0x144
              Console = 0
              hcpsfailed = 0
              hcps_valid = 0
              InSameNetwork = 0
              hcps_len = 0
              LastCall = "Tue Jun 23 11:51:45 2009"
              ActiveCall = "Tue Jun 23 11:51:45 2009"
              cpsCall = "Tue Jun 23 11:51:45 2009"
              cblist = 21133
              index = 373
      }
      Interface = {
              numberOfInterfaces = 2
              uuid = "aae8a851-1d54-4b83-ad-17-db967bd89e1b"
              interface[0] = {
                      addr = "161.144.167.187"
                      port = 7001
              }
              interface[1] = {
                      addr = "192.168.8.4"
                      port = 7001
              }
      }
fs state analyzer: h(0)> next
loading structure from address 0xfed80568 (offset 1384)
      host_state_entry_header = {
              magic = 0xa8b9cadb
              len = 120
              interfaces = 4
              hcps = 0
      }
      hostDiskEntry = {
              host = "10.181.34.134"
              port = 7001
              hostFlags = 0x144
              Console = 0
              hcpsfailed = 0
              hcps_valid = 0
              InSameNetwork = 0
              hcps_len = 0
              LastCall = "Tue Jun 23 11:51:08 2009"
              ActiveCall = "Tue Jun 23 11:51:08 2009"
              cpsCall = "Tue Jun 23 11:51:08 2009"
              cblist = 8422
              index = 369
      }
      Interface = {
              numberOfInterfaces = 4
              uuid = "00107e94-794d-1a3d-ae-e2-0ab52421aa77"
              interface[0] = {
                      addr = "10.181.36.33"
                      port = 7001
              }
              interface[1] = {
                      addr = "10.181.36.31"
                      port = 7001
              }
              interface[2] = {
                      addr = "10.181.32.134"
                      port = 7001
              }
              interface[3] = {
                      addr = "10.181.34.134"
                      port = 7001
              }
      }
fs state analyzer: h(1)>

Callback Information

Callback information is available through the cb command, e.g.

fs state analyzer> cb
fs state analyzer: fe(0):cb(0)> dump
loading structure from address 0xfed97b6c (offset 97132)
      CBDiskEntry = {
              cb = {
                      cnext = 0
                      fhead = 19989
                      thead = 103
                      status = 1
                      hhead = 224
                      tprev = 12276
                      tnext = 22836
                      hprev = 12276
                      hnext = 22836
              }
              index = 6774
      }

The dump command (as opposed to this) displays all call-backs for the current file-entry. Moving to the next file-entry can be achieved by

fs state analyzer: fe(0):cb(0)> quit
fs state analyzer: fe(0)> next
loading structure from address 0xfed97b90 (offset 97168)
      callback_state_entry_header = {
              magic = 0x54637281
              len = 104
              nCBs = 1
      }
      FEDiskEntry = {
              fe = {
                      vnode = 46426
                      unique = 125874
                      volid = 537156174
                      fnext = 8880
                      ncbs = 1
                      firstcb = 23232
                      status = 0
              }
              index = 21352
      }
fs state analyzer: fe(1)> cb
fs state analyzer: fe(1):cb(0)> dump
loading structure from address 0xfed97bd4 (offset 97236)
      CBDiskEntry = {
              cb = {
                      cnext = 0
                      fhead = 21352
                      thead = 103
                      status = 1
                      hhead = 382
                      tprev = 23751
                      tnext = 22772
                      hprev = 23751
                      hnext = 7824
              }
              index = 23232
      }