nfs: move more to Documentation/filesystems/nfs

Oops: I missed two files in the first commit that created this
directory.

Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
This commit is contained in:
J. Bruce Fields
2009-11-06 13:59:43 -05:00
parent 8c10cbdb4a
commit ea4878a24d
4 changed files with 4 additions and 2 deletions
+4
View File
@@ -2,6 +2,8 @@
- this file (nfs-related documentation).
Exporting
- explanation of how to make filesystems exportable.
knfsd-stats.txt
- statistics which the NFS server makes available to user space.
nfs.txt
- nfs client, and DNS resolution for fs_locations.
nfs41-server.txt
@@ -10,3 +12,5 @@ nfs-rdma.txt
- how to install and setup the Linux NFS/RDMA client and server software
nfsroot.txt
- short guide on setting up a diskless box with NFS root filesystem.
rpc-cache.txt
- introduction to the caching mechanisms in the sunrpc layer.
@@ -0,0 +1,159 @@
Kernel NFS Server Statistics
============================
This document describes the format and semantics of the statistics
which the kernel NFS server makes available to userspace. These
statistics are available in several text form pseudo files, each of
which is described separately below.
In most cases you don't need to know these formats, as the nfsstat(8)
program from the nfs-utils distribution provides a helpful command-line
interface for extracting and printing them.
All the files described here are formatted as a sequence of text lines,
separated by newline '\n' characters. Lines beginning with a hash
'#' character are comments intended for humans and should be ignored
by parsing routines. All other lines contain a sequence of fields
separated by whitespace.
/proc/fs/nfsd/pool_stats
------------------------
This file is available in kernels from 2.6.30 onwards, if the
/proc/fs/nfsd filesystem is mounted (it almost always should be).
The first line is a comment which describes the fields present in
all the other lines. The other lines present the following data as
a sequence of unsigned decimal numeric fields. One line is shown
for each NFS thread pool.
All counters are 64 bits wide and wrap naturally. There is no way
to zero these counters, instead applications should do their own
rate conversion.
pool
The id number of the NFS thread pool to which this line applies.
This number does not change.
Thread pool ids are a contiguous set of small integers starting
at zero. The maximum value depends on the thread pool mode, but
currently cannot be larger than the number of CPUs in the system.
Note that in the default case there will be a single thread pool
which contains all the nfsd threads and all the CPUs in the system,
and thus this file will have a single line with a pool id of "0".
packets-arrived
Counts how many NFS packets have arrived. More precisely, this
is the number of times that the network stack has notified the
sunrpc server layer that new data may be available on a transport
(e.g. an NFS or UDP socket or an NFS/RDMA endpoint).
Depending on the NFS workload patterns and various network stack
effects (such as Large Receive Offload) which can combine packets
on the wire, this may be either more or less than the number
of NFS calls received (which statistic is available elsewhere).
However this is a more accurate and less workload-dependent measure
of how much CPU load is being placed on the sunrpc server layer
due to NFS network traffic.
sockets-enqueued
Counts how many times an NFS transport is enqueued to wait for
an nfsd thread to service it, i.e. no nfsd thread was considered
available.
The circumstance this statistic tracks indicates that there was NFS
network-facing work to be done but it couldn't be done immediately,
thus introducing a small delay in servicing NFS calls. The ideal
rate of change for this counter is zero; significantly non-zero
values may indicate a performance limitation.
This can happen either because there are too few nfsd threads in the
thread pool for the NFS workload (the workload is thread-limited),
or because the NFS workload needs more CPU time than is available in
the thread pool (the workload is CPU-limited). In the former case,
configuring more nfsd threads will probably improve the performance
of the NFS workload. In the latter case, the sunrpc server layer is
already choosing not to wake idle nfsd threads because there are too
many nfsd threads which want to run but cannot, so configuring more
nfsd threads will make no difference whatsoever. The overloads-avoided
statistic (see below) can be used to distinguish these cases.
threads-woken
Counts how many times an idle nfsd thread is woken to try to
receive some data from an NFS transport.
This statistic tracks the circumstance where incoming
network-facing NFS work is being handled quickly, which is a good
thing. The ideal rate of change for this counter will be close
to but less than the rate of change of the packets-arrived counter.
overloads-avoided
Counts how many times the sunrpc server layer chose not to wake an
nfsd thread, despite the presence of idle nfsd threads, because
too many nfsd threads had been recently woken but could not get
enough CPU time to actually run.
This statistic counts a circumstance where the sunrpc layer
heuristically avoids overloading the CPU scheduler with too many
runnable nfsd threads. The ideal rate of change for this counter
is zero. Significant non-zero values indicate that the workload
is CPU limited. Usually this is associated with heavy CPU usage
on all the CPUs in the nfsd thread pool.
If a sustained large overloads-avoided rate is detected on a pool,
the top(1) utility should be used to check for the following
pattern of CPU usage on all the CPUs associated with the given
nfsd thread pool.
- %us ~= 0 (as you're *NOT* running applications on your NFS server)
- %wa ~= 0
- %id ~= 0
- %sy + %hi + %si ~= 100
If this pattern is seen, configuring more nfsd threads will *not*
improve the performance of the workload. If this patten is not
seen, then something more subtle is wrong.
threads-timedout
Counts how many times an nfsd thread triggered an idle timeout,
i.e. was not woken to handle any incoming network packets for
some time.
This statistic counts a circumstance where there are more nfsd
threads configured than can be used by the NFS workload. This is
a clue that the number of nfsd threads can be reduced without
affecting performance. Unfortunately, it's only a clue and not
a strong indication, for a couple of reasons:
- Currently the rate at which the counter is incremented is quite
slow; the idle timeout is 60 minutes. Unless the NFS workload
remains constant for hours at a time, this counter is unlikely
to be providing information that is still useful.
- It is usually a wise policy to provide some slack,
i.e. configure a few more nfsds than are currently needed,
to allow for future spikes in load.
Note that incoming packets on NFS transports will be dealt with in
one of three ways. An nfsd thread can be woken (threads-woken counts
this case), or the transport can be enqueued for later attention
(sockets-enqueued counts this case), or the packet can be temporarily
deferred because the transport is currently being used by an nfsd
thread. This last case is not very interesting and is not explicitly
counted, but can be inferred from the other counters thus:
packets-deferred = packets-arrived - ( sockets-enqueued + threads-woken )
More
----
Descriptions of the other statistics file should go here.
Greg Banks <gnb@sgi.com>
26 Mar 2009
+202
View File
@@ -0,0 +1,202 @@
This document gives a brief introduction to the caching
mechanisms in the sunrpc layer that is used, in particular,
for NFS authentication.
CACHES
======
The caching replaces the old exports table and allows for
a wide variety of values to be caches.
There are a number of caches that are similar in structure though
quite possibly very different in content and use. There is a corpus
of common code for managing these caches.
Examples of caches that are likely to be needed are:
- mapping from IP address to client name
- mapping from client name and filesystem to export options
- mapping from UID to list of GIDs, to work around NFS's limitation
of 16 gids.
- mappings between local UID/GID and remote UID/GID for sites that
do not have uniform uid assignment
- mapping from network identify to public key for crypto authentication.
The common code handles such things as:
- general cache lookup with correct locking
- supporting 'NEGATIVE' as well as positive entries
- allowing an EXPIRED time on cache items, and removing
items after they expire, and are no longer in-use.
- making requests to user-space to fill in cache entries
- allowing user-space to directly set entries in the cache
- delaying RPC requests that depend on as-yet incomplete
cache entries, and replaying those requests when the cache entry
is complete.
- clean out old entries as they expire.
Creating a Cache
----------------
1/ A cache needs a datum to store. This is in the form of a
structure definition that must contain a
struct cache_head
as an element, usually the first.
It will also contain a key and some content.
Each cache element is reference counted and contains
expiry and update times for use in cache management.
2/ A cache needs a "cache_detail" structure that
describes the cache. This stores the hash table, some
parameters for cache management, and some operations detailing how
to work with particular cache items.
The operations requires are:
struct cache_head *alloc(void)
This simply allocates appropriate memory and returns
a pointer to the cache_detail embedded within the
structure
void cache_put(struct kref *)
This is called when the last reference to an item is
dropped. The pointer passed is to the 'ref' field
in the cache_head. cache_put should release any
references create by 'cache_init' and, if CACHE_VALID
is set, any references created by cache_update.
It should then release the memory allocated by
'alloc'.
int match(struct cache_head *orig, struct cache_head *new)
test if the keys in the two structures match. Return
1 if they do, 0 if they don't.
void init(struct cache_head *orig, struct cache_head *new)
Set the 'key' fields in 'new' from 'orig'. This may
include taking references to shared objects.
void update(struct cache_head *orig, struct cache_head *new)
Set the 'content' fileds in 'new' from 'orig'.
int cache_show(struct seq_file *m, struct cache_detail *cd,
struct cache_head *h)
Optional. Used to provide a /proc file that lists the
contents of a cache. This should show one item,
usually on just one line.
int cache_request(struct cache_detail *cd, struct cache_head *h,
char **bpp, int *blen)
Format a request to be send to user-space for an item
to be instantiated. *bpp is a buffer of size *blen.
bpp should be moved forward over the encoded message,
and *blen should be reduced to show how much free
space remains. Return 0 on success or <0 if not
enough room or other problem.
int cache_parse(struct cache_detail *cd, char *buf, int len)
A message from user space has arrived to fill out a
cache entry. It is in 'buf' of length 'len'.
cache_parse should parse this, find the item in the
cache with sunrpc_cache_lookup, and update the item
with sunrpc_cache_update.
3/ A cache needs to be registered using cache_register(). This
includes it on a list of caches that will be regularly
cleaned to discard old data.
Using a cache
-------------
To find a value in a cache, call sunrpc_cache_lookup passing a pointer
to the cache_head in a sample item with the 'key' fields filled in.
This will be passed to ->match to identify the target entry. If no
entry is found, a new entry will be create, added to the cache, and
marked as not containing valid data.
The item returned is typically passed to cache_check which will check
if the data is valid, and may initiate an up-call to get fresh data.
cache_check will return -ENOENT in the entry is negative or if an up
call is needed but not possible, -EAGAIN if an upcall is pending,
or 0 if the data is valid;
cache_check can be passed a "struct cache_req *". This structure is
typically embedded in the actual request and can be used to create a
deferred copy of the request (struct cache_deferred_req). This is
done when the found cache item is not uptodate, but the is reason to
believe that userspace might provide information soon. When the cache
item does become valid, the deferred copy of the request will be
revisited (->revisit). It is expected that this method will
reschedule the request for processing.
The value returned by sunrpc_cache_lookup can also be passed to
sunrpc_cache_update to set the content for the item. A second item is
passed which should hold the content. If the item found by _lookup
has valid data, then it is discarded and a new item is created. This
saves any user of an item from worrying about content changing while
it is being inspected. If the item found by _lookup does not contain
valid data, then the content is copied across and CACHE_VALID is set.
Populating a cache
------------------
Each cache has a name, and when the cache is registered, a directory
with that name is created in /proc/net/rpc
This directory contains a file called 'channel' which is a channel
for communicating between kernel and user for populating the cache.
This directory may later contain other files of interacting
with the cache.
The 'channel' works a bit like a datagram socket. Each 'write' is
passed as a whole to the cache for parsing and interpretation.
Each cache can treat the write requests differently, but it is
expected that a message written will contain:
- a key
- an expiry time
- a content.
with the intention that an item in the cache with the give key
should be create or updated to have the given content, and the
expiry time should be set on that item.
Reading from a channel is a bit more interesting. When a cache
lookup fails, or when it succeeds but finds an entry that may soon
expire, a request is lodged for that cache item to be updated by
user-space. These requests appear in the channel file.
Successive reads will return successive requests.
If there are no more requests to return, read will return EOF, but a
select or poll for read will block waiting for another request to be
added.
Thus a user-space helper is likely to:
open the channel.
select for readable
read a request
write a response
loop.
If it dies and needs to be restarted, any requests that have not been
answered will still appear in the file and will be read by the new
instance of the helper.
Each cache should define a "cache_parse" method which takes a message
written from user-space and processes it. It should return an error
(which propagates back to the write syscall) or 0.
Each cache should also define a "cache_request" method which
takes a cache item and encodes a request into the buffer
provided.
Note: If a cache has no active readers on the channel, and has had not
active readers for more than 60 seconds, further requests will not be
added to the channel but instead all lookups that do not find a valid
entry will fail. This is partly for backward compatibility: The
previous nfs exports table was deemed to be authoritative and a
failed lookup meant a definite 'no'.
request/response format
-----------------------
While each cache is free to use it's own format for requests
and responses over channel, the following is recommended as
appropriate and support routines are available to help:
Each request or response record should be printable ASCII
with precisely one newline character which should be at the end.
Fields within the record should be separated by spaces, normally one.
If spaces, newlines, or nul characters are needed in a field they
much be quoted. two mechanisms are available:
1/ If a field begins '\x' then it must contain an even number of
hex digits, and pairs of these digits provide the bytes in the
field.
2/ otherwise a \ in the field must be followed by 3 octal digits
which give the code for a byte. Other characters are treated
as them selves. At the very least, space, newline, nul, and
'\' must be quoted in this way.