v9fs: Plan 9 Resource Sharing for Linux ======================================= ABOUT ===== v9fs is a Unix implementation of the Plan 9 9p remote filesystem protocol, designed by Ken Thompson, the original author of Unix. Just as UTF8 was Ken's Plan 9 solution to character encoding, 9p was his network filesystem protocol. v9fs provides a simple network filesystem that lets you run a server and go: mount -t 9p 127.0.0.1:/path /mnt Like Samba, 9p uses one connection per mount point and the server is a normal userspace processes. Unlike samba the protocol going across the wire is simple and easily supports POSIX semantics. like NFS, this offers a way to remotely mount filesystems. Unlike NFS, v9fs does not reimplement half the VFS layer, the server is not running in kernel space, it doesn't have any sort of RPC layer, re-exporting a v9fs mount with another v9fs server should just work, you shouldn't get strange errors from things like "mkdir sub; rm -rf sub > sub/log.txt", it doesn't care what the underlying filesystem format is, etc. The 9p protocol can operate across any bidirectional serial transport. The v9fs driver includes support for TCP/IP, virtio, RDMA, named pipe, and filehandle transports. A TCP/IP server named "Distributed I/O Daemon" is available at https://code.google.com/p/diod and the "virtfs" subsystem of QEMU and KVM is a virtio 9p server. PROTOCOL: ========= This driver supports three versions of the 9p protocol: the legacy 9p2000 and 9p2000.U versions, and the current (default) 9p2000.L. In 2000 Ken Thompson retired from Bell Labs, and the last version of the Plan 9 file sharing protocol he worked on was called "9p2000". This version had 12 basic operations, and is documented at: http://plan9.bell-labs.com/magic/man2html/5/0intro Around 2003 Plan 9 was open sourced, which led to the development of an extended version of the protocol, "9p2000.U" (for Unix) providing full posix semantics (by adding concepts the plan 9 OS didn't have, such as the suid bit and UID numbers instead of names). The first v9fs driver for Linux was merged in 2005 and based on 9p2000.U, as described in the paper "Grave Robbers From Outer Space": http://www.usenix.org/events/usenix05/tech/freenix/hensbergen.html The current protocol version, "9p2000.L" (for Linux) was developed for v9fs after a few years of real world experience. It tailors the protocol to Linux with support for ACLs, file locking, more efficient directory listing, corner cases in things like file deletion, and so on. This new version was introduced around 2009 and became the new default in 2011. The 9p2000.L protocol is described at: https://code.google.com/p/diod/wiki/protocol QUICK START: ============ If you just want to play around and get a feel for 9p, grab the diod (9p over ipv4) server from https://code.google.com/p/diod/ compile it (./configure; make) and run it as a normal user (this invocation does not require root access): diod/diod -f -n -S -l 127.0.0.1:9999 -e $PWD Then fire up a qemu instance with standard emulated network (where 10.0.2.2 tunnels through to the host's loopback) and run: mount -t 9p -o port=9999,aname=$PWD,version=9p2000.L 10.0.2.2 /mnt Where $PWD is same path diod served. (Note that diod currently requires absolue paths for exports.) There's a diod.8 man page in the same directory as the executable, and man pages for diod.conf and diodmount (allowing user/password logins) elsewhere in the source. The QEMU/KVM documentation for setting up virtfs (their built-in 9p server over virtio transport) is at: http://wiki.qemu.org/Documentation/9psetup QEMU also usually runs its server as a normal user on the host. It offers a "mapped" mode that saves ownership and device node information in extended attributes. CACHE MODES =========== By default, 9p operates uncached, letting the server handle concurrency. On a modern LAN this as fast or faster than talking to local disk, and scales surprisingly well (about as well as web servers). Back before cache support was even implemented, v9fs was tested with ~2000 simultaneous clients running against the same server, and performed acceptably. (Run the server as root so setrlimit could remove the per-process filehandle restriction, give server had plenty of RAM and plug it into a gigabit switch.) The "-o cache=loose" mode enables Linux's local VFS cache but makes no attempt to handle multi-user concurrency beyond not panicing the kernel: updates are written back when the client's VFS gets around to flushing them, last writer wins. File locking and fsync/fdatasync are available if an application wants to handle its own concurrency. Loose cacheing works well for read-only mounts (allowing scalable fanout in clusters with intermediate servers re-exporting read-only v9fs mounts to more clients), or mounts with nonconcurrent users (including only one client mounting a directory, or user home directories under a common directory). ; multiple users of the same mount are fine, the potential conflcit is that if multiple systems mount the same directory and modify the same files under it, the cache won't be notified of updates on the server. The client pulls data from the server, the server cannot asynchronously push unrequested updates to the client). The "-o cache=fscache" mode uses Linux's fscache subsystem to provide persistent local cacheing (which doesn't help concurrency at all). See Documentation/filesystems/cacheing/fscache.txt for details. This code makes no attempt to handle the full range of cacheing corner cases other protocols wrestle with; v9fs just doesn't go there. The old saying is "reliability, performance, concurrency: pick two" for a reason. Uncached mode provides reliability and concurrency, cached mode provides performance and one other (your choice which). Even with cacheing, multiple users of the same mount on a client are fine, the potential conflicit is that if multiple client systems the same directory from a server and modify the same files under it, the client's cache won't be notified of updates from other clients before naturally expiring. The client pulls data from the server, the server cannot asynchronously push unrequested updates to the client. In 9p the server only responds to client requests, it never initiates transactions. SECURITY ======== Security is implemented as a wrapper around the serial protocol; once the connection is established v9fs supplies a username or UID (-o uname=) to the server and operations succeed or fail based on that user's credentials. Diod's diodmount uses trans=fd under the covers, allowing the mount and server programs to verify credentials first, then hand the connection off to the kernel to mount the directory. QEMU's virtfs does not implement any login mechanism because virtio data isn't visible from userspace, mounting a directory requires root access, and client OS is the only user of the virtfs server. You can use ssh port forwarding to encrypt the protocol. To forward the remote system's port 564 to localhosts's port 9999: ssh -L 9999:localhost:564 user@server sleep 999999999 & Then mount from 127.0.0.1:9999. USING AUTOMOUNT TO RECONNECT ============================ If a samba server crashes and is restarted, clients will automaticaly reconnect. The v9fs driver doesn't do this, in part because the driver is not involved in any security aspects of the login process so it doesn't have enough information to restart the session. In theory you can use automount to do this. I have no idea how. USAGE ===== For remote file server: mount -t 9p 10.10.1.2 /mnt/9 For Plan 9 From User Space applications (http://swtch.com/plan9) mount -t 9p `namespace`/acme /mnt/9 -o trans=unix,uname=$USER For server running on QEMU host with virtio transport: mount -t 9p -o trans=virtio /mnt/9 where mount_tag is the tag associated by the server to each of the exported mount points. Each 9P export is seen by the client as a virtio device with an associated "mount_tag" property. Available mount tags can be seen by reading /sys/bus/virtio/drivers/9pnet_virtio/virtio/mount_tag files. OPTIONS ======= trans=name select an alternative transport. Valid options are currently: unix - specifying a named pipe mount point tcp - specifying a normal TCP/IP connection fd - used passed file descriptors for connection (see rfdno and wfdno) virtio - connect to the next virtio channel available (from QEMU with trans_virtio module) rdma - connect to a specified RDMA channel uname=name user name to attempt mount as on the remote server. The server may override or ignore this value. Certain user names may require authentication. aname=name aname specifies the file tree to access when the server is offering several exported file systems. cache=mode specifies a caching policy. By default, no caches are used. loose = no attempts are made at consistency, intended for exclusive, read-only mounts fscache = use FS-Cache for a persistent, read-only cache backend. debug=n specifies debug level. The debug level is a bitmask. 0x01 = display verbose error messages 0x02 = developer debug (DEBUG_CURRENT) 0x04 = display 9p trace 0x08 = display VFS trace 0x10 = display Marshalling debug 0x20 = display RPC debug 0x40 = display transport debug 0x80 = display allocation debug 0x100 = display protocol message debug 0x200 = display Fid debug 0x400 = display packet debug 0x800 = display fscache tracing debug rfdno=n the file descriptor for reading with trans=fd wfdno=n the file descriptor for writing with trans=fd msize=n the number of bytes to use for 9p packet payload port=n port to connect to on the remote server noextend force legacy mode (no 9p2000.u or 9p2000.L semantics) version=name Select 9P protocol version. Valid options are: 9p2000 - Legacy mode (same as noextend) 9p2000.u - Use 9P2000.u protocol 9p2000.L - Use 9P2000.L protocol dfltuid attempt to mount as a particular uid dfltgid attempt to mount with a particular gid afid security channel - used by Plan 9 authentication protocols nodevmap do not map special files - represent them as normal files. This can be used to share devices/named pipes/sockets between hosts. This functionality will be expanded in later versions. access there are four access modes. user = if a user tries to access a file on v9fs filesystem for the first time, v9fs sends an attach command (Tattach) for that user. This is the default mode. = allows only user with uid= to access the files on the mounted filesystem any = v9fs does single attach and performs all operations as one user client = ACL based access check on the 9p client side for access validation cachetag cache tag to use the specified persistent cache. cache tags for existing cache sessions can be listed at /sys/fs/9p/caches. (applies only to cache=fscache) RESOURCES ========= This software was originally developed by Ron Minnich and Maya Gokhale. Additional development by Greg Watson and most recently Eric Van Hensbergen , Latchesar Ionkov and Russ Cox . Protocol specifications are maintained on github: http://ericvh.github.com/9p-rfc/ 9p client and server implementations are listed on http://9p.cat-v.org/implementations A 9p2000.L server is being developed by LLNL and can be found at http://code.google.com/p/diod/ There are user and developer mailing lists available through the v9fs project on sourceforge (http://sourceforge.net/projects/v9fs). News and other information is maintained on a Wiki. (http://sf.net/apps/mediawiki/v9fs/index.php). Bug reports may be issued through the kernel.org bugzilla (http://bugzilla.kernel.org) For more information on the Plan 9 Operating System check out http://plan9.bell-labs.com/plan9 For information on Plan 9 from User Space (Plan 9 applications and libraries ported to Linux/BSD/OSX/etc) check out http://swtch.com/plan9 FURTHER READING: ================ * XCPU & Clustering http://xcpu.org/papers/xcpu-talk.pdf * KVMFS: control file system for KVM http://xcpu.org/papers/kvmfs.pdf * CellFS: A New Programming Model for the Cell BE http://xcpu.org/papers/cellfs-talk.pdf * PROSE I/O: Using 9p to enable Application Partitions http://plan9.escet.urjc.es/iwp9/cready/PROSE_iwp9_2006.pdf * VirtFS: A Virtualization Aware File System pass-through http://goo.gl/3WPDg STATUS ====== 9p2000.L is feature complete as of 2.6.38. PLEASE USE THE KERNEL BUGZILLA TO REPORT PROBLEMS. (http://bugzilla.kernel.org)