NFSv4 S. Shepler Internet-Draft M. Eisler Intended status: Standards Track D. Noveck Expires: September 5, 2007 Editors March 4, 2007 NFSv4 Minor Version 1 draft-ietf-nfsv4-minorversion1-10.txt Status of this Memo By submitting this Internet-Draft, each author represents that any applicable patent or other IPR claims of which he or she is aware have been or will be disclosed, and any of which he or she becomes aware will be disclosed, in accordance with Section 6 of BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. This Internet-Draft will expire on September 5, 2007. Copyright Notice Copyright (C) The IETF Trust (2007). Abstract This Internet-Draft describes NFSv4 minor version one, including features retained from the base protocol and protocol extensions made subsequently. The current draft includes description of the major extensions, Sessions, Directory Delegations, and parallel NFS (pNFS). This Internet-Draft is an active work item of the NFSv4 working group. Active and resolved issues may be found in the issue tracker at: http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4. New issues Shepler, et al. Expires September 5, 2007 [Page 1] Internet-Draft NFSv4 Minor Version 1 March 2007 related to this document should be raised with the NFSv4 Working Group nfsv4@ietf.org and logged in the issue tracker. Requirements Language The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1]. Table of Contents 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 10 1.1. The NFSv4.1 Protocol . . . . . . . . . . . . . . . . . . 10 1.2. NFS Version 4 Goals . . . . . . . . . . . . . . . . . . 10 1.3. Minor Version 1 Goals . . . . . . . . . . . . . . . . . 11 1.4. Overview of NFS version 4.1 Features . . . . . . . . . . 11 1.4.1. RPC and Security . . . . . . . . . . . . . . . . . . 12 1.4.2. Protocol Structure . . . . . . . . . . . . . . . . . 12 1.4.3. File System Model . . . . . . . . . . . . . . . . . 13 1.4.4. Locking Facilities . . . . . . . . . . . . . . . . . 14 1.5. General Definitions . . . . . . . . . . . . . . . . . . 15 1.6. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 17 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 17 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 18 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 18 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 18 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 21 2.4. Client Identifiers and Client Owners . . . . . . . . . . 22 2.4.1. Server Release of Client ID . . . . . . . . . . . . 26 2.4.2. Handling Client Owner Conflicts . . . . . . . . . . 26 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 27 2.6. Security Service Negotiation . . . . . . . . . . . . . . 27 2.6.1. NFSv4 Security Tuples . . . . . . . . . . . . . . . 28 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 28 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 28 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 32 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 34 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 34 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 34 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 35 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 35 2.9.1. Required and Recommended Properties of Transports . 35 2.9.2. Client and Server Transport Behavior . . . . . . . . 35 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 37 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 37 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 37 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 38 Shepler, et al. Expires September 5, 2007 [Page 2] Internet-Draft NFSv4 Minor Version 1 March 2007 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 39 2.10.4. Exactly Once Semantics . . . . . . . . . . . . . . . 42 2.10.5. RDMA Considerations . . . . . . . . . . . . . . . . 51 2.10.6. Sessions Security . . . . . . . . . . . . . . . . . 53 2.10.7. Session Mechanics - Steady State . . . . . . . . . . 57 2.10.8. Session Mechanics - Recovery . . . . . . . . . . . . 59 2.10.9. Parallel NFS and Sessions . . . . . . . . . . . . . 62 3. Protocol Data Types . . . . . . . . . . . . . . . . . . . . . 62 3.1. Basic Data Types . . . . . . . . . . . . . . . . . . . . 62 3.2. Structured Data Types . . . . . . . . . . . . . . . . . 64 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 74 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 74 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 74 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 75 4.2.1. General Properties of a Filehandle . . . . . . . . . 75 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 76 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 76 4.3. One Method of Constructing a Volatile Filehandle . . . . 77 4.4. Client Recovery from Filehandle Expiration . . . . . . . 78 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 79 5.1. Mandatory Attributes . . . . . . . . . . . . . . . . . . 80 5.2. Recommended Attributes . . . . . . . . . . . . . . . . . 80 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 81 5.4. Classification of Attributes . . . . . . . . . . . . . . 81 5.5. Mandatory Attributes - Definitions . . . . . . . . . . . 83 5.6. Recommended Attributes - Definitions . . . . . . . . . . 84 5.7. Time Access . . . . . . . . . . . . . . . . . . . . . . 94 5.8. Interpreting owner and owner_group . . . . . . . . . . . 95 5.9. Character Case Attributes . . . . . . . . . . . . . . . 97 5.10. Quota Attributes . . . . . . . . . . . . . . . . . . . . 97 5.11. mounted_on_fileid . . . . . . . . . . . . . . . . . . . 98 5.12. Directory Notification Attributes . . . . . . . . . . . 99 5.12.1. dir_notif_delay . . . . . . . . . . . . . . . . . . 99 5.12.2. dirent_notif_delay . . . . . . . . . . . . . . . . . 99 5.13. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 99 5.13.1. fs_layout_type . . . . . . . . . . . . . . . . . . . 99 5.13.2. layout_alignment . . . . . . . . . . . . . . . . . . 99 5.13.3. layout_blksize . . . . . . . . . . . . . . . . . . . 100 5.13.4. layout_hint . . . . . . . . . . . . . . . . . . . . 100 5.13.5. layout_type . . . . . . . . . . . . . . . . . . . . 100 5.13.6. mdsthreshold . . . . . . . . . . . . . . . . . . . . 100 5.14. Retention Attributes . . . . . . . . . . . . . . . . . . 101 6. Access Control Lists . . . . . . . . . . . . . . . . . . . . 103 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 103 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 104 6.2.1. ACL Attribute . . . . . . . . . . . . . . . . . . . 104 6.2.2. dacl and sacl Attributes . . . . . . . . . . . . . . 115 Shepler, et al. Expires September 5, 2007 [Page 3] Internet-Draft NFSv4 Minor Version 1 March 2007 6.2.3. mode Attribute . . . . . . . . . . . . . . . . . . . 116 6.2.4. mode_set_masked Attribute . . . . . . . . . . . . . 116 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 117 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 117 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 118 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 119 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 120 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 121 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 122 7. Single-server Name Space . . . . . . . . . . . . . . . . . . 125 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 126 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 126 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 126 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 127 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 127 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 127 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 128 7.8. Security Policy and Name Space Presentation . . . . . . 128 8. File Locking and Share Reservations . . . . . . . . . . . . . 129 8.1. Locking . . . . . . . . . . . . . . . . . . . . . . . . 130 8.1.1. Client and Session ID . . . . . . . . . . . . . . . 130 8.1.2. State-owner Definition . . . . . . . . . . . . . . . 130 8.1.3. Stateid Definition . . . . . . . . . . . . . . . . . 131 8.1.4. Use of the Stateid and Locking . . . . . . . . . . . 134 8.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 137 8.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 137 8.4. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 138 8.5. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 138 8.6. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 139 8.6.1. Client Failure and Recovery . . . . . . . . . . . . 139 8.6.2. Server Failure and Recovery . . . . . . . . . . . . 140 8.6.3. Network Partitions and Recovery . . . . . . . . . . 143 8.7. Server Revocation of Locks . . . . . . . . . . . . . . . 147 8.8. Share Reservations . . . . . . . . . . . . . . . . . . . 148 8.9. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 149 8.10. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 149 8.11. Short and Long Leases . . . . . . . . . . . . . . . . . 150 8.12. Clocks, Propagation Delay, and Calculating Lease Expiration . . . . . . . . . . . . . . . . . . . . . . . 151 8.13. Vestigial Locking Infrastructure From V4.0 . . . . . . . 151 9. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 152 9.1. Performance Challenges for Client-Side Caching . . . . . 153 9.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 153 9.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 155 9.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 157 9.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 157 9.3.2. Data Caching and File Locking . . . . . . . . . . . 158 9.3.3. Data Caching and Mandatory File Locking . . . . . . 160 Shepler, et al. Expires September 5, 2007 [Page 4] Internet-Draft NFSv4 Minor Version 1 March 2007 9.3.4. Data Caching and File Identity . . . . . . . . . . . 160 9.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 161 9.4.1. Open Delegation and Data Caching . . . . . . . . . . 164 9.4.2. Open Delegation and File Locks . . . . . . . . . . . 165 9.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 165 9.4.4. Recall of Open Delegation . . . . . . . . . . . . . 168 9.4.5. Clients that Fail to Honor Delegation Recalls . . . 170 9.4.6. Delegation Revocation . . . . . . . . . . . . . . . 171 9.5. Data Caching and Revocation . . . . . . . . . . . . . . 171 9.5.1. Revocation Recovery for Write Open Delegation . . . 172 9.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 173 9.7. Data and Metadata Caching and Memory Mapped Files . . . 175 9.8. Name Caching . . . . . . . . . . . . . . . . . . . . . . 177 9.9. Directory Caching . . . . . . . . . . . . . . . . . . . 178 10. Multi-Server Name Space . . . . . . . . . . . . . . . . . . . 179 10.1. Location attributes . . . . . . . . . . . . . . . . . . 179 10.2. File System Presence or Absence . . . . . . . . . . . . 179 10.3. Getting Attributes for an Absent File System . . . . . . 181 10.3.1. GETATTR Within an Absent File System . . . . . . . . 181 10.3.2. READDIR and Absent File Systems . . . . . . . . . . 182 10.4. Uses of Location Information . . . . . . . . . . . . . . 183 10.4.1. File System Replication . . . . . . . . . . . . . . 183 10.4.2. File System Migration . . . . . . . . . . . . . . . 185 10.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 186 10.5. Additional Client-side Considerations . . . . . . . . . 187 10.6. Effecting File System Transitions . . . . . . . . . . . 188 10.6.1. File System Transitions and Simultaneous Access . . 189 10.6.2. Simultaneous Use and Transparent Transitions . . . . 190 10.6.3. Filehandles and File System Transitions . . . . . . 192 10.6.4. Fileid's and File System Transitions . . . . . . . . 192 10.6.5. Fsids and File System Transitions . . . . . . . . . 193 10.6.6. The Change Attribute and File System Transitions . . 193 10.6.7. Lock State and File System Transitions . . . . . . . 194 10.6.8. Write Verifiers and File System Transitions . . . . 197 10.7. Effecting File System Referrals . . . . . . . . . . . . 197 10.7.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 198 10.7.2. Referral Example (READDIR) . . . . . . . . . . . . . 202 10.8. The Attribute fs_absent . . . . . . . . . . . . . . . . 204 10.9. The Attribute fs_locations . . . . . . . . . . . . . . . 204 10.10. The Attribute fs_locations_info . . . . . . . . . . . . 206 10.10.1. The fs_locations_server4 Structure . . . . . . . . . 209 10.10.2. The fs_locations_info4 Structure . . . . . . . . . . 214 10.10.3. The fs_locations_item4 Structure . . . . . . . . . . 215 10.11. The Attribute fs_status . . . . . . . . . . . . . . . . 216 11. Directory Delegations . . . . . . . . . . . . . . . . . . . . 220 11.1. Introduction to Directory Delegations . . . . . . . . . 220 11.2. Directory Delegation Design . . . . . . . . . . . . . . 221 11.3. Attributes in Support of Directory Notifications . . . . 222 Shepler, et al. Expires September 5, 2007 [Page 5] Internet-Draft NFSv4 Minor Version 1 March 2007 11.4. Delegation Recall . . . . . . . . . . . . . . . . . . . 222 11.5. Directory Delegation Recovery . . . . . . . . . . . . . 222 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 222 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 222 12.2. PNFS Definitions . . . . . . . . . . . . . . . . . . . . 224 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 224 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 224 12.2.3. Client . . . . . . . . . . . . . . . . . . . . . . . 225 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 225 12.2.5. Data Server . . . . . . . . . . . . . . . . . . . . 225 12.2.6. Storage Protocol or Data Protocol . . . . . . . . . 225 12.2.7. Control Protocol . . . . . . . . . . . . . . . . . . 225 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 226 12.2.9. Layout Types . . . . . . . . . . . . . . . . . . . . 226 12.2.10. Layout Iomode . . . . . . . . . . . . . . . . . . . 226 12.2.11. Layout Segment . . . . . . . . . . . . . . . . . . . 227 12.2.12. Device IDs . . . . . . . . . . . . . . . . . . . . . 228 12.3. PNFS Operations . . . . . . . . . . . . . . . . . . . . 228 12.4. PNFS Attributes . . . . . . . . . . . . . . . . . . . . 229 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 229 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 229 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 230 12.5.3. Committing a Layout . . . . . . . . . . . . . . . . 231 12.5.4. Recalling a Layout . . . . . . . . . . . . . . . . . 234 12.5.5. Metadata Server Write Propagation . . . . . . . . . 240 12.6. PNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 240 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 241 12.7.1. Client Recovery . . . . . . . . . . . . . . . . . . 241 12.7.2. Dealing with Lease Expiration on the Client . . . . 242 12.7.3. Dealing with Loss of Layout State on the Metadata Server . . . . . . . . . . . . . . . . . . . . . . . 243 12.7.4. Recovery from Metadata Server Restart . . . . . . . 244 12.7.5. Operations During Metadata Server Grace Period . . . 246 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 246 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 247 12.9. Security Considerations . . . . . . . . . . . . . . . . 248 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 249 13.1. Session Considerations . . . . . . . . . . . . . . . . . 249 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 251 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 251 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 255 13.5. Sparse and Dense Stripe Unit Packing . . . . . . . . . . 257 13.6. Data Server Multipathing . . . . . . . . . . . . . . . . 259 13.7. Operations Issued to NFSv4.1 Data Servers . . . . . . . 259 13.8. COMMIT Through Metadata Server . . . . . . . . . . . . . 260 13.9. Global Stateid Requirements . . . . . . . . . . . . . . 261 13.10. The Layout Iomode . . . . . . . . . . . . . . . . . . . 261 13.11. Data Server State Propagation . . . . . . . . . . . . . 261 Shepler, et al. Expires September 5, 2007 [Page 6] Internet-Draft NFSv4 Minor Version 1 March 2007 13.11.1. Lock State Propagation . . . . . . . . . . . . . . . 262 13.11.2. Open-mode Validation . . . . . . . . . . . . . . . . 262 13.11.3. File Attributes . . . . . . . . . . . . . . . . . . 263 13.12. Data Server Component File Size . . . . . . . . . . . . 263 13.13. Recovery Considerations . . . . . . . . . . . . . . . . 264 13.14. Security Considerations for the File Layout Type . . . . 265 14. Internationalization . . . . . . . . . . . . . . . . . . . . 265 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 266 14.2. Stringprep profile for the utf8str_cis type . . . . . . 268 14.3. Stringprep profile for the utf8str_mixed type . . . . . 269 14.4. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 271 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 271 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 271 15.2. Operations and their valid errors . . . . . . . . . . . 285 15.3. Callback operations and their valid errors . . . . . . . 299 15.4. Errors and the operations that use them . . . . . . . . 300 16. NFS version 4.1 Procedures . . . . . . . . . . . . . . . . . 307 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 307 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 308 17. NFS version 4.1 Operations . . . . . . . . . . . . . . . . . 313 17.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 313 17.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 315 17.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 317 17.4. Operation 6: CREATE - Create a Non-Regular File Object . 319 17.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting Recovery . . . . . . . . . . . . . . . . . . . . . . . . 322 17.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 323 17.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 323 17.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 325 17.9. Operation 11: LINK - Create Link to a File . . . . . . . 326 17.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 327 17.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 331 17.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 332 17.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 334 17.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 335 17.15. Operation 17: NVERIFY - Verify Difference in Attributes . . . . . . . . . . . . . . . . . . . . . . . 337 17.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 338 17.17. Operation 19: OPENATTR - Open Named Attribute Directory . . . . . . . . . . . . . . . . . . . . . . . 352 17.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 354 17.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 355 17.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 356 17.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 357 17.22. Operation 25: READ - Read from File . . . . . . . . . . 358 17.23. Operation 26: READDIR - Read Directory . . . . . . . . . 360 17.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 364 17.25. Operation 28: REMOVE - Remove File System Object . . . . 365 Shepler, et al. Expires September 5, 2007 [Page 7] Internet-Draft NFSv4 Minor Version 1 March 2007 17.26. Operation 29: RENAME - Rename Directory Entry . . . . . 367 17.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 369 17.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 370 17.29. Operation 33: SECINFO - Obtain Available Security . . . 370 17.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 374 17.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 376 17.32. Operation 38: WRITE - Write to File . . . . . . . . . . 377 17.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 382 17.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 383 17.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 387 17.36. Operation 43: CREATE_SESSION - Create New Session and Confirm Client ID . . . . . . . . . . . . . . . . . . . 395 17.37. Operation 44: DESTROY_SESSION - Destroy existing session . . . . . . . . . . . . . . . . . . . . . . . . 405 17.38. Operation 45: FREE_STATEID - Free stateid with no locks . . . . . . . . . . . . . . . . . . . . . . . . . 406 17.39. Operation 46: GET_DIR_DELEGATION - Get a directory delegation . . . . . . . . . . . . . . . . . . . . . . . 407 17.40. Operation 47: GETDEVICEINFO - Get Device Information . . 412 17.41. Operation 48: GETDEVICELIST . . . . . . . . . . . . . . 413 17.42. Operation 49: LAYOUTCOMMIT - Commit writes made using a layout . . . . . . . . . . . . . . . . . . . . . . . . 414 17.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 417 17.44. Operation 51: LAYOUTRETURN - Release Layout Information . . . . . . . . . . . . . . . . . . . . . . 420 17.45. Operation 52: SECINFO_NO_NAME - Get Security on Unnamed Object . . . . . . . . . . . . . . . . . . . . . 423 17.46. Operation 53: SEQUENCE - Supply per-procedure sequencing and control . . . . . . . . . . . . . . . . . 424 17.47. Operation 54: SET_SSV . . . . . . . . . . . . . . . . . 429 17.48. Operation 55: TEST_STATEID - Test stateids for validity . . . . . . . . . . . . . . . . . . . . . . . . 431 17.49. Operation 56: WANT_DELEGATION . . . . . . . . . . . . . 432 17.50. Operation 57: DESTROY_CLIENTID - Destroy existing client ID . . . . . . . . . . . . . . . . . . . . . . . 435 17.51. Operation 10044: ILLEGAL - Illegal operation . . . . . . 436 18. NFS version 4.1 Callback Procedures . . . . . . . . . . . . . 437 18.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 437 18.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 437 19. NFS version 4.1 Callback Operations . . . . . . . . . . . . . 439 19.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 439 19.2. Operation 4: CB_RECALL - Recall an Open Delegation . . . 441 19.3. Operation 5: CB_LAYOUTRECALL . . . . . . . . . . . . . . 442 19.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 444 19.5. Operation 7: CB_PUSH_DELEG . . . . . . . . . . . . . . . 447 19.6. Operation 8: CB_RECALL_ANY - Keep any N delegations . . 448 19.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL . . . . . . . . . . 451 19.8. Operation 10: CB_RECALL_SLOT - change flow control Shepler, et al. Expires September 5, 2007 [Page 8] Internet-Draft NFSv4 Minor Version 1 March 2007 limits . . . . . . . . . . . . . . . . . . . . . . . . . 452 19.9. Operation 11: CB_SEQUENCE - Supply callback channel sequencing and control . . . . . . . . . . . . . . . . . 453 19.10. Operation 12: CB_WANTS_CANCELLED . . . . . . . . . . . . 455 19.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible lock availability . . . . . . . . . . . . . . . . . . . 456 19.12. Operation 10044: CB_ILLEGAL - Illegal Callback Operation . . . . . . . . . . . . . . . . . . . . . . . 457 20. Security Considerations . . . . . . . . . . . . . . . . . . . 458 21. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 458 21.1. Defining new layout types . . . . . . . . . . . . . . . 458 22. References . . . . . . . . . . . . . . . . . . . . . . . . . 459 22.1. Normative References . . . . . . . . . . . . . . . . . . 459 22.2. Informative References . . . . . . . . . . . . . . . . . 460 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 461 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 462 Intellectual Property and Copyright Statements . . . . . . . . . 464 Shepler, et al. Expires September 5, 2007 [Page 9] Internet-Draft NFSv4 Minor Version 1 March 2007 1. Introduction 1.1. The NFSv4.1 Protocol The NFSv4.1 protocol is a minor version of the NFSv4 protocol described in [2]. It generally follows the guidelines for minor versioning model laid in Section 10 of RFC 3530. However, it diverges from guidelines 11 ("a client and server that supports minor version X must support minor versions 0 through X-1"), and 12 ("no features may be introduced as mandatory in a minor version"). These divergences are due to the introduction of the sessions model for managing non-idempotent operations and the RECLAIM_COMPLETE operation. These two new features are infrastructural in nature and simplify implementation of existing and other new features. Making them optional would add undue complexity to protocol definition and implementation. NFSv4.1 accordingly updates the Minor Versioning guidelines (Section 2.7). NFSv4.1, as a minor version, is consistent with the overall goals for NFS Version 4, but extends the protocol so as to better meet those goals, based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted some additional goals, which motivate some of the major extensions in minor version 1. 1.2. NFS Version 4 Goals The NFS version 4 protocol is a further revision of the NFS protocol defined already by versions 2 [17]] and 3 [18]. It retains the essential characteristics of previous versions: design for easy recovery, independent of transport protocols, operating systems and file systems, simplicity, and good performance. The NFS version 4 revision has the following goals: o Improved access and good performance on the Internet. The protocol is designed to transit firewalls easily, perform well where latency is high and bandwidth is low, and scale to very large numbers of clients per server. o Strong security with negotiation built into the protocol. The protocol builds on the work of the ONCRPC working group in supporting the RPCSEC_GSS protocol. Additionally, the NFS version 4 protocol provides a mechanism to allow clients and servers the ability to negotiate security and require clients and servers to support a minimal set of security schemes. Shepler, et al. Expires September 5, 2007 [Page 10] Internet-Draft NFSv4 Minor Version 1 March 2007 o Good cross-platform interoperability. The protocol features a file system model that provides a useful, common set of features that does not unduly favor one file system or operating system over another. o Designed for protocol extensions. The protocol is designed to accept standard extensions within a framework that enable and encourages backward compatibility. 1.3. Minor Version 1 Goals Minor version one has the following goals, within the framework established by the overall version 4 goals. o To correct significant structural weaknesses and oversights discovered in the base protocol. o To add clarity and specificity to areas left unaddressed or not addressed in sufficient detail in the base protocol. o To add specific features based on experience with the existing protocol and recent industry developments. o To provide protocol support to take advantage of clustered server deployments including the ability to provide scalable parallel access to files distributed among multiple servers. 1.4. Overview of NFS version 4.1 Features To provide a reasonable context for the reader, the major features of NFS version 4.1 protocol will be reviewed in brief. This will be done to provide an appropriate context for both the reader who is familiar with the previous versions of the NFS protocol and the reader that is new to the NFS protocols. For the reader new to the NFS protocols, there is still a set of fundamental knowledge that is expected. The reader should be familiar with the XDR and RPC protocols as described in [3] and [4]. A basic knowledge of file systems and distributed file systems is expected as well. This description of version 4.1 features will not distinguish those added in minor version one from those present in the base protocol but will treat minor version 1 as a unified whole. See Section 1.6 for a description of the differences between the two minor versions. Shepler, et al. Expires September 5, 2007 [Page 11] Internet-Draft NFSv4 Minor Version 1 March 2007 1.4.1. RPC and Security As with previous versions of NFS, the External Data Representation (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFS version 4.1 protocol are those defined in [3] and [4]. To meet end- to-end security requirements, the RPCSEC_GSS framework [5] will be used to extend the basic RPC security. With the use of RPCSEC_GSS, various mechanisms can be provided to offer authentication, integrity, and privacy to the NFS version 4 protocol. Kerberos V5 will be used as described in [6] to provide one security framework. The LIPKEY and SPKM-3 GSS-API mechanisms described in [7] will be used to provide for the use of user password and client/server public key certificates by the NFS version 4 protocol. With the use of RPCSEC_GSS, other mechanisms may also be specified and used for NFS version 4.1 security. To enable in-band security negotiation, the NFS version 4.1 protocol has operations which provide the client a method of querying the server about its policies regarding which security mechanisms must be used for access to the server's file system resources. With this, the client can securely match the security mechanism that meets the policies specified at both the client and server. 1.4.2. Protocol Structure 1.4.2.1. Core Protocol Unlike NFS Versions 2 and 3, which used a series of ancillary protocols (e.g. NLM, NSM, MOUNT), within all minor versions of NFS version 4 only a single RPC protocol is used to make requests of the server. Facilities that had been separate protocols, such as locking, are now integrated within a single unified protocol. 1.4.2.2. Parallel Access Minor version one supports high-performance data access to a clustered server implementation by enabling a separation of metadata access and data access, with the latter done to multiple servers in parallel. Such parallel data access is controlled by recallable objects known as "layouts", which are integrated into the protocol locking model. Clients direct requests for data access to a set of data servers specified by the layout via a data storage protocol which may be NFSv4.1 or may be another protocol. Shepler, et al. Expires September 5, 2007 [Page 12] Internet-Draft NFSv4 Minor Version 1 March 2007 1.4.3. File System Model The general file system model used for the NFS version 4.1 protocol is the same as previous versions. The server file system is hierarchical with the regular files contained within being treated as opaque octet streams. In a slight departure, file and directory names are encoded with UTF-8 to deal with the basics of internationalization. The NFS version 4.1 protocol does not require a separate protocol to provide for the initial mapping between path name and filehandle. All file systems exported by a server are presented as a tree so that all file systems are reachable from a special per-server global root filehandle. This allows LOOKUP operations to be used to perform functions previously provided by the MOUNT protocol. The server provides any necessary pseudo file systems to bridge any gaps that arise due to unexported gaps between exported file systems. 1.4.3.1. Filehandles As in previous versions of the NFS protocol, opaque filehandles are used to identify individual files and directories. Lookup-type and create operations are used to go from file and directory names to the filehandle which is then used to identify the object to subsequent operations. The NFS version 4.1 protocol provides support for persistent filehandles, guaranteed to be valid for the lifetime of the file system object designated. In addition it provides support to servers to provide filehandles with more limited validity guarantees, called volatile filehandles. 1.4.3.2. File Attributes The NFS version 4.1 protocol has a rich and extensible attribute structure. Only a small set of the defined attributes are mandatory and must be provided by all server implementations. The other attributes are known as "recommended" attributes. One significant recommended file attribute is the Access Control List (ACL) attribute. This attribute provides for directory and file access control beyond the model used in NFS Versions 2 and 3. The ACL definition allows for specification of specific sets of permissions for individual users and groups. In addition, ACL inheritance allows propagation of access permissions and restriction down a directory tree as file system objects are created. One other type of attribute is the named attribute. A named Shepler, et al. Expires September 5, 2007 [Page 13] Internet-Draft NFSv4 Minor Version 1 March 2007 attribute is an opaque octet stream that is associated with a directory or file and referred to by a string name. Named attributes are meant to be used by client applications as a method to associate application-specific data with a regular file or directory. 1.4.3.3. Multi-server Namespace NFS Version 4.1 contains a number of features to allow implementation of namespaces that cross server boundaries and that allow and facilitate a non-disruptive transfer of support for individual file systems between servers. They are all based upon attributes that allow one file system to specify alternate or new locations for that file system. These attributes may be used together with the concept of absent file system which provide specifications for additional locations but no actual file system content. This allows a number of important facilities: o Location attributes may be used with absent file systems to implement referrals whereby one server may direct the client to a file system provided by another server. This allows extensive multi-server namespaces to be constructed. o Location attributes may be provided for present file systems to provide the locations of alternate file system instances or replicas to be used in the event that the current file system instance becomes unavailable. o Location attributes may be provided when a previously present file system becomes absent. This allows non-disruptive migration of file systems to alternate servers. 1.4.4. Locking Facilities As mentioned previously, NFS v4.1, is a single protocol which includes locking facilities. These locking facilities include support for many types of locks including a number of sorts of recallable locks. Recallable locks such as delegations allow the client to be assured that certain events will not occur so long as that lock is held. When circumstances change, the lock is recalled via a callback request. The assurances provided by delegations allow more extensive caching to be done safely when circumstances allow it. o Share reservations as established by OPEN operations. o Byte-range locks. Shepler, et al. Expires September 5, 2007 [Page 14] Internet-Draft NFSv4 Minor Version 1 March 2007 o File delegations which are recallable locks that assure the holder that inconsistent opens and file changes cannot occur so long as the delegation is held. o Directory delegations which are recallable delegations that assure the holder that inconsistent directory modifications cannot occur so long as the delegation is held. o Layouts which are recallable objects that assure the holder that direct access to the file data may be performed directly by the client and that no change to the data's location inconsistent with that access may be made so long as the layout is held. All locks for a given client are tied together under a single client- wide lease. All requests made on sessions associated with the client renew that lease. When leases are not promptly renewed lock are subject to revocation. In the event of server reinitialization, clients have the opportunity to safely reclaim their locks within a special grace period. 1.5. General Definitions The following definitions are provided for the purpose of providing an appropriate context for the reader. Client The "client" is the entity that accesses the NFS server's resources. The client may be an application which contains the logic to access the NFS server directly. The client may also be the traditional operating system client remote file system services for a set of applications. A client is uniquely identified by a Client Owner. In the case of file locking the client is the entity that maintains a set of locks on behalf of one or more applications. This client is responsible for crash or failure recovery for those locks it manages. Note that multiple clients may share the same transport and connection and multiple clients may exist on the same network node. Client ID A 64-bit quantity used as a unique, short-hand reference to a client supplied Verifier and client owner. The server is responsible for supplying the client ID. Shepler, et al. Expires September 5, 2007 [Page 15] Internet-Draft NFSv4 Minor Version 1 March 2007 Client Owner The client owner is a unique string, opaque to the server, which identifies a client. Multiple network connections and source network addresses originating those connections may share a client owner. The server is expected to treat requests from connnections with the same client owner has coming from the same client. Lease An interval of time defined by the server for which the client is irrevocably granted a lock. At the end of a lease period the lock may be revoked if the lease has not been extended. The lock must be revoked if a conflicting lock has been granted after the lease interval. All leases granted by a server have the same fixed interval. Note that the fixed interval was chosen to alleviate the expense a server would have in maintaining state about variable length leases across server failures. Lock The term "lock" is used to refer to any of record (octet-range) locks, share reservations, delegations or layouts unless specifically stated otherwise. Server The "Server" is the entity responsible for coordinating client access to a set of file systems. A server can span multiple network addresses. In NFSv4.1, a server is a two tiered entity allows for servers consisting of multiple components the flexibility to tightly or loosely couple their components without requiring tight synchronization among the components. Every server has a "Server Owner" which reflects the two tiers of a server entity. Server Owner The "Server Owner" identifies the server to the client. The server owner consists of a major and minor identifier. When the client has two connections each to a peer with the same major and minor identifier, the client assumes both peers are the same server (the server namespace is the same via each connection), and further assumes session and lock state is sharable across both connections. When each peer has the same major identifier but different minor identifier, the client assumes both peers can serve the same namespace, but session and lock state is not sharable across both connections. Stable Storage NFS version 4 servers must be able to recover without data loss from multiple power failures (including cascading power failures, that is, several power failures in quick succession), operating system failures, and hardware failure of components other than the storage medium itself (for example, disk, nonvolatile RAM). Shepler, et al. Expires September 5, 2007 [Page 16] Internet-Draft NFSv4 Minor Version 1 March 2007 Some examples of stable storage that are allowable for an NFS server include: 1. Media commit of data, that is, the modified data has been successfully written to the disk media, for example, the disk platter. 2. An immediate reply disk drive with battery-backed on- drive intermediate storage or uninterruptible power system (UPS). 3. Server commit of data with battery-backed intermediate storage and recovery software. 4. Cache commit with uninterruptible power system (UPS) and recovery software. Stateid A 128-bit quantity returned by a server that uniquely defines the open and locking state provided by the server for a specific open or lock owner for a specific file and type of lock. Verifier A 64-bit quantity generated by the client that the server can use to determine if the client has restarted and lost all previous lock state. 1.6. Differences from NFSv4.0 The following summarizes the differences between minor version one and the base protocol: o Implementation of the sessions model. o Support for parallel access to data. o Addition of the RECLAIM_COMPLETE operation to better structure the lock reclamation process. o Support for delegations on directories and other file types in addition to regular files. o Operations to re-obtain a delegation. o Support for client and server implementation id's. 2. Core Infrastructure Shepler, et al. Expires September 5, 2007 [Page 17] Internet-Draft NFSv4 Minor Version 1 March 2007 2.1. Introduction NFS version 4.1 (NFSv4.1) relies on core infrastructure common to nearly every operation. This core infrastructure is described in the remainder of this section. 2.2. RPC and XDR The NFS version 4.1 (NFSv4.1) protocol is a Remote Procedure Call (RPC) application that uses RPC version 2 and the corresponding eXternal Data Representation (XDR) as defined in RFC1831 [4] and RFC4506 [3]. 2.2.1. RPC-based Security Previous NFS versions have been thought of as having a host-based authentication model, where the NFS server authenticates the NFS client, and trust the client to authenticate all users. Actually, NFS has always depended on RPC for authentication. The first form of RPC authentication which required a host-based authentication approach. NFSv4 also depends on RPC for basic security services, and mandates RPC support for a user-based authentication model. The user-based authentication model has user principals authenticated by a server, and in turn the server authenticated by user principals. RPC provides some basic security services which are used by NFSv4. 2.2.1.1. RPC Security Flavors As described in section 7.2 "Authentication" of [4], RPC security is encapsulated in the RPC header, via a security or authentication flavor, and information specific to the specification of the security flavor. Every RPC header conveys information used to identify and authenticate a client and server. As discussed in Section 2.2.1.1.1, some security flavors provide additional security services. NFSv4 clients and servers MUST implement RPCSEC_GSS. (This requirement to implement is not a requirement to use.) Other flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well. 2.2.1.1.1. RPCSEC_GSS and Security Services RPCSEC_GSS ([5]) uses the functionality of GSS-API RFC2743 [8]. This allows for the use of various security mechanisms by the RPC layer without the additional implementation overhead of adding RPC security flavors. Shepler, et al. Expires September 5, 2007 [Page 18] Internet-Draft NFSv4 Minor Version 1 March 2007 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate users on clients to servers, and servers to users. It can also perform integrity checking on the entire RPC message, including the RPC header, and the arguments or results. Finally, privacy, usually via encryption, is a service available with RPCSEC_GSS. Privacy is performed on the arguments and results. Note that if privacy is selected, integrity, authentication, and identification are enabled. If privacy is not selected, but integrity is selected, authentication and identification are enabled. If integrity and privacy are not selected, but authentication is enabled, identification is enabled. RPCSEC_GSS does not provide identification as a separate service. Although GSS-API has an authentication service distinct from its privacy and integrity services, GSS-API's authentication service is not used for RPCSEC_GSS's authentication service. Instead, each RPC request and response header is integrity protected with the GSS-API integrity service, and this allows RPCSEC_GSS to offer per-RPC authentication and identity. See [5] for more information. NFSv4 client and servers MUST support RPCSEC_GSS's integrity and authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's privacy service. 2.2.1.1.1.2. Security mechanisms for NFS version 4 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide security services. Therefore NFSv4 clients and servers MUST support three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY. The use of RPCSEC_GSS requires selection of: mechanism, quality of protection (QOP), and service (authentication, integrity, privacy). For the mandated security mechanisms, NFSv4 specifies that a QOP of zero (0) is used, leaving it up to the mechanism or the mechanism's configuration to use an appropriate level of protection that QOP zero maps to. Each mandated mechanism specifies minimum set of cryptographic algorithms for implementing integrity and privacy. NFSv4 clients and servers MUST be implemented on operating environments that comply with the mandatory cryptographic algorithms of each mandated mechanism. 2.2.1.1.1.2.1. Kerberos V5 The Kerberos V5 GSS-API mechanism as described in RFC1964 [6] ( [[Comment.1: need new Kerberos RFC]] ) MUST be implemented with the RPCSEC_GSS services as specified in the following table: Shepler, et al. Expires September 5, 2007 [Page 19] Internet-Draft NFSv4 Minor Version 1 March 2007 column descriptions: 1 == number of pseudo flavor 2 == name of pseudo flavor 3 == mechanism's OID 4 == RPCSEC_GSS service 5 == NFSv4.1 clients MUST support 6 == NFSv4.1 servers MUST support 1 2 3 4 5 6 ------------------------------------------------------------------ 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes Note that the number and name of the pseudo flavor is presented here as a mapping aid to the implementor. Because the NFSv4 protocol includes a method to negotiate security and it understands the GSS- API mechanism, the pseudo flavor is not needed. The pseudo flavor is needed for the NFS version 3 since the security negotiation is done via the MOUNT protocol as described in [19]. 2.2.1.1.1.2.2. LIPKEY The LIPKEY V5 GSS-API mechanism as described in [7] MUST be implemented with the RPCSEC_GSS services as specified in the following table: 1 2 3 4 5 6 ------------------------------------------------------------------ 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes 390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity yes yes 390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy no yes 2.2.1.1.1.2.3. SPKM-3 as a security triple The SPKM-3 GSS-API mechanism as described in [7] MUST be implemented with the RPCSEC_GSS services as specified in the following table: 1 2 3 4 5 6 ------------------------------------------------------------------ 390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none yes yes 390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity yes yes 390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy no yes Shepler, et al. Expires September 5, 2007 [Page 20] Internet-Draft NFSv4 Minor Version 1 March 2007 2.2.1.1.1.3. GSS Server Principal Regardless of what security mechanism under RPCSEC_GSS is being used, the NFS server, MUST identify itself in GSS-API via a GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE names are of the form: service@hostname For NFS, the "service" element is nfs Implementations of security mechanisms will convert nfs@hostname to various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the following form is RECOMMENDED: nfs/hostname 2.3. COMPOUND and CB_COMPOUND A significant departure from the versions of the NFS protocol before version 4 is the introduction of the COMPOUND procedure. For the NFSv4 protocol, in all minor versions, there are exactly two RPC procedures, NULL and COMPOUND. The COMPOUND procedure is defined as a series of individual operations and these operations perform the sorts of functions performed by traditional NFS procedures. The operations combined within a COMPOUND request are evaluated in order by the server, without any atomicity guarantees. A limited set of facilities exist to pass results from one operation to another. Once an operation returns a failing result, the evaluation ends and the results of all evaluated operations are returned to the client. With the use of the COMPOUND procedure, the client is able to build simple or complex requests. These COMPOUND requests allow for a reduction in the number of RPCs needed for logical file system operations. For example, multi-component lookup requests can be constructed by combining multiple LOOKUP operations. Those can be further combined with operations such as GETATTR, READDIR, or OPEN plus READ to do more complicated sets of operation without incurring additional latency. NFSv4 also contains a considerable set of callback operations in which the server makes an RPC directed at the client. Callback RPC's have a similar structure to that of the normal server requests. For the NFS version 4 protocol callbacks in all minor versions, there are two RPC procedures, NULL and CB_COMPOUND. The CB_COMPOUND procedure Shepler, et al. Expires September 5, 2007 [Page 21] Internet-Draft NFSv4 Minor Version 1 March 2007 is defined in an analogous fashion to that of COMPOUND with its own set of callback operations. Addition of new server and callback operation within the COMPOUND and CB_COMPOUND request framework provide means of extending the protocol in subsequent minor versions. Except for a small number of operations needed for session creation, server requests and callback requests are performed within the context of a session. Sessions provide a client context for every request and support robust replay protection for non-idempotent requests. 2.4. Client Identifiers and Client Owners For each operation that obtains or depends on locking state, the specific client must be determinable by the server. In NFSv4, each distinct client instance is represented by a client ID, which is a 64-bit identifier that identifies a specific client at a given time and which is changed whenever the client or the server re- initializes. Client IDs are used to support lock identification and crash recovery. In NFSv4.1, during steady state operation, the client ID associated with each operation is derived from the session (see Section 2.10) on which the operation is issued. Each session is associated with a specific client ID at session creation and that client ID then becomes the client ID associated with all requests issued using it. Therefore, unlike NFSv4.0, the only NFSv4.1 operations possible before a client ID is established, are those directly connected with establishing the client ID. A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION operation using that client ID (eir_clientid as returned from EXCHANGE_ID) is required to establish the identification on the server. Establishment of identification by a new incarnation of the client also has the effect of immediately releasing any locking state that a previous incarnation of that same client might have had on the server. Such released state would include all lock, share reservation, and, where the server is not supporting the CLAIM_DELEGATE_PREV claim type, all delegation state associated with same client with the same identity. For discussion of delegation state recovery, see Section 9.2.1. Releasing such state requires that the server be able to determine that one client instance is the successor of another. Where this cannot be done, for any of a number of reasons, the locking state will remain for a time subject to lease expiration (see Section 8.5) Shepler, et al. Expires September 5, 2007 [Page 22] Internet-Draft NFSv4 Minor Version 1 March 2007 and the new client will need to wait for such state to be removed, if it makes conflicting lock requests. Client identification is encapsulated in the following Client Owner structure: struct client_owner4 { verifier4 co_verifier; opaque co_ownerid; }; The first field, co_verifier, is a client incarnation verifier that is used to detect client reboots. Only if the co_verifier is different from that the server had previously recorded for the client (as identified by the second field of the structure, co_ownerid) does the server start the process of canceling the client's leased state. The second field, co_ownerid is a variable length string that uniquely defines the client so that subsequent instances of the same client bear the same co_ownerid with a different verifier. There are several considerations for how the client generates the co_ownerid string: o The string should be unique so that multiple clients do not present the same string. The consequences of two clients presenting the same string range from one client getting an error to one client having its leased state abruptly and unexpectedly canceled. o The string should be selected so the subsequent incarnations (e.g. reboots) of the same client cause the client to present the same string. The implementor is cautioned from an approach that requires the string to be recorded in a local file because this precludes the use of the implementation in an environment where there is no local disk and all file access is from an NFS version 4 server. o The string should be the same for each server network address that the client accesses, rather than common to all server network addresses (note: the precise opposite was advised in RFC3530). This way, if a server has multiple interfaces, the client can trunk traffic over multiple network paths as described in Section 2.10.3.4.1. o The algorithm for generating the string should not assume that the client's network address will not change, unless the client Shepler, et al. Expires September 5, 2007 [Page 23] Internet-Draft NFSv4 Minor Version 1 March 2007 implementation knows it is using statically assigned network addresses. This includes changes between client incarnations and even changes while the client is still running in its current incarnation. This means that if the client includes just the client's network address in the co_ownerid string, there is a real risk, with dynamic address assignment, that after the client gives up the network address, another client, using a similar algorithm for generating the co_ownerid string, would generate a conflicting co_ownerid string. Given the above considerations, an example of a well generated co_ownerid string is one that includes: o If applicable, the client's statically assigned network address. o Additional information that tends to be unique, such as one or more of: * The client machine's serial number (for privacy reasons, it is best to perform some one way function on the serial number). * A MAC address (again, a one way function should be performed). * The timestamp of when the NFS version 4 software was first installed on the client (though this is subject to the previously mentioned caution about using information that is stored in a file, because the file might only be accessible over NFS version 4). * A true random number. However since this number ought to be the same between client incarnations, this shares the same problem as that of the using the timestamp of the software installation. o For a user level NFS version 4 client, it should contain additional information to distinguish the client from other user level clients running on the same host, such as a process identifier or other unique sequence. As a security measure, the server MUST NOT cancel a client's leased state if the principal established the state for a given co_ownerid string is not the same as the principal issuing the EXCHANGE_ID. A server may compare an client_owner4 in a EXCHANGE_ID with an nfs_client_id4 established using SETCLIENTID using NFSv4 minor version 0, so that an NFSv4.1 client is not forced to delay until lease expiration for locking state established by the earlier client using minor version 0. This requires the client_owner4 be Shepler, et al. Expires September 5, 2007 [Page 24] Internet-Draft NFSv4 Minor Version 1 March 2007 constructed the same way as the nfs_client_id4. If the latter's contents included the server's network address, and the NFSv4.1 client does not wish to use a client ID that prevents trunking, it should issue two EXCHANGE_ID operations. The first EXCHANGE_ID will have a client_owner4 equal to the nfs_client_id4. This will clear the state created by the NFSv4.0 client. The second EXCHANGE_ID will not have the server's network address. The state created for the second EXCHANGE_ID will not have to wait for lease expiration, because there will be no state to expire. Once an EXCHANGE_ID has been done, and the resulting client ID established as associated with a session, all requests made on that session implicitly identify that client ID, which in turn designates the client specified using the long-form client_owner4 structure. The shorthand client identifier (a client ID) is assigned by the server (the eir_clientid result from EXCHANGE_ID) and should be chosen so that it will not conflict with a client ID previously assigned by the server. This applies across server restarts or reboots. In the event of a server restart, a client may find out that its current client ID is no longer valid when receives a NFS4ERR_STALE_CLIENTID error. The precise circumstances depend of the characteristics of the sessions involved, specifically whether the session is persistent (see Section 2.10.4.5). When a session is not persistent, the client will need to create a new session. When the existing client ID is presented to a server as part of creating a session and that client ID is not recognized, as would happen after a server reboot, the server will reject the request with the error NFS4ERR_STALE_CLIENTID. When this happens, the client must obtain a new client ID by use of the EXCHANGE_ID operation and then use that client ID as the basis of the basis of a new session and then proceed to any other necessary recovery for the server reboot case (See Section 8.6.2). In the case of the session being persistent, the client will re- establish communication using the existing session after the reboot. This session will be associated with a client ID that has had state revoked (but the persistent session is never associated with a stale client ID, because if the session is persistent, the client ID MUST persist), and the client will receive an indication of that fact in the sr_status_flags field returned by the SEQUENCE operation (see Section 17.46.4). The client can then use the existing session to do whatever operations are necessary to determine the status of requests outstanding at the time of reboot, while avoiding issuing new requests, particularly any involving locking on that session. Such requests would fail with an NFS4ERR_STALE_STATEID error, if Shepler, et al. Expires September 5, 2007 [Page 25] Internet-Draft NFSv4 Minor Version 1 March 2007 attempted. See the detailed descriptions of EXCHANGE_ID (Section 17.35 and CREATE_SESSION (Section 17.36) for a complete specification of these operations. 2.4.1. Server Release of Client ID NFSv4.1 introduces a new operation called DESTROY_CLIENTID (Section 17.50) which the client SHOULD use to destroy a client ID it no longer needs. This permits graceful, bilateral release of a client ID. If the server determines that the client holds no associated state for its client ID (including sessions, opens, locks, delegations, layouts, and wants), the server may choose to unilaterally release the client ID. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the EXCHANGE_ID/CREATE_SESSION sequence to establish a new identity. It should be clear that the server must be very hesitant to release a client ID since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a client ID unless there had been no activity from that client for many minutes. As long as there are sessions, opens, locks, delegations, layouts, or wants, the server MUST not release the client ID. See Section 2.10.8.1.4 for discussion on releasing inactive sessions. 2.4.2. Handling Client Owner Conflicts If the co_ownerid string in a EXCHANGE_ID request is properly constructed, and if the client takes care to use the same principal for each successive use of EXCHANGE_ID, then, barring an active denial of service attack, conflicts are not possible. However, client bugs, server bugs, or perhaps a deliberate change of the principal owner of the co_ownerid string (such as the case of a client that changes security flavors, and under the new flavor, there is no mapping to the previous owner) will in rare cases result in a conflict. When the server gets a EXCHANGE_ID for a client owner that currently has no state, or if it has state, but the lease has expired, server MUST allow the EXCHANGE_ID, and confirm the new client ID if followed by the appropriate CREATE_SESSION. Shepler, et al. Expires September 5, 2007 [Page 26] Internet-Draft NFSv4 Minor Version 1 March 2007 When the server gets a EXCHANGE_ID for a client owner that currently has state, or an unexpired lease, and the principal that issues the EXCHANGE_ID is different than principal the previously established the client owner, the server MUST not destroy the any state that currently exists for client owner. Regardless, the server has two choices. First, it can return NFS4ERR_CLID_INUSE. Second, it can allow the EXCHANGE_ID, and simply treat the client owner as consisting of both the co_ownerid and the principal that issued the EXCHANGE_ID. 2.5. Server Owners The Server Owner is somewhat similar to a Client Owner (Section 2.4), but unlike the Client Owner, there is no shorthand serverid. The Server Owner is defined in the following structure: struct server_owner4 { uint64_t so_minor_id; opaque so_major_id; }; The Server Owner is returned in the results of EXCHANGE_ID. When the so_major_id fields are the same in two EXCHANGE_ID results, the connections each EXCHANGE_ID are sent over can be assumed to address the same Server (as defined in Section 1.5). If the so_minor_id fields are also the same, then not only do both connections connect to the same server, but the session and other state can be shared across both connections. The reader is cautioned that multiple servers may deliberately or accidentally claim to have the same so_major_id or so_major_id/so_minor_id; the reader should examine Section 2.10.3.4.1 and Section 17.35. The considerations for generating an so_major_id are similar to that for generating a co_ownerid string (see Section 2.4). The consequences of two servers generating conflict so_major_id values are less dire than they are for co_ownerid conflicts because the client can use RPCSEC_GSS to compare the authenticity of each server (see Section 2.10.3.4.1). 2.6. Security Service Negotiation With the NFS version 4 server potentially offering multiple security mechanisms, the client needs a method to determine or negotiate which mechanism is to be used for its communication with the server. The NFS server may have multiple points within its file system namespace that are available for use by NFS clients. These points can be considered security policy boundaries, and in some NFS Shepler, et al. Expires September 5, 2007 [Page 27] Internet-Draft NFSv4 Minor Version 1 March 2007 implementations are tied to NFS export points. In turn the NFS server may be configured such that each of these security policy boundaries may have different or multiple security mechanisms in use. The security negotiation between client and server must be done with a secure channel to eliminate the possibility of a third party intercepting the negotiation sequence and forcing the client and server to choose a lower level of security than required or desired. See Section 20 for further discussion. 2.6.1. NFSv4 Security Tuples An NFS server can assign one or more "security tuples" to each security policy boundary in its namespace. Each security tuple consists of a security flavor (see Section 2.2.1.1), and if the flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of protection, and an RPCSEC_GSS service. 2.6.2. SECINFO and SECINFO_NO_NAME The SECINFO and SECINFO_NO_NAME operations allow the client to determine, on a per filehandle basis, what security tuple is to be used for server access. In general, the client will not have to use either operation except during initial communication with the server or when the client crosses security policy boundaries at the server. It is possible that the server's policies change during the client's interaction therefore forcing the client to negotiate a new security tuple. Where the use of different security tuples would affect the type of access that would be allowed if a request was issued over the same connection used for the SECINFO or SECINFO_NO_NAME operation (e.g. read-only vs. read-write) access, security tuples that allow greater access should be presented first. Where the general level of access is the same and different security flavors limit the range of principals whose privileges are recognized (e.g. allowing or disallowing root access), flavors supporting the greatest range of principals should be listed first. 2.6.3. Security Error Based on the assumption that each NFS version 4 client and server must support a minimum set of security (i.e., LIPKEY, SPKM-3, and Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file access to the server with one of the minimal security tuples. During communication with the server, the client may receive an NFS error of NFS4ERR_WRONGSEC. This error allows the server to notify the client that the security tuple currently being used is contravenes the Shepler, et al. Expires September 5, 2007 [Page 28] Internet-Draft NFSv4 Minor Version 1 March 2007 server's security policy. The client is then responsible for determining (see Section 2.6.3.1) what security tuples are available at the server and choosing one which is appropriate for the client. 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME This section explains of the mechanics of NFSv4.1 security negotiation. The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, PUTFH, and RESTOREFH. 2.6.3.1.1. Put Filehandle Operation + SAVEFH The client is saving a filehandle for a future RESTOREFH. The server MUST NOT return NFS4ERR_WRONG to either the put filehandle operation or SAVEFH. 2.6.3.1.2. Two or More Put Filehandle Operations For a series of N put filehandle operations, the server MUST NOT return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. The Nth put filehandle operation is handled as if it is the first in a series of operations, and the second in the series of operations is not a put filehandle operation. For example if the server received PUTFH, PUTROOTFH, LOOKUP, then the PUTFH is ignored for NFS4ERR_WRONGSEC purposes, and the PUTROOTFH, LOOKUP subseries is processed as according to Section 2.6.3.1.3. 2.6.3.1.3. Put Filehandle Operation + LOOKUP (or OPEN by Name) This situation also applies to a put filehandle operation followed by a LOOKUP or an OPEN operation that specifies a component name. In this situation, the client is potentially crossing a security policy boundary, and the set of security tuples the parent directory supports differ from those of the child. The server implementation may decide whether to impose any restrictions on security policy administration. There are at least three approaches (sec_policy_child is the tuple set of the child export, sec_policy_parent is that of the parent). a) sec_policy_child <= sec_policy_parent (<= for subset). This means that the set of security tuples specified on the security policy of a child directory is always a subset of that of its parent directory. Shepler, et al. Expires September 5, 2007 [Page 29] Internet-Draft NFSv4 Minor Version 1 March 2007 b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, {} for the empty set). This means that the security tuples specified on the security policy of a child directory always has a non empty intersection with that of the parent. c) sec_policy_child ^ sec_policy_parent == {}. This means that the set of tuples specified on the security policy of a child directory may not intersect with that of the parent. In other words, there are no restrictions on how the system administrator may set up these tuples. For a server to support approach (b) (when client chooses a flavor that is not a member of sec_policy_parent) and (c), the put filehandle operation must NOT return NFS4ERR_WRONGSEC in case of security mismatch. Instead, it should be returned from the LOOKUP (or OPEN by component name) that follows. Since the above guideline does not contradict approach (a), it should be followed in general. Even if approach (a) is implemented, it is possible for the security tuple used to be acceptable for the target of LOOKUP but not for the filehandles used in the put filehandle operation. The put filehandle operation could be a PUTROOTFH or PUTPUBFH, where the client cannot know the security tuples for the root or public filehandle. Or the security policy for the filehandle used by the put filehandle operation could have changed since the time the filehandle was obtained. Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in response to the put filehandle operation if the operation is immediately followed by a LOOKUP or an OPEN by component name. 2.6.3.1.4. Put Filehandle Operation + LOOKUPP Since SECINFO only works its way down, there is no way LOOKUPP can return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME solves this issue because via style SECINFO_STYLE4_PARENT, it works in the opposite direction as SECINFO. As with Section 2.6.3.1.3, the put filehandle operation must not return NFS4ERR_WRONGSEC whenever it is followed by LOOKUPP. If the server does not support SECINFO_NO_NAME, the client's only recourse is to issue the put filehandle operation, LOOKUPP, GETFH sequence of operations with every security tuple it supports. Regardless whether SECINFO_NO_NAME is supported, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle operation if the operation is immediately followed by a LOOKUPP. Shepler, et al. Expires September 5, 2007 [Page 30] Internet-Draft NFSv4 Minor Version 1 March 2007 2.6.3.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME A security sensitive client is allowed to choose a strong security tuple when querying a server to determine a file object's permitted security tuples. The security tuple chosen by the client does not have to be included in the tuple list of the security policy of the either parent directory indicated in the put filehandle operation, or the child file object indicated in SECINFO (or any parent directory indicated in SECINFO_NO_NAME). Of course the server has to be configured for whatever security tuple the client selects, otherwise the request will fail at RPC layer with an appropriate authentication error. In theory, there is no connection between the security flavor used by SECINFO or SECINFO_NO_NAME and those supported by the security policy. But in practice, the client may start looking for strong flavors from those supported by the security policy, followed by those in the mandatory set. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put filehandle operation whenever it is immediately followed by SECINFO or SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC from SECINFO or SECINFO_NO_NAME. 2.6.3.1.6. Put Filehandle Operation + Nothing The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 2.6.3.1.7. Put Filehandle Operation + Anything Else "Anything Else" includes OPEN by filehandle. The security policy enforcement applies to the filehandle specified in the put filehandle operation. Therefore PUTFH must return NFS4ERR_WRONGSEC in case of security tuple on the part of the mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an allowable error to every other operation. A COMPOUND containing the series put filehandle operation + SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way for the client to recover from NFS4ERR_WRONGSEC. The NFSv4.1 server MUST not return NFS4ERR_WRONGSEC to any operation other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by component name). Shepler, et al. Expires September 5, 2007 [Page 31] Internet-Draft NFSv4 Minor Version 1 March 2007 2.7. Minor Versioning To address the requirement of an NFS protocol that can evolve as the need arises, the NFS version 4 protocol contains the rules and framework to allow for future minor changes or versioning. The base assumption with respect to minor versioning is that any future accepted minor version must follow the IETF process and be documented in a standards track RFC. Therefore, each minor version number will correspond to an RFC. Minor version zero of the NFS version 4 protocol is represented by [2], and minor version one is represented by this document [[Comment.2: change "document" to "RFC" when we publish]] . The COMPOUND and CB_COMPOUND procedures support the encoding of the minor version being requested by the client. The following items represent the basic rules for the development of minor versions. Note that a future minor version may decide to modify or add to the following rules as part of the minor version definition. 1. Procedures are not added or deleted To maintain the general RPC model, NFS version 4 minor versions will not add to or delete procedures from the NFS program. 2. Minor versions may add operations to the COMPOUND and CB_COMPOUND procedures. The addition of operations to the COMPOUND and CB_COMPOUND procedures does not affect the RPC model. * Minor versions may append attributes to GETATTR4args, bitmap4, and GETATTR4res. This allows for the expansion of the attribute model to allow for future growth or adaptation. * Minor version X must append any new attributes after the last documented attribute. Since attribute results are specified as an opaque array of per-attribute XDR encoded results, the complexity of adding new attributes in the midst of the current definitions would be too burdensome. 3. Minor versions must not modify the structure of an existing operation's arguments or results. Shepler, et al. Expires September 5, 2007 [Page 32] Internet-Draft NFSv4 Minor Version 1 March 2007 Again the complexity of handling multiple structure definitions for a single operation is too burdensome. New operations should be added instead of modifying existing structures for a minor version. This rule does not preclude the following adaptations in a minor version. * adding bits to flag fields such as new attributes to GETATTR's bitmap4 data type * adding bits to existing attributes like ACLs that have flag words * extending enumerated types (including NFS4ERR_*) with new values 4. Minor versions may not modify the structure of existing attributes. 5. Minor versions may not delete operations. This prevents the potential reuse of a particular operation "slot" in a future minor version. 6. Minor versions may not delete attributes. 7. Minor versions may not delete flag bits or enumeration values. 8. Minor versions may declare an operation as mandatory to NOT implement. Specifying an operation as "mandatory to not implement" is equivalent to obsoleting an operation. For the client, it means that the operation should not be sent to the server. For the server, an NFS error can be returned as opposed to "dropping" the request as an XDR decode error. This approach allows for the obsolescence of an operation while maintaining its structure so that a future minor version can reintroduce the operation. 1. Minor versions may declare attributes mandatory to NOT implement. 2. Minor versions may declare flag bits or enumeration values as mandatory to NOT implement. Shepler, et al. Expires September 5, 2007 [Page 33] Internet-Draft NFSv4 Minor Version 1 March 2007 9. Minor versions may downgrade features from mandatory to recommended, or recommended to optional. 10. Minor versions may upgrade features from optional to recommended or recommended to mandatory. 11. A client and server that supports minor version X should support minor versions 0 (zero) through X-1 as well. 12. Except for infrastructural changes, no new features may be introduced as mandatory in a minor version. This rule allows for the introduction of new functionality and forces the use of implementation experience before designating a feature as mandatory. On the other hand, some classes of features are infrastructural and have broad effects. Allowing such features to not be mandatory complicates implementation of the minor version. 13. A client MUST NOT attempt to use a stateid, filehandle, or similar returned object from the COMPOUND procedure with minor version X for another COMPOUND procedure with minor version Y, where X != Y. 2.8. Non-RPC-based Security Services As described in Section 2.2.1.1.1.1, NFSv4 relies on RPC for identification, authentication, integrity, and privacy. NFSv4 itself provides additional security services as described in the next several subsections. 2.8.1. Authorization Authorization to access a file object via an NFSv4 operation is ultimately determined by the NFSv4 server. A client can predetermine its access to a file object via the OPEN (Section 17.16) and the ACCESS (Section 17.1) operations. Principals with appropriate access rights can modify the authorization on a file object via the SETATTR (Section 17.30) operation. Four attributes that affect access rights are: mode, owner, owner_group, and acl. See Section 5. 2.8.2. Auditing NFSv4 provides auditing on a per file object basis, via the ACL attribute as described in Section 6. It is outside the scope of this specification to specify audit log formats or management policies. Shepler, et al. Expires September 5, 2007 [Page 34] Internet-Draft NFSv4 Minor Version 1 March 2007 2.8.3. Intrusion Detection NFSv4 provides alarm control on a per file object basis, via the ACL attribute as described in Section 6. Alarms may serve as the basis for intrusion detection. It is outside the scope of this specification to specify heuristics for detecting intrusion via alarms. 2.9. Transport Layers 2.9.1. Required and Recommended Properties of Transports NFSv4 works over RDMA and non-RDMA_based transports with the following attributes: o The transport supports reliable delivery of data, which NFSv4 requires but neither NFSv4 nor RPC has facilities for ensuring. [20] o The transport delivers data in the order it was sent. Ordered delivery simplifies detection of transmit errors, and simplifies the sending of arbitrary sized requests and responses, via the record marking protocol [4]. Where an NFS version 4 implementation supports operation over the IP network protocol, any transport used between NFS and IP MUST be among the IETF-approved congestion control transport protocols. At the time this document was written, the only two transports that had the above attributes were TCP and SCTP. To enhance the possibilities for interoperability, an NFS version 4 implementation MUST support operation over the TCP transport protocol. Even if NFS version 4 is used over a non-IP network protocol, it is RECOMMENDED that the transport support congestion control. It is permissible for a connectionless transport to be used under NFSv4.1, however reliable and in-order delivery of data by the connectionless transport is still required. NFSv4.1 assumes that a client transport address and server transport address used to send data over a transport together constitute a connection, even if the underlying transport eschews the concept of a connection. 2.9.2. Client and Server Transport Behavior If a connection-oriented transport (e.g. TCP) is used the client and server SHOULD use long lived connections for at least three reasons: Shepler, et al. Expires September 5, 2007 [Page 35] Internet-Draft NFSv4 Minor Version 1 March 2007 1. This will prevent the weakening of the transport's congestion control mechanisms via short lived connections. 2. This will improve performance for the WAN environment by eliminating the need for connection setup handshakes. 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the client and server to maintain a client-created channel (see Section 2.10.3.4for the server to use. In order to reduce congestion, if a connection-oriented transport is used, and the request is not the NULL procedure, o A requester MUST NOT retry a request unless the connection the request was issued over was disconnected before the reply was received. o A replier MUST NOT silently drop a request, even if the request is a retry. (The silent drop behavior of RPCSEC_GSS [5] does not apply because this behavior happens at the RPCSEC_GSS layer, a lower layer in the request processing). Instead, the replier SHOULD return an appropriate error (see Section 2.10.4.1) or it MAY disconnect the connection. When using RDMA transports there are other reasons for not tolerating retries over the same connection: o RDMA transports use "credits" to enforce flow control, where a credit is a right to a peer to transmit a message. If one peer were to retransmit a request (or reply), it would consume an additional credit. If the replier retransmitted a reply, it would certainly result in an RDMA connection loss, since the requester would typically only post a single receive buffer for each request. If the requester retransmitted a request, the additional credit consumed on the server might lead to RDMA connection failure unless the client accounted for it and decreased its available credit, leading to wasted resources. o RDMA credits present a new issue to the reply cache in NFSv4.1. The reply cache may be used when a connection within a session is lost, such as after the client reconnects. Credit information is a dynamic property of the RDMA connection, and stale values must not be replayed from the cache. This implies that the reply cache contents must not be blindly used when replies are issued from it, and credit information appropriate to the channel must be refreshed by the RPC layer. In addition, the NFSv4.1 requester is not allowed to stop waiting for Shepler, et al. Expires September 5, 2007 [Page 36] Internet-Draft NFSv4 Minor Version 1 March 2007 a reply, as described in Section 2.10.4.2. 2.9.3. Ports Historically, NFS version 2 and version 3 servers have resided on port 2049. The registered port 2049 RFC3232 [21] for the NFS protocol should be the default configuration. NFSv4 clients SHOULD NOT use the RPC binding protocols as described in RFC1833 [22]. 2.10. Session 2.10.1. Motivation and Overview Previous versions and minor versions of NFS have suffered from the following: o Lack of support for exactly once semantics (EOS). This includes lack of support for EOS through server failure and recovery. o Limited callback support, including no support for sending callbacks through firewalls, and races between responses from normal requests, and callbacks. o Limited trunking over multiple network paths. o Requiring machine credentials for fully secure operation. Through the introduction of a session, NFSv4.1 addresses the above shortfalls with practical solutions: o EOS is enabled by a reply cache with a bounded size, making it feasible to keep on persistent storage and enable EOS through server failure and recovery. One reason that previous revisions of NFS did not support EOS was because some EOS approaches often limited parallelism. As will be explained in Section 2.10.4), NFSv4.1 supports both EOS and unlimited parallelism. o The NFSv4.1 client provides creates transport connections and gives them to the server for sending callbacks, thus solving the firewall issue (Section 17.34). Races between responses from client requests, and callbacks caused by the requests are detected via the session's sequencing properties which are a byproduct of EOS (Section 2.10.4.3). o The NFSv4.1 client can add an arbitrary number of connections to the session, and thus provide trunking (Section 2.10.3.4.1). Shepler, et al. Expires September 5, 2007 [Page 37] Internet-Draft NFSv4 Minor Version 1 March 2007 o The NFSv4.1 session produces a session key independent of client and server machine credentials which can be used to compute a digest for protecting key session management operations Section 2.10.6.3). o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for use by the session's callback channel that do not require the server to authenticate to a client machine principal (Section 2.10.6.2). A session is a dynamically created, long-lived server object created by a client, used over time from one or more transport connections. Its function is to maintain the server's state relative to the connection(s) belonging to a client instance. This state is entirely independent of the connection itself, and indeed the state exists whether the connection exists or not (though locks, delegations, etc. and generally expire in the extended absence of an open connection). The session in effect becomes the object representing an active client on a set of zero or more connections. 2.10.2. NFSv4 Integration Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major infrastructure change like sessions would require a new major version number to an RPC program like NFS. However, because NFSv4 encapsulates its functionality in a single procedure, COMPOUND, and because COMPOUND can support an arbitrary number of operations, sessions are almost trivially added. COMPOUND includes a minor version number field, and for NFSv4.1 this minor version is set to 1. When the NFSv4 server processes a COMPOUND with the minor version set to 1, it expects a different set of operations than it does for NFSv4.0. One operation it expects is the SEQUENCE operation, which is required for every COMPOUND that operates over an established session. 2.10.2.1. SEQUENCE and CB_SEQUENCE In NFSv4.1, when the SEQUENCE operation is present, it is always the first operation in the COMPOUND procedure. The primary purpose of SEQUENCE is to carry the session identifier. The session identifier associates all other operations in the COMPOUND procedure with a particular session. SEQUENCE also contains required information for maintaining EOS (see Section 2.10.4). Session-enabled NFSv4.1 COMPOUND requests thus have the form: Shepler, et al. Expires September 5, 2007 [Page 38] Internet-Draft NFSv4 Minor Version 1 March 2007 +-----+--------------+-----------+------------+-----------+---- | tag | minorversion | numops |SEQUENCE op | op + args | ... | | (== 1) | (limited) | + args | | +-----+--------------+-----------+------------+-----------+---- and the reply's structure is: +------------+-----+--------+-------------------------------+--// |last status | tag | numres |status + SEQUENCE op + results | // +------------+-----+--------+-------------------------------+--// //-----------------------+---- // status + op + results | ... //-----------------------+---- A CB_COMPOUND procedure request and reply has a similar form, but instead of a SEQUENCE operation, there is a CB_SEQUENCE operation, and there is an additional field called "callback_ident", which is superfluous in NFSv4.1. CB_SEQUENCE has the same information as SEQUENCE, but includes other information needed to solve callback races (Section 2.10.4.3). 2.10.2.2. Client ID and Session Association Sessions are subordinate to the client ID (Section 2.4). Each client ID can have zero or more active sessions. A client ID, and a session bound to it are required to do anything useful in NFSv4.1. Each time a session is used, the state leased to its associated client ID is automatically renewed. State such as share reservations, locks, delegations, and layouts (Section 1.4.4) is tied to the client ID, not the sessions of the client ID. Successive state changing operations from a given state owner can go over different sessions, as long each session is associated with the same client ID. Callbacks can arrive over a different session than the session that sent the operation the acquired the state that the callback is for. For example, if session A is used to acquire a delegation, a request to recall the delegation can arrive over session B. 2.10.3. Channels Each session has one or two channels: the "operation" or "fore" channel used for ordinary requests from client to server, and the "back" channel, used for callback requests from server to client. The session allocates resources for each channel, including separate reply caches (see Section 2.10.4.1). These resources are for the most part specified at time the session is created. Shepler, et al. Expires September 5, 2007 [Page 39] Internet-Draft NFSv4 Minor Version 1 March 2007 2.10.3.1. Operation Channel The operation channel carries COMPOUND requests and responses. A session always has an operation channel. 2.10.3.2. Backchannel The backchannel carries CB_COMPOUND requests and responses. Whether there is a backchannel or not is a decision of the client; NFSv4.1 servers MUST support backchannels. 2.10.3.3. Session and Channel Association Because there are at most two channels per session, and because each channel has a distinct purpose, channels are not assigned identifiers. The operation and backchannel are implicitly created and associated when the session is created. 2.10.3.4. Connection and Channel Association Each channel is associated with zero or more transport connections. A connection can be bound to one channel or both channels of a session; the client and server negotiate whether a connection will carry traffic for one channel or both channels via the CREATE_SESSION (Section 17.36) and the BIND_CONN_TO_SESSION (Section 17.34) operations. When a session is created via CREATE_SESSION, it is automatically bound to the operation channel, and optionally the backchannel. If the client does not specify connecting binding enforcement when the session is created, then additional connections are automatically bound to the operation channel when the are used with a SEQUENCE operation that has the session's sessionid. A connection MAY be bound to the channels of other sessions. The client decides, and the NFSv4.1 server MUST allow it. A connection MAY be bound to the channels of other sessions of other clientids. Again, the client decides, and the server MUST allow it. It is permissible for connections of multiple types to be bound to the same channel. For example a TCP and RDMA connection can be bound to the operation channel. In the event an RDMA and non-RDMA connection are bound to the same channel, the maximum number of slots must be at least one more than the total number of credits. This way if all RDMA credits are use, the non-RDMA connection can have at least one outstanding request. It is permissible for a connection of one type to be bound to the operation channel, and another type bound to the backchannel. Shepler, et al. Expires September 5, 2007 [Page 40] Internet-Draft NFSv4 Minor Version 1 March 2007 2.10.3.4.1. Trunking A client is allowed to issue EXCHANGE_ID multiple times to the same server. The client may be unaware that two different server network addresses refer to the same server. The use of EXCHANGE_ID allows a client to become aware that an additional network address refers to a server the client already has an established client ID and session for. The eir_server_owner and eir_server_scope results from EXCHANGE_ID give a client a hint that the server it is connected to may be the same as the server it is connected to via another connection. When EXCHANGE_ID is issued over two different connections, and each return the same eir_server_owner.so_major_id and eir_server_scope, the client treats the connections as connected to the same server (subject to verification, as described later in this section (Paragraph 2), even if the destination network addresses are different). As long two unrelated servers have not selected and returned a conflicting pair of eir_major_id and eir_server_scope, or unless the client has used different co_ownerid values in each EXCHANGE_ID request, or the server has lost client ID state (e.g. the server has rebooted) the server MUST return the same eir_clientid result. Otherwise, the client and server use the common eir_clientid to identify the client. The eir_server_owner.so_minor_id field allows the server to control binding of connections to sessions. When two connections have a matching eir_server_scope, so_major_id and so_minor_id, the client may bind both connections to a common session; this is session trunking. When two connections have a matching so_major_id and eir_server_scope, but different so_minor_id, the client will need to create a new session for the client ID in order to use the connection; this is client ID trunking. In either session or client ID trunking, the bandwidth capacity can scale with the number of connections. When two servers over two connections claim matching or partially matching eir_server_owner, eir_server_scope, and eir_clientid values the client does not have to trust the servers' claims. The client may verify these claims before trunking traffic in the following ways: o For session trunking, clients and servers can reliably verify if connections between different network paths are in fact bound to the same NFSv4.1 server and usable on the same session. The SET_SSV (Section 17.47) operation allows a client and server to establish a unique, shared key value (the SSV). When a new connection is bound to the session (via the BIND_CONN_TO_SESSION operation, see Section 17.34), the client offers a digest that is based on the SSV. If the client mistakenly tries to bind a connection to a session of a wrong server, the server will either reject the attempt because it is not aware of the session Shepler, et al. Expires September 5, 2007 [Page 41] Internet-Draft NFSv4 Minor Version 1 March 2007 identifier of the BIND_CONN_TO_SESSION arguments, or it will reject the attempt because the digest for the SSV does not match what the server expects. Even if the server mistakenly or maliciously accepts the connection bind attempt, the digest it computes in the response will not be verified by the client, the client will know it cannot use the connection for trunking the specified channel. o In the case of client ID trunking, the client can use RPCSEC_GSS to verify that each connection is aimed at the same server. When the client invokes EXCHANGE_ID, it should use RPCSEC_GSS. If each RPCSEC_GSS context over each connection has the same server principal, then -- barring a compromise of the server's GSS credentials -- the servers at the end of each connection are the same. 2.10.4. Exactly Once Semantics Via the session, NFSv4.1 offers exactly once semantics (EOS) for requests sent over a channel. EOS is supported on both the operation and back channels. Each COMPOUND or CB_COMPOUND request that is issued with a leading SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver exactly once. This requirement is regardless whether the request is issued with reply caching specified (see Section 2.10.4.1.2). The requirement holds even if the requester is issuing the request over a session created between a pNFS data client and pNFS data server. The rationale for this requirement is understood by categorizing requests into three classifications: o Nonidempotent requests. o Idempotent modifying requests. o Idempotent non-modifying requests. An example of a non-idempotent request is RENAME. If is obvious that if a replier executes the same RENAME request twice, and the first execution succeeds, the re-execution will fail. If the replier returns the result from the re-execution, this result is incorrect. Therefore, EOS is required for nonidempotent requests. An example of an idempotent modifying request is a COMPOUND request containing a WRITE operation. Repeated execution of the same WRITE has the same effect as execution of that write once. Nevertheless, putting enforcing EOS for WRITEs and other idempotent modifying requests is necessary to avoid data corruption. Shepler, et al. Expires September 5, 2007 [Page 42] Internet-Draft NFSv4 Minor Version 1 March 2007 Suppose a client issues WRITEs A, B, C to a noncompliant server that does not enforce EOS, and receives no response, perhaps due to a network partition. The client reconnects to the server and re-issues all three WRITEs. Now, the server has outstanding two instances of each of A, B, and C. The server can be in a situation in which it executes and replies to the retries of A, B, and C while the first A, B, and C are still waiting around in the server's I/O system for some resource. Upon receiving the replies to the second attempts of WRITEs A, B, and C, the client believes its writes are done so it is free to do issue WRITE D which overlaps the range of one or more of A, B, C. If any of A, B, or C are subsequently are executed for the second time, then what has been written by D can be overwritten and thus corrupted. Note that it is not required the server cache the reply to the modifying operation to avoid data corruption (but if the client specified the reply to be cached, the server must cache it). An example of an idempotent non-modifying request is a COMPOUND containing SEQUENCE, PUTFH, READLINK and nothing else. The re- execution of a such a request will not cause data corruption, or produce an incorrect result. Nonetheless, for simplicity, the replier MUST enforce EOS for such requests. 2.10.4.1. Slot Identifiers and Reply Cache The RPC layer provides a transaction ID (xid), which, while required to be unique, is not especially convenient for tracking requests. The xid is only meaningful to the requester it cannot be interpreted at the replier except to test for equality with previously issued requests. Because RPC operations may be completed by the replier in any order, many transaction IDs may be outstanding at any time. The requester may therefore perform a computationally expensive lookup operation in the process of demultiplexing each reply. In the NFSv4.1, there is a limit to the number of active requests. This immediately enables a computationally efficient index for each request which is designated as a Slot Identifier, or slotid. When the requester issues a new request, it selects a slotid in the range 0..N-1, where N is the replier's current "outstanding requests" limit granted to the requester on the session over which the request is to be issued. The value of N outstanding requests starts out as the value of ca_maxrequests (Section 17.36), but can be adjusted by the response to SEQUENCE or CB_SEQUENCE as described later in this section. The slotid must be unused by any of the requests which the requester has already active on the session. "Unused" here means the requester has no outstanding request for that slotid. Because the Shepler, et al. Expires September 5, 2007 [Page 43] Internet-Draft NFSv4 Minor Version 1 March 2007 slot id is always an integer in the range 0..N-1, requester implementations can use the slotid from a replier response to efficiently match responses with outstanding requests, such as, for example, by using the slotid to index into an outstanding request array. This can be used to avoid expensive hashing and lookup functions in the performance-critical receive path. The sequenceid, which accompanies the slotid in each request, is for an important check at the server: it must be able to be determined efficiently whether a request using a certain slotid is a retransmit or a new, never-before-seen request. It is not feasible for the client to assert that it is retransmitting to implement this, because for any given request the client cannot know the server has seen it unless the server actually replies. Of course, if the client has seen the server's reply, the client would not retransmit. The sequenceid MUST increase monotonically for each new transmit of a given slotid, and MUST remain unchanged for any retransmission. The server must in turn compare each newly received request's sequenceid with the last one previously received for that slotid, to see if the new request is: o A new request, in which the sequenceid is one greater than that previously seen in the slot (accounting for sequence wraparound). The replier proceeds to execute the new request. o A retransmitted request, in which the sequenceid is equal to that last seen in the slot. Note that this request may be either complete, or in progress. The replier performs replay processing in these cases. o A misordered replay, in which the sequenceid is less than (accounting for sequence wraparound) than that previously seen in the slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). o A misordered new request, in which the sequenceid is two or more than (accounting for sequence wraparound) than that previously seen in the slot. Note that because the sequenceid must wraparound one it reaches 0xFFFFFFFF, a misordered new request and a misordered replay cannot be distinguished. Thus, the replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). Unlike the XID, the slotid is always within a specific range; this has two implications. The first implication is that for a given session, the replier need only cache the results of a limited number of COMPOUND requests. The second implication derives from the first, Shepler, et al. Expires September 5, 2007 [Page 44] Internet-Draft NFSv4 Minor Version 1 March 2007 which is unlike XID-indexed reply caches (also know as duplicate request caches - DRCs), the slotid-based reply cache cannot be overflowed. Through use of the sequenceid to identify retransmitted requests, the replier does not need to actually cache the request itself, reducing the storage requirements of the reply cache further. These new facilities makes it practical to maintain all the required entries for an effective reply cache. The slotid and sequenceid therefore take over the traditional role of the XID and port number in the replier reply cache implementation, and the session replaces the IP address. This approach is considerably more portable and completely robust - it is not subject to the frequent reassignment of ports as clients reconnect over IP networks. In addition, the RPC XID is not used in the reply cache, enhancing robustness of the cache in the face of any rapid reuse of XIDs by the client. [[Comment.3: We need to discuss the requirements of the client for changing the XID.]] The slotid information is included in each request, without violating the minor versioning rules of the NFSv4.0 specification, by encoding it in the SEQUENCE operation within each NFSv4.1 COMPOUND and CB_COMPOUND procedure. The operation easily piggybacks within existing messages. [[Comment.4: Need a better term than piggyback]] The receipt of a new sequenced request arriving on any valid slot is an indication that the previous reply cache contents of that slot may be discarded. The SEQUENCE (and CB_SEQUENCE) operation also carries a "highest_slotid" value which carries additional client slot usage information. The requester must always provide a slotid representing the outstanding request with the highest-numbered slot value. The requester should in all cases provide the most conservative value possible, although it can be increased somewhat above the actual instantaneous usage to maintain some minimum or optimal level. This provides a way for the requester to yield unused request slots back to the replier, which in turn can use the information to reallocate resources. The replier responds with both a new target highest_slotid, and an enforced highest_slotid, described as follows: o The target highest_slotid is an indication to the requester of the highest_slotid the replier wishes the requester to be using. This permits the replier to withdraw (or add) resources from a requester that has been found to not be using them, in order to more fairly share resources among a varying level of demand from other requesters. The requester must always comply with the Shepler, et al. Expires September 5, 2007 [Page 45] Internet-Draft NFSv4 Minor Version 1 March 2007 replier's value updates, since they indicate newly established hard limits on the requester's access to session resources. However, because of request pipelining, the requester may have active requests in flight reflecting prior values, therefore the replier must not immediately require the requester to comply. o The enforced highest_slotid indicates the highest slotid the requester is permitted to use on a subsequent SEQUENCE or CB_SEQUENCE operation. The requester is required to use the lowest available slot when issuing a new request. This way, the replier may be able to retire slot entries faster. However, where the replier is actively adjusting its granted maximum request count (i.e. the highest_slotid) to the requester, it will not not be able to use just the receipt of the slotid and highest_slotid in the request. Neither the slotid nor the highest_slotid used in a request may reflect the replier's current idea of the requester's session limit, because the request may have been sent from the requester before the update was received. Therefore, in the downward adjustment case, the replier may have to retain a number of reply cache entries at least as large as the old value of maximum requests outstanding, until operation sequencing rules allow it to infer that the requester has seen its reply. 2.10.4.1.1. Errors from SEQUENCE and CB_SEQUENCE Any time SEQUENCE or CB_SEQUENCE return an error, the sequenceid of the slot MUST NOT change. The replier MUST NOT modify the reply cache entry for the slot whenever an error is returned from SEQUENCE or CB_SEQUENCE. 2.10.4.1.2. Optional Reply Caching On a per-request basis the requester can choose to direct the replier to cache the reply to all operations after the first operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it would not direct the replier to cache the entire reply is that the request is composed of all idempotent operations [20]. Caching the reply may offer little benefit, and if the reply is too large (see Section 2.10.4.4), it may not be cacheable anyway. Whether the requester requests the reply to be cached or not has no effect on the slot processing. If the results of SEQUENCE or CB_SEQUENCE are NFS4_OK, then the slot's sequenceid MUST be incremented by one. If a requester does not direct the replier to cache, the reply, the replier MUST do one of following: Shepler, et al. Expires September 5, 2007 [Page 46] Internet-Draft NFSv4 Minor Version 1 March 2007 o The replier can cache the entire original reply. Even though sa_cachethis or csa_cachethis are FALSE, the replier is always free to cache. It may choose this approach in order to simplify implementation. o The replier enters into its reply cache a reply consisting of the original results to the SEQUENCE or CB_SEQUENCE operation, followed by the error NFS4ERR_RETRY_UNCACHED_REP. Thus if the requester later retries the request, it will get NFS4ERR_RETRY_UNCACHE_REP. 2.10.4.1.3. Multiple Connections and Sharing the Reply Cache Multiple connections can be bound to a session's channel, hence the connections share the same table of slotids. For connections over non-RDMA transports like TCP, there are no particular considerations. Considerations for multiple RDMA connections sharing a slot table are discussed in Section 2.10.5.1. [[Comment.5: Also need to discuss when RDMA and non-RDMA share a slot table.]] 2.10.4.2. Retry and Replay A client MUST NOT retry a request, unless the connection it used to send the request disconnects. The client can then reconnect and resend the request, or it can resend the request over a different connection. In the case of the server resending over the backchannel, it cannot reconnect, and either resends the request over another connection that the client has bound to the backchannel, or if there is no other backchannel connection, waits for the client to bind a connection to the backchannel. A client MUST wait for a reply to a request before using the slot for another request. If it does not wait for a reply, then the client does not know what sequenceid to use for the slot on its next request. For example, suppose a client sends a request with sequenceid 1, and does not wait for the response. The next time it uses the slot, it sends the new request with sequenceid 2. If the server has not seen the request with sequenceid 1, then the server is expecting sequenceid 2, and rejects the client's new request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). RDMA fabrics do not guarantee that the memory handles (Steering Tags) within each RDMA three-tuple are valid on a scope [[Comment.6: What is a three-tuple?]] outside that of a single connection. Therefore, handles used by the direct operations become invalid after connection loss. The server must ensure that any RDMA operations which must be replayed from the reply cache use the newly provided handle(s) from the most recent request. Shepler, et al. Expires September 5, 2007 [Page 47] Internet-Draft NFSv4 Minor Version 1 March 2007 2.10.4.3. Resolving server callback races with sessions It is possible for server callbacks to arrive at the client before the reply from related forward channel operations. For example, a client may have been granted a delegation to a file it has opened, but the reply to the OPEN (informing the client of the granting of the delegation) may be delayed in the network. If a conflicting operation arrives at the server, it will recall the delegation using the callback channel, which may be on a different transport connection, perhaps even a different network. In NFSv4.0, if the callback request arrives before the related reply, the client may reply to the server with an error. The presence of a session between client and server alleviates this issue. When a session is in place, each client request is uniquely identified by its { slotid, sequenceid } pair. By the rules under which slot entries (reply cache entries) are retired, the server has knowledge whether the client has "seen" each of the server's replies. The server can therefore provide sufficient information to the client to allow it to disambiguate between an erroneous or conflicting callback and a race condition. For each client operation which might result in some sort of server callback, the server should "remember" the { slotid, sequenceid } pair of the client request until the slotid retirement rules allow the server to determine that the client has, in fact, seen the server's reply. Until the time the { slotid, sequenceid } request pair can be retired, any recalls of the associated object MUST carry an array of these referring identifiers (in the CB_SEQUENCE operation's arguments), for the benefit of the client. After this time, it is not necessary for the server to provide this information in related callbacks, since it is certain that a race condition can no longer occur. The CB_SEQUENCE operation which begins each server callback carries a list of "referring" { slotid, sequenceid } tuples. If the client finds the request corresponding to the referring slotid and sequenced id be currently outstanding (i.e. the server's reply has not been seen by the client), it can determine that the callback has raced the reply, and act accordingly. The client must not simply wait forever for the expected server reply to arrive on any of the session's operations channels, because it is possible that they will be delayed indefinitely. However, it should wait for a period of time, and if the time expires it can provide a more meaningful error such as NFS4ERR_DELAY. [[Comment.7: We need to consider the clients' options here, and Shepler, et al. Expires September 5, 2007 [Page 48] Internet-Draft NFSv4 Minor Version 1 March 2007 describe them... NFS4ERR_DELAY has been discussed as a legal reply to CB_RECALL?]] There are other scenarios under which callbacks may race replies, among them pNFS layout recalls, described in Section 12.5.4.2 [[Comment.8: fill in the blanks w/others, etc...]] 2.10.4.4. COMPOUND and CB_COMPOUND Construction Issues Very large requests and replies may pose both buffer management issues (especially with RDMA) and reply cache issues. When the session is created, (Section 17.36) the client and server negotiate the maximum sized request they will send or process (ca_maxrequestsize), the maximum sized reply they will return or process (ca_maxresponsesize), and the maximum sized reply they will store in the reply cache (ca_maxresponsesize_cached). If a request exceeds ca_maxrequestsize, the reply will have the status NFS4ERR_REQ_TOO_BIG. A replier may return NFS4ERR_REQ_TOO_BIG as the status for first operation (SEQUENCE or CB_SEQUENCE) in the request, or it may chose to return it on a subsequent operation. If a reply exceeds ca_maxresponsesize, the reply will have the status NFS4ERR_REP_TOO_BIG. A replier may return NFS4ERR_REP_TOO_BIG as the status for first operation (SEQUENCE or CB_SEQUENCE) in the request, or it may chose to return it on a subsequent operation. If sa_cachethis or csa_cachethis are TRUE, then the replier MUST cache a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE operation (see Section 2.10.4.1.1). If the reply exceeds ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) is returned on a operation other than first operation (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or csa_cachethis are TRUE. For example, if a COMPOUND has eleven operations, including SEQUENCE, the fifth operation is a RENAME, and the tenth operation is a READ for one million bytes, server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since the server executed several operations, especially the non-idempotent RENAME, the client's request to cache the reply needs to be honored in order for correct operation of exactly once semantics. If the client retries the request, the server will have cached a reply that contains results for ten of the eleven requested operations, with the tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. A client needs to take care that when sending operations that change the current filehandle (except for PUTFH, PUTPUBFH, and PUTROOTFH) Shepler, et al. Expires September 5, 2007 [Page 49] Internet-Draft NFSv4 Minor Version 1 March 2007 that it not exceed the maximum reply buffer before the GETFH operation. Otherwise the client will have to retry the operation that changed the current filehandle, in order obtain the desired filehandle. For the OPEN operation (see Section 17.16), retry is not always available as an option. The following guidelines for the handling of filehandle changing operations are advised: o A client SHOULD issue GETFH immediately after a current filehandle changing operation. This is especially important after any current filehandle changing non-idempotent operation. It is critical to issue GETFH immediately after OPEN. o A server MAY return NFS4ERR_REP_TOO_BIG or NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a filehandle changing operation if the reply would be too large on the next operation. o A server SHOULD return NFS4ERR_REP_TOO_BIG or NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a filehandle changing non-idempotent operation if the reply would be too large on the next operation, especially if the operation is OPEN. o A server MAY return NFS4ERR_UNSAFE_COMPOUND if it looks at the next operation after a non-idempotent current filehandle changing operation, and finds it is not GETFH. The server would do this if it is unable to determine in advance whether the total response