1 2 3 4 NFSv4 S. Shepler 5 Internet-Draft M. Eisler 6 Intended status: Standards Track D. Noveck 7 Expires: February 23, 2009 Editors 8 August 22, 2008 9 10 11 NFS Version 4 Minor Version 1 12 draft-ietf-nfsv4-minorversion1-25.txt 13 14 Status of this Memo 15 16 By submitting this Internet-Draft, each author represents that any 17 applicable patent or other IPR claims of which he or she is aware 18 have been or will be disclosed, and any of which he or she becomes 19 aware will be disclosed, in accordance with Section 6 of BCP 79. 20 21 Internet-Drafts are working documents of the Internet Engineering 22 Task Force (IETF), its areas, and its working groups. Note that 23 other groups may also distribute working documents as Internet- 24 Drafts. 25 26 Internet-Drafts are draft documents valid for a maximum of six months 27 and may be updated, replaced, or obsoleted by other documents at any 28 time. It is inappropriate to use Internet-Drafts as reference 29 material or to cite them other than as "work in progress." 30 31 The list of current Internet-Drafts can be accessed at 32 http://www.ietf.org/ietf/1id-abstracts.txt. 33 34 The list of Internet-Draft Shadow Directories can be accessed at 35 http://www.ietf.org/shadow.html. 36 37 This Internet-Draft will expire on February 23, 2009. 38 39 Abstract 40 41 This Internet-Draft describes NFS version 4 minor version one, 42 including features retained from the base protocol and protocol 43 extensions made subsequently. Major extensions introduced in NFS 44 version 4 minor version one include: Sessions, Directory Delegations, 45 and parallel NFS (pNFS). 46 47 Requirements Language 48 49 The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", 50 "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this 51 document are to be interpreted as described in RFC 2119 [1]. 52 53 54 55 Shepler, et al. Expires February 23, 2009 [Page 1] 56 57 Internet-Draft NFSv4.1 August 2008 58 59 60 Table of Contents 61 62 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 11 63 1.1. The NFS Version 4 Minor Version 1 Protocol . . . . . . . 11 64 1.2. Scope of this Document . . . . . . . . . . . . . . . . . 11 65 1.3. NFSv4 Goals . . . . . . . . . . . . . . . . . . . . . . 11 66 1.4. NFSv4.1 Goals . . . . . . . . . . . . . . . . . . . . . 12 67 1.5. General Definitions . . . . . . . . . . . . . . . . . . 12 68 1.6. Overview of NFSv4.1 Features . . . . . . . . . . . . . . 15 69 1.6.1. RPC and Security . . . . . . . . . . . . . . . . . . 15 70 1.6.2. Protocol Structure . . . . . . . . . . . . . . . . . 15 71 1.6.3. File System Model . . . . . . . . . . . . . . . . . 16 72 1.6.4. Locking Facilities . . . . . . . . . . . . . . . . . 18 73 1.7. Differences from NFSv4.0 . . . . . . . . . . . . . . . . 19 74 2. Core Infrastructure . . . . . . . . . . . . . . . . . . . . . 20 75 2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 20 76 2.2. RPC and XDR . . . . . . . . . . . . . . . . . . . . . . 20 77 2.2.1. RPC-based Security . . . . . . . . . . . . . . . . . 20 78 2.3. COMPOUND and CB_COMPOUND . . . . . . . . . . . . . . . . 23 79 2.4. Client Identifiers and Client Owners . . . . . . . . . . 24 80 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 . . . . . . . . . . 27 81 2.4.2. Server Release of Client ID . . . . . . . . . . . . 28 82 2.4.3. Resolving Client Owner Conflicts . . . . . . . . . . 28 83 2.5. Server Owners . . . . . . . . . . . . . . . . . . . . . 29 84 2.6. Security Service Negotiation . . . . . . . . . . . . . . 30 85 2.6.1. NFSv4.1 Security Tuples . . . . . . . . . . . . . . 30 86 2.6.2. SECINFO and SECINFO_NO_NAME . . . . . . . . . . . . 31 87 2.6.3. Security Error . . . . . . . . . . . . . . . . . . . 31 88 2.7. Minor Versioning . . . . . . . . . . . . . . . . . . . . 35 89 2.8. Non-RPC-based Security Services . . . . . . . . . . . . 38 90 2.8.1. Authorization . . . . . . . . . . . . . . . . . . . 38 91 2.8.2. Auditing . . . . . . . . . . . . . . . . . . . . . . 38 92 2.8.3. Intrusion Detection . . . . . . . . . . . . . . . . 38 93 2.9. Transport Layers . . . . . . . . . . . . . . . . . . . . 39 94 2.9.1. REQUIRED and RECOMMENDED Properties of Transports . 39 95 2.9.2. Client and Server Transport Behavior . . . . . . . . 39 96 2.9.3. Ports . . . . . . . . . . . . . . . . . . . . . . . 41 97 2.10. Session . . . . . . . . . . . . . . . . . . . . . . . . 41 98 2.10.1. Motivation and Overview . . . . . . . . . . . . . . 41 99 2.10.2. NFSv4 Integration . . . . . . . . . . . . . . . . . 42 100 2.10.3. Channels . . . . . . . . . . . . . . . . . . . . . . 44 101 2.10.4. Trunking . . . . . . . . . . . . . . . . . . . . . . 45 102 2.10.5. Exactly Once Semantics . . . . . . . . . . . . . . . 48 103 2.10.6. RDMA Considerations . . . . . . . . . . . . . . . . 61 104 2.10.7. Sessions Security . . . . . . . . . . . . . . . . . 64 105 2.10.8. The SSV GSS Mechanism . . . . . . . . . . . . . . . 69 106 2.10.9. Session Mechanics - Steady State . . . . . . . . . . 73 107 2.10.10. Session Inactivity Timer . . . . . . . . . . . . . . 75 108 109 110 111 Shepler, et al. Expires February 23, 2009 [Page 2] 112 113 Internet-Draft NFSv4.1 August 2008 114 115 116 2.10.11. Session Mechanics - Recovery . . . . . . . . . . . . 75 117 2.10.12. Parallel NFS and Sessions . . . . . . . . . . . . . 79 118 3. Protocol Constants and Data Types . . . . . . . . . . . . . . 79 119 3.1. Basic Constants . . . . . . . . . . . . . . . . . . . . 79 120 3.2. Basic Data Types . . . . . . . . . . . . . . . . . . . . 80 121 3.3. Structured Data Types . . . . . . . . . . . . . . . . . 82 122 4. Filehandles . . . . . . . . . . . . . . . . . . . . . . . . . 90 123 4.1. Obtaining the First Filehandle . . . . . . . . . . . . . 90 124 4.1.1. Root Filehandle . . . . . . . . . . . . . . . . . . 91 125 4.1.2. Public Filehandle . . . . . . . . . . . . . . . . . 91 126 4.2. Filehandle Types . . . . . . . . . . . . . . . . . . . . 91 127 4.2.1. General Properties of a Filehandle . . . . . . . . . 92 128 4.2.2. Persistent Filehandle . . . . . . . . . . . . . . . 93 129 4.2.3. Volatile Filehandle . . . . . . . . . . . . . . . . 93 130 4.3. One Method of Constructing a Volatile Filehandle . . . . 94 131 4.4. Client Recovery from Filehandle Expiration . . . . . . . 95 132 5. File Attributes . . . . . . . . . . . . . . . . . . . . . . . 96 133 5.1. REQUIRED Attributes . . . . . . . . . . . . . . . . . . 97 134 5.2. RECOMMENDED Attributes . . . . . . . . . . . . . . . . . 97 135 5.3. Named Attributes . . . . . . . . . . . . . . . . . . . . 98 136 5.4. Classification of Attributes . . . . . . . . . . . . . . 99 137 5.5. Set-Only and Get-Only Attributes . . . . . . . . . . . . 100 138 5.6. REQUIRED Attributes - List and Definition References . . 100 139 5.7. RECOMMENDED Attributes - List and Definition 140 References . . . . . . . . . . . . . . . . . . . . . . . 101 141 5.8. Attribute Definitions . . . . . . . . . . . . . . . . . 103 142 5.8.1. Definitions of REQUIRED Attributes . . . . . . . . . 103 143 5.8.2. Definitions of Uncategorized RECOMMENDED 144 Attributes . . . . . . . . . . . . . . . . . . . . . 105 145 5.9. Interpreting owner and owner_group . . . . . . . . . . . 112 146 5.10. Character Case Attributes . . . . . . . . . . . . . . . 114 147 5.11. Directory Notification Attributes . . . . . . . . . . . 114 148 5.12. pNFS Attribute Definitions . . . . . . . . . . . . . . . 114 149 5.13. Retention Attributes . . . . . . . . . . . . . . . . . . 116 150 6. Access Control Attributes . . . . . . . . . . . . . . . . . . 119 151 6.1. Goals . . . . . . . . . . . . . . . . . . . . . . . . . 119 152 6.2. File Attributes Discussion . . . . . . . . . . . . . . . 120 153 6.2.1. Attribute 12: acl . . . . . . . . . . . . . . . . . 120 154 6.2.2. Attribute 58: dacl . . . . . . . . . . . . . . . . . 135 155 6.2.3. Attribute 59: sacl . . . . . . . . . . . . . . . . . 135 156 6.2.4. Attribute 33: mode . . . . . . . . . . . . . . . . . 135 157 6.2.5. Attribute 74: mode_set_masked . . . . . . . . . . . 136 158 6.3. Common Methods . . . . . . . . . . . . . . . . . . . . . 137 159 6.3.1. Interpreting an ACL . . . . . . . . . . . . . . . . 137 160 6.3.2. Computing a Mode Attribute from an ACL . . . . . . . 138 161 6.4. Requirements . . . . . . . . . . . . . . . . . . . . . . 139 162 6.4.1. Setting the mode and/or ACL Attributes . . . . . . . 139 163 6.4.2. Retrieving the mode and/or ACL Attributes . . . . . 141 164 165 166 167 Shepler, et al. Expires February 23, 2009 [Page 3] 168 169 Internet-Draft NFSv4.1 August 2008 170 171 172 6.4.3. Creating New Objects . . . . . . . . . . . . . . . . 141 173 7. Single-server Namespace . . . . . . . . . . . . . . . . . . . 145 174 7.1. Server Exports . . . . . . . . . . . . . . . . . . . . . 145 175 7.2. Browsing Exports . . . . . . . . . . . . . . . . . . . . 146 176 7.3. Server Pseudo File System . . . . . . . . . . . . . . . 146 177 7.4. Multiple Roots . . . . . . . . . . . . . . . . . . . . . 147 178 7.5. Filehandle Volatility . . . . . . . . . . . . . . . . . 147 179 7.6. Exported Root . . . . . . . . . . . . . . . . . . . . . 147 180 7.7. Mount Point Crossing . . . . . . . . . . . . . . . . . . 148 181 7.8. Security Policy and Namespace Presentation . . . . . . . 148 182 8. State Management . . . . . . . . . . . . . . . . . . . . . . 149 183 8.1. Client and Session ID . . . . . . . . . . . . . . . . . 150 184 8.2. Stateid Definition . . . . . . . . . . . . . . . . . . . 150 185 8.2.1. Stateid Types . . . . . . . . . . . . . . . . . . . 151 186 8.2.2. Stateid Structure . . . . . . . . . . . . . . . . . 152 187 8.2.3. Special Stateids . . . . . . . . . . . . . . . . . . 154 188 8.2.4. Stateid Lifetime and Validation . . . . . . . . . . 155 189 8.2.5. Stateid Use for I/O Operations . . . . . . . . . . . 158 190 8.2.6. Stateid Use for SETATTR Operations . . . . . . . . . 159 191 8.3. Lease Renewal . . . . . . . . . . . . . . . . . . . . . 159 192 8.4. Crash Recovery . . . . . . . . . . . . . . . . . . . . . 161 193 8.4.1. Client Failure and Recovery . . . . . . . . . . . . 162 194 8.4.2. Server Failure and Recovery . . . . . . . . . . . . 163 195 8.4.3. Network Partitions and Recovery . . . . . . . . . . 166 196 8.5. Server Revocation of Locks . . . . . . . . . . . . . . . 171 197 8.6. Short and Long Leases . . . . . . . . . . . . . . . . . 172 198 8.7. Clocks, Propagation Delay, and Calculating Lease 199 Expiration . . . . . . . . . . . . . . . . . . . . . . . 172 200 8.8. Obsolete Locking Infrastructure From NFSv4.0 . . . . . . 173 201 9. File Locking and Share Reservations . . . . . . . . . . . . . 174 202 9.1. Opens and Byte-Range Locks . . . . . . . . . . . . . . . 174 203 9.1.1. State-owner Definition . . . . . . . . . . . . . . . 174 204 9.1.2. Use of the Stateid and Locking . . . . . . . . . . . 175 205 9.2. Lock Ranges . . . . . . . . . . . . . . . . . . . . . . 178 206 9.3. Upgrading and Downgrading Locks . . . . . . . . . . . . 178 207 9.4. Stateid Seqid Values and Byte-Range Locks . . . . . . . 179 208 9.5. Issues with Multiple Open-Owners . . . . . . . . . . . . 179 209 9.6. Blocking Locks . . . . . . . . . . . . . . . . . . . . . 180 210 9.7. Share Reservations . . . . . . . . . . . . . . . . . . . 181 211 9.8. OPEN/CLOSE Operations . . . . . . . . . . . . . . . . . 182 212 9.9. Open Upgrade and Downgrade . . . . . . . . . . . . . . . 182 213 9.10. Parallel OPENs . . . . . . . . . . . . . . . . . . . . . 183 214 9.11. Reclaim of Open and Byte-Range Locks . . . . . . . . . . 184 215 10. Client-Side Caching . . . . . . . . . . . . . . . . . . . . . 184 216 10.1. Performance Challenges for Client-Side Caching . . . . . 185 217 10.2. Delegation and Callbacks . . . . . . . . . . . . . . . . 186 218 10.2.1. Delegation Recovery . . . . . . . . . . . . . . . . 188 219 10.3. Data Caching . . . . . . . . . . . . . . . . . . . . . . 190 220 221 222 223 Shepler, et al. Expires February 23, 2009 [Page 4] 224 225 Internet-Draft NFSv4.1 August 2008 226 227 228 10.3.1. Data Caching and OPENs . . . . . . . . . . . . . . . 190 229 10.3.2. Data Caching and File Locking . . . . . . . . . . . 191 230 10.3.3. Data Caching and Mandatory File Locking . . . . . . 193 231 10.3.4. Data Caching and File Identity . . . . . . . . . . . 193 232 10.4. Open Delegation . . . . . . . . . . . . . . . . . . . . 195 233 10.4.1. Open Delegation and Data Caching . . . . . . . . . . 197 234 10.4.2. Open Delegation and File Locks . . . . . . . . . . . 198 235 10.4.3. Handling of CB_GETATTR . . . . . . . . . . . . . . . 199 236 10.4.4. Recall of Open Delegation . . . . . . . . . . . . . 202 237 10.4.5. Clients that Fail to Honor Delegation Recalls . . . 204 238 10.4.6. Delegation Revocation . . . . . . . . . . . . . . . 204 239 10.4.7. Delegations via WANT_DELEGATION . . . . . . . . . . 205 240 10.5. Data Caching and Revocation . . . . . . . . . . . . . . 206 241 10.5.1. Revocation Recovery for Write Open Delegation . . . 206 242 10.6. Attribute Caching . . . . . . . . . . . . . . . . . . . 207 243 10.7. Data and Metadata Caching and Memory Mapped Files . . . 209 244 10.8. Name and Directory Caching without Directory 245 Delegations . . . . . . . . . . . . . . . . . . . . . . 211 246 10.8.1. Name Caching . . . . . . . . . . . . . . . . . . . . 211 247 10.8.2. Directory Caching . . . . . . . . . . . . . . . . . 213 248 10.9. Directory Delegations . . . . . . . . . . . . . . . . . 214 249 10.9.1. Introduction to Directory Delegations . . . . . . . 214 250 10.9.2. Directory Delegation Design . . . . . . . . . . . . 215 251 10.9.3. Attributes in Support of Directory Notifications . . 216 252 10.9.4. Directory Delegation Recall . . . . . . . . . . . . 216 253 10.9.5. Directory Delegation Recovery . . . . . . . . . . . 217 254 11. Multi-Server Namespace . . . . . . . . . . . . . . . . . . . 217 255 11.1. Location Attributes . . . . . . . . . . . . . . . . . . 217 256 11.2. File System Presence or Absence . . . . . . . . . . . . 218 257 11.3. Getting Attributes for an Absent File System . . . . . . 219 258 11.3.1. GETATTR Within an Absent File System . . . . . . . . 219 259 11.3.2. READDIR and Absent File Systems . . . . . . . . . . 220 260 11.4. Uses of Location Information . . . . . . . . . . . . . . 221 261 11.4.1. File System Replication . . . . . . . . . . . . . . 222 262 11.4.2. File System Migration . . . . . . . . . . . . . . . 222 263 11.4.3. Referrals . . . . . . . . . . . . . . . . . . . . . 224 264 11.5. Location Entries and Server Identity . . . . . . . . . . 225 265 11.6. Additional Client-side Considerations . . . . . . . . . 226 266 11.7. Effecting File System Transitions . . . . . . . . . . . 226 267 11.7.1. File System Transitions and Simultaneous Access . . 228 268 11.7.2. Simultaneous Use and Transparent Transitions . . . . 228 269 11.7.3. Filehandles and File System Transitions . . . . . . 231 270 11.7.4. Fileids and File System Transitions . . . . . . . . 231 271 11.7.5. Fsids and File System Transitions . . . . . . . . . 233 272 11.7.6. The Change Attribute and File System Transitions . . 233 273 11.7.7. Lock State and File System Transitions . . . . . . . 234 274 11.7.8. Write Verifiers and File System Transitions . . . . 238 275 11.7.9. Readdir Cookies and Verifiers and File System 276 277 278 279 Shepler, et al. Expires February 23, 2009 [Page 5] 280 281 Internet-Draft NFSv4.1 August 2008 282 283 284 Transitions . . . . . . . . . . . . . . . . . . . . 238 285 11.7.10. File System Data and File System Transitions . . . . 238 286 11.8. Effecting File System Referrals . . . . . . . . . . . . 240 287 11.8.1. Referral Example (LOOKUP) . . . . . . . . . . . . . 240 288 11.8.2. Referral Example (READDIR) . . . . . . . . . . . . . 244 289 11.9. The Attribute fs_locations . . . . . . . . . . . . . . . 246 290 11.10. The Attribute fs_locations_info . . . . . . . . . . . . 249 291 11.10.1. The fs_locations_server4 Structure . . . . . . . . . 253 292 11.10.2. The fs_locations_info4 Structure . . . . . . . . . . 258 293 11.10.3. The fs_locations_item4 Structure . . . . . . . . . . 259 294 11.11. The Attribute fs_status . . . . . . . . . . . . . . . . 261 295 12. Parallel NFS (pNFS) . . . . . . . . . . . . . . . . . . . . . 265 296 12.1. Introduction . . . . . . . . . . . . . . . . . . . . . . 265 297 12.2. pNFS Definitions . . . . . . . . . . . . . . . . . . . . 266 298 12.2.1. Metadata . . . . . . . . . . . . . . . . . . . . . . 267 299 12.2.2. Metadata Server . . . . . . . . . . . . . . . . . . 267 300 12.2.3. pNFS Client . . . . . . . . . . . . . . . . . . . . 267 301 12.2.4. Storage Device . . . . . . . . . . . . . . . . . . . 267 302 12.2.5. Storage Protocol . . . . . . . . . . . . . . . . . . 268 303 12.2.6. Control Protocol . . . . . . . . . . . . . . . . . . 268 304 12.2.7. Layout Types . . . . . . . . . . . . . . . . . . . . 268 305 12.2.8. Layout . . . . . . . . . . . . . . . . . . . . . . . 269 306 12.2.9. Layout Iomode . . . . . . . . . . . . . . . . . . . 269 307 12.2.10. Device IDs . . . . . . . . . . . . . . . . . . . . . 270 308 12.3. pNFS Operations . . . . . . . . . . . . . . . . . . . . 271 309 12.4. pNFS Attributes . . . . . . . . . . . . . . . . . . . . 272 310 12.5. Layout Semantics . . . . . . . . . . . . . . . . . . . . 272 311 12.5.1. Guarantees Provided by Layouts . . . . . . . . . . . 272 312 12.5.2. Getting a Layout . . . . . . . . . . . . . . . . . . 273 313 12.5.3. Layout Stateid . . . . . . . . . . . . . . . . . . . 274 314 12.5.4. Committing a Layout . . . . . . . . . . . . . . . . 276 315 12.5.5. Recalling a Layout . . . . . . . . . . . . . . . . . 279 316 12.5.6. Revoking Layouts . . . . . . . . . . . . . . . . . . 287 317 12.5.7. Metadata Server Write Propagation . . . . . . . . . 287 318 12.6. pNFS Mechanics . . . . . . . . . . . . . . . . . . . . . 287 319 12.7. Recovery . . . . . . . . . . . . . . . . . . . . . . . . 289 320 12.7.1. Recovery from Client Restart . . . . . . . . . . . . 289 321 12.7.2. Dealing with Lease Expiration on the Client . . . . 290 322 12.7.3. Dealing with Loss of Layout State on the Metadata 323 Server . . . . . . . . . . . . . . . . . . . . . . . 291 324 12.7.4. Recovery from Metadata Server Restart . . . . . . . 291 325 12.7.5. Operations During Metadata Server Grace Period . . . 293 326 12.7.6. Storage Device Recovery . . . . . . . . . . . . . . 294 327 12.8. Metadata and Storage Device Roles . . . . . . . . . . . 294 328 12.9. Security Considerations for pNFS . . . . . . . . . . . . 294 329 13. PNFS: NFSv4.1 File Layout Type . . . . . . . . . . . . . . . 295 330 13.1. Client ID and Session Considerations . . . . . . . . . . 296 331 13.1.1. Sessions Considerations for Data Servers . . . . . . 298 332 333 334 335 Shepler, et al. Expires February 23, 2009 [Page 6] 336 337 Internet-Draft NFSv4.1 August 2008 338 339 340 13.2. File Layout Definitions . . . . . . . . . . . . . . . . 298 341 13.3. File Layout Data Types . . . . . . . . . . . . . . . . . 299 342 13.4. Interpreting the File Layout . . . . . . . . . . . . . . 303 343 13.4.1. Determining the Stripe Unit Number . . . . . . . . . 303 344 13.4.2. Interpreting the File Layout Using Sparse Packing . 303 345 13.4.3. Interpreting the File Layout Using Dense Packing . . 306 346 13.4.4. Sparse and Dense Stripe Unit Packing . . . . . . . . 308 347 13.5. Data Server Multipathing . . . . . . . . . . . . . . . . 310 348 13.6. Operations Sent to NFSv4.1 Data Servers . . . . . . . . 311 349 13.7. COMMIT Through Metadata Server . . . . . . . . . . . . . 313 350 13.8. The Layout Iomode . . . . . . . . . . . . . . . . . . . 315 351 13.9. Metadata and Data Server State Coordination . . . . . . 315 352 13.9.1. Global Stateid Requirements . . . . . . . . . . . . 315 353 13.9.2. Data Server State Propagation . . . . . . . . . . . 316 354 13.10. Data Server Component File Size . . . . . . . . . . . . 318 355 13.11. Layout Revocation and Fencing . . . . . . . . . . . . . 319 356 13.12. Security Considerations for the File Layout Type . . . . 319 357 14. Internationalization . . . . . . . . . . . . . . . . . . . . 320 358 14.1. Stringprep profile for the utf8str_cs type . . . . . . . 321 359 14.2. Stringprep profile for the utf8str_cis type . . . . . . 323 360 14.3. Stringprep profile for the utf8str_mixed type . . . . . 324 361 14.4. UTF-8 Capabilities . . . . . . . . . . . . . . . . . . . 326 362 14.5. UTF-8 Related Errors . . . . . . . . . . . . . . . . . . 326 363 15. Error Values . . . . . . . . . . . . . . . . . . . . . . . . 327 364 15.1. Error Definitions . . . . . . . . . . . . . . . . . . . 327 365 15.1.1. General Errors . . . . . . . . . . . . . . . . . . . 329 366 15.1.2. Filehandle Errors . . . . . . . . . . . . . . . . . 331 367 15.1.3. Compound Structure Errors . . . . . . . . . . . . . 332 368 15.1.4. File System Errors . . . . . . . . . . . . . . . . . 334 369 15.1.5. State Management Errors . . . . . . . . . . . . . . 336 370 15.1.6. Security Errors . . . . . . . . . . . . . . . . . . 337 371 15.1.7. Name Errors . . . . . . . . . . . . . . . . . . . . 337 372 15.1.8. Locking Errors . . . . . . . . . . . . . . . . . . . 338 373 15.1.9. Reclaim Errors . . . . . . . . . . . . . . . . . . . 339 374 15.1.10. pNFS Errors . . . . . . . . . . . . . . . . . . . . 340 375 15.1.11. Session Use Errors . . . . . . . . . . . . . . . . . 341 376 15.1.12. Session Management Errors . . . . . . . . . . . . . 343 377 15.1.13. Client Management Errors . . . . . . . . . . . . . . 343 378 15.1.14. Delegation Errors . . . . . . . . . . . . . . . . . 344 379 15.1.15. Attribute Handling Errors . . . . . . . . . . . . . 344 380 15.1.16. Obsoleted Errors . . . . . . . . . . . . . . . . . . 345 381 15.2. Operations and their valid errors . . . . . . . . . . . 346 382 15.3. Callback operations and their valid errors . . . . . . . 362 383 15.4. Errors and the operations that use them . . . . . . . . 364 384 16. NFSv4.1 Procedures . . . . . . . . . . . . . . . . . . . . . 378 385 16.1. Procedure 0: NULL - No Operation . . . . . . . . . . . . 378 386 16.2. Procedure 1: COMPOUND - Compound Operations . . . . . . 379 387 17. Operations: REQUIRED, RECOMMENDED, or OPTIONAL . . . . . . . 390 388 389 390 391 Shepler, et al. Expires February 23, 2009 [Page 7] 392 393 Internet-Draft NFSv4.1 August 2008 394 395 396 18. NFSv4.1 Operations . . . . . . . . . . . . . . . . . . . . . 393 397 18.1. Operation 3: ACCESS - Check Access Rights . . . . . . . 393 398 18.2. Operation 4: CLOSE - Close File . . . . . . . . . . . . 399 399 18.3. Operation 5: COMMIT - Commit Cached Data . . . . . . . . 400 400 18.4. Operation 6: CREATE - Create a Non-Regular File Object . 403 401 18.5. Operation 7: DELEGPURGE - Purge Delegations Awaiting 402 Recovery . . . . . . . . . . . . . . . . . . . . . . . . 406 403 18.6. Operation 8: DELEGRETURN - Return Delegation . . . . . . 407 404 18.7. Operation 9: GETATTR - Get Attributes . . . . . . . . . 407 405 18.8. Operation 10: GETFH - Get Current Filehandle . . . . . . 409 406 18.9. Operation 11: LINK - Create Link to a File . . . . . . . 410 407 18.10. Operation 12: LOCK - Create Lock . . . . . . . . . . . . 413 408 18.11. Operation 13: LOCKT - Test For Lock . . . . . . . . . . 417 409 18.12. Operation 14: LOCKU - Unlock File . . . . . . . . . . . 418 410 18.13. Operation 15: LOOKUP - Lookup Filename . . . . . . . . . 420 411 18.14. Operation 16: LOOKUPP - Lookup Parent Directory . . . . 421 412 18.15. Operation 17: NVERIFY - Verify Difference in 413 Attributes . . . . . . . . . . . . . . . . . . . . . . . 423 414 18.16. Operation 18: OPEN - Open a Regular File . . . . . . . . 424 415 18.17. Operation 19: OPENATTR - Open Named Attribute 416 Directory . . . . . . . . . . . . . . . . . . . . . . . 443 417 18.18. Operation 21: OPEN_DOWNGRADE - Reduce Open File Access . 444 418 18.19. Operation 22: PUTFH - Set Current Filehandle . . . . . . 446 419 18.20. Operation 23: PUTPUBFH - Set Public Filehandle . . . . . 446 420 18.21. Operation 24: PUTROOTFH - Set Root Filehandle . . . . . 448 421 18.22. Operation 25: READ - Read from File . . . . . . . . . . 449 422 18.23. Operation 26: READDIR - Read Directory . . . . . . . . . 451 423 18.24. Operation 27: READLINK - Read Symbolic Link . . . . . . 455 424 18.25. Operation 28: REMOVE - Remove File System Object . . . . 456 425 18.26. Operation 29: RENAME - Rename Directory Entry . . . . . 458 426 18.27. Operation 31: RESTOREFH - Restore Saved Filehandle . . . 462 427 18.28. Operation 32: SAVEFH - Save Current Filehandle . . . . . 463 428 18.29. Operation 33: SECINFO - Obtain Available Security . . . 464 429 18.30. Operation 34: SETATTR - Set Attributes . . . . . . . . . 468 430 18.31. Operation 37: VERIFY - Verify Same Attributes . . . . . 471 431 18.32. Operation 38: WRITE - Write to File . . . . . . . . . . 472 432 18.33. Operation 40: BACKCHANNEL_CTL - Backchannel control . . 476 433 18.34. Operation 41: BIND_CONN_TO_SESSION . . . . . . . . . . . 478 434 18.35. Operation 42: EXCHANGE_ID - Instantiate Client ID . . . 481 435 18.36. Operation 43: CREATE_SESSION - Create New Session and 436 Confirm Client ID . . . . . . . . . . . . . . . . . . . 498 437 18.37. Operation 44: DESTROY_SESSION - Destroy existing 438 session . . . . . . . . . . . . . . . . . . . . . . . . 508 439 18.38. Operation 45: FREE_STATEID - Free stateid with no 440 locks . . . . . . . . . . . . . . . . . . . . . . . . . 509 441 18.39. Operation 46: GET_DIR_DELEGATION - Get a directory 442 delegation . . . . . . . . . . . . . . . . . . . . . . . 510 443 18.40. Operation 47: GETDEVICEINFO - Get Device Information . . 514 444 445 446 447 Shepler, et al. Expires February 23, 2009 [Page 8] 448 449 Internet-Draft NFSv4.1 August 2008 450 451 452 18.41. Operation 48: GETDEVICELIST - Get All Device Mappings 453 for a File System . . . . . . . . . . . . . . . . . . . 516 454 18.42. Operation 49: LAYOUTCOMMIT - Commit writes made using 455 a layout . . . . . . . . . . . . . . . . . . . . . . . . 518 456 18.43. Operation 50: LAYOUTGET - Get Layout Information . . . . 521 457 18.44. Operation 51: LAYOUTRETURN - Release Layout 458 Information . . . . . . . . . . . . . . . . . . . . . . 531 459 18.45. Operation 52: SECINFO_NO_NAME - Get Security on 460 Unnamed Object . . . . . . . . . . . . . . . . . . . . . 535 461 18.46. Operation 53: SEQUENCE - Supply per-procedure 462 sequencing and control . . . . . . . . . . . . . . . . . 537 463 18.47. Operation 54: SET_SSV - Update SSV for a Client ID . . . 542 464 18.48. Operation 55: TEST_STATEID - Test stateids for 465 validity . . . . . . . . . . . . . . . . . . . . . . . . 544 466 18.49. Operation 56: WANT_DELEGATION - Request Delegation . . . 546 467 18.50. Operation 57: DESTROY_CLIENTID - Destroy existing 468 client ID . . . . . . . . . . . . . . . . . . . . . . . 550 469 18.51. Operation 58: RECLAIM_COMPLETE - Indicates Reclaims 470 Finished . . . . . . . . . . . . . . . . . . . . . . . . 550 471 18.52. Operation 10044: ILLEGAL - Illegal operation . . . . . . 553 472 19. NFSv4.1 Callback Procedures . . . . . . . . . . . . . . . . . 553 473 19.1. Procedure 0: CB_NULL - No Operation . . . . . . . . . . 554 474 19.2. Procedure 1: CB_COMPOUND - Compound Operations . . . . . 554 475 20. NFSv4.1 Callback Operations . . . . . . . . . . . . . . . . . 558 476 20.1. Operation 3: CB_GETATTR - Get Attributes . . . . . . . . 558 477 20.2. Operation 4: CB_RECALL - Recall a Delegation . . . . . . 559 478 20.3. Operation 5: CB_LAYOUTRECALL - Recall Layout from 479 Client . . . . . . . . . . . . . . . . . . . . . . . . . 560 480 20.4. Operation 6: CB_NOTIFY - Notify directory changes . . . 564 481 20.5. Operation 7: CB_PUSH_DELEG - Offer Delegation to 482 Client . . . . . . . . . . . . . . . . . . . . . . . . . 568 483 20.6. Operation 8: CB_RECALL_ANY - Keep any N recallable 484 objects . . . . . . . . . . . . . . . . . . . . . . . . 569 485 20.7. Operation 9: CB_RECALLABLE_OBJ_AVAIL - Signal 486 Resources for Recallable Objects . . . . . . . . . . . . 572 487 20.8. Operation 10: CB_RECALL_SLOT - change flow control 488 limits . . . . . . . . . . . . . . . . . . . . . . . . . 573 489 20.9. Operation 11: CB_SEQUENCE - Supply backchannel 490 sequencing and control . . . . . . . . . . . . . . . . . 574 491 20.10. Operation 12: CB_WANTS_CANCELLED - Cancel Pending 492 Delegation Wants . . . . . . . . . . . . . . . . . . . . 576 493 20.11. Operation 13: CB_NOTIFY_LOCK - Notify of possible 494 lock availability . . . . . . . . . . . . . . . . . . . 577 495 20.12. Operation 14: CB_NOTIFY_DEVICEID - Notify device ID 496 changes . . . . . . . . . . . . . . . . . . . . . . . . 579 497 20.13. Operation 10044: CB_ILLEGAL - Illegal Callback 498 Operation . . . . . . . . . . . . . . . . . . . . . . . 581 499 21. Security Considerations . . . . . . . . . . . . . . . . . . . 581 500 501 502 503 Shepler, et al. Expires February 23, 2009 [Page 9] 504 505 Internet-Draft NFSv4.1 August 2008 506 507 508 22. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 583 509 22.1. Named Attribute Definitions . . . . . . . . . . . . . . 583 510 22.1.1. Initial Registry . . . . . . . . . . . . . . . . . . 584 511 22.1.2. Updating Registrations . . . . . . . . . . . . . . . 584 512 22.2. Device ID Notifications . . . . . . . . . . . . . . . . 584 513 22.2.1. Initial Registry . . . . . . . . . . . . . . . . . . 585 514 22.2.2. Updating Registrations . . . . . . . . . . . . . . . 585 515 22.3. Object Recall Types . . . . . . . . . . . . . . . . . . 585 516 22.3.1. Initial Registry . . . . . . . . . . . . . . . . . . 587 517 22.3.2. Updating Registrations . . . . . . . . . . . . . . . 587 518 22.4. Layout Types . . . . . . . . . . . . . . . . . . . . . . 587 519 22.4.1. Initial Registry . . . . . . . . . . . . . . . . . . 588 520 22.4.2. Updating Registrations . . . . . . . . . . . . . . . 588 521 22.4.3. Guidelines for Writing Layout Type Specifications . 588 522 22.5. Path Variable Definitions . . . . . . . . . . . . . . . 590 523 22.5.1. Path Variables Registry . . . . . . . . . . . . . . 590 524 22.5.2. Values for the ${ietf.org:CPU_ARCH} Variable . . . . 592 525 22.5.3. Values for the ${ietf.org:OS_TYPE} Variable . . . . 592 526 23. References . . . . . . . . . . . . . . . . . . . . . . . . . 593 527 23.1. Normative References . . . . . . . . . . . . . . . . . . 593 528 23.2. Informative References . . . . . . . . . . . . . . . . . 595 529 Appendix A. Acknowledgments . . . . . . . . . . . . . . . . . . 596 530 Appendix B. RFC Editor Notes . . . . . . . . . . . . . . . . . . 598 531 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 599 532 Intellectual Property and Copyright Statements . . . . . . . . . 600 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 Shepler, et al. Expires February 23, 2009 [Page 10] 560 561 Internet-Draft NFSv4.1 August 2008 562 563 564 1. Introduction 565 566 1.1. The NFS Version 4 Minor Version 1 Protocol 567 568 The NFS version 4 minor version 1 (NFSv4.1) protocol is the second 569 minor version of the NFS version 4 (NFSv4) protocol. The first minor 570 version, NFSv4.0 is described in [20]. It generally follows the 571 guidelines for minor versioning model listed in Section 10 of RFC 572 3530. However, it diverges from guidelines 11 ("a client and server 573 that supports minor version X must support minor versions 0 through 574 X-1"), and 12 ("no features may be introduced as mandatory in a minor 575 version"). These divergences are due to the introduction of the 576 sessions model for managing non-idempotent operations and the 577 RECLAIM_COMPLETE operation. These two new features are 578 infrastructural in nature and simplify implementation of existing and 579 other new features. Making them anything but REQUIRED would add 580 undue complexity to protocol definition and implementation. NFSv4.1 581 accordingly updates the Minor Versioning guidelines (Section 2.7). 582 583 As a minor version, NFSv4.1 is consistent with the overall goals for 584 NFSv4, but extends the protocol so as to better meet those goals, 585 based on experiences with NFSv4.0. In addition, NFSv4.1 has adopted 586 some additional goals, which motivate some of the major extensions in 587 NFSv4.1. 588 589 1.2. Scope of this Document 590 591 This document describes the NFSv4.1 protocol. With respect to 592 NFSv4.0, this document does not: 593 594 o describe the NFSv4.0 protocol, except where needed to contrast 595 with NFSv4.1. 596 597 o modify the specification of the NFSv4.0 protocol. 598 599 o clarify the NFSv4.0 protocol. 600 601 1.3. NFSv4 Goals 602 603 The NFSv4 protocol is a further revision of the NFS protocol defined 604 already by NFSv3 [21]. It retains the essential characteristics of 605 previous versions: easy recovery; independence of transport 606 protocols, operating systems and file systems; simplicity; and good 607 performance. NFSv4 has the following goals: 608 609 o Improved access and good performance on the Internet. 610 611 The protocol is designed to transit firewalls easily, perform well 612 613 614 615 Shepler, et al. Expires February 23, 2009 [Page 11] 616 617 Internet-Draft NFSv4.1 August 2008 618 619 620 where latency is high and bandwidth is low, and scale to very 621 large numbers of clients per server. 622 623 o Strong security with negotiation built into the protocol. 624 625 The protocol builds on the work of the ONCRPC working group in 626 supporting the RPCSEC_GSS protocol. Additionally, the NFSv4.1 627 protocol provides a mechanism to allow clients and servers the 628 ability to negotiate security and require clients and servers to 629 support a minimal set of security schemes. 630 631 o Good cross-platform interoperability. 632 633 The protocol features a file system model that provides a useful, 634 common set of features that does not unduly favor one file system 635 or operating system over another. 636 637 o Designed for protocol extensions. 638 639 The protocol is designed to accept standard extensions within a 640 framework that enable and encourages backward compatibility. 641 642 1.4. NFSv4.1 Goals 643 644 NFSv4.1 has the following goals, within the framework established by 645 the overall NFSv4 goals. 646 647 o To correct significant structural weaknesses and oversights 648 discovered in the base protocol. 649 650 o To add clarity and specificity to areas left unaddressed or not 651 addressed in sufficient detail in the base protocol. However, as 652 stated in Section 1.2, it is not a goal to clarify the NFSv4.0 653 protocol in the NFSv4.1 specification. 654 655 o To add specific features based on experience with the existing 656 protocol and recent industry developments. 657 658 o To provide protocol support to take advantage of clustered server 659 deployments including the ability to provide scalable parallel 660 access to files distributed among multiple servers. 661 662 1.5. General Definitions 663 664 The following definitions are provided for the purpose of providing 665 an appropriate context for the reader. 666 667 668 669 670 671 Shepler, et al. Expires February 23, 2009 [Page 12] 672 673 Internet-Draft NFSv4.1 August 2008 674 675 676 Byte This document defines a byte as an octet, i.e. a datum exactly 677 8 bits in length. 678 679 Client The "client" is the entity that accesses the NFS server's 680 resources. The client may be an application which contains the 681 logic to access the NFS server directly. The client may also be 682 the traditional operating system client that provides remote file 683 system services for a set of applications. 684 685 A client is uniquely identified by a Client Owner. 686 687 With reference to file locking, the client is also the entity that 688 maintains a set of locks on behalf of one or more applications. 689 This client is responsible for crash or failure recovery for those 690 locks it manages. 691 692 Note that multiple clients may share the same transport and 693 connection and multiple clients may exist on the same network 694 node. 695 696 Client ID A 64-bit quantity used as a unique, short-hand reference 697 to a client supplied Verifier and client owner. The server is 698 responsible for supplying the client ID. 699 700 Client Owner The client owner is a unique string, opaque to the 701 server, which identifies a client. Multiple network connections 702 and source network addresses originating from those connections 703 may share a client owner. The server is expected to treat 704 requests from connections with the same client owner as coming 705 from the same client. 706 707 File System The collection of objects on a server (as identified by 708 the major identifier of a Server Owner, which is defined later in 709 this section), that share the same fsid attribute (see 710 Section 5.8.1.9). 711 712 Lease An interval of time defined by the server for which the client 713 is irrevocably granted a lock. At the end of a lease period the 714 lock may be revoked if the lease has not been extended. The lock 715 must be revoked if a conflicting lock has been granted after the 716 lease interval. 717 718 All leases granted by a server have the same fixed interval. Note 719 that the fixed interval was chosen to alleviate the expense a 720 server would have in maintaining state about variable length 721 leases across server failures. 722 723 724 725 726 727 Shepler, et al. Expires February 23, 2009 [Page 13] 728 729 Internet-Draft NFSv4.1 August 2008 730 731 732 Lock The term "lock" is used to refer to byte-range (in UNIX 733 environments, also known as record) locks, share reservations, 734 delegations, or layouts unless specifically stated otherwise. 735 736 Server The "Server" is the entity responsible for coordinating 737 client access to a set of file systems and is identified by a 738 Server owner. A server can span multiple network addresses. 739 740 Server Owner The "Server Owner" identifies the server to the client. 741 The server owner consists of a major and minor identifier. When 742 the client has two connections each to a peer with the same major 743 identifier, the client assumes both peers are the same server (the 744 server namespace is the same via each connection), and assumes and 745 lock state is sharable across both connections. When each peer 746 has both the same major and minor identifier, the client assumes 747 each connection might be associable with the same session. 748 749 Stable Storage NFSv4.1 servers must be able to recover without data 750 loss from multiple power failures (including cascading power 751 failures, that is, several power failures in quick succession), 752 operating system failures, and hardware failure of components 753 other than the storage medium itself (for example, disk, 754 nonvolatile RAM). 755 756 Some examples of stable storage that are allowable for an NFS 757 server include: 758 759 1. Media commit of data, that is, the modified data has been 760 successfully written to the disk media, for example, the disk 761 platter. 762 763 2. An immediate reply disk drive with battery-backed on- drive 764 intermediate storage or uninterruptible power system (UPS). 765 766 3. Server commit of data with battery-backed intermediate storage 767 and recovery software. 768 769 4. Cache commit with uninterruptible power system (UPS) and 770 recovery software. 771 772 Stateid A 128-bit quantity returned by a server that uniquely 773 defines the open and locking state provided by the server for a 774 specific open-owner or lock-owner/open-owner pair for a specific 775 file and type of lock. 776 777 778 779 780 781 782 783 Shepler, et al. Expires February 23, 2009 [Page 14] 784 785 Internet-Draft NFSv4.1 August 2008 786 787 788 Verifier A 64-bit quantity generated by the client that the server 789 can use to determine if the client has restarted and lost all 790 previous lock state. 791 792 1.6. Overview of NFSv4.1 Features 793 794 To provide a reasonable context for the reader, the major features of 795 the NFSv4.1 protocol will be reviewed in brief. This will be done to 796 provide an appropriate context for both the reader who is familiar 797 with the previous versions of the NFS protocol and the reader that is 798 new to the NFS protocols. For the reader new to the NFS protocols, 799 there is still a set of fundamental knowledge that is expected. The 800 reader should be familiar with the XDR and RPC protocols as described 801 in [2] and [3]. A basic knowledge of file systems and distributed 802 file systems is expected as well. 803 804 In general this specification of NFSv4.1 will not distinguish those 805 added in minor version one from those present in the base protocol 806 but will treat NFSv4.1 as a unified whole. See Section 1.7 for a 807 summary of the differences between NFSv4.0 and NFSv4.1. 808 809 1.6.1. RPC and Security 810 811 As with previous versions of NFS, the External Data Representation 812 (XDR) and Remote Procedure Call (RPC) mechanisms used for the NFSv4.1 813 protocol are those defined in [2] and [3]. To meet end-to-end 814 security requirements, the RPCSEC_GSS framework [4] will be used to 815 extend the basic RPC security. With the use of RPCSEC_GSS, various 816 mechanisms can be provided to offer authentication, integrity, and 817 privacy to the NFSv4 protocol. Kerberos V5 will be used as described 818 in [5] to provide one security framework. The LIPKEY and SPKM-3 GSS- 819 API mechanisms described in [6] will be used to provide for the use 820 of user password and client/server public key certificates by the 821 NFSv4 protocol. With the use of RPCSEC_GSS, other mechanisms may 822 also be specified and used for NFSv4.1 security. 823 824 To enable in-band security negotiation, the NFSv4.1 protocol has 825 operations which provide the client a method of querying the server 826 about its policies regarding which security mechanisms must be used 827 for access to the server's file system resources. With this, the 828 client can securely match the security mechanism that meets the 829 policies specified at both the client and server. 830 831 1.6.2. Protocol Structure 832 833 834 835 836 837 838 839 Shepler, et al. Expires February 23, 2009 [Page 15] 840 841 Internet-Draft NFSv4.1 August 2008 842 843 844 1.6.2.1. Core Protocol 845 846 Unlike NFSv3, which used a series of ancillary protocols (e.g. NLM, 847 NSM, MOUNT), within all minor versions of NFSv4 a single RPC protocol 848 is used to make requests to the server. Facilities that had been 849 separate protocols, such as locking, are now integrated within a 850 single unified protocol. 851 852 1.6.2.2. Parallel Access 853 854 Minor version one supports high-performance data access to a 855 clustered server implementation by enabling a separation of metadata 856 access and data access, with the latter done to multiple servers in 857 parallel. 858 859 Such parallel data access is controlled by recallable objects known 860 as "layouts", which are integrated into the protocol locking model. 861 Clients direct requests for data access to a set of data servers 862 specified by the layout via a data storage protocol which may be 863 NFSv4.1 or may be another protocol. 864 865 1.6.3. File System Model 866 867 The general file system model used for the NFSv4.1 protocol is the 868 same as previous versions. The server file system is hierarchical 869 with the regular files contained within being treated as opaque byte 870 streams. In a slight departure, file and directory names are encoded 871 with UTF-8 to deal with the basics of internationalization. 872 873 The NFSv4.1 protocol does not require a separate protocol to provide 874 for the initial mapping between path name and filehandle. All file 875 systems exported by a server are presented as a tree so that all file 876 systems are reachable from a special per-server global root 877 filehandle. This allows LOOKUP operations to be used to perform 878 functions previously provided by the MOUNT protocol. The server 879 provides any necessary pseudo file systems to bridge any gaps that 880 arise due to unexported gaps between exported file systems. 881 882 1.6.3.1. Filehandles 883 884 As in previous versions of the NFS protocol, opaque filehandles are 885 used to identify individual files and directories. Lookup-type and 886 create operations translate file and directory names to filehandles 887 which are then used to identify objects in subsequent operations. 888 889 The NFSv4.1 protocol provides support for persistent filehandles, 890 guaranteed to be valid for the lifetime of the file system object 891 designated. In addition it provides support to servers to provide 892 893 894 895 Shepler, et al. Expires February 23, 2009 [Page 16] 896 897 Internet-Draft NFSv4.1 August 2008 898 899 900 filehandles with more limited validity guarantees, called volatile 901 filehandles. 902 903 1.6.3.2. File Attributes 904 905 The NFSv4.1 protocol has a rich and extensible file object attribute 906 structure, which is divided into REQUIRED, RECOMMENDED, and named 907 attributes (see Section 5). 908 909 Several (but not all) of the REQUIRED attributes are derived from the 910 attributes of NFSv3 (see definition of the fattr3 data type in [21]). 911 An example of a REQUIRED attribute is the file object's type 912 (Section 5.8.1.2) so that regular files can be distinguished from 913 directories (also known as folders in some operating environments) 914 and other types of objects. REQUIRED attributes are discussed in 915 Section 5.1. 916 917 An example of three RECOMMENDED attributes are acl, sacl, and dacl. 918 These attributes define an Access Control List (ACL) on a file object 919 ((Section 6). An ACL provides directory and file access control 920 beyond the model used in NFSv3. The ACL definition allows for 921 specification of specific sets of permissions for individual users 922 and groups. In addition, ACL inheritance allows propagation of 923 access permissions and restriction down a directory tree as file 924 system objects are created. RECOMMENDED attributes are discussed in 925 Section 5.2. 926 927 A named attribute is an opaque byte stream that is associated with a 928 directory or file and referred to by a string name. Named attributes 929 are meant to be used by client applications as a method to associate 930 application-specific data with a regular file or directory. NFSv4.1 931 modifies named attributes relative to NFSv4.0 by tightening the 932 allowed operations in order to prevent the development of non- 933 interoperable implementations. Named attributes are discussed in 934 Section 5.3. 935 936 1.6.3.3. Multi-server Namespace 937 938 NFSv4.1 contains a number of features to allow implementation of 939 namespaces that cross server boundaries and that allow and facilitate 940 a non-disruptive transfer of support for individual file systems 941 between servers. They are all based upon attributes that allow one 942 file system to specify alternate or new locations for that file 943 system. 944 945 These attributes may be used together with the concept of absent file 946 systems, which provide specifications for additional locations but no 947 actual file system content. This allows a number of important 948 949 950 951 Shepler, et al. Expires February 23, 2009 [Page 17] 952 953 Internet-Draft NFSv4.1 August 2008 954 955 956 facilities: 957 958 o Location attributes may be used with absent file systems to 959 implement referrals whereby one server may direct the client to a 960 file system provided by another server. This allows extensive 961 multi-server namespaces to be constructed. 962 963 o Location attributes may be provided for present file systems to 964 provide the locations of alternate file system instances or 965 replicas to be used in the event that the current file system 966 instance becomes unavailable. 967 968 o Location attributes may be provided when a previously present file 969 system becomes absent. This allows non-disruptive migration of 970 file systems to alternate servers. 971 972 1.6.4. Locking Facilities 973 974 As mentioned previously, NFS v4.1 is a single protocol which includes 975 locking facilities. These locking facilities include support for 976 many types of locks including a number of sorts of recallable locks. 977 Recallable locks such as delegations allow the client to be assured 978 that certain events will not occur so long as that lock is held. 979 When circumstances change, the lock is recalled via a callback 980 request. The assurances provided by delegations allow more extensive 981 caching to be done safely when circumstances allow it. 982 983 The types of locks are: 984 985 o Share reservations as established by OPEN operations. 986 987 o Byte-range locks. 988 989 o File delegations, which are recallable locks that assure the 990 holder that inconsistent opens and file changes cannot occur so 991 long as the delegation is held. 992 993 o Directory delegations, which are recallable locks that assure the 994 holder that inconsistent directory modifications cannot occur so 995 long as the delegation is held. 996 997 o Layouts, which are recallable objects that assure the holder that 998 direct access to the file data may be performed directly by the 999 client and that no change to the data's location inconsistent with 1000 that access may be made so long as the layout is held. 1001 1002 All locks for a given client are tied together under a single client- 1003 wide lease. All requests made on sessions associated with the client 1004 1005 1006 1007 Shepler, et al. Expires February 23, 2009 [Page 18] 1008 1009 Internet-Draft NFSv4.1 August 2008 1010 1011 1012 renew that lease. When leases are not promptly renewed locks are 1013 subject to revocation. In the event of server restart, clients have 1014 the opportunity to safely reclaim their locks within a special grace 1015 period. 1016 1017 1.7. Differences from NFSv4.0 1018 1019 The following summarizes the major differences between minor version 1020 one and the base protocol: 1021 1022 o Implementation of the sessions model (Section 2.10). 1023 1024 o Parallel access to data (Section 12). 1025 1026 o Addition of the RECLAIM_COMPLETE operation to better structure the 1027 lock reclamation process (Section 18.51). 1028 1029 o Enhanced delegation support as follows. 1030 1031 * Delegations on directories and other file types in addition to 1032 regular files (Section 18.39, Section 18.49). 1033 1034 * Operations to optimize acquisition of recalled or denied 1035 delegations (Section 18.49, Section 20.5, Section 20.7). 1036 1037 * Notifications of changes to files and directories 1038 (Section 18.39, Section 20.4). 1039 1040 * A method to allow a server to indicate it is recalling one or 1041 more delegations for resource management reasons, and thus a 1042 method to allow the client to pick which delegations to return 1043 (Section 20.6). 1044 1045 o Attributes can be set atomically during exclusive file create via 1046 the OPEN operation (see the new EXCLUSIVE4_1 creation method in 1047 Section 18.16). 1048 1049 o Open files can be preserved if removed and the hard link count 1050 goes to zero thus obviating the need for clients to rename deleted 1051 files to partially hidden names -- colloquially called "silly 1052 rename" (see the new OPEN4_RESULT_PRESERVE_UNLINKED reply flag in 1053 Section 18.16). 1054 1055 o Improved compatibility with Microsoft Windows for Access Control 1056 Lists (Section 6.2.3, Section 6.2.2, Section 6.4.3.2). 1057 1058 o Data retention (Section 5.13). 1059 1060 1061 1062 1063 Shepler, et al. Expires February 23, 2009 [Page 19] 1064 1065 Internet-Draft NFSv4.1 August 2008 1066 1067 1068 o Identification of the implementation of the NFS client and server 1069 (Section 18.35). 1070 1071 o Support for notification of the availability of byte-range locks 1072 (see the new OPEN4_RESULT_MAY_NOTIFY_LOCK reply flag in 1073 Section 18.16 and see Section 20.11). 1074 1075 1076 2. Core Infrastructure 1077 1078 2.1. Introduction 1079 1080 NFSv4.1 relies on core infrastructure common to nearly every 1081 operation. This core infrastructure is described in the remainder of 1082 this section. 1083 1084 2.2. RPC and XDR 1085 1086 The NFSv4.1 protocol is a Remote Procedure Call (RPC) application 1087 that uses RPC version 2 and the corresponding eXternal Data 1088 Representation (XDR) as defined in [3] and [2]. 1089 1090 2.2.1. RPC-based Security 1091 1092 Previous NFS versions have been thought of as having a host-based 1093 authentication model, where the NFS server authenticates the NFS 1094 client, and trusts the client to authenticate all users. Actually, 1095 NFS has always depended on RPC for authentication. One of the first 1096 forms of RPC authentication, AUTH_SYS, had no strong authentication, 1097 and required a host-based authentication approach. NFSv4.1 also 1098 depends on RPC for basic security services, and mandates RPC support 1099 for a user-based authentication model. The user-based authentication 1100 model has user principals authenticated by a server, and in turn the 1101 server authenticated by user principals. RPC provides some basic 1102 security services which are used by NFSv4.1. 1103 1104 2.2.1.1. RPC Security Flavors 1105 1106 As described in section 7.2 "Authentication" of [3], RPC security is 1107 encapsulated in the RPC header, via a security or authentication 1108 flavor, and information specific to the specified security flavor. 1109 Every RPC header conveys information used to identify and 1110 authenticate a client and server. As discussed in Section 2.2.1.1.1, 1111 some security flavors provide additional security services. 1112 1113 NFSv4.1 clients and servers MUST implement RPCSEC_GSS. (This 1114 requirement to implement is not a requirement to use.) Other 1115 flavors, such as AUTH_NONE, and AUTH_SYS, MAY be implemented as well. 1116 1117 1118 1119 Shepler, et al. Expires February 23, 2009 [Page 20] 1120 1121 Internet-Draft NFSv4.1 August 2008 1122 1123 1124 2.2.1.1.1. RPCSEC_GSS and Security Services 1125 1126 RPCSEC_GSS ([4]) uses the functionality of GSS-API [7]. This allows 1127 for the use of various security mechanisms by the RPC layer without 1128 the additional implementation overhead of adding RPC security 1129 flavors. 1130 1131 2.2.1.1.1.1. Identification, Authentication, Integrity, Privacy 1132 1133 Via the GSS-API, RPCSEC_GSS can be used to identify and authenticate 1134 users on clients to servers, and servers to users. It can also 1135 perform integrity checking on the entire RPC message, including the 1136 RPC header, and the arguments or results. Finally, privacy, usually 1137 via encryption, is a service available with RPCSEC_GSS. Privacy is 1138 performed on the arguments and results. Note that if privacy is 1139 selected, integrity, authentication, and identification are enabled. 1140 If privacy is not selected, but integrity is selected, authentication 1141 and identification are enabled. If integrity and privacy are not 1142 selected, but authentication is enabled, identification is enabled. 1143 RPCSEC_GSS does not provide identification as a separate service. 1144 1145 Although GSS-API has an authentication service distinct from its 1146 privacy and integrity services, GSS-API's authentication service is 1147 not used for RPCSEC_GSS's authentication service. Instead, each RPC 1148 request and response header is integrity protected with the GSS-API 1149 integrity service, and this allows RPCSEC_GSS to offer per-RPC 1150 authentication and identity. See [4] for more information. 1151 1152 NFSv4.1 client and servers MUST support RPCSEC_GSS's integrity and 1153 authentication service. NFSv4.1 servers MUST support RPCSEC_GSS's 1154 privacy service. NFSv4.1 clients SHOULD support RPCSEC_GSS's privacy 1155 service. 1156 1157 2.2.1.1.1.2. Security mechanisms for NFSv4.1 1158 1159 RPCSEC_GSS, via GSS-API, normalizes access to mechanisms that provide 1160 security services. Therefore NFSv4.1 clients and servers MUST 1161 support three security mechanisms: Kerberos V5, SPKM-3, and LIPKEY. 1162 1163 The use of RPCSEC_GSS requires selection of: mechanism, quality of 1164 protection (QOP), and service (authentication, integrity, privacy). 1165 For the mandated security mechanisms, NFSv4.1 specifies that a QOP of 1166 zero (0) is used, leaving it up to the mechanism or the mechanism's 1167 configuration to use an appropriate level of protection that QOP zero 1168 maps to. Each mandated mechanism specifies minimum set of 1169 cryptographic algorithms for implementing integrity and privacy. 1170 NFSv4.1 clients and servers MUST be implemented on operating 1171 environments that comply with the REQUIRED cryptographic algorithms 1172 1173 1174 1175 Shepler, et al. Expires February 23, 2009 [Page 21] 1176 1177 Internet-Draft NFSv4.1 August 2008 1178 1179 1180 of each REQUIRED mechanism. 1181 1182 2.2.1.1.1.2.1. Kerberos V5 1183 1184 The Kerberos V5 GSS-API mechanism as described in [5] MUST be 1185 implemented with the RPCSEC_GSS services as specified in the 1186 following table: 1187 1188 1189 column descriptions: 1190 1 == number of pseudo flavor 1191 2 == name of pseudo flavor 1192 3 == mechanism's OID 1193 4 == RPCSEC_GSS service 1194 5 == NFSv4.1 clients MUST support 1195 6 == NFSv4.1 servers MUST support 1196 1197 1 2 3 4 5 6 1198 ------------------------------------------------------------------ 1199 390003 krb5 1.2.840.113554.1.2.2 rpc_gss_svc_none yes yes 1200 390004 krb5i 1.2.840.113554.1.2.2 rpc_gss_svc_integrity yes yes 1201 390005 krb5p 1.2.840.113554.1.2.2 rpc_gss_svc_privacy no yes 1202 1203 Note that the number and name of the pseudo flavor is presented here 1204 as a mapping aid to the implementor. Because the NFSv4.1 protocol 1205 includes a method to negotiate security and it understands the GSS- 1206 API mechanism, the pseudo flavor is not needed. The pseudo flavor is 1207 needed for the NFSv3 since the security negotiation is done via the 1208 MOUNT protocol as described in [22]. 1209 1210 2.2.1.1.1.2.2. LIPKEY 1211 1212 The LIPKEY V5 GSS-API mechanism as described in [6] MUST be 1213 implemented with the RPCSEC_GSS services as specified in the 1214 following table: 1215 1216 1217 1 2 3 4 5 6 1218 ------------------------------------------------------------------ 1219 390006 lipkey 1.3.6.1.5.5.9 rpc_gss_svc_none yes yes 1220 390007 lipkey-i 1.3.6.1.5.5.9 rpc_gss_svc_integrity yes yes 1221 390008 lipkey-p 1.3.6.1.5.5.9 rpc_gss_svc_privacy no yes 1222 1223 2.2.1.1.1.2.3. SPKM-3 as a security triple 1224 1225 The SPKM-3 GSS-API mechanism as described in [6] MUST be implemented 1226 with the RPCSEC_GSS services as specified in the following table: 1227 1228 1229 1230 1231 Shepler, et al. Expires February 23, 2009 [Page 22] 1232 1233 Internet-Draft NFSv4.1 August 2008 1234 1235 1236 1 2 3 4 5 6 1237 ------------------------------------------------------------------ 1238 390009 spkm3 1.3.6.1.5.5.1.3 rpc_gss_svc_none yes yes 1239 390010 spkm3i 1.3.6.1.5.5.1.3 rpc_gss_svc_integrity yes yes 1240 390011 spkm3p 1.3.6.1.5.5.1.3 rpc_gss_svc_privacy no yes 1241 1242 2.2.1.1.1.3. GSS Server Principal 1243 1244 Regardless of what security mechanism under RPCSEC_GSS is being used, 1245 the NFS server, MUST identify itself in GSS-API via a 1246 GSS_C_NT_HOSTBASED_SERVICE name type. GSS_C_NT_HOSTBASED_SERVICE 1247 names are of the form: 1248 1249 service@hostname 1250 1251 For NFS, the "service" element is 1252 1253 nfs 1254 1255 Implementations of security mechanisms will convert nfs@hostname to 1256 various different forms. For Kerberos V5, LIPKEY, and SPKM-3, the 1257 following form is RECOMMENDED: 1258 1259 nfs/hostname 1260 1261 2.3. COMPOUND and CB_COMPOUND 1262 1263 A significant departure from the versions of the NFS protocol before 1264 NFSv4 is the introduction of the COMPOUND procedure. For the NFSv4 1265 protocol, in all minor versions, there are exactly two RPC 1266 procedures, NULL and COMPOUND. The COMPOUND procedure is defined as 1267 a series of individual operations and these operations perform the 1268 sorts of functions performed by traditional NFS procedures. 1269 1270 The operations combined within a COMPOUND request are evaluated in 1271 order by the server, without any atomicity guarantees. A limited set 1272 of facilities exist to pass results from one operation to another. 1273 Once an operation returns a failing result, the evaluation ends and 1274 the results of all evaluated operations are returned to the client. 1275 1276 With the use of the COMPOUND procedure, the client is able to build 1277 simple or complex requests. These COMPOUND requests allow for a 1278 reduction in the number of RPCs needed for logical file system 1279 operations. For example, multi-component lookup requests can be 1280 constructed by combining multiple LOOKUP operations. Those can be 1281 further combined with operations such as GETATTR, READDIR, or OPEN 1282 plus READ to do more complicated sets of operation without incurring 1283 additional latency. 1284 1285 1286 1287 Shepler, et al. Expires February 23, 2009 [Page 23] 1288 1289 Internet-Draft NFSv4.1 August 2008 1290 1291 1292 NFSv4.1 also contains a considerable set of callback operations in 1293 which the server makes an RPC directed at the client. Callback RPCs 1294 have a similar structure to that of the normal server requests. In 1295 all minor versions of the NFSv4 protocol there are two callback RPC 1296 procedures, CB_NULL and CB_COMPOUND. The CB_COMPOUND procedure is 1297 defined in an analogous fashion to that of COMPOUND with its own set 1298 of callback operations. 1299 1300 The addition of new server and callback operations within the 1301 COMPOUND and CB_COMPOUND request framework provides a means of 1302 extending the protocol in subsequent minor versions. 1303 1304 Except for a small number of operations needed for session creation, 1305 server requests and callback requests are performed within the 1306 context of a session. Sessions provide a client context for every 1307 request and support robust reply protection for non-idempotent 1308 requests. 1309 1310 2.4. Client Identifiers and Client Owners 1311 1312 For each operation that obtains or depends on locking state, the 1313 specific client must be identifiable by the server. 1314 1315 Each distinct client instance is represented by a client ID. A 1316 client ID is a 64-bit identifier representing a specific client at a 1317 given time. The client ID is changed whenever the client re- 1318 initializes, and may change when the server re-initializes. Client 1319 IDs are used to support lock identification and crash recovery. 1320 1321 During steady state operation, the client ID associated with each 1322 operation is derived from the session (see Section 2.10) on which the 1323 operation is sent. A session is associated with a client ID when the 1324 session is created. 1325 1326 Unlike NFSv4.0, the only NFSv4.1 operations possible before a client 1327 ID is established are those needed to establish the client ID. 1328 1329 A sequence of an EXCHANGE_ID operation followed by a CREATE_SESSION 1330 operation using that client ID (eir_clientid as returned from 1331 EXCHANGE_ID) is required to establish and confirm the client ID on 1332 the server. Establishment of identification by a new incarnation of 1333 the client also has the effect of immediately releasing any locking 1334 state that a previous incarnation of that same client might have had 1335 on the server. Such released state would include all lock, share 1336 reservation, layout state, and where the server is not supporting the 1337 CLAIM_DELEGATE_PREV claim type, all delegation state associated with 1338 the same client with the same identity. For discussion of delegation 1339 state recovery, see Section 10.2.1. For discussion of layout state 1340 1341 1342 1343 Shepler, et al. Expires February 23, 2009 [Page 24] 1344 1345 Internet-Draft NFSv4.1 August 2008 1346 1347 1348 recovery see Section 12.7.1. 1349 1350 Releasing such state requires that the server be able to determine 1351 that one client instance is the successor of another. Where this 1352 cannot be done, for any of a number of reasons, the locking state 1353 will remain for a time subject to lease expiration (see Section 8.3) 1354 and the new client will need to wait for such state to be removed, if 1355 it makes conflicting lock requests. 1356 1357 Client identification is encapsulated in the following Client Owner 1358 data type: 1359 1360 1361 struct client_owner4 { 1362 verifier4 co_verifier; 1363 opaque co_ownerid; 1364 }; 1365 1366 The first field, co_verifier, is a client incarnation verifier. The 1367 server will start the process of canceling the client's leased state 1368 if co_verifier is different than what the server has previously 1369 recorded for the identified client (as specified in the co_ownerid 1370 field). 1371 1372 The second field, co_ownerid is a variable length string that 1373 uniquely defines the client so that subsequent instances of the same 1374 client bear the same co_ownerid with a different verifier. 1375 1376 There are several considerations for how the client generates the 1377 co_ownerid string: 1378 1379 o The string should be unique so that multiple clients do not 1380 present the same string. The consequences of two clients 1381 presenting the same string range from one client getting an error 1382 to one client having its leased state abruptly and unexpectedly 1383 canceled. 1384 1385 o The string should be selected so that subsequent incarnations 1386 (e.g. restarts) of the same client cause the client to present the 1387 same string. The implementor is cautioned from an approach that 1388 requires the string to be recorded in a local file because this 1389 precludes the use of the implementation in an environment where 1390 there is no local disk and all file access is from an NFSv4.1 1391 server. 1392 1393 o The string should be the same for each server network address that 1394 the client accesses. This way, if a server has multiple 1395 interfaces, the client can trunk traffic over multiple network 1396 1397 1398 1399 Shepler, et al. Expires February 23, 2009 [Page 25] 1400 1401 Internet-Draft NFSv4.1 August 2008 1402 1403 1404 paths as described in Section 2.10.4. (Note: the precise opposite 1405 was advised in the NFSv4.0 specification [20].) 1406 1407 o The algorithm for generating the string should not assume that the 1408 client's network address will not change, unless the client 1409 implementation knows it is using statically assigned network 1410 addresses. This includes changes between client incarnations and 1411 even changes while the client is still running in its current 1412 incarnation. Thus with dynamic address assignment, if the client 1413 includes just the client's network address in the co_ownerid 1414 string, there is a real risk that after the client gives up the 1415 network address, another client, using a similar algorithm for 1416 generating the co_ownerid string, would generate a conflicting 1417 co_ownerid string. 1418 1419 Given the above considerations, an example of a well generated 1420 co_ownerid string is one that includes: 1421 1422 o If applicable, the client's statically assigned network address. 1423 1424 o Additional information that tends to be unique, such as one or 1425 more of: 1426 1427 * The client machine's serial number (for privacy reasons, it is 1428 best to perform some one way function on the serial number). 1429 1430 * A MAC address (again, a one way function should be performed). 1431 1432 * The timestamp of when the NFSv4.1 software was first installed 1433 on the client (though this is subject to the previously 1434 mentioned caution about using information that is stored in a 1435 file, because the file might only be accessible over NFSv4.1). 1436 1437 * A true random number. However since this number ought to be 1438 the same between client incarnations, this shares the same 1439 problem as that of using the timestamp of the software 1440 installation. 1441 1442 o For a user level NFSv4.1 client, it should contain additional 1443 information to distinguish the client from other user level 1444 clients running on the same host, such as a process identifier or 1445 other unique sequence. 1446 1447 The client ID is assigned by the server (the eir_clientid result from 1448 EXCHANGE_ID) and should be chosen so that it will not conflict with a 1449 client ID previously assigned by the server. This applies across 1450 server restarts. 1451 1452 1453 1454 1455 Shepler, et al. Expires February 23, 2009 [Page 26] 1456 1457 Internet-Draft NFSv4.1 August 2008 1458 1459 1460 In the event of a server restart, a client may find out that its 1461 current client ID is no longer valid when it receives an 1462 NFS4ERR_STALE_CLIENTID error. The precise circumstances depend on 1463 the characteristics of the sessions involved, specifically whether 1464 the session is persistent (see Section 2.10.5.5), but in each case 1465 the client will receive this error when it attempts to establish a 1466 new session with the existing client ID and receives the error 1467 NFS4ERR_STALE_CLIENTID, indicating that a new client ID must be 1468 obtained via EXCHANGE_ID and the new session established with that 1469 client ID. 1470 1471 When a session is not persistent, the client will find out that it 1472 needs to create a new session as a result of getting an 1473 NFS4ERR_BADSESSION, since the session in question was lost as part of 1474 a server restart. When the existing client ID is presented to a 1475 server as part of creating a session and that client ID is not 1476 recognized, as would happen after a server restart, the server will 1477 reject the request with the error NFS4ERR_STALE_CLIENTID. 1478 1479 In the case of the session being persistent, the client will re- 1480 establish communication using the existing session after the restart. 1481 This session will be associated with the existing client ID but may 1482 only be used to retransmit operations that the client previously 1483 transmitted and did not see replies to. Replies to operations that 1484 the server previously performed will come from the reply cache, 1485 otherwise NFS4ERR_DEADSESSION will be returned. Hence, such a 1486 session is referred to as "dead". In this situation, in order to 1487 perform new operations, the client must establish a new session. If 1488 an attempt is made to establish this new session with the existing 1489 client ID, the server will reject the request with 1490 NFS4ERR_STALE_CLIENTID. 1491 1492 When NFS4ERR_STALE_CLIENTID is received in either of these 1493 situations, the client must obtain a new client ID by use of the 1494 EXCHANGE_ID operation, then use that client ID as the basis of a new 1495 session, and then proceed to any other necessary recovery for the 1496 server restart case (See Section 8.4.2). 1497 1498 See the descriptions of EXCHANGE_ID (Section 18.35) and 1499 CREATE_SESSION (Section 18.36) for a complete specification of these 1500 operations. 1501 1502 2.4.1. Upgrade from NFSv4.0 to NFSv4.1 1503 1504 To facilitate upgrade from NFSv4.0 to NFSv4.1, a server may compare a 1505 client_owner4 in an EXCHANGE_ID with an nfs_client_id4 established 1506 using the SETCLIENTID operation of NFSv4.0. A server that does so 1507 will allow an upgraded client to avoid waiting until the lease (i.e. 1508 1509 1510 1511 Shepler, et al. Expires February 23, 2009 [Page 27] 1512 1513 Internet-Draft NFSv4.1 August 2008 1514 1515 1516 the lease established by the NFSv4.0 instance client) expires. This 1517 requires the client_owner4 be constructed the same way as the 1518 nfs_client_id4. If the latter's contents included the server's 1519 network address (per the recommendations of the NFSv4.0 specification 1520 [20]), and the NFSv4.1 client does not wish to use a client ID that 1521 prevents trunking, it should send two EXCHANGE_ID operations. The 1522 first EXCHANGE_ID will have a client_owner4 equal to the 1523 nfs_client_id4. This will clear the state created by the NFSv4.0 1524 client. The second EXCHANGE_ID will not have the server's network 1525 address. The state created for the second EXCHANGE_ID will not have 1526 to wait for lease expiration, because there will be no state to 1527 expire. 1528 1529 2.4.2. Server Release of Client ID 1530 1531 NFSv4.1 introduces a new operation called DESTROY_CLIENTID 1532 (Section 18.50) which the client SHOULD use to destroy a client ID it 1533 no longer needs. This permits graceful, bilateral release of a 1534 client ID. The operation cannot be used if there are sessions 1535 associated with the client ID, or state with an unexpired lease. 1536 1537 If the server determines that the client holds no associated state 1538 for its client ID (including sessions, opens, locks, delegations, 1539 layouts, and wants), the server may choose to unilaterally release 1540 the client ID in order to conserve resources. If the client contacts 1541 the server after this release, the server must ensure the client 1542 receives the appropriate error so that it will use the EXCHANGE_ID/ 1543 CREATE_SESSION sequence to establish a new client ID. The server 1544 ought to be very hesitant to release a client ID since the resulting 1545 work on the client to recover from such an event will be the same 1546 burden as if the server had failed and restarted. Typically a server 1547 would not release a client ID unless there had been no activity from 1548 that client for many minutes. As long as there are sessions, opens, 1549 locks, delegations, layouts, or wants, the server MUST NOT release 1550 the client ID. See Section 2.10.11.1.4 for discussion on releasing 1551 inactive sessions. 1552 1553 2.4.3. Resolving Client Owner Conflicts 1554 1555 When the server gets an EXCHANGE_ID for a client owner that currently 1556 has no state, or that has state, but the lease has expired, the 1557 server MUST allow the EXCHANGE_ID, and confirm the new client ID if 1558 followed by the appropriate CREATE_SESSION. 1559 1560 When the server gets an EXCHANGE_ID for a new incarnation of a client 1561 owner that currently has an old incarnation with state and an 1562 unexpired lease, the server is allowed to dispose of the state of the 1563 previous incarnation of the client owner if one of the following are 1564 1565 1566 1567 Shepler, et al. Expires February 23, 2009 [Page 28] 1568 1569 Internet-Draft NFSv4.1 August 2008 1570 1571 1572 true: 1573 1574 o The principal that created the client ID for the client owner is 1575 the same as the principal that is issuing the EXCHANGE_ID. Note 1576 that if the client ID was created with SP4_MACH_CRED state 1577 protection (Section 18.35), the principal MUST be based on 1578 RPCSEC_GSS authentication, the RPCSEC_GSS service used MUST be 1579 integrity or privacy, and the same GSS mechanism and principal 1580 must be used as that used when the client ID was created. 1581 1582 o The client ID was established with SP4_SSV protection 1583 (Section 18.35, Section 2.10.7.3) and the client sends the 1584 EXCHANGE_ID with the security flavor set to RPCSEC_GSS using the 1585 GSS SSV mechanism (Section 2.10.8). 1586 1587 o The client ID was established with SP4_SSV protection, and under 1588 the conditions described herein, the EXCHANGE_ID was sent with 1589 SP4_MACH_CRED state protection. Because the SSV might not persist 1590 across client and server restart, and because the first time a 1591 client sends EXCHANGE_ID to a server it does not have an SSV, the 1592 client MAY send the subsequent EXCHANGE_ID without an SSV 1593 RPCSEC_GSS handle. Instead, as with SP4_MACH_CRED protection, the 1594 principal MUST be based on RPCSEC_GSS authentication, the 1595 RPCSEC_GSS service used MUST be integrity or privacy, and the same 1596 GSS mechanism and principal MUST be used as that used when the 1597 client ID was created. 1598 1599 If none of the above situations apply, the server MUST return 1600 NFS4ERR_CLID_INUSE. 1601 1602 If the server accepts the principal and co_ownerid as matching that 1603 which created the client ID, and the co_verifier in the EXCHANGE_ID 1604 differs from the co_verifier used when the client ID was created, 1605 then after the server receives a CREATE_SESSION that confirms the 1606 client ID, the server deletes state. If the co_verifier values are 1607 the same, (e.g. the client is either updating properties of the 1608 client ID (Section 18.35), or the client is attempting trunking 1609 (Section 2.10.4) the server MUST NOT delete state. 1610 1611 2.5. Server Owners 1612 1613 The Server Owner is similar to a Client Owner (Section 2.4), but 1614 unlike the Client Owner, there is no shorthand server ID. The Server 1615 Owner is defined in the following data type: 1616 1617 1618 1619 1620 1621 1622 1623 Shepler, et al. Expires February 23, 2009 [Page 29] 1624 1625 Internet-Draft NFSv4.1 August 2008 1626 1627 1628 struct server_owner4 { 1629 uint64_t so_minor_id; 1630 opaque so_major_id; 1631 }; 1632 1633 The Server Owner is returned from EXCHANGE_ID. When the so_major_id 1634 fields are the same in two EXCHANGE_ID results, the connections each 1635 EXCHANGE_ID were sent over can be assumed to address the same Server 1636 (as defined in Section 1.5). If the so_minor_id fields are also the 1637 same, then not only do both connections connect to the same server, 1638 but the session can be shared across both connections. The reader is 1639 cautioned that multiple servers may deliberately or accidentally 1640 claim to have the same so_major_id or so_major_id/so_minor_id; the 1641 reader should examine Section 2.10.4 and Section 18.35 in order to 1642 avoid acting on falsely matching Server Owner values. 1643 1644 The considerations for generating a so_major_id are similar to that 1645 for generating a co_ownerid string (see Section 2.4). The 1646 consequences of two servers generating conflicting so_major_id values 1647 are less dire than they are for co_ownerid conflicts because the 1648 client can use RPCSEC_GSS to compare the authenticity of each server 1649 (see Section 2.10.4). 1650 1651 2.6. Security Service Negotiation 1652 1653 With the NFSv4.1 server potentially offering multiple security 1654 mechanisms, the client needs a method to determine or negotiate which 1655 mechanism is to be used for its communication with the server. The 1656 NFS server may have multiple points within its file system namespace 1657 that are available for use by NFS clients. These points can be 1658 considered security policy boundaries, and in some NFS 1659 implementations are tied to NFS export points. In turn the NFS 1660 server may be configured such that each of these security policy 1661 boundaries may have different or multiple security mechanisms in use. 1662 1663 The security negotiation between client and server must be done with 1664 a secure channel to eliminate the possibility of a third party 1665 intercepting the negotiation sequence and forcing the client and 1666 server to choose a lower level of security than required or desired. 1667 See Section 21 for further discussion. 1668 1669 2.6.1. NFSv4.1 Security Tuples 1670 1671 An NFS server can assign one or more "security tuples" to each 1672 security policy boundary in its namespace. Each security tuple 1673 consists of a security flavor (see Section 2.2.1.1), and if the 1674 flavor is RPCSEC_GSS, a GSS-API mechanism OID, a GSS-API quality of 1675 protection, and an RPCSEC_GSS service. 1676 1677 1678 1679 Shepler, et al. Expires February 23, 2009 [Page 30] 1680 1681 Internet-Draft NFSv4.1 August 2008 1682 1683 1684 2.6.2. SECINFO and SECINFO_NO_NAME 1685 1686 The SECINFO and SECINFO_NO_NAME operations allow the client to 1687 determine, on a per filehandle basis, what security tuple is to be 1688 used for server access. In general, the client will not have to use 1689 either operation except during initial communication with the server 1690 or when the client crosses security policy boundaries at the server. 1691 However, the server's policies may also change at any time and force 1692 the client to negotiate a new security tuple. 1693 1694 Where the use of different security tuples would affect the type of 1695 access that would be allowed if a request was sent over the same 1696 connection used for the SECINFO or SECINFO_NO_NAME operation (e.g. 1697 read-only vs. read-write) access, security tuples that allow greater 1698 access should be presented first. Where the general level of access 1699 is the same and different security flavors limit the range of 1700 principals whose privileges are recognized (e.g. allowing or 1701 disallowing root access), flavors supporting the greatest range of 1702 principals should be listed first. 1703 1704 2.6.3. Security Error 1705 1706 Based on the assumption that each NFSv4.1 client and server must 1707 support a minimum set of security (i.e., LIPKEY, SPKM-3, and 1708 Kerberos-V5 all under RPCSEC_GSS), the NFS client will initiate file 1709 access to the server with one of the minimal security tuples. During 1710 communication with the server, the client may receive an NFS error of 1711 NFS4ERR_WRONGSEC. This error allows the server to notify the client 1712 that the security tuple currently being used contravenes the server's 1713 security policy. The client is then responsible for determining (see 1714 Section 2.6.3.1) what security tuples are available at the server and 1715 choosing one which is appropriate for the client. 1716 1717 2.6.3.1. Using NFS4ERR_WRONGSEC, SECINFO, and SECINFO_NO_NAME 1718 1719 This section explains of the mechanics of NFSv4.1 security 1720 negotiation. 1721 1722 2.6.3.1.1. Put Filehandle Operations 1723 1724 The term "put filehandle operation" refers to PUTROOTFH, PUTPUBFH, 1725 PUTFH, and RESTOREFH. Each of the subsections herein describes how 1726 the server handles a subseries of operations that starts with a put 1727 filehandle operation. 1728 1729 1730 1731 1732 1733 1734 1735 Shepler, et al. Expires February 23, 2009 [Page 31] 1736 1737 Internet-Draft NFSv4.1 August 2008 1738 1739 1740 2.6.3.1.1.1. Put Filehandle Operation + SAVEFH 1741 1742 The client is saving a filehandle for a future RESTOREFH, LINK, or 1743 RENAME. SAVEFH MUST NOT return NFS4ERR_WRONGSEC. To determine 1744 whether the put filehandle operation returns NFS4ERR_WRONGSEC or not, 1745 the server implementation pretends SAVEFH is not in the series of 1746 operations and examines which of the situations described in the 1747 other subsections of Section 2.6.3.1.1 apply. 1748 1749 2.6.3.1.1.2. Two or More Put Filehandle Operations 1750 1751 For a series of N put filehandle operations, the server MUST NOT 1752 return NFS4ERR_WRONGSEC to the first N-1 put filehandle operations. 1753 The N'th put filehandle operation is handled as if it is the first in 1754 a subseries of operations. For example if the server received PUTFH, 1755 PUTROOTFH, LOOKUP, then the PUTFH is ignored for NFS4ERR_WRONGSEC 1756 purposes, and the PUTROOTFH, LOOKUP subseries is processed as 1757 according to Section 2.6.3.1.1.3. 1758 1759 2.6.3.1.1.3. Put Filehandle Operation + LOOKUP (or OPEN of an Existing 1760 Name) 1761 1762 This situation also applies to a put filehandle operation followed by 1763 a LOOKUP or an OPEN operation that specifies an existing component 1764 name. 1765 1766 In this situation, the client is potentially crossing a security 1767 policy boundary, and the set of security tuples the parent directory 1768 supports may differ from those of the child. The server 1769 implementation may decide whether to impose any restrictions on 1770 security policy administration. There are at least three approaches 1771 (sec_policy_child is the tuple set of the child export, 1772 sec_policy_parent is that of the parent). 1773 1774 a) sec_policy_child <= sec_policy_parent (<= for subset). This 1775 means that the set of security tuples specified on the security 1776 policy of a child directory is always a subset of that of its 1777 parent directory. 1778 1779 b) sec_policy_child ^ sec_policy_parent != {} (^ for intersection, 1780 {} for the empty set). This means that the security tuples 1781 specified on the security policy of a child directory always has a 1782 non empty intersection with that of the parent. 1783 1784 c) sec_policy_child ^ sec_policy_parent == {}. This means that 1785 the set of tuples specified on the security policy of a child 1786 directory may not intersect with that of the parent. In other 1787 words, there are no restrictions on how the system administrator 1788 1789 1790 1791 Shepler, et al. Expires February 23, 2009 [Page 32] 1792 1793 Internet-Draft NFSv4.1 August 2008 1794 1795 1796 may set up these tuples. 1797 1798 In order for a server to support approaches (b) (for the case when a 1799 client chooses a flavor that is not a member of sec_policy_parent) 1800 and (c), the put filehandle operation cannot return NFS4ERR_WRONGSEC 1801 when there is a security tuple mismatch. Instead, it should be 1802 returned from the LOOKUP (or OPEN by existing component name) that 1803 follows. 1804 1805 Since the above guideline does not contradict approach (a), it should 1806 be followed in general. Even if approach (a) is implemented, it is 1807 possible for the security tuple used to be acceptable for the target 1808 of LOOKUP but not for the filehandles used in the put filehandle 1809 operation. The put filehandle operation could be a PUTROOTFH or 1810 PUTPUBFH, where the client cannot know the security tuples for the 1811 root or public filehandle. Or the security policy for the filehandle 1812 used by the put filehandle operation could have changed since the 1813 time the filehandle was obtained. 1814 1815 Therefore, an NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC in 1816 response to the put filehandle operation if the operation is 1817 immediately followed by a LOOKUP or an OPEN by component name. 1818 1819 2.6.3.1.1.4. Put Filehandle Operation + LOOKUPP 1820 1821 Since SECINFO only works its way down, there is no way LOOKUPP can 1822 return NFS4ERR_WRONGSEC without SECINFO_NO_NAME. SECINFO_NO_NAME 1823 solves this issue via style SECINFO_STYLE4_PARENT, which works in the 1824 opposite direction as SECINFO. As with Section 2.6.3.1.1.3, a put 1825 filehandle operation that is followed by a LOOKUPP MUST NOT return 1826 NFS4ERR_WRONGSEC. If the server does not support SECINFO_NO_NAME, 1827 the client's only recourse is to send the put filehandle operation, 1828 LOOKUPP, GETFH sequence of operations with every security tuple it 1829 supports. 1830 1831 Regardless of whether SECINFO_NO_NAME is supported, an NFSv4.1 server 1832 MUST NOT return NFS4ERR_WRONGSEC in response to a put filehandle 1833 operation if the operation is immediately followed by a LOOKUPP. 1834 1835 2.6.3.1.1.5. Put Filehandle Operation + SECINFO/SECINFO_NO_NAME 1836 1837 A security sensitive client is allowed to choose a strong security 1838 tuple when querying a server to determine a file object's permitted 1839 security tuples. The security tuple chosen by the client does not 1840 have to be included in the tuple list of the security policy of the 1841 either parent directory indicated in the put filehandle operation, or 1842 the child file object indicated in SECINFO (or any parent directory 1843 indicated in SECINFO_NO_NAME). Of course the server has to be 1844 1845 1846 1847 Shepler, et al. Expires February 23, 2009 [Page 33] 1848 1849 Internet-Draft NFSv4.1 August 2008 1850 1851 1852 configured for whatever security tuple the client selects, otherwise 1853 the request will fail at RPC layer with an appropriate authentication 1854 error. 1855 1856 In theory, there is no connection between the security flavor used by 1857 SECINFO or SECINFO_NO_NAME and those supported by the security 1858 policy. But in practice, the client may start looking for strong 1859 flavors from those supported by the security policy, followed by 1860 those in the REQUIRED set. 1861 1862 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to a put 1863 filehandle operation that is immediately followed by SECINFO or 1864 SECINFO_NO_NAME. The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC 1865 from SECINFO or SECINFO_NO_NAME. 1866 1867 2.6.3.1.1.6. Put Filehandle Operation + Nothing 1868 1869 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC. 1870 1871 2.6.3.1.1.7. Put Filehandle Operation + Anything Else 1872 1873 "Anything Else" includes OPEN by filehandle. 1874 1875 The security policy enforcement applies to the filehandle specified 1876 in the put filehandle operation. Therefore the put filehandle 1877 operation must return NFS4ERR_WRONGSEC when there is a security tuple 1878 mismatch. This avoids the complexity adding NFS4ERR_WRONGSEC as an 1879 allowable error to every other operation. 1880 1881 A COMPOUND containing the series put filehandle operation + 1882 SECINFO_NO_NAME (style SECINFO_STYLE4_CURRENT_FH) is an efficient way 1883 for the client to recover from NFS4ERR_WRONGSEC. 1884 1885 The NFSv4.1 server MUST NOT return NFS4ERR_WRONGSEC to any operation 1886 other than a put filehandle operation, LOOKUP, LOOKUPP, and OPEN (by 1887 component name). 1888 1889 2.6.3.1.1.8. Operations after SECINFO and SECINFO_NO_NAME 1890 1891 Suppose a client sends a COMPOUND procedure containing the series 1892 SEQUENCE, PUTFH, SECINFO_NONAME, READ, and suppose the security tuple 1893 used does not match that required for the target file. By rule (see 1894 Section 2.6.3.1.1.5), neither PUTFH nor SECINFO_NO_NAME can return 1895 NFS4ERR_WRONGSEC. By rule (see Section 2.6.3.1.1.7), READ cannot 1896 return NFS4ERR_WRONGSEC. The issue is resolved by the fact that 1897 SECINFO and SECINFO_NO_NAME consume the current filehandle (note that 1898 this is a change from NFSv4.0). This leaves no current filehandle 1899 for READ to use, and READ returns NFS4ERR_NOFILEHANDLE. 1900 1901 1902 1903 Shepler, et al. Expires February 23, 2009 [Page 34] 1904 1905 Internet-Draft NFSv4.1 August 2008 1906 1907 1908 2.6.3.1.2. LINK and RENAME 1909 1910 The LINK and RENAME operations use both the current and saved 1911 filehandles. When the current filehandle is injected into a series 1912 of operations via a put filehandle operation, the server MUST return 1913 NFS4ERR_WRONGSEC, per Section 2.6.3.1.1. LINK and RENAME MAY return 1914 NFS4ERR_WRONGSEC if the security policy of the saved filehandle 1915 rejects the security flavor used in the COMPOUND request's 1916 credentials. If the server does so, then if there is no intersection 1917 between the security policies of saved and current filehandles, this 1918 means it will be impossible for client to perform the intended LINK 1919 or RENAME operation. 1920 1921 For example, suppose the client sends this COMPOUND request: 1922 SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, RENAME "c" "d", where 1923 filehandles bFH and aFH refer to different directories. Suppose no 1924 common security tuple exists between the security policies of aFH and 1925 bFH. If the client sends the request using credentials acceptable to 1926 bFH's security policy but not aFH's policy, then the PUTFH aFH 1927 operation will fail with NFS4ERR_WRONGSEC. After a SECINFO_NO_NAME 1928 request, the client sends SEQUENCE, PUTFH bFH, SAVEFH, PUTFH aFH, 1929 RENAME "c" "d", using credentials acceptable to aFH's security 1930 policy, but not bFH's policy. The server returns NFS4ERR_WRONGSEC on 1931 the RENAME operation. 1932 1933 To prevent a client from an endless sequence of a request containing 1934 LINK or RENAME, followed by a request containing SECINFO_NO_NAME, the 1935 server MUST detect when the security policies of the current and 1936 saved filehandles have no mutually acceptable security tuple, and 1937 MUST NOT return NFS4ERR_WRONGSEC in that situation. Instead the 1938 server MUST return NFS4ERR_XDEV. 1939 1940 Thus while a server MAY return NFS4ERR_WRONGSEC from LINK and RENAME, 1941 the server implementor may reasonably decide the consequences are not 1942 worth the security benefits, and so allow the security policy of the 1943 current filehandle to override that of the saved filehandle. 1944 1945 2.7. Minor Versioning 1946 1947 To address the requirement of an NFS protocol that can evolve as the 1948 need arises, the NFSv4.1 protocol contains the rules and framework to 1949 allow for future minor changes or versioning. 1950 1951 The base assumption with respect to minor versioning is that any 1952 future accepted minor version must follow the IETF process and be 1953 documented in a standards track RFC. Therefore, each minor version 1954 number will correspond to one or more new RFCs. Minor version zero 1955 of the NFSv4 protocol is represented by [20], and minor version one 1956 1957 1958 1959 Shepler, et al. Expires February 23, 2009 [Page 35] 1960 1961 Internet-Draft NFSv4.1 August 2008 1962 1963 1964 is represented by this document [[Comment.1: RFC Editor: change 1965 "document" to "RFC" when we publish]]. The COMPOUND and CB_COMPOUND 1966 procedures support the encoding of the minor version being requested 1967 by the client. 1968 1969 The following items represent the basic rules for the development of 1970 minor versions. Note that a future minor version may decide to 1971 modify or add to the following rules as part of the minor version 1972 definition. 1973 1974 1. Procedures are not added or deleted 1975 1976 To maintain the general RPC model, NFSv4 minor versions will not 1977 add to or delete procedures from the NFS program. 1978 1979 2. Minor versions may add operations to the COMPOUND and 1980 CB_COMPOUND procedures. 1981 1982 The addition of operations to the COMPOUND and CB_COMPOUND 1983 procedures does not affect the RPC model. 1984 1985 * Minor versions may append attributes to the bitmap4 that 1986 represents sets of attributes and the fattr4 that represents 1987 sets of attribute values. 1988 1989 This allows for the expansion of the attribute model to allow 1990 for future growth or adaptation. 1991 1992 * Minor version X must append any new attributes after the last 1993 documented attribute. 1994 1995 Since attribute results are specified as an opaque array of 1996 per-attribute XDR encoded results, the complexity of adding 1997 new attributes in the midst of the current definitions would 1998 be too burdensome. 1999 2000 3. Minor versions must not modify the structure of an existing 2001 operation's arguments or results. 2002 2003 Again the complexity of handling multiple structure definitions 2004 for a single operation is too burdensome. New operations should 2005 be added instead of modifying existing structures for a minor 2006 version. 2007 2008 This rule does not preclude the following adaptations in a minor 2009 version. 2010 2011 2012 2013 2014 2015 Shepler, et al. Expires February 23, 2009 [Page 36] 2016 2017 Internet-Draft NFSv4.1 August 2008 2018 2019 2020 * adding bits to flag fields such as new attributes to 2021 GETATTR's bitmap4 data type and providing corresponding 2022 variants of opaque arrays, such as a notify4 used together 2023 with such bitmaps. 2024 2025 * adding bits to existing attributes like ACLs that have flag 2026 words 2027 2028 * extending enumerated types (including NFS4ERR_*) with new 2029 values 2030 2031 * adding cases to a switched union 2032 2033 4. Minor versions may not modify the structure of existing 2034 attributes. 2035 2036 5. Minor versions may not delete operations. 2037 2038 This prevents the potential reuse of a particular operation 2039 "slot" in a future minor version. 2040 2041 6. Minor versions may not delete attributes. 2042 2043 7. Minor versions may not delete flag bits or enumeration values. 2044 2045 8. Minor versions may declare an operation MUST NOT be implemented. 2046 2047 Specifying an operation MUST NOT be implemented is equivalent to 2048 obsoleting an operation. For the client, it means that the 2049 operation should not be sent to the server. For the server, an 2050 NFS error can be returned as opposed to "dropping" the request 2051 as an XDR decode error. This approach allows for the 2052 obsolescence of an operation while maintaining its structure so 2053 that a future minor version can reintroduce the operation. 2054 2055 1. Minor versions may declare an attribute MUST NOT be 2056 implemented. 2057 2058 2. Minor versions may declare a flag bit or enumeration value 2059 MUST NOT be implemented. 2060 2061 9. Minor versions may downgrade features from REQUIRED to 2062 RECOMMENDED, or RECOMMENDED to OPTIONAL. 2063 2064 10. Minor versions may upgrade features from OPTIONAL to RECOMMENDED 2065 or RECOMMENDED to REQUIRED. 2066 2067 2068 2069 2070 2071 Shepler, et al. Expires February 23, 2009 [Page 37] 2072 2073 Internet-Draft NFSv4.1 August 2008 2074 2075 2076 11. A client and server that supports minor version X should support 2077 minor versions 0 (zero) through X-1 as well. 2078 2079 12. Except for infrastructural changes, no new features may be 2080 introduced as REQUIRED in a minor version. 2081 2082 This rule allows for the introduction of new functionality and 2083 forces the use of implementation experience before designating a 2084 feature as REQUIRED. On the other hand, some classes of 2085 features are infrastructural and have broad effects. Allowing 2086 such features to not be REQUIRED complicates implementation of 2087 the minor version. 2088 2089 13. A client MUST NOT attempt to use a stateid, filehandle, or 2090 similar returned object from the COMPOUND procedure with minor 2091 version X for another COMPOUND procedure with minor version Y, 2092 where X != Y. 2093 2094 2.8. Non-RPC-based Security Services 2095 2096 As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for 2097 identification, authentication, integrity, and privacy. NFSv4.1 2098 itself provides or enables additional security services as described 2099 in the next several subsections. 2100 2101 2.8.1. Authorization 2102 2103 Authorization to access a file object via an NFSv4.1 operation is 2104 ultimately determined by the NFSv4.1 server. A client can 2105 predetermine its access to a file object via the OPEN (Section 18.16) 2106 and the ACCESS (Section 18.1) operations. 2107 2108 Principals with appropriate access rights can modify the 2109 authorization on a file object via the SETATTR (Section 18.30) 2110 operation. Attributes that affect access rights include: mode, 2111 owner, owner_group, acl, dacl, and sacl. See Section 5. 2112 2113 2.8.2. Auditing 2114 2115 NFSv4.1 provides auditing on a per file object basis, via the acl and 2116 sacl attributes as described in Section 6. It is outside the scope 2117 of this specification to specify audit log formats or management 2118 policies. 2119 2120 2.8.3. Intrusion Detection 2121 2122 NFSv4.1 provides alarm control on a per file object basis, via the 2123 acl and sacl attributes as described in Section 6. Alarms may serve 2124 2125 2126 2127 Shepler, et al. Expires February 23, 2009 [Page 38] 2128 2129 Internet-Draft NFSv4.1 August 2008 2130 2131 2132 as the basis for intrusion detection. It is outside the scope of 2133 this specification to specify heuristics for detecting intrusion via 2134 alarms. 2135 2136 2.9. Transport Layers 2137 2138 2.9.1. REQUIRED and RECOMMENDED Properties of Transports 2139 2140 NFSv4.1 works over RDMA and non-RDMA-based transports with the 2141 following attributes: 2142 2143 o The transport supports reliable delivery of data, which NFSv4.1 2144 requires but neither NFSv4.1 nor RPC has facilities for ensuring. 2145 [23] 2146 2147 o The transport delivers data in the order it was sent. Ordered 2148 delivery simplifies detection of transmit errors, and simplifies 2149 the sending of arbitrary sized requests and responses, via the 2150 record marking protocol [3]. 2151 2152 Where an NFSv4.1 implementation supports operation over the IP 2153 network protocol, any transport used between NFS and IP MUST be among 2154 the IETF-approved congestion control transport protocols. At the 2155 time this document was written, the only two transports that had the 2156 above attributes were TCP and SCTP. To enhance the possibilities for 2157 interoperability, an NFSv4.1 implementation MUST support operation 2158 over the TCP transport protocol. 2159 2160 Even if NFSv4.1 is used over a non-IP network protocol, it is 2161 RECOMMENDED that the transport support congestion control. 2162 2163 It is permissible for a connectionless transport to be used under 2164 NFSv4.1, however reliable and in-order delivery of data combined with 2165 congestion control by the connectionless transport is REQUIRED. 2166 NFSv4.1 assumes that a client transport address and server transport 2167 address used to send data over a transport together constitute a 2168 connection, even if the underlying transport eschews the concept of a 2169 connection. 2170 2171 2.9.2. Client and Server Transport Behavior 2172 2173 If a connection-oriented transport (e.g. TCP) is used, the client 2174 and server SHOULD use long lived connections for at least three 2175 reasons: 2176 2177 1. This will prevent the weakening of the transport's congestion 2178 control mechanisms via short lived connections. 2179 2180 2181 2182 2183 Shepler, et al. Expires February 23, 2009 [Page 39] 2184 2185 Internet-Draft NFSv4.1 August 2008 2186 2187 2188 2. This will improve performance for the WAN environment by 2189 eliminating the need for connection setup handshakes. 2190 2191 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the 2192 client and server to maintain a client-created backchannel (see 2193 Section 2.10.3.1) for the server to use. 2194 2195 In order to reduce congestion, if a connection-oriented transport is 2196 used, and the request is not the NULL procedure, 2197 2198 o A requester MUST NOT retry a request unless the connection the 2199 request was sent over was lost before the reply was received. 2200 2201 o A replier MUST NOT silently drop a request, even if the request is 2202 a retry. (The silent drop behavior of RPCSEC_GSS [4] does not 2203 apply because this behavior happens at the RPCSEC_GSS layer, a 2204 lower layer in the request processing). Instead, the replier 2205 SHOULD return an appropriate error (see Section 2.10.5.1) or it 2206 MAY disconnect the connection. 2207 2208 When sending a reply, the replier MUST send the reply to the same 2209 full network address (e.g. if using an IP-based transport, the source 2210 port of the requester is part of the full network address) that the 2211 requester sent the request from. If using a connection-oriented 2212 transport, replies MUST be sent on the same connection the request 2213 was received from. 2214 2215 If a connection is dropped after the replier receives the request but 2216 before the replier sends the reply, the replier might have an pending 2217 reply. If a connection is established with the same source and 2218 destination full network address as the dropped connection, then the 2219 replier MUST NOT send the reply until the client retries the request. 2220 The reason for this prohibition is that the client MAY retry a 2221 request over a different connection than is associated with the 2222 session. 2223 2224 When using RDMA transports there are other reasons for not tolerating 2225 retries over the same connection: 2226 2227 o RDMA transports use "credits" to enforce flow control, where a 2228 credit is a right to a peer to transmit a message. If one peer 2229 were to retransmit a request (or reply), it would consume an 2230 additional credit. If the replier retransmitted a reply, it would 2231 certainly result in an RDMA connection loss, since the requester 2232 would typically only post a single receive buffer for each 2233 request. If the requester retransmitted a request, the additional 2234 credit consumed on the server might lead to RDMA connection 2235 failure unless the client accounted for it and decreased its 2236 2237 2238 2239 Shepler, et al. Expires February 23, 2009 [Page 40] 2240 2241 Internet-Draft NFSv4.1 August 2008 2242 2243 2244 available credit, leading to wasted resources. 2245 2246 o RDMA credits present a new issue to the reply cache in NFSv4.1. 2247 The reply cache may be used when a connection within a session is 2248 lost, such as after the client reconnects. Credit information is 2249 a dynamic property of the RDMA connection, and stale values must 2250 not be replayed from the cache. This implies that the reply cache 2251 contents must not be blindly used when replies are sent from it, 2252 and credit information appropriate to the channel must be 2253 refreshed by the RPC layer. 2254 2255 In addition, as described in Section 2.10.5.2, while a session is 2256 active, the NFSv4.1 requester MUST NOT stop waiting for a reply. 2257 2258 2.9.3. Ports 2259 2260 Historically, NFSv3 servers have listened over TCP port 2049. The 2261 registered port 2049 [24] for the NFS protocol should be the default 2262 configuration. NFSv4.1 clients SHOULD NOT use the RPC binding 2263 protocols as described in [25]. 2264 2265 2.10. Session 2266 2267 2.10.1. Motivation and Overview 2268 2269 Previous versions and minor versions of NFS have suffered from the 2270 following: 2271 2272 o Lack of support for Exactly Once Semantics (EOS). This includes 2273 lack of support for EOS through server failure and recovery. 2274 2275 o Limited callback support, including no support for sending 2276 callbacks through firewalls, and races between replies to normal 2277 requests and callbacks. 2278 2279 o Limited trunking over multiple network paths. 2280 2281 o Requiring machine credentials for fully secure operation. 2282 2283 Through the introduction of a session, NFSv4.1 addresses the above 2284 shortfalls with practical solutions: 2285 2286 o EOS is enabled by a reply cache with a bounded size, making it 2287 feasible to keep the cache in persistent storage and enable EOS 2288 through server failure and recovery. One reason that previous 2289 revisions of NFS did not support EOS was because some EOS 2290 approaches often limited parallelism. As will be explained in 2291 Section 2.10.5, NFSv4.1 supports both EOS and unlimited 2292 2293 2294 2295 Shepler, et al. Expires February 23, 2009 [Page 41] 2296 2297 Internet-Draft NFSv4.1 August 2008 2298 2299 2300 parallelism. 2301 2302 o The NFSv4.1 client (defined in Section 1.5, Paragraph 2) creates 2303 transport connections and provides them to the server to use for 2304 sending callback requests, thus solving the firewall issue 2305 (Section 18.34). Races between responses from client requests, 2306 and callbacks caused by the requests are detected via the 2307 session's sequencing properties which are a consequence of EOS 2308 (Section 2.10.5.3). 2309 2310 o The NFSv4.1 client can add an arbitrary number of connections to 2311 the session, and thus provide trunking (Section 2.10.4). 2312 2313 o The NFSv4.1 client and server produces a session key independent 2314 of client and server machine credentials which can be used to 2315 compute a digest for protecting critical session management 2316 operations (Section 2.10.7.3). 2317 2318 o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for 2319 use by the session's backchannel that do not require the server to 2320 authenticate to a client machine principal (Section 2.10.7.2). 2321 2322 A session is a dynamically created, long-lived server object created 2323 by a client, used over time from one or more transport connections. 2324 Its function is to maintain the server's state relative to the 2325 connection(s) belonging to a client instance. This state is entirely 2326 independent of the connection itself, and indeed the state exists 2327 whether the connection exists or not. A client may have one or more 2328 sessions associated with it so that client-associated state may be 2329 accessed using any of the sessions associated with that client's 2330 client ID, when connections are associated with those sessions. When 2331 no connections are associated with any of a client ID's sessions for 2332 an extended time, such objects as locks, opens, delegations, layouts, 2333 etc. are subject to expiration. The session serves as an object 2334 representing a means of access by a client to the associated client 2335 state on the server, independent of the physical means of access to 2336 that state. 2337 2338 A single client may create multiple sessions. A single session MUST 2339 NOT serve multiple clients. 2340 2341 2.10.2. NFSv4 Integration 2342 2343 Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major 2344 infrastructure change such as sessions would require a new major 2345 version number to an ONC RPC program like NFS. However, because 2346 NFSv4 encapsulates its functionality in a single procedure, COMPOUND, 2347 and because COMPOUND can support an arbitrary number of operations, 2348 2349 2350 2351 Shepler, et al. Expires February 23, 2009 [Page 42] 2352 2353 Internet-Draft NFSv4.1 August 2008 2354 2355 2356 sessions have been added to NFSv4.1 with little difficulty. COMPOUND 2357 includes a minor version number field, and for NFSv4.1 this minor 2358 version is set to 1. When the NFSv4 server processes a COMPOUND with 2359 the minor version set to 1, it expects a different set of operations 2360 than it does for NFSv4.0. NFSv4.1 defines the SEQUENCE operation, 2361 which is required for every COMPOUND that operates over an 2362 established session, with the exception of some session 2363 administration operations, such as DESTROY_SESSION (Section 18.37). 2364 2365 2.10.2.1. SEQUENCE and CB_SEQUENCE 2366 2367 In NFSv4.1, when the SEQUENCE operation is present, it MUST be the 2368 first operation in the COMPOUND procedure. The primary purpose of 2369 SEQUENCE is to carry the session identifier. The session identifier 2370 associates all other operations in the COMPOUND procedure with a 2371 particular session. SEQUENCE also contains required information for 2372 maintaining EOS (see Section 2.10.5). Session-enabled NFSv4.1 2373 COMPOUND requests thus have the form: 2374 2375 +-----+--------------+-----------+------------+-----------+---- 2376 | tag | minorversion | numops |SEQUENCE op | op + args | ... 2377 | | (== 1) | (limited) | + args | | 2378 +-----+--------------+-----------+------------+-----------+---- 2379 2380 and the replys have the form: 2381 2382 +------------+-----+--------+-------------------------------+--// 2383 |last status | tag | numres |status + SEQUENCE op + results | // 2384 +------------+-----+--------+-------------------------------+--// 2385 //-----------------------+---- 2386 // status + op + results | ... 2387 //-----------------------+---- 2388 2389 A CB_COMPOUND procedure request and reply has a similar form to 2390 COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE 2391 operation. CB_COMPOUND also has an additional field called 2392 "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored 2393 by the client. CB_SEQUENCE has the same information as SEQUENCE, and 2394 also includes other information needed to resolve callback races 2395 (Section 2.10.5.3). 2396 2397 2.10.2.2. Client ID and Session Association 2398 2399 Each client ID (Section 2.4) can have zero or more active sessions. 2400 A client ID and associated session are required to perform file 2401 access in NFSv4.1. Each time a session is used (whether by a client 2402 sending a request to the server, or the client replying to a callback 2403 request from the server), the state leased to its associated client 2404 2405 2406 2407 Shepler, et al. Expires February 23, 2009 [Page 43] 2408 2409 Internet-Draft NFSv4.1 August 2008 2410 2411 2412 ID is automatically renewed. 2413 2414 State such as share reservations, locks, delegations, and layouts 2415 (Section 1.6.4) is tied to the client ID. Client state is not tied 2416 to any individual session. Successive state changing operations from 2417 a given state owner MAY go over different sessions, provided the 2418 session is associated with the same client ID. A callback MAY arrive 2419 over a different session than from the session that originally 2420 acquired the state pertaining to the callback. For example, if 2421 session A is used to acquire a delegation, a request to recall the 2422 delegation MAY arrive over session B if both sessions are associated 2423 with the same client ID. Section 2.10.7.1 and Section 2.10.7.2 2424 discuss the security considerations around callbacks. 2425 2426 2.10.3. Channels 2427 2428 A channel is not a connection. A channel represents the direction 2429 ONC RPC requests are sent. 2430 2431 Each session has one or two channels: the fore channel and the 2432 backchannel. Because there are at most two channels per session, and 2433 because each channel has a distinct purpose, channels are not 2434 assigned identifiers. 2435 2436 The fore channel is used for ordinary requests from the client to the 2437 server, and carries COMPOUND requests and responses. A session 2438 always has a fore channel. 2439 2440 The backchannel used for callback requests from server to client, and 2441 carries CB_COMPOUND requests and responses. Whether there is a 2442 backchannel or not is a decision by the client, however many features 2443 of NFSv4.1 require a backchannel. NFSv4.1 servers MUST support 2444 backchannels. 2445 2446 Each session has resources for each channel, including separate reply 2447 caches (see Section 2.10.5.1). Note that even the backchannel 2448 requires a reply cache because some callback operations are 2449 nonidempotent. 2450 2451 2.10.3.1. Association of Connections, Channels, and Sessions 2452 2453 Each channel is associated with zero or more transport connections 2454 (whether of the same transport protocol or different transport 2455 protocols). A connection can be associated with one channel or both 2456 channels of a session; the client and server negotiate whether a 2457 connection will carry traffic for one channel or both channels via 2458 the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION 2459 (Section 18.34) operations. When a session is created via 2460 2461 2462 2463 Shepler, et al. Expires February 23, 2009 [Page 44] 2464 2465 Internet-Draft NFSv4.1 August 2008 2466 2467 2468 CREATE_SESSION, the connection that transported the CREATE_SESSION 2469 request is automatically associated with the fore channel, and 2470 optionally the backchannel. If the client specifies no state 2471 protection (Section 18.35) when the session is created, then when 2472 SEQUENCE is transmitted on a different connection, the connection is 2473 automatically associated with the fore channel of the session 2474 specified in the SEQUENCE operation. 2475 2476 A connection's association with a session is not exclusive. A 2477 connection associated with the channel(s) of one session may be 2478 simultaneously associated with the channel(s) of other sessions 2479 including sessions associated with other client IDs. 2480 2481 It is permissible for connections of multiple transport types to be 2482 associated with the same channel. For example both a TCP and RDMA 2483 connection can be associated with the fore channel. In the event an 2484 RDMA and non-RDMA connection are associated with the same channel, 2485 the maximum number of slots SHOULD be at least one more than the 2486 total number of RDMA credits (Section 2.10.5.1. This way if all RDMA 2487 credits are used, the non-RDMA connection can have at least one 2488 outstanding request. If a server supports multiple transport types, 2489 it MUST allow a client to associate connections from each transport 2490 to a channel. 2491 2492 It is permissible for a connection of one type of transport to be 2493 associated with the fore channel, and a connection of a different 2494 type to be associated with the backchannel. 2495 2496 2.10.4. Trunking 2497 2498 Trunking is the use of multiple connections between a client and 2499 server in order to increase the speed of data transfer. NFSv4.1 2500 supports two types of trunking: session trunking and client ID 2501 trunking. NFSv4.1 repliers and requesters MUST support session 2502 trunking. NFSv4.1 servers MAY support client ID trunking. NFSv4.1 2503 clients MUST support client ID trunking. 2504 2505 Session trunking is essentially the association of multiple 2506 connections, each with potentially different target and/or source 2507 network addresses, to the same session. 2508 2509 Client ID trunking is the association of multiple sessions to the 2510 same client ID, major server owner ID (Section 2.5), and server scope 2511 (Section 11.7.7). When two servers return the same major server 2512 owner and server scope it means the two servers are cooperating on 2513 locking state management which is a prerequisite for client ID 2514 trunking. 2515 2516 2517 2518 2519 Shepler, et al. Expires February 23, 2009 [Page 45] 2520 2521 Internet-Draft NFSv4.1 August 2008 2522 2523 2524 Understanding and distinguishing session and client ID trunking 2525 requires understanding how the results of the EXCHANGE_ID 2526 (Section 18.35) operation identify a server. Suppose a client sends 2527 EXCHANGE_ID over two different connections each with a possibly 2528 different target network address but each EXCHANGE_ID with the same 2529 value in the eia_clientowner field. If the same NFSv4.1 server is 2530 listening over each connection, then each EXCHANGE_ID result MUST 2531 return the same values of eir_clientid, eir_server_owner.so_major_id 2532 and eir_server_scope. The client can then treat each connection as 2533 referring to the same server (subject to verification, see 2534 Paragraph 5 later in this section), and it can use each connection to 2535 trunk requests and replies. The question is whether session trunking 2536 and/or client ID trunking applies. 2537 2538 Session Trunking If the eia_clientowner argument is the same in two 2539 different EXCHANGE_ID requests, and the eir_clientid, 2540 eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and 2541 eir_server_scope results match in both EXCHANGE_ID results, then 2542 the client is permitted to perform session trunking. If the 2543 client has no session mapping to the tuple of eir_clientid, 2544 eir_server_owner.so_major_id, eir_server_scope, 2545 eir_server_owner.so_minor_id, then it creates the session via a 2546 CREATE_SESSION operation over one of the connections, which 2547 associates the connection to the session. If there is a session 2548 for the tuple, the client can send BIND_CONN_TO_SESSION to 2549 associate the connection to the session. (Of course, if the 2550 client does not want to use session trunking, it can invoke 2551 CREATE_SESSION on the connection. This will result in client ID 2552 trunking as described below.) 2553 2554 2555 2556 Client ID Trunking If the eia_clientowner argument is the same in 2557 two different EXCHANGE_ID requests, and the eir_clientid, 2558 eir_server_owner.so_major_id, and eir_server_scope results match 2559 in both EXCHANGE_ID results, but the eir_server_owner.so_minor_id 2560 results do not match then the client is permitted to perform 2561 client ID trunking. The client can associate each connection with 2562 different sessions, where each session is associated with the same 2563 server. 2564 2565 2566 Of course, even if the eir_server_owner.so_minor_id fields do 2567 match, the client is free to employ client ID trunking instead of 2568 session trunking. 2569 2570 2571 The client completes the act of client ID trunking by invoking 2572 2573 2574 2575 Shepler, et al. Expires February 23, 2009 [Page 46] 2576 2577 Internet-Draft NFSv4.1 August 2008 2578 2579 2580 CREATE_SESSION on each connection, using the same client ID that 2581 was returned in eir_clientid. These invocations create two 2582 sessions and also associate each connection with each session. 2583 2584 2585 When doing client ID trunking, locking state is shared across 2586 sessions associated with the same client ID. This requires the 2587 server to coordinate state across sessions. 2588 2589 When two servers over two connections claim matching or partially 2590 matching eir_server_owner, eir_server_scope, and eir_clientid values, 2591 the client does not have to trust the servers' claims. The client 2592 may verify these claims before trunking traffic in the following 2593 ways: 2594 2595 o For session trunking, clients SHOULD reliably verify if 2596 connections between different network paths are in fact associated 2597 with the same NFSv4.1 server and usable on the same session, and 2598 servers MUST allow clients to perform reliable verification. When 2599 a client ID is created, the client SHOULD specify that 2600 BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or 2601 SP4_MACH_CRED (Section 18.35) state protection options. For 2602 SP4_SSV, reliable verification depends on a shared secret (the 2603 SSV) that is established via the SET_SSV (Section 18.47) 2604 operation. 2605 2606 When a new connection is associated with the session (via the 2607 BIND_CONN_TO_SESSION operation, see Section 18.34), if the client 2608 specified SP4_SSV state protection for the BIND_CONN_TO_SESSION 2609 operation, the client MUST send the BIND_CONN_TO_SESSION with 2610 RPCSEC_GSS protection, using integrity or privacy, and an 2611 RPCSEC_GSS handle created with the GSS SSV mechanism 2612 (Section 2.10.8). 2613 2614 If the client mistakenly tries to associate a connection to a 2615 session of a wrong server, the server will either reject the 2616 attempt because it is not aware of the session identifier of the 2617 BIND_CONN_TO_SESSION arguments, or it will reject the attempt 2618 because the RPCSEC_GSS authentication fails. Even if the server 2619 mistakenly or maliciously accepts the connection association 2620 attempt, the RPCSEC_GSS verifier it computes in the response will 2621 not be verified by the client, so the client will know it cannot 2622 use the connection for trunking the specified session. 2623 2624 If the client specified SP4_MACH_CRED state protection, the 2625 BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or 2626 privacy, using the same credential that was used when the client 2627 ID was created. Mutual authentication via RPCSEC_GSS assures the 2628 2629 2630 2631 Shepler, et al. Expires February 23, 2009 [Page 47] 2632 2633 Internet-Draft NFSv4.1 August 2008 2634 2635 2636 client that the connection is associated with the correct session 2637 of the correct server. 2638 2639 2640 o For client ID trunking, the client has at least two options for 2641 verifying that the same client ID obtained from two different 2642 EXCHANGE_ID operations came from the same server. The first 2643 option is to use RPCSEC_GSS authentication when issuing each 2644 EXCHANGE_ID. Each time an EXCHANGE_ID is sent with RPCSEC_GSS 2645 authentication, the client notes the principal name of the GSS 2646 target. If the EXCHANGE_ID results indicate client ID trunking is 2647 possible, and the GSS targets' principal names are the same, the 2648 servers are the same and client ID trunking is allowed. 2649 2650 The second option for verification is to use SP4_SSV protection. 2651 When the client sends EXCHANGE_ID it specifies SP4_SSV protection. 2652 The first EXCHANGE_ID the client sends always has to be confirmed 2653 by a CREATE_SESSION call. The client then sends SET_SSV. Later 2654 the client sends EXCHANGE_ID to a second destination network 2655 address than the first EXCHANGE_ID was sent with. The client 2656 checks that each EXCHANGE_ID reply has the same eir_clientid, 2657 eir_server_owner.so_major_id, and eir_server_scope. If so, the 2658 client verifies the claim by issuing a CREATE_SESSION to the 2659 second destination address, protected with RPCSEC_GSS integrity 2660 using an RPCSEC_GSS handle returned by the second EXCHANGE_ID. If 2661 the server accepts the CREATE_SESSION request, and if the client 2662 verifies the RPCSEC_GSS verifier and integrity codes, then the 2663 client has proof the second server knows the SSV, and thus the two 2664 servers are the same for the purposes of client ID trunking. 2665 2666 2.10.5. Exactly Once Semantics 2667 2668 Via the session, NFSv4.1 offers Exactly Once Semantics (EOS) for 2669 requests sent over a channel. EOS is supported on both the fore and 2670 back channels. 2671 2672 Each COMPOUND or CB_COMPOUND request that is sent with a leading 2673 SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver 2674 exactly once. This requirement holds regardless of whether the 2675 request is sent with reply caching specified (see 2676 Section 2.10.5.1.3). The requirement holds even if the requester is 2677 issuing the request over a session created between a pNFS data client 2678 and pNFS data server. To understand the rationale for this 2679 requirement, divide the requests into three classifications: 2680 2681 o Nonidempotent requests. 2682 2683 2684 2685 2686 2687 Shepler, et al. Expires February 23, 2009 [Page 48] 2688 2689 Internet-Draft NFSv4.1 August 2008 2690 2691 2692 o Idempotent modifying requests. 2693 2694 o Idempotent non-modifying requests. 2695 2696 An example of a non-idempotent request is RENAME. If is obvious that 2697 if a replier executes the same RENAME request twice, and the first 2698 execution succeeds, the re-execution will fail. If the replier 2699 returns the result from the re-execution, this result is incorrect. 2700 Therefore, EOS is required for nonidempotent requests. 2701 2702 An example of an idempotent modifying request is a COMPOUND request 2703 containing a WRITE operation. Repeated execution of the same WRITE 2704 has the same effect as execution of that write a single time. 2705 Nevertheless, enforcing EOS for WRITEs and other idempotent modifying 2706 requests is necessary to avoid data corruption. 2707 2708 Suppose a client sends WRITE A to a noncompliant server that does not 2709 enforce EOS, and receives no response, perhaps due to a network 2710 partition. The client reconnects to the server and re-sends WRITE A. 2711 Now, the server has outstanding two instances of A. The server can be 2712 in a situation in which it executes and replies to the retry of A, 2713 while the first A is still waiting in the server's internal I/O 2714 system for some resource. Upon receiving the reply to the second 2715 attempt of WRITE A, the client believes its write is done so it is 2716 free to send WRITE B which overlaps the range of A. When the original 2717 A is dispatched from the server's I/O system, and executed (thus the 2718 second time A will have been written), then what has been written by 2719 B can be overwritten and thus corrupted. 2720 2721 An example of an idempotent non-modifying request is a COMPOUND 2722 containing SEQUENCE, PUTFH, READLINK and nothing else. The re- 2723 execution of a such a request will not cause data corruption, or 2724 produce an incorrect result. Nonetheless, to keep the implementation 2725 simple, the replier MUST enforce EOS for all requests whether 2726 idempotent and non-modifying or not. 2727 2728 Note that true and complete EOS is not possible unless the server 2729 persists the reply cache in stable storage, unless the server is 2730 somehow implemented to never require a restart (indeed if such a 2731 server exists, the distinction between a reply cache kept in stable 2732 storage versus one that is not is one without meaning). See 2733 Section 2.10.5.5 for a discussion of persistence in the reply cache. 2734 Regardless, even if the server does not persist the reply cache, EOS 2735 improves robustness and correctness over previous versions of NFS 2736 because the legacy duplicate request/reply caches were based on the 2737 ONC RPC transaction identifier (XID). Section 2.10.5.1 explains the 2738 shortcomings of the XID as a basis for a reply cache and describes 2739 how NFSv4.1 sessions improve upon the XID. 2740 2741 2742 2743 Shepler, et al. Expires February 23, 2009 [Page 49] 2744 2745 Internet-Draft NFSv4.1 August 2008 2746 2747 2748 2.10.5.1. Slot Identifiers and Reply Cache 2749 2750 The RPC layer provides a transaction ID (XID), which, while required 2751 to be unique, is not convenient for tracking requests for two 2752 reasons. First, the XID is only meaningful to the requester; it 2753 cannot be interpreted by the replier except to test for equality with 2754 previously sent requests. When consulting an RPC-based duplicate 2755 request cache, the opaqueness of the XID requires a computationally 2756 expensive lookup (often via a hash that includes XID and source 2757 address). NFSv4.1 requests use a non-opaque slot ID which is an 2758 index into a slot table, which is far more efficient. Second, 2759 because RPC requests can be executed by the replier in any order, 2760 there is no bound on the number of requests that may be outstanding 2761 at any time. To achieve perfect EOS using ONC RPC would require 2762 storing all replies in the reply cache. XIDs are 32 bits; storing 2763 over four billion (2^32) replies in the reply cache is not practical. 2764 In practice, previous versions of NFS have chosen to store a fixed 2765 number of replies in the cache, and use a least recently used (LRU) 2766 approach to replacing cache entries with new entries when the cache 2767 is full. In NFSv4.1, the number of outstanding requests is bounded 2768 by the size of the slot table, and a sequence ID per slot is used to 2769 tell the replier when it is safe to delete a cached reply. 2770 2771 In the NFSv4.1 reply cache, when the requester sends a new request, 2772 it selects a slot ID in the range 0..N, where N is the replier's 2773 current maximum slot ID granted to the requester on the session over 2774 which the request is to be sent. The value of N starts out as equal 2775 to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the 2776 response to SEQUENCE or CB_SEQUENCE as described later in this 2777 section. The slot ID must be unused by any of the requests which the 2778 requester has already active on the session. "Unused" here means the 2779 requester has no outstanding request for that slot ID. 2780 2781 A slot contains a sequence ID and the cached reply corresponding to 2782 the request sent with that sequence ID. The sequence ID is a 32 bit 2783 unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 2784 1). The first time a slot is used, the requester MUST specify a 2785 sequence ID of one (1) (Section 18.36). Each time a slot is reused, 2786 the request MUST specify a sequence ID that is one greater than that 2787 of the previous request on the slot. If the previous sequence ID was 2788 0xFFFFFFFF, then the next request for the slot MUST have the sequence 2789 ID set to zero (i.e. (2^32 - 1) + 1 mod 2^32). 2790 2791 The sequence ID accompanies the slot ID in each request. It is for 2792 the critical check at the server: it used to efficiently determine 2793 whether a request using a certain slot ID is a retransmit or a new, 2794 never-before-seen request. It is not feasible for the client to 2795 assert that it is retransmitting to implement this, because for any 2796 2797 2798 2799 Shepler, et al. Expires February 23, 2009 [Page 50] 2800 2801 Internet-Draft NFSv4.1 August 2008 2802 2803 2804 given request the client cannot know whether the server has seen it 2805 unless the server actually replies. Of course, if the client has 2806 seen the server's reply, the client would not retransmit. 2807 2808 The replier compares each received request's sequence ID with the 2809 last one previously received for that slot ID, to see if the new 2810 request is: 2811 2812 o A new request, in which the sequence ID is one greater than that 2813 previously seen in the slot (accounting for sequence wraparound). 2814 The replier proceeds to execute the new request, and the replier 2815 MUST increase the slot's sequence ID by one. 2816 2817 o A retransmitted request, in which the sequence ID is equal to that 2818 currently recorded in the slot. If the original request has 2819 executed to completion, the replier returns the cached reply. See 2820 Section 2.10.5.2 for direction on how the replier deals with 2821 retries of requests that are still in progress. 2822 2823 o A misordered retry, in which the sequence ID is less than 2824 (accounting for sequence wraparound) that previously seen in the 2825 slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the 2826 result from SEQUENCE or CB_SEQUENCE). 2827 2828 o A misordered new request, in which the sequence ID is two or more 2829 than (accounting for sequence wraparound) than that previously 2830 seen in the slot. Note that because the sequence ID must 2831 wraparound to zero (0) once it reaches 0xFFFFFFFF, a misordered 2832 new request and a misordered retry cannot be distinguished. Thus, 2833 the replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from 2834 SEQUENCE or CB_SEQUENCE). 2835 2836 Unlike the XID, the slot ID is always within a specific range; this 2837 has two implications. The first implication is that for a given 2838 session, the replier need only cache the results of a limited number 2839 of COMPOUND requests . The second implication derives from the 2840 first, which is that unlike XID-indexed reply caches (also known as 2841 duplicate request caches - DRCs), the slot ID-based reply cache 2842 cannot be overflowed. Through use of the sequence ID to identify 2843 retransmitted requests, the replier does not need to actually cache 2844 the request itself, reducing the storage requirements of the reply 2845 cache further. These facilities make it practical to maintain all 2846 the required entries for an effective reply cache. 2847 2848 The slot ID, sequence ID, and session ID therefore take over the 2849 traditional role of the XID and source network address in the 2850 replier's reply cache implementation. This approach is considerably 2851 more portable and completely robust - it is not subject to the 2852 2853 2854 2855 Shepler, et al. Expires February 23, 2009 [Page 51] 2856 2857 Internet-Draft NFSv4.1 August 2008 2858 2859 2860 reassignment of ports as clients reconnect over IP networks. In 2861 addition, the RPC XID is not used in the reply cache, enhancing 2862 robustness of the cache in the face of any rapid reuse of XIDs by the 2863 requester. While the replier does not care about the XID for the 2864 purposes of reply cache management (but the replier MUST return the 2865 same XID that was in the request), nonetheless there are 2866 considerations for the XID in NFSv4.1 that are the same as all other 2867 previous versions of NFS. The RPC XID remains in each message and 2868 must be formulated in NFSv4.1 requests as in any other ONC RPC 2869 request. The reasons include: 2870 2871 o The RPC layer retains its existing semantics and implementation. 2872 2873 o The requester and replier must be able to interoperate at the RPC 2874 layer, prior to the NFSv4.1 decoding of the SEQUENCE or 2875 CB_SEQUENCE operation. 2876 2877 o If an operation is being used that does not start with SEQUENCE or 2878 CB_SEQUENCE (e.g. BIND_CONN_TO_SESSION), then the RPC XID is 2879 needed for correct operation to match the reply to the request. 2880 2881 o The SEQUENCE or CB_SEQUENCE operation may generate an error. If 2882 so, the embedded slot ID, sequence ID, and session ID (if present) 2883 in the request will not be in the reply, and the requester has 2884 only the XID to match the reply to the request. 2885 2886 Given that well formulated XIDs continue to be required, this begs 2887 the question why SEQUENCE and CB_SEQUENCE replies have a session ID, 2888 slot ID and sequence ID? Having the session ID in the reply means 2889 the requester does not have to use the XID to lookup the session ID, 2890 which would be necessary if the connection were associated with 2891 multiple sessions. Having the slot ID and sequence ID in the reply 2892 means requester does not have to use the XID to lookup the slot ID 2893 and sequence ID. Furthermore, since the XID is only 32 bits, it is 2894 too small to guarantee the re-association of a reply with its request 2895 ([26]); having session ID, slot ID, and sequence ID in the reply 2896 allows the client to validate that the reply in fact belongs to the 2897 matched request. 2898 2899 The SEQUENCE (and CB_SEQUENCE) operation also carries a 2900 "highest_slotid" value which carries additional requester slot usage 2901 information. The requester must always indicate the slot ID 2902 representing the outstanding request with the highest-numbered slot 2903 value. The requester should in all cases provide the most 2904 conservative value possible, although it can be increased somewhat 2905 above the actual instantaneous usage to maintain some minimum or 2906 optimal level. This provides a way for the requester to yield unused 2907 request slots back to the replier, which in turn can use the 2908 2909 2910 2911 Shepler, et al. Expires February 23, 2009 [Page 52] 2912 2913 Internet-Draft NFSv4.1 August 2008 2914 2915 2916 information to reallocate resources. 2917 2918 The replier responds with both a new target highest_slotid, and an 2919 enforced highest_slotid, described as follows: 2920 2921 o The target highest_slotid is an indication to the requester of the 2922 highest_slotid the replier wishes the requester to be using. This 2923 permits the replier to withdraw (or add) resources from a 2924 requester that has been found to not be using them, in order to 2925 more fairly share resources among a varying level of demand from 2926 other requesters. The requester must always comply with the 2927 replier's value updates, since they indicate newly established 2928 hard limits on the requester's access to session resources. 2929 However, because of request pipelining, the requester may have 2930 active requests in flight reflecting prior values, therefore the 2931 replier must not immediately require the requester to comply. 2932 2933 2934 o The enforced highest_slotid indicates the highest slot ID the 2935 requester is permitted to use on a subsequent SEQUENCE or 2936 CB_SEQUENCE operation. The replier's enforced highest_slotid 2937 SHOULD be no less than the highest_slotid the requester indicated 2938 in the SEQUENCE or CB_SEQUENCE arguments. 2939 2940 If a replier detects the client is being intransigent, i.e. it 2941 fails in a series of requests to honor the target highest_slotid 2942 even though the replier knows there are no outstanding requests a 2943 higher slot ids, it MAY take more forceful action. When faced 2944 with intransigence, the replier MAY reply with a new enforced 2945 highest_slotid that is less than its previous enforced 2946 highest_slotid. Thereafter, if the requester continues to send 2947 requests with a highest_slotid that is greater than the replier's 2948 new enforced highest_slotid the server MAY return 2949 NFS4ERR_BAD_HIGHSLOT, unless the slot ID in the request is greater 2950 than the new enforced highest_slotid, and the request is a retry. 2951 2952 The replier SHOULD retain the slots it wants to retire until the 2953 requester sends a request with a highest_slotid less than or equal 2954 to the replier's new enforced highest_slotid. Also if a request 2955 is received with a slot that is higher than the new enforced 2956 highest_slotid, and the sequence ID is one higher than what is in 2957 the slot's reply cache, then the server can both retire the slot 2958 and return NFS4ERR_BADSLOT (however the server MUST NOT do one and 2959 not the other). (The reason it is safe to retire the slot is 2960 because that by using the next sequence ID, the client is 2961 indicating it has received the previous reply for the slot.) Once 2962 the replier has forcibly lowered the enforced highest_slotid, the 2963 requester is only allowed to send retries to the to-be-retired 2964 2965 2966 2967 Shepler, et al. Expires February 23, 2009 [Page 53] 2968 2969 Internet-Draft NFSv4.1 August 2008 2970 2971 2972 slots. 2973 2974 2975 o The requester SHOULD use the lowest available slot when issuing a 2976 new request. This way, the replier may be able to retire slot 2977 entries faster. However, where the replier is actively adjusting 2978 its granted highest_slotid, it will not be able to use only the 2979 receipt of the slot ID and highest_slotid in the request. Neither 2980 the slot ID nor the highest_slotid used in a request may reflect 2981 the replier's current idea of the requester's session limit, 2982 because the request may have been sent from the requester before 2983 the update was received. Therefore, in the downward adjustment 2984 case, the replier may have to retain a number of reply cache 2985 entries at least as large as the old value of maximum requests 2986 outstanding, until it can infer that the requester has seen a 2987 reply containing the new granted highest_slotid. The replier can 2988 infer that requester as seen such a reply when it receives a new 2989 request with the same slot ID as the request replied to and the 2990 next higher sequence ID. 2991 2992 2.10.5.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies 2993 2994 When a SEQUENCE or CB_SEQUENCE operation is successfully executed, 2995 its reply MUST always be cached. Specifically, session ID, sequence 2996 ID, and slot ID MUST be cached in the reply cache. The reply from 2997 SEQUENCE also includes the highest slot ID, target highest slot ID, 2998 and status flags. Instead of caching these values, the server MAY 2999 re-compute the values from the current state of the fore channel, 3000 session and/or client ID as appropriate. Similarly, the reply from 3001 CB_SEQUENCE includes a highest slot ID and target highest slot ID. 3002 The client MAY re-compute the values from the current state of the 3003 session as appropriate. 3004 3005 Regardless of whether a replier is re-computing highest slot ID, 3006 target slot ID, and status on replies to retries or not, the 3007 requester MUST NOT assume the values are being re-computed whenever 3008 it receives a reply after a retry is sent, since it has no way of 3009 knowing whether the reply it has received was sent by the server in 3010 response to the retry, or is a delayed response to the original 3011 request. Therefore, it may be the case that highest slot ID, target 3012 slot ID, or status bits may reflect the state of affairs when the 3013 request was first executed. Although acting based on such delayed 3014 information is valid, it may cause the receiver to do unneeded work. 3015 Requesters MAY choose to send additional requests to get the current 3016 state of affairs or use the state of affairs reported by subsequent 3017 requests, in preference to acting immediately on data which may be 3018 out of date. 3019 3020 3021 3022 3023 Shepler, et al. Expires February 23, 2009 [Page 54] 3024 3025 Internet-Draft NFSv4.1 August 2008 3026 3027 3028 2.10.5.1.2. Errors from SEQUENCE and CB_SEQUENCE 3029 3030 Any time SEQUENCE or CB_SEQUENCE return an error, the sequence ID of 3031 the slot MUST NOT change. The replier MUST NOT modify the reply 3032 cache entry for the slot whenever an error is returned from SEQUENCE 3033 or CB_SEQUENCE. 3034 3035 2.10.5.1.3. Optional Reply Caching 3036 3037 On a per-request basis the requester can choose to direct the replier 3038 to cache the reply to all operations after the first operation 3039 (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis 3040 fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it 3041 would not direct the replier to cache the entire reply is that the 3042 request is composed of all idempotent operations [23]. Caching the 3043 reply may offer little benefit. If the reply is too large (see 3044 Section 2.10.5.4), it may not be cacheable anyway. Even if the reply 3045 to idempotent request is small enough to cache, unnecessarily caching 3046 the reply slows down the server and increases RPC latency. 3047 3048 Whether the requester requests the reply to be cached or not has no 3049 effect on the slot processing. If the results of SEQUENCE or 3050 CB_SEQUENCE are NFS4_OK, then the slot's sequence ID MUST be 3051 incremented by one. If a requester does not direct the replier to 3052 cache the reply, the replier MUST do one of following: 3053 3054 o The replier can cache the entire original reply. Even though 3055 sa_cachethis or csa_cachethis are FALSE, the replier is always 3056 free to cache. It may choose this approach in order to simplify 3057 implementation. 3058 3059 o The replier enters into its reply cache a reply consisting of the 3060 original results to the SEQUENCE or CB_SEQUENCE operation, and 3061 with the next operation in COMPOUND or CB_COMPOUND having the 3062 error NFS4ERR_RETRY_UNCACHED_REP. Thus if the requester later 3063 retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. 3064 3065 2.10.5.2. Retry and Replay of Reply 3066 3067 A requester MUST NOT retry a request, unless the connection it used 3068 to send the request disconnects. The requester can then reconnect 3069 and re-send the request, or it can re-send the request over a 3070 different connection that is associated with the same session. 3071 3072 If the requester is a server wanting to re-send a callback operation 3073 over the backchannel of session, the requester of course cannot 3074 reconnect because only the client can associate connections with the 3075 backchannel. The server can re-send the request over another 3076 3077 3078 3079 Shepler, et al. Expires February 23, 2009 [Page 55] 3080 3081 Internet-Draft NFSv4.1 August 2008 3082 3083 3084 connection that is bound to the same session's backchannel. If there 3085 is no such connection, the server MUST indicate that the session has 3086 no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag 3087 bit in the response to the next SEQUENCE operation from the client. 3088 The client MUST then associate a connection with the session (or 3089 destroy the session). 3090 3091 Note that it is not fatal for a client to retry without a disconnect 3092 between the request and retry. However the retry does consume 3093 resources, especially with RDMA, where each request, retry or not, 3094 consumes a credit. Retries for no reason, especially retries sent 3095 shortly after the previous attempt, are a poor use of network 3096 bandwidth and defeat the purpose of a transport's inherent congestion 3097 control system. 3098 3099 A requester MUST wait for a reply to a request before using the slot 3100 for another request. If it does not wait for a reply, then the 3101 requester does not know what sequence ID to use for the slot on its 3102 next request. For example, suppose a requester sends a request with 3103 sequence ID 1, and does not wait for the response. The next time it 3104 uses the slot, it sends the new request with sequence ID 2. If the 3105 replier has not seen the request with sequence ID 1, then the replier 3106 is not expecting sequence ID 2, and rejects the requester's new 3107 request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or 3108 CB_SEQUENCE). 3109 3110 RDMA fabrics do not guarantee that the memory handles (Steering Tags) 3111 within each RPC/RDMA "chunk" ([8]) are valid on a scope outside that 3112 of a single connection. Therefore, handles used by the direct 3113 operations become invalid after connection loss. The server must 3114 ensure that any RDMA operations which must be replayed from the reply 3115 cache use the newly provided handle(s) from the most recent request. 3116 3117 A retry might be sent while the original request is still in progress 3118 on the replier. The replier SHOULD deal with the issue by returning 3119 NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but 3120 implementations MAY return NFS4ERR_MISORDERED. Since errors from 3121 SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this 3122 approach allows the results of the execution of the original request 3123 to be properly recorded in the reply cache (assuming the requester 3124 specified the reply to be cached). 3125 3126 2.10.5.3. Resolving Server Callback Races 3127 3128 It is possible for server callbacks to arrive at the client before 3129 the reply from related fore channel operations. For example, a 3130 client may have been granted a delegation to a file it has opened, 3131 but the reply to the OPEN (informing the client of the granting of 3132 3133 3134 3135 Shepler, et al. Expires February 23, 2009 [Page 56] 3136 3137 Internet-Draft NFSv4.1 August 2008 3138 3139 3140 the delegation) may be delayed in the network. If a conflicting 3141 operation arrives at the server, it will recall the delegation using 3142 the backchannel, which may be on a different transport connection, 3143 perhaps even a different network, or even a different session 3144 associated with the same client ID 3145 3146 The presence of a session between client and server alleviates this 3147 issue. When a session is in place, each client request is uniquely 3148 identified by its { session ID, slot ID, sequence ID } triple. By 3149 the rules under which slot entries (reply cache entries) are retired, 3150 the server has knowledge whether the client has "seen" each of the 3151 server's replies. The server can therefore provide sufficient 3152 information to the client to allow it to disambiguate between an 3153 erroneous or conflicting callback race condition. 3154 3155 For each client operation which might result in some sort of server 3156 callback, the server SHOULD "remember" the { session ID, slot ID, 3157 sequence ID } triple of the client request until the slot ID 3158 retirement rules allow the server to determine that the client has, 3159 in fact, seen the server's reply. Until the time the { session ID, 3160 slot ID, sequence ID } request triple can be retired, any recalls of 3161 the associated object MUST carry an array of these referring 3162 identifiers (in the CB_SEQUENCE operation's arguments), for the 3163 benefit of the client. After this time, it is not necessary for the 3164 server to provide this information in related callbacks, since it is 3165 certain that a race condition can no longer occur. 3166 3167 The CB_SEQUENCE operation which begins each server callback carries a 3168 list of "referring" { session ID, slot ID, sequence ID } triples. If 3169 the client finds the request corresponding to the referring session 3170 ID, slot ID and sequence ID to be currently outstanding (i.e. the 3171 server's reply has not been seen by the client), it can determine 3172 that the callback has raced the reply, and act accordingly. If the 3173 client does not find the request corresponding the referring triple 3174 to be outstanding (including the case of a session ID referring to a 3175 destroyed session), then there is no race with respect to this 3176 triple. The server SHOULD limit the referring triples to requests 3177 that refer to just those that apply to the objects referred to in the 3178 CB_COMPOUND procedure. 3179 3180 The client must not simply wait forever for the expected server reply 3181 to arrive before responding to the CB_COMPOUND that won the race, 3182 because it is possible that it will be delayed indefinitely. The 3183 client should assume the likely case that the reply will arrive 3184 within the average round trip time for COMPOUND requests to the 3185 server, and wait that period of time. If that period of time expires 3186 it can respond to the CB_COMPOUND with NFS4ERR_DELAY. 3187 3188 3189 3190 3191 Shepler, et al. Expires February 23, 2009 [Page 57] 3192 3193 Internet-Draft NFSv4.1 August 2008 3194 3195 3196 There are other scenarios under which callbacks may race replies. 3197 Among them are pNFS layout recalls as described in Section 12.5.5.2. 3198 3199 2.10.5.4. COMPOUND and CB_COMPOUND Construction Issues 3200 3201 Very large requests and replies may pose both buffer management 3202 issues (especially with RDMA) and reply cache issues. When the 3203 session is created, (Section 18.36), for each channel (fore and 3204 back), the client and server negotiate the maximum sized request they 3205 will send or process (ca_maxrequestsize), the maximum sized reply 3206 they will return or process (ca_maxresponsesize), and the maximum 3207 sized reply they will store in the reply cache 3208 (ca_maxresponsesize_cached). 3209 3210 If a request exceeds ca_maxrequestsize, the reply will have the 3211 status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG 3212 as the status for first operation (SEQUENCE or CB_SEQUENCE) in the 3213 request (which means no operations in the request executed, and the 3214 state of the slot in the reply cache is unchanged), or it MAY opt to 3215 return it on a subsequent operation in the same COMPOUND or 3216 CB_COMPOUND request (which means at least one operation did execute 3217 and the state of the slot in reply cache does change). The replier 3218 SHOULD set NFS4ERR_REQ_TOO_BIG on the operation that exceeds 3219 ca_maxrequestsize. 3220 3221 If a reply exceeds ca_maxresponsesize, the reply will have the status 3222 NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the 3223 status for first operation (SEQUENCE or CB_SEQUENCE) in the request, 3224 or it MAY opt to return it on a subsequent operation (in the same 3225 COMPOUND or CB_COMPOUND reply). A replier MAY return 3226 NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if 3227 the response would still exceed ca_maxresponsesize. 3228 3229 If sa_cachethis or csa_cachethis are TRUE, then the replier MUST 3230 cache a reply except if an error is returned by the SEQUENCE or 3231 CB_SEQUENCE operation (see Section 2.10.5.1.2). If the reply exceeds 3232 ca_maxresponsesize_cached, (and sa_cachethis or csa_cachethis are 3233 TRUE) then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even 3234 if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) 3235 is returned on a operation other than first operation (SEQUENCE or 3236 CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or 3237 csa_cachethis are TRUE. For example, if a COMPOUND has eleven 3238 operations, including SEQUENCE, the fifth operation is a RENAME, and 3239 the tenth operation is a READ for one million bytes, the server may 3240 return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since 3241 the server executed several operations, especially the non-idempotent 3242 RENAME, the client's request to cache the reply needs to be honored 3243 in order for correct operation of exactly once semantics. If the 3244 3245 3246 3247 Shepler, et al. Expires February 23, 2009 [Page 58] 3248 3249 Internet-Draft NFSv4.1 August 2008 3250 3251 3252 client retries the request, the server will have cached a reply that 3253 contains results for ten of the eleven requested operations, with the 3254 tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. 3255 3256 A client needs to take care that when sending operations that change 3257 the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH and 3258 RESTOREFH) that it not exceed the maximum reply buffer before the 3259 GETFH operation. Otherwise the client will have to retry the 3260 operation that changed the current filehandle, in order to obtain the 3261 desired filehandle. For the OPEN operation (see Section 18.16), 3262 retry is not always available as an option. The following guidelines 3263 for the handling of filehandle changing operations are advised: 3264 3265 o Within the same COMPOUND procedure, a client SHOULD send GETFH 3266 immediately after a current filehandle changing operation. A 3267 client MUST send GETFH after a current filehandle changing 3268 operation that is also non-idempotent (for example, the OPEN 3269 operation), unless the operation is RESTOREFH. RESTOREFH is an 3270 exception, because even though it is non-idempotent, the 3271 filehandle RESTOREFH produced originated from an operation that is 3272 either idempotent (e.g. PUTFH, LOOKUP), or non-idempotent (e.g. 3273 OPEN, CREATE). If the origin is non-idempotent, then because the 3274 client MUST send GETFH after the origin operation, the client can 3275 recover if RESTOREFH returns an error. 3276 3277 o A server MAY return NFS4ERR_REP_TOO_BIG or 3278 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 3279 filehandle changing operation if the reply would be too large on 3280 the next operation. 3281 3282 o A server SHOULD return NFS4ERR_REP_TOO_BIG or 3283 NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a 3284 filehandle changing non-idempotent operation if the reply would be 3285 too large on the next operation, especially if the operation is 3286 OPEN. 3287 3288 o A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent 3289 current filehandle changing operation, if it looks at the next 3290 operation (in the same COMPOUND procedure) and finds it is not 3291 GETFH. The server SHOULD do this if it is unable to determine in 3292 advance whether the total response size would exceed 3293 ca_maxresponsesize_cached or ca_maxresponsesize. 3294 3295 2.10.5.5. Persistence 3296 3297 Since the reply cache is bounded, it is practical for the reply cache 3298 to persist across server restarts. The replier MUST persist the 3299 following information if it agreed to persist the session (when the 3300 3301 3302 3303 Shepler, et al. Expires February 23, 2009 [Page 59] 3304 3305 Internet-Draft NFSv4.1 August 2008 3306 3307 3308 session was created; see Section 18.36): 3309 3310 o The session ID. 3311 3312 o The slot table including the sequence ID and cached reply for each 3313 slot. 3314 3315 The above are sufficient for a replier to provide EOS semantics for 3316 any requests that were sent and executed before the server restarted. 3317 If the replier is a client then there is no need for it to persist 3318 any more information, unless the client will be persisting all other 3319 state across client restart. In which case, the server will never 3320 see any NFSv4.1-level protocol manifestation of a client restart. If 3321 the replier is a server, with just the slot table and session ID 3322 persisting, any requests the client retries after the server restart 3323 will return the results that are cached in reply cache. and any new 3324 requests (i.e. the sequence ID is one (1) greater than the slot's 3325 sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by 3326 SEQUENCE). Such a session is considered dead. A server MAY re- 3327 animate a session after a server restart so that the session will 3328 accept new requests as well as retries. To re-animate a session the 3329 server needs to persist additional information through server 3330 restart: 3331 3332 o The client ID. This is a prerequisite to let the client to create 3333 more sessions associated with the same client ID as the 3334 3335 o The client ID's sequence ID that is used for creating sessions 3336 (see Section 18.35 and Section 18.36). This is a prerequisite to 3337 let the client create more sessions. 3338 3339 o The principal that created the client ID. This allows the server 3340 to authenticate the client when it sends EXCHANGE_ID. 3341 3342 o The SSV, if SP4_SSV state protection was specified when the client 3343 ID was created (see Section 18.35). This lets the client create 3344 new sessions, and associate connections with the new and existing 3345 sessions. 3346 3347 o The properties of the client ID as defined in Section 18.35. 3348 3349 A persistent reply cache places certain demands on the server. The 3350 execution of the sequence of operations (starting with SEQUENCE) and 3351 placement of its results in the persistent cache MUST be atomic. If 3352 a client retries an sequence of operations that was previously 3353 executed on the server the only acceptable outcomes are either the 3354 original cached reply or an indication that client ID or session has 3355 been lost (indicating a catastrophic loss of the reply cache or a 3356 3357 3358 3359 Shepler, et al. Expires February 23, 2009 [Page 60] 3360 3361 Internet-Draft NFSv4.1 August 2008 3362 3363 3364 session that has been deleted because the client failed to use the 3365 session for an extended period of time). 3366 3367 A server could fail and restart in the middle of a COMPOUND procedure 3368 that contains one or more non-idempotent or idempotent-but-modifying 3369 operations. This creates an even higher challenge for atomic 3370 execution and placement of results in the reply cache. One way to 3371 view the problem is as a single transaction consisting of each 3372 operation in the COMPOUND followed by storing the result in 3373 persistent storage, then finally a transaction commit. If there is a 3374 failure before the transaction is committed, then the server rolls 3375 back the transaction. If server itself fails, then when it restarts, 3376 its recovery logic could roll back the transaction before starting 3377 the NFSv4.1 server. 3378 3379 While the description of the implementation for atomic execution of 3380 the request and caching of the reply is beyond the scope of this 3381 document, an example implementation for NFSv2 [27] is described in 3382 [28]. 3383 3384 2.10.6. RDMA Considerations 3385 3386 A complete discussion of the operation of RPC-based protocols over 3387 RDMA transports is in [8]. A discussion of the operation of NFSv4, 3388 including NFSv4.1, over RDMA is in [9]. Where RDMA is considered, 3389 this specification assumes the use of such a layering; it addresses 3390 only the upper layer issues relevant to making best use of RPC/RDMA. 3391 3392 2.10.6.1. RDMA Connection Resources 3393 3394 RDMA requires its consumers to register memory and post buffers of a 3395 specific size and number for receive operations. 3396 3397 Registration of memory can be a relatively high-overhead operation, 3398 since it requires pinning of buffers, assignment of attributes (e.g. 3399 readable/writable), and initialization of hardware translation. 3400 Preregistration is desirable to reduce overhead. These registrations 3401 are specific to hardware interfaces and even to RDMA connection 3402 endpoints, therefore negotiation of their limits is desirable to 3403 manage resources effectively. 3404 3405 Following basic registration, these buffers must be posted by the RPC 3406 layer to handle receives. These buffers remain in use by the RPC/ 3407 NFSv4.1 implementation; the size and number of them must be known to 3408 the remote peer in order to avoid RDMA errors which would cause a 3409 fatal error on the RDMA connection. 3410 3411 NFSv4.1 manages slots as resources on a per session basis (see 3412 3413 3414 3415 Shepler, et al. Expires February 23, 2009 [Page 61] 3416 3417 Internet-Draft NFSv4.1 August 2008 3418 3419 3420 Section 2.10), while RDMA connections manage credits on a per 3421 connection basis. This means that in order for a peer to send data 3422 over RDMA to a remote buffer, it has to have both an NFSv4.1 slot, 3423 and an RDMA credit. If multiple RDMA connections are associated with 3424 a session, then if the total number of credits across all RDMA 3425 connections associated with the session is X, and the number slots in 3426 the session is Y, then the maximum number of outstanding requests is 3427 lesser of X and Y. 3428 3429 2.10.6.2. Flow Control 3430 3431 Previous versions of NFS do not provide flow control; instead they 3432 rely on the windowing provided by transports like TCP to throttle 3433 requests. This does not work with RDMA, which provides no operation 3434 flow control and will terminate a connection in error when limits are 3435 exceeded. Limits such as maximum number of requests outstanding are 3436 therefore negotiated when a session is created (see the 3437 ca_maxrequests field in Section 18.36). These limits then provide 3438 the maxima which each connection associated with the session's 3439 channel(s) must remain within. RDMA connections are managed within 3440 these limits as described in section 3.3 ("Flow Control"[[Comment.2: 3441 RFC Editor: please verify section and title of the RPCRDMA document 3442 which is currently at 3443 http://tools.ietf.org/html/draft-ietf-nfsv4-rpcrdma-08#section-3.3]]) 3444 of [8]; if there are multiple RDMA connections, then the maximum 3445 number of requests for a channel will be divided among the RDMA 3446 connections. Put a different way, the onus is on the replier to 3447 ensure that total number of RDMA credits across all connections 3448 associated with the replier's channel does exceed the channel's 3449 maximum number of outstanding requests. 3450 3451 The limits may also be modified dynamically at the replier's choosing 3452 by manipulating certain parameters present in each NFSv4.1 reply. In 3453 addition, the CB_RECALL_SLOT callback operation (see Section 20.8) 3454 can be sent by a server to a client to return RDMA credits to the 3455 server, thereby lowering the maximum number of requests a client can 3456 have outstanding to the server. 3457 3458 2.10.6.3. Padding 3459 3460 Header padding is requested by each peer at session initiation (see 3461 the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), 3462 and subsequently used by the RPC RDMA layer, as described in [8]. 3463 Zero padding is permitted. 3464 3465 Padding leverages the useful property that RDMA preserve alignment of 3466 data, even when they are placed into anonymous (untagged) buffers. 3467 If requested, client inline writes will insert appropriate pad bytes 3468 3469 3470 3471 Shepler, et al. Expires February 23, 2009 [Page 62] 3472 3473 Internet-Draft NFSv4.1 August 2008 3474 3475 3476 within the request header to align the data payload on the specified 3477 boundary. The client is encouraged to add sufficient padding (up to 3478 the negotiated size) so that the "data" field of the NFSv4.1 WRITE 3479 operation is aligned. Most servers can make good use of such 3480 padding, which allows them to chain receive buffers in such a way 3481 that any data carried by client requests will be placed into 3482 appropriate buffers at the server, ready for file system processing. 3483 The receiver's RPC layer encounters no overhead from skipping over 3484 pad bytes, and the RDMA layer's high performance makes the insertion 3485 and transmission of padding on the sender a significant optimization. 3486 In this way, the need for servers to perform RDMA Read to satisfy all 3487 but the largest client writes is obviated. An added benefit is the 3488 reduction of message round trips on the network - a potentially good 3489 trade, where latency is present. 3490 3491 The value to choose for padding is subject to a number of criteria. 3492 A primary source of variable-length data in the RPC header is the 3493 authentication information, the form of which is client-determined, 3494 possibly in response to server specification. The contents of 3495 COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all 3496 go into the determination of a maximal NFSv4.1 request size and 3497 therefore minimal buffer size. The client must select its offered 3498 value carefully, so as not to overburden the server, and vice- versa. 3499 The payoff of an appropriate padding value is higher performance. 3500 [[Comment.3: RFC editor please keep this diagram on one page.]] 3501 3502 Sender gather: 3503 |RPC Request|Pad bytes|Length| -> |User data...| 3504 \------+----------------------/ \ 3505 \ \ 3506 \ Receiver scatter: \-----------+- ... 3507 /-----+----------------\ \ \ 3508 |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... 3509 3510 In the above case, the server may recycle unused buffers to the next 3511 posted receive if unused by the actual received request, or may pass 3512 the now-complete buffers by reference for normal write processing. 3513 For a server which can make use of it, this removes any need for data 3514 copies of incoming data, without resorting to complicated end-to-end 3515 buffer advertisement and management. This includes most kernel-based 3516 and integrated server designs, among many others. The client may 3517 perform similar optimizations, if desired. 3518 3519 2.10.6.4. Dual RDMA and Non-RDMA Transports 3520 3521 Some RDMA transports (for example [10]), permit a "streaming" (non- 3522 RDMA) phase, where ordinary traffic might flow before "stepping up" 3523 to RDMA mode, commencing RDMA traffic. Some RDMA transports start 3524 3525 3526 3527 Shepler, et al. Expires February 23, 2009 [Page 63] 3528 3529 Internet-Draft NFSv4.1 August 2008 3530 3531 3532 connections always in RDMA mode. NFSv4.1 allows, but does not 3533 assume, a streaming phase before RDMA mode. When a connection is 3534 associated with a session, the client and server negotiate whether 3535 the connection is used in RDMA or non-RDMA mode (see Section 18.36 3536 and Section 18.34). 3537 3538 2.10.7. Sessions Security 3539 3540 2.10.7.1. Session Callback Security 3541 3542 Via session / connection association, NFSv4.1 improves security over 3543 that provided by NFSv4.0 for the backchannel. The connection is 3544 client-initiated (see Section 18.34), and subject to the same 3545 firewall and routing checks as the fore channel. The connection 3546 cannot be hijacked by an attacker who connects to the client port 3547 prior to the intended server as is possible with NFSv4.0. At the 3548 client's option (see Section 18.35), connection association is fully 3549 authenticated before being activated (see Section 18.34). Traffic 3550 from the server over the backchannel is authenticated exactly as the 3551 client specifies (see Section 2.10.7.2). 3552 3553 2.10.7.2. Backchannel RPC Security 3554 3555 When the NFSv4.1 client establishes the backchannel, it informs the 3556 server of the security flavors and principals to use when sending 3557 requests. If the security flavor is RPCSEC_GSS, the client expresses 3558 the principal in the form of an established RPCSEC_GSS context. The 3559 server is free to use any of the flavor/principal combinations the 3560 client offers, but it MUST NOT use unoffered combinations. This way, 3561 the client need not provide a target GSS principal for the 3562 backchannel as it did with NFSv4.0, nor the server have to implement 3563 an RPCSEC_GSS initiator as it did with NFSv4.0 [20]. 3564 3565 The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL 3566 (Section 18.33) operations allow the client to specify flavor/ 3567 principal combinations. 3568 3569 Also note that the SP4_SSV state protection mode (see Section 18.35 3570 and Section 2.10.7.3) has the side benefit of providing SSV-derived 3571 RPCSEC_GSS contexts (Section 2.10.8). 3572 3573 2.10.7.3. Protection from Unauthorized State Changes 3574 3575 As described to this point in the specification, the state model of 3576 NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation 3577 with a forged session ID and with a slot ID that it expects the 3578 legitimate client to use next. When the legitimate client uses the 3579 slot ID with the same sequence number, the server returns the 3580 3581 3582 3583 Shepler, et al. Expires February 23, 2009 [Page 64] 3584 3585 Internet-Draft NFSv4.1 August 2008 3586 3587 3588 attacker's result from the reply cache which disrupts the legitimate 3589 client and thus denies service to it. Similarly an attacker could 3590 send a CREATE_SESSION with a forged client ID to create a new session 3591 associated with the client ID. The attacker could send requests 3592 using the new session that change locking state, such as LOCKU 3593 operations to release locks the legitimate client has acquired. 3594 Setting a security policy on the file which requires RPCSEC_GSS 3595 credentials when manipulating the file's state is one potential work 3596 around, but has the disadvantage of preventing a legitimate client 3597 from releasing state when RPCSEC_GSS is required to do so, but a GSS 3598 context cannot be obtained (possibly because the user has logged off 3599 the client). 3600 3601 NFSv4.1 provides three options to a client for state protection which 3602 are specified when a client creates a client ID via EXCHANGE_ID 3603 (Section 18.35). 3604 3605 The first (SP4_NONE) is to simply waive state protection. 3606 3607 The other two options (SP4_MACH_CRED and SP4_SSV) share several 3608 traits: 3609 3610 o An RPCSEC_GSS-based credential is used to authenticate client ID 3611 and session maintenance operations, including creating and 3612 destroying a session, associating a connection with the session, 3613 and destroying the client ID. 3614 3615 o Because RPCSEC_GSS is used to authenticate client ID and session 3616 maintenance, the attacker cannot associate a rogue connection with 3617 a legitimate session, or associate a rogue session with a 3618 legitimate client ID in order to maliciously alter the client ID's 3619 lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. 3620 3621 o In cases where the server's security policies on a portion of its 3622 namespace require RPCSEC_GSS authentication, a client may have to 3623 use an RPCSEC_GSS credential to remove per-file state (for example 3624 LOCKU, CLOSE, etc.). The server may require that the principal 3625 that removes the state match certain criteria (for example, the 3626 principal might have to be the same as the one that acquired the 3627 state). However, the client might not have an RPCSEC_GSS context 3628 for such a principal, and might not be able to create such a 3629 context (perhaps because the user has logged off). When the 3630 client establishes SP4_MACH_CRED or SP4_SSV protection, it can 3631 specify a list of operations that the server MUST allow using the 3632 machine credential (if SP4_MACH_CRED is used) or the SSV 3633 credential (if SP4_SSV is used). 3634 3635 The SP4_MACH_CRED state protection option uses a machine credential 3636 3637 3638 3639 Shepler, et al. Expires February 23, 2009 [Page 65] 3640 3641 Internet-Draft NFSv4.1 August 2008 3642 3643 3644 where the principal that creates the client ID, must also be the 3645 principal that performs client ID and session maintenance operations. 3646 The security of the machine credential state protection approach 3647 depends entirely on safe guarding the per-machine credential. 3648 Assuming a proper safe guard, using the per-machine credential for 3649 operations like CREATE_SESSION, BIND_CONN_TO_SESSION, 3650 DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from 3651 associating a rogue connection with a session, or associating a rogue 3652 session with a client ID. 3653 3654 There are at least three scenarios for the SP4_MACH_CRED option: 3655 3656 1. That the system administrator configures a unique, permanent per- 3657 machine credential for one of the mandated GSS mechanisms (for 3658 example, if Kerberos V5 is used, a "keytab" containing a 3659 principal named after client host name could be used). 3660 3661 2. The client is used by a single user, and so the client ID and its 3662 sessions are used by just that user. If the user's credential 3663 expires, then session and client ID maintenance cannot occur, but 3664 since the client has a single user, only that user is 3665 inconvenienced. 3666 3667 3. The physical client has multiple users, but the client 3668 implementation has a unique client ID for each user. This is 3669 effectively the same as the second scenario, but a disadvantage 3670 is that each user must be allocated at least one session each, so 3671 the approach suffers from lack of economy. 3672 3673 The SP4_SSV protection option uses a Secret State Verifier (SSV) 3674 which is shared between a client and server. The SSV serves as the 3675 secret key for an internal (that is, internal to NFSv4.1) GSS 3676 mechanism that uses the secret key for Message Integrity Code (MIC) 3677 and Wrap tokens (Section 2.10.8). The SP4_SSV protection option is 3678 intended for the client that has multiple users, and the system 3679 administrator does not wish to configure a permanent machine 3680 credential for each client. The SSV is established on the server via 3681 SET_SSV (see Section 18.47). To prevent eavesdropping, a client 3682 SHOULD send SET_SSV via RPCSEC_GSS with the privacy service. Several 3683 aspects of the SSV make it intractable for an attacker to guess the 3684 SSV, and thus associate rogue connections with a session, and rogue 3685 sessions with a client ID: 3686 3687 o The arguments to and results of SET_SSV include digests of the old 3688 and new SSV, respectively. 3689 3690 o Because the initial value of the SSV is zero, therefore known, the 3691 client that opts for SP4_SSV protection and opts to apply SP4_SSV 3692 3693 3694 3695 Shepler, et al. Expires February 23, 2009 [Page 66] 3696 3697 Internet-Draft NFSv4.1 August 2008 3698 3699 3700 protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at 3701 least one SET_SSV operation before the first BIND_CONN_TO_SESSION 3702 operation or before the second CREATE_SESSION operation on a 3703 client ID. If it does not, the SSV mechanism will not generate 3704 tokens (Section 2.10.8). A client SHOULD send SET_SSV as soon as 3705 a session is created. 3706 3707 o A SET_SSV does not replace the SSV with the argument to SET_SSV. 3708 Instead, the current SSV on the server is logically exclusive ORed 3709 (XORed) with the argument to SET_SSV. Each time a new principal 3710 uses a client ID for the first time, the client SHOULD send a 3711 SET_SSV with that principal's RPCSEC_GSS credentials, with 3712 RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. 3713 3714 Here are the types of attacks that can be attempted by an attacker 3715 named Eve on a victim named Bob, and how SP4_SSV protection foils 3716 each attack: 3717 3718 o Suppose Eve is the first user to log into a legitimate client. 3719 Eve's use of an NFSv4.1 file system will cause the legitimate 3720 client to create a client ID with SP4_SSV protection, specifying 3721 that the BIND_CONN_TO_SESSION operation MUST use the SSV 3722 credential. Eve's use of the file system also causes an SSV to be 3723 created. The SET_SSV operation that creates the SSV will be 3724 protected by the RPCSEC_GSS context created by the legitimate 3725 client which uses Eve's GSS principal and credentials. Eve can 3726 eavesdrop on the network while her RPCSEC_GSS context is created, 3727 and the SET_SSV using her context is sent. Even if the legitimate 3728 client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve 3729 knows her own credentials, she can decrypt the SSV. Eve can 3730 compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will 3731 accept, and so associate a new connection with the legitimate 3732 session. Eve can change the slot ID and sequence state of a 3733 legitimate session, and/or the SSV state, in such a way that when 3734 Bob accesses the server via the same legitimate client, the 3735 legitimate client will be unable to use the session. 3736 3737 The client's only recourse is to create a new client ID for Bob to 3738 use, and establish a new SSV for the client ID. The client will 3739 be unable to delete the old client ID, and will let the lease on 3740 the old client ID expire. 3741 3742 Once the legitimate client establishes an SSV over the new session 3743 using Bob's RPCSEC_GSS context, Eve can use the new session via 3744 the legitimate client, but she cannot disrupt Bob. Moreover, 3745 because the client SHOULD have modified the SSV due to Eve using 3746 the new session, Bob cannot get revenge on Eve by associating a 3747 rogue connection with the session. 3748 3749 3750 3751 Shepler, et al. Expires February 23, 2009 [Page 67] 3752 3753 Internet-Draft NFSv4.1 August 2008 3754 3755 3756 The question is how did the legitimate client detect that Eve has 3757 hijacked the old session? When the client detects that a new 3758 principal, Bob, wants to use the session, it SHOULD have sent a 3759 SET_SSV, which leads to following sub-scenarios: 3760 3761 3762 * Let us suppose that from the rogue connection, Eve sent a 3763 SET_SSV with the same slot ID and sequence ID that the 3764 legitimate client later uses. The server will assume the 3765 SET_SSV sent with Bob's credentials is a retry, and return to 3766 the legitimate client the reply it sent Eve. However, unless 3767 Eve can correctly guess the SSV the legitimate client will use, 3768 the digest verification checks in the SET_SSV response will 3769 fail. That is an indication to the client that the session has 3770 apparently been hijacked. 3771 3772 3773 * Alternatively, Eve sent a SET_SSV with a different slot ID than 3774 the legitimate client uses for its SET_SSV. Then the digest 3775 verification of the SET_SSV sent with Bob's credentials fails 3776 on the server, and the error returned to the client makes it 3777 apparent that the session has been hijacked. 3778 3779 3780 * Alternatively, Eve sent an operation other than SET_SSV, but 3781 with the same slot ID and sequence that the legitimate client 3782 uses for its SET_SSV. The server returns to the legitimate 3783 client the response it sent Eve. The client sees that the 3784 response is not at all what it expects. The client assumes 3785 either session hijacking or a server bug, and either way 3786 destroys the old session. 3787 3788 3789 o Eve associates a rogue connection with the session as above, and 3790 then destroys the session. Again, Bob goes to use the server from 3791 the legitimate client, which sends a SET_SSV using Bob's 3792 credentials. The client receives an error that indicates the 3793 session does not exist. When the client tries to create a new 3794 session, this will fail because the SSV it has does not match that 3795 the server has, and now the client knows the session was hijacked. 3796 The legitimate client establishes a new client ID. 3797 3798 3799 o If Eve creates a connection before the legitimate client 3800 establishes an SSV, because the initial value of the SSV is zero 3801 and therefore known, Eve can send a SET_SSV that will pass the 3802 digest verification check. However because the new connection has 3803 not been associated with the session, the SET_SSV is rejected for 3804 3805 3806 3807 Shepler, et al. Expires February 23, 2009 [Page 68] 3808 3809 Internet-Draft NFSv4.1 August 2008 3810 3811 3812 that reason. 3813 3814 3815 In summary, an attacker's disruption of state when SP4_SSV protection 3816 is in use is limited to the formative period of a client ID, its 3817 first session, and the establishment of the SSV. Once a non- 3818 malicious user uses the client ID, the client quickly detects any 3819 hijack and rectifies the situation. Once a non-malicious user 3820 successfully modifies the SSV, the attacker cannot use NFSv4.1 3821 operations to disrupt the non-malicious user. 3822 3823 Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches 3824 prevent hijacking of a transport connection that has previously been 3825 associated with a session. If the goal of a counter threat strategy 3826 is to prevent connection hijacking, the use of IPsec is RECOMMENDED. 3827 3828 If a connection hijack occurs, the hijacker could in theory change 3829 locking state and negatively impact the service to legitimate 3830 clients. However if the server is configured to require the use of 3831 RPCSEC_GSS with integrity or privacy on the affected file objects, 3832 and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35), 3833 is in force, this will thwart unauthorized attempts to change locking 3834 state. 3835 3836 2.10.8. The SSV GSS Mechanism 3837 3838 The SSV provides the secret key for a mechanism that NFSv4.1 uses for 3839 state protection. Contexts for this mechanism are not established 3840 via the RPCSEC_GSS protocol. Instead, the contexts are automatically 3841 created when EXCHANGE_ID specifies SP4_SSV protection. The only 3842 tokens defined are the PerMsgToken (emitted by GSS_GetMIC) and the 3843 SealedMessage token (emitted by GSS_Wrap). 3844 3845 The mechanism OID for the SSV mechanism is: 3846 iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech 3847 (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any 3848 initial context tokens, the OID can be used to let servers indicate 3849 that the SSV mechanism is acceptable whenever the client sends a 3850 SECINFO or SECINFO_NO_NAME operation (see Section 2.6). 3851 3852 The SSV mechanism defines four subkeys derived from the SSV value. 3853 Each time SET_SSV is invoked the subkeys are recalculated by the 3854 client and server. The calculation of each of the four subkeys 3855 depends on each of the four respective ssv_subkey4 enumerated values. 3856 The calculation uses the HMAC [11], algorithm, using the current SSV 3857 as the key, the one way hash algorithm as negotiated by EXCHANGE_ID, 3858 and the input text as represented by the XDR encoded enumeration of 3859 type ssv_subkey4. 3860 3861 3862 3863 Shepler, et al. Expires February 23, 2009 [Page 69] 3864 3865 Internet-Draft NFSv4.1 August 2008 3866 3867 3868 /* Input for computing subkeys */ 3869 enum ssv_subkey4 { 3870 SSV4_SUBKEY_MIC_I2T = 1, 3871 SSV4_SUBKEY_MIC_T2I = 2, 3872 SSV4_SUBKEY_SEAL_I2T = 3, 3873 SSV4_SUBKEY_SEAL_T2I = 4 3874 }; 3875 3876 3877 The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating 3878 message integrity codes (MICs) that originate from the NFSv4.1 3879 client, whether as part of a request over the fore channel, or a 3880 response over the backchannel. The subkey derived from SSV4_SUBKEY- 3881 MIST2I is used for MICs originating from the NFSv4.1 server. The 3882 subkey derived from SSV4_SUBKEY_SEAL_I2T is used for encryption text 3883 originating from the NFSv4.1 client and the subkey derived from 3884 SSV4_SUBKEY_SEAL_T2I is used for encryption text originating from the 3885 NFSv4.1 server. 3886 3887 The PerMsgToken description is based on an XDR definition: 3888 3889 3890 /* Input for computing smt_hmac */ 3891 struct ssv_mic_plain_tkn4 { 3892 uint32_t smpt_ssv_seq; 3893 opaque smpt_orig_plain<>; 3894 }; 3895 3896 3897 3898 3899 /* SSV GSS PerMsgToken token */ 3900 struct ssv_mic_tkn4 { 3901 uint32_t smt_ssv_seq; 3902 opaque smt_hmac<>; 3903 }; 3904 3905 3906 The field smt_hmac is an HMAC calculated by using the subkey derived 3907 from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one 3908 way hash algorithm as negotiated by EXCHANGE_ID, and the input text 3909 as represented by data of type ssv_mic_plain_tkn4. The field 3910 smpt_ssv_seq is the same as smt_ssv_seq. The field smpt_orig_plain 3911 is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of 3912 [7]). The caller of GSS_GetMIC() provides a pointer to a buffer 3913 containing the plain text. The SSV mechanism's entry point for 3914 GSS_GetMIC() encodes this into an opaque array, and the encoding will 3915 include an initial four byte length, plus any necessary padding. 3916 3917 3918 3919 Shepler, et al. Expires February 23, 2009 [Page 70] 3920 3921 Internet-Draft NFSv4.1 August 2008 3922 3923 3924 Prepended to this will be the XDR encoded value of smpt_ssv_seq thus 3925 making up an XDR encoding of a value of data type ssv_mic_plain_tkn4, 3926 which in turn is the input into the HMAC. 3927 3928 The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type 3929 ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence 3930 number which is equal to 1 after SET_SSV (Section 18.47) is called 3931 the first time on a client ID. Thereafter, it is incremented on each 3932 SET_SSV. Thus smt_ssv_seq represents the version of the SSV at the 3933 time GSS_GetMIC() was called. As noted in Section 18.35, the client 3934 and server can maintain multiple concurrent versions of the SSV. 3935 This allows the SSV to be changed without serializing all RPC calls 3936 that use the SSV mechanism with SET_SSV operations. Once the HMAC is 3937 calculated, it is XDR encoded into smt_hmac, which will include an 3938 initial four byte length, and any necessary padding. Prepended to 3939 this will be the XDR encoded value of smt_ssv_seq. 3940 3941 The SealedMessage description is based on an XDR definition: 3942 3943 3944 /* Input for computing ssct_encr_data and ssct_hmac */ 3945 struct ssv_seal_plain_tkn4 { 3946 opaque sspt_confounder<>; 3947 uint32_t sspt_ssv_seq; 3948 opaque sspt_orig_plain<>; 3949 opaque sspt_pad<>; 3950 }; 3951 3952 3953 3954 3955 /* SSV GSS SealedMessage token */ 3956 struct ssv_seal_cipher_tkn4 { 3957 uint32_t ssct_ssv_seq; 3958 opaque ssct_iv<>; 3959 opaque ssct_encr_data<>; 3960 opaque ssct_hmac<>; 3961 }; 3962 3963 3964 The token emitted by GSS_Wrap() is XDR encoded and of XDR data type 3965 ssv_seal_cipher_tkn4. 3966 3967 The ssct_ssv_seq field has the same meaning as smt_ssv_seq. 3968 3969 The ssct_encr_data field is the result of encrypting a value of the 3970 XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the 3971 subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and 3972 3973 3974 3975 Shepler, et al. Expires February 23, 2009 [Page 71] 3976 3977 Internet-Draft NFSv4.1 August 2008 3978 3979 3980 the encryption algorithm is that negotiated by EXCHANGE_ID. 3981 3982 The ssct_iv field is the initialization vector (IV) for the 3983 encryption algorithm (if applicable) and is sent in clear text. The 3984 content and size of the IV MUST comply with specification of the 3985 encryption algorithm. For example, the id-aes256-CBC algorithm MUST 3986 use a 16 byte initialization vector (IV) which MUST be unpredictable 3987 for each instance of a value of type ssv_seal_plain_tkn4 that is 3988 encrypted with a particular SSV key. 3989 3990 The ssct_hmac field is the result of computing an HMAC using value of 3991 the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The 3992 key is the subkey derived from SSV4_SUBKEY_MIC_I2T or 3993 SSV4_SUBKEY_MIC_T2I, and the one way hash algorithm is that 3994 negotiated by EXCHANGE_ID. 3995 3996 The sspt_confounder field is a random value. 3997 3998 The sspt_ssv_seq field is the same as ssvt_ssv_seq. 3999 4000 The field sspt_orig_plain field is the original plaintext and is the 4001 "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of 4002 [7]). As with the handling of the plaintext by the SSV mechanism's 4003 GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a 4004 pointer to the plaintext, and will XDR encode an opaque array into 4005 sspt_orig_plain representing the plain text, along with the other 4006 fields of an instance of data type ssv_seal_plain_tkn4. 4007 4008 The sspt_pad field is present to support encryption algorithms that 4009 require inputs to be in fixed sized blocks. The content of sspt_pad 4010 is zero filled except for the length. Beware that the XDR encoding 4011 of ssv_seal_plain_tkn4 contains three variable length arrays, and so 4012 each array consumes four bytes for an array length, and each array 4013 that follows the length is always padded to a multiple of four bytes 4014 per the XDR standard. 4015 4016 For example suppose the encryption algorithm uses 16 byte blocks, and 4017 the sspt_confounder is three bytes long, and the sspt_orig_plain 4018 field is 15 bytes long. The XDR encoding of sspt_confounder uses 4019 eight bytes (4 + 3 + 1 byte pad), the XDR encoding of sspt_ssv_seq 4020 uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4 4021 + 15 + 1 byte pad), and the smallest XDR encoding of the sspt_pad 4022 field is four bytes. This totals 36 bytes. The next multiple of 16 4023 is 48, thus the length field of sspt_pad needs to be set to 12 bytes, 4024 or a total encoding of 16 bytes. The total number of XDR encoded 4025 bytes is thus 8 + 4 + 20 + 16 = 48. 4026 4027 GSS_Wrap() emits a token that is an XDR encoding of a value of data 4028 4029 4030 4031 Shepler, et al. Expires February 23, 2009 [Page 72] 4032 4033 Internet-Draft NFSv4.1 August 2008 4034 4035 4036 type ssv_seal_cipher_tkn4. Note that regardless whether the caller 4037 of GSS_Wrap() requests confidentiality or not, the token always has 4038 confidentiality. This is because the SSV mechanism is for 4039 RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without 4040 confidentiality. 4041 4042 There is one SSV per client ID. Effectively there is a single GSS 4043 context for a client ID / SSV pair. All SSV mechanism RPCSEC_GSS 4044 handles of a client ID / SSV pair share the same GSS context. SSV 4045 GSS contexts do not expire except when the SSV is destroyed (causes 4046 would include the client ID being destroyed or a server restart). 4047 Since one purpose of context expiration is to replace keys that have 4048 been in use for "too long" hence vulnerable to compromise by brute 4049 force or accident, the client can replace the SSV key by sending 4050 periodic SET_SSV operations, by cycling through different users' 4051 RPCSEC_GSS credentials. This way the SSV is replaced without 4052 destroying the SSV's GSS contexts. 4053 4054 SSV RPCSEC_GSS handles can be expired or deleted by the server at any 4055 time and the EXCHANGE_ID operation can be used to create more SSV 4056 RPCSEC_GSS handles. Expiration of SSV RPCSEC_GSS handles does not 4057 imply that the SSV or its GSS context have expired. 4058 4059 The client MUST establish an SSV via SET_SSV before the SSV GSS 4060 context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). 4061 If SET_SSV has not been successfully called, attempts to emit tokens 4062 MUST fail. 4063 4064 The SSV mechanism does not support replay detection and sequencing in 4065 its tokens because RPCSEC_GSS does not use those features (See 4066 Section 5.2.2 "Context Creation Requests" in [4]). 4067 4068 2.10.9. Session Mechanics - Steady State 4069 4070 2.10.9.1. Obligations of the Server 4071 4072 The server has the primary obligation to monitor the state of 4073 backchannel resources that the client has created for the server 4074 (RPCSEC_GSS contexts and backchannel connections). If these 4075 resources vanish, the server takes action as specified in 4076 Section 2.10.11.2. 4077 4078 2.10.9.2. Obligations of the Client 4079 4080 The client SHOULD honor the following obligations in order to utilize 4081 the session: 4082 4083 4084 4085 4086 4087 Shepler, et al. Expires February 23, 2009 [Page 73] 4088 4089 Internet-Draft NFSv4.1 August 2008 4090 4091 4092 o Keep a necessary session from going idle on the server. A client 4093 that requires a session, but nonetheless is not sending operations 4094 risks having the session be destroyed by the server. This is 4095 because sessions consume resources, and resource limitations may 4096 force the server to cull an inactive session. A server MAY 4097 consider a session to be inactive if the client has not used the 4098 session before the session inactivity timer (Section 2.10.10) has 4099 expired. 4100 4101 o Destroy the session when not needed. If a client has multiple 4102 sessions, one of which has no requests waiting for replies, and 4103 has been idle for some period of time, it SHOULD destroy the 4104 session. 4105 4106 o Maintain GSS contexts for the backchannel. If the client requires 4107 the server to use the RPCSEC_GSS security flavor for callbacks, 4108 then it needs to be sure the contexts handed to the server via 4109 BACKCHANNEL_CTL are unexpired. 4110 4111 o Preserve a connection for a backchannel. The server requires a 4112 backchannel in order to gracefully recall recallable state, or 4113 notify the client of certain events. Note that if the connection 4114 is not being used for the fore channel, there is no way for the 4115 client tell if the connection is still alive (e.g., the server 4116 restarted without sending a disconnect). The onus is on the 4117 server, not the client, to determine if the backchannel's 4118 connection is alive, and to indicate in the response to a SEQUENCE 4119 operation when the last connection associated with a session's 4120 backchannel has disconnected. 4121 4122 2.10.9.3. Steps the Client Takes To Establish a Session 4123 4124 If the client does not have a client ID, the client sends EXCHANGE_ID 4125 to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV 4126 protection, in the spo_must_enforce list of operations, it SHOULD at 4127 minimum specify: CREATE_SESSION, DESTROY_SESSION, 4128 BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If opts 4129 for SP4_SSV protection, the client needs to ask for SSV-based 4130 RPCSEC_GSS handles. 4131 4132 The client uses the client ID to send a CREATE_SESSION on a 4133 connection to the server. The results of CREATE_SESSION indicate 4134 whether the server will persist the session reply cache through a 4135 server restarted or not, and the client notes this for future 4136 reference. 4137 4138 If the client specified SP4_SSV state protection when the client ID 4139 was created, then it SHOULD send SET_SSV in the first COMPOUND after 4140 4141 4142 4143 Shepler, et al. Expires February 23, 2009 [Page 74] 4144 4145 Internet-Draft NFSv4.1 August 2008 4146 4147 4148 the session is created. Each time a new principal goes to use the 4149 client ID, it SHOULD send a SET_SSV again. 4150 4151 If the client wants to use delegations, layouts, directory 4152 notifications, or any other state that requires a backchannel, then 4153 it must add a connection to the backchannel if CREATE_SESSION did not 4154 already do so. The client creates a connection, and calls 4155 BIND_CONN_TO_SESSION to associate the connection with the session and 4156 the session's backchannel. If CREATE_SESSION did not already do so, 4157 the client MUST tell the server what security is required in order 4158 for the client to accept callbacks. The client does this via 4159 BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV 4160 protection when it called EXCHANGE_ID, then the client SHOULD specify 4161 that the backchannel use RPCSEC_GSS contexts for security. 4162 4163 If the client wants to use additional connections for the 4164 backchannel, then it must call BIND_CONN_TO_SESSION on each 4165 connection it wants to use with the session. If the client wants to 4166 use additional connections for the fore channel, then it must call 4167 BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED state 4168 protection when the client ID was created. 4169 4170 At this point the session has reached steady state. 4171 4172 2.10.10. Session Inactivity Timer 4173 4174 The server MAY maintain a session inactivity timer for each session. 4175 If the session inactivity timer expires, then the server MAY destroy 4176 the session. To avoid losing a session due to inactivity, the client 4177 MUST renew the session inactivity timer. The length of session 4178 inactivity timer MUST NOT be less than the lease_time attribute 4179 (Section 5.8.1.11). As with lease renewal (Section 8.3), when the 4180 server receives a SEQUENCE operation, it resets the session 4181 inactivity timer, and MUST NOT allow the timer to expire while the 4182 rest of the operations in the COMPOUND procedure's request are still 4183 executing. Once the last operation has finished, the server MUST set 4184 the session inactivity timer to expire no sooner that the sum of the 4185 current time and the value of the lease_time attribute. 4186 4187 2.10.11. Session Mechanics - Recovery 4188 4189 2.10.11.1. Events Requiring Client Action 4190 4191 The following events require client action to recover. 4192 4193 4194 4195 4196 4197 4198 4199 Shepler, et al. Expires February 23, 2009 [Page 75] 4200 4201 Internet-Draft NFSv4.1 August 2008 4202 4203 4204 2.10.11.1.1. RPCSEC_GSS Context Loss by Callback Path 4205 4206 If all RPCSEC_GSS contexts granted by the client to the server for 4207 callback use have expired, the client MUST establish a new context 4208 via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE 4209 results indicates when callback contexts are nearly expired, or fully 4210 expired (see Section 18.46.3). 4211 4212 2.10.11.1.2. Connection Loss 4213 4214 If the client loses the last connection of the session, and if wants 4215 to retain the session, then it must create a new connection, and if, 4216 when the client ID was created, BIND_CONN_TO_SESSION was specified in 4217 the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION 4218 to associate the connection with the session. 4219 4220 If there was a request outstanding at the time the of connection 4221 loss, then if client wants to continue to use the session it MUST 4222 retry the request, as described in Section 2.10.5.2. Note that it is 4223 not necessary to retry requests over a connection with the same 4224 source network address or the same destination network address as the 4225 lost connection. As long as the session ID, slot ID, and sequence ID 4226 in the retry match that of the original request, the server will 4227 recognize the request as a retry if it executed the request prior to 4228 disconnect. 4229 4230 If the connection that was lost was the last one associated with the 4231 backchannel, and the client wants to retain the backchannel and/or 4232 not put recallable state subject to revocation, the client must 4233 reconnect, and if it does, it MUST associate the connection to the 4234 session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD 4235 indicate when it has no callback connection via the sr_status_flags 4236 result from SEQUENCE. 4237 4238 2.10.11.1.3. Backchannel GSS Context Loss 4239 4240 Via the sr_status_flags result of the SEQUENCE operation or other 4241 means, the client will learn if some or all of the RPCSEC_GSS 4242 contexts it assigned to the backchannel have been lost. If the 4243 client wants to the retain the backchannel and/or not put recallable 4244 state subjection to revocation, the client must use BACKCHANNEL_CTL 4245 to assign new contexts. 4246 4247 2.10.11.1.4. Loss of Session 4248 4249 The replier might lose a record of the session. Causes include: 4250 4251 4252 4253 4254 4255 Shepler, et al. Expires February 23, 2009 [Page 76] 4256 4257 Internet-Draft NFSv4.1 August 2008 4258 4259 4260 o Replier failure and restart 4261 4262 o A catastrophe that causes the reply cache to be corrupted or lost 4263 on the media it was stored on. This applies even if the replier 4264 indicated in the CREATE_SESSION results that it would persist the 4265 cache. 4266 4267 o The server purges the session of a client that has been inactive 4268 for a very extended period of time. 4269 4270 Loss of reply cache is equivalent to loss of session. The replier 4271 indicates loss of session to the requester by returning 4272 NFS4ERR_BADSESSION on the next operation that uses the session ID 4273 that refers to the lost session. 4274 4275 After an event like a server restart, the client may have lost its 4276 connections. The client assumes for the moment that the session has 4277 not been lost. It reconnects, and if it specified connection 4278 association enforcement when the session was created, it invokes 4279 BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes 4280 SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns 4281 NFS4ERR_BADSESSION, the client knows the session was lost. If the 4282 connection survives session loss, then the next SEQUENCE operation 4283 the client sends over the connection will get back 4284 NFS4ERR_BADSESSION. The client again knows the session was lost. 4285 4286 When the client detects session loss, it must call CREATE_SESSION to 4287 recover. Any non-idempotent operations that were in progress may 4288 have been performed on the server at the time of session loss. The 4289 client has no general way to recover from this. 4290 4291 Note that loss of session does not imply loss of lock, open, 4292 delegation, or layout state because locks, opens, delegations, and 4293 layouts are tied to the client ID and depend on the client ID, not 4294 the session. Nor does loss of lock, open, delegation, or layout 4295 state imply loss of session state, because the session depends on the 4296 client ID; loss of client ID however does imply loss of session, 4297 lock, open, delegation, and layout state. See Section 8.4.2. A 4298 session can survive a server restart, but lock recovery may still be 4299 needed. 4300 4301 It is possible CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID 4302 (for example the server restarts and does not preserve client ID 4303 state). If so, the client needs to call EXCHANGE_ID, followed by 4304 CREATE_SESSION. 4305 4306 4307 4308 4309 4310 4311 Shepler, et al. Expires February 23, 2009 [Page 77] 4312 4313 Internet-Draft NFSv4.1 August 2008 4314 4315 4316 2.10.11.2. Events Requiring Server Action 4317 4318 The following events require server action to recover. 4319 4320 2.10.11.2.1. Client Crash and Restart 4321 4322 As described in Section 18.35, a restarted client sends EXCHANGE_ID 4323 in such a way it causes the server to delete any sessions it had. 4324 4325 2.10.11.2.2. Client Crash with No Restart 4326 4327 If a client crashes and never comes back, it will never send 4328 EXCHANGE_ID with its old client owner. Thus the server has session 4329 state that will never be used again. After an extended period of 4330 time and if the server has resource constraints, it MAY destroy the 4331 old session as well as locking state. 4332 4333 2.10.11.2.3. Extended Network Partition 4334 4335 To the server, the extended network partition may be no different 4336 from a client crash with no restart (see Section 2.10.11.2.2). 4337 Unless the server can discern that there is a network partition, it 4338 is free to treat the situation as if the client has crashed 4339 permanently. 4340 4341 2.10.11.2.4. Backchannel Connection Loss 4342 4343 If there were callback requests outstanding at the time of a 4344 connection loss, then the server MUST retry the request, as described 4345 in Section 2.10.5.2. Note that it is not necessary to retry requests 4346 over a connection with the same source network address or the same 4347 destination network address as the lost connection. As long as the 4348 session ID, slot ID, and sequence ID in the retry match that of the 4349 original request, the callback target will recognize the request as a 4350 retry even if it did see the request prior to disconnect. 4351 4352 If the connection lost is the last one associated with the 4353 backchannel, then the server MUST indicate that in the 4354 sr_status_flags field of every SEQUENCE reply until the backchannel 4355 is reestablished. There are two situations each of which use 4356 different status flags: no connectivity for the session's 4357 backchannel, and no connectivity for any session backchannel of the 4358 client. See Section 18.46 for a description of the appropriate flags 4359 in sr_status_flags. 4360 4361 4362 4363 4364 4365 4366 4367 Shepler, et al. Expires February 23, 2009 [Page 78] 4368 4369 Internet-Draft NFSv4.1 August 2008 4370 4371 4372 2.10.11.2.5. GSS Context Loss 4373 4374 The server SHOULD monitor when the number RPCSEC_GSS contexts 4375 assigned to the backchannel reaches one, and when that one context is 4376 near expiry (i.e. between one and two periods of lease time), 4377 indicate so in the sr_status_flags field of all SEQUENCE replies. 4378 The server MUST indicate when the all of the backchannel's assigned 4379 RPCSEC_GSS contexts have expired in the sr_status_flags field of all 4380 SEQUENCE replies. 4381 4382 2.10.12. Parallel NFS and Sessions 4383 4384 A client and server can potentially be a non-pNFS implementation, a 4385 metadata server implementation, a data server implementation, or two 4386 or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, 4387 EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not 4388 mutually exclusive) are passed in the EXCHANGE_ID arguments and 4389 results to allow the client to indicate how it wants to use sessions 4390 created under the client ID, and to allow the server to indicate how 4391 it will allow the sessions to be used. See Section 13.1 for pNFS 4392 sessions considerations. 4393 4394 4395 3. Protocol Constants and Data Types 4396 4397 The syntax and semantics to describe the data types of the NFSv4.1 4398 protocol are defined in the XDR RFC4506 [2] and RPC RFC1831 [3] 4399 documents. The next sections build upon the XDR data types to define 4400 constants, types and structures specific to this protocol. The full 4401 list of XDR data types is in [12]. 4402 4403 3.1. Basic Constants 4404 4405 const NFS4_FHSIZE = 128; 4406 const NFS4_VERIFIER_SIZE = 8; 4407 const NFS4_OPAQUE_LIMIT = 1024; 4408 const NFS4_SESSIONID_SIZE = 16; 4409 4410 const NFS4_INT64_MAX = 0x7fffffffffffffff; 4411 const NFS4_UINT64_MAX = 0xffffffffffffffff; 4412 const NFS4_INT32_MAX = 0x7fffffff; 4413 const NFS4_UINT32_MAX = 0xffffffff; 4414 4415 const NFS4_MAXFILELEN = 0xffffffffffffffff; 4416 const NFS4_MAXFILEOFF = 0xfffffffffffffffe; 4417 4418 Except where noted, all these constants are defined in bytes. 4419 4420 4421 4422 4423 Shepler, et al. Expires February 23, 2009 [Page 79] 4424 4425 Internet-Draft NFSv4.1 August 2008 4426 4427 4428 o NFS4_FHSIZE is the maximum size of a filehandle. 4429 4430 o NFS4_VERIFIER_SIZE is the fixed size of a verifier. 4431 4432 o NFS4_OPAQUE_LIMIT is the maximum size of certain opaque 4433 information. 4434 4435 o NFS4_SESSIONID_SIZE is the fixed size of a session identifier. 4436 4437 o NFS4_INT64_MAX is the maximum value of a signed 64 bit integer. 4438 4439 o NFS4_UINT64_MAX is the maximum value of an unsigned 64 bit 4440 integer. 4441 4442 o NFS4_INT32_MAX is the maximum value of a signed 32 bit integer. 4443 4444 o NFS4_UINT32_MAX is the maximum value of an unsigned 32 bit 4445 integer. 4446 4447 o NFS4_MAXFILELEN is the maximum length of a regular file. 4448 4449 o NFS4_MAXFILEOFF is the maximum offset into a regular file. 4450 4451 3.2. Basic Data Types 4452 4453 These are the base NFSv4.1 data types. 4454 4455 +---------------+---------------------------------------------------+ 4456 | Data Type | Definition | 4457 +---------------+---------------------------------------------------+ 4458 | int32_t | typedef int int32_t; | 4459 | uint32_t | typedef unsigned int uint32_t; | 4460 | int64_t | typedef hyper int64_t; | 4461 | uint64_t | typedef unsigned hyper uint64_t; | 4462 | attrlist4 | typedef opaque attrlist4<>; | 4463 | | Used for file/directory attributes. | 4464 | bitmap4 | typedef uint32_t bitmap4<>; | 4465 | | Used in attribute array encoding. | 4466 | changeid4 | typedef uint64_t changeid4; | 4467 | | Used in the definition of change_info4. | 4468 | clientid4 | typedef uint64_t clientid4; | 4469 | | Shorthand reference to client identification. | 4470 | count4 | typedef uint32_t count4; | 4471 | | Various count parameters (READ, WRITE, COMMIT). | 4472 | length4 | typedef uint64_t length4; | 4473 | | Describes LOCK lengths. | 4474 | mode4 | typedef uint32_t mode4; | 4475 | | Mode attribute data type. | 4476 4477 4478 4479 Shepler, et al. Expires February 23, 2009 [Page 80] 4480 4481 Internet-Draft NFSv4.1 August 2008 4482 4483 4484 | nfs_cookie4 | typedef uint64_t nfs_cookie4; | 4485 | | Opaque cookie value for READDIR. | 4486 | nfs_fh4 | typedef opaque nfs_fh4; | 4487 | | Filehandle definition. | 4488 | nfs_ftype4 | enum nfs_ftype4; | 4489 | | Various defined file types. | 4490 | nfsstat4 | enum nfsstat4; | 4491 | | Return value for operations. | 4492 | offset4 | typedef uint64_t offset4; | 4493 | | Various offset designations (READ, WRITE, LOCK, | 4494 | | COMMIT). | 4495 | qop4 | typedef uint32_t qop4; | 4496 | | Quality of protection designation in SECINFO. | 4497 | sec_oid4 | typedef opaque sec_oid4<>; | 4498 | | Security Object Identifier. The sec_oid4 data | 4499 | | type is not really opaque. Instead it contains | 4500 | | an ASN.1 OBJECT IDENTIFIER as used by GSS-API in | 4501 | | the mech_type argument to GSS_Init_sec_context. | 4502 | | See [7] for details. | 4503 | sequenceid4 | typedef uint32_t sequenceid4; | 4504 | | Sequence number used for various session | 4505 | | operations (EXCHANGE_ID, CREATE_SESSION, | 4506 | | SEQUENCE, CB_SEQUENCE). | 4507 | seqid4 | typedef uint32_t seqid4; | 4508 | | Sequence identifier used for file locking. | 4509 | sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; | 4510 | | Session identifier. | 4511 | slotid4 | typedef uint32_t slotid4; | 4512 | | Sequencing artifact for various session | 4513 | | operations (SEQUENCE, CB_SEQUENCE). | 4514 | utf8string | typedef opaque utf8string<>; | 4515 | | UTF-8 encoding for strings. | 4516 | utf8str_cis | typedef utf8string utf8str_cis; | 4517 | | Case-insensitive UTF-8 string. | 4518 | utf8str_cs | typedef utf8string utf8str_cs; | 4519 | | Case-sensitive UTF-8 string. | 4520 | utf8str_mixed | typedef utf8string utf8str_mixed; | 4521 | | UTF-8 strings with a case sensitive prefix and a | 4522 | | case insensitive suffix. | 4523 | component4 | typedef utf8str_cs component4; | 4524 | | Represents path name components. | 4525 | linktext4 | typedef utf8str_cs linktext4; | 4526 | | Symbolic link contents. | 4527 | pathname4 | typedef component4 pathname4<>; | 4528 | | Represents path name for fs_locations. | 4529 | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | 4530 4531 4532 4533 4534 4535 Shepler, et al. Expires February 23, 2009 [Page 81] 4536 4537 Internet-Draft NFSv4.1 August 2008 4538 4539 4540 | | Verifier used for various operations (COMMIT, | 4541 | | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) | 4542 | | NFS4_VERIFIER_SIZE is defined as 8. | 4543 +---------------+---------------------------------------------------+ 4544 4545 End of Base Data Types 4546 4547 Table 1 4548 4549 3.3. Structured Data Types 4550 4551 3.3.1. nfstime4 4552 4553 struct nfstime4 { 4554 int64_t seconds; 4555 uint32_t nseconds; 4556 }; 4557 4558 The nfstime4 data type gives the number of seconds and nanoseconds 4559 since midnight or 0 hour January 1, 1970 Coordinated Universal Time 4560 (UTC). Values greater than zero for the seconds field denote dates 4561 after the 0 hour January 1, 1970. Values less than zero for the 4562 seconds field denote dates before the 0 hour January 1, 1970. In 4563 both cases, the nseconds field is to be added to the seconds field 4564 for the final time representation. For example, if the time to be 4565 represented is one-half second before 0 hour January 1, 1970, the 4566 seconds field would have a value of negative one (-1) and the 4567 nseconds fields would have a value of one-half second (500000000). 4568 Values greater than 999,999,999 for nseconds are invalid. 4569 4570 This data type is used to pass time and date information. A server 4571 converts to and from its local representation of time when processing 4572 time values, preserving as much accuracy as possible. If the 4573 precision of timestamps stored for a file system object is less than 4574 defined, loss of precision can occur. An adjunct time maintenance 4575 protocol is RECOMMENDED to reduce client and server time skew. 4576 4577 3.3.2. time_how4 4578 4579 enum time_how4 { 4580 SET_TO_SERVER_TIME4 = 0, 4581 SET_TO_CLIENT_TIME4 = 1 4582 }; 4583 4584 4585 4586 4587 4588 4589 4590 4591 Shepler, et al. Expires February 23, 2009 [Page 82] 4592 4593 Internet-Draft NFSv4.1 August 2008 4594 4595 4596 3.3.3. settime4 4597 4598 union settime4 switch (time_how4 set_it) { 4599 case SET_TO_CLIENT_TIME4: 4600 nfstime4 time; 4601 default: 4602 void; 4603 }; 4604 4605 The time_how4 and settime4 data types are used for setting timestamps 4606 in file object attributes. If set_it is SET_TO_SERVER_TIME4, then 4607 the server uses its local representation of time for the time value. 4608 4609 3.3.4. specdata4 4610 4611 struct specdata4 { 4612 uint32_t specdata1; /* major device number */ 4613 uint32_t specdata2; /* minor device number */ 4614 }; 4615 4616 This data type represents the device numbers for the device file 4617 types NF4CHR and NF4BLK. 4618 4619 3.3.5. fsid4 4620 4621 struct fsid4 { 4622 uint64_t major; 4623 uint64_t minor; 4624 }; 4625 4626 3.3.6. chg_policy4 4627 4628 struct change_policy4 { 4629 uint64_t cp_major; 4630 uint64_t cp_minor; 4631 }; 4632 4633 The chg_policy4 data type is used for the change_policy RECOMMENDED 4634 attribute. It provides change sequencing indication analogous to the 4635 change attribute. To enable the server to present a value valid 4636 across server re-initialization without requiring persistent storage, 4637 two 64-bit quantities are used, allowing one to be a server instance 4638 ID and the second to be incremented non-persistently, within a given 4639 server instance. 4640 4641 4642 4643 4644 4645 4646 4647 Shepler, et al. Expires February 23, 2009 [Page 83] 4648 4649 Internet-Draft NFSv4.1 August 2008 4650 4651 4652 3.3.7. fattr4 4653 4654 struct fattr4 { 4655 bitmap4 attrmask; 4656 attrlist4 attr_vals; 4657 }; 4658 4659 The fattr4 data type is used to represent file and directory 4660 attributes. 4661 4662 The bitmap is a counted array of 32 bit integers used to contain bit 4663 values. The position of the integer in the array that contains bit n 4664 can be computed from the expression (n / 32) and its bit within that 4665 integer is (n mod 32). 4666 4667 4668 0 1 4669 +-----------+-----------+-----------+-- 4670 | count | 31 .. 0 | 63 .. 32 | 4671 +-----------+-----------+-----------+-- 4672 4673 3.3.8. change_info4 4674 4675 struct change_info4 { 4676 bool atomic; 4677 changeid4 before; 4678 changeid4 after; 4679 }; 4680 4681 This data type is used with the CREATE, LINK, OPEN, REMOVE, and 4682 RENAME operations to let the client know the value of the change 4683 attribute for the directory in which the target file system object 4684 resides. 4685 4686 3.3.9. netaddr4 4687 4688 struct netaddr4 { 4689 /* see struct rpcb in RFC 1833 */ 4690 string na_r_netid<>; /* network id */ 4691 string na_r_addr<>; /* universal address */ 4692 }; 4693 4694 The netaddr4 data type is used to identify network transport 4695 endpoints. The r_netid and r_addr fields respectively contain a 4696 netid and uaddr. The netid and uaddr concepts are defined in in 4697 [13]. The netid and uaddr formats for TCP over IPv4 and TCP over 4698 IPv6 are defined in [13], specifically Tables 2 and 3 and Sections 4699 3.2.3.3 and 3.2.3.4. 4700 4701 4702 4703 Shepler, et al. Expires February 23, 2009 [Page 84] 4704 4705 Internet-Draft NFSv4.1 August 2008 4706 4707 4708 3.3.10. state_owner4 4709 4710 struct state_owner4 { 4711 clientid4 clientid; 4712 opaque owner; 4713 }; 4714 4715 typedef state_owner4 open_owner4; 4716 typedef state_owner4 lock_owner4; 4717 4718 The state_owner4 data type is the base type for the open_owner4 4719 Section 3.3.10.1 and lock_owner4 Section 3.3.10.2. 4720 4721 3.3.10.1. open_owner4 4722 4723 This data type is used to identify the owner of open state. 4724 4725 3.3.10.2. lock_owner4 4726 4727 This structure is used to identify the owner of byte-range locking 4728 state. 4729 4730 3.3.11. open_to_lock_owner4 4731 4732 struct open_to_lock_owner4 { 4733 seqid4 open_seqid; 4734 stateid4 open_stateid; 4735 seqid4 lock_seqid; 4736 lock_owner4 lock_owner; 4737 }; 4738 4739 This data type is used for the first LOCK operation done for an 4740 open_owner4. It provides both the open_stateid and lock_owner such 4741 that the transition is made from a valid open_stateid sequence to 4742 that of the new lock_stateid sequence. Using this mechanism avoids 4743 the confirmation of the lock_owner/lock_seqid pair since it is tied 4744 to established state in the form of the open_stateid/open_seqid. 4745 4746 3.3.12. stateid4 4747 4748 struct stateid4 { 4749 uint32_t seqid; 4750 opaque other[12]; 4751 }; 4752 4753 This data type is used for the various state sharing mechanisms 4754 between the client and server. The client never modifies a value of 4755 data type stateid. The starting value of the seqid field is 4756 4757 4758 4759 Shepler, et al. Expires February 23, 2009 [Page 85] 4760 4761 Internet-Draft NFSv4.1 August 2008 4762 4763 4764 undefined. The server is required to increment the seqid field by 4765 one (1) at each transition of the stateid. This is important since 4766 the client will inspect the seqid in OPEN stateids to determine the 4767 order of OPEN processing done by the server. 4768 4769 3.3.13. layouttype4 4770 4771 enum layouttype4 { 4772 LAYOUT4_NFSV4_1_FILES = 0x1, 4773 LAYOUT4_OSD2_OBJECTS = 0x2, 4774 LAYOUT4_BLOCK_VOLUME = 0x3 4775 }; 4776 4777 This data type indicates what type of layout is being used. The file 4778 server advertises the layout types it supports through the 4779 fs_layout_type file system attribute (Section 5.12.1). A client asks 4780 for layouts of a particular type in LAYOUTGET, and processes those 4781 layouts in its layout-type-specific logic. 4782 4783 The layouttype4 data type is 32 bits in length. The range 4784 represented by the layout type is split into three parts. Type 0x0 4785 is reserved. Types within the range 0x00000001-0x7FFFFFFF are 4786 globally unique and are assigned according to the description in 4787 Section 22.4; they are maintained by IANA. Types within the range 4788 0x80000000-0xFFFFFFFF are site specific and for private use only. 4789 4790 The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file 4791 layout type, as defined in Section 13, is to be used. The 4792 LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as 4793 defined in [29], is to be used. Similarly, the LAYOUT4_BLOCK_VOLUME 4794 enumeration specifies that the block/volume layout, as defined in 4795 [30], is to be used. 4796 4797 3.3.14. deviceid4 4798 4799 const NFS4_DEVICEID4_SIZE = 16; 4800 4801 typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; 4802 4803 Layout information includes device IDs that specify a storage device 4804 through a compact handle. Addressing and type information is 4805 obtained with the GETDEVICEINFO operation. Device IDs are not 4806 guaranteed to be valid across metadata server restarts. A device ID 4807 is unique per client ID and layout type. See Section 12.2.10 for 4808 more details. 4809 4810 4811 4812 4813 4814 4815 Shepler, et al. Expires February 23, 2009 [Page 86] 4816 4817 Internet-Draft NFSv4.1 August 2008 4818 4819 4820 3.3.15. device_addr4 4821 4822 struct device_addr4 { 4823 layouttype4 da_layout_type; 4824 opaque da_addr_body<>; 4825 }; 4826 4827 The device address is used to set up a communication channel with the 4828 storage device. Different layout types will require different data 4829 types to define how they communicate with storage devices. The 4830 opaque da_addr_body field must be interpreted based on the specified 4831 da_layout_type field. 4832 4833 This document defines the device address for the NFSv4.1 file layout 4834 (see Section 13.3), which identifies a storage device by network IP 4835 address and port number. This is sufficient for the clients to 4836 communicate with the NFSv4.1 storage devices, and may be sufficient 4837 for other layout types as well. Device types for object storage 4838 devices and block storage devices (e.g., SCSI volume labels) will be 4839 defined by their respective layout specifications. 4840 4841 3.3.16. layout_content4 4842 4843 struct layout_content4 { 4844 layouttype4 loc_type; 4845 opaque loc_body<>; 4846 }; 4847 4848 The loc_body field must be interpreted based on the layout type 4849 (loc_type). This document defines the loc_body for the NFSv4.1 file 4850 layout type is defined; see Section 13.3 for its definition. 4851 4852 3.3.17. layout4 4853 4854 struct layout4 { 4855 offset4 lo_offset; 4856 length4 lo_length; 4857 layoutiomode4 lo_iomode; 4858 layout_content4 lo_content; 4859 }; 4860 4861 The layout4 data type defines a layout for a file. The layout type 4862 specific data is opaque within lo_content. Since layouts are sub- 4863 dividable, the offset and length together with the file's filehandle, 4864 the client ID, iomode, and layout type, identify the layout. 4865 4866 4867 4868 4869 4870 4871 Shepler, et al. Expires February 23, 2009 [Page 87] 4872 4873 Internet-Draft NFSv4.1 August 2008 4874 4875 4876 3.3.18. layoutupdate4 4877 4878 struct layoutupdate4 { 4879 layouttype4 lou_type; 4880 opaque lou_body<>; 4881 }; 4882 4883 The layoutupdate4 data type is used by the client to return updated 4884 layout information to the metadata server via the LAYOUTCOMMIT 4885 (Section 18.42) operation. This data type provides a channel to pass 4886 layout type specific information (in field lou_body) back to the 4887 metadata server. E.g., for the block/volume layout type this could 4888 include the list of reserved blocks that were written. The contents 4889 of the opaque lou_body argument are determined by the layout type. 4890 The NFSv4.1 file-based layout does not use this data type; if 4891 lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a 4892 zero length. 4893 4894 3.3.19. layouthint4 4895 4896 struct layouthint4 { 4897 layouttype4 loh_type; 4898 opaque loh_body<>; 4899 }; 4900 4901 The layouthint4 data type is used by the client to pass in a hint 4902 about the type of layout it would like created for a particular file. 4903 It is the data type specified by the layout_hint attribute described 4904 in Section 5.12.4. The metadata server may ignore the hint, or may 4905 selectively ignore fields within the hint. This hint should be 4906 provided at create time as part of the initial attributes within 4907 OPEN. The loh_body field is specific to the type of layout 4908 (loh_type). The NFSv4.1 file-based layout uses the 4909 nfsv4_1_file_layouthint4 data type as defined in Section 13.3. 4910 4911 3.3.20. layoutiomode4 4912 4913 enum layoutiomode4 { 4914 LAYOUTIOMODE4_READ = 1, 4915 LAYOUTIOMODE4_RW = 2, 4916 LAYOUTIOMODE4_ANY = 3 4917 }; 4918 4919 The iomode specifies whether the client intends to just read or both 4920 read and write the data represented by the layout. While the 4921 LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the 4922 LAYOUTGET operation, it MAY be used in the arguments to the 4923 LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY 4924 4925 4926 4927 Shepler, et al. Expires February 23, 2009 [Page 88] 4928 4929 Internet-Draft NFSv4.1 August 2008 4930 4931 4932 iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ 4933 and LAYOUTIOMODE4_RW iomodes are being returned or recalled, 4934 respectively. The metadata server's use of the iomode may depend on 4935 the layout type being used. The storage devices MAY validate I/O 4936 accesses against the iomode and reject invalid accesses. 4937 4938 3.3.21. nfs_impl_id4 4939 4940 struct nfs_impl_id4 { 4941 utf8str_cis nii_domain; 4942 utf8str_cs nii_name; 4943 nfstime4 nii_date; 4944 }; 4945 4946 This data type is used to identify client and server implementation 4947 details. The nii_domain field is the DNS domain name that the 4948 implementer is associated with. The nii_name field is the product 4949 name of the implementation and is completely free form. It is 4950 RECOMMENDED that the nii_name be used to distinguish machine 4951 architecture, machine platforms, revisions, versions, and patch 4952 levels. The nii_date field is the timestamp of when the software 4953 instance was published or built. 4954 4955 3.3.22. threshold_item4 4956 4957 struct threshold_item4 { 4958 layouttype4 thi_layout_type; 4959 bitmap4 thi_hintset; 4960 opaque thi_hintlist<>; 4961 }; 4962 4963 This data type contains a list of hints specific to a layout type for 4964 helping the client determine when it should send I/O directly through 4965 the metadata server versus the storage devices. The data type 4966 consists of the layout type (thi_layout_type), a bitmap (thi_hintset) 4967 describing the set of hints supported by the server (they may differ 4968 based on the layout type), and a list of hints (thi_hintlist), whose 4969 content is determined by the hintset bitmap. See the mdsthreshold 4970 attribute for more details. 4971 4972 The thi_hintset field is a bitmap of the following values: 4973 4974 4975 4976 4977 4978 4979 4980 4981 4982 4983 Shepler, et al. Expires February 23, 2009 [Page 89] 4984 4985 Internet-Draft NFSv4.1 August 2008 4986 4987 4988 +-------------------------+---+---------+---------------------------+ 4989 | name | # | Data | Description | 4990 | | | Type | | 4991 +-------------------------+---+---------+---------------------------+ 4992 | threshold4_read_size | 0 | length4 | The file size below which | 4993 | | | | it is RECOMMENDED to read | 4994 | | | | data through the MDS. | 4995 | threshold4_write_size | 1 | length4 | The file size below which | 4996 | | | | it is RECOMMENDED to | 4997 | | | | write data through the | 4998 | | | | MDS. | 4999 | threshold4_read_iosize | 2 | length4 | For read I/O sizes below | 5000 | | | | this threshold it is | 5001 | | | | RECOMMENDED to read data | 5002 | | | | through the MDS | 5003 | threshold4_write_iosize | 3 | length4 | For write I/O sizes below | 5004 | | | | this threshold it is | 5005 | | | | RECOMMENDED to write data | 5006 | | | | through the MDS | 5007 +-------------------------+---+---------+---------------------------+ 5008 5009 3.3.23. mdsthreshold4 5010 5011 struct mdsthreshold4 { 5012 threshold_item4 mth_hints<>; 5013 }; 5014 5015 This data type holds an array of elements of data type 5016 threshold_item4, each of which is valid for a particular layout type. 5017 An array is necessary because a server can support multiple layout 5018 types for a single file. 5019 5020 5021 4. Filehandles 5022 5023 The filehandle in the NFS protocol is a per server unique identifier 5024 for a file system object. The contents of the filehandle are opaque 5025 to the client. Therefore, the server is responsible for translating 5026 the filehandle to an internal representation of the file system 5027 object. 5028 5029 4.1. Obtaining the First Filehandle 5030 5031 The operations of the NFS protocol are defined in terms of one or 5032 more filehandles. Therefore, the client needs a filehandle to 5033 initiate communication with the server. With the NFSv3 protocol 5034 RFC1813 [21], there exists an ancillary protocol to obtain this first 5035 filehandle. The MOUNT protocol, RPC program number 100005, provides 5036 5037 5038 5039 Shepler, et al. Expires February 23, 2009 [Page 90] 5040 5041 Internet-Draft NFSv4.1 August 2008 5042 5043 5044 the mechanism of translating a string based file system path name to 5045 a filehandle which can then be used by the NFS protocols. 5046 5047 The MOUNT protocol has deficiencies in the area of security and use 5048 via firewalls. This is one reason that the use of the public 5049 filehandle was introduced in RFC2054 [31] and RFC2055 [32]. With the 5050 use of the public filehandle in combination with the LOOKUP operation 5051 in the NFSv3 protocol, it has been demonstrated that the MOUNT 5052 protocol is unnecessary for viable interaction between NFS client and 5053 server. 5054 5055 Therefore, the NFSv4.1 protocol will not use an ancillary protocol 5056 for translation from string based path names to a filehandle. Two 5057 special filehandles will be used as starting points for the NFS 5058 client. 5059 5060 4.1.1. Root Filehandle 5061 5062 The first of the special filehandles is the ROOT filehandle. The 5063 ROOT filehandle is the "conceptual" root of the file system name 5064 space at the NFS server. The client uses or starts with the ROOT 5065 filehandle by employing the PUTROOTFH operation. The PUTROOTFH 5066 operation instructs the server to set the "current" filehandle to the 5067 ROOT of the server's file tree. Once this PUTROOTFH operation is 5068 used, the client can then traverse the entirety of the server's file 5069 tree with the LOOKUP operation. A complete discussion of the server 5070 name space is in the Section 7. 5071 5072 4.1.2. Public Filehandle 5073 5074 The second special filehandle is the PUBLIC filehandle. Unlike the 5075 ROOT filehandle, the PUBLIC filehandle may be bound or represent an 5076 arbitrary file system object at the server. The server is 5077 responsible for this binding. It may be that the PUBLIC filehandle 5078 and the ROOT filehandle refer to the same file system object. 5079 However, it is up to the administrative software at the server and 5080 the policies of the server administrator to define the binding of the 5081 PUBLIC filehandle and server file system object. The client may not 5082 make any assumptions about this binding. The client uses the PUBLIC 5083 filehandle via the PUTPUBFH operation. 5084 5085 4.2. Filehandle Types 5086 5087 In the NFSv3 protocol, there was one type of filehandle with a single 5088 set of semantics. This type of filehandle is termed "persistent" in 5089 NFSv4.1. The semantics of a persistent filehandle remain the same as 5090 before. A new type of filehandle introduced in NFSv4.1 is the 5091 "volatile" filehandle, which attempts to accommodate certain server 5092 5093 5094 5095 Shepler, et al. Expires February 23, 2009 [Page 91] 5096 5097 Internet-Draft NFSv4.1 August 2008 5098 5099 5100 environments. 5101 5102 The volatile filehandle type was introduced to address server 5103 functionality or implementation issues which make correct 5104 implementation of a persistent filehandle infeasible. Some server 5105 environments do not provide a file system level invariant that can be 5106 used to construct a persistent filehandle. The underlying server 5107 file system may not provide the invariant or the server's file system 5108 programming interfaces may not provide access to the needed 5109 invariant. Volatile filehandles may ease the implementation of 5110 server functionality such as hierarchical storage management or file 5111 system reorganization or migration. However, the volatile filehandle 5112 increases the implementation burden for the client. 5113 5114 Since the client will need to handle persistent and volatile 5115 filehandles differently, a file attribute is defined which may be 5116 used by the client to determine the filehandle types being returned 5117 by the server. 5118 5119 4.2.1. General Properties of a Filehandle 5120 5121 The filehandle contains all the information the server needs to 5122 distinguish an individual file. To the client, the filehandle is 5123 opaque. The client stores filehandles for use in a later request and 5124 can compare two filehandles from the same server for equality by 5125 doing a byte-by-byte comparison. However, the client MUST NOT 5126 otherwise interpret the contents of filehandles. If two filehandles 5127 from the same server are equal, they MUST refer to the same file. 5128 Servers SHOULD try to maintain a one-to-one correspondence between 5129 filehandles and files but this is not required. Clients MUST use 5130 filehandle comparisons only to improve performance, not for correct 5131 behavior. All clients need to be prepared for situations in which it 5132 cannot be determined whether two filehandles denote the same object 5133 and in such cases, avoid making invalid assumptions which might cause 5134 incorrect behavior. Further discussion of filehandle and attribute 5135 comparison in the context of data caching is presented in the 5136 Section 10.3.4. 5137 5138 As an example, in the case that two different path names when 5139 traversed at the server terminate at the same file system object, the 5140 server SHOULD return the same filehandle for each path. This can 5141 occur if a hard link is used to create two file names which refer to 5142 the same underlying file object and associated data. For example, if 5143 paths /a/b/c and /a/d/c refer to the same file, the server SHOULD 5144 return the same filehandle for both path names traversals. 5145 5146 5147 5148 5149 5150 5151 Shepler, et al. Expires February 23, 2009 [Page 92] 5152 5153 Internet-Draft NFSv4.1 August 2008 5154 5155 5156 4.2.2. Persistent Filehandle 5157 5158 A persistent filehandle is defined as having a fixed value for the 5159 lifetime of the file system object to which it refers. Once the 5160 server creates the filehandle for a file system object, the server 5161 MUST accept the same filehandle for the object for the lifetime of 5162 the object. If the server restarts, the NFS server must honor the 5163 same filehandle value as it did in the server's previous 5164 instantiation. Similarly, if the file system is migrated, the new 5165 NFS server must honor the same filehandle as the old NFS server. 5166 5167 The persistent filehandle will be become stale or invalid when the 5168 file system object is removed. When the server is presented with a 5169 persistent filehandle that refers to a deleted object, it MUST return 5170 an error of NFS4ERR_STALE. A filehandle may become stale when the 5171 file system containing the object is no longer available. The file 5172 system may become unavailable if it exists on removable media and the 5173 media is no longer available at the server or the file system in 5174 whole has been destroyed or the file system has simply been removed 5175 from the server's name space (i.e. unmounted in a UNIX environment). 5176 5177 4.2.3. Volatile Filehandle 5178 5179 A volatile filehandle does not share the same longevity 5180 characteristics of a persistent filehandle. The server may determine 5181 that a volatile filehandle is no longer valid at many different 5182 points in time. If the server can definitively determine that a 5183 volatile filehandle refers to an object that has been removed, the 5184 server should return NFS4ERR_STALE to the client (as is the case for 5185 persistent filehandles). In all other cases where the server 5186 determines that a volatile filehandle can no longer be used, it 5187 should return an error of NFS4ERR_FHEXPIRED. 5188 5189 The REQUIRED attribute "fh_expire_type" is used by the client to 5190 determine what type of filehandle the server is providing for a 5191 particular file system. This attribute is a bitmask with the 5192 following values: 5193 5194 FH4_PERSISTENT The value of FH4_PERSISTENT is used to indicate a 5195 persistent filehandle, which is valid until the object is removed 5196 from the file system. The server will not return 5197 NFS4ERR_FHEXPIRED for this filehandle. FH4_PERSISTENT is defined 5198 as a value in which none of the bits specified below are set. 5199 5200 FH4_VOLATILE_ANY The filehandle may expire at any time, except as 5201 specifically excluded (i.e. FH4_NO_EXPIRE_WITH_OPEN). 5202 5203 5204 5205 5206 5207 Shepler, et al. Expires February 23, 2009 [Page 93] 5208 5209 Internet-Draft NFSv4.1 August 2008 5210 5211 5212 FH4_NOEXPIRE_WITH_OPEN May only be set when FH4_VOLATILE_ANY is set. 5213 If this bit is set, then the meaning of FH4_VOLATILE_ANY is 5214 qualified to exclude any expiration of the filehandle when it is 5215 open. 5216 5217 FH4_VOL_MIGRATION The filehandle will expire as a result of a file 5218 system transition (migration or replication), in those case in 5219 which the continuity of filehandle use is not specified by 5220 _handle_ class information within the fs_locations_info attribute. 5221 When this bit is set, clients without access to fs_locations_info 5222 information should assume filehandles will expire on file system 5223 transitions. 5224 5225 FH4_VOL_RENAME The filehandle will expire during rename. This 5226 includes a rename by the requesting client or a rename by any 5227 other client. If FH4_VOL_ANY is set, FH4_VOL_RENAME is redundant. 5228 5229 Servers which provide volatile filehandles that may expire while open 5230 (i.e. if FH4_VOL_MIGRATION or FH4_VOL_RENAME is set or if 5231 FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set), should 5232 deny a RENAME or REMOVE that would affect an OPEN file of any of the 5233 components leading to the OPEN file. In addition, the server should 5234 deny all RENAME or REMOVE requests during the grace period upon 5235 server restart. 5236 5237 Servers which provide volatile filehandles that may expire while open 5238 require special care as regards handling of RENAMEs and REMOVEs. 5239 This situation can arise if FH4_VOL_MIGRATION or FH4_VOL_RENAME is 5240 set, if FH4_VOLATILE_ANY is set and FH4_NOEXPIRE_WITH_OPEN not set, 5241 or if a non-readonly file system has a transition target in a 5242 different _handle _ class. In these cases, the server should deny a 5243 RENAME or REMOVE that would affect an OPEN file of any of the 5244 components leading to the OPEN file. In addition, the server should 5245 deny all RENAME or REMOVE requests during the grace period, in order 5246 to make sure that reclaims of files where filehandles may have 5247 expired do not do a reclaim for the wrong file. 5248 5249 Volatile filehandles are especially suitable for implementation of 5250 the pseudo file systems used to bridge exports. See Section 7.5 for 5251 a discussion of this. 5252 5253 4.3. One Method of Constructing a Volatile Filehandle 5254 5255 A volatile filehandle, while opaque to the client could contain: 5256 5257 [volatile bit = 1 | server boot time | slot | generation number] 5258 5259 5260 5261 5262 5263 Shepler, et al. Expires February 23, 2009 [Page 94] 5264 5265 Internet-Draft NFSv4.1 August 2008 5266 5267 5268 o slot is an index in the server volatile filehandle table 5269 5270 o generation number is the generation number for the table entry/ 5271 slot 5272 5273 When the client presents a volatile filehandle, the server makes the 5274 following checks, which assume that the check for the volatile bit 5275 has passed. If the server boot time is less than the current server 5276 boot time, return NFS4ERR_FHEXPIRED. If slot is out of range, return 5277 NFS4ERR_BADHANDLE. If the generation number does not match, return 5278 NFS4ERR_FHEXPIRED. 5279 5280 When the server restarts, the table is gone (it is volatile). 5281 5282 If volatile bit is 0, then it is a persistent filehandle with a 5283 different structure following it. 5284 5285 4.4. Client Recovery from Filehandle Expiration 5286 5287 If possible, the client SHOULD recover from the receipt of an 5288 NFS4ERR_FHEXPIRED error. The client must take on additional 5289 responsibility so that it may prepare itself to recover from the 5290 expiration of a volatile filehandle. If the server returns 5291 persistent filehandles, the client does not need these additional 5292 steps. 5293 5294 For volatile filehandles, most commonly the client will need to store 5295 the component names leading up to and including the file system 5296 object in question. With these names, the client should be able to 5297 recover by finding a filehandle in the name space that is still 5298 available or by starting at the root of the server's file system name 5299 space. 5300 5301 If the expired filehandle refers to an object that has been removed 5302 from the file system, obviously the client will not be able to 5303 recover from the expired filehandle. 5304 5305 It is also possible that the expired filehandle refers to a file that 5306 has been renamed. If the file was renamed by another client, again 5307 it is possible that the original client will not be able to recover. 5308 However, in the case that the client itself is renaming the file and 5309 the file is open, it is possible that the client may be able to 5310 recover. The client can determine the new path name based on the 5311 processing of the rename request. The client can then regenerate the 5312 new filehandle based on the new path name. The client could also use 5313 the compound operation mechanism to construct a set of operations 5314 like: 5315 5316 5317 5318 5319 Shepler, et al. Expires February 23, 2009 [Page 95] 5320 5321 Internet-Draft NFSv4.1 August 2008 5322 5323 5324 RENAME A B 5325 LOOKUP B 5326 GETFH 5327 5328 Note that the COMPOUND procedure does not provide atomicity. This 5329 example only reduces the overhead of recovering from an expired 5330 filehandle. 5331 5332 5333 5. File Attributes 5334 5335 To meet the requirements of extensibility and increased 5336 interoperability with non-UNIX platforms, attributes must be handled 5337 in a flexible manner. The NFSv3 fattr3 structure contains a fixed 5338 list of attributes that not all clients and servers are able to 5339 support or care about. The fattr3 structure can not be extended as 5340 new needs arise and it provides no way to indicate non-support. With 5341 the NFSv4.1 protocol, the client is able query what attributes the 5342 server supports and construct requests with only those supported 5343 attributes (or a subset thereof). 5344 5345 To this end, attributes are divided into three groups: REQUIRED, 5346 RECOMMENDED, and named. Both REQUIRED and RECOMMENDED attributes are 5347 supported in the NFSv4.1 protocol by a specific and well-defined 5348 encoding and are identified by number. They are requested by setting 5349 a bit in the bit vector sent in the GETATTR request; the server 5350 response includes a bit vector to list what attributes were returned 5351 in the response. New REQUIRED or RECOMMENDED attributes may be added 5352 to the NFSv4 protocol as part of a new minor version by publishing a 5353 standards-track RFC which allocates a new attribute number value and 5354 defines the encoding for the attribute. See Section 2.7 for further 5355 discussion. 5356 5357 Named attributes are accessed by the new OPENATTR operation, which 5358 accesses a hidden directory of attributes associated with a file 5359 system object. OPENATTR takes a filehandle for the object and 5360 returns the filehandle for the attribute hierarchy. The filehandle 5361 for the named attributes is a directory object accessible by LOOKUP 5362 or READDIR and contains files whose names represent the named 5363 attributes and whose data bytes are the value of the attribute. For 5364 example: 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 Shepler, et al. Expires February 23, 2009 [Page 96] 5376 5377 Internet-Draft NFSv4.1 August 2008 5378 5379 5380 +----------+-----------+---------------------------------+ 5381 | LOOKUP | "foo" | ; look up file | 5382 | GETATTR | attrbits | | 5383 | OPENATTR | | ; access foo's named attributes | 5384 | LOOKUP | "x11icon" | ; look up specific attribute | 5385 | READ | 0,4096 | ; read stream of bytes | 5386 +----------+-----------+---------------------------------+ 5387 5388 Named attributes are intended for data needed by applications rather 5389 than by an NFS client implementation. NFS implementors are strongly 5390 encouraged to define their new attributes as RECOMMENDED attributes 5391 by bringing them to the IETF standards-track process. 5392 5393 The set of attributes which are classified as REQUIRED is 5394 deliberately small since servers must do whatever it takes to support 5395 them. A server should support as many of the RECOMMENDED attributes 5396 as possible but by their definition, the server is not required to 5397 support all of them. Attributes are deemed REQUIRED if the data is 5398 both needed by a large number of clients and is not otherwise 5399 reasonably computable by the client when support is not provided on 5400 the server. 5401 5402 Note that the hidden directory returned by OPENATTR is a convenience 5403 for protocol processing. The client should not make any assumptions 5404 about the server's implementation of named attributes and whether the 5405 underlying file system at the server has a named attribute directory 5406 or not. Therefore, operations such as SETATTR and GETATTR on the 5407 named attribute directory are undefined. 5408 5409 5.1. REQUIRED Attributes 5410 5411 These MUST be supported by every NFSv4.1 client and server in order 5412 to ensure a minimum level of interoperability. The server MUST store 5413 and return these attributes and the client MUST be able to function 5414 with an attribute set limited to these attributes. With just the 5415 REQUIRED attributes some client functionality may be impaired or 5416 limited in some ways. A client may ask for any of these attributes 5417 to be returned by setting a bit in the GETATTR request and the server 5418 must return their value. 5419 5420 5.2. RECOMMENDED Attributes 5421 5422 These attributes are understood well enough to warrant support in the 5423 NFSv4.1 protocol. However, they may not be supported on all clients 5424 and servers. A client may ask for any of these attributes to be 5425 returned by setting a bit in the GETATTR request but must handle the 5426 case where the server does not return them. A client may ask for the 5427 set of attributes the server supports and SHOULD NOT request 5428 5429 5430 5431 Shepler, et al. Expires February 23, 2009 [Page 97] 5432 5433 Internet-Draft NFSv4.1 August 2008 5434 5435 5436 attributes the server does not support. A server should be tolerant 5437 of requests for unsupported attributes and simply not return them 5438 rather than considering the request an error. It is expected that 5439 servers will support all attributes they comfortably can and only 5440 fail to support attributes which are difficult to support in their 5441 operating environments. A server should provide attributes whenever 5442 they don't have to "tell lies" to the client. For example, a file 5443 modification time should be either an accurate time or should not be 5444 supported by the server. This will not always be comfortable to 5445 clients but the client is better positioned decide whether and how to 5446 fabricate or construct an attribute or whether to do without the 5447 attribute. 5448 5449 5.3. Named Attributes 5450 5451 These attributes are not supported by direct encoding in the NFSv4 5452 protocol but are accessed by string names rather than numbers and 5453 correspond to an uninterpreted stream of bytes which are stored with 5454 the file system object. The name space for these attributes may be 5455 accessed by using the OPENATTR operation. The OPENATTR operation 5456 returns a filehandle for a virtual "named attribute directory" and 5457 further perusal and modification of the name space may be done using 5458 operations that work on more typical directories. In particular, 5459 READDIR may be used to get a list of such named attributes and LOOKUP 5460 and OPEN may select a particular attribute. Creation of a new named 5461 attribute may be the result of an OPEN specifying file creation. 5462 5463 Once an OPEN is done, named attributes may be examined and changed by 5464 normal READ and WRITE operations using the filehandles and stateids 5465 returned by OPEN. 5466 5467 Named attributes and the named attribute directory may have their own 5468 (non-named) attributes. Each of objects must have all of the 5469 REQUIRED attributes and may have additional RECOMMENDED attributes. 5470 However, the set of attributes for named attributes and the named 5471 attribute directory need not be as large as, and typically will not 5472 be as large as that for other objects in that file system. 5473 5474 Named attributes and the named attribute directory may be the target 5475 of delegations (in the case of the named attribute directory these 5476 will be directory delegations). However, since granting of 5477 delegations or not is within the server's discretion, a server need 5478 not support delegations on named attributes or the named attribute 5479 directory. 5480 5481 It is RECOMMENDED that servers support arbitrary named attributes. A 5482 client should not depend on the ability to store any named attributes 5483 in the server's file system. If a server does support named 5484 5485 5486 5487 Shepler, et al. Expires February 23, 2009 [Page 98] 5488 5489 Internet-Draft NFSv4.1 August 2008 5490 5491 5492 attributes, a client which is also able to handle them should be able 5493 to copy a file's data and metadata with complete transparency from 5494 one location to another; this would imply that names allowed for 5495 regular directory entries are valid for named attribute names as 5496 well. 5497 5498 In NFSv4.1, the structure of named attribute directories is 5499 restricted in a number of ways, in order to prevent the development 5500 of non-interoperable implementations in which some servers support a 5501 fully general hierarchical directory structure for named attributes 5502 while others support a limited set, but fully adequate to the 5503 feature's goals. In such an environment, clients or applications 5504 might come to depend on non-portable extensions. The restrictions 5505 are: 5506 5507 o CREATE is not allowed in a named attribute directory. Thus, such 5508 objects as symbolic links and special files are not allowed to be 5509 named attributes. Further, directories may not be created in a 5510 named attribute directory so no hierarchical structure of named 5511 attributes for a single object is allowed. 5512 5513 o If OPENATTR is done on a named attribute directory or on a named 5514 attribute, the server MUST return NFS4ERR_WRONG_TYPE. 5515 5516 o Doing a RENAME of a named attribute to a different named attribute 5517 directory or to an ordinary (i.e. non-named-attribute) directory 5518 is not allowed. 5519 5520 o Creating hard links between named attribute directories or between 5521 named attribute directories and ordinary directories is not 5522 allowed. 5523 5524 Names of attributes will not be controlled by this document or other 5525 IETF standards track documents. See Section 22.1 for further 5526 discussion. 5527 5528 5.4. Classification of Attributes 5529 5530 Each of the REQUIRED and RECOMMENDED attributes can be classified in 5531 one of three categories: per server, per file system, or per file 5532 system object. Note that it is possible that some per file system 5533 attributes may vary within the file system. See the "homogeneous" 5534 attribute for its definition. Note that the attributes 5535 time_access_set and time_modify_set are not listed in this section 5536 because they are write-only attributes corresponding to time_access 5537 and time_modify, and are used in a special instance of SETATTR. 5538 5539 5540 5541 5542 5543 Shepler, et al. Expires February 23, 2009 [Page 99] 5544 5545 Internet-Draft NFSv4.1 August 2008 5546 5547 5548 o The per server attribute is: 5549 5550 lease_time 5551 5552 o The per file system attributes are: 5553 5554 supported_attrs, suppattr_exclcreat, fh_expire_type, 5555 link_support, symlink_support, unique_handles, aclsupport, 5556 cansettime, case_insensitive, case_preserving, 5557 chown_restricted, files_avail, files_free, files_total, 5558 fs_locations, homogeneous, maxfilesize, maxname, maxread, 5559 maxwrite, no_trunc, space_avail, space_free, space_total, 5560 time_delta, change_policy, fs_status, fs_layout_type, 5561 fs_locations_info, fs_charset_cap 5562 5563 o The per file system object attributes are: 5564 5565 type, change, size, named_attr, fsid, rdattr_error, filehandle, 5566 acl, archive, fileid, hidden, maxlink, mimetype, mode, 5567 numlinks, owner, owner_group, rawdev, space_used, system, 5568 time_access, time_backup, time_create, time_metadata, 5569 time_modify, mounted_on_fileid, dir_notif_delay, 5570 dirent_notif_delay, dacl, sacl, layout_type, layout_hint, 5571 layout_blksize, layout_alignment, mdsthreshold, retention_get, 5572 retention_set, retentevt_get, retentevt_set, retention_hold, 5573 mode_set_masked 5574 5575 For quota_avail_hard, quota_avail_soft, and quota_used see their 5576 definitions below for the appropriate classification. 5577 5578 5.5. Set-Only and Get-Only Attributes 5579 5580 Some REQUIRED and RECOMMENDED attributes are set-only, i.e. they can 5581 be set via SETATTR but not retrieved via GETATTR. Similarly, some 5582 REQUIRED and RECOMMENDED attributes are get-only, i.e. they can be 5583 retrieved GETATTR but not set via SETATTR. If a client attempts to 5584 set a get-only attribute or get a set-only attributes, the server 5585 MUST return NFS4ERR_INVAL. 5586 5587 5.6. REQUIRED Attributes - List and Definition References 5588 5589 The list of REQUIRED attributes appears in Table 2. The meaning of 5590 the columns of the table are: 5591 5592 o Name: the name of attribute 5593 5594 o Id: the number assigned to the attribute. In the event of 5595 conflicts between the assigned number and [12], the latter is 5596 5597 5598 5599 Shepler, et al. Expires February 23, 2009 [Page 100] 5600 5601 Internet-Draft NFSv4.1 August 2008 5602 5603 5604 authoritative. 5605 5606 o Data Type: The XDR data type of the attribute. 5607 5608 o Acc: Access allowed to the attribute. R means read-only (GETATTR 5609 may retrieve, SETATTR may not set). W means write-only (SETATTR 5610 may set, GETATTR may not retrieve). R W means read/write (GETATTR 5611 may retrieve, SETATTR may set). 5612 5613 o Defined in: the section of this specification that describes the 5614 attribute. 5615 5616 +--------------------+----+------------+-----+------------------+ 5617 | Name | Id | Data Type | Acc | Defined in: | 5618 +--------------------+----+------------+-----+------------------+ 5619 | supported_attrs | 0 | bitmap4 | R | Section 5.8.1.1 | 5620 | type | 1 | nfs_ftype4 | R | Section 5.8.1.2 | 5621 | fh_expire_type | 2 | uint32_t | R | Section 5.8.1.3 | 5622 | change | 3 | uint64_t | R | Section 5.8.1.4 | 5623 | size | 4 | uint64_t | R W | Section 5.8.1.5 | 5624 | link_support | 5 | bool | R | Section 5.8.1.6 | 5625 | symlink_support | 6 | bool | R | Section 5.8.1.7 | 5626 | named_attr | 7 | bool | R | Section 5.8.1.8 | 5627 | fsid | 8 | fsid4 | R | Section 5.8.1.9 | 5628 | unique_handles | 9 | bool | R | Section 5.8.1.10 | 5629 | lease_time | 10 | nfs_lease4 | R | Section 5.8.1.11 | 5630 | rdattr_error | 11 | enum | R | Section 5.8.1.12 | 5631 | filehandle | 19 | nfs_fh4 | R | Section 5.8.1.13 | 5632 | suppattr_exclcreat | 75 | bitmap4 | R | Section 5.8.1.14 | 5633 +--------------------+----+------------+-----+------------------+ 5634 5635 Table 2 5636 5637 5.7. RECOMMENDED Attributes - List and Definition References 5638 5639 The RECOMMENDED attributes are defined in Table 3. The meanings of 5640 the column headers are the same as Table 2; see Section 5.6 for the 5641 meanings. 5642 5643 +--------------------+----+----------------+-----+------------------+ 5644 | Name | Id | Data Type | Acc | Defined in: | 5645 +--------------------+----+----------------+-----+------------------+ 5646 | acl | 12 | nfsace4<> | R W | Section 6.2.1 | 5647 | aclsupport | 13 | uint32_t | R | Section 6.2.1.2 | 5648 | archive | 14 | bool | R W | Section 5.8.2.1 | 5649 | cansettime | 15 | bool | R | Section 5.8.2.2 | 5650 | case_insensitive | 16 | bool | R | Section 5.8.2.3 | 5651 | case_preserving | 17 | bool | R | Section 5.8.2.4 | 5652 5653 5654 5655 Shepler, et al. Expires February 23, 2009 [Page 101] 5656 5657 Internet-Draft NFSv4.1 August 2008 5658 5659 5660 | change_policy | 60 | chg_policy4 | R | Section 5.8.2.5 | 5661 | chown_restricted | 18 | bool | R | Section 5.8.2.6 | 5662 | dacl | 58 | nfsacl41 | R W | Section 6.2.2 | 5663 | dir_notif_delay | 56 | nfstime4 | R | Section 5.11.1 | 5664 | dirent_notif_delay | 57 | nfstime4 | R | Section 5.11.2 | 5665 | fileid | 20 | uint64_t | R | Section 5.8.2.7 | 5666 | files_avail | 21 | uint64_t | R | Section 5.8.2.8 | 5667 | files_free | 22 | uint64_t | R | Section 5.8.2.9 | 5668 | files_total | 23 | uint64_t | R | Section 5.8.2.10 | 5669 | fs_charset_cap | 76 | uint32_t | R | Section 5.8.2.11 | 5670 | fs_layout_type | 62 | layouttype4<> | R | Section 5.12.1 | 5671 | fs_locations | 24 | fs_locations | R | Section 5.8.2.12 | 5672 | fs_locations_info | 67 | * | R | Section 5.8.2.13 | 5673 | fs_status | 61 | fs4_status | R | Section 5.8.2.14 | 5674 | hidden | 25 | bool | R W | Section 5.8.2.15 | 5675 | homogeneous | 26 | bool | R | Section 5.8.2.16 | 5676 | layout_alignment | 66 | uint32_t | R | Section 5.12.2 | 5677 | layout_blksize | 65 | uint32_t | R | Section 5.12.3 | 5678 | layout_hint | 63 | layouthint4 | W | Section 5.12.4 | 5679 | layout_type | 64 | layouttype4<> | R | Section 5.12.5 | 5680 | maxfilesize | 27 | uint64_t | R | Section 5.8.2.17 | 5681 | maxlink | 28 | uint32_t | R | Section 5.8.2.18 | 5682 | maxname | 29 | uint32_t | R | Section 5.8.2.19 | 5683 | maxread | 30 | uint64_t | R | Section 5.8.2.20 | 5684 | maxwrite | 31 | uint64_t | R | Section 5.8.2.21 | 5685 | mdsthreshold | 68 | mdsthreshold4 | R | Section 5.12.6 | 5686 | mimetype | 32 | utf8<> | R W | Section 5.8.2.22 | 5687 | mode | 33 | mode4 | R W | Section 6.2.4 | 5688 | mode_set_masked | 74 | mode_masked4 | W | Section 6.2.5 | 5689 | mounted_on_fileid | 55 | uint64_t | R | Section 5.8.2.23 | 5690 | no_trunc | 34 | bool | R | Section 5.8.2.24 | 5691 | numlinks | 35 | uint32_t | R | Section 5.8.2.25 | 5692 | owner | 36 | utf8<> | R W | Section 5.8.2.26 | 5693 | owner_group | 37 | utf8<> | R W | Section 5.8.2.27 | 5694 | quota_avail_hard | 38 | uint64_t | R | Section 5.8.2.28 | 5695 | quota_avail_soft | 39 | uint64_t | R | Section 5.8.2.29 | 5696 | quota_used | 40 | uint64_t | R | Section 5.8.2.30 | 5697 | rawdev | 41 | specdata4 | R | Section 5.8.2.31 | 5698 | retentevt_get | 71 | retention_get4 | R | Section 5.13.3 | 5699 | retentevt_set | 72 | retention_set4 | W | Section 5.13.4 | 5700 | retention_get | 69 | retention_get4 | R | Section 5.13.1 | 5701 | retention_hold | 73 | uint64_t | R W | Section 5.13.5 | 5702 | retention_set | 70 | retention_set4 | W | Section 5.13.2 | 5703 | sacl | 59 | nfsacl41 | R W | Section 6.2.3 | 5704 | space_avail | 42 | uint64_t | R | Section 5.8.2.32 | 5705 | space_free | 43 | uint64_t | R | Section 5.8.2.33 | 5706 | space_total | 44 | uint64_t | R | Section 5.8.2.34 | 5707 | space_used | 45 | uint64_t | R | Section 5.8.2.35 | 5708 5709 5710 5711 Shepler, et al. Expires February 23, 2009 [Page 102] 5712 5713 Internet-Draft NFSv4.1 August 2008 5714 5715 5716 | system | 46 | bool | R W | Section 5.8.2.36 | 5717 | time_access | 47 | nfstime4 | R | Section 5.8.2.37 | 5718 | time_access_set | 48 | settime4 | W | Section 5.8.2.38 | 5719 | time_backup | 49 | nfstime4 | R W | Section 5.8.2.39 | 5720 | time_create | 50 | nfstime4 | R W | Section 5.8.2.40 | 5721 | time_delta | 51 | nfstime4 | R | Section 5.8.2.41 | 5722 | time_metadata | 52 | nfstime4 | R | Section 5.8.2.42 | 5723 | time_modify | 53 | nfstime4 | R | Section 5.8.2.43 | 5724 | time_modify_set | 54 | settime4 | W | Section 5.8.2.44 | 5725 +--------------------+----+----------------+-----+------------------+ 5726 5727 Table 3 5728 5729 * fs_locations_info4 5730 5731 5.8. Attribute Definitions 5732 5733 5.8.1. Definitions of REQUIRED Attributes 5734 5735 5.8.1.1. Attribute 0: supported_attrs 5736 5737 The bit vector which would retrieve all REQUIRED and RECOMMENDED 5738 attributes that are supported for this object. The scope of this 5739 attribute applies to all objects with a matching fsid. 5740 5741 5.8.1.2. Attribute 1: type 5742 5743 Designates the type of an object in terms of one of a number of 5744 special constants: 5745 5746 o NF4REG designates a regular file. 5747 5748 o NF4DIR designates a directory. 5749 5750 o NF4BLK designates a block device special file. 5751 5752 o NF4CHR designates a character device special file. 5753 5754 o NF4LNK designates a symbolic link. 5755 5756 o NF4SOCK designates a named socket special file. 5757 5758 o NF4FIFO designates a fifo special file. 5759 5760 o NF4ATTRDIR designates a named attribute directory. 5761 5762 o NF4NAMEDATTR designates a named attribute. 5763 5764 5765 5766 5767 Shepler, et al. Expires February 23, 2009 [Page 103] 5768 5769 Internet-Draft NFSv4.1 August 2008 5770 5771 5772 Within the explanatory text and operation descriptions, the following 5773 phrases will be used with the meanings given below: 5774 5775 o The phrase "is a directory" means that the object is of type 5776 NF4DIR or of type NF4ATTRDIR. 5777 5778 o The phrase "is a special file" means that the object is of one of 5779 the types NF4BLK, NF4CHR, NF4SOCK, or NF4FIFO. 5780 5781 o The phrase "is an ordinary file" means that the object is of type 5782 NF4REG or of type NF4NAMEDATTR. 5783 5784 5.8.1.3. Attribute 2: fh_expire_type 5785 5786 Server uses this to specify filehandle expiration behavior to the 5787 client. See Section 4 for additional description. 5788 5789 5.8.1.4. Attribute 3: change 5790 5791 A value created by the server that the client can use to determine if 5792 file data, directory contents or attributes of the object have been 5793 modified. The server may return the object's time_metadata attribute 5794 for this attribute's value but only if the file system object can not 5795 be updated more frequently than the resolution of time_metadata. 5796 5797 5.8.1.5. Attribute 4: size 5798 5799 The size of the object in bytes. 5800 5801 5.8.1.6. Attribute 5: link_support 5802 5803 True, if the object's file system supports hard links. 5804 5805 5.8.1.7. Attribute 6: symlink_support 5806 5807 True, if the object's file system supports symbolic links. 5808 5809 5.8.1.8. Attribute 7: named_attr 5810 5811 True, if this object has named attributes. In other words, object 5812 has a non-empty named attribute directory. 5813 5814 5.8.1.9. Attribute 8: fsid 5815 5816 Unique file system identifier for the file system holding this 5817 object. fsid contains major and minor components each of which are of 5818 data type uint64_t. 5819 5820 5821 5822 5823 Shepler, et al. Expires February 23, 2009 [Page 104] 5824 5825 Internet-Draft NFSv4.1 August 2008 5826 5827 5828 5.8.1.10. Attribute 9: unique_handles 5829 5830 True, if two distinct filehandles guaranteed to refer to two 5831 different file system objects. 5832 5833 5.8.1.11. Attribute 10: lease_time 5834 5835 Duration of leases at server in seconds. 5836 5837 5.8.1.12. Attribute 11: rdattr_error 5838 5839 Error returned from an attempt to retrieve attributes during a 5840 READDIR operation. 5841 5842 5.8.1.13. Attribute 19: filehandle 5843 5844 The filehandle of this object (primarily for READDIR requests). 5845 5846 5.8.1.14. Attribute 75: suppattr_exclcreat 5847 5848 The bit vector which would set all REQUIRED and RECOMMENDED 5849 attributes that are supported by the EXCLUSIVE4_1 method of file 5850 creation via the OPEN operation. The scope of this attribute applies 5851 to all objects with a matching fsid. 5852 5853 5.8.2. Definitions of Uncategorized RECOMMENDED Attributes 5854 5855 The definitions of most of the RECOMMENDED attributes follow. 5856 Collections that share a common category are defined in other 5857 sections. 5858 5859 5.8.2.1. Attribute 14: archive 5860 5861 True, if this file has been archived since the time of last 5862 modification (deprecated in favor of time_backup). 5863 5864 5.8.2.2. Attribute 15: cansettime 5865 5866 True, if the server able to change the times for a file system object 5867 as specified in a SETATTR operation. 5868 5869 5.8.2.3. Attribute 16: case_insensitive 5870 5871 True, if file name comparisons on this file system are case 5872 insensitive. 5873 5874 5875 5876 5877 5878 5879 Shepler, et al. Expires February 23, 2009 [Page 105] 5880 5881 Internet-Draft NFSv4.1 August 2008 5882 5883 5884 5.8.2.4. Attribute 17: case_preserving 5885 5886 True, if file name case on this file system is preserved. 5887 5888 5.8.2.5. Attribute 60: change_policy 5889 5890 A value created by the server that the client can use to determine if 5891 some server policy related to the current file system has been 5892 subject to change. If the value remains the same then the client can 5893 be sure that the values of the attributes related to fs location and 5894 the fss_type field of the fs_status attribute have not changed. On 5895 the other hand, a change in this value does necessarily imply a 5896 change in policy. It is up to the client to interrogate the server 5897 to determine if some policy relevant to it has changed. See 5898 Section 3.3.6 for details. 5899 5900 This attribute MUST change when the value returned by the 5901 fs_locations or fs_locations_info attribute changes, when a file 5902 system goes from read-only to writable or vice versa, or when the 5903 allowable set of security flavors for the file system or any part 5904 thereof is changed. 5905 5906 5.8.2.6. Attribute 18: chown_restricted 5907 5908 If TRUE, the server will reject any request to change either the 5909 owner or the group associated with a file if the caller is not a 5910 privileged user (for example, "root" in UNIX operating environments 5911 or in Windows 2000 the "Take Ownership" privilege). 5912 5913 5.8.2.7. Attribute 20: fileid 5914 5915 A number uniquely identifying the file within the file system. 5916 5917 5.8.2.8. Attribute 21: files_avail 5918 5919 File slots available to this user on the file system containing this 5920 object - this should be the smallest relevant limit. 5921 5922 5.8.2.9. Attribute 22: files_free 5923 5924 Free file slots on the file system containing this object - this 5925 should be the smallest relevant limit. 5926 5927 5.8.2.10. Attribute 23: files_total 5928 5929 Total file slots on the file system containing this object. 5930 5931 5932 5933 5934 5935 Shepler, et al. Expires February 23, 2009 [Page 106] 5936 5937 Internet-Draft NFSv4.1 August 2008 5938 5939 5940 5.8.2.11. Attribute 76: fs_charset_cap 5941 5942 Character set capabilities for this file system. See Section 14.4. 5943 5944 5.8.2.12. Attribute 24: fs_locations 5945 5946 Locations where this file system may be found. If the server returns 5947 NFS4ERR_MOVED as an error, this attribute MUST be supported. 5948 5949 5.8.2.13. Attribute 67: fs_locations_info 5950 5951 Full function file system location. 5952 5953 5.8.2.14. Attribute 61: fs_status 5954 5955 Generic file system type information. 5956 5957 5.8.2.15. Attribute 25: hidden 5958 5959 True, if the file is considered hidden with respect to the Windows 5960 API. 5961 5962 5.8.2.16. Attribute 26: homogeneous 5963 5964 True, if this object's file system is homogeneous, i.e. are per file 5965 system attributes the same for all file system's objects. 5966 5967 5.8.2.17. Attribute 27: maxfilesize 5968 5969 Maximum supported file size for the file system of this object. 5970 5971 5.8.2.18. Attribute 28: maxlink 5972 5973 Maximum number of links for this object. 5974 5975 5.8.2.19. Attribute 29: maxname 5976 5977 Maximum file name size supported for this object. 5978 5979 5.8.2.20. Attribute 30: maxread 5980 5981 Maximum read size supported for this object. 5982 5983 5.8.2.21. Attribute 31: maxwrite 5984 5985 Maximum write size supported for this object. This attribute SHOULD 5986 be supported if the file is writable. Lack of this attribute can 5987 lead to the client either wasting bandwidth or not receiving the best 5988 5989 5990 5991 Shepler, et al. Expires February 23, 2009 [Page 107] 5992 5993 Internet-Draft NFSv4.1 August 2008 5994 5995 5996 performance. 5997 5998 5.8.2.22. Attribute 32: mimetype 5999 6000 MIME body type/subtype of this object. 6001 6002 5.8.2.23. Attribute 55: mounted_on_fileid 6003 6004 Like fileid, but if the target filehandle is the root of a file 6005 system, this attribute represents the fileid of the underlying 6006 directory. 6007 6008 UNIX-based operating environments connect a file system into the 6009 namespace by connecting (mounting) the file system onto the existing 6010 file object (the mount point, usually a directory) of an existing 6011 file system. When the mount point's parent directory is read via an 6012 API like readdir(), the return results are directory entries, each 6013 with a component name and a fileid. The fileid of the mount point's 6014 directory entry will be different from the fileid that the stat() 6015 system call returns. The stat() system call is returning the fileid 6016 of the root of the mounted file system, whereas readdir() is 6017 returning the fileid stat() would have returned before any file 6018 systems were mounted on the mount point. 6019 6020 Unlike NFSv3, NFSv4.1 allows a client's LOOKUP request to cross other 6021 file systems. The client detects the file system crossing whenever 6022 the filehandle argument of LOOKUP has an fsid attribute different 6023 from that of the filehandle returned by LOOKUP. A UNIX-based client 6024 will consider this a "mount point crossing". UNIX has a legacy 6025 scheme for allowing a process to determine its current working 6026 directory. This relies on readdir() of a mount point's parent and 6027 stat() of the mount point returning fileids as previously described. 6028 The mounted_on_fileid attribute corresponds to the fileid that 6029 readdir() would have returned as described previously. 6030 6031 While the NFSv4.1 client could simply fabricate a fileid 6032 corresponding to what mounted_on_fileid provides (and if the server 6033 does not support mounted_on_fileid, the client has no choice), there 6034 is a risk that the client will generate a fileid that conflicts with 6035 one that is already assigned to another object in the file system. 6036 Instead, if the server can provide the mounted_on_fileid, the 6037 potential for client operational problems in this area is eliminated. 6038 6039 If the server detects that there is no mounted point at the target 6040 file object, then the value for mounted_on_fileid that it returns is 6041 the same as that of the fileid attribute. 6042 6043 The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD 6044 6045 6046 6047 Shepler, et al. Expires February 23, 2009 [Page 108] 6048 6049 Internet-Draft NFSv4.1 August 2008 6050 6051 6052 provide it if possible, and for a UNIX-based server, this is 6053 straightforward. Usually, mounted_on_fileid will be requested during 6054 a READDIR operation, in which case it is trivial (at least for UNIX- 6055 based servers) to return mounted_on_fileid since it is equal to the 6056 fileid of a directory entry returned by readdir(). If 6057 mounted_on_fileid is requested in a GETATTR operation, the server 6058 should obey an invariant that has it returning a value that is equal 6059 to the file object's entry in the object's parent directory, i.e. 6060 what readdir() would have returned. Some operating environments 6061 allow a series of two or more file systems to be mounted onto a 6062 single mount point. In this case, for the server to obey the 6063 aforementioned invariant, it will need to find the base mount point, 6064 and not the intermediate mount points. 6065 6066 5.8.2.24. Attribute 34: no_trunc 6067 6068 If this attribute is TRUE, then if the client uses a file name longer 6069 than name_max, an error will be returned instead of the name being 6070 truncated. 6071 6072 5.8.2.25. Attribute 35: numlinks 6073 6074 Number of hard links to this object. 6075 6076 5.8.2.26. Attribute 36: owner 6077 6078 The string name of the owner of this object. 6079 6080 5.8.2.27. Attribute 37: owner_group 6081 6082 The string name of the group ownership of this object. 6083 6084 5.8.2.28. Attribute 38: quota_avail_hard 6085 6086 The value in bytes which represents the amount of additional disk 6087 space beyond the current allocation that can be allocated to this 6088 file or directory before further allocations will be refused. It is 6089 understood that this space may be consumed by allocations to other 6090 files or directories. 6091 6092 5.8.2.29. Attribute 39: quota_avail_soft 6093 6094 The value in bytes which represents the amount of additional disk 6095 space that can be allocated to this file or directory before the user 6096 may reasonably be warned. It is understood that this space may be 6097 consumed by allocations to other files or directories though there is 6098 a rule as to which other files or directories. 6099 6100 6101 6102 6103 Shepler, et al. Expires February 23, 2009 [Page 109] 6104 6105 Internet-Draft NFSv4.1 August 2008 6106 6107 6108 5.8.2.30. Attribute 40: quota_used 6109 6110 The value in bytes which represent the amount of disc space used by 6111 this file or directory and possibly a number of other similar files 6112 or directories, where the set of "similar" meets at least the 6113 criterion that allocating space to any file or directory in the set 6114 will reduce the "quota_avail_hard" of every other file or directory 6115 in the set. 6116 6117 Note that there may be a number of distinct but overlapping sets of 6118 files or directories for which a quota_used value is maintained. 6119 E.g. "all files with a given owner", "all files with a given group 6120 owner". etc. 6121 6122 The server is at liberty to choose any of those sets but should do so 6123 in a repeatable way. The rule may be configured per file system or 6124 may be "choose the set with the smallest quota". 6125 6126 5.8.2.31. Attribute 41: rawdev 6127 6128 Raw device identifier; the UNIX device major/minor node information. 6129 If the value of type is not NF4BLK or NF4CHR, the value returned 6130 SHOULD NOT be considered useful. 6131 6132 5.8.2.32. Attribute 42: space_avail 6133 6134 Disk space in bytes available to this user on the file system 6135 containing this object - this should be the smallest relevant limit. 6136 6137 5.8.2.33. Attribute 43: space_free 6138 6139 Free disk space in bytes on the file system containing this object - 6140 this should be the smallest relevant limit. 6141 6142 5.8.2.34. Attribute 44: space_total 6143 6144 Total disk space in bytes on the file system containing this object. 6145 6146 5.8.2.35. Attribute 45: space_used 6147 6148 Number of file system bytes allocated to this object. 6149 6150 5.8.2.36. Attribute 46: system 6151 6152 This attribute is TRUE if this file is a "system" file with respect 6153 to the Windows operating environment. 6154 6155 6156 6157 6158 6159 Shepler, et al. Expires February 23, 2009 [Page 110] 6160 6161 Internet-Draft NFSv4.1 August 2008 6162 6163 6164 5.8.2.37. Attribute 47: time_access 6165 6166 The time_access attribute represents the time of last access to the 6167 object by a read that was satisfied by the server. The notion of 6168 what is an "access" depends on server's operating environment and/or 6169 the server's file system semantics. For example, for servers obeying 6170 POSIX semantics, time_access would be updated only by the READLINK, 6171 READ, and READDIR operations and not any of the operations that 6172 modify the content of the object. Of course, setting the 6173 corresponding time_access_set attribute is another way to modify the 6174 time_access attribute. 6175 6176 Whenever the file object resides on a writable file system, the 6177 server should make best efforts to record time_access into stable 6178 storage. However, to mitigate the performance effects of doing so, 6179 and most especially whenever the server is satisfying the read of the 6180 object's content from its cache, the server MAY cache access time 6181 updates and lazily write them to stable storage. It is also 6182 acceptable to give administrators of the server the option to disable 6183 time_access updates. 6184 6185 5.8.2.38. Attribute 48: time_access_set 6186 6187 Set the time of last access to the object. SETATTR use only. 6188 6189 5.8.2.39. Attribute 49: time_backup 6190 6191 The time of last backup of the object. 6192 6193 5.8.2.40. Attribute 50: time_create 6194 6195 The time of creation of the object. This attribute does not have any 6196 relation to the traditional UNIX file attribute "ctime" or "change 6197 time". 6198 6199 5.8.2.41. Attribute 51: time_delta 6200 6201 Smallest useful server time granularity. 6202 6203 5.8.2.42. Attribute 52: time_metadata 6204 6205 The time of last metadata modification of the object. 6206 6207 5.8.2.43. Attribute 53: time_modify 6208 6209 The time of last modification to the object. 6210 6211 6212 6213 6214 6215 Shepler, et al. Expires February 23, 2009 [Page 111] 6216 6217 Internet-Draft NFSv4.1 August 2008 6218 6219 6220 5.8.2.44. Attribute 54: time_modify_set 6221 6222 Set the time of last modification to the object. SETATTR use only. 6223 6224 5.9. Interpreting owner and owner_group 6225 6226 The RECOMMENDED attributes "owner" and "owner_group" (and also users 6227 and groups within the "acl" attribute) are represented in terms of a 6228 UTF-8 string. To avoid a representation that is tied to a particular 6229 underlying implementation at the client or server, the use of the 6230 UTF-8 string has been chosen. Note that section 6.1 of RFC2624 [33] 6231 provides additional rationale. It is expected that the client and 6232 server will have their own local representation of owner and 6233 owner_group that is used for local storage or presentation to the end 6234 user. Therefore, it is expected that when these attributes are 6235 transferred between the client and server that the local 6236 representation is translated to a syntax of the form "user@ 6237 dns_domain". This will allow for a client and server that do not use 6238 the same local representation the ability to translate to a common 6239 syntax that can be interpreted by both. 6240 6241 Similarly, security principals may be represented in different ways 6242 by different security mechanisms. Servers normally translate these 6243 representations into a common format, generally that used by local 6244 storage, to serve as a means of identifying the users corresponding 6245 to these security principals. When these local identifiers are 6246 translated to the form of the owner attribute, associated with files 6247 created by such principals they identify, in a common format, the 6248 users associated with each corresponding set of security principals. 6249 6250 The translation used to interpret owner and group strings is not 6251 specified as part of the protocol. This allows various solutions to 6252 be employed. For example, a local translation table may be consulted 6253 that maps between a numeric identifier to the user@dns_domain syntax. 6254 A name service may also be used to accomplish the translation. A 6255 server may provide a more general service, not limited by any 6256 particular translation (which would only translate a limited set of 6257 possible strings) by storing the owner and owner_group attributes in 6258 local storage without any translation or it may augment a translation 6259 method by storing the entire string for attributes for which no 6260 translation is available while using the local representation for 6261 those cases in which a translation is available. 6262 6263 Servers that do not provide support for all possible values of the 6264 owner and owner_group attributes, SHOULD return an error 6265 (NFS4ERR_BADOWNER) when a string is presented that has no 6266 translation, as the value to be set for a SETATTR of the owner, 6267 owner_group, or acl attributes. When a server does accept an owner 6268 6269 6270 6271 Shepler, et al. Expires February 23, 2009 [Page 112] 6272 6273 Internet-Draft NFSv4.1 August 2008 6274 6275 6276 or owner_group value as valid on a SETATTR (and similarly for the 6277 owner and group strings in an acl), it is promising to return that 6278 same string when a corresponding GETATTR is done. Configuration 6279 changes (including changes from the mapping of the string to the 6280 local representation) and ill-constructed name translations (those 6281 that contain aliasing) may make that promise impossible to honor. 6282 Servers should make appropriate efforts to avoid a situation in which 6283 these attributes have their values changed when no real change to 6284 ownership has occurred. 6285 6286 The "dns_domain" portion of the owner string is meant to be a DNS 6287 domain name. For example, user@ietf.org. Servers should accept as 6288 valid a set of users for at least one domain. A server may treat 6289 other domains as having no valid translations. A more general 6290 service is provided when a server is capable of accepting users for 6291 multiple domains, or for all domains, subject to security 6292 constraints. 6293 6294 In the case where there is no translation available to the client or 6295 server, the attribute value must be constructed without the "@". 6296 Therefore, the absence of the @ from the owner or owner_group 6297 attribute signifies that no translation was available at the sender 6298 and that the receiver of the attribute should not use that string as 6299 a basis for translation into its own internal format. Even though 6300 the attribute value can not be translated, it may still be useful. 6301 In the case of a client, the attribute string may be used for local 6302 display of ownership. 6303 6304 To provide a greater degree of compatibility with NFSv3, which 6305 identified users and groups by 32-bit unsigned user identifiers and 6306 group identifiers, owner and group strings that consist of decimal 6307 numeric values with no leading zeros can be given a special 6308 interpretation by clients and servers which choose to provide such 6309 support. The receiver may treat such a user or group string as 6310 representing the same user as would be represented by an NFSv3 uid or 6311 gid having the corresponding numeric value. A server is not 6312 obligated to accept such a string, but may return an NFS4ERR_BADOWNER 6313 instead. To avoid this mechanism being used to subvert user and 6314 group translation, so that a client might pass all of the owners and 6315 groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER 6316 error when there is a valid translation for the user or owner 6317 designated in this way. In that case, the client must use the 6318 appropriate name@domain string and not the special form for 6319 compatibility. 6320 6321 The owner string "nobody" may be used to designate an anonymous user, 6322 which will be associated with a file created by a security principal 6323 that cannot be mapped through normal means to the owner attribute. 6324 6325 6326 6327 Shepler, et al. Expires February 23, 2009 [Page 113] 6328 6329 Internet-Draft NFSv4.1 August 2008 6330 6331 6332 5.10. Character Case Attributes 6333 6334 With respect to the case_insensitive and case_preserving attributes, 6335 each UCS-4 character (which UTF-8 encodes) has a "long descriptive 6336 name" RFC1345 [34] which may or may not include the word "CAPITAL" or 6337 "SMALL". The presence of SMALL or CAPITAL allows an NFS server to 6338 implement unambiguous and efficient table driven mappings for case 6339 insensitive comparisons, and non-case-preserving storage. For 6340 general character handling and internationalization issues, see 6341 Section 14. 6342 6343 5.11. Directory Notification Attributes 6344 6345 As described in Section 18.39, the client can request a minimum delay 6346 for notifications of changes to attributes, but the server is free to 6347 ignore what the client requests. The client can determine in advance 6348 what notification delays the server will accept by issuing a GETATTR 6349 for either or both of two directory notification attributes. When 6350 the client calls the GET_DIR_DELEGATION operation and asks for 6351 attribute change notifications, it should request notification delays 6352 that are no less than the values in the server-provided attributes. 6353 6354 5.11.1. Attribute 56: dir_notif_delay 6355 6356 The dir_notif_delay attribute is the minimum number of seconds the 6357 server will delay before notifying the client of a change to the 6358 directory's attributes. 6359 6360 5.11.2. Attribute 57: dirent_notif_delay 6361 6362 The dirent_notif_delay attribute is the minimum number of seconds the 6363 server will delay before notifying the client of a change to a file 6364 object that has an entry in the directory. 6365 6366 5.12. pNFS Attribute Definitions 6367 6368 5.12.1. Attribute 62: fs_layout_type 6369 6370 The fs_layout_type attribute (see Section 3.3.13) applies to a file 6371 system and indicates what layout types are supported by the file 6372 system. When the client encounters a new fsid, the client SHOULD 6373 obtain the value for the fs_layout_type attribute associated with the 6374 new file system. This attribute is used by the client to determine 6375 if the layout types supported by the server match any of the client's 6376 supported layout types. 6377 6378 6379 6380 6381 6382 6383 Shepler, et al. Expires February 23, 2009 [Page 114] 6384 6385 Internet-Draft NFSv4.1 August 2008 6386 6387 6388 5.12.2. Attribute 66: layout_alignment 6389 6390 When a client holds layouts on files of a file system, the 6391 layout_alignment attribute indicates the preferred alignment for I/O 6392 to files on that file system. Where possible, the client should send 6393 READ and WRITE operations with offsets that are whole multiples of 6394 the layout_alignment attribute. 6395 6396 5.12.3. Attribute 65: layout_blksize 6397 6398 When a client holds layouts on files of a file system, the 6399 layout_blksize attribute indicates the preferred block size for I/O 6400 to files on that file system. Where possible, the client should send 6401 READ operations with a count argument that is a whole multiple of 6402 layout_blksize, and WRITE operations with a data argument of size 6403 that is a whole multiple of layout_blksize. 6404 6405 5.12.4. Attribute 63: layout_hint 6406 6407 The layout_hint attribute (see Section 3.3.19) may be set on newly 6408 created files to influence the metadata server's choice for the 6409 file's layout. If possible, this attribute is one of those set in 6410 the initial attributes within the OPEN operation. The metadata 6411 server may choose to ignore this attribute. The layout_hint 6412 attribute is a sub-set of the layout structure returned by LAYOUTGET. 6413 For example, instead of specifying particular devices, this would be 6414 used to suggest the stripe width of a file. The server 6415 implementation determines which fields within the layout will be 6416 used. 6417 6418 5.12.5. Attribute 64: layout_type 6419 6420 This attribute lists the layout type(s) available for a file. The 6421 value returned by the server is for informational purposes only. The 6422 client will use the LAYOUTGET operation to obtain the information 6423 needed in order to perform I/O. For example, the specific device 6424 information for the file and its layout. 6425 6426 5.12.6. Attribute 68: mdsthreshold 6427 6428 This attribute is a server provided hint used to communicate to the 6429 client when it is more efficient to send READ and WRITE operations to 6430 the metadata server or the data server. The two types of thresholds 6431 described are file size thresholds and I/O size thresholds. If a 6432 file's size is smaller than the file size threshold, data accesses 6433 SHOULD be sent to the metadata server. If an I/O request has a 6434 length that is below the I/O size threshold, the I/O SHOULD be sent 6435 to the metadata server. Each threshold type is specified separately 6436 6437 6438 6439 Shepler, et al. Expires February 23, 2009 [Page 115] 6440 6441 Internet-Draft NFSv4.1 August 2008 6442 6443 6444 for READ and WRITE. 6445 6446 The server MAY provide both types of thresholds for a file. If both 6447 file size and I/O size are provided, the client SHOULD reach or 6448 exceed both thresholds before issuing its READ or WRITE requests to 6449 the data server. Alternatively, if only one of the specified 6450 thresholds are reached or exceeded, the I/O requests are sent to the 6451 metadata server. 6452 6453 For each threshold type, a value of 0 indicates no READ or WRITE 6454 should be sent to the metadata server, while a value of all 1s 6455 indicates all READS or WRITES should be sent to the metadata server. 6456 6457 The attribute is available on a per filehandle basis. If the current 6458 filehandle refers to a non-pNFS file or directory, the metadata 6459 server should return an attribute that is representative of the 6460 filehandle's file system. It is suggested that this attribute is 6461 queried as part of the OPEN operation. Due to dynamic system 6462 changes, the client should not assume that the attribute will remain 6463 constant for any specific time period, thus it should be periodically 6464 refreshed. 6465 6466 5.13. Retention Attributes 6467 6468 Retention is a concept whereby a file object can be placed in an 6469 immutable, undeletable, unrenamable state for a fixed or infinite 6470 duration of time. Once in this "retained" state, the file cannot be 6471 moved out of the state until the duration of retention has been 6472 reached. 6473 6474 When retention is enabled, retention MUST extend to the data of the 6475 file, and the name of file. The server MAY extend retention to any 6476 other property of the file, including any subset of REQUIRED, 6477 RECOMMENDED, and named attributes, with the exceptions noted in this 6478 section. 6479 6480 Servers MAY support or not support retention on any file object type. 6481 6482 The five retention attributes are explained in the next subsections. 6483 6484 5.13.1. Attribute 69: retention_get 6485 6486 If retention is enabled for the associated file, this attribute's 6487 value represents the retention begin time of the file object. This 6488 attribute's value is only readable with the GETATTR operation and 6489 MUST NOT be modified by the SETATTR operation (Section 5.5). The 6490 value of the attribute consists of: 6491 6492 6493 6494 6495 Shepler, et al. Expires February 23, 2009 [Page 116] 6496 6497 Internet-Draft NFSv4.1 August 2008 6498 6499 6500 const RET4_DURATION_INFINITE = 0xffffffffffffffff; 6501 struct retention_get4 { 6502 uint64_t rg_duration; 6503 nfstime4 rg_begin_time<1>; 6504 }; 6505 6506 The field rg_duration is the duration in seconds indicating how long 6507 the file will be retained once retention is enabled. The field 6508 rg_begin_time is an array of up to one absolute time value. If the 6509 array is zero length, no beginning retention time has been 6510 established, and retention is not enabled. If rg_duration is equal 6511 to RET4_DURATION_INFINITE, the file, once retention is enabled, will 6512 be retained for an infinite duration. 6513 6514 If (as soon as) rg_duration is zero, then rg_begin_time will be of 6515 zero length, and again, retention is not (no longer) enabled. 6516 6517 5.13.2. Attribute 70: retention_set 6518 6519 This attribute is used to set the retention duration and optionally 6520 enable retention for the associated file object. This attribute is 6521 only modifiable via the SETATTR operation and MUST NOT be retrieved 6522 by the GETATTR operation (Section 5.5). This attribute corresponds 6523 to retention_get. The value of the attribute consists of: 6524 6525 struct retention_set4 { 6526 bool rs_enable; 6527 uint64_t rs_duration<1>; 6528 }; 6529 6530 If the client sets rs_enable to TRUE, then it is enabling retention 6531 on the file object with the begin time of retention starting from the 6532 server's current time and date. The duration of the retention can 6533 also be provided if the rs_duration array is of length one. The 6534 duration is the time in seconds from the begin time of retention, and 6535 if set to RET4_DURATION_INFINITE, the file is to be retained forever. 6536 If retention is enabled, with no duration specified in either this 6537 SETATTR or a previous SETATTR, the duration defaults to zero seconds. 6538 The server MAY restrict the enabling of retention or the duration of 6539 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 6540 The enabling of retention MUST NOT prevent the enabling of event- 6541 based retention nor the modification of the retention_hold attribute. 6542 6543 The following rules apply to both the retention_set and retentevt_set 6544 attributes. 6545 6546 o As long as retention is not enabled, the client is permitted to 6547 decrease the duration. 6548 6549 6550 6551 Shepler, et al. Expires February 23, 2009 [Page 117] 6552 6553 Internet-Draft NFSv4.1 August 2008 6554 6555 6556 o The duration can always be set to an equal or higher value, even 6557 if retention is enabled. Note that once retention is enabled, the 6558 actual duration (as returned by the retention_get or retentevt_get 6559 attributes, see Section 5.13.1 or Section 5.13.3), is constantly 6560 counting down to zero (one unit per second), unless the duration 6561 was set to RET4_DURATION_INFINITE. Thus it will not be possible 6562 for the client to precisely extend the duration on a file that has 6563 retention enabled. 6564 6565 o While retention is enabled, attempts to disable retention or 6566 decrease the retention's duration MUST fail with the error 6567 NFS4ERR_INVAL. 6568 6569 o If the principal attempting to change retention_set or 6570 retentevt_set does not have ACE4_WRITE_RETENTION permissions, the 6571 attempt MUST fail with NFS4ERR_ACCESS. 6572 6573 5.13.3. Attribute 71: retentevt_get 6574 6575 Get the event-based retention duration, and if enabled, the event- 6576 based retention begin time of the file object. This attribute is 6577 like retention_get but refers to event-based retention. The event 6578 that triggers event-based retention is not defined by the NFSv4.1 6579 specification. 6580 6581 5.13.4. Attribute 72: retentevt_set 6582 6583 Set the event-based retention duration, and optionally enable event- 6584 based retention on the file object. This attribute corresponds to 6585 retentevt_get, is like retention_set, but refers to event-based 6586 retention. When event based retention is set, the file MUST be 6587 retained even if non-event-based retention has been set, and the 6588 duration of non-event-based retention has been reached. Conversely, 6589 when non-event-based retention has been set, the file MUST be 6590 retained even if event-based retention has been set, and the duration 6591 of event-based retention has been reached. The server MAY restrict 6592 the enabling of event-based retention or the duration of event-based 6593 retention on the basis of the ACE4_WRITE_RETENTION ACL permission. 6594 The enabling of event-based retention MUST NOT prevent the enabling 6595 of non-event-based retention nor the modification of the 6596 retention_hold attribute. 6597 6598 5.13.5. Attribute 73: retention_hold 6599 6600 Get or set administrative retention holds, one hold per bit position. 6601 6602 This attribute allows one to 64 administrative holds, one hold per 6603 bit on the attribute. If retention_hold is not zero, then the file 6604 6605 6606 6607 Shepler, et al. Expires February 23, 2009 [Page 118] 6608 6609 Internet-Draft NFSv4.1 August 2008 6610 6611 6612 MUST NOT be deleted, renamed, or modified, even if the duration on 6613 enabled event or non-event-based retention has been reached. The 6614 server MAY restrict the modification of retention_hold on the basis 6615 of the ACE4_WRITE_RETENTION_HOLD ACL permission. The enabling of 6616 administration retention holds does not prevent the enabling of 6617 event-based or non-event-based retention. 6618 6619 If the principal attempting to change retention_hold does not have 6620 ACE4_WRITE_RETENTION_HOLD permissions, the attempt MUST fail with 6621 NFS4ERR_ACCESS. 6622 6623 6624 6. Access Control Attributes 6625 6626 Access Control Lists (ACLs) are file attributes that specify fine 6627 grained access control. This chapter covers the "acl", "dacl", 6628 "sacl", "aclsupport", "mode", "mode_set_masked" file attributes, and 6629 their interactions. Note that file attributes may apply to any file 6630 system object. 6631 6632 6.1. Goals 6633 6634 ACLs and modes represent two well established models for specifying 6635 permissions. This chapter specifies requirements that attempt to 6636 meet the following goals: 6637 6638 o If a server supports the mode attribute, it should provide 6639 reasonable semantics to clients that only set and retrieve the 6640 mode attribute. 6641 6642 o If a server supports ACL attributes, it should provide reasonable 6643 semantics to clients that only set and retrieve those attributes. 6644 6645 o On servers that support the mode attribute, if ACL attributes have 6646 never been set on an object, via inheritance or explicitly, the 6647 behavior should be traditional UNIX-like behavior. 6648 6649 o On servers that support the mode attribute, if the ACL attributes 6650 have been previously set on an object, either explicitly or via 6651 inheritance: 6652 6653 * Setting only the mode attribute should effectively control the 6654 traditional UNIX-like permissions of read, write, and execute 6655 on owner, owner_group, and other. 6656 6657 * Setting only the mode attribute should provide reasonable 6658 security. For example, setting a mode of 000 should be enough 6659 to ensure that future opens for read or write by any principal 6660 6661 6662 6663 Shepler, et al. Expires February 23, 2009 [Page 119] 6664 6665 Internet-Draft NFSv4.1 August 2008 6666 6667 6668 fail, regardless of a previously existing or inherited ACL. 6669 6670 o NFSv4.1 may introduce different semantics relating to the mode and 6671 ACL attributes, but it does not render invalid any previously 6672 existing implementations. Additionally, this chapter provides 6673 clarifications based on previous implementations and discussions 6674 around them. 6675 6676 o On servers that support both the mode and the acl or dacl 6677 attributes, the server must keep the two consistent with each 6678 other. The value of the mode attribute (with the exception of the 6679 three high order bits described in Section 6.2.4), must be 6680 determined entirely by the value of the ACL, so that use of the 6681 mode is never required for anything other than setting the three 6682 high order bits. See Section 6.4.1 for exact requirements. 6683 6684 o When a mode attribute is set on an object, the ACL attributes may 6685 need to be modified so as to not conflict with the new mode. In 6686 such cases, it is desirable that the ACL keep as much information 6687 as possible. This includes information about inheritance, AUDIT 6688 and ALARM ACEs, and permissions granted and denied that do not 6689 conflict with the new mode. 6690 6691 6.2. File Attributes Discussion 6692 6693 6.2.1. Attribute 12: acl 6694 6695 The NFSv4.1 ACL attribute contains an array of access control entries 6696 (ACEs) that are associated with the file system object. Although the 6697 client can read and write the acl attribute, the server is 6698 responsible for using the ACL to perform access control. The client 6699 can use the OPEN or ACCESS operations to check access without 6700 modifying or reading data or metadata. 6701 6702 The NFS ACE structure is defined as follows: 6703 6704 typedef uint32_t acetype4; 6705 6706 6707 typedef uint32_t aceflag4; 6708 6709 6710 typedef uint32_t acemask4; 6711 6712 6713 6714 6715 6716 6717 6718 6719 Shepler, et al. Expires February 23, 2009 [Page 120] 6720 6721 Internet-Draft NFSv4.1 August 2008 6722 6723 6724 struct nfsace4 { 6725 acetype4 type; 6726 aceflag4 flag; 6727 acemask4 access_mask; 6728 utf8str_mixed who; 6729 }; 6730 6731 To determine if a request succeeds, the server processes each nfsace4 6732 entry in order. Only ACEs which have a "who" that matches the 6733 requester are considered. Each ACE is processed until all of the 6734 bits of the requester's access have been ALLOWED. Once a bit (see 6735 below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer 6736 considered in the processing of later ACEs. If an ACCESS_DENIED_ACE 6737 is encountered where the requester's access still has unALLOWED bits 6738 in common with the "access_mask" of the ACE, the request is denied. 6739 When the ACL is fully processed, if there are bits in the requester's 6740 mask that have not been ALLOWED or DENIED, access is denied. 6741 6742 Unlike the ALLOW and DENY ACE types, the ALARM and AUDIT ACE types do 6743 not affect a requester's access, and instead are for triggering 6744 events as a result of a requester's access attempt. Therefore, AUDIT 6745 and ALARM ACEs are processed only after processing ALLOW and DENY 6746 ACEs. 6747 6748 The NFSv4.1 ACL model is quite rich. Some server platforms may 6749 provide access control functionality that goes beyond the UNIX-style 6750 mode attribute, but which is not as rich as the NFS ACL model. So 6751 that users can take advantage of this more limited functionality, the 6752 server may support the acl attributes by mapping between its ACL 6753 model and the NFSv4.1 ACL model. Servers must ensure that the ACL 6754 they actually store or enforce is at least as strict as the NFSv4 ACL 6755 that was set. It is tempting to accomplish this by rejecting any ACL 6756 that falls outside the small set that can be represented accurately. 6757 However, such an approach can render ACLs unusable without special 6758 client-side knowledge of the server's mapping, which defeats the 6759 purpose of having a common NFSv4 ACL protocol. Therefore servers 6760 should accept every ACL that they can without compromising security. 6761 To help accomplish this, servers may make a special exception, in the 6762 case of unsupported permission bits, to the rule that bits not 6763 ALLOWED or DENIED by an ACL must be denied. For example, a UNIX- 6764 style server might choose to silently allow read attribute 6765 permissions even though an ACL does not explicitly allow those 6766 permissions. (An ACL that explicitly denies permission to read 6767 attributes should still be rejected.) 6768 6769 The situation is complicated by the fact that a server may have 6770 multiple modules that enforce ACLs. For example, the enforcement for 6771 NFSv4.1 access may be different from, but not weaker than, the 6772 6773 6774 6775 Shepler, et al. Expires February 23, 2009 [Page 121] 6776 6777 Internet-Draft NFSv4.1 August 2008 6778 6779 6780 enforcement for local access, and both may be different from the 6781 enforcement for access through other protocols such as SMB. So it 6782 may be useful for a server to accept an ACL even if not all of its 6783 modules are able to support it. 6784 6785 The guiding principle with regard to NFSv4 access is that the server 6786 must not accept ACLs that appear to make access to the file more 6787 restrictive than it really is. 6788 6789 6.2.1.1. ACE Type 6790 6791 The constants used for the type field (acetype4) are as follows: 6792 6793 const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; 6794 const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; 6795 const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; 6796 const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; 6797 6798 Only the ALLOWED and DENIED bits types may be used in the dacl 6799 attribute, and only the AUDIT and ALARM bits may be used in the sacl 6800 attribute. All four are permitted in the acl attribute. 6801 6802 +------------------------------+--------------+---------------------+ 6803 | Value | Abbreviation | Description | 6804 +------------------------------+--------------+---------------------+ 6805 | ACE4_ACCESS_ALLOWED_ACE_TYPE | ALLOW | Explicitly grants | 6806 | | | the access defined | 6807 | | | in acemask4 to the | 6808 | | | file or directory. | 6809 | ACE4_ACCESS_DENIED_ACE_TYPE | DENY | Explicitly denies | 6810 | | | the access defined | 6811 | | | in acemask4 to the | 6812 | | | file or directory. | 6813 | ACE4_SYSTEM_AUDIT_ACE_TYPE | AUDIT | LOG (in a system | 6814 | | | dependent way) any | 6815 | | | access attempt to a | 6816 | | | file or directory | 6817 | | | which uses any of | 6818 | | | the access methods | 6819 | | | specified in | 6820 | | | acemask4. | 6821 6822 6823 6824 6825 6826 6827 6828 6829 6830 6831 Shepler, et al. Expires February 23, 2009 [Page 122] 6832 6833 Internet-Draft NFSv4.1 August 2008 6834 6835 6836 | ACE4_SYSTEM_ALARM_ACE_TYPE | ALARM | Generate a system | 6837 | | | ALARM (system | 6838 | | | dependent) when any | 6839 | | | access attempt is | 6840 | | | made to a file or | 6841 | | | directory for the | 6842 | | | access methods | 6843 | | | specified in | 6844 | | | acemask4. | 6845 +------------------------------+--------------+---------------------+ 6846 6847 The "Abbreviation" column denotes how the types will be referred to 6848 throughout the rest of this chapter. 6849 6850 6.2.1.2. Attribute 13: aclsupport 6851 6852 A server need not support all of the above ACE types. This attribute 6853 indicates which ACE types are supported for the current file system. 6854 The bitmask constants used to represent the above definitions within 6855 the aclsupport attribute are as follows: 6856 6857 const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; 6858 const ACL4_SUPPORT_DENY_ACL = 0x00000002; 6859 const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; 6860 const ACL4_SUPPORT_ALARM_ACL = 0x00000008; 6861 6862 Servers which support either the ALLOW or DENY ACE type SHOULD 6863 support both ALLOW and DENY ACE types. 6864 6865 Clients should not attempt to set an ACE unless the server claims 6866 support for that ACE type. If the server receives a request to set 6867 an ACE that it cannot store, it MUST reject the request with 6868 NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE 6869 that it can store but cannot enforce, the server SHOULD reject the 6870 request with NFS4ERR_ATTRNOTSUPP. 6871 6872 Support for any of the ACL attributes is optional (albeit, 6873 RECOMMENDED). However, a server that supports either of the new ACL 6874 attributes (dacl or sacl) MUST allow use of the new ACL attributes to 6875 access all of the ACE types which it supports. In other words, if 6876 such a server supports ALLOW or DENY ACEs, then it MUST support the 6877 dacl attribute, and if it supports AUDIT or ALARM ACEs, then it MUST 6878 support the sacl attribute. 6879 6880 6.2.1.3. ACE Access Mask 6881 6882 The bitmask constants used for the access mask field are as follows: 6883 6884 6885 6886 6887 Shepler, et al. Expires February 23, 2009 [Page 123] 6888 6889 Internet-Draft NFSv4.1 August 2008 6890 6891 6892 const ACE4_READ_DATA = 0x00000001; 6893 const ACE4_LIST_DIRECTORY = 0x00000001; 6894 const ACE4_WRITE_DATA = 0x00000002; 6895 const ACE4_ADD_FILE = 0x00000002; 6896 const ACE4_APPEND_DATA = 0x00000004; 6897 const ACE4_ADD_SUBDIRECTORY = 0x00000004; 6898 const ACE4_READ_NAMED_ATTRS = 0x00000008; 6899 const ACE4_WRITE_NAMED_ATTRS = 0x00000010; 6900 const ACE4_EXECUTE = 0x00000020; 6901 const ACE4_DELETE_CHILD = 0x00000040; 6902 const ACE4_READ_ATTRIBUTES = 0x00000080; 6903 const ACE4_WRITE_ATTRIBUTES = 0x00000100; 6904 const ACE4_WRITE_RETENTION = 0x00000200; 6905 const ACE4_WRITE_RETENTION_HOLD = 0x00000400; 6906 6907 const ACE4_DELETE = 0x00010000; 6908 const ACE4_READ_ACL = 0x00020000; 6909 const ACE4_WRITE_ACL = 0x00040000; 6910 const ACE4_WRITE_OWNER = 0x00080000; 6911 const ACE4_SYNCHRONIZE = 0x00100000; 6912 6913 Note that some masks have coincident values, for example, 6914 ACE4_READ_DATA and ACE4_LIST_DIRECTORY. The mask entries 6915 ACE4_LIST_DIRECTORY, ACE4_ADD_FILE, and ACE4_ADD_SUBDIRECTORY are 6916 intended to be used with directory objects, while ACE4_READ_DATA, 6917 ACE4_WRITE_DATA, and ACE4_APPEND_DATA are intended to be used with 6918 non-directory objects. 6919 6920 6.2.1.3.1. Discussion of Mask Attributes 6921 6922 ACE4_READ_DATA 6923 6924 Operation(s) affected: 6925 6926 READ 6927 6928 OPEN 6929 6930 Discussion: 6931 6932 Permission to read the data of the file. 6933 6934 Servers SHOULD allow a user the ability to read the data of the 6935 file when only the ACE4_EXECUTE access mask bit is allowed. 6936 6937 6938 6939 6940 6941 6942 6943 Shepler, et al. Expires February 23, 2009 [Page 124] 6944 6945 Internet-Draft NFSv4.1 August 2008 6946 6947 6948 ACE4_LIST_DIRECTORY 6949 6950 Operation(s) affected: 6951 6952 READDIR 6953 6954 Discussion: 6955 6956 Permission to list the contents of a directory. 6957 6958 ACE4_WRITE_DATA 6959 6960 Operation(s) affected: 6961 6962 WRITE 6963 6964 OPEN 6965 6966 SETATTR of size 6967 6968 Discussion: 6969 6970 Permission to modify a file's data. 6971 6972 ACE4_ADD_FILE 6973 6974 Operation(s) affected: 6975 6976 CREATE 6977 6978 LINK 6979 6980 OPEN 6981 6982 RENAME 6983 6984 Discussion: 6985 6986 Permission to add a new file in a directory. The CREATE 6987 operation is affected when nfs_ftype4 is NF4LNK, NF4BLK, 6988 NF4CHR, NF4SOCK, or NF4FIFO. (NF4DIR is not listed because it 6989 is covered by ACE4_ADD_SUBDIRECTORY.) OPEN is affected when 6990 used to create a regular file. LINK and RENAME are always 6991 affected. 6992 6993 6994 6995 6996 6997 6998 6999 Shepler, et al. Expires February 23, 2009 [Page 125] 7000 7001 Internet-Draft NFSv4.1 August 2008 7002 7003 7004 ACE4_APPEND_DATA 7005 7006 Operation(s) affected: 7007 7008 WRITE 7009 7010 OPEN 7011 7012 SETATTR of size 7013 7014 Discussion: 7015 7016 The ability to modify a file's data, but only starting at EOF. 7017 This allows for the notion of append-only files, by allowing 7018 ACE4_APPEND_DATA and denying ACE4_WRITE_DATA to the same user 7019 or group. If a file has an ACL such as the one described above 7020 and a WRITE request is made for somewhere other than EOF, the 7021 server SHOULD return NFS4ERR_ACCESS. 7022 7023 ACE4_ADD_SUBDIRECTORY 7024 7025 Operation(s) affected: 7026 7027 CREATE 7028 7029 RENAME 7030 7031 Discussion: 7032 7033 Permission to create a subdirectory in a directory. The CREATE 7034 operation is affected when nfs_ftype4 is NF4DIR. The RENAME 7035 operation is always affected. 7036 7037 ACE4_READ_NAMED_ATTRS 7038 7039 Operation(s) affected: 7040 7041 OPENATTR 7042 7043 Discussion: 7044 7045 Permission to read the named attributes of a file or to lookup 7046 the named attributes directory. OPENATTR is affected when it 7047 is not used to create a named attribute directory. This is 7048 when 1.) createdir is TRUE, but a named attribute directory 7049 already exists, or 2.) createdir is FALSE. 7050 7051 7052 7053 7054 7055 Shepler, et al. Expires February 23, 2009 [Page 126] 7056 7057 Internet-Draft NFSv4.1 August 2008 7058 7059 7060 ACE4_WRITE_NAMED_ATTRS 7061 7062 Operation(s) affected: 7063 7064 OPENATTR 7065 7066 7067 7068 Discussion: 7069 7070 Permission to write the named attributes of a file or to create 7071 a named attribute directory. OPENATTR is affected when it is 7072 used to create a named attribute directory. This is when 7073 createdir is TRUE and no named attribute directory exists. The 7074 ability to check whether or not a named attribute directory 7075 exists depends on the ability to look it up, therefore, users 7076 also need the ACE4_READ_NAMED_ATTRS permission in order to 7077 create a named attribute directory. 7078 7079 ACE4_EXECUTE 7080 7081 Operation(s) affected: 7082 7083 READ 7084 7085 OPEN 7086 7087 REMOVE 7088 7089 RENAME 7090 7091 LINK 7092 7093 CREATE 7094 7095 Discussion: 7096 7097 Permission to execute a file. 7098 7099 Servers SHOULD allow a user the ability to read the data of the 7100 file when only the ACE4_EXECUTE access mask bit is allowed. 7101 This is because there is no way to execute a file without 7102 reading the contents. Though a server may treat ACE4_EXECUTE 7103 and ACE4_READ_DATA bits identically when deciding to permit a 7104 READ operation, it SHOULD still allow the two bits to be set 7105 independently in ACLs, and MUST distinguish between them when 7106 replying to ACCESS operations. In particular, servers SHOULD 7107 NOT silently turn on one of the two bits when the other is set, 7108 7109 7110 7111 Shepler, et al. Expires February 23, 2009 [Page 127] 7112 7113 Internet-Draft NFSv4.1 August 2008 7114 7115 7116 as that would make it impossible for the client to correctly 7117 enforce the distinction between read and execute permissions. 7118 7119 As an example, following a SETATTR of the following ACL: 7120 7121 nfsuser:ACE4_EXECUTE:ALLOW 7122 7123 A subsequent GETATTR of ACL for that file SHOULD return: 7124 7125 nfsuser:ACE4_EXECUTE:ALLOW 7126 7127 Rather than: 7128 7129 nfsuser:ACE4_EXECUTE/ACE4_READ_DATA:ALLOW 7130 7131 ACE4_EXECUTE 7132 7133 Operation(s) affected: 7134 7135 LOOKUP 7136 7137 Discussion: 7138 7139 Permission to traverse/search a directory. 7140 7141 ACE4_DELETE_CHILD 7142 7143 Operation(s) affected: 7144 7145 REMOVE 7146 7147 RENAME 7148 7149 Discussion: 7150 7151 Permission to delete a file or directory within a directory. 7152 See Section 6.2.1.3.2 for information on ACE4_DELETE and 7153 ACE4_DELETE_CHILD interact. 7154 7155 ACE4_READ_ATTRIBUTES 7156 7157 Operation(s) affected: 7158 7159 GETATTR of file system object attributes 7160 7161 7162 7163 7164 7165 7166 7167 Shepler, et al. Expires February 23, 2009 [Page 128] 7168 7169 Internet-Draft NFSv4.1 August 2008 7170 7171 7172 VERIFY 7173 7174 NVERIFY 7175 7176 READDIR 7177 7178 Discussion: 7179 7180 The ability to read basic attributes (non-ACLs) of a file. On 7181 a UNIX system, basic attributes can be thought of as the stat 7182 level attributes. Allowing this access mask bit would mean the 7183 entity can execute "ls -l" and stat. If a READDIR operation 7184 requests attributes, this mask must be allowed for the READDIR 7185 to succeed. 7186 7187 ACE4_WRITE_ATTRIBUTES 7188 7189 Operation(s) affected: 7190 7191 SETATTR of time_access_set, time_backup, 7192 7193 time_create, time_modify_set, mimetype, hidden, system 7194 7195 Discussion: 7196 7197 Permission to change the times associated with a file or 7198 directory to an arbitrary value. Also permission to change the 7199 mimetype, hidden and system attributes. A user having 7200 ACE4_WRITE_DATA or ACE4_WRITE_ATTRIBUTES will be allowed to set 7201 the times associated with a file to the current server time. 7202 7203 ACE4_WRITE_RETENTION 7204 7205 Operation(s) affected: 7206 7207 SETATTR of retention_set, retentevt_set. 7208 7209 Discussion: 7210 7211 Permission to modify the durations of event and non-event-based 7212 retention. Also permission to enable event and non-event-based 7213 retention. A server MAY behave such that setting 7214 ACE4_WRITE_ATTRIBUTES allows ACE4_WRITE_RETENTION. 7215 7216 7217 7218 7219 7220 7221 7222 7223 Shepler, et al. Expires February 23, 2009 [Page 129] 7224 7225 Internet-Draft NFSv4.1 August 2008 7226 7227 7228 ACE4_WRITE_RETENTION_HOLD 7229 7230 Operation(s) affected: 7231 7232 SETATTR of retention_hold. 7233 7234 Discussion: 7235 7236 Permission to modify the administration retention holds. A 7237 server MAY map ACE4_WRITE_ATTRIBUTES to 7238 ACE_WRITE_RETENTION_HOLD. 7239 7240 ACE4_DELETE 7241 7242 Operation(s) affected: 7243 7244 REMOVE 7245 7246 Discussion: 7247 7248 Permission to delete the file or directory. See 7249 Section 6.2.1.3.2 for information on ACE4_DELETE and 7250 ACE4_DELETE_CHILD interact. 7251 7252 ACE4_READ_ACL 7253 7254 Operation(s) affected: 7255 7256 GETATTR of acl, dacl, or sacl 7257 7258 NVERIFY 7259 7260 VERIFY 7261 7262 Discussion: 7263 7264 Permission to read the ACL. 7265 7266 ACE4_WRITE_ACL 7267 7268 Operation(s) affected: 7269 7270 SETATTR of acl and mode 7271 7272 7273 7274 7275 7276 7277 7278 7279 Shepler, et al. Expires February 23, 2009 [Page 130] 7280 7281 Internet-Draft NFSv4.1 August 2008 7282 7283 7284 Discussion: 7285 7286 Permission to write the acl and mode attributes. 7287 7288 ACE4_WRITE_OWNER 7289 7290 Operation(s) affected: 7291 7292 SETATTR of owner and owner_group 7293 7294 Discussion: 7295 7296 Permission to write the owner and owner_group attributes. On 7297 UNIX systems, this is the ability to execute chown() and 7298 chgrp(). 7299 7300 ACE4_SYNCHRONIZE 7301 7302 Operation(s) affected: 7303 7304 NONE 7305 7306 Discussion: 7307 7308 Permission to access file locally at the server with 7309 synchronized reads and writes. 7310 7311 Server implementations need not provide the granularity of control 7312 that is implied by this list of masks. For example, POSIX-based 7313 systems might not distinguish ACE4_APPEND_DATA (the ability to append 7314 to a file) from ACE4_WRITE_DATA (the ability to modify existing 7315 contents); both masks would be tied to a single "write" permission. 7316 When such a server returns attributes to the client, it would show 7317 both ACE4_APPEND_DATA and ACE4_WRITE_DATA if and only if the write 7318 permission is enabled. 7319 7320 If a server receives a SETATTR request that it cannot accurately 7321 implement, it should err in the direction of more restricted access, 7322 except in the previously discussed cases of execute and read. For 7323 example, suppose a server cannot distinguish overwriting data from 7324 appending new data, as described in the previous paragraph. If a 7325 client submits an ALLOW ACE where ACE4_APPEND_DATA is set but 7326 ACE4_WRITE_DATA is not (or vice versa), the server should either turn 7327 off ACE4_APPEND_DATA or reject the request with NFS4ERR_ATTRNOTSUPP. 7328 7329 7330 7331 7332 7333 7334 7335 Shepler, et al. Expires February 23, 2009 [Page 131] 7336 7337 Internet-Draft NFSv4.1 August 2008 7338 7339 7340 6.2.1.3.2. ACE4_DELETE vs. ACE4_DELETE_CHILD 7341 7342 Two access mask bits govern the ability to delete a directory entry: 7343 ACE4_DELETE on the object itself (the "target"), and 7344 ACE4_DELETE_CHILD on the containing directory (the "parent"). 7345 7346 Many systems also take the "sticky bit" (MODE4_SVTX) on a directory 7347 to allow unlink only to a user that owns either the target or the 7348 parent; on some such systems the decision also depends on whether the 7349 target is writable. 7350 7351 Servers SHOULD allow unlink if either ACE4_DELETE is permitted on the 7352 target, or ACE4_DELETE_CHILD is permitted on the parent. (Note that 7353 this is true even if the parent or target explicitly denies one of 7354 these permissions.) 7355 7356 If the ACLs in question neither explicitly ALLOW nor DENY either of 7357 the above, and if MODE4_SVTX is not set on the parent, then the 7358 server SHOULD allow the removal if and only if ACE4_ADD_FILE is 7359 permitted. In the case where MODE4_SVTX is set, the server may also 7360 require the remover to own either the parent or the target, or may 7361 require the target to be writable. 7362 7363 This allows servers to support something close to traditional UNIX- 7364 like semantics, with ACE4_ADD_FILE taking the place of the write bit. 7365 7366 6.2.1.4. ACE flag 7367 7368 The bitmask constants used for the flag field are as follows: 7369 7370 const ACE4_FILE_INHERIT_ACE = 0x00000001; 7371 const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; 7372 const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; 7373 const ACE4_INHERIT_ONLY_ACE = 0x00000008; 7374 const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; 7375 const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; 7376 const ACE4_IDENTIFIER_GROUP = 0x00000040; 7377 const ACE4_INHERITED_ACE = 0x00000080; 7378 7379 A server need not support any of these flags. If the server supports 7380 flags that are similar to, but not exactly the same as, these flags, 7381 the implementation may define a mapping between the protocol-defined 7382 flags and the implementation-defined flags. 7383 7384 For example, suppose a client tries to set an ACE with 7385 ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the 7386 server does not support any form of ACL inheritance, the server 7387 should reject the request with NFS4ERR_ATTRNOTSUPP. If the server 7388 7389 7390 7391 Shepler, et al. Expires February 23, 2009 [Page 132] 7392 7393 Internet-Draft NFSv4.1 August 2008 7394 7395 7396 supports a single "inherit ACE" flag that applies to both files and 7397 directories, the server may reject the request (i.e., requiring the 7398 client to set both the file and directory inheritance flags). The 7399 server may also accept the request and silently turn on the 7400 ACE4_DIRECTORY_INHERIT_ACE flag. 7401 7402 6.2.1.4.1. Discussion of Flag Bits 7403 7404 ACE4_FILE_INHERIT_ACE 7405 Any non-directory file in any sub-directory will get this ACE 7406 inherited. 7407 7408 ACE4_DIRECTORY_INHERIT_ACE 7409 Can be placed on a directory and indicates that this ACE should be 7410 added to each new directory created. 7411 If this flag is set in an ACE in an ACL attribute to be set on a 7412 non-directory file system object, the operation attempting to set 7413 the ACL SHOULD fail with NFS4ERR_ATTRNOTSUPP. 7414 7415 ACE4_INHERIT_ONLY_ACE 7416 Can be placed on a directory but does not apply to the directory; 7417 ALLOW and DENY ACEs with this bit set do not affect access to the 7418 directory, and AUDIT and ALARM ACEs with this bit set do not 7419 trigger log or alarm events. Such ACEs only take effect once they 7420 are applied (with this bit cleared) to newly created files and 7421 directories as specified by the above two flags. 7422 If this flag is present on an ACE, but neither 7423 ACE4_DIRECTORY_INHERIT_ACE nor ACE4_FILE_INHERIT_ACE is present, 7424 then an operation attempting to set such an attribute SHOULD fail 7425 with NFS4ERR_ATTRNOTSUPP. 7426 7427 ACE4_NO_PROPAGATE_INHERIT_ACE 7428 Can be placed on a directory. This flag tells the server that 7429 inheritance of this ACE should stop at newly created child 7430 directories. 7431 7432 ACE4_INHERITED_ACE 7433 Indicates that this ACE is inherited from a parent directory. A 7434 server that supports automatic inheritance will place this flag on 7435 any ACEs inherited from the parent directory when creating a new 7436 object. Client applications will use this to perform automatic 7437 inheritance. Clients and servers MUST clear this bit in the acl 7438 attribute; it may only be used in the dacl and sacl attributes. 7439 7440 ACE4_SUCCESSFUL_ACCESS_ACE_FLAG 7441 7442 7443 7444 7445 7446 7447 Shepler, et al. Expires February 23, 2009 [Page 133] 7448 7449 Internet-Draft NFSv4.1 August 2008 7450 7451 7452 ACE4_FAILED_ACCESS_ACE_FLAG 7453 The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and 7454 ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits may be set only on 7455 ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE 7456 (ALARM) ACE types. If during the processing of the file's ACL, 7457 the server encounters an AUDIT or ALARM ACE that matches the 7458 principal attempting the OPEN, the server notes that fact, and the 7459 presence, if any, of the SUCCESS and FAILED flags encountered in 7460 the AUDIT or ALARM ACE. Once the server completes the ACL 7461 processing, it then notes if the operation succeeded or failed. 7462 If the operation succeeded, and if the SUCCESS flag was set for a 7463 matching AUDIT or ALARM ACE, then the appropriate AUDIT or ALARM 7464 event occurs. If the operation failed, and if the FAILED flag was 7465 set for the matching AUDIT or ALARM ACE, then the appropriate 7466 AUDIT or ALARM event occurs. Either or both of the SUCCESS or 7467 FAILED can be set, but if neither is set, the AUDIT or ALARM ACE 7468 is not useful. 7469 7470 The previously described processing applies to ACCESS operations 7471 even when they return NFS4_OK. For the purposes of AUDIT and 7472 ALARM, we consider an ACCESS operation to be a "failure" if it 7473 fails to return a bit that was requested and supported. 7474 7475 ACE4_IDENTIFIER_GROUP 7476 Indicates that the "who" refers to a GROUP as defined under UNIX 7477 or a GROUP ACCOUNT as defined under Windows. Clients and servers 7478 MUST ignore the ACE4_IDENTIFIER_GROUP flag on ACEs with a who 7479 value equal to one of the special identifiers outlined in 7480 Section 6.2.1.5. 7481 7482 6.2.1.5. ACE Who 7483 7484 The "who" field of an ACE is an identifier that specifies the 7485 principal or principals to whom the ACE applies. It may refer to a 7486 user or a group, with the flag bit ACE4_IDENTIFIER_GROUP specifying 7487 which. 7488 7489 There are several special identifiers which need to be understood 7490 universally, rather than in the context of a particular DNS domain. 7491 Some of these identifiers cannot be understood when an NFS client 7492 accesses the server, but have meaning when a local process accesses 7493 the file. The ability to display and modify these permissions is 7494 permitted over NFS, even if none of the access methods on the server 7495 understands the identifiers. 7496 7497 7498 7499 7500 7501 7502 7503 Shepler, et al. Expires February 23, 2009 [Page 134] 7504 7505 Internet-Draft NFSv4.1 August 2008 7506 7507 7508 +---------------+--------------------------------------------------+ 7509 | Who | Description | 7510 +---------------+--------------------------------------------------+ 7511 | OWNER | The owner of the file | 7512 | GROUP | The group associated with the file. | 7513 | EVERYONE | The world, including the owner and owning group. | 7514 | INTERACTIVE | Accessed from an interactive terminal. | 7515 | NETWORK | Accessed via the network. | 7516 | DIALUP | Accessed as a dialup user to the server. | 7517 | BATCH | Accessed from a batch job. | 7518 | ANONYMOUS | Accessed without any authentication. | 7519 | AUTHENTICATED | Any authenticated user (opposite of ANONYMOUS) | 7520 | SERVICE | Access from a system service. | 7521 +---------------+--------------------------------------------------+ 7522 7523 Table 4 7524 7525 To avoid conflict, these special identifiers are distinguished by an 7526 appended "@" and should appear in the form "xxxx@" (with no domain 7527 name after the "@"). For example: ANONYMOUS@. 7528 7529 The ACE4_IDENTIFIER_GROUP flag MUST be ignored on entries with these 7530 special identifiers. When encoding entries with these special 7531 identifiers, the ACE4_IDENTIFIER_GROUP flag SHOULD be set to zero. 7532 7533 6.2.1.5.1. Discussion of EVERYONE@ 7534 7535 It is important to note that "EVERYONE@" is not equivalent to the 7536 UNIX "other" entity. This is because, by definition, UNIX "other" 7537 does not include the owner or owning group of a file. "EVERYONE@" 7538 means literally everyone, including the owner or owning group. 7539 7540 6.2.2. Attribute 58: dacl 7541 7542 The dacl attribute is like the acl attribute, but dacl allows just 7543 ALLOW and DENY ACEs. The dacl attribute supports automatic 7544 inheritance (see Section 6.4.3.2). 7545 7546 6.2.3. Attribute 59: sacl 7547 7548 The sacl attribute is like the acl attribute, but sacl allows just 7549 AUDIT and ALARM ACEs. The sacl attribute supports automatic 7550 inheritance (see Section 6.4.3.2). 7551 7552 6.2.4. Attribute 33: mode 7553 7554 The NFSv4.1 mode attribute is based on the UNIX mode bits. The 7555 following bits are defined: 7556 7557 7558 7559 Shepler, et al. Expires February 23, 2009 [Page 135] 7560 7561 Internet-Draft NFSv4.1 August 2008 7562 7563 7564 const MODE4_SUID = 0x800; /* set user id on execution */ 7565 const MODE4_SGID = 0x400; /* set group id on execution */ 7566 const MODE4_SVTX = 0x200; /* save text even after use */ 7567 const MODE4_RUSR = 0x100; /* read permission: owner */ 7568 const MODE4_WUSR = 0x080; /* write permission: owner */ 7569 const MODE4_XUSR = 0x040; /* execute permission: owner */ 7570 const MODE4_RGRP = 0x020; /* read permission: group */ 7571 const MODE4_WGRP = 0x010; /* write permission: group */ 7572 const MODE4_XGRP = 0x008; /* execute permission: group */ 7573 const MODE4_ROTH = 0x004; /* read permission: other */ 7574 const MODE4_WOTH = 0x002; /* write permission: other */ 7575 const MODE4_XOTH = 0x001; /* execute permission: other */ 7576 7577 Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal 7578 identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and 7579 MODE4_XGRP apply to principals identified in the owner_group 7580 attribute but who are not identified in the owner attribute. Bits 7581 MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does 7582 not match that in the owner attribute, and does not have a group 7583 matching that of the owner_group attribute. 7584 7585 Bits within the mode other than those specified above are not defined 7586 by this protocol. A server MUST NOT return bits other than those 7587 defined above in a GETATTR or READDIR operation, and it MUST return 7588 NFS4ERR_INVAL if bits other than those defined above are set in a 7589 SETATTR, CREATE, OPEN, VERIFY or NVERIFY operation. 7590 7591 6.2.5. Attribute 74: mode_set_masked 7592 7593 The mode_set_masked attribute is a write-only attribute that allows 7594 individual bits in the mode attribute to be set or reset, without 7595 changing others. It allows, for example, the bits MODE4_SUID, 7596 MODE4_SGID, and MODE4_SVTX to be modified while leaving unmodified 7597 any of the nine low-order mode bits devoted to permissions. 7598 7599 In such instances that the nine low-order bits are left unmodified, 7600 then neither the acl nor the dacl attribute should be automatically 7601 modified as discussed in Section 6.4.1. 7602 7603 The mode_set_masked attribute consists of two words each in the form 7604 of a mode4. The first consists of the value to be applied to the 7605 current mode value and the second is a mask. Only bits set to one in 7606 the mask word are changed (set or reset) in the file's mode. All 7607 other bits in the mode remain unchanged. Bits in the first word that 7608 correspond to bits which are zero in the mask are ignored, except 7609 that undefined bits are checked for validity and can result in 7610 NFS4ERR_INVAL as described below. 7611 7612 7613 7614 7615 Shepler, et al. Expires February 23, 2009 [Page 136] 7616 7617 Internet-Draft NFSv4.1 August 2008 7618 7619 7620 The mode_set_masked attribute is only valid in a SETATTR operation. 7621 If it is used in a CREATE or OPEN operation, the server MUST return 7622 NFS4ERR_INVAL. 7623 7624 Bits not defined as valid in the mode attribute are not valid in 7625 either word of the mode_set_masked attribute. The server MUST return 7626 NFS4ERR_INVAL if any of those are on in a SETATTR. If the mode and 7627 mode_set_masked attributes are both specified in the same SETATTR, 7628 the server MUST also return NFS4ERR_INVAL. 7629 7630 6.3. Common Methods 7631 7632 The requirements in this section will be referred to in future 7633 sections, especially Section 6.4. 7634 7635 6.3.1. Interpreting an ACL 7636 7637 6.3.1.1. Server Considerations 7638 7639 The server uses the algorithm described in Section 6.2.1 to determine 7640 whether an ACL allows access to an object. However, the ACL may not 7641 be the sole determiner of access. For example: 7642 7643 o In the case of a file system exported as read-only, the server may 7644 deny write permissions even though an object's ACL grants it. 7645 7646 o Server implementations MAY grant ACE4_WRITE_ACL and ACE4_READ_ACL 7647 permissions to prevent a situation from arising in which there is 7648 no valid way to ever modify the ACL. 7649 7650 o All servers will allow a user the ability to read the data of the 7651 file when only the execute permission is granted (i.e. If the ACL 7652 denies the user the ACE4_READ_DATA access and allows the user 7653 ACE4_EXECUTE, the server will allow the user to read the data of 7654 the file). 7655 7656 o Many servers have the notion of owner-override in which the owner 7657 of the object is allowed to override accesses that are denied by 7658 the ACL. This may be helpful, for example, to allow users 7659 continued access to open files on which the permissions have 7660 changed. 7661 7662 o Many servers have the notion of a "superuser" that has privileges 7663 beyond an ordinary user. The superuser may be able to read or 7664 write data or metadata in ways that would not be permitted by the 7665 ACL. 7666 7667 7668 7669 7670 7671 Shepler, et al. Expires February 23, 2009 [Page 137] 7672 7673 Internet-Draft NFSv4.1 August 2008 7674 7675 7676 6.3.1.2. Client Considerations 7677 7678 Clients SHOULD NOT do their own access checks based on their 7679 interpretation the ACL, but rather use the OPEN and ACCESS operations 7680 to do access checks. This allows the client to act on the results of 7681 having the server determine whether or not access should be granted 7682 based on its interpretation of the ACL. 7683 7684 Clients must be aware of situations in which an object's ACL will 7685 define a certain access even though the server will not enforce it. 7686 In general, but especially in these situations, the client needs to 7687 do its part in the enforcement of access as defined by the ACL. To 7688 do this, the client MAY send the appropriate ACCESS operation prior 7689 to servicing the request of the user or application in order to 7690 determine whether the user or application should be granted the 7691 access requested. For examples in which the ACL may define accesses 7692 that the server doesn't enforce see Section 6.3.1.1. 7693 7694 6.3.2. Computing a Mode Attribute from an ACL 7695 7696 The following method can be used to calculate the MODE4_R*, MODE4_W* 7697 and MODE4_X* bits of a mode attribute, based upon an ACL. 7698 7699 First, for each of the special identifiers OWNER@, GROUP@, and 7700 EVERYONE@, evaluate the ACL in order, considering only ALLOW and DENY 7701 ACEs for the identifier EVERYONE@ and for the identifier under 7702 consideration. The result of the evaluation will be an NFSv4 ACL 7703 mask showing exactly which bits are permitted to that identifier. 7704 7705 Then translate the calculated mask for OWNER@, GROUP@, and EVERYONE@ 7706 into mode bits for, respectively, the user, group, and other, as 7707 follows: 7708 7709 1. Set the read bit (MODE4_RUSR, MODE4_RGRP, or MODE4_ROTH) if and 7710 only if ACE4_READ_DATA is set in the corresponding mask. 7711 7712 2. Set the write bit (MODE4_WUSR, MODE4_WGRP, or MODE4_WOTH) if and 7713 only if ACE4_WRITE_DATA and ACE4_APPEND_DATA are both set in the 7714 corresponding mask. 7715 7716 3. Set the execute bit (MODE4_XUSR, MODE4_XGRP, or MODE4_XOTH), if 7717 and only if ACE4_EXECUTE is set in the corresponding mask. 7718 7719 6.3.2.1. Discussion 7720 7721 Some server implementations also add bits permitted to named users 7722 and groups to the group bits (MODE4_RGRP, MODE4_WGRP, and 7723 MODE4_XGRP). 7724 7725 7726 7727 Shepler, et al. Expires February 23, 2009 [Page 138] 7728 7729 Internet-Draft NFSv4.1 August 2008 7730 7731 7732 Implementations are discouraged from doing this, because it has been 7733 found to cause confusion for users who see members of a file's group 7734 denied access that the mode bits appear to allow. (The presence of 7735 DENY ACEs may also lead to such behavior, but DENY ACEs are expected 7736 to be more rarely used.) 7737 7738 The same user confusion seen when fetching the mode also results if 7739 setting the mode does not effectively control permissions for the 7740 owner, group, and other users; this motivates some of the 7741 requirements that follow. 7742 7743 6.4. Requirements 7744 7745 The server that supports both mode and ACL must take care to 7746 synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the 7747 ACEs which have respective who fields of "OWNER@", "GROUP@", and 7748 "EVERYONE@" so that the client can see semantically equivalent access 7749 permissions exist whether the client asks for owner, owner_group and 7750 mode attributes, or for just the ACL. 7751 7752 In this section, much is made of the methods in Section 6.3.2. Many 7753 requirements refer to this section. But note that the methods have 7754 behaviors specified with "SHOULD". This is intentional, to avoid 7755 invalidating existing implementations that compute the mode according 7756 to the withdrawn POSIX ACL draft (1003.1e draft 17), rather than by 7757 actual permissions on owner, group, and other. 7758 7759 6.4.1. Setting the mode and/or ACL Attributes 7760 7761 In the case where a server supports the sacl or dacl attribute, in 7762 addition to the acl attribute, the server MUST fail a request to set 7763 the acl attribute simultaneously with a dacl or sacl attribute. The 7764 error to be given is NFS4ERR_ATTRNOTSUPP. 7765 7766 6.4.1.1. Setting mode and not ACL 7767 7768 When any of the nine low-order mode bits are subject to change, 7769 either because the mode attribute was set or because the 7770 mode_set_masked attribute was set and the mask included one or more 7771 bits from the nine low-order mode bits, and no ACL attribute is 7772 explicitly set, the acl and dacl attributes must be modified in 7773 accordance with the updated value of those bits. This must happen 7774 even if the value of the low-order bits is the same after the mode is 7775 set as before. 7776 7777 Note that any AUDIT or ALARM ACEs (hence any ACEs in the sacl 7778 attribute) are unaffected by changes to the mode. 7779 7780 7781 7782 7783 Shepler, et al. Expires February 23, 2009 [Page 139] 7784 7785 Internet-Draft NFSv4.1 August 2008 7786 7787 7788 In cases in which the permissions bits are subject to change, the acl 7789 and dacl attributes MUST be modified such that the mode computed via 7790 the method in Section 6.3.2 yields the low-order nine bits (MODE4_R*, 7791 MODE4_W*, MODE4_X*) of the mode attribute as modified by the 7792 attribute change. The ACL attributes SHOULD also be modified such 7793 that: 7794 7795 1. If MODE4_RGRP is not set, entities explicitly listed in the ACL 7796 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 7797 ACE4_READ_DATA. 7798 7799 2. If MODE4_WGRP is not set, entities explicitly listed in the ACL 7800 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 7801 ACE4_WRITE_DATA or ACE4_APPEND_DATA. 7802 7803 3. If MODE4_XGRP is not set, entities explicitly listed in the ACL 7804 other than OWNER@ and EVERYONE@ SHOULD NOT be granted 7805 ACE4_EXECUTE. 7806 7807 Access mask bits other those listed above, appearing in ALLOW ACEs, 7808 MAY also be disabled. 7809 7810 Note that ACEs with the flag ACE4_INHERIT_ONLY_ACE set do not affect 7811 the permissions of the ACL itself, nor do ACEs of the type AUDIT and 7812 ALARM. As such, it is desirable to leave these ACEs unmodified when 7813 modifying the ACL attributes. 7814 7815 Also note that the requirement may be met by discarding the acl and 7816 dacl, in favor of an ACL that represents the mode and only the mode. 7817 This is permitted, but it is preferable for a server to preserve as 7818 much of the ACL as possible without violating the above requirements. 7819 Discarding the ACL makes it effectively impossible for a file created 7820 with a mode attribute to inherit an ACL (see Section 6.4.3). 7821 7822 6.4.1.2. Setting ACL and not mode 7823 7824 When setting the acl or dacl and not setting the mode or 7825 mode_set_masked attributes, the permission bits of the mode need to 7826 be derived from the ACL. In this case, the ACL attribute SHOULD be 7827 set as given. The nine low-order bits of the mode attribute 7828 (MODE4_R*, MODE4_W*, MODE4_X*) MUST be modified to match the result 7829 of the method Section 6.3.2. The three high-order bits of the mode 7830 (MODE4_SUID, MODE4_SGID, MODE4_SVTX) SHOULD remain unchanged. 7831 7832 6.4.1.3. Setting both ACL and mode 7833 7834 When setting both the mode (includes use of either the mode attribute 7835 or the mode_set_masked attribute) and the acl or dacl attributes in 7836 7837 7838 7839 Shepler, et al. Expires February 23, 2009 [Page 140] 7840 7841 Internet-Draft NFSv4.1 August 2008 7842 7843 7844 the same operation, the attributes MUST be applied in this order: 7845 mode (or mode_set_masked), then ACL. The mode-related attribute is 7846 set as given, then the ACL attribute is set as given, possibly 7847 changing the final mode, as described above in Section 6.4.1.2. 7848 7849 6.4.2. Retrieving the mode and/or ACL Attributes 7850 7851 This section applies only to servers that support both the mode and 7852 ACL attributes. 7853 7854 Some server implementations may have a concept of "objects without 7855 ACLs", meaning that all permissions are granted and denied according 7856 to the mode attribute, and that no ACL attribute is stored for that 7857 object. If an ACL attribute is requested of such a server, the 7858 server SHOULD return an ACL that does not conflict with the mode; 7859 that is to say, the ACL returned SHOULD represent the nine low-order 7860 bits of the mode attribute (MODE4_R*, MODE4_W*, MODE4_X*) as 7861 described in Section 6.3.2. 7862 7863 For other server implementations, the ACL attribute is always present 7864 for every object. Such servers SHOULD store at least the three high- 7865 order bits of the mode attribute (MODE4_SUID, MODE4_SGID, 7866 MODE4_SVTX). The server SHOULD return a mode attribute if one is 7867 requested, and the low-order nine bits of the mode (MODE4_R*, 7868 MODE4_W*, MODE4_X*) MUST match the result of applying the method in 7869 Section 6.3.2 to the ACL attribute. 7870 7871 6.4.3. Creating New Objects 7872 7873 If a server supports any ACL attributes, it may use the ACL 7874 attributes on the parent directory to compute an initial ACL 7875 attribute for a newly created object. This will be referred to as 7876 the inherited ACL within this section. The act of adding one or more 7877 ACEs to the inherited ACL that are based upon ACEs in the parent 7878 directory's ACL will be referred to as inheriting an ACE within this 7879 section. 7880 7881 Implementors should standardize on what the behavior of CREATE and 7882 OPEN must be depending on the presence or absence of the mode and ACL 7883 attributes. 7884 7885 1. If just the mode is given in the call: 7886 7887 In this case, inheritance SHOULD take place, but the mode MUST be 7888 applied to the inherited ACL as described in Section 6.4.1.1, 7889 thereby modifying the ACL. 7890 7891 7892 7893 7894 7895 Shepler, et al. Expires February 23, 2009 [Page 141] 7896 7897 Internet-Draft NFSv4.1 August 2008 7898 7899 7900 2. If just the ACL is given in the call: 7901 7902 In this case, inheritance SHOULD NOT take place, and the ACL as 7903 defined in the CREATE or OPEN will be set without modification, 7904 and the mode modified as in Section 6.4.1.2 7905 7906 7907 3. If both mode and ACL are given in the call: 7908 7909 In this case, inheritance SHOULD NOT take place, and both 7910 attributes will be set as described in Section 6.4.1.3. 7911 7912 7913 4. If neither mode nor ACL are given in the call: 7914 7915 In the case where an object is being created without any initial 7916 attributes at all, e.g. an OPEN operation with an opentype4 of 7917 OPEN4_CREATE and a createmode4 of EXCLUSIVE4, inheritance SHOULD 7918 NOT take place (note that EXCLUSIVE4_1 is a better choice of 7919 createmode4, since it does permit initial attributes). Instead, 7920 the server SHOULD set permissions to deny all access to the newly 7921 created object. It is expected that the appropriate client will 7922 set the desired attributes in a subsequent SETATTR operation, and 7923 the server SHOULD allow that operation to succeed, regardless of 7924 what permissions the object is created with. For example, an 7925 empty ACL denies all permissions, but the server should allow the 7926 owner's SETATTR to succeed even though WRITE_ACL is implicitly 7927 denied. 7928 7929 In other cases, inheritance SHOULD take place, and no 7930 modifications to the ACL will happen. The mode attribute, if 7931 supported, MUST be as computed in Section 6.3.2, with the 7932 MODE4_SUID, MODE4_SGID and MODE4_SVTX bits clear. If no 7933 inheritable ACEs exist on the parent directory, the rules for 7934 creating acl, dacl or sacl attributes are implementation defined. 7935 If either the dacl or sacl attribute is supported, then the 7936 ACL4_DEFAULTED flag SHOULD be set on the newly created 7937 attributes. 7938 7939 7940 6.4.3.1. The Inherited ACL 7941 7942 If the object being created is not a directory, the inherited ACL 7943 SHOULD NOT inherit ACEs from the parent directory ACL unless the 7944 ACE4_FILE_INHERIT_FLAG is set. 7945 7946 If the object being created is a directory, the inherited ACL should 7947 inherit all inheritable ACEs from the parent directory, those that 7948 7949 7950 7951 Shepler, et al. Expires February 23, 2009 [Page 142] 7952 7953 Internet-Draft NFSv4.1 August 2008 7954 7955 7956 have ACE4_FILE_INHERIT_ACE or ACE4_DIRECTORY_INHERIT_ACE flag set. 7957 If the inheritable ACE has ACE4_FILE_INHERIT_ACE set, but 7958 ACE4_DIRECTORY_INHERIT_ACE is clear, the inherited ACE on the newly 7959 created directory MUST have the ACE4_INHERIT_ONLY_ACE flag set to 7960 prevent the directory from being affected by ACEs meant for non- 7961 directories. 7962 7963 When a new directory is created, the server MAY split any inherited 7964 ACE which is both inheritable and effective (in other words, which 7965 has neither ACE4_INHERIT_ONLY_ACE nor ACE4_NO_PROPAGATE_INHERIT_ACE 7966 set), into two ACEs, one with no inheritance flags, and one with 7967 ACE4_INHERIT_ONLY_ACE set. (In the case of a dacl or sacl attribute, 7968 both of those ACEs SHOULD also have the ACE4_INHERITED_ACE flag set.) 7969 This makes it simpler to modify the effective permissions on the 7970 directory without modifying the ACE which is to be inherited to the 7971 new directory's children. 7972 7973 6.4.3.2. Automatic Inheritance 7974 7975 The acl attribute consists only of an array of ACEs, but the sacl 7976 (Section 6.2.3) and dacl (Section 6.2.2) attributes also include an 7977 additional flag field. 7978 7979 struct nfsacl41 { 7980 aclflag4 na41_flag; 7981 nfsace4 na41_aces<>; 7982 }; 7983 7984 The flag field applies to the entire sacl or dacl; three flag values 7985 are defined: 7986 7987 const ACL4_AUTO_INHERIT = 0x00000001; 7988 const ACL4_PROTECTED = 0x00000002; 7989 const ACL4_DEFAULTED = 0x00000004; 7990 7991 and all other bits must be cleared. The ACE4_INHERITED_ACE flag may 7992 be set in the ACEs of the sacl or dacl (whereas it must always be 7993 cleared in the acl). 7994 7995 Together these features allow a server to support automatic 7996 inheritance, which we now explain in more detail. 7997 7998 Inheritable ACEs are normally inherited by child objects only at the 7999 time that the child objects are created; later modifications to 8000 inheritable ACEs do not result in modifications to inherited ACEs on 8001 descendants. 8002 8003 However, the dacl and sacl provide an OPTIONAL mechanism which allows 8004 8005 8006 8007 Shepler, et al. Expires February 23, 2009 [Page 143] 8008 8009 Internet-Draft NFSv4.1 August 2008 8010 8011 8012 a client application to propagate changes to inheritable ACEs to an 8013 entire directory hierarchy. 8014 8015 A server that supports this performs inheritance at object creation 8016 time in the normal way, and SHOULD set the ACE4_INHERITED_ACE flag on 8017 any inherited ACEs as they are added to the new object. 8018 8019 A client application such as an ACL editor may then propagate changes 8020 to inheritable ACEs on a directory by recursively traversing that 8021 directory's descendants and modifying each ACL encountered to remove 8022 any ACEs with the ACE4_INHERITED_ACE flag and to replace them by the 8023 new inheritable ACEs (also with the ACE4_INHERITED_ACE flag set). It 8024 uses the existing ACE inheritance flags in the obvious way to decide 8025 which ACEs to propagate. (Note that it may encounter further 8026 inheritable ACEs when descending the directory hierarchy, and that 8027 those will also need to be taken into account when propagating 8028 inheritable ACEs to further descendants.) 8029 8030 The reach of this propagation may be limited in two ways: first, 8031 automatic inheritance is not performed from any directory ACL that 8032 has the ACL4_AUTO_INHERIT flag cleared; and second, automatic 8033 inheritance stops wherever an ACL with the ACL4_PROTECTED flag is 8034 set, preventing modification of that ACL and also (if the ACL is set 8035 on a directory) of the ACL on any of the object's descendants. 8036 8037 This propagation is performed independently for the sacl and the dacl 8038 attributes; thus the ACL4_AUTO_INHERIT and ACL4_PROTECTED flags may 8039 be independently set for the sacl and the dacl, and propagation of 8040 one type of acl may continue down a hierarchy even where propagation 8041 of the other acl has stopped. 8042 8043 New objects should be created with a dacl and a sacl that both have 8044 the ACL4_PROTECTED flag cleared and the ACL4_AUTO_INHERIT flag set to 8045 the same value as that on, respectively, the sacl or dacl of the 8046 parent object. 8047 8048 Both the dacl and sacl attributes are RECOMMENDED, and a server may 8049 support one without supporting the other. 8050 8051 A server that supports both the old acl attribute and one or both of 8052 the new dacl or sacl attributes must do so in such a way as to keep 8053 all three attributes consistent with each other. Thus the ACEs 8054 reported in the acl attribute should be the union of the ACEs 8055 reported in the dacl and sacl attributes, except that the 8056 ACE4_INHERITED_ACE flag must be cleared from the ACEs in the acl. 8057 And of course a client that queries only the acl will be unable to 8058 determine the values of the sacl or dacl flag fields. 8059 8060 8061 8062 8063 Shepler, et al. Expires February 23, 2009 [Page 144] 8064 8065 Internet-Draft NFSv4.1 August 2008 8066 8067 8068 When a client performs a SETATTR for the acl attribute, the server 8069 SHOULD set the ACL4_PROTECTED flag to true on both the sacl and the 8070 dacl. By using the acl attribute, as opposed to the dacl or sacl 8071 attributes, the client signals that it may not understand automatic 8072 inheritance, and thus cannot be trusted to set an ACL for which 8073 automatic inheritance would make sense. 8074 8075 When a client application queries an ACL, modifies it, and sets it 8076 again, it should leave any ACEs marked with ACE4_INHERITED_ACE 8077 unchanged, in their original order, at the end of the ACL. If the 8078 application is unable to do this, it should set the ACL4_PROTECTED 8079 flag. This behavior is not enforced by servers, but violations of 8080 this rule may lead to unexpected results when applications perform 8081 automatic inheritance. 8082 8083 If a server also supports the mode attribute, it SHOULD set the mode 8084 in such a way that leaves inherited ACEs unchanged, in their original 8085 order, at the end of the ACL. If it is unable to do so, it SHOULD 8086 set the ACL4_PROTECTED flag on the file's dacl. 8087 8088 Finally, in the case where the request that creates a new file or 8089 directory does not also set permissions for that file or directory, 8090 and there are also no ACEs to inherit from the parent's directory, 8091 then the server's choice of ACL for the new object is implementation- 8092 dependent. In this case, the server SHOULD set the ACL4_DEFAULTED 8093 flag on the ACL it chooses for the new object. An application 8094 performing automatic inheritance takes the ACL4_DEFAULTED flag as a 8095 sign that the ACL should be completely replaced by one generated 8096 using the automatic inheritance rules. 8097 8098 8099 7. Single-server Namespace 8100 8101 This chapter describes the NFSv4 single-server namespace. Single- 8102 server namespaces may be presented directly to clients, or they may 8103 be used as a basis to form larger multi-server namespaces (e.g. site- 8104 wide or organization-wide) to be presented to clients, as described 8105 in Section 11. 8106 8107 7.1. Server Exports 8108 8109 On a UNIX server, the namespace describes all the files reachable by 8110 pathnames under the root directory or "/". On a Windows server the 8111 namespace constitutes all the files on disks named by mapped disk 8112 letters. NFS server administrators rarely make the entire server's 8113 file system namespace available to NFS clients. More often portions 8114 of the namespace are made available via an "export" feature. In 8115 previous versions of the NFS protocol, the root filehandle for each 8116 8117 8118 8119 Shepler, et al. Expires February 23, 2009 [Page 145] 8120 8121 Internet-Draft NFSv4.1 August 2008 8122 8123 8124 export is obtained through the MOUNT protocol; the client sent a 8125 string that identified the export name within the namespace and the 8126 server returned the root filehandle for that export. The MOUNT 8127 protocol also provided an EXPORTS procedure that enumerated server's 8128 exports. 8129 8130 7.2. Browsing Exports 8131 8132 The NFSv4.1 protocol provides a root filehandle that clients can use 8133 to obtain filehandles for the exports of a particular server, via a 8134 series of LOOKUP operations within a COMPOUND, to traverse a path. A 8135 common user experience is to use a graphical user interface (perhaps 8136 a file "Open" dialog window) to find a file via progressive browsing 8137 through a directory tree. The client must be able to move from one 8138 export to another export via single-component, progressive LOOKUP 8139 operations. 8140 8141 This style of browsing is not well supported by the NFSv3 protocol. 8142 In NFSv3, the client expects all LOOKUP operations to remain within a 8143 single server file system. For example, the device attribute will 8144 not change. This prevents a client from taking namespace paths that 8145 span exports. 8146 8147 In the case of NFSv3, an automounter on the client can obtain a 8148 snapshot of the server's namespace using the EXPORTS procedure of the 8149 MOUNT protocol. If it understands the server's pathname syntax, it 8150 can create an image of the server's namespace on the client. The 8151 parts of the namespace that are not exported by the server are filled 8152 in with directories that might be constructed similarly to an NFSv4.1 8153 "pseudo file system" (see Section 7.3) that allows the user to browse 8154 from one mounted file system to another. There is a drawback to this 8155 representation of the server's namespace on the client: it is static. 8156 If the server administrator adds a new export the client will be 8157 unaware of it. 8158 8159 7.3. Server Pseudo File System 8160 8161 NFSv4.1 servers avoid this namespace inconsistency by presenting all 8162 the exports for a given server within the framework of a single 8163 namespace, for that server. An NFSv4.1 client uses LOOKUP and 8164 READDIR operations to browse seamlessly from one export to another. 8165 8166 Where there are portions of the server namespace that are not 8167 exported, clients require some way of traversing those portions to 8168 reach actual exported file systems. A technique that servers may use 8169 to provide for this is to bridge unexported portion of the namespace 8170 via a "pseudo file system" that provides a view of exported 8171 directories only. A pseudo file system has a unique fsid and behaves 8172 8173 8174 8175 Shepler, et al. Expires February 23, 2009 [Page 146] 8176 8177 Internet-Draft NFSv4.1 August 2008 8178 8179 8180 like a normal, read-only file system. 8181 8182 Based on the construction of the server's namespace, it is possible 8183 that multiple pseudo file systems may exist. For example, 8184 8185 /a pseudo file system 8186 /a/b real file system 8187 /a/b/c pseudo file system 8188 /a/b/c/d real file system 8189 8190 Each of the pseudo file systems is considered a separate entity and 8191 therefore MUST have its own fsid, unique among all the fsids for that 8192 server. 8193 8194 7.4. Multiple Roots 8195 8196 Certain operating environments are sometimes described as having 8197 "multiple roots". In such environments individual file systems are 8198 commonly represented by disk or volume names. NFSv4 servers for 8199 these platforms can construct a pseudo file system above these root 8200 names so that disk letters or volume names are simply directory names 8201 in the pseudo root. 8202 8203 7.5. Filehandle Volatility 8204 8205 The nature of the server's pseudo file system is that it is a logical 8206 representation of file system(s) available from the server. 8207 Therefore, the pseudo file system is most likely constructed 8208 dynamically when the server is first instantiated. It is expected 8209 that the pseudo file system may not have an on disk counterpart from 8210 which persistent filehandles could be constructed. Even though it is 8211 preferable that the server provide persistent filehandles for the 8212 pseudo file system, the NFS client should expect that pseudo file 8213 system filehandles are volatile. This can be confirmed by checking 8214 the associated "fh_expire_type" attribute for those filehandles in 8215 question. If the filehandles are volatile, the NFS client must be 8216 prepared to recover a filehandle value (e.g. with a series of LOOKUP 8217 operations) when receiving an error of NFS4ERR_FHEXPIRED. 8218 8219 Because it is quite likely that servers will implement pseudo file 8220 systems using volatile filehandles, clients need to be prepared for 8221 them, rather than assuming that all filehandles will be persistent. 8222 8223 7.6. Exported Root 8224 8225 If the server's root file system is exported, one might conclude that 8226 a pseudo file system is unneeded. This not necessarily so. Assume 8227 the following file systems on a server: 8228 8229 8230 8231 Shepler, et al. Expires February 23, 2009 [Page 147] 8232 8233 Internet-Draft NFSv4.1 August 2008 8234 8235 8236 / fs1 (exported) 8237 /a fs2 (not exported) 8238 /a/b fs3 (exported) 8239 8240 Because fs2 is not exported, fs3 cannot be reached with simple 8241 LOOKUPs. The server must bridge the gap with a pseudo file system. 8242 8243 7.7. Mount Point Crossing 8244 8245 The server file system environment may be constructed in such a way 8246 that one file system contains a directory which is 'covered' or 8247 mounted upon by a second file system. For example: 8248 8249 /a/b (file system 1) 8250 /a/b/c/d (file system 2) 8251 8252 The pseudo file system for this server may be constructed to look 8253 like: 8254 8255 / (place holder/not exported) 8256 /a/b (file system 1) 8257 /a/b/c/d (file system 2) 8258 8259 It is the server's responsibility to present the pseudo file system 8260 that is complete to the client. If the client sends a lookup request 8261 for the path "/a/b/c/d", the server's response is the filehandle of 8262 the root of the file system "/a/b/c/d". In previous versions of the 8263 NFS protocol, the server would respond with the filehandle of 8264 directory "/a/b/c/d" within the file system "/a/b". 8265 8266 The NFS client will be able to determine if it crosses a server mount 8267 point by a change in the value of the "fsid" attribute. 8268 8269 7.8. Security Policy and Namespace Presentation 8270 8271 Because NFSv4 clients possess the ability to change the security 8272 mechanisms used, after determining what is allowed, by using SECINFO 8273 and SECINFO_NONAME, the server SHOULD NOT present a different view of 8274 the namespace based on the security mechanism being used by a client. 8275 Instead, it should present a consistent view and return 8276 NFS4ERR_WRONGSEC if an attempt is made to access data with an 8277 inappropriate security mechanism. 8278 8279 If security considerations make it necessary to hide the existence of 8280 a particular file system, as opposed to all of the data within it, 8281 the server can apply the security policy of a shared resource in the 8282 server's namespace to components of the resource's ancestors. For 8283 example: 8284 8285 8286 8287 Shepler, et al. Expires February 23, 2009 [Page 148] 8288 8289 Internet-Draft NFSv4.1 August 2008 8290 8291 8292 / (place holder/not exported) 8293 /a/b (file system 1) 8294 /a/b/MySecretProject (file system 2) 8295 8296 8297 The /a/b/MySecretProject directory is a real file system and is the 8298 shared resource. Suppose the security policy for /a/b/ 8299 MySecretProject is Kerberos with integrity and it is desired to limit 8300 knowledge of the existence of this file system. In this case, the 8301 server should apply the same security policy to /a/b. This allows 8302 for knowledge of the existence of a file system to be secured when 8303 desirable. 8304 8305 For the case of the use of multiple, disjoint security mechanisms in 8306 the server's resources, applying that sort of policy would result in 8307 the higher-level file system not being accessible using any security 8308 flavor, which would make the that higher-level file system 8309 inaccessible. Therefore, that sort of configuration is not 8310 compatible with hiding the existence (as opposed to the contents) 8311 from clients using multiple disjoint sets of security flavors. 8312 8313 In other circumstances, a desirable policy is for the security of a 8314 particular object in the server's namespace should include the union 8315 of all security mechanisms of all direct descendants. A common and 8316 convenient practice, unless strong security requirements dictate 8317 otherwise, is to make all of the pseudo file system accessible by all 8318 of the valid security mechanisms. 8319 8320 Where there is concern about the security of data on the network, 8321 clients should use strong security mechanisms to access the pseudo 8322 file system in order to prevent man-in-the-middle attacks. 8323 8324 8325 8. State Management 8326 8327 Integrating locking into the NFS protocol necessarily causes it to be 8328 stateful. With the inclusion of such features as share reservations, 8329 file and directory delegations, recallable layouts, and support for 8330 mandatory byte-range locking, the protocol becomes substantially more 8331 dependent on proper management of state than the traditional 8332 combination of NFS and NLM [35]. These features include expanded 8333 locking facilities, which provide some measure of interclient 8334 exclusion, but the state also offers features not readily providable 8335 using a stateless model. There are three components to making this 8336 state manageable: 8337 8338 o Clear division between client and server 8339 8340 8341 8342 8343 Shepler, et al. Expires February 23, 2009 [Page 149] 8344 8345 Internet-Draft NFSv4.1 August 2008 8346 8347 8348 o Ability to reliably detect inconsistency in state between client 8349 and server 8350 8351 o Simple and robust recovery mechanisms 8352 8353 In this model, the server owns the state information. The client 8354 requests changes in locks and the server responds with the changes 8355 made. Non-client-initiated changes in locking state are infrequent. 8356 The client receives prompt notification of such changes and can 8357 adjust its view of the locking state to reflect the server's changes. 8358 8359 Individual pieces of state created by the server and passed to the 8360 client at its request are represented by 128-bit stateids. These 8361 stateids may represent a particular open file, a set of byte-range 8362 locks held by a particular owner, or a recallable delegation of 8363 privileges to access a file in particular ways, or at a particular 8364 location. 8365 8366 In all cases, there is a transition from the most general information 8367 which represents a client as a whole to the eventual lightweight 8368 stateid used for most client and server locking interactions. The 8369 details of this transition will vary with the type of object but it 8370 always starts with a client ID. 8371 8372 8.1. Client and Session ID 8373 8374 A client must establish a client ID (see Section 2.4) and then one or 8375 more sessionids (see Section 2.10) before performing any operations 8376 to open, lock, delegate, or obtain a layout for a file object. Each 8377 session ID is associated with a specific client ID, and thus serves 8378 as a shorthand reference to an NFSv4.1 client. 8379 8380 For some types of locking interactions, the client will represent 8381 some number of internal locking entities called "owners", which 8382 normally correspond to processes internal to the client. For other 8383 types of locking-related objects, such as delegations and layouts, no 8384 such intermediate entities are provided for, and the locking-related 8385 objects are considered to be transferred directly between the server 8386 and a unitary client. 8387 8388 8.2. Stateid Definition 8389 8390 When the server grants a lock of any type (including opens, byte- 8391 range locks, delegations, and layouts) it responds with a unique 8392 stateid, that represents a set of locks (often a single lock) for the 8393 same file, of the same type, and sharing the same ownership 8394 characteristics. Thus opens of the same file by different open- 8395 owners each have an identifying stateid. Similarly, each set of 8396 8397 8398 8399 Shepler, et al. Expires February 23, 2009 [Page 150] 8400 8401 Internet-Draft NFSv4.1 August 2008 8402 8403 8404 byte-range locks on a file owned by a specific lock-owner has its own 8405 identifying stateid. Delegations and layouts also have associated 8406 stateids by which they may be referenced. The stateid is used as a 8407 shorthand reference to a lock or set of locks and given a stateid the 8408 server can determine the associated state-owner or state-owners (in 8409 the case of an open-owner/lock-owner pair) and the associated 8410 filehandle. When stateids are used, the current filehandle must be 8411 the one associated with that stateid. 8412 8413 All stateids associated with a given client ID are associated with a 8414 common lease which represents the claim of those stateids and the 8415 objects they represent to be maintained by the server. See 8416 Section 8.3 for a discussion of leases. 8417 8418 The server may assign stateids independently for different clients. 8419 A stateid with the same bit pattern for one client may designate an 8420 entirely different set of locks for a different client. The stateid 8421 is always interpreted with respect to the client ID associated with 8422 the current session. Stateids apply to all sessions associated with 8423 the given client ID and the client may use a stateid obtained from 8424 one session on another session associated with the same client ID. 8425 8426 8.2.1. Stateid Types 8427 8428 With the exception of special stateids (see Section 8.2.3), each 8429 stateid represents locking objects of one of a set of types defined 8430 by the NFSv4.1 protocol. Note that in all these cases, where we 8431 speak of guarantee, it is understood there are situations such as a 8432 client restart, or lock revocation, that allow the guarantee to be 8433 voided. 8434 8435 o Stateids may represent opens of files. 8436 8437 Each stateid in this case represents the open state for a given 8438 client ID/open-owner/filehandle triple. Such stateids are subject 8439 to change (with consequent incrementing of the stateid's seqid) in 8440 response to OPENs that result in upgrade and OPEN_DOWNGRADE 8441 operations. 8442 8443 o Stateids may represent sets of byte-range locks. 8444 8445 All locks held on a particular file by a particular owner and all 8446 gotten under the aegis of a particular open file are associated 8447 with a single stateid with the seqid being incremented whenever 8448 LOCK and LOCKU operations affect that set of locks. 8449 8450 o Stateids may represent file delegations, which are recallable 8451 guarantees by the server to the client, that other clients will 8452 8453 8454 8455 Shepler, et al. Expires February 23, 2009 [Page 151] 8456 8457 Internet-Draft NFSv4.1 August 2008 8458 8459 8460 not reference, or will not modify a particular file, until the 8461 delegation is returned. In NFSv4.1, file delegations may be 8462 obtained on both regular and non-regular files. 8463 8464 A stateid represents a single delegation held by a client for a 8465 particular filehandle. 8466 8467 o Stateids may represent directory delegations, which are recallable 8468 guarantees by the server to the client, that other clients will 8469 not modify the directory, until the delegation is returned. 8470 8471 A stateid represents a single delegation held by a client for a 8472 particular directory filehandle. 8473 8474 o Stateids may represent layouts, which are recallable guarantees by 8475 the server to the client, that particular files may be accessed 8476 via an alternate data access protocol at specific locations. Such 8477 access is limited to particular sets of byte ranges and may 8478 proceed until those byte ranges are reduced or the layout is 8479 returned. 8480 8481 A stateid represents the set of all layouts held by a particular 8482 client for a particular filehandle with a given layout type. The 8483 seqid is updated as the layouts of that set changes with layout 8484 stateid changing operations such as LAYOUTGET and LAYOUTRETURN. 8485 8486 8.2.2. Stateid Structure 8487 8488 Stateids are divided into two fields, a 96-bit "other" field 8489 identifying the specific set of locks and a 32-bit "seqid" sequence 8490 value. Except in the case of special stateids (see Section 8.2.3), a 8491 particular value of the "other" field denotes a set of locks of the 8492 same type (for example byte-range locks, opens, delegations, or 8493 layouts), for a specific file or directory, and sharing the same 8494 ownership characteristics. The seqid designates a specific instance 8495 of such a set of locks, and is incremented to indicate changes in 8496 such a set of locks, either by the addition or deletion of locks from 8497 the set, a change in the byte-range they apply to, or an upgrade or 8498 downgrade in the type of one or more locks. 8499 8500 When such a set of locks is first created the server returns a 8501 stateid with seqid value of one. On subsequent operations which 8502 modify the set of locks the server is required to increment the seqid 8503 field by one (1) whenever it returns a stateid for the same state- 8504 owner/file/type combination and there is some change in the set of 8505 locks actually designated. In this case the server will return a 8506 stateid with an other field the same as previously used for that 8507 state-owner/file/type combination, with an incremented seqid field. 8508 8509 8510 8511 Shepler, et al. Expires February 23, 2009 [Page 152] 8512 8513 Internet-Draft NFSv4.1 August 2008 8514 8515 8516 This pattern continues until the seqid is incremented past 8517 NFS4_UINT32_MAX, and one (not zero) is the next seqid value. 8518 8519 The purpose of the incrementing of the seqid is to allow the server 8520 to communicate to the client the order in which operations that 8521 modified locking state associated with a stateid have been processed 8522 and to make it possible for the client to send requests that are 8523 conditional on the set of locks not having changed since the stateid 8524 in question was returned. 8525 8526 Except for layout stateids (Section 12.5.3) when a client sends a 8527 stateid to the server, it has two choices with regard to the seqid 8528 sent. It may set the seqid to zero to indicate to the server that it 8529 wishes the most up-to-date seqid for that stateid's "other" field to 8530 be used. This would be the common choice in the case of a stateid 8531 sent with a READ or WRITE operation. It also may set a non-zero 8532 value in which case the server checks if that seqid is the correct 8533 one. In that case the server is required to return 8534 NFS4ERR_OLD_STATEID if the seqid is lower than the most current value 8535 and NFS4ERR_BAD_STATEID if the seqid is greater than the most current 8536 value. This would be the common choice in the case of stateids sent 8537 with a CLOSE or OPEN_DOWNGRADE. Because OPENs may be sent in 8538 parallel for the same owner, a client might close a file without 8539 knowing that an OPEN upgrade had been done by the server, changing 8540 the lock in question. If CLOSE were sent with a zero seqid, the OPEN 8541 upgrade would be canceled before the client even received an 8542 indication that an upgrade had happened. 8543 8544 When a stateid is sent by the server to client as part of a callback 8545 operation, it is not subject to checking for a current seqid and 8546 returning NFS4ERR_OLD_STATEID. This is because the client is not in 8547 a position to know the most up-to-date seqid and thus cannot verify 8548 it. Unless specially noted, the seqid value for a stateid sent by 8549 the server to the client as part of a callback is required to be zero 8550 with NFS4ERR_BAD_STATEID returned if it is not. 8551 8552 In making comparisons between seqids, both by the client in 8553 determining the order of operations and by the server in determining 8554 whether the NFS4ERR_OLD_STATEID is to be returned, the possibility of 8555 the seqid being swapped around past the NFS4_UINT32_MAX value needs 8556 to be taken into account. When two seqid values are being compared, 8557 the total count of slots for all sessions associated with the current 8558 client is used to do this. When one seqid value is less that this 8559 total slot count and another seqid value is greater than 8560 NFS4_UINT32_MAX minus the total slot count, the former is to be 8561 treated as lower than the later, despite the fact that it is 8562 numerically greater. 8563 8564 8565 8566 8567 Shepler, et al. Expires February 23, 2009 [Page 153] 8568 8569 Internet-Draft NFSv4.1 August 2008 8570 8571 8572 8.2.3. Special Stateids 8573 8574 Stateid values whose "other" field is either all zeros or all ones 8575 are reserved. They may not be assigned by the server but have 8576 special meanings defined by the protocol. The particular meaning 8577 depends on whether the "other" field is all zeros or all ones and the 8578 specific value of the "seqid" field. 8579 8580 The following combinations of "other" and "seqid" are defined in 8581 NFSv4.1: 8582 8583 o When "other" and "seqid" are both zero, the stateid is treated as 8584 a special anonymous stateid, which can be used in READ, WRITE, and 8585 SETATTR requests to indicate the absence of any open state 8586 associated with the request. When an anonymous stateid value is 8587 used, and an existing open denies the form of access requested, 8588 then access will be denied to the request. This stateid MUST NOT 8589 be used on operations to data servers (Section 13.6). 8590 8591 o When "other" and "seqid" are both all ones, the stateid is a 8592 special read bypass stateid. When this value is used in WRITE or 8593 SETATTR, it is treated like the anonymous value. When used in 8594 READ, the server MAY grant access, even if access would normally 8595 be denied to READ requests. This stateid MUST NOT be used on 8596 operations to data servers. 8597 8598 o When "other" is zero and "seqid" is one, the stateid represents 8599 the current stateid, which is whatever value is the last stateid 8600 returned by an operation within the COMPOUND. In the case of an 8601 OPEN, the stateid returned for the open file, and not the 8602 delegation is used. The stateid passed to the operation in place 8603 of the special value has its "seqid" value set to zero, except 8604 when the current stateid is used by the operation CLOSE or 8605 OPEN_DOWNGRADE. If there is no operation in the COMPOUND which 8606 has returned a stateid value, the server MUST return the error 8607 NFS4ERR_BAD_STATEID. As illustrated in Figure 6, if the value of 8608 a current stateid is a special stateid, and the stateid of an 8609 operation's arguments has "other" set to zero, and "seqid" set to 8610 one, then the server MUST return the error NFS4ERR_BAD_STATEID. 8611 8612 o When "other" is zero and "seqid" is NFS4_UINT32_MAX, the stateid 8613 represents a reserved stateid value defined to be invalid. When 8614 this stateid is used, the server MUST return the error 8615 NFS4ERR_BAD_STATEID. 8616 8617 If a stateid value is used which has all zero or all ones in the 8618 "other" field, but does not match one of the cases above, the server 8619 MUST return the error NFS4ERR_BAD_STATEID. 8620 8621 8622 8623 Shepler, et al. Expires February 23, 2009 [Page 154] 8624 8625 Internet-Draft NFSv4.1 August 2008 8626 8627 8628 Special stateids, unlike other stateids, are not associated with 8629 individual client IDs or filehandles and can be used with all valid 8630 client IDs and filehandles. In the case of a special stateid 8631 designating the current stateid, the current stateid value 8632 substituted for the special stateid is associated with a particular 8633 client ID and filehandle, and so, if it is used where current 8634 filehandle does not match that associated with the current stateid, 8635 the operation to which the stateid is passed will return 8636 NFS4ERR_BAD_STATEID. 8637 8638 8.2.4. Stateid Lifetime and Validation 8639 8640 Stateids must remain valid until either a client restart or a server 8641 restart or until the client returns all of the locks associated with 8642 the stateid by means of an operation such as CLOSE or DELEGRETURN. 8643 If the locks are lost due to revocation the stateid remains a valid 8644 designation of that revoked state until the client frees it by using 8645 FREE_STATEID. Stateids associated with byte-range locks are an 8646 exception. They remain valid even if a LOCKU frees all remaining 8647 locks, so long as the open file with which they are associated 8648 remains open, unless the client does a FREE_STATEID to cause the 8649 stateid to be freed. 8650 8651 It should be noted that there are situations in which the client's 8652 locks become invalid, without the client requesting they be returned. 8653 These include lease expiration and a number of forms of lock 8654 revocation within the lease period. It is important to note that in 8655 these situations, the stateid remains valid and the client can use it 8656 to determine the disposition of the associated lost locks. 8657 8658 An "other" value must never be reused for a different purpose (i.e. 8659 different filehandle, owner, or type of locks) within the context of 8660 a single client ID. A server may retain the "other" value for the 8661 same purpose beyond the point where it may otherwise be freed but if 8662 it does so, it must maintain "seqid" continuity with previous values. 8663 8664 One mechanism that may be used to satisfy the requirement that the 8665 server recognize invalid and out-of-date stateids is for the server 8666 to divide the "other" field of the stateid into two fields. 8667 8668 o An index into a table of locking-state structures. 8669 8670 o A generation number which is incremented on each allocation of a 8671 table entry for a particular use. 8672 8673 And then store in each table entry, 8674 8675 8676 8677 8678 8679 Shepler, et al. Expires February 23, 2009 [Page 155] 8680 8681 Internet-Draft NFSv4.1 August 2008 8682 8683 8684 o The client ID with which the stateid is associated. 8685 8686 o The current generation number for the (at most one) valid stateid 8687 sharing this index value. 8688 8689 o The filehandle of the file on which the locks are taken. 8690 8691 o An indication of the type of stateid (open, byte-range lock, file 8692 delegation, directory delegation, layout). 8693 8694 o The last "seqid" value returned corresponding to the current 8695 "other" value. 8696 8697 o An indication of the current status of the locks associated with 8698 this stateid. In particular, whether these have been revoked and 8699 if so, for what reason. 8700 8701 With this information, an incoming stateid can be validated and the 8702 appropriate error returned when necessary. Special and non-special 8703 stateids are handled separately. (See Section 8.2.3 for a discussion 8704 of special stateids.) 8705 8706 Note that stateids are implicitly qualified by the current client ID, 8707 as derived from the client ID associated with the current session. 8708 Note however, that the semantics of the session will prevent stateids 8709 associated with a previous client or server instance from being 8710 analyzed by this procedure. 8711 8712 If server restart has resulted in an invalid client ID or a session 8713 ID which is invalid, SEQUENCE will return an error and the operation 8714 that takes a stateid as an argument will never be processed. 8715 8716 If there has been a server restart where there is a persistent 8717 session, and all leased state has been lost, then the session in 8718 question will, although valid, be marked as dead, and any operation 8719 not satisfied by means of the reply cache will receive the error 8720 NFS4ERR_DEADSESSION, and thus not be processed as indicated below. 8721 8722 When a stateid is being tested, and the "other" field is all zeros or 8723 all ones, a check that the "other" and "seqid" fields match a defined 8724 combination for a special stateid is done and the results determined 8725 as follows: 8726 8727 o If the "other" and "seqid" fields do not match a defined 8728 combination associated with a special stateid, the error 8729 NFS4ERR_BAD_STATEID is returned. 8730 8731 8732 8733 8734 8735 Shepler, et al. Expires February 23, 2009 [Page 156] 8736 8737 Internet-Draft NFSv4.1 August 2008 8738 8739 8740 o If the special stateid is one designating the current stateid, and 8741 there is a current stateid, then the current stateid is 8742 substituted for the special stateid and the checks appropriate to 8743 non-special stateids in performed. 8744 8745 o If the combination is valid in general but is not appropriate to 8746 the context in which the stateid is used (e.g. an all-zero stateid 8747 is used when an open stateid is required in a LOCK operation), the 8748 error NFS4ERR_BAD_STATEID is also returned. 8749 8750 o Otherwise, the check is completed and the special stateid is 8751 accepted as valid. 8752 8753 When a stateid is being tested, and the "other" field is neither all 8754 zeros or all ones, the following procedure could be used to validate 8755 an incoming stateid and return an appropriate error, when necessary, 8756 assuming that the "other" field would be divided into a table index 8757 and an entry generation. 8758 8759 o If the table index field is outside the range of the associated 8760 table, return NFS4ERR_BAD_STATEID. 8761 8762 o If the selected table entry is of a different generation than that 8763 specified in the incoming stateid, return NFS4ERR_BAD_STATEID. 8764 8765 o If the selected table entry does not match the current filehandle, 8766 return NFS4ERR_BAD_STATEID. 8767 8768 o If the client ID in the table entry does not match the client ID 8769 associated with the current session, return NFS4ERR_BAD_STATEID. 8770 8771 o If the stateid represents revoked state, then return 8772 NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, or NFS4ERR_DELEG_REVOKED, 8773 as appropriate. 8774 8775 o If the stateid type is not valid for the context in which the 8776 stateid appears, return NFS4ERR_BAD_STATEID. Note that a stateid 8777 may be valid in general, as would be reported by the TEST_STATEID 8778 operation, but be invalid for a particular operation, as, for 8779 example, when a stateid which doesn't represent byte-range locks 8780 is passed to the non-from_open case of LOCK or to LOCKU, or when a 8781 stateid which does not represent an open is passed to CLOSE or 8782 OPEN_DOWNGRADE. In such cases, the server MUST return 8783 NFS4ERR_BAD_STATEID. 8784 8785 o If the "seqid" field is not zero, and it is greater than the 8786 current sequence value corresponding the current "other" field, 8787 return NFS4ERR_BAD_STATEID. 8788 8789 8790 8791 Shepler, et al. Expires February 23, 2009 [Page 157] 8792 8793 Internet-Draft NFSv4.1 August 2008 8794 8795 8796 o If the "seqid" field is not zero, and it is less than the current 8797 sequence value corresponding the current "other" field, return 8798 NFS4ERR_OLD_STATEID. 8799 8800 o Otherwise, the stateid is valid and the table entry should contain 8801 any additional information about the type of stateid and 8802 information associated with that particular type of stateid, such 8803 as the associated set of locks, such as open-owner and lock-owner 8804 information, as well as information on the specific locks, such as 8805 open modes and byte ranges. 8806 8807 8.2.5. Stateid Use for I/O Operations 8808 8809 Clients performing I/O operations need to select an appropriate 8810 stateid based on the locks (including opens and delegations) held by 8811 the client and the various types of state-owners issuing the I/O 8812 requests. SETATTR operations which change the file size are treated 8813 like I/O operations in this regard. 8814 8815 The following rules, applied in order of decreasing priority, govern 8816 the selection of the appropriate stateid. In following these rules, 8817 the client will only consider locks of which it has actually received 8818 notification by an appropriate operation response or callback. Note 8819 that the rules are slightly different in the case of I/O to data 8820 servers when file layouts are being used (see Section 13.9.1). 8821 8822 o If the client holds a delegation for the file in question, the 8823 delegation stateid SHOULD be used. 8824 8825 o Otherwise, if the lock-owner corresponding entity (e.g. process) 8826 issuing the I/O has a lock stateid for the associated open file, 8827 then the lock stateid for that lock-owner and open file SHOULD be 8828 used. 8829 8830 o If there is no lock stateid, then the open stateid for the open 8831 file in question SHOULD be used. 8832 8833 o Finally, if none of the above apply, then a special stateid SHOULD 8834 be used. 8835 8836 Ignoring these rules may result in situations in which the server 8837 does not have information necessary to properly process the request. 8838 For example, when mandatory byte-range locks are in effect, if the 8839 stateid does not indicate the proper lockowner, via a lock stateid, a 8840 request might be avoidably rejected. 8841 8842 The server however should not try to enforce these ordering rules and 8843 should use whatever information is available to proper process I/O 8844 8845 8846 8847 Shepler, et al. Expires February 23, 2009 [Page 158] 8848 8849 Internet-Draft NFSv4.1 August 2008 8850 8851 8852 requests. In particular, when a client has a delegation for a given 8853 file, it SHOULD take note of this fact in processing a request, even 8854 if it is sent with a special stateid. 8855 8856 8.2.6. Stateid Use for SETATTR Operations 8857 8858 Because each operation is associated with a session ID and from that 8859 the clientid can be determined, operations do not need to include a 8860 stateid for the server to be able to determine whether they should 8861 cause a delegation to be recalled or are to be treated as done within 8862 the scope of the delegation. 8863 8864 In the case of SETATTR operations, a stateid is present. In cases 8865 other than those which set the file size, the client may send either 8866 a special stateid or, when a delegation is held for the file in 8867 question, a delegation stateid. While the server SHOULD validate the 8868 stateid and may use the stateid to optimize the determination as to 8869 whether a delegation is held, it SHOULD note the presence of a 8870 delegation even when a special stateid is sent, and MUST accept a 8871 valid delegation stateid when sent. 8872 8873 8.3. Lease Renewal 8874 8875 The purpose of a lease is to allow the client to indicate to the 8876 server, in a low-overhead way, that it is active, and thus that the 8877 server is to retain the client's locks. This arrangement allows the 8878 server to remove stale locking-related objects that are held by a 8879 client that has crashed or is otherwise unreachable, once the 8880 relevant lease expires. This in turn allows other clients to obtain 8881 conflicting locks without being delayed indefinitely by inactive or 8882 unreachable clients. It is not a mechanism for cache consistency and 8883 lease renewals may not be denied if the lease interval has not 8884 expired. 8885 8886 Since each session is associated with a specific client (identified 8887 by the client's client ID), any operation sent on that session is an 8888 indication that the associated client is reachable. When a request 8889 is sent for a given session, successful execution of a SEQUENCE 8890 operation (or successful retrieval of the result of SEQUENCE from the 8891 reply cache) on an unexpired lease will result in the lease being 8892 implicitly renewed, for the standard renewal period (equal to the 8893 lease_time attribute). 8894 8895 If the client ID's lease has not expired when the server receives a 8896 SEQUENCE operation, then the server MUST renew the lease. If the 8897 client ID's lease has expired when the server receives a SEQUENCE 8898 operation, the server MAY renew the lease; this depends on whether 8899 any state was revoked as a result of the client's failure to renew 8900 8901 8902 8903 Shepler, et al. Expires February 23, 2009 [Page 159] 8904 8905 Internet-Draft NFSv4.1 August 2008 8906 8907 8908 the lease before expiration. 8909 8910 Absent other activity that would renew the lease, a COMPOUND 8911 consisting of a single SEQUENCE operation will suffice. The client 8912 should also take communication-related delays into account and take 8913 steps to ensure that the renewal messages actually reach the server 8914 in good time. For example: 8915 8916 o When trunking is in effect, the client should consider issuing 8917 multiple requests on different connections, in order to ensure 8918 that renewal occurs, even in the event of blockage in the path 8919 used for one of those connections. 8920 8921 o Transport retransmission delays might become so large as to 8922 approach or exceed the length of the lease period. This may be 8923 particularly likely when the server is unresponsive due to a 8924 restart; see Section 8.4.2.1. If the client implementation is not 8925 careful, transport retransmission delays can result in the client 8926 failing to detect a server restart before the grace period ends. 8927 The scenario is that the client is using a transport with 8928 exponential back off, such that the maximum retransmission timeout 8929 exceeds the both the grace period and the lease_time attribute. A 8930 network partition causes the client's connection's retransmission 8931 interval to back off, and even after the partition heals, the next 8932 transport-level retransmission is sent after the server has 8933 restarted and its grace period ends. 8934 8935 The client MUST either recover from the ensuing NFS4ERR_NOGRACE 8936 errors, or it MUST ensure that despite transport level 8937 retransmission intervals that exceed the lease_time, nonetheless a 8938 SEQUENCE operation is sent that renews the lease before 8939 expiration. The client can achieve this by associating a new 8940 connection with the session, and sending a SEQUENCE operation on 8941 it. However, if the attempt to establish a new connection is 8942 delayed for some reason (e.g. exponential backoff of the 8943 connection establishment packets), the client will have to abort 8944 the connection establishment attempt before the lease expires, and 8945 attempt to re-connect. 8946 8947 If the server renews the lease upon receiving a SEQUENCE operation, 8948 the server MUST NOT allow the lease to expire while the rest of the 8949 operations in the COMPOUND procedure's request are still executing. 8950 Once the last operation has finished, and the response to COMPOUND 8951 has been sent, the server MUST set the lease to expire no sooner than 8952 the sum of current time and the value of the lease_time attribute. 8953 8954 A client ID's lease can expire when it has been at least the lease 8955 interval (lease_time) since the last lease-renewing SEQUENCE 8956 8957 8958 8959 Shepler, et al. Expires February 23, 2009 [Page 160] 8960 8961 Internet-Draft NFSv4.1 August 2008 8962 8963 8964 operation was sent on any of the client ID's sessions and there are 8965 no active COMPOUND operations on any such sessions. 8966 8967 Because the SEQUENCE operation is the basic mechanism to renew a 8968 lease, and because if must be done at least once for each lease 8969 period, it is the natural mechanism whereby the server will inform 8970 the client of changes in the lease status that the client needs to be 8971 informed of. The client should inspect the status flags 8972 (sr_status_flags) returned by sequence and take the appropriate 8973 action (see Section 18.46.3 for details). 8974 8975 o The status bits SEQ4_STATUS_CB_PATH_DOWN and 8976 SEQ4_STATUS_CB_PATH_DOWN_SESSION indicate problems with the 8977 backchannel which the client may need to address in order to 8978 receive callback requests. 8979 8980 o The status bits SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING and 8981 SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED indicate problems with GSS 8982 contexts for the backchannel which the client may have to address 8983 to allow callback requests to be sent to it. 8984 8985 o The status bits SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED, 8986 SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED, 8987 SEQ4_STATUS_ADMIN_STATE_REVOKED, and 8988 SEQ4_STATUS_RECALLABLE_STATE_REVOKED notify the client of lock 8989 revocation events. When these bits are set, the client should use 8990 TEST_STATEID to find what stateids have been revoked and use 8991 FREE_STATEID to acknowledge loss of the associated state. 8992 8993 o The status bit SEQ4_STATUS_LEASE_MOVE indicates that 8994 responsibility for lease renewal has been transferred to one or 8995 more new servers. 8996 8997 o The status bit SEQ4_STATUS_RESTART_RECLAIM_NEEDED indicates that 8998 due to server restart the client must reclaim locking state. 8999 9000 o The status bit SEQ4_STATUS_BACKCHANNEL_FAULT indicates the server 9001 has encountered an unrecoverable fault with the backchannel (e.g. 9002 it has lost track of a sequence ID for a slot in the backchannel). 9003 9004 8.4. Crash Recovery 9005 9006 A critical requirement in crash recovery is that both the client and 9007 the server know when the other has failed. Additionally, it is 9008 required that a client sees a consistent view of data across server 9009 restarts. All READ and WRITE operations that may have been queued 9010 within the client or network buffers must wait until the client has 9011 successfully recovered the locks protecting the READ and WRITE 9012 9013 9014 9015 Shepler, et al. Expires February 23, 2009 [Page 161] 9016 9017 Internet-Draft NFSv4.1 August 2008 9018 9019 9020 operations. Any that reach the server before the server can safely 9021 determine that the client has recovered enough locking state to be 9022 sure that such operations can be safely processed must be rejected. 9023 This will happen because either: 9024 9025 o The state presented is no longer valid since it is associated with 9026 a now invalid client ID. In this case the client will receive 9027 either an NFS4ERR_BADSESSION or NFS4ERR_DEADSESSION error, and any 9028 attempt to attach a new session to the existing client ID will 9029 result in an NFS4ERR_STALE_CLIENTID error. 9030 9031 o Subsequent recovery of locks may make execution of the operation 9032 inappropriate (NFS4ERR_GRACE). 9033 9034 8.4.1. Client Failure and Recovery 9035 9036 In the event that a client fails, the server may release the client's 9037 locks when the associated lease has expired. Conflicting locks from 9038 another client may only be granted after this lease expiration. As 9039 discussed in Section 8.3, when a client has not failed and re- 9040 establishes its lease before expiration occurs, requests for 9041 conflicting locks will not be granted. 9042 9043 To minimize client delay upon restart, lock requests are associated 9044 with an instance of the client by a client-supplied verifier. This 9045 verifier is part of the client_owner4 sent in the initial EXCHANGE_ID 9046 call made by the client. The server returns a client ID as a result 9047 of the EXCHANGE_ID operation. The client then confirms the use of 9048 the client ID by establishing a session associated with that client 9049 ID (see Section 18.36.3 for a description how this is done). All 9050 locks, including opens, byte-range locks, delegations, and layouts 9051 obtained by sessions using that client ID are associated with that 9052 client ID. 9053 9054 Since the verifier will be changed by the client upon each 9055 initialization, the server can compare a new verifier to the verifier 9056 associated with currently held locks and determine that they do not 9057 match. This signifies the client's new instantiation and subsequent 9058 loss (upon confirmation of the new client ID) of locking state. As a 9059 result, the server is free to release all locks held which are 9060 associated with the old client ID which was derived from the old 9061 verifier. At this point conflicting locks from other clients, kept 9062 waiting while the lease had not yet expired, can be granted. In 9063 addition, all stateids associated with the old client ID can also be 9064 freed, as they are no longer reference-able. 9065 9066 Note that the verifier must have the same uniqueness properties as 9067 the verifier for the COMMIT operation. 9068 9069 9070 9071 Shepler, et al. Expires February 23, 2009 [Page 162] 9072 9073 Internet-Draft NFSv4.1 August 2008 9074 9075 9076 8.4.2. Server Failure and Recovery 9077 9078 If the server loses locking state (usually as a result of a restart), 9079 it must allow clients time to discover this fact and re-establish the 9080 lost locking state. The client must be able to re-establish the 9081 locking state without having the server deny valid requests because 9082 the server has granted conflicting access to another client. 9083 Likewise, if there is a possibility that clients have not yet re- 9084 established their locking state for a file, and that such locking 9085 state might make it invalid to perform READ or WRITE operations, for 9086 example through the establishment of mandatory locks, the server must 9087 disallow READ and WRITE operations for that file. 9088 9089 A client can determine that loss of locking state has occurred via 9090 several methods. 9091 9092 1. When a SEQUENCE (most common) or other operation returns 9093 NFS4ERR_BADSESSION, this may mean the session has been destroyed, 9094 but the client ID is still valid. The client sends a 9095 CREATE_SESSION request with the client ID to re-establish the 9096 session. If CREATE_SESSION fails with NFS4ERR_STALE_CLIENTID, 9097 the client must establish a new client ID (see Section 8.1) and 9098 re-establish its lock state with the new client ID, after the 9099 CREATE_SESSION operation succeeds (see Section 8.4.2.1). 9100 9101 2. When a SEQUENCE (most common) or other operation on a persistent 9102 session returns NFS4ERR_DEADSESSION, this indicates that a 9103 session is no longer usable for new, i.e. not satisfied from the 9104 reply cache, operations. Once all pending operations are 9105 determined to be either performed before the retry or not 9106 performed, the client sends a CREATE_SESSION request with the 9107 client ID to re-establish the session. If CREATE_SESSION fails 9108 with NFS4ERR_STALE_CLIENTID, the client must establish a new 9109 client ID (see Section 8.1) and re-establish its lock state after 9110 the CREATE_SESSION, with the new client ID, succeeds, 9111 (Section 8.4.2.1). 9112 9113 3. When a operation, neither SEQUENCE nor preceded by SEQUENCE (for 9114 example, CREATE_SESSION, DESTROY_SESSION) returns 9115 NFS4ERR_STALE_CLIENTID. The client MUST establish a new client 9116 ID (Section 8.1) and re-establish its lock state 9117 (Section 8.4.2.1). 9118 9119 8.4.2.1. State Reclaim 9120 9121 When state information and the associated locks are lost as a result 9122 of a server restart, the protocol must provide a way to cause that 9123 state to be re-established. The approach used is to define, for most 9124 9125 9126 9127 Shepler, et al. Expires February 23, 2009 [Page 163] 9128 9129 Internet-Draft NFSv4.1 August 2008 9130 9131 9132 types of locking state (layouts are an exception), a request whose 9133 function is to allow the client to re-establish on the server a lock 9134 first obtained from a previous instance. Generally these requests 9135 are variants of the requests normally used to create locks of that 9136 type and are referred to as "reclaim-type" requests and the process 9137 of re-establishing such locks is referred to as "reclaiming" them. 9138 9139 Because each client must have an opportunity to reclaim all of the 9140 locks that it has without the possibility that some other client will 9141 be granted a conflicting lock, a special period called the "grace 9142 period" is devoted to the reclaim process. During this period, 9143 requests creating client IDs and sessions are handled normally, but 9144 locking requests are subject to special restrictions. Only reclaim- 9145 type locking requests are allowed, unless the server can reliably 9146 determine (through state persistently maintained across restart 9147 instances), that granting any such lock cannot possibly conflict with 9148 a subsequent reclaim. When a request is made to obtain a new lock 9149 (i.e. not a reclaim-type request) during the grace period and such a 9150 determination cannot be made, the server must return the error 9151 NFS4ERR_GRACE. 9152 9153 Once a session is established using the new client ID, the client 9154 will use reclaim-type locking requests (e.g. LOCK requests with 9155 reclaim set to TRUE and OPEN operations with a claim type of 9156 CLAIM_PREVIOUS; see Section 9.11) to re-establish its locking state. 9157 Once this is done, or if there is no such locking state to reclaim, 9158 the client sends a global RECLAIM_COMPLETE operation, i.e. one with 9159 the rca_one_fs argument set to FALSE, to indicate that it has 9160 reclaimed all of the locking state that it will reclaim. Once a 9161 client sends such a RECLAIM_COMPLETE operation, it may attempt non- 9162 reclaim locking operations, although it may get NFS4ERR_GRACE errors 9163 the operations until the period of special handling is over. See 9164 Section 11.7.7 for a discussion of the analogous handling lock 9165 reclamation in the case of file systems transitioning from server to 9166 server. 9167 9168 During the grace period, the server must reject READ and WRITE 9169 operations and non-reclaim locking requests (i.e. other LOCK and OPEN 9170 operations) with an error of NFS4ERR_GRACE, unless it can guarantee 9171 that these may be done safely, as described below. 9172 9173 The grace period may last until all clients which are known to 9174 possibly have had locks have done a global RECLAIM_COMPLETE 9175 operation, indicating that they have finished reclaiming the locks 9176 they held before the server restart. This means that a client which 9177 has done a RECLAIM_COMPLETE must be prepared to receive an 9178 NFS4ERR_GRACE when attempting to acquire new locks. In order for the 9179 server to know that all clients with possible prior lock state have 9180 9181 9182 9183 Shepler, et al. Expires February 23, 2009 [Page 164] 9184 9185 Internet-Draft NFSv4.1 August 2008 9186 9187 9188 done a RECLAIM_COMPLETE, the server must maintain in stable storage a 9189 list of clients which may have such locks. The server may also 9190 terminate the grace period before all clients have done a global 9191 RECLAIM_COMPLETE. The server SHOULD NOT terminate the grace period 9192 before a time equal to the lease period in order to give clients an 9193 opportunity to find out about the server restart, as a result of 9194 issuing requests on associated sessions with a frequency governed by 9195 the lease time. Note that when a client does not issue such requests 9196 (or they are issued by the client but not received by the server), it 9197 is possible for the grace period to expire before the client finds 9198 out that the server restart has occurred. 9199 9200 Some additional time in order to allow a client to establish a new 9201 client ID and session and to effect lock reclaims may be added to the 9202 lease time. Note that analogous rules apply to file system-specific 9203 grace periods discussed in Section 11.7.7. 9204 9205 If the server can reliably determine that granting a non-reclaim 9206 request will not conflict with reclamation of locks by other clients, 9207 the NFS4ERR_GRACE error does not have to be returned even within the 9208 grace period, although NFS4ERR_GRACE must always be returned to 9209 clients attempting a non-reclaim lock request before doing their own 9210 global RECLAIM_COMPLETE. For the server to be able to service READ 9211 and WRITE operations during the grace period, it must again be able 9212 to guarantee that no possible conflict could arise between a 9213 potential reclaim locking request and the READ or WRITE operation. 9214 If the server is unable to offer that guarantee, the NFS4ERR_GRACE 9215 error must be returned to the client. 9216 9217 For a server to provide simple, valid handling during the grace 9218 period, the easiest method is to simply reject all non-reclaim 9219 locking requests and READ and WRITE operations by returning the 9220 NFS4ERR_GRACE error. However, a server may keep information about 9221 granted locks in stable storage. With this information, the server 9222 could determine if a regular lock or READ or WRITE operation can be 9223 safely processed. 9224 9225 For example, if the server maintained on stable storage summary 9226 information on whether mandatory locks exist, either mandatory byte- 9227 range locks, or share reservations specifying deny modes, many 9228 requests could be allowed during the grace period. If it is known 9229 that no such share reservations exist, OPEN request that do not 9230 specify deny modes may be safely granted. If, in addition, it is 9231 known that no mandatory byte-range locks exist, either through 9232 information stored on stable storage or simply because the server 9233 does not support such locks, READ and WRITE requests may be safely 9234 processed during the grace period. Another important case is where 9235 it is known that no mandatory byte-range locks exist, either because 9236 9237 9238 9239 Shepler, et al. Expires February 23, 2009 [Page 165] 9240 9241 Internet-Draft NFSv4.1 August 2008 9242 9243 9244 the server does not provide support for them, or because their 9245 absence is known from persistently recorded data. In this case, READ 9246 and WRITE operations specifying stateids derived from reclaim-type 9247 operation may be validly processed during the grace period because 9248 the fact of the valid reclaim ensures that no lock subsequently 9249 granted can prevent the I/O. 9250 9251 To reiterate, for a server that allows non-reclaim lock and I/O 9252 requests to be processed during the grace period, it MUST determine 9253 that no lock subsequently reclaimed will be rejected and that no lock 9254 subsequently reclaimed would have prevented any I/O operation 9255 processed during the grace period. 9256 9257 Clients should be prepared for the return of NFS4ERR_GRACE errors for 9258 non-reclaim lock and I/O requests. In this case the client should 9259 employ a retry mechanism for the request. A delay (on the order of 9260 several seconds) between retries should be used to avoid overwhelming 9261 the server. Further discussion of the general issue is included in 9262 [36]. The client must account for the server that can perform I/O 9263 and non-reclaim locking requests within the grace period as well as 9264 those that cannot do so. 9265 9266 A reclaim-type locking request outside the server's grace period can 9267 only succeed if the server can guarantee that no conflicting lock or 9268 I/O request has been granted since restart. 9269 9270 A server may, upon restart, establish a new value for the lease 9271 period. Therefore, clients should, once a new client ID is 9272 established, refetch the lease_time attribute and use it as the basis 9273 for lease renewal for the lease associated with that server. 9274 However, the server must establish, for this restart event, a grace 9275 period at least as long as the lease period for the previous server 9276 instantiation. This allows the client state obtained during the 9277 previous server instance to be reliably re-established. 9278 9279 8.4.3. Network Partitions and Recovery 9280 9281 If the duration of a network partition is greater than the lease 9282 period provided by the server, the server will not have received a 9283 lease renewal from the client. If this occurs, the server may free 9284 all locks held for the client, or it may allow the lock state to 9285 remain for a considerable period, subject to the constraint that if a 9286 request for a conflicting lock is made, locks associated with an 9287 expired lease do not prevent such a conflicting lock from being 9288 granted but MUST be revoked as necessary so as not to interfere with 9289 such conflicting requests. 9290 9291 If the server chooses to delay freeing of lock state until there is a 9292 9293 9294 9295 Shepler, et al. Expires February 23, 2009 [Page 166] 9296 9297 Internet-Draft NFSv4.1 August 2008 9298 9299 9300 conflict, it may either free all of the clients locks once there is a 9301 conflict, or it may only revoke the minimum set of locks necessary to 9302 allow conflicting requests. When it adopts the finer-grained 9303 approach, it must revoke all locks associated with a given stateid, 9304 even if the conflict is with only a subset of locks. 9305 9306 When the server chooses to free all of a client's lock state, either 9307 immediately upon lease expiration, or a result of the first attempt 9308 to obtain a conflicting a lock, the server may report the loss of 9309 lock state in a number of ways. 9310 9311 The server may choose to invalidate the session and the associated 9312 client ID. In this case, once the client can communicate with the 9313 server, it will receive an NFS4ERR_BADSESSION error. Upon attempting 9314 to create a new session, it would get an NFS4ERR_STALE_CLIENTID. 9315 Upon creating the new client ID and new session it would attempt to 9316 reclaim locks not be allowed to do so by the server. 9317 9318 Another possibility is for the server to maintain the session and 9319 client ID but for all stateids held by the client to become invalid 9320 or stale. Once the client can reach the server after such a network 9321 partition, the status returned by the SEQUENCE operation will 9322 indicate a loss of locking state, i.e. the flag 9323 SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED will be set in sr_status_flags. 9324 In addition, all I/O submitted by the client with the now invalid 9325 stateids will fail with the server returning the error 9326 NFS4ERR_EXPIRED. Once the client learns of the loss of locking 9327 state, it will suitably notify the applications that held the 9328 invalidated locks. The client should then take action to free 9329 invalidated stateids, either by establishing a new client ID using a 9330 new verifier or by doing a FREE_STATEID operation to release each of 9331 the invalidated stateids. 9332 9333 When the server adopts a finer-grained approach to revocation of 9334 locks when lease have expired, only a subset of stateids will 9335 normally become invalid during a network partition. When the client 9336 can communicate with the server after such a network partition heals, 9337 the status returned by the SEQUENCE operation will indicate a partial 9338 loss of locking state (SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED). In 9339 addition, operations, including I/O submitted by the client, with the 9340 now invalid stateids will fail with the server returning the error 9341 NFS4ERR_EXPIRED. Once the client learns of the loss of locking 9342 state, it will use the TEST_STATEID operation on all of its stateids 9343 to determine which locks have been lost and then suitably notify the 9344 applications that held the invalidated locks. The client can then 9345 release the invalidated locking state and acknowledge the revocation 9346 of the associated locks by doing a FREE_STATEID operation on each of 9347 the invalidated stateids. 9348 9349 9350 9351 Shepler, et al. Expires February 23, 2009 [Page 167] 9352 9353 Internet-Draft NFSv4.1 August 2008 9354 9355 9356 When a network partition is combined with a server restart, there are 9357 edge conditions that place requirements on the server in order to 9358 avoid silent data corruption following the server restart. Two of 9359 these edge conditions are known, and are discussed below. 9360 9361 The first edge condition arises as a result of the scenarios such as 9362 the following: 9363 9364 1. Client A acquires a lock. 9365 9366 2. Client A and server experience mutual network partition, such 9367 that client A is unable to renew its lease. 9368 9369 3. Client A's lease expires, and the server releases the lock. 9370 9371 4. Client B acquires a lock that would have conflicted with that of 9372 Client A. 9373 9374 5. Client B releases its lock. 9375 9376 6. Server restarts. 9377 9378 7. Network partition between client A and server heals. 9379 9380 8. Client A connects to new server instance and finds out about 9381 server restart. 9382 9383 9. Client A reclaims its lock within the server's grace period. 9384 9385 Thus, at the final step, the server has erroneously granted client 9386 A's lock reclaim. If client B modified the object the lock was 9387 protecting, client A will experience object corruption. 9388 9389 The second known edge condition arises in situations such as the 9390 following: 9391 9392 1. Client A acquires one or more locks. 9393 9394 2. Server restarts. 9395 9396 3. Client A and server experience mutual network partition, such 9397 that client A is unable to reclaim all of its locks within the 9398 grace period. 9399 9400 4. Server's reclaim grace period ends. Client A has either no 9401 locks or an incomplete set of locks known to the server. 9402 9403 9404 9405 9406 9407 Shepler, et al. Expires February 23, 2009 [Page 168] 9408 9409 Internet-Draft NFSv4.1 August 2008 9410 9411 9412 5. Client B acquires a lock that would have conflicted with a lock 9413 of client A that was not reclaimed. 9414 9415 6. Client B releases the lock. 9416 9417 7. Server restarts a second time. 9418 9419 8. Network partition between client A and server heals. 9420 9421 9. Client A connects to new server instance and finds out about 9422 server restart. 9423 9424 10. Client A reclaims its lock within the server's grace period. 9425 9426 As with the first edge condition, the final step of the scenario of 9427 the second edge condition has the server erroneously granting client 9428 A's lock reclaim. 9429 9430 Solving the first and second edge conditions requires that the server 9431 either always assumes after it restarts that some edge condition 9432 occurs, and thus return NFS4ERR_NO_GRACE for all reclaim attempts, or 9433 that the server record some information in stable storage. The 9434 amount of information the server records in stable storage is in 9435 inverse proportion to how harsh the server intends to be whenever 9436 edge conditions arise. The server that is completely tolerant of all 9437 edge conditions will record in stable storage every lock that is 9438 acquired, removing the lock record from stable storage only when the 9439 lock is released. For the two edge conditions discussed above, the 9440 harshest a server can be, and still support a grace period for 9441 reclaims, requires that the server record in stable storage 9442 information some minimal information. For example, a server 9443 implementation could, for each client, save in stable storage a 9444 record containing: 9445 9446 o the co_ownerid field from the client_owner4 presented in the 9447 EXCHANGE_ID operation. 9448 9449 o a boolean that indicates if the client's lease expired or if there 9450 was administrative intervention (see Section 8.5) to revoke a 9451 byte-range lock, share reservation, or delegation and there has 9452 been no acknowledgement, via FREE_STATEID, of such revocation. 9453 9454 o a boolean that indicates whether the client may have locks that it 9455 believes to be reclaimable in situations which the grace period 9456 was terminated, making the server's view of lock reclaimability 9457 suspect. The server will set this for any client record in stable 9458 storage where the client has not done a suitable RECLAIM_COMPLETE 9459 (global or file system-specific depending on the target of the 9460 9461 9462 9463 Shepler, et al. Expires February 23, 2009 [Page 169] 9464 9465 Internet-Draft NFSv4.1 August 2008 9466 9467 9468 lock request) before it grants any new (i.e. not reclaimed) lock 9469 to any client. 9470 9471 Assuming the above record keeping, for the first edge condition, 9472 after the server restarts, the record that client A's lease expired 9473 means that another client could have acquired a conflicting byte- 9474 range lock, share reservation, or delegation. Hence the server must 9475 reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 9476 9477 For the second edge condition, after the server restarts for a second 9478 time, the indication that the client had not completed its reclaims 9479 at the time at which the grace period ended means that the server 9480 must reject a reclaim from client A with the error NFS4ERR_NO_GRACE. 9481 9482 When either edge condition occurs, the client's attempt to reclaim 9483 locks will result in the error NFS4ERR_NO_GRACE. When this is 9484 received, or after the client restarts with no lock state, the client 9485 will send a global RECLAIM_COMPLETE. When the RECLAIM_COMPLETE is 9486 received, the server and client are again in agreement regarding 9487 reclaimable locks and both booleans in persistent storage can be 9488 reset, to be set again only when there is a subsequent event that 9489 causes lock reclaim operations to be questionable. 9490 9491 Regardless of the level and approach to record keeping, the server 9492 MUST implement one of the following strategies (which apply to 9493 reclaims of share reservations, byte-range locks, and delegations): 9494 9495 1. Reject all reclaims with NFS4ERR_NO_GRACE. This is extremely 9496 unforgiving, but necessary if the server does not record lock 9497 state in stable storage. 9498 9499 2. Record sufficient state in stable storage such that all known 9500 edge conditions involving server restart, including the two noted 9501 in this section, are detected. It is acceptable to erroneously 9502 recognize an edge condition and not allow a reclaim, when, with 9503 sufficient knowledge it would be allowed. The error the server 9504 would return in this case is NFS4ERR_NO_GRACE. Note it is not 9505 known if there are other edge conditions. 9506 9507 In the event that, after a server restart, the server determines 9508 that there is unrecoverable damage or corruption to the 9509 information in stable storage, then for all clients and/or locks 9510 which may be affected, the server MUST return NFS4ERR_NO_GRACE. 9511 9512 A mandate for the client's handling of the NFS4ERR_NO_GRACE error is 9513 outside the scope of this specification, since the strategies for 9514 such handling are very dependent on the client's operating 9515 environment. However, one potential approach is described below. 9516 9517 9518 9519 Shepler, et al. Expires February 23, 2009 [Page 170] 9520 9521 Internet-Draft NFSv4.1 August 2008 9522 9523 9524 When the client receives NFS4ERR_NO_GRACE, it could examine the 9525 change attribute of the objects the client is trying to reclaim state 9526 for, and use that to determine whether to re-establish the state via 9527 normal OPEN or LOCK requests. This is acceptable provided the 9528 client's operating environment allows it. In other words, the client 9529 implementor is advised to document for his users the behavior. The 9530 client could also inform the application that its byte-range lock or 9531 share reservations (whether they were delegated or not) have been 9532 lost, such as via a UNIX signal, a GUI pop-up window, etc. See 9533 Section 10.5 for a discussion of what the client should do for 9534 dealing with unreclaimed delegations on client state. 9535 9536 For further discussion of revocation of locks see Section 8.5. 9537 9538 8.5. Server Revocation of Locks 9539 9540 At any point, the server can revoke locks held by a client and the 9541 client must be prepared for this event. When the client detects that 9542 its locks have been or may have been revoked, the client is 9543 responsible for validating the state information between itself and 9544 the server. Validating locking state for the client means that it 9545 must verify or reclaim state for each lock currently held. 9546 9547 The first occasion of lock revocation is upon server restart. Note 9548 that this includes situations in which sessions are persistent and 9549 locking state is lost. In this class of instances, the client will 9550 receive an error (NFS4ERR_STALE_CLIENTID) on an operation that takes 9551 client ID, usually as part of recovery in response to a problem with 9552 the current session) and the client will proceed with normal crash 9553 recovery as described in the Section 8.4.2.1. 9554 9555 The second occasion of lock revocation is the inability to renew the 9556 lease before expiration, as discussed in Section 8.4.3. While this 9557 is considered a rare or unusual event, the client must be prepared to 9558 recover. The server is responsible for determining the precise 9559 consequences of the lease expiration, informing the client of the 9560 scope of the lock revocation decided upon. The client then uses the 9561 status information provided by the server in the SEQUENCE results 9562 (field sr_status_flags, see Section 18.46.3) to synchronize its 9563 locking state with that of the server, in order to recover. 9564 9565 The third occasion of lock revocation can occur as a result of 9566 revocation of locks within the lease period, either because of 9567 administrative intervention, or because a recallable lock (a 9568 delegation or layout) was not returned within the lease period after 9569 having been recalled. While these are considered rare events, they 9570 are possible and the client must be prepared to deal with them. When 9571 either of these events occur, the client finds out about the 9572 9573 9574 9575 Shepler, et al. Expires February 23, 2009 [Page 171] 9576 9577 Internet-Draft NFSv4.1 August 2008 9578 9579 9580 situation through the status returned by the SEQUENCE operation. Any 9581 use of stateids associated with locks revoked during the lease period 9582 will receive the error NFS4ERR_ADMIN_REVOKED or 9583 NFS4ERR_DELEG_REVOKED, as appropriate. 9584 9585 In all situations in which a subset of locking state may have been 9586 revoked, which include all cases in which locking state is revoked 9587 within the lease period, it is up to the client to determine which 9588 locks have been revoked and which have not. It does this by using 9589 the TEST_STATEID operation on the appropriate set of stateids. Once 9590 the set of revoked locks has been determined, the applications can be 9591 notified, and the invalidated stateids can be freed and lock 9592 revocation acknowledged by using FREE_STATEID. 9593 9594 8.6. Short and Long Leases 9595 9596 When determining the time period for the server lease, the usual 9597 lease tradeoffs apply. Short leases are good for fast server 9598 recovery at a cost of increased operations to effect lease renewal 9599 (when there are no other operations during the period to effect lease 9600 renewal as a side-effect). Long leases are certainly kinder and 9601 gentler to servers trying to handle very large numbers of clients. 9602 The number of extra requests to effect lock renewal drops in inverse 9603 proportion to the lease time. The disadvantages of long leases 9604 include the possibility of slower recovery after certain failures. 9605 After server failure, a longer grace period may be required when some 9606 clients do not promptly reclaim their locks and do a global 9607 RECLAIM_COMPLETE. In the event of client failure, there can be a 9608 longer period for leases to expire thus forcing conflicting requests 9609 to wait. 9610 9611 Long leases are practical if the server can store lease state in non- 9612 volatile memory. Upon recovery, the server can reconstruct the lease 9613 state from its non-volatile memory and continue operation with its 9614 clients and therefore long leases would not be an issue. 9615 9616 8.7. Clocks, Propagation Delay, and Calculating Lease Expiration 9617 9618 To avoid the need for synchronized clocks, lease times are granted by 9619 the server as a time delta. However, there is a requirement that the 9620 client and server clocks do not drift excessively over the duration 9621 of the lease. There is also the issue of propagation delay across 9622 the network which could easily be several hundred milliseconds as 9623 well as the possibility that requests will be lost and need to be 9624 retransmitted. 9625 9626 To take propagation delay into account, the client should subtract it 9627 from lease times (e.g. if the client estimates the one-way 9628 9629 9630 9631 Shepler, et al. Expires February 23, 2009 [Page 172] 9632 9633 Internet-Draft NFSv4.1 August 2008 9634 9635 9636 propagation delay as 200 milliseconds, then it can assume that the 9637 lease is already 200 milliseconds old when it gets it). In addition, 9638 it will take another 200 milliseconds to get a response back to the 9639 server. So the client must send a lease renewal or write data back 9640 to the server at least 400 milliseconds before the lease would 9641 expire. If the propagation delay varies over the life of the lease 9642 (e.g. the client is on a mobile host), the client will need to 9643 continuously subtract the increase in propagation delay from the 9644 lease times. 9645 9646 The server's lease period configuration should take into account the 9647 network distance of the clients that will be accessing the server's 9648 resources. It is expected that the lease period will take into 9649 account the network propagation delays and other network delay 9650 factors for the client population. Since the protocol does not allow 9651 for an automatic method to determine an appropriate lease period, the 9652 server's administrator may have to tune the lease period. 9653 9654 8.8. Obsolete Locking Infrastructure From NFSv4.0 9655 9656 There are a number of operations and fields within existing 9657 operations that no longer have a function in NFSv4.1. In one way or 9658 another, these changes are all due to the implementation of sessions 9659 which provides client context and exactly once semantics as a base 9660 feature of the protocol, separate from locking itself. 9661 9662 The following NFSv4.0 operations MUST NOT be implemented in NFSv4.1. 9663 The server MUST return NFS4ERR_NOTSUPP if these operations are found 9664 in an NFSv4.1 COMPOUND. 9665 9666 o SETCLIENTID since its function has been replaced by EXCHANGE_ID. 9667 9668 o SETCLIENTID_CONFIRM since client ID confirmation now happens by 9669 means of CREATE_SESSION. 9670 9671 o OPEN_CONFIRM because state-owner-based seqids have been replaced 9672 by the sequence ID in the SEQUENCE operation. 9673 9674 o RELEASE_LOCKOWNER because lock-owners with no associated locks do 9675 not have any sequence-related state and so can be deleted by the 9676 server at will. 9677 9678 o RENEW because every SEQUENCE operation for a session causes lease 9679 renewal, making a separate operation superfluous. 9680 9681 Also, there are a number of fields, present in existing operations 9682 related to locking that have no use in minor version one. They were 9683 used in minor version zero to perform functions now provided in a 9684 9685 9686 9687 Shepler, et al. Expires February 23, 2009 [Page 173] 9688 9689 Internet-Draft NFSv4.1 August 2008 9690 9691 9692 different fashion. 9693 9694 o Sequence ids used to sequence requests for a given state-owner and 9695 to provide retry protection, now provided via sessions. 9696 9697 o Client IDs used to identify the client associated with a given 9698 request. Client identification is now available using the client 9699 ID associated with the current session, without needing an 9700 explicit client ID field. 9701 9702 Such vestigial fields in existing operations have no function in 9703 NFSv4.1 and are ignored by the server. Note that client IDs in 9704 operations new to NFSv4.1 (such as CREATE_SESSION and 9705 DESTROY_CLIENTID) are not ignored. 9706 9707 9708 9. File Locking and Share Reservations 9709 9710 To support Win32 share reservations it is necessary to provide 9711 operations which atomically open or create files. Having a separate 9712 share/unshare operation would not allow correct implementation of the 9713 Win32 OpenFile API. In order to correctly implement share semantics, 9714 the previous NFS protocol mechanisms used when a file is opened or 9715 created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFSv4.1 9716 protocol defines an OPEN operation which is capable of atomically 9717 looking up, creating, and locking a file on the server. 9718 9719 9.1. Opens and Byte-Range Locks 9720 9721 It is assumed that manipulating a byte-range lock is rare when 9722 compared to READ and WRITE operations. It is also assumed that 9723 server restarts and network partitions are relatively rare. 9724 Therefore it is important that the READ and WRITE operations have a 9725 lightweight mechanism to indicate if they possess a held lock. A 9726 byte-range lock request contains the heavyweight information required 9727 to establish a lock and uniquely define the owner of the lock. 9728 9729 9.1.1. State-owner Definition 9730 9731 When opening a file or requesting a byte-range lock, the client must 9732 specify an identifier which represents the owner of the requested 9733 lock. This identifier is in the form of a state-owner, represented 9734 in the protocol by a state_owner4, a variable-length opaque array 9735 which, when concatenated with the current client ID uniquely defines 9736 the owner of lock managed by the client. This may be a thread ID, 9737 process ID, or other unique value. 9738 9739 Owners of opens and owners of byte-range locks are separate entities 9740 9741 9742 9743 Shepler, et al. Expires February 23, 2009 [Page 174] 9744 9745 Internet-Draft NFSv4.1 August 2008 9746 9747 9748 and remain separate even if the same opaque arrays are used to 9749 designate owners of each. The protocol distinguishes between open- 9750 owners (represented by open_owner4 structures) and lock-owners 9751 (represented by lock_owner4 structures). 9752 9753 Each open is associated with a specific open-owner while each byte- 9754 range lock is associated with a lock-owner and an open-owner, the 9755 latter being the open-owner associated with the open file under which 9756 the LOCK operation was done. Delegations and layouts, on the other 9757 hand, are not associated with a specific owner but are associated 9758 with the client as a whole (identified by a client ID). 9759 9760 9.1.2. Use of the Stateid and Locking 9761 9762 All READ, WRITE and SETATTR operations contain a stateid. For the 9763 purposes of this section, SETATTR operations which change the size 9764 attribute of a file are treated as if they are writing the area 9765 between the old and new size (i.e. the range truncated or added to 9766 the file by means of the SETATTR), even where SETATTR is not 9767 explicitly mentioned in the text. The stateid passed to one of these 9768 operations must be one that represents an open, a set of byte-range 9769 locks, or a delegation, or it may be a special stateid representing 9770 anonymous access or the special bypass stateid. 9771 9772 If the state-owner performs a READ or WRITE in a situation in which 9773 it has established a byte-range lock or share reservation on the 9774 server (any OPEN constitutes a share reservation) the stateid 9775 (previously returned by the server) must be used to indicate what 9776 locks, including both byte-range locks and share reservations, are 9777 held by the state-owner. If no state is established by the client, 9778 either byte-range lock or share reservation, a special stateid for 9779 anonymous state (zero as "other" and "seqid") is used. (See 9780 Section 8.2.3 for a description of 'special' stateids in general.) 9781 Regardless whether a stateid for anonymous state or a stateid 9782 returned by the server is used, if there is a conflicting share 9783 reservation or mandatory byte-range lock held on the file, the server 9784 MUST refuse to service the READ or WRITE operation. 9785 9786 Share reservations are established by OPEN operations and by their 9787 nature are mandatory in that when the OPEN denies READ or WRITE 9788 operations, that denial results in such operations being rejected 9789 with error NFS4ERR_LOCKED. Byte-range locks may be implemented by 9790 the server as either mandatory or advisory, or the choice of 9791 mandatory or advisory behavior may be determined by the server on the 9792 basis of the file being accessed (for example, some UNIX-based 9793 servers support a "mandatory lock bit" on the mode attribute such 9794 that if set, byte-range locks are required on the file before I/O is 9795 possible). When byte-range locks are advisory, they only prevent the 9796 9797 9798 9799 Shepler, et al. Expires February 23, 2009 [Page 175] 9800 9801 Internet-Draft NFSv4.1 August 2008 9802 9803 9804 granting of conflicting lock requests and have no effect on READs or 9805 WRITEs. Mandatory byte-range locks, however, prevent conflicting I/O 9806 operations. When they are attempted, they are rejected with 9807 NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file it 9808 knows it has the proper share reservation for, it will need to send a 9809 LOCK request on the region of the file that includes the region the 9810 I/O was to be performed on, with an appropriate locktype (i.e. 9811 READ*_LT for a READ operation, WRITE*_LT for a WRITE operation). 9812 9813 Note that for UNIX environments that support mandatory file locking, 9814 the distinction between advisory and mandatory locking is subtle. In 9815 fact, advisory and mandatory byte-range locks are exactly the same in 9816 so far as the APIs and requirements on implementation. If the 9817 mandatory lock attribute is set on the file, the server checks to see 9818 if the lock-owner has an appropriate shared (read) or exclusive 9819 (write) byte-range lock on the region it wishes to read or write to. 9820 If there is no appropriate lock, the server checks if there is a 9821 conflicting lock (which can be done by attempting to acquire the 9822 conflicting lock on behalf of the lock-owner, and if successful, 9823 release the lock after the READ or WRITE is done), and if there is, 9824 the server returns NFS4ERR_LOCKED. 9825 9826 For Windows environments, byte-range locks are always mandatory, so 9827 the server always checks for byte-range locks during I/O requests. 9828 9829 Thus, the NFSv4.1 LOCK operation does not need to distinguish between 9830 advisory and mandatory byte-range locks. It is the NFSv4.1 server's 9831 processing of the READ and WRITE operations that introduces the 9832 distinction. 9833 9834 Every stateid which is validly passed to READ, WRITE or SETATTR, with 9835 the exception of special stateid values, defines an access mode for 9836 the file (i.e. READ, WRITE, or READ-WRITE) 9837 9838 o For stateids associated with opens, this is the mode defined by 9839 the original OPEN which caused the allocation of the open stateid 9840 and as modified by subsequent OPENs and OPEN_DOWNGRADEs for the 9841 same open-owner/file pair. 9842 9843 o For stateids returned by byte-range lock requests, the appropriate 9844 mode is the access mode for the open stateid associated with the 9845 lock set represented by the stateid. 9846 9847 o For delegation stateids the access mode is based on the type of 9848 delegation. 9849 9850 When a READ, WRITE, or SETATTR (which specifies the size attribute) 9851 is done, the operation is subject to checking against the access mode 9852 9853 9854 9855 Shepler, et al. Expires February 23, 2009 [Page 176] 9856 9857 Internet-Draft NFSv4.1 August 2008 9858 9859 9860 to verify that the operation is appropriate given the stateid with 9861 which the operation is associated. 9862 9863 In the case of WRITE-type operations (i.e. WRITEs and SETATTRs which 9864 set size), the server MUST verify that the access mode allows writing 9865 and MUST return an NFS4ERR_OPENMODE error if it does not. In the 9866 case, of READ, the server may perform the corresponding check on the 9867 access mode, or it may choose to allow READ on opens for WRITE only, 9868 to accommodate clients whose write implementation may unavoidably do 9869 reads (e.g. due to buffer cache constraints). However, even if READs 9870 are allowed in these circumstances, the server MUST still check for 9871 locks that conflict with the READ (e.g. another open specify denial 9872 of READs). Note that a server which does enforce the access mode 9873 check on READs need not explicitly check for conflicting share 9874 reservations since the existence of OPEN for read access guarantees 9875 that no conflicting share reservation can exist. 9876 9877 The read bypass special stateid (all bits of "other" and "seqid" set 9878 to one) indicates a desire to bypass locking checks. The server MAY 9879 allow READ operations to bypass locking checks at the server, when 9880 this special stateid is used. However, WRITE operations with this 9881 special stateid value MUST NOT bypass locking checks and are treated 9882 exactly the same as if a special stateid for anonymous state were 9883 used. 9884 9885 A lock may not be granted while a READ or WRITE operation using one 9886 of the special stateids is being performed and the scope of the lock 9887 to be granted would conflict with the READ or WRITE operation. This 9888 can occur when: 9889 9890 o A mandatory byte range lock is requested with range that conflicts 9891 with the range of the READ or WRITE operation. For the purposes 9892 of this paragraph, a conflict occurs when a shared lock is 9893 requested and a WRITE operation is being performed, or an 9894 exclusive lock is requested and either a READ or a WRITE operation 9895 is being performed. 9896 9897 o A share reservation is requested which denies reading and or 9898 writing and the corresponding operation is being performed. 9899 9900 o A delegation is to be granted and the delegation type would 9901 prevent the I/O operation, i.e. READ and WRITE conflict with a 9902 write delegation and WRITE conflicts with a read delegation. 9903 9904 When a client holds a delegation, it needs to ensure that the stateid 9905 sent conveys the association of operation with the delegation, to 9906 avoid the delegation from being avoidably recalled. When the 9907 delegation stateid, or a stateid open associated with that 9908 9909 9910 9911 Shepler, et al. Expires February 23, 2009 [Page 177] 9912 9913 Internet-Draft NFSv4.1 August 2008 9914 9915 9916 delegation, or a stateid representing byte-range locks derived form 9917 such an open is used, the server knows that the READ, WRITE, or 9918 SETATTR does not conflict with the delegation, but is sent under the 9919 aegis of the delegation. Even though it is possible for the server 9920 to determine from the client ID (via the session ID) that the client 9921 does in fact have a delegation, the server is not obliged to check 9922 this, so using a special stateid can result in avoidable recall of 9923 the delegation. 9924 9925 9.2. Lock Ranges 9926 9927 The protocol allows a lock-owner to request a lock with a byte range 9928 and then either upgrade, downgrade, or unlock a sub-range of the 9929 initial lock, or a range that consists of a range which overlaps, 9930 fully or partially, that initial lock or a combination of a set of 9931 existing locks for the same lock-owner. It is expected that this 9932 will be an uncommon type of request. In any case, servers or server 9933 file systems may not be able to support sub-range lock semantics. In 9934 the event that a server receives a locking request that represents a 9935 sub-range of current locking state for the lock-owner, the server is 9936 allowed to return the error NFS4ERR_LOCK_RANGE to signify that it 9937 does not support sub-range lock operations. Therefore, the client 9938 should be prepared to receive this error and, if appropriate, report 9939 the error to the requesting application. 9940 9941 The client is discouraged from combining multiple independent locking 9942 ranges that happen to be adjacent into a single request since the 9943 server may not support sub-range requests and for reasons related to 9944 the recovery of file locking state in the event of server failure. 9945 As discussed in Section 8.4.2, the server may employ certain 9946 optimizations during recovery that work effectively only when the 9947 client's behavior during lock recovery is similar to the client's 9948 locking behavior prior to server failure. 9949 9950 9.3. Upgrading and Downgrading Locks 9951 9952 If a client has a write lock on a byte-range, it can request an 9953 atomic downgrade of the lock to a read lock via the LOCK request, by 9954 setting the type to READ_LT. If the server supports atomic 9955 downgrade, the request will succeed. If not, it will return 9956 NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this 9957 error, and if appropriate, report the error to the requesting 9958 application. 9959 9960 If a client has a read lock on a byte-range, it can request an atomic 9961 upgrade of the lock to a write lock via the LOCK request by setting 9962 the type to WRITE_LT or WRITEW_LT. If the server does not support 9963 atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade 9964 9965 9966 9967 Shepler, et al. Expires February 23, 2009 [Page 178] 9968 9969 Internet-Draft NFSv4.1 August 2008 9970 9971 9972 can be achieved without an existing conflict, the request will 9973 succeed. Otherwise, the server will return either NFS4ERR_DENIED or 9974 NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the 9975 client sent the LOCK request with the type set to WRITEW_LT and the 9976 server has detected a deadlock. The client should be prepared to 9977 receive such errors and if appropriate, report the error to the 9978 requesting application. 9979 9980 9.4. Stateid Seqid Values and Byte-Range Locks 9981 9982 When a lock or unlock request is done, passing a stateid, the stateid 9983 returned has the same "other" value and a "seqid" value that is 9984 incremented to reflect the occurrence of the lock or unlock request. 9985 The server MUST increment the value of the "seqid" field whenever 9986 there is any change to the locking status of any byte offset as 9987 described by any of locks covered by the stateid. A change in 9988 locking status includes a change from locked to unlocked or the 9989 reverse or a change from being locked for read to being locked for 9990 write or the reverse. 9991 9992 When there is no such change, as, for example when a range already 9993 locked for write is locked again for write, the server MAY increment 9994 the "seqid" value. 9995 9996 9.5. Issues with Multiple Open-Owners 9997 9998 When the same file is opened by multiple open-owners, a client will 9999 have multiple open stateids for that file, each associated with a 10000 different open-owner. In that case, there can be multiple LOCK and 10001 LOCKU requests for the same lock-owner issued using the different 10002 open stateids, and so a situation may arise in which there are 10003 multiple stateids, each representing byte-range locks on the same 10004 file and held by the same lock-owner but each associated with a 10005 different open-owner. 10006 10007 In such a situation, the locking status of each byte (i.e. whether it 10008 is locked, the read or write mode of the lock and the lock-owner 10009 holding the lock) MUST reflect the last LOCK or LOCKU operation done 10010 for the lock-owner in question, independent of the stateid through 10011 which the request was issued. 10012 10013 When a byte is locked by the lock-owner in question, the open-owner 10014 to which that lock is assigned SHOULD be that of the open-owner 10015 associated with the stateid through which the last LOCK of that byte 10016 was done. When there is a change in the open-owner associated with 10017 locks for the stateid through which a LOCK or LOCKU was done, the 10018 "seqid" field of the stateid MUST be incremented, even if the 10019 locking, in terms of lock-owners has not changed. When there is a 10020 10021 10022 10023 Shepler, et al. Expires February 23, 2009 [Page 179] 10024 10025 Internet-Draft NFSv4.1 August 2008 10026 10027 10028 change to the set of locked bytes associated with a different stateid 10029 for the same lock-owner, i.e. associated with a different open-owner, 10030 the "seqid" value for that stateid MUST NOT be incremented. 10031 10032 9.6. Blocking Locks 10033 10034 Some clients require the support of blocking locks. While NFSv4.1 10035 provides a callback when a previously unavailable lock becomes 10036 available, this is an OPTIONAL feature and clients cannot depend on 10037 its presence. Clients need to be prepared to continually poll for 10038 the lock. This presents a fairness problem. Two of the lock types, 10039 READW and WRITEW, are used to indicate to the server that the client 10040 is requesting a blocking lock. When the callback is not used, the 10041 server should maintain an ordered list of pending blocking locks. 10042 When the conflicting lock is released, the server may wait for the 10043 period of time equal to lease_time for the first waiting client to 10044 re-request the lock. After the lease period expires, the next 10045 waiting client request is allowed the lock. Clients are required to 10046 poll at an interval sufficiently small that it is likely to acquire 10047 the lock in a timely manner. The server is not required to maintain 10048 a list of pending blocked locks as it is used to increase fairness 10049 and not correct operation. Because of the unordered nature of crash 10050 recovery, storing of lock state to stable storage would be required 10051 to guarantee ordered granting of blocking locks. 10052 10053 Servers may also note the lock types and delay returning denial of 10054 the request to allow extra time for a conflicting lock to be 10055 released, allowing a successful return. In this way, clients can 10056 avoid the burden of needlessly frequent polling for blocking locks. 10057 The server should take care in the length of delay in the event the 10058 client retransmits the request. 10059 10060 If a server receives a blocking lock request, denies it, and then 10061 later receives a nonblocking request for the same lock, which is also 10062 denied, then it should remove the lock in question from its list of 10063 pending blocking locks. Clients should use such a nonblocking 10064 request to indicate to the server that this is the last time they 10065 intend to poll for the lock, as may happen when the process 10066 requesting the lock is interrupted. This is a courtesy to the 10067 server, to prevent it from unnecessarily waiting a lease period 10068 before granting other lock requests. However, clients are not 10069 required to perform this courtesy, and servers must not depend on 10070 them doing so. Also, clients must be prepared for the possibility 10071 that this final locking request will be accepted. 10072 10073 When server indicates, via the flag OPEN4_