Issue159

Title draft-09 pNFS issues (already)
Priority required Status resolved-edit-complete
Superseder Nosy List mre, spencer.shepler
Assigned To spencer.shepler Topics

Created on 2007-03-02.23:47:29 by mre, last changed 2008-01-11.19:56:41 by mre.

Messages
msg539 (view) Author: mre Date: 2007-03-03.07:22:21
-----Original Message-----
From: Eisler, Michael 
Sent: Friday, March 02, 2007 11:19 PM
To: Iyer, Rahul
Cc: Goodson, Garth
Subject: RE: possible type in draft 9

 

> -----Original Message-----
> From: Iyer, Rahul 
> Sent: Friday, March 02, 2007 4:35 PM
> To: Eisler, Michael
> Cc: Goodson, Garth
> Subject: RE: possible type in draft 9
> 
> Hi Mike,
> Here are a few of my comments. Apologies for the delay. This 
> covers some of the pNFS section. I'll review the NFSv4 files 
> part of pNFS and send any comments I might have a bit later.
> Thanks,
> Regards
> Rahul
> 
> --- Start comments ---
> Section 12.3 pNFS Operations
> ============================
> 
> "GETDEVICELIST (Section 17.41), allows clients to fetch the all the
> device ID to storage device address mappings of particular file
> system."
> 
> This is not entirely accurate. For cluster file systems that have
> nodes 'come and go' like IBM's GPFS implementation, this definition
> doesn't hold true.  This seems to imply that once a client does a
> GETDEVICELIST, the client will never need to do a GETDEVICEINFO
> (because it already has all the mappings).  GETDEVICELIST is more like
> an "at that moment in time" thing. So, adding something like "at that
> point in time" would make it more accurate.

I see your point. I will clarify this in -10.



> 
> ---
> 
> "LAYOUTCOMMIT (Section 17.42) is used to inform the metadata server
> that the client wants to commit data it wrote to the storage device
> (which as indicated in the layout segment returned by LAYOUTGET)."
> 
> This isn't entirely accurate either. COMMITing data to the data
> servers seems to sound like sending a COMMIT operation to the data
> server to COMMIT an unstable write to it. What LAYOUTCOMMIT aims to
> achieve is commiting the metadata changes that correspond to the data
> changes that were made at the data servers.

OK.

> 
> ---
> 
> 12.7.2. Dealing with Lease Expiration on the Client
> =================================================== 
> Paragraph 3 
> 
> The client needn't necessarily wrte through the MDS. It could very
> well refetch its layouts and then do the I/O. Something like 'it is
> recommended' might be a better way to phrase this.

OK. I also need to mention that the device mappings have gone away.

> 
> --- 
> 
> Paragraph 4
> 
> NFS4ERR_STATE_CLIENTID should be NFS4ERR_STALE_CLIENTID

OK.

> 
> ---
> 
> Paragraph 5, sentence 3
> 
> Grammatical error

OK

> 
> ---
> 
> 12.7.3. Dealing with Loss of Layout State on the Metadata Server
> ===============================================================
> 
> It is stated that this section "describes the recovery from the
> situation...", but this section really just gives pointers to other
> sections/documents. It would be nice to reword the opening statement
> to reflect this.

OK.

> 
> ---
> 
> 12.7.4. Recovery from Metadata Server Restart
> =============================================
> 
> 1st Paragraph, 3rd sentence
> 
> "then client has additional work to do in order to get the client,..."
> probably should be the 'then the client' and 'to get itself'.

OK

> 
> ---
> 
> 3rd bullet
> 
> "The client does not have a copy of the data in its memory and the
> metadata server is still in its grace period."
> 
> should probably be 
> 
> "If the client does not have a copy of the data in its memory and the
> metadata server is still in its grace period, ..."

OK.

> 
> ---
> 
> 3rd bullet, 2nd Paragraph, 2nd sentence
> 
> "The use of LAYOUTCOMMIT in reclaim mode informs the metadata server
> that the layout segment has changed."
> 
> This is not a 100% true. In the files case, the client never changes
> the layout segment. In the blocks case, it might. A better way to
> phrase this would be to say "either the layout segment, or the data
> region it represents (or both) has changed"

OK.
> 
> ---
> 
> 12.7.5. Operations During Metadata Server Grace Period
> ======================================================
> 
> It's stated here that WRITEs and LAYOUTGETs may be allowed during the
> grace period. Are OPENs allowed in the grace period? If not, then how
> would one perform a WRITE, because the state id would be invalid; and
> to get a valid state id, the client would have to do an OPEN.

Draft-08 said the WRITEs could proceed during the grace period:
  However, depending on the storage protocol and server 
  implementation, the server may be able to determine that a 
  particular request is safe. For example, a server may save 
  provisional allocation mappings for each file to stable 
  storage, and use this information during the recovery grace 
  period to determine that a WRITE request is safe. 

That's wrong of course.

If a metadata server can save provisional allocation mappings to
stable storage, it could easily, for example record a two bits to stable storage.
One bit says whether there the file has an OPEN or not. The other says
where there was a deny mode other than none. If both bits
are zero, then the metadta server can allow the open.

> 
> ---
> 
> 12.7.6. Storage Device Recovery
> ===============================
> 
> "Second, the best solution is for the client to err on the side of
> caution and attempt to re-write the modified data through another
> path."
> 
> Add "if available" to the end of the sentence?

OK.

I wasn't thinking about permanent crashes here; I was thinking about
a storage server that restarted.

> 
> The client should immediately write the data to the metadata server,
> ..."
> 
> This seems like an either/or situation to me? The client could either
> try to write the data to the device through another path *or* write it
> through the MDS with the stable flag set to FILE_SYNC4.

Yes I think more text could be added here.
I think the safest is to go the metadata server route.
If the storage device restarts, and the storage protocol supports
sessions, and the session persisted, then retrying works, but
if the device failed, it could just as easily be messed up after
restart.

With NFSv4.1 layouts, the client could issue an EXCHANGE_ID to
the secondary data server and if it gets back the same client and server owner
as the failed primary, then it knows it can use the session it
had with the primary, and retry. That feels safe because clearly
the problem was just a failed network patch, not a data server failure.
Other storage protocols might have similar trunking systems. So more
text.

Otherwise, I don't know if in general using another path to retry the
WRITE is safe. I'm worried by the scenario in this passage on
exactly once semantics (section 2.10.4.)

  Suppose a client issues WRITEs A, B, C to a 
  noncompliant server that does not enforce EOS, and 
  receives no response, perhaps due to a network partition. 
  The client reconnects to the server and re-issues all 
  three WRITEs. Now, the server has outstanding two instances 
  of each of A, B, and C. The server can be in a situation in 
  which it executes and replies to the retries of A, B, and C 
  while the first A, B, and C are still waiting around in the 
  server's I/O system for some resource. Upon receiving the 
  replies to the second attempts of WRITEs A, B, and C, the 
  client believes its writes are done so it is free to do issue 
  WRITE D which overlaps the range of one or more of A, B, C. 
  If any of A, B, or C are subsequently are executed for the 
  second time, then what has been written by D can be overwritten 
  and thus corrupted.

If the two storage devices don't have common reply cache,
exactly once semantics aren't possible, and so dangling
Re-sending via the metadata server, which has control over
the data server prevents these races. I agree it is perfectly
fine for the client to issue new I/Os to the secondary path/device.

I wonder if the client needs an op to tell the metadata server the
the storage device failed?
msg538 (view) Author: mre Date: 2007-03-03.07:21:02
-----Original Message-----
From: Goodson, Garth 
Sent: Friday, March 02, 2007 2:09 PM
To: Eisler, Michael
Subject: possible type in draft 9

12.6

If the server's	response does not have EXCHGID4_FLAG_USE_PNFS_MDS, then 
contrary to what the fs_layout_type attribute said, the client does not
                                                         ^^^^^^
support	pNFS, and the client will not be able use pNFS to that server

client should be server I think.
msg537 (view) Author: mre Date: 2007-03-02.23:47:29
-----Original Message-----
From: Eisler, Michael 
Sent: Friday, March 02, 2007 3:47 PM
To: Goodson, Garth
Cc: Iyer, Rahul
Subject: RE: possible type in draft 9

 

> -----Original Message-----
> From: Goodson, Garth 
> Sent: Friday, March 02, 2007 3:14 PM
> To: Eisler, Michael
> Cc: Iyer, Rahul
> Subject: Re: possible type in draft 9
> 
> Couple of other comments (sorry I'm just now getting to read it)...
> 
> 12.7.2 should probably also talk about layout lease expiration.

OK.

> 
> I think we need to bring up on the WG that we have now tied 
> devices to 
> leases.  I think this is a fairly big change that others 
> should be made 
> aware of.

I wasn't aware that it was a big change (i think this was
there before I started editing the chapter) but OK. 

> 
> 12.7.4
> 
> 2nd bullet point
> 
> If the client synchronously wrote data to the storage device, 
> but still 
> ^has^ a copy of that data in its memory
> 
> 3rd bullet
>   the device IDs may map to difference device addresses
>                             ^^^^^^^^^^ different
> last bullet
> the failure of LAYOUTCOMMIT means the data in the range <loca_offset, 
> loca_length> lost
> 
> I would say "potentially lost" it really depends on the 
> backend.  E.g., 
> in GX with coral, the LAYOUTCOMMIT is not really required.

Then the metadata server shouldn't return an error from LAYOUTCOMMIT ...
we traded email on this, and I believe that was the argument you were
making (LAYOUTCOMMIT should almost never fail after
a metadata restart). If GX/Coral doesn't need LAYOUTCOMMIT (perhaps that
should be conveyed somehow in the return from LAYOUTGET), then
it doesn't need to return an error.

> 
> 12.7.6
> 
> The client should immediately write the data to the metadata server,
> 
> It is equally valid for the client to use one of the 
> multipath devices 
> listed in the device (for v4.1 file layout) or some other multipath 
> mechanism (e.g., for LUNs).
> 
> But it is always safest to write through the MDS

OK, I will make this update in draft-10.

> 
> (I think this may map to Comment 16 found in 13.6)
> 
> 13.7
> 
> Maybe add something about using the sessions tied to data servers not 
> accepting mds type operations.

OK.

> 
> 13.9
> 
> Why no delegation stateids?

This was one of the issues in the issues tracker. 
http://www.nfsv4-editor.org/cgi-bin/roundup/nfsv4/issue141 .
I see that I misinterpreted the issue.
History
Date User Action Args
2008-01-11 19:56:41mresetassignedto: mre -> spencer.shepler
nosy: + spencer.shepler
2007-11-07 18:50:24mresetstatus: unread -> resolved-edit-complete
2007-03-03 07:22:21mresetmessages: + msg539
2007-03-03 07:21:03mresetmessages: + msg538
2007-03-02 23:47:30mrecreate