From braden Tue Jan 2 14:54:58 1990 Received-Date: Tue, 2 Jan 90 14:54:58 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 2 Jan 90 14:54:58 -0800 Date: Tue, 2 Jan 90 14:55:45 PST From: braden Posted-Date: Tue, 2 Jan 90 14:55:45 PST Message-Id: <9001022255.AA12875@braden.isi.edu> Received: by braden.isi.edu (4.0/4.0.3-4) id ; Tue, 2 Jan 90 14:55:45 PST To: end2end-interest Subject: X-Kernel ----- Begin Included Message ----- From llp@cs.arizona.edu Tue Jan 2 14:28:40 1990 Date: Tue, 2 Jan 90 15:25:27 MST From: "Larry Peterson" To: xkernel-group@cs.arizona.edu, llp@cs.arizona.edu Subject: distribution announcement We have put together a public distribution of the x-kernel. You can retrieve a copy by doing anonymous ftp to cs.arizona.edu and getting xkernel/xkernel.tar.Z. We've also established a mail group (this group) where we will put postings about future releases. You're on this list because at some point in time you expressed at least a passing interest in the x-kernel. We don't expect much traffic on this group (there's a separate xkernel-bugs alias), but if you want your name taken off, please drop me a note. Likewise, let me know of anyone else that might like to be on this group. Cheers, Larry ----- End Included Message ----- From braden Thu Jan 11 14:30:49 1990 Received-Date: Thu, 11 Jan 90 14:30:49 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Jan 90 14:30:49 -0800 Date: Thu, 11 Jan 90 14:31:40 PST From: braden Posted-Date: Thu, 11 Jan 90 14:31:40 PST Message-Id: <9001112231.AA16877@braden.isi.edu> Received: by braden.isi.edu (4.0/4.0.3-4) id ; Thu, 11 Jan 90 14:31:40 PST To: end2end-interest Subject: Issue on Symmetric Routes Cc: estrin@usc.edu Deborah Estrin, chairperson of ANRG, has posed the following question to the E2E RG. Although routing issues are generally out of the scope of E2E, there are some end-to-end performance issues here too. Bob Braden ----- Included Message ----- Something that keeps coming up is this issue of symmetric routes. I wonder if you could bring this up with E2E, either via email or at the next meeting. I know D.Clark's opinion on this and would like to hear comments of other E2E members...(rumor has it that not everyone on E2E always agrees with one another...) Some argue that we want symmetric paths because: a. congestion control mechanisms use reverse flows b. we want to avoid two bills c. network stability based on estimates on RTD can get messed up if acks and data take different routes d. half-live/dead connections are a painful form of error. e. it seems to make certain policies much easier to implement (see DDC's RFC-1102, Section 7). On the other hand: f. policy-related constraints themselves (e.g. access control or billing) might lead to different routes in the two directions. g. when you are doing source routing and synthesizing routes based on source and transit policies, then you don't know if the route can be used in both directions without having access to the destinations policies and doing route synthesis from that perspective also. OF COURSE everyone hopes and plans that in the GENERAL case routes should be symmetric, that policy will allow them to be, and that the preferred retrun route should be the one that the source took in getting to the destination. BUT what about exceptions. Should we allow them? How strongly do folks feel about this issue? Is it just that one should be able to request symmetric paths or to favor them over others by defaulting to trying to reverse the source route, for example? ----- End Included Message ----- From casner Thu Jan 11 19:51:37 1990 Received-Date: Thu, 11 Jan 90 19:51:37 -0800 Received: from casner.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Jan 90 19:51:37 -0800 Posted-Date: Thu 11 Jan 90 19:50:40-PST Received: by casner.isi.edu (4.0/4.0.3-4) id ; Thu, 11 Jan 90 19:50:42 PST Date: Thu 11 Jan 90 19:50:40-PST From: Stephen Casner Subject: Re: Issue on Symmetric Routes To: braden, END2END-INTEREST Cc: estrin@usc.edu Message-Id: <632116240.0.CASNER@ISI.EDU> In-Reply-To: <9001112231.AA16877@braden.isi.edu> Mail-System-Version: Allow me to contribute an opinion from the radicals and heretics WG (connection-oriented IP). We believe asymmetric routes should be allowed because resources may not be available in both directions on the forward route. If this seems unlikely, consider a mix of applications including asymmetric ones (FTP) or a mesh of multicast connections where hop choices may look different from various directions. In fact, for the near term (ST-II) protocol, we've chosen to make connections simplex (with one or more destinations) and use a separate connection from each source (say, in a conference). The family of connections for a session would be grouped at a higher layer. We may be the lunatic fringe, but we believe that our resource reservation mechanisms will need to work in concert with the policy routing mechanisms, too. Or should we expect to be constrained to symmetry at AD boundary points? -- Steve ------- From craig@NNSC.NSF.NET Fri Jan 12 05:44:35 1990 Received-Date: Fri, 12 Jan 90 05:44:35 -0800 Received: from vaxa.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 12 Jan 90 05:44:35 -0800 Posted-Date: Fri, 12 Jan 90 08:11:50 -0500 Message-Id: <9001121314.AA19240@vaxa.isi.edu> Received: from [192.31.103.6] by vaxa.isi.edu (5.61/5.61) id AA19240; Fri, 12 Jan 90 05:14:55 -0800 To: braden Cc: end2end-interest, estrin@usc.edu Subject: re: Issue on Symmetric Routes Date: Fri, 12 Jan 90 08:11:50 -0500 From: Craig Partridge > c. network stability based on estimates on RTD can get messed up if > acks and data take different routes Is this really true? It thought the algorithms based on round-trip delay were just that, based on round-trip delay, not one-way path delay. Have I missed something? Craig From braden Fri Jan 19 11:13:31 1990 Received-Date: Fri, 19 Jan 90 11:13:31 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 19 Jan 90 11:13:31 -0800 Date: Fri, 19 Jan 90 11:14:53 PST From: braden Posted-Date: Fri, 19 Jan 90 11:14:53 PST Message-Id: <9001191914.AA18849@braden.isi.edu> Received: by braden.isi.edu (4.0/4.0.3-4) id ; Fri, 19 Jan 90 11:14:53 PST To: end2end-interest Subject: "Chaotic behavior?" Cc: braden In an article in today's NYTimes (that well-known technical journal!), John Markoff writes about "AT&T's Trouble Shows Computers Defy Understanding Even In Collapse". Not a bad article, actually. The following paragraph is included (quoted without permission): "In another example, in 1987, computer designers at TRW Inc., a large government contractor, were surprised to find that a computer network they had strung together in Europe for a United States intelligence agency was exhibiting strange, unpredictable behavior. On close examination the engineers discovered nothing wrong with the design of the system, which linked hundreds of computers as part of a military data communications network." "They later said they suspected that they were confronted with the mathematical concept called chaos, a way of describing otherwise unpredictable manmade and natural phenomena like turbulence in rapidly moving water or in the atmosphere." Please, no flaming about this definition of chaos; that is not the point. I am wondering whether this was a kind of end-to-end instability that we are looking for. Does anyone know about this incident, or whom to contact to get more information about it? Bob Braden From perry@MCL.Unisys.COM Fri Jan 19 16:32:43 1990 Posted-Date: Fri, 19 Jan 90 19:32:27 EST Received-Date: Fri, 19 Jan 90 16:32:43 -0800 Received: from KAUAI.MCL.UNISYS.COM by venera.isi.edu (5.61/5.61+local) id ; Fri, 19 Jan 90 16:32:43 -0800 Received: from LANAI.MCL.UNISYS.COM ([192.31.44.6]) by kauai.MCL.Unisys.COM (4.1/Domain/jpb/mls/2.9) id AA01182; Fri, 19 Jan 90 19:33:51 EST Received: by LANAI.MCL.UNISYS.COM [4.1/Domain/jbp/2.4] id AA13544; Fri, 19 Jan 90 19:32:27 EST Date: Fri, 19 Jan 90 19:32:27 EST From: perry@MCL.Unisys.COM (Dennis Perry) Message-Id: <9001200032.AA13544@LANAI.MCL.UNISYS.COM> To: braden@venera.isi.edu, end2end-interest@venera.isi.edu Subject: Re: "Chaotic behavior?" Cc: perry@mcl.unisys.com Bob, I talked to my bother, Dewayne, who works at Bell Labs in Murray Hill doing research in software engineering environments. As of a few days ago, there was still no definitive word on what had happened. The software had been deployed for about a year. He indicated that two switches had malfunctioned. What I read in the paper indicated only one switch in NY. Apparently, an untested occurance occured and caused the problem. dennis From Mills@udel.edu Fri Jan 19 21:40:01 1990 Posted-Date: Sat, 20 Jan 90 5:25:03 GMT Received-Date: Fri, 19 Jan 90 21:40:01 -0800 Received: from louie.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 19 Jan 90 21:40:01 -0800 Received: from huey.udel.edu by louie.udel.edu id aa21788; 20 Jan 90 0:30 EST Date: Sat, 20 Jan 90 5:25:03 GMT From: Mills@udel.edu To: braden@venera.isi.edu Cc: end2end-interest@venera.isi.edu, braden@venera.isi.edu Subject: Re: "Chaotic behavior?" Message-Id: <9001200025.aa18230@huey.udel.edu> Bob, You are probably aware of the French TRANSPAC network fiasco some years back in which the X.25 PADs apparently got into a shouting match, but had not yet adopted "chaos" to the Franglois. Unfortunately, I do not have specific pointers to that incident either. Onward System Seven! Dave From craig@NNSC.NSF.NET Mon Jan 22 05:18:23 1990 Posted-Date: Mon, 22 Jan 90 08:16:56 -0500 Received-Date: Mon, 22 Jan 90 05:18:23 -0800 Message-Id: <9001221318.AA13364@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Mon, 22 Jan 90 05:18:23 -0800 To: end2end-interest Subject: simulation of DEC bit Date: Mon, 22 Jan 90 08:16:56 -0500 From: Craig Partridge ------- Forwarded Message Received: from allspice.lcs.mit.edu by NNSC.NSF.NET id aa27157; 21 Jan 90 19:08 EST Received: from [128.8.128.102] by PTT.LCS.MIT.EDU via TCP with SMTP id AA29113; Sun, 21 Jan 90 19:06:12 EST Received: by prufrock.cs.UMD.EDU (5.61/UMIACS-0.9/04-05-88) id AA01892; Sun, 21 Jan 90 19:06:05 -0500 Date: Sun, 21 Jan 90 19:06:05 -0500 From: Jean Bolot Message-Id: <9001220006.AA01892@prufrock.cs.UMD.EDU> To: info-netsim@ALLSPICE.LCS.MIT.EDU Subject: simulation of DEC sources Hello everybody, I am using the MIT simulator to observe the behavior of networks in which sources have different retransmission policies. Most of you on this list might be aware that the simulator can handle 4.2, 4.3 and Jacobson-Karels retransmission policies. My question is the following: has anyone written (or does anyone intend to write in the near future) a version that handles DEC-style retransmission (i.e., with the DEC congestion avoidance bit, and window size adjustments based on the values of the bits received)? Thanks in advance for your help. - -Jean Jean Bolot Department of Computer Science University of Maryland College Park, MD 20742 ------- End of Forwarded Message From lixia@arisia.Xerox.COM Mon Jan 22 16:10:43 1990 Posted-Date: Mon, 22 Jan 90 16:10:07 -0800 Received-Date: Mon, 22 Jan 90 16:10:43 -0800 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Mon, 22 Jan 90 16:10:43 -0800 Received: by arisia.Xerox.COM (5.61+/IDA-1.2.8/gandalf) id AA01484; Mon, 22 Jan 90 16:10:07 -0800 Message-Id: <9001230010.AA01484@arisia.Xerox.COM> Date: Mon, 22 Jan 90 16:10:07 -0800 From: Lixia Zhang To: craig@NNSC.NSF.NET, end2end-interest@venera.isi.edu Subject: Re: simulation of DEC bit I plan to code the DECBIT scheme into MY simulator sometime soon and I'd like to collect comments/suggestions what to see about DECBIT. Lixia PS: My simulator is substantially different from NETSIM so this won't help Jean Bolot (so didn't include him in reply). From braden Wed Jan 31 12:31:44 1990 Received-Date: Wed, 31 Jan 90 12:31:44 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 31 Jan 90 12:31:44 -0800 Date: Wed, 31 Jan 90 12:33:07 PST From: braden Posted-Date: Wed, 31 Jan 90 12:33:07 PST Message-Id: <9001312033.AA01699@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 31 Jan 90 12:33:07 PST To: federation@nnsc.nsf.net, ietf, long@nic.near.net Subject: Re: New Working Group on End-to-End User Connectivity Cc: braden, end2end-interest, pgross@nri.reston.va.us Dan, Sorry, but I have to declare prior use of the names "end-to-end" and "end2end". It would create a lot of confusion in IETF/IRTF land, to have both an IETF working group and an IRTF research group with the same name. Bob Braden Chair, End-to-End Research Group ----- Begin Included Message ----- From owner-ietf@venera.isi.edu Mon Jan 29 22:05:20 1990 To: ietf@venera.isi.edu, federation@nnsc.nsf.net Subject: New Working Group on End-to-End User Connectivity Date: Mon, 29 Jan 90 22:07:47 -0500 From: Daniel Long Hi! For those of you who missed Craig's announcement, this IETF working group is being formed in cooperation with FARNET. Its goal is to develop a straw-man plan for how a user having difficulty with the network can report problems and, how to make sure that, once reported, the problem gets fixed. (Note that because the Internet has a large number of management entities involved in running it, making sure a problem gets fixed requires a well-defined understanding about cooperation required among the various administrators). I've created a mailing list for folks interested in participating in or learning more about the WG. Please send to: end2end-request@nic.near.net The WG will meet on Wednesday afternoon at the upcoming IETF. I'd like to invite anyone who's interested to attend and get involved. Let me know if you have any questions. Regards, Dan Long NEARnet/BBN 617-873-2766 ----- End Included Message ----- From braden Wed Jan 31 14:26:36 1990 Received-Date: Wed, 31 Jan 90 14:26:36 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 31 Jan 90 14:26:36 -0800 Date: Wed, 31 Jan 90 14:28:06 PST From: braden Posted-Date: Wed, 31 Jan 90 14:28:06 PST Message-Id: <9001312228.AA01808@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 31 Jan 90 14:28:06 PST To: end2end-interest Subject: Deflection routing At the BBN workshop last week, Maxemchuk sung the praises of "deflection routing". Can someone explain to me what that is, and/or give me a reference I can read, please? Thanks. Bob Braden From Shenker.pa@Xerox.COM Wed Jan 31 17:03:44 1990 Posted-Date: 31 Jan 90 16:54:08 PST (Wednesday) Received-Date: Wed, 31 Jan 90 17:03:44 -0800 Received: from Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Wed, 31 Jan 90 17:03:44 -0800 Received: from Cabernet.ms by ArpaGateway.ms ; 31 JAN 90 16:54:06 PST Date: 31 Jan 90 16:54:08 PST (Wednesday) From: Shenker.pa@Xerox.COM Subject: Re: Deflection routing In-Reply-To: <9001312228.AA01808@braden.isi.edu> To: braden@venera.isi.edu Cc: end2end-interest@venera.isi.edu Message-Id: <900131-165406-7293@Xerox> As I understand it, Maxemchuck works with a Manhattan network (with one-way streets) with each node having two input ports and two output ports. At each time step (the net is synchronous), the incoming packets (at most one from each port) are routed to the outgoing ports. If both incoming packets want to use the same outgoing port, one of them is "deflected" to the other port. Since the network topology is a mesh, subsequent routing decisions can ensure that the deflected packet eventually gets home. Scott From craig@NNSC.NSF.NET Thu Feb 1 04:57:18 1990 Posted-Date: Thu, 01 Feb 90 07:56:44 -0500 Received-Date: Thu, 1 Feb 90 04:57:18 -0800 Message-Id: <9002011257.AA06781@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Thu, 1 Feb 90 04:57:18 -0800 To: braden Cc: end2end-interest Subject: re: Deflection routing Date: Thu, 01 Feb 90 07:56:44 -0500 From: Craig Partridge Bob: Funny you should ask... I asked Maxemchuk last week. Craig Received: from inet.att.com by NNSC.NSF.NET id aa01043; 29 Jan 90 14:03 EST Received: by research; Mon Jan 29 14:03:34 1990 Date: Mon, 29 Jan 90 14:04:32 EST From: Nick Maxemchuk To: craig@NNSC.NSF.NET Subject: Re: Manhattan routing I've written quite a few. These are probably the best overviews N. F. Maxemchuk, "Regular and Mesh Topologies in Local and Metropolitan Area Networks," AT&T Technical Journal, Sept. 85, vol. 64, no. 7, pp. 1659-1686. N. F. Maxemchuk, "The Manhattan Street Network," Proc. of Globcom '85, pp. 255-261, Dec. 2-5, 1986, New Orleans, LA. N. F. Maxemchuk, "Routing in the Manhattan Street Network," IEEE Trans. on Commun., May 1987, vol. COM-35, no. 5, pp. 503-512. N. F. Maxemchuk, "Comparison of Deflection and Store-and-Forward Techniques in the Manhattan Street and Shuffle-Exchange Networks," IEEE INFOCOM '89, Ottawa, Ont. Canada, April 25-27, 1989, pp. 800-809. There is also a paper I'm preparing on livelocks and flow control, but the draft is very rough Nick From braden Thu Feb 1 12:17:49 1990 Received-Date: Thu, 1 Feb 90 12:17:49 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 1 Feb 90 12:17:49 -0800 Date: Thu, 1 Feb 90 12:19:18 PST From: braden Posted-Date: Thu, 1 Feb 90 12:19:18 PST Message-Id: <9002012019.AA02082@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Thu, 1 Feb 90 12:19:18 PST To: Shenker.pa@xerox.com Subject: Re: Deflection routing Cc: end2end-interest Scott, Thanks for the explanation. If I understand, he resolves contention by introducing some local "mis-routing". Presumably this is virtual-circuit routing, so each packet contains a VCI that can be mapped into the preferred output port at any router. Bob From braden Wed Feb 7 12:12:50 1990 Received-Date: Wed, 7 Feb 90 12:12:50 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 7 Feb 90 12:12:50 -0800 Date: Wed, 7 Feb 90 12:16:49 PST From: braden Posted-Date: Wed, 7 Feb 90 12:16:49 PST Message-Id: <9002072016.AA05630@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 7 Feb 90 12:16:49 PST To: westine Subject: Monthly Report for E2E RG Cc: braden, end2end-interest, estrin@usc.edu END-TO-END RESEARCH GROUP The End-to-End Research Group met for two days at the Xerox Palo Alto Research Center (PARC) on January 17-18. The entire second day was devoted to a joint meeting with the Privacy and Security Working Group; see their report for a summary of this day. The topics discussed during the first day included the following. o MULTICASTING The group discussed the question: what are the primary areas left for research in Internet multicasting (see below). o DATA ENCODING Craig Partridge presented some partial results for performance measurements on various data encodings. It was observed that our machines are increasingly RISC-based with boundary alignment constraints; "bytes are not cheap". A useful effort would be to design a data encoding that would be efficient for RISC CPUs. o ASYMMETRIC ROUTES The group discussed the architectural question: are asymmetric routes (i.e., different routes in the two directions) intrinsically bad? This issue arose out of Dave Clarks RFC on policy-based routing (RFC-1102), which argued for symmetric routes at the inter-AD level. Three arguments have been raised against asymmetric routes: (1) they double the effort for billing and accounting; (2) they cause "wierdness" when things break; and (3) there may be a problem constructing a path for sending control information back to the source from an intermediate gateway. The group decided that, with respect to (3), an architecture that makes it difficult to send an error message from the gateway system back to the source is a bad idea. Clark suggested one solution might be to send control or error information forward to the destination and thence back to the source (like the DEC congestion bit). o TCP WINDOW SIZE Revisiting the issue of TCP operation over a "big fat pipe" (see RFC-1072, RFC-1106), we learned that Van Jacobson is thinking of a modification to classical-VJ slow-start and congestion-avoidance, to handle this problem. The RG wants to follow up on this. o HIGH SPEED PROTOCOLS Dave Clark presented his latest strategy for defeating protocol layering. A number of the discussions concerned the general question: what are the research issues? The following issues were identified in the meeting: o Scaling issues for resource location protocols. o Implications for all protocol layers of synchronized clocks. o Scaling of clock synchronization protocols, e.g., NTP. o Best algorithm for multicast routing. o Inter-AD multicast routing algorithms. o Congestion control with multicasting. o Self-organizing set of agents, e.g., NTP agents. o Data encodings efficient for RISC chips. o TCP congestion control over big fat pipes. o Possible phase changes with network growth. Bob Braden From legato!Legato.COM!nowicki@Sun.COM Fri Feb 9 16:47:50 1990 Posted-Date: Fri, 9 Feb 90 14:25:59 PST Received-Date: Fri, 9 Feb 90 16:47:50 -0800 Received: from Sun.COM by venera.isi.edu (5.61/5.61+local) id ; Fri, 9 Feb 90 16:47:50 -0800 Received: from sun.Sun.COM (sun-bb.Corp.Sun.COM) by Sun.COM (4.1/SMI-4.1) id AA11084; Fri, 9 Feb 90 16:50:02 PST Received: from legato.UUCP by sun.Sun.COM (4.1/SMI-4.1) id AA14529; Fri, 9 Feb 90 16:49:00 PST Received: from rose.Legato.COM by Legato.COM (4.0/SMI-4.0) id AA05542 for end2end-interest@venera.isi.edu; Fri, 9 Feb 90 14:25:59 PST Date: Fri, 9 Feb 90 14:25:59 PST From: nowicki@Legato.COM (Bill Nowicki) Message-Id: <9002092225.AA05542@Legato.COM> To: braden@venera.isi.edu Subject: Re: Monthly Report for E2E RG Cc: end2end-interest@venera.isi.edu Date: Wed, 7 Feb 90 12:16:49 PST From: braden@venera.isi.edu Subject: Monthly Report for E2E RG o DATA ENCODING Craig Partridge presented some partial results for performance measurements on various data encodings. It was observed that our machines are increasingly RISC-based with boundary alignment constraints; "bytes are not cheap". A useful effort would be to design a data encoding that would be efficient for RISC CPUs. That was not my conclusion. I thought the intuition was that XDR would have an even GREATER advantage on RISC machines, since it already aligns everything. What I would like to see is some measurements to get an idea of how much the speed penalty is for ASN.1 on a RISC, as well as how XDR compares to NDR on little-endian RISCs. This must be weighted by the fact that of the four combinations of byte order for sender and receiver, NDR is definitely worse in at least three of the four cases. Why do we need yet another design, when we do not understand how the existing designs compare? -- WIN From braden Wed Feb 14 09:33:03 1990 Received-Date: Wed, 14 Feb 90 09:33:03 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 14 Feb 90 09:33:03 -0800 Date: Wed, 14 Feb 90 09:37:00 PST From: braden Posted-Date: Wed, 14 Feb 90 09:37:00 PST Message-Id: <9002141737.AA07119@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 14 Feb 90 09:37:00 PST To: end2end-interest Subject: Forwarded mail ----- Begin Included Message ----- From craig@NNSC.NSF.NET Wed Feb 14 08:17:10 1990 To: end2end-tf@ISI.EDU To: iesg@nri.reston.va.us Cc: zweig@cs.uiuc.edu Subject: TCP alternate checksums... Date: Wed, 14 Feb 90 11:11:50 -0500 From: Craig Partridge Hi folks: Johnny Zweig and I have been puttering on this RFC on and off since he sent a note last year asking about supporting alternative checksums in TCP. I view this purely as a research note, saying "if you wanted to do this, here's a method." I doubt this would be a candidate for near-term standardization. Comments welcome... Craig Internet Engineering Task Force J. Zweig Request for Comments: XXXX UIUC C. Partridge BBN February 1990 TCP Alternate Checksum Options Status of This Memo This memo is proposes a pair of TCP options to allow use of alternate data checksum algorithms in the TCP header. The use of these options is experimental, and not recommended for production use. Distribution of this memo is unlimited. Introduction Some members of the networking community have expressed interest in using checksum-algorithms with different error detection and correction properties than the standard TCP checksum. The option described in this memo provides a mechanism to negotiate the use of an alternate checksum at connection-establishment time, as well as a mechanism to carry additional checksum information for algorithms that utilize checksums that are longer than 16 bits. Definition of the Options The TCP Alternate Checksum Request Option may be sent in a SYN segment by a TCP to indicate that the TCP is prepared to both generate and receive checksums based on an alternate algorithm. During communication, the alternate checksum replaces the regular TCP checksum in the checksum field of the TCP header. Should the alternate checksum require more than 2 octets to transmit, the checksum may either be moved into a TCP Alternate Checksum Data Option and the checksum field of the TCP header sent as 0, or the data may be split between the header field and the option. Alternate checksums are computed over the same data as the regular TCP checksum (see TCP Alternate Checksum Data Option discussion below). TCP Alternate Checksum Request Option The format of the TCP Alternate Checksum Request Option is: +----------+----------+----------+ | Kind=X | Length=3 | chksum | +----------+----------+----------+ Here chksum is a number identifying the type of checksum to be used. The currently defined values of chksum are: 0 -- TCP checksum 1 -- 8-bit Fletcher's algorithm (see Appendix I) 2 -- 16-bit Fletcher's algorithm (see Appendix II) Note that the 8-bit Fletcher algorithm has a 16-bit checksum and the 16-bit algorithm gives a 32-bit checksum. Alternate checksum negotiation proceeds as follows: A SYN segment used to originate a connection may contain the Alternate Checksum Request Option, which specifies an alternate checksum-calculation algorithm to be used for the connection. The acknowledging SYN-ACK segment may also carry the option. If both SYN segments carry the Alternate Checksum Request option, and both specify the same algorithm, that algorithm must be used for the remainder of the connection. Otherwise, the standard TCP checksum algorithm must be used for the entire connection. Any segment with the SYN bit set must always use the standard TCP checksum algorithm. Thus the SYN segment will always be understood by the receiving TCP. The alternate checksum must not be used until the first non-SYN segment. The option may not be sent in any segment that does not have the SYN bit set. An implementation of TCP which does not support the option should silently ignore it (as RFC 1122 requires). Ignoring the option will force any TCP attempting to use an alternate checksum to use the standard TCP checksum algorithm, thus ensuring interoperability. TCP Alternate Checksum Data Option The format of the TCP Alternate Checksum Data Option is: +---------+---------+---------+ +---------+ | Kind=Z |Length=N | data | ... | data | +---------+---------+---------+ +---------+ This field is used only when the alternate checksum that is negotiated is longer than 16 bits. These checksums will not fit in the checksum field of the TCP header and thus at least part of them must be put in an option. Whether the checksum is split between the checksum field in the TCP header and the option or the entire checksum is placed in the option is determined on a checksum by checksum basis. The length of this option will depend on the choice of alternate checksum algorithm for this connection. While computing the alternate checksum, the TCP checksum field and the data portion TCP Alternate Checksum Data Option are replaced with zeros. An otherwise acceptable segment carrying this option on a connection using a 16-bit checksum algorithm, or carrying this option with an inappropriate number of data octets for the chosen alternate checksum algorithm is in error and must be discarded; a RST-segment must be generated, and the connection aborted. APPENDIX I: The 8-bit Fletcher Checksum Algorithm The 8-bit Fletcher Checksum Algorithm is calculated over a sequence of data octets (call them D[1] through D[N]) by maintaining 2 unsigned 1's-complement 8-bit accumulators A and B whose contents are initially zero, and performing the following loop where i ranges from 1 to N: A := A + D[i] B := B + A It can be shown that at the end of the loop A will contain the 8-bit 1's complement sum of all octets in the datagram, and that B will contain (N)D[1] + (N-1)D[2] + ... + D[N]. The octets covered by this algorithm should be the same as those over which the standard TCP checksum calculation is performed, with the pseudoheader being D[1] through D[12] and the TCP header beginning at D[13]. Note that, for purposes of the checksum computation, the checksum field itself must be equal to zero. At the end of the loop, the A goes in the first byte of the TCP checksum and B goes in the second byte. Note that, unlike the OSI version of the Fletcher checksum, this checksum does not adjust the check bytes so that the receiver checksum is 0. There are a number of much faster algorithms for calculating the two octets of the 8-bit Fletcher checksum. For more information see [Sklower89], [Nakassis88] and [Fletcher82]. Naturally, any computation which computes the same number as would be calculated by the loop above may be used to calculate the checksum. One advantage of the Fletcher algorithms over the standard TCP checksum algorithm is the ability to detect the transposition of octets/words of any size within a datagram. APPENDIX II: The 16-bit Fletcher Checksum Algorithm The 16-bit Fletcher Checksum algorithm proceeds in precisely the same manner as the 8-bit checksum algorithm,, except that A, B and the D[i] are 16-bit quantities. It is necessary (as it is with the standard TCP checksum algorithm) to pad a datagram containing an odd number of octets with a zero octet. Result A should be placed in the TCP header checksum field and Result B should appear in an TCP Alternate Checksum Data. This option must be present in every TCP header. The two bytes reserved for B should be set to zero during the calculation of the checksum. The checksum field of the TCP header shall contain the contents of A at the end of the loop. The TCP Alternate Checksum Data option must be present and contain the contents of B at the end of the loop. BIBLIOGRAPHY: [BrBoPa89] Braden, R., Borman, D., Partridge, C. "Computing the Internet Checksum". ACM Computer Communication Review, Vol. 19, No. 2, April, 1989, pp. 86-101. [Note that this includes Plummer, W. "IEN-45: TCP Checksum Function Design" (1978) as an appendix.] [Fletcher82] Fletcher, J. "An Arithmetic Checksum for Serial Transmissions". IEEE Transactions on Communication, Vol. COM-30, No. 1, January, 1982, pp. 247-252. [Nakassis88] Nakassis, T. "Fletcher's Error Detection Algorithm: How to implement it efficiently and how to avoid the most common pitfalls". ACM Computer Communication Review, Vol. 18, No. 5, October, 1988, pp. 86-94. [Sklower89] Sklower, K. "Improving the Efficiency of the OSI Checksum Calculation". ACM Computer Communication Review, Vol. 19, No. 5, October, 1989, pp. 32-43. ------- End of Forwarded Message ----- End Included Message ----- From braden Mon Feb 26 13:41:58 1990 Received-Date: Mon, 26 Feb 90 13:41:58 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 26 Feb 90 13:41:58 -0800 Date: Mon, 26 Feb 90 13:44:57 PST From: braden Posted-Date: Mon, 26 Feb 90 13:44:57 PST Message-Id: <9002262144.AA02271@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 26 Feb 90 13:44:57 PST To: van@helios.ee.lbl.gov Subject: RFC-1072 extensions Cc: braden, end2end-interest Van, A man named Joseph Maestas from LANL wants to know where he can get code for RFC-1072. I believe you have implemented at least part of this code in your 4.4 experimental kernel; is that true? What is the status of it? Thanks, Bob From legato!Legato.COM!nowicki@Sun.COM Wed Feb 28 12:44:59 1990 Posted-Date: Wed, 28 Feb 90 12:34:18 PST Received-Date: Wed, 28 Feb 90 12:44:59 -0800 Received: from Sun.COM by venera.isi.edu (5.61/5.61+local) id ; Wed, 28 Feb 90 12:44:59 -0800 Received: from sun.Sun.COM (sun-bb.Corp.Sun.COM) by Sun.COM (4.1/SMI-4.1) id AA07403; Wed, 28 Feb 90 12:46:03 PST Received: from legato.UUCP by sun.Sun.COM (4.1/SMI-4.1) id AA08896; Wed, 28 Feb 90 12:45:48 PST Received: from rose.Legato.COM by Legato.COM (4.0/SMI-4.0) id AA28251 for mishkin@Apollo.COM; Wed, 28 Feb 90 12:34:18 PST Date: Wed, 28 Feb 90 12:34:18 PST From: nowicki@Legato.COM (Bill Nowicki) Message-Id: <9002282034.AA28251@Legato.COM> To: mishkin@Apollo.COM Subject: Re: Data Rep comparisons revisited Cc: end2end@Venera.ISI.EDU I just realized that this message was sent to only me personally, and not the list. I thought that it was a question of Craig, so did not answer it, but will now: From: uunet!apollo.com!mishkin (Nathaniel Mishkin) Date: Sun, 11 Feb 90 12:14:40 EST Subject: Re: Monthly Report for E2E RG To: nowicki@Legato.COM (Bill Nowicki) Could you clarify these comments? I read Craig's paper on this stuff a while ago (so I don't remember the details of HIS comments), but I did point out some problems with his comments to him. My experience has been that the protocol (NDR) often gets confused with what today's stubs (from the NIDL compiler) happen to actually do. For the benfit of those like Nat who were not at the meeting: the idea was to compare only the representations, using hand-optimized code, not just the stubs generated by any compiler. The point that Van made was that tests done on a CISC like the 68020 would give an unfair advantage to ASN.1, since it has byte instructions. On most modern RISC machines, there is a large penalty for accessing non-aligned data objects. Since both XDR and NDR align, they should therefore have a speed advantage over ASN.1 even if ASN.1 might save a few bytes of space. The other operation that is very expensive on many RISC machines is the branch, since it can flush pipelines. The main difference between NDR and XDR is that NDR needs a conditional branch on each reception to determine if the bytes need to be swapped or not. This penalty is paid in all four cases of sender/receiver byte order, even though the benefit of avoiding a swap comes in only one of the four cases. That is, in the big->big endian case, neither swaps, but NDR does a test, so XDR MUST be faster. In the little->big and big->little cases, there is a single swap in both, but NDR also does a test, so XDR MUST be faster. Only in the case of little->little endian case does NDR avoid the swaps. To get the expected value, each of the cases needs to be weighted by its estimated probabilty and summed. For example, if saving the swapping gains 20%, but only happens 25% of the time, then the expected gain is really only 5%. If the conditional branch costs more than 5%, then XDR is faster on the average. XDR always wins over NDR on space, since you do not need to encode the byte order, as well as on complexity and size of the code. At any rate, the conculsion was that nobody has ever seen any meaningful numbers. -- WIN From mishkin@apollo.com Thu Mar 1 10:59:22 1990 Posted-Date: Thu, 1 Mar 90 09:52:25 EST Received-Date: Thu, 1 Mar 90 10:59:22 -0800 Message-Id: <9003011859.AA09458@venera.isi.edu> Received: from amway.ch.apollo.hp.com by venera.isi.edu (5.61/5.61+local) id ; Thu, 1 Mar 90 10:59:22 -0800 Received: from jrst.ch.apollo.hp.com by amway.apollo.com id Thu, 1 Mar 90 09:51:03 EST Received: by jrst.ch.apollo.hp.com id ; Thu, 1 Mar 90 09:52:26 EST From: mishkin@apollo.com (chc 02 rd) Date: Thu, 1 Mar 90 09:52:25 EST Subject: Re: Data Rep comparisons revisited To: nowicki@Legato.COM (Bill Nowicki) Cc: end2end@Venera.ISI.EDU, pjl@apollo.com In-Reply-To: nowicki@Legato.COM (Bill Nowicki), wed, 28 feb 90 15:34:18 The other operation that is very expensive on many RISC machines is the branch, since it can flush pipelines. The main difference between NDR and XDR is that NDR needs a conditional branch on each reception to determine if the bytes need to be swapped or not. This penalty is paid in all four cases of sender/receiver byte order, even though the benefit of avoiding a swap comes in only one of the four cases. That is, in the big->big endian case, neither swaps, but NDR does a test, so XDR MUST be faster. In the little->big and big->little cases, there is a single swap in both, but NDR also does a test, so XDR MUST be faster. Only in the case of little->little endian case does NDR avoid the swaps. First of all, I'm not sure I buy the argument about the cost of the branch. Given the nature of stubs, my intuition is that there will be plenty of operations that are candidates for being executed in the branch shadow, reducing (eliminating?) the cost of the branch. Second, the cost of the tests can be (virtually) eliminated if you're willing to pay a code size price. E.g., the stubs we generate to unmarshall NDR have two code paths: if (...the local drep matches the sender's drep in all ways...) { copy out the 1st param... copy out the 2nd param... ... } else { if (...the local drep of the type of the 1st param marches the sender's drep...) copy out the 1st param... else copy and convert the 1st param... if (...the local drep of the type of the 2nd param marches the sender's drep...) copy out the 2nd param... else copy and convert the 2nd param... ... } The "then" clause of the "if" contains no tests. BTW, the stub generator has an option that makes it generate only the "else" clause, to make stubs smaller (and sometimes slower). Third, you mention the cost of encoding the byte order. While I'm pretty sure you understand, I just want to make sure: The encoding of the byte order used by the sender is (a) small (a 4 byte descriptor describes byte order, floating point format, and character representation), and (b) appears just once in the encoded byte stream (at the front of the byte stream that encodes the parameters to a remote call, when NDR is being used for RPC). Fourth, the complete picture has to count marshalling cost as well as unmarshalling cost. NDR can essentially never be more costly than XDR in marshalling, and it is faster in case the marshalling system's native data representation is (a) not the same as XDR, but (b) is one of the reps allowed by NDR. (VAX and IBM System 370 architectectures have such data reps, so I'm not just quibbling here.) Remember also that when we're talking about saving a byte swap, we're really talking about potentially saving a whole data copy, not just replacing a (presumably slow) swap with a (presumably faster) copy. On the marshalling side, if the local drep matches the wire format, I can potentially ream out the data directly to the network from the application's data space. (I'm probably too optimistic about when I'll be able to program in an environment where an application can poke a scatter/gather network controller directly, but I can dream!) On the unmarshalling side, under the same conditions, I can potentially pass a reference to a network receive buffer (or the closest equivalent the OS is likely to let me have) to a server routine. -- Nat ------- From Eric.Cooper@N.SP.CS.CMU.EDU Thu Mar 1 12:54:35 1990 Posted-Date: Thu, 1 Mar 90 15:53:06 -0500 (EST) Received-Date: Thu, 1 Mar 90 12:54:35 -0800 Received: from N.SP.CS.CMU.EDU by venera.isi.edu (5.61/5.61+local) id ; Thu, 1 Mar 90 12:54:35 -0800 Received: from Messages.7.10.N.CUILIB.3.45.SNAP.NOT.LINKED.N.SP.CS.CMU.EDU.vax.22 via MS.5.6.N.SP.CS.CMU.EDU.vax_22; Thu, 1 Mar 90 15:53:06 -0500 (EST) Message-Id: Date: Thu, 1 Mar 90 15:53:06 -0500 (EST) From: Eric.Cooper@CS.CMU.EDU To: mishkin@APOLLO.COM Subject: Re: Data Rep comparisons revisited Cc: end2end@VENERA.ISI.EDU, pjl@APOLLO.COM In-Reply-To: <9003011859.AA09458@venera.isi.edu> References: <9003011859.AA09458@venera.isi.edu> ... I can potentially ream out the data directly to the network from the application's data space. (I'm probably too optimistic about when I'll be able to program in an environment where an application can poke a scatter/gather network controller directly, but I can dream!) ... I can potentially pass a reference to a network receive buffer (or the closest equivalent the OS is likely to let me have) to a server routine. You can do both these things in Nectar: applications can marshal/unmarshal directly to and from controller memory, and in the case where either one is a no-op, you can just pass it by reference to the appropriate routine. Professor Eric C. Cooper School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213-3890 Internet: ecc@cs.cmu.edu Phone: +1 412 268 3734 FAX: +1 412 681 5739 From craig@NNSC.NSF.NET Fri Mar 2 08:28:11 1990 Posted-Date: Fri, 02 Mar 90 11:29:12 -0500 Received-Date: Fri, 2 Mar 90 08:28:11 -0800 Message-Id: <9003021628.AA11671@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Fri, 2 Mar 90 08:28:11 -0800 To: mishkin@apollo.com Cc: end2end@venera.isi.edu Subject: re: Data Rep comparisons revisited Date: Fri, 02 Mar 90 11:29:12 -0500 From: Craig Partridge Nat: I've been reading your exchange with Bill with interest. I want to push back on a point. > Fourth, the complete picture has to count marshalling cost as well as > unmarshalling cost. NDR can essentially never be more costly than XDR > in marshalling, and it is faster in case the marshalling system's native > data representation is (a) not the same as XDR, but (b) is one of the > reps allowed by NDR. (VAX and IBM System 370 architectectures have such > data reps, so I'm not just quibbling here.) Right. I count both. Consider the following situation for integers. This is crude because it doesn't cover instruction timings, but helps make the point. Most systems I know do a 4-byte-swap in four instructions, so I'll assume 4 for a swap, and I'll assume the swap can be done concurrently with moving data from a network buffer into an integer register or location on the stack. I assume a move has to be done -- in cases I'm aware of, you have to move data in and out of the network packet. So... here are the four possible combinations of host pairs and instruction counts. Sender big-endian and Receiver big-endian XDR - 1 move to send; 1 move to receive NDR - 1 move to send; 1 test, 1 branch, and 1 move to receive Sender big-endian and Receiver little-endian XDR - 1 move to send; 4 to swap and receive NDR - 1 move to send; 1 test, 1 branch, and 4 to swap and receive Sender little-endian and Receiver big-endian XDR - 4 to swap and send; 1 move to receive NDR - 1 move to send; 1 test, 1 branch, and 4 to swap and receive Sender little-endian and Receiver little-endian XDR - 4 to swap and send; 4 to swap and receive NDR - 1 move to send; 1 test, 1 branch, and 1 move to receive If we assume each case is equally likely, the weighted sum of instructions for XDR is (2 + 5 + 5 + 8)/4 = 5 instructions. For NDR it is (4 + 7 + 7 + 4)/4 = 5.25 instructions. I think smart coding could save two instructions (the branch and move could be done together when types match), but that just makes NDR take the same time as XDR, if all cases are equally likely. Of course all cases aren't equally likely, but I've heard arguments about what scenario is. Choosing a scenario favors one data type over another. All I claim is that I'm testing a particular scenario, in which we're exchanging small amounts of data. For the NDR case, I've made various assumptions about how frequently the label shows up (i.e. length of call stream). The task force actually beat up on me for not thinking carefully enough about what I was measuring (and whether the results could be shown on paper instead in code), so I'm still trying to refine the work. Craig From craig@NNSC.NSF.NET Wed Mar 7 11:07:51 1990 Received-Date: Wed, 7 Mar 90 11:07:51 -0800 Received: from vaxa.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 7 Mar 90 11:07:51 -0800 Posted-Date: Wed, 07 Mar 90 13:16:03 -0500 Message-Id: <9003071813.AA13005@vaxa.isi.edu> Received: from WS6.NNSC.NSF.NET by vaxa.isi.edu (5.61/5.61) id AA13005; Wed, 7 Mar 90 10:13:11 -0800 To: end2end-interest Subject: Wanted: Book Reviewer Date: Wed, 07 Mar 90 13:16:03 -0500 From: Craig Partridge If anyone is interested in reviewing Network Computing Architecture by L. Zahn, T. Dineen, P. Leach, E. Martin, N. Mishkin, J. Pato and G. Wyant. 224pp. [About the Apollo networking environment] for CCR, please let me know. The deal is you get a free copy of the book, but have to write up a review for me by June 1st. Craig From craig@NNSC.NSF.NET Wed Mar 7 11:44:05 1990 Received-Date: Wed, 7 Mar 90 11:07:51 -0800 Received: from vaxa.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 7 Mar 90 11:07:51 -0800 Posted-Date: Wed, 07 Mar 90 13:16:03 -0500 Message-Id: <9003071813.AA13005@vaxa.isi.edu> Received: from WS6.NNSC.NSF.NET by vaxa.isi.edu (5.61/5.61) id AA13005; Wed, 7 Mar 90 10:13:11 -0800 To: end2end-interest Subject: Wanted: Book Reviewer Date: Wed, 07 Mar 90 13:16:03 -0500 From: Craig Partridge If anyone is interested in reviewing Network Computing Architecture by L. Zahn, T. Dineen, P. Leach, E. Martin, N. Mishkin, J. Pato and G. Wyant. 224pp. [About the Apollo networking environment] for CCR, please let me know. The deal is you get a free copy of the book, but have to write up a review for me by June 1st. Craig From craig@NNSC.NSF.NET Fri Mar 9 06:41:12 1990 Posted-Date: Fri, 09 Mar 90 09:37:00 -0500 Received-Date: Fri, 9 Mar 90 06:41:12 -0800 Message-Id: <9003091441.AA09534@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Fri, 9 Mar 90 06:41:12 -0800 To: end2end Subject: have book reviewer Date: Fri, 09 Mar 90 09:37:00 -0500 From: Craig Partridge Thanks! Craig From braden Mon Mar 19 09:17:11 1990 Received-Date: Mon, 19 Mar 90 09:17:11 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 19 Mar 90 09:17:11 -0800 Date: Mon, 19 Mar 90 09:18:02 PST From: braden Posted-Date: Mon, 19 Mar 90 09:18:02 PST Message-Id: <9003191718.AA00702@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 19 Mar 90 09:18:02 PST To: end2end-interest Subject: Transport protocol performance ... ----- Begin Included Message ----- From owner-ietf@ISI.EDU Fri Mar 16 18:09:12 1990 Date: Fri, 16 Mar 90 15:12:24 CST From: Guy Almes To: craig@nnsc.nsf.net, ietf@ISI.EDU Subject: Re: IBM Workstation faster than Cray? Craig, Anyone who saw the NSFnet demonstration at National Net'90 this week can tell you that the IBM RISC/6000 runs TCP/IP pretty fast. They had one machine in Ann Arbor connected to a second machine in Washington over a 22Mb/s fraction of a T3 serial line. This second machine was in turn connected to a third machine over an FDDI ring. TCP-based applications running from the first machine (over the half- T3 through the second machine as gateway and over the FDDI ring) to the third machine exhibited application-level performance of 8 to 10 Mb/s. Note that the software involved is very new and not tuned. So I don't think Dave Borman's record is in danger yet. On the other hand, the prospects for high-performance wide-area networking look bright. I'd welcome anyone providing more detail to this account or correcting any mistakes. -- Guy ----- End Included Message ----- ----- Begin Included Message ----- From owner-ietf@ISI.EDU Sat Mar 17 05:42:40 1990 Date: Fri, 16 Mar 90 21:57:57 EST From: Hans-Werner Braun To: ietf@ISI.EDU Subject: Re: IBM Workstation faster than Cray? Some correction, Guy. No fractional T3 was involved. The link was a clear channel T3 straight into the packet switch. While the RS/6000 machines were tested in the lab before, to my knowledge Net'90 was the first time they were tested on a live circuit. The circuit was provided by MCI, with a microwave shot from the NSFNET Network Operations Center in Ann Arbor to the nearest MCI access point and from there to Washington. There were many parties involved to make this work, and make it work in time. The figure you mention of about 22Mbps was what was measured by IBM, Unix Application to Unix Application, when the demo came up. With regard to file transfer speeds, note that unless you have very large TCP windows you will be in trouble with latencies on eventually cross country circuits running at T3 speeds. Look, unoptimized TCPs don't even run well on T1 links across the country. If I recall right, you could not even get the 448Kbps that we had end-end initially for end-end speeds on the T1 network. The 22Mbps demo had no issues with windows or acknowledgements as it was sending packets at full thrust from what the Unix application could deliver to the T3 hardware. To get an initial indication for a difference, one can easily compare speeds between identical end systems by trying FTP and NFS. Again, this was all just an initial test and demo of a prototype. By the time the NSFNET delivers operational T3 things should become more interesting. To make them really interesting, people should really take a look at the performance of their TCPs so that they are ready in time on their hosts for T3 speeds with cross country fiber latencies. -- Hans-Werner ----- End Included Message ----- From braden Tue Mar 20 11:00:59 1990 Received-Date: Tue, 20 Mar 90 11:00:59 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 20 Mar 90 11:00:59 -0800 Date: Tue, 20 Mar 90 11:01:48 PST From: braden Posted-Date: Tue, 20 Mar 90 11:01:48 PST Message-Id: <9003201901.AA01159@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 20 Mar 90 11:01:48 PST To: van@helios.ee.lbl.gov Subject: Re: RFC 1072 Implementations Needed Cc: end2end-interest From tcp-ip-RELAY@NIC.DDN.MIL Mon Mar 19 21:46:04 1990 Date: 19 Mar 90 12:30:00 MST From: "2645 Pierson, Lyndon G." Subject: RFC 1072 Implementations Needed To: "tcp-ip" RFC 1072 represents a significant improvement in the performance of TCP in communication environments with high delay-bandwidth product (geosynchronous satellite links at T1 and above) and moderate error rates. Does anyone know of a full implementation of TCP which incorporates the RFC 1072 window scaling, selective acknowledgements, and rtt estimator? (or even a planned implementation?) L. G. Pierson (505)-845-8212, LGPIERS@SANDIA.GOV ____________________________________________________ Van, Are you planning to make an RFC-1072 implementation available, e.g., in 4.4BSD? This is the 3rd inquiry in a month... Bob From craig@NNSC.NSF.NET Tue Mar 20 16:06:16 1990 Posted-Date: Tue, 20 Mar 90 19:00:15 -0500 Received-Date: Tue, 20 Mar 90 16:06:16 -0800 Message-Id: <9003210006.AA27431@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Tue, 20 Mar 90 16:06:16 -0800 To: braden Cc: van@helios.ee.lbl.gov Cc: end2end-interest@venera.isi.edu Subject: re: RFC 1072 Implementations Needed Date: Tue, 20 Mar 90 19:00:15 -0500 From: Craig Partridge Bob: In principle I have no problem with seeing RFC 1072 put in BSD but in practice this may be a problem -- the IETF is in the midst of developing a complete suite of options (some from RFC 1072, some not) for gigabit paths (which are, by definition, big fat pipes). I'd hate to see two variants running around in products... (could care less about multiple variants in research -- the more variety in experimentation the better!!). Craig From van@helios.ee.lbl.gov Fri Mar 23 19:40:17 1990 Posted-Date: Fri, 23 Mar 90 19:40:45 PST Received-Date: Fri, 23 Mar 90 19:40:17 -0800 Received: from vs.ee.lbl.gov by venera.isi.edu (5.61/5.61+local) id ; Fri, 23 Mar 90 19:40:17 -0800 Received: by helios.ee.lbl.gov (5.61/1.39) id AA00505; Fri, 23 Mar 90 19:40:48 -0800 Message-Id: <9003240340.AA00505@helios.ee.lbl.gov> To: Craig Partridge Cc: end2end-interest@venera.isi.edu Subject: Re: RFC 1072 Implementations Needed In-Reply-To: Your message of Tue, 20 Mar 90 19:00:15 EST. Date: Fri, 23 Mar 90 19:40:45 PST From: Van Jacobson Craig, I got deathly sick at the last IETF and while you were holding the TCP options meeting, I was passed out in my hotel room, enjoying the interesting dreams that come with a 105 deg. fever. I'm really sorry I missed the meeting, particularly after reading the summary you sent to the tcp-ip list a few weeks ago. Pieces of the sequence number buried in the urgent pointer?? A new option for urps >2^16?? Perhaps I wasn't the only one in Tallahassee having fever dreams! :) This strange stuff is certainly going to slow down exactly the implementations you want to go fast (the problem you're trying to solve only exists at bandwidths >100Mbps). And, as we discussed before, RFC1072 actually contained a simple, efficient solution to the problem (the timestamp/echo option). Just to make sure we're in sync here, the problem we're trying to solve is that TCP duplicate detection and sequencing can fail if it is possible to wrap the sequence space in less than the IP-guaranteed maximum packet lifetime (the IP time-to-live). An expression for the constraint that must be satisfied is B * 2^31 > TTL where B is the bandwidth of the path (in bytes/sec) and TTL is the IP TTL (what has units of "seconds" in this case). To plug in some numbers, the max ttl you could use on a 1Gbps link would be 17 sec., using the rfc793 ttl of 30, you are completely safe at any bandwidth less than 573Mbps and, using the max ttl of 255, you are safe at any bandwidth < 67Mbps. As an aside, note that this constraint has absolutely nothing to do with windows, big or small. Although Alex McKenzie's rfc pointed out a real problem, I think there was some confusion caused by bringing it up in the context of big windows. Because the IP TTL is in seconds, the issue is how fast does the bandwidth let you eat sequence space, not how many round trips it takes to eat the space. Or to put it another way, we're going to be in trouble on FDDI local nets with 16KB windows long, long before we have a problem with 100MB windows on a Gbit transcontinental backbone. Getting back to the solution, remember that rfc1072 included an "echo" option that was intended to help the sender measure round-trip-time. The problem it solves is that almost all TCP implementations time one packet per window. If you look at rtt estimation as a signal processing problem (which it is), you have a data signal at some frequency (the packet rate) but you sample it at a lower frequency (the window rate) and drive the estimator off the sampled data. Unfortunately the lower frequency sampling violates Nyquist's criteria and can (does) introduce "aliasing" artifacts in the estimated rtt. (I have (slightly pathological) Arpanet data showing that this frequently resulted in a 70% underestimate of the average rtt.) A good rtt estimator with a conservative rto calculation can tolerate the aliasing when the sampling frequency is "close" to the data frequency. E.g., with a window of 8 packets, you sample at 1/8 the data frequency -- less than an order of magnitude off. But when the window is tens or hundreds of packets, the rtt estimator is going to get fooled and you'll get spurious retransmits (and, because of the huge "stored energy" in a link that requires a large window, these spurious retransmits are deadly -- they correspond exactly to a feedback control system with a loop gain >1 and it's very difficult to keep such a system stable.) A solution to the aliasing problem that actually simplifies the sender substantially (in the usual case, the rtt code is the single biggest protocol cost for tcp) is to have the sender put a timestamp in each packet and have the receiver reflect that timestamp back in the ack. Thus a single subtract gives the sender an accurate rtt measurement for every ack (every other packet with a sensible receiver). Let me stress again that you essentially *must* use the echo option with big windows -- not using it opens the door to some really dangerous instabilities (and you probably want to use the option even with small windows -- it makes the sender faster). But, since packets now have a timestamp, the receiver should be able to use it to protect against sequence number wraparound. In particular, the receiver's algorithm is: 1) if an arriving packet is in sequence, record its timestamp and accept it normally. 2) Otherwise, if the packet is outside the window, reject it. (normal tcp processing) 3) Otherwise, if the packet timestamp is less than the timestamp of the most recently received in-sequence packet, treat the packet as if it is outside the window. 4) Otherwise treat the packet as a normal in-window, out-of-sequence tcp packet (e.g., queue it for reassembly). Discussion ---------- Step (1) says that in the "normal" case of no errors and data arriving in sequence, there is no additional tcp processing from what we do now. There is a possibility that a packet from 2^32 bytes in the past will arrive at exactly the wrong time and it will mistakenly be accepted. The probability of this happening is *at most* the mss divided by the size of the sequence space, e.g., 2^12 / 2^32 for an FDDI link. Recalling the discussion we had during one of the host requirements meetings, the 16 bit tcp checksum gives you a basic unreliability of one part in 2^16. If the reliability of other protocol mechanisms is "good" compared to the checksum (e.g., at least an order of magnitude more reliable) they are "good enough". I.e., they don't contribute significantly to the overall reliability. Since the reliability of the sequence check must always be better that the checksum check (since the mss must be < 2^16) and, under any reasonable model of packet lifetime, will be many orders of magnitude more reliable than the checksum, I think step (1) is justified. Step (3) places some requirements on the clock. First, it requires that the clock not be "too slow": that it tick at least once for each 2^31 bytes sent. This is probably not a serious problem. Even with rfc1072 extension, 2^31 bytes must be at least two windows and, to be useful to the sender for round trip timing, the clock should tick at least once per window's worth of data. Or, to make this more quantitative, any clock faster than 1 tick/sec will work up to link speeds of ~2 Gbps. A 1ms clock will work up to link speeds of 2 Tbps. Second, step (3) requires that the clock not be "too fast": that it doesn't wrap within the IP packet lifetime guarantee. Since the clock is 32 bits and the worst-case packet lifetime is 255 seconds, the maximum acceptable clock frequency is one tick every 59 ns. Since the sender is using the clock for rtt calculations, it doesn't need to have much more resolution than the granularity of the retransmit timer. E.g., tens or hundreds of milliseconds. I really can't see a system wanting to schedule a retransmit accurate to a few nanoseconds. Since we're in the neighborhood, I'm reminded of the discussion we had at host requirements about whether IP TTL could be treated as a strict hopcount. Note that step (3) really relaxes TCP's requirements on the network layer. E.g., with a 10ms clock (the fastest I think it should run) it takes 249 days to wrap the sign bit. So as long as the IP layer guarantees that packets self-destruct within 6 months, TCP will work correctly. Six months at T1 speeds requires half a terabyte of buffer in the path or at least 2 GB of buffer per hop for the maximum of 255 hops. Since we aren't constructing networks with quite this much buffer capacity, IP could be free to interpret ttl as a hopcount if all we ran was time-stamped tcp. - Van ps- It's unrelated to the above, but I didn't understand the desire for a new urgent pointer at all. In any use I can envision, the urgent pointer is sent or received at the right edge of the window. Since it's relative to the packet sequence number, that means it must be large enough to address any point in a packet, i.e., 16 bits since the max packet length is 16 bits. Why does expanding the window require expanding the urgent pointer? I see the two as entirely unrelated. From braden Tue Mar 27 10:27:52 1990 Received-Date: Tue, 27 Mar 90 10:27:52 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 27 Mar 90 10:27:52 -0800 Date: Tue, 27 Mar 90 10:28:13 PST From: braden Posted-Date: Tue, 27 Mar 90 10:28:13 PST Message-Id: <9003271828.AA04258@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 27 Mar 90 10:28:13 PST To: end2end-interest ----- Begin Included Message ----- From tcp-ip-RELAY@NIC.DDN.MIL Tue Mar 27 09:59:34 1990 Date: Tue, 27 Mar 90 10:21:46 EST From: fab%saturn.ACC.COM@salt.acc.com (Fred Bohle acc_gnsc) To: lgpiers@sandia.gov, tcp-ip@nic.ddn.mil Subject: Re: RFC 1072 Implementations Needed >Message-Id: <9003200542.AA00921@saturn.acc.com> >Date: 19 Mar 90 12:30:00 MST >From: "2645 Pierson, Lyndon G." >Subject: RFC 1072 Implementations Needed >To: "tcp-ip" > >RFC 1072 represents a significant improvement in the performance >of TCP in communication environments with high delay-bandwidth product >(geosynchronous satellite links at T1 and above) and moderate error rates. > >Does anyone know of a full implementation of TCP which incorporates >the RFC 1072 window scaling, selective acknowledgements, and rtt >estimator? (or even a planned implementation?) > >L. G. Pierson (505)-845-8212, LGPIERS@SANDIA.GOV > > I am currently working on Window Scaling and Van Jacobson's rtt estimator. Do you also have a need for Selective Acknowlegements? Please contact me directly for more information. These enhancements will shortly be part of ACC'S ACCES/MVS R3.0, no, no, I mean Interlink's SNS/TCPaccess R1.0. Fred ------------------------------------------------------------------------ Fred Bohle EMAIL: fab@saturn.acc.com ACC (Interlink) AT&T : 301-290-8100 10220 Old Columbia Road Columbia, MD 21046 ------------------------------------------------------------------------ ----- End Included Message ----- From craig@NNSC.NSF.NET Thu Mar 29 07:29:41 1990 Posted-Date: Thu, 29 Mar 90 10:25:22 -0500 Received-Date: Thu, 29 Mar 90 07:29:41 -0800 Message-Id: <9003291529.AA24826@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Thu, 29 Mar 90 07:29:41 -0800 Received: by NNSC.NSF.NET id aa08401; 29 Mar 90 10:25 EST To: end2end-interest Subject: paper on diagnotisc system Date: Thu, 29 Mar 90 10:25:22 -0500 From: Craig Partridge Hi folks: Some time ago (about a year or so) I mentioned an interesting AI system I saw at NTT which diagnosed network problems on LANs. The author (Toshiharu Sugawara) gave me a copy of his paper on the system today -- if you'd like a copy, drop me a note with your preferred snail mail address. Craig From braden Thu Mar 29 09:26:25 1990 Received-Date: Thu, 29 Mar 90 09:26:25 -0800 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 29 Mar 90 09:26:25 -0800 Date: Thu, 29 Mar 90 09:26:41 PST From: braden Posted-Date: Thu, 29 Mar 90 09:26:41 PST Message-Id: <9003291726.AA05065@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Thu, 29 Mar 90 09:26:41 PST To: craig@nnsc.nsf.net, end2end-interest Subject: Re: paper on diagnotisc system Mr. Sugawara gave a talk on his paper at ISI last week, but unfortunately due to the langauge barrier I was not able to get a very good idea of exactly what he has done. It is an interesting piece of work, however. Bob Braden From braden Fri Apr 6 15:17:13 1990 Received-Date: Fri, 6 Apr 90 15:17:13 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 6 Apr 90 15:17:13 -0700 Date: Fri, 6 Apr 90 15:17:34 PDT From: braden Posted-Date: Fri, 6 Apr 90 15:17:34 PDT Message-Id: <9004062217.AA07759@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Fri, 6 Apr 90 15:17:34 PDT To: end2end-interest Subject: Lost in the Protocol Forest This guy was asking about RDP, so out of curiosity I asked him why he thought he needed RDP, and suggested he look into VMTP. Here is his reply... I think this guy needs help! I am surprised that he has not been XTP'd. Bob ----- Begin Included Message ----- From psm%helios.nosc.mil@nosc.mil Fri Apr 6 15:10:47 1990 Date: Fri, 6 Apr 90 15:08:08 PDT From: psm%helios.nosc.mil@nosc.mil To: braden@ISI.EDU Subject: Re:: RDP Implementations: Summary Hi Bob. We're designing a Navy communications system where the components at a given site (probably all hanging off one LAN) need to talk to one another, giving us very fast delivery of packets. We initially attempted to use TCP because it provides reliable, sequenced, non-duplicative delivery, but it had some flaws that we couldn't live with. One problem was if the other end was down, the attempt to establish communication with it didn't return to the caller for 45 seconds. Likewise, the protocol didn't return to the caller for 3.5 minutes if the connection broke during a session. Since we don't have access to the parameters that govern these and other defaults, we're looking into alternatives. RDP looked like one possibility. I haven't heard of VMTP, so I'd be interested in learning more about it, if it's applicable. Thanks for the interest. Scot McIntosh. ----- End Included Message ----- From van@helios.ee.lbl.gov Mon Apr 30 01:39:43 1990 Posted-Date: Mon, 30 Apr 90 01:40:59 PDT Received-Date: Mon, 30 Apr 90 01:39:43 -0700 Received: from vs.ee.lbl.gov by venera.isi.edu (5.61/5.61+local) id ; Mon, 30 Apr 90 01:39:43 -0700 Received: by helios.ee.lbl.gov (5.61/1.39) id AA26743; Mon, 30 Apr 90 01:40:59 -0700 Message-Id: <9004300840.AA26743@helios.ee.lbl.gov> To: end2end-interest@venera.isi.edu Subject: modified TCP congestion avoidance algorithm Date: Mon, 30 Apr 90 01:40:59 PDT From: Van Jacobson This is a description of the modified TCP congestion avoidance algorithm that I promised at the teleconference. BTW, on re-reading, I noticed there were several errors in Lixia's note besides the problem I noted at the teleconference. I don't know whether that's because I mis-communicated the algorithm at dinner (as I recall, I'd had some wine) or because she's convinced that TCP is ultimately irrelevant :). Either way, you will probably be disappointed if you experiment with what's in that note. First, I should point out once again that there are two completely independent window adjustment algorithms running in the sender: Slow-start is run when the pipe is empty (i.e., when first starting or re-starting after a timeout). Its goal is to get the "ack clock" started so packets will be metered into the network at a reasonable rate. The other algorithm, congestion avoidance, is run any time *but* when (re-)starting and is responsible for estimating the (dynamically varying) pipesize. You will cause yourself, or me, no end of confusion if you lump these separate algorithms (as Lixia's message did). The modifications described here are only to the congestion avoidance algorithm, not to slow-start, and they are intended to apply to large bandwidth-delay product paths (though they don't do any harm on other paths). Remember that with regular TCP (or with slow-start/c-a TCP), throughput really starts to go to hell when the probability of packet loss is on the order of the bandwidth-delay product. E.g., you might expect a 1% packet loss rate to translate into a 1% lower throughput but for, say, a TCP connection with a 100 packet b-d p. (= window), it results in a 50-75% throughput loss. To make TCP effective on fat pipes, it would be nice if throughput degraded only as function of loss probability rather than as the product of the loss probabilty and the b-d p. (Assuming, of course, that we can do this without sacrificing congestion avoidance.) These mods do two things: (1) prevent the pipe from going empty after a loss (if the pipe doesn't go empty, you won't have to waste round-trip times re-filling it) and (2) correctly account for the amount of data actually in the pipe (since that's what congestion avoidance is supposed to be estimating and adapting to). For (1), remember that we use a packet loss as a signal that the pipe is overfull (congested) and that packet loss can be detected one of two different ways: (a) via a retransmit timeout or (b) when some small number (3-4) of consecutive duplicate acks has been received (the "fast retransmit" algorithm). In case (a), the pipe is guaranteed to be empty so we must slow-start. In case (b), if the duplicate ack threshhold is small compared to the bandwidth-delay product, we will detect the loss with the pipe almost full. I.e., given a threshhold of 3 packets and an LBL-MIT bandwidth-delay of around 24KB or 16 packets (assuming 1500 byte MTUs), the pipe is 75% full when fast-retransmit detects a loss (actually, until gateways start doing some sort of congestion control, the pipe is overfull when the loss is detected so *at least* 75% of the packets needed for ack clocking are in transit when fast-retransmit happens). Since the pipe is full, there's no need to slow-start after a fast-retransmit. For (2), consider what a duplicate ack means: either the network duplicated a packet (i.e., the NSFNet braindead IBM token ring adapters) or the receiver got an out-of-order packet. The usual cause of out-of-order packets at the receiver is a missing packet. I.e., if there are W packets in transit and one is dropped, the receiver will get W-1 out-of-order and (4.3-tahoe TCP will) generate W-1 duplicate acks. If the `consecutive duplicates' threshhold is set high enough, we can reasonably assume that duplicate acks mean dropped packets. But there's more information in the ack: The receiver can only generate one in response to a packet arrival. I.e., a duplicate ack means that a packet has left the network (it is now cached at the receiver). If the sender is limitted by the congestion window, a packet can now be sent. (The congestion window is a count of how many packets will fit in the pipe. The ack says a packet has left the pipe so a new one can be added to take its place.) To put this another way, say the current congestion window is C (i.e, C packets will fit in the pipe) and D duplicate acks have been received. Then only C-D packets are actually in the pipe and the sender wants to use a window of C+D packets to fill the pipe to its estimated capacity (C+D sent - D received = C in pipe). So, conceptually, the slow-start/cong.avoid/fast-rexmit changes are: - The sender's input routine is changed to set `cwnd' to `ssthresh' when the dup ack threshhold is reached. [It used to set cwnd to mss to force a slow-start.] Everything else stays the same. - The sender's output routine is changed to use an effective window of min(snd_wnd, cwnd + dupacks*mss) [the change is the addition of the `dupacks*mss' term.] `Dupacks' is zero until the rexmit threshhold is reached and zero except when receiving a sequence of duplicate acks. The actual implementation is slightly different than the above because I wanted to avoid the multiply in the output routine (multiplies are expensive on some risc machines). A diff of the old and new fastrexmit code is attached (your line numbers will vary). Note that we still do congestion avoidance (i.e., the window is reduced by 50% when we detect the packet loss). But, as long as the receiver's offered window is large enough (it needs to be at most twice the bandwidth-delay product), we continue sending packets (at exactly half the rate we were sending before the loss) even after the loss is detected so the pipe stays full at exactly the level we want and a slow-start isn't necessary. Some algebra might make this last clear: Say U is the sequence number of the first un-acked packet and we are using a window size of W when packet U is dropped. Packets [U..U+W) are in transit. When the loss is detected, we send packet U and pull the window back to W/2. But in the round-trip time it takes the U retransmit to fill the receiver's hole and an ack to get back, W-1 dup acks will arrive (one for each packet in transit). The window is effectively inflated by one packet for each of these acks so packets [U..U+W/2+W-1) are sent. But we don't re-send packets unless we know they've been lost so the amount actually sent between the loss detection and the recovery ack is U+W/2+W-1 - U+W = W/2-1 which is exactly the amount congestion avoidance allows us to send (if we add in the rexmit of U). The recovery ack is for packet U+W so when the effective window is pulled back from W/2+W-1 to W/2 (which happens because the recovery ack is `new' and sets dupack to zero), we are allowed to send up to packet U+W+W/2 which is exactly the first packet we haven't yet sent. (I.e., there is no sudden burst of packets as the `hole' is filled.) Also, when sending packets between the loss detection and the recovery ack, we do nothing for the first W/2 dup acks (because they only allow us to send packets we've already sent) and the bottleneck gateway is given W/2 packet times to clean out its backlog. Thus when we start sending our W/2-1 new packets, the bottleneck queue is as empty as it can be. [I don't know if you can get the flavor of what happens from this description -- it's hard to see without a picture. But I was delighted by how beautifully it worked -- it was like watching the innards of an engine when all the separate motions of crank, pistons and valves suddenly fit together and everything appears in exactly the right place at just the right time.] Also note that this algorithm interoperates with old tcp's: Most pre-tahoe tcp's don't generate the dup acks on out-of-order packets. If we don't get the dup acks, fast retransmit never fires and the window is never inflated so everything happens in the old way (via timeouts). Everything works just as it did without the new algorithm (and just as slow). If you want to simulate this, the intended environment is: - large bandwidth-delay product (say 20 or more packets) - receiver advertising window of two b-d p (or, equivalently, advertised window of the unloaded b-d p but two or more connections simultaneously sharing the path). - average loss rate (from congestion or other source) less than one lost packet per round-trip-time per active connection. (The algorithm works at higher loss rate but the TCP selective ack option has to be implemented otherwise the pipe will go empty waiting to fill the second hole and throughput will once again degrade at the product of the loss rate and b-d p. With selective ack, throughput is insensitive to b-d p at any loss rate.) And, of course, we should always remember that good engineering practise suggests a b-d p worth of buffer at each bottleneck -- less buffer and your simulation will exhibit the interesting pathologies of a poorly engineered network but will probably tell you little about the workings of the algorithm (unless the algorithm misbehaves badly under these conditions but my simulations and measurements say that it doesn't). In these days of $100/megabyte memory, I dearly hope that this particular example of bad engineering is of historical interest only. - Van ----------------- *** /tmp/,RCSt1a26717 Mon Apr 30 01:35:17 1990 --- tcp_input.c Mon Apr 30 01:33:30 1990 *************** *** 834,850 **** * Kludge snd_nxt & the congestion * window so we send only this one ! * packet. If this packet fills the ! * only hole in the receiver's seq. ! * space, the next real ack will fully ! * open our window. This means we ! * have to do the usual slow-start to ! * not overwhelm an intermediate gateway ! * with a burst of packets. Leave ! * here with the congestion window set ! * to allow 2 packets on the next real ! * ack and the exp-to-linear thresh ! * set for half the current window ! * size (since we know we're losing at ! * the current window size). */ if (tp->t_timer[TCPT_REXMT] == 0 || --- 834,850 ---- * Kludge snd_nxt & the congestion * window so we send only this one ! * packet. ! * ! * We know we're losing at the current ! * window size so do congestion avoidance ! * (set ssthresh to half the current window ! * and pull our congestion window back to ! * the new ssthresh). ! * ! * Dup acks mean that packets have left the ! * network (they're now cached at the receiver) ! * so bump cwnd by the amount in the receiver ! * to keep a constant cwnd packets in the ! * network. */ if (tp->t_timer[TCPT_REXMT] == 0 || *************** *** 853,864 **** else if (++tp->t_dupacks == tcprexmtthresh) { tcp_seq onxt = tp->snd_nxt; ! u_int win = ! MIN(tp->snd_wnd, tp->snd_cwnd) / 2 / ! tp->t_maxseg; if (win < 2) win = 2; tp->snd_ssthresh = win * tp->t_maxseg; - tp->t_timer[TCPT_REXMT] = 0; tp->t_rtt = 0; --- 853,864 ---- else if (++tp->t_dupacks == tcprexmtthresh) { tcp_seq onxt = tp->snd_nxt; ! u_int win = MIN(tp->snd_wnd, ! tp->snd_cwnd); + win /= tp->t_maxseg; + win >>= 1; if (win < 2) win = 2; tp->snd_ssthresh = win * tp->t_maxseg; tp->t_timer[TCPT_REXMT] = 0; tp->t_rtt = 0; *************** *** 866,873 **** tp->snd_cwnd = tp->t_maxseg; (void) tcp_output(tp); ! if (SEQ_GT(onxt, tp->snd_nxt)) tp->snd_nxt = onxt; goto drop; } } else --- 866,879 ---- tp->snd_cwnd = tp->t_maxseg; (void) tcp_output(tp); ! tp->snd_cwnd = tp->snd_ssthresh + ! tp->t_maxseg * ! tp->t_dupacks; if (SEQ_GT(onxt, tp->snd_nxt)) tp->snd_nxt = onxt; goto drop; + } else if (tp->t_dupacks > tcprexmtthresh) { + tp->snd_cwnd += tp->t_maxseg; + (void) tcp_output(tp); + goto drop; } } else *************** *** 874,877 **** --- 880,890 ---- tp->t_dupacks = 0; break; + } + if (tp->t_dupacks) { + /* + * the congestion window was inflated to account for + * the other side's cached packets - retract it. + */ + tp->snd_cwnd = tp->snd_ssthresh; } tp->t_dupacks = 0; *** /tmp/,RCSt1a26725 Mon Apr 30 01:35:23 1990 --- tcp_timer.c Mon Apr 30 00:36:29 1990 *************** *** 223,226 **** --- 223,227 ---- tp->snd_cwnd = tp->t_maxseg; tp->snd_ssthresh = win * tp->t_maxseg; + tp->t_dupacks = 0; } (void) tcp_output(tp); From legato!Legato.COM!nowicki@Sun.COM Mon Apr 30 09:42:23 1990 Posted-Date: Mon, 30 Apr 90 09:35:06 PDT Received-Date: Mon, 30 Apr 90 09:42:23 -0700 Received: from Sun.COM by venera.isi.edu (5.61/5.61+local) id ; Mon, 30 Apr 90 09:42:23 -0700 Received: from sun.Sun.COM (sun-bb.Corp.Sun.COM) by Sun.COM (4.1/SMI-4.1) id AA13769; Mon, 30 Apr 90 09:41:49 PDT Received: from legato.UUCP by sun.Sun.COM (4.1/SMI-4.1) id AA19614; Mon, 30 Apr 90 09:41:45 PDT Received: from rose.Legato.COM by Legato.COM (4.0/SMI-4.0) id AA02967 for van@helios.ee.lbl.gov; Mon, 30 Apr 90 09:35:06 PDT Date: Mon, 30 Apr 90 09:35:06 PDT From: nowicki@Legato.COM (Bill Nowicki) Message-Id: <9004301635.AA02967@Legato.COM> To: van@helios.ee.lbl.gov Subject: Re: TCP meters Cc: end2end@venera.isi.edu As for SunOS, for quite some time (SunOS 2.0, as I recall) there has been kmem_alloc and kmem_free for allocating and freeing arbitrary chunks of kernel memory. The PCB allocation and deallocation should be localized enough that just changing the m_get and mfree to #ifdef'ed calls to kmem_alloc should not be any major roadblock -- just another portability gotcha. -- WIN From van@helios.ee.lbl.gov Mon Apr 30 10:35:01 1990 Posted-Date: Mon, 30 Apr 90 10:36:12 PDT Received-Date: Mon, 30 Apr 90 10:35:01 -0700 Received: from vs.ee.lbl.gov by venera.isi.edu (5.61/5.61+local) id ; Mon, 30 Apr 90 10:35:01 -0700 Received: by helios.ee.lbl.gov (5.61/1.39) id AA27020; Mon, 30 Apr 90 10:36:14 -0700 Message-Id: <9004301736.AA27020@helios.ee.lbl.gov> To: end2end-interest@venera.isi.edu Subject: modified TCP congestion avoidance algorithm (correction) Date: Mon, 30 Apr 90 10:36:12 PDT From: Van Jacobson I shouldn't make last minute 'fixes'. The code I sent out last night had a small error: *** t.c Mon Apr 30 10:28:52 1990 --- tcp_input.c Mon Apr 30 10:30:41 1990 *************** *** 885,893 **** * the congestion window was inflated to account for * the other side's cached packets - retract it. */ ! tp->snd_cwnd = tp->snd_ssthresh; } - tp->t_dupacks = 0; if (SEQ_GT(ti->ti_ack, tp->snd_max)) { tcpstat.tcps_rcvacktoomuch++; goto dropafterack; --- 885,894 ---- * the congestion window was inflated to account for * the other side's cached packets - retract it. */ ! if (tp->snd_cwnd > tp->snd_ssthresh) ! tp->snd_cwnd = tp->snd_ssthresh; ! tp->t_dupacks = 0; } if (SEQ_GT(ti->ti_ack, tp->snd_max)) { tcpstat.tcps_rcvacktoomuch++; goto dropafterack; From braden Tue May 1 16:26:44 1990 Received-Date: Tue, 1 May 90 16:26:44 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 1 May 90 16:26:44 -0700 Date: Tue, 1 May 90 16:27:20 PDT From: braden Posted-Date: Tue, 1 May 90 16:27:20 PDT Message-Id: <9005012327.AA02617@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 1 May 90 16:27:20 PDT To: deering@pescadero.stanford.edu Subject: Multicast Routing Cc: end2end-interest Steve, As you know, there is a low-key war going on these days to determine which routing protocol, OSPF or extended IS-IS, will become the standard open intra-AS routing protocol for the Internet. The war will probably go on for awhile... I am concerned about multicast routing. You have been working for some time (a year?) with the OSPF folks to put multicast routing into OSPF. What is the status of that, by the way? Have you given any thought to the difficulty of putting multicast routing into extended IS-IS? I am worried that the following sequence of events is possible: (1) ANSI standardizes the Internet extensions for IS-IS -- Ross is chugging away at that -- without multicast routing, and subsequently (2) the IETF/IAB decides to adopt IS-IS rather than OSPF. This sequence could make widespread deployment of intra-AS multicast routing within a reasonable time frame much more difficult. Comments? Bob From cheriton@Pescadero.Stanford.EDU Tue May 1 22:32:35 1990 Posted-Date: Tue, 1 May 90 22:32:19 PDT Received-Date: Tue, 1 May 90 22:32:35 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 1 May 90 22:32:35 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA17646; Tue, 1 May 90 22:32:19 PDT Date: Tue, 1 May 90 22:32:19 PDT From: David Cheriton Message-Id: <9005020532.AA17646@Pescadero.Stanford.EDU> To: end2end-interest@venera.isi.edu Subject: Timeliness is Next to Godliness This is a solicitation and reminder that Dave Mills and I are putting together document on summary of possible benefits of synchronized time as a ubiquitous Internet service along with costs, disadvantages and generally anything nasty that can be said. We have agreed to split it roughly with Mills handling the service provision issues and Cheriton handling the service utilization. Please send any contribution you have to one or both of us. Time is of the essence, of course. David C. From J.Crowcroft@Cs.Ucl.AC.UK Wed May 2 07:16:43 1990 Posted-Date: Wed, 02 May 90 15:09:58 +0100 Received-Date: Wed, 2 May 90 07:16:43 -0700 Message-Id: <9005021416.AA07654@venera.isi.edu> Received: from cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Wed, 2 May 90 07:16:43 -0700 To: David Cheriton Cc: end2end-interest@venera.isi.edu Subject: Re: Timeliness is Next to Godliness In-Reply-To: Your message of Tue, 01 May 90 22:32:19 -0700. <9005020532.AA17646@Pescadero.Stanford.EDU> Date: Wed, 02 May 90 15:09:58 +0100 From: Jon Crowcroft Source-Info: perky.cs.ucl.ac.uk >This is a solicitation and reminder that Dave Mills and I are putting >together document on summary of possible benefits of synchronized >time as a ubiquitous Internet service along with costs, disadvantages >and generally anything nasty that can be said. > We have agreed to split it roughly with Mills handling the service >provision issues and Cheriton handling the service utilization. David, we have a UK/ISO version of the NTP service, with a very draft RFC awaiting a bit of editing - Dave Mills had some comments which we havnt got round to incorporating yet - we'll send it soon... jon From braden Wed May 2 10:35:23 1990 Received-Date: Wed, 2 May 90 10:35:23 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 2 May 90 10:35:23 -0700 Date: Wed, 2 May 90 10:35:57 PDT From: braden Posted-Date: Wed, 2 May 90 10:35:57 PDT Message-Id: <9005021735.AA03121@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 2 May 90 10:35:57 PDT To: J.Crowcroft@cs.ucl.ac.uk, cheriton@pescadero.stanford.edu Subject: Re: Timeliness is Next to Godliness Cc: end2end-interest From J.Crowcroft@Cs.Ucl.AC.UK Wed May 2 07:19:23 1990 To: David Cheriton Cc: end2end-interest@ISI.EDU Subject: Re: Timeliness is Next to Godliness In-Reply-To: Your message of Tue, 01 May 90 22:32:19 -0700. <9005020532.AA17646@Pescadero.Stanford.EDU> Date: Wed, 02 May 90 15:09:58 +0100 From: Jon Crowcroft Source-Info: perky.cs.ucl.ac.uk >This is a solicitation and reminder that Dave Mills and I are putting >together document on summary of possible benefits of synchronized >time as a ubiquitous Internet service along with costs, disadvantages >and generally anything nasty that can be said. > We have agreed to split it roughly with Mills handling the service >provision issues and Cheriton handling the service utilization. David, we have a UK/ISO version of the NTP service, with a very draft RFC awaiting a bit of editing - Dave Mills had some comments which we havnt got round to incorporating yet - we'll send it soon... jon Jon, A UK version? I can't wait to learn... does it send its bits backwards, or does it establish Royal Time, or ??? :-) Bob From Mills@udel.edu Wed May 2 19:33:21 1990 Posted-Date: Thu, 3 May 90 1:11:39 GMT Received-Date: Wed, 2 May 90 19:33:21 -0700 Received: from louie.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 2 May 90 19:33:21 -0700 Received: from huey.udel.edu by louie.udel.edu id ae00249; 2 May 90 22:19 EDT Date: Thu, 3 May 90 1:11:39 GMT From: Mills@udel.edu To: braden@venera.isi.edu Cc: J.Crowcroft@cs.ucl.ac.uk, cheriton@pescadero.stanford.edu, end2end-interest@venera.isi.edu Subject: Re: Timeliness is Next to Godliness Message-Id: <9005022111.aa18119@huey.udel.edu> Bob, As I understand it, NTP+ASN.1+ROS+TP0+X.25=RoyalTime, formerly GMT, not to be confused with UTC. Even the BIH is gone and the Government even took away the Royal Greenwich atomic watches (in the interest of austerity). Slough is next... Dave From J.Crowcroft@Cs.Ucl.AC.UK Thu May 3 01:35:56 1990 Posted-Date: Thu, 03 May 90 09:30:30 +0100 Received-Date: Thu, 3 May 90 01:35:56 -0700 Message-Id: <9005030835.AA16114@venera.isi.edu> Received: from cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Thu, 3 May 90 01:35:56 -0700 To: Mills@louie.udel.edu Cc: braden@venera.isi.edu, cheriton@pescadero.stanford.edu, end2end-interest@venera.isi.edu Subject: Re: Timeliness is Next to Godliness In-Reply-To: Your message of Thu, 03 May 90 01:11:39 +0000. <9005022111.aa18119@huey.udel.edu> Date: Thu, 03 May 90 09:30:30 +0100 From: Jon Crowcroft Source-Info: hubris.cs.ucl.ac.uk >As I understand it, NTP+ASN.1+ROS+TP0+X.25=RoyalTime, formerly GMT, not to >be confused with UTC. Even the BIH is gone and the Government even took >away the Royal Greenwich atomic watches (in the interest of austerity). >Slough is next... Dave, you are right, except we allow NTP on ASN/ROS over anything, but anything in the uk ahppens to usually be tp0 and x25 - moves are afoot to rectify this outmoded situation... the vanishment of greewich world status was initially an EC directive i believe - they couldnt stand the temporal chauvenism the Royal Observatory and Mean Time line arte still where they 'always' were though... jon From Mills@udel.edu Thu May 3 06:33:00 1990 Posted-Date: Thu, 3 May 90 13:25:40 GMT Received-Date: Thu, 3 May 90 06:33:00 -0700 Received: from louie.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 3 May 90 06:33:00 -0700 Received: from huey.udel.edu by louie.udel.edu id aa05377; 3 May 90 9:27 EDT Date: Thu, 3 May 90 13:25:40 GMT From: Mills@udel.edu To: Jon Crowcroft Cc: Mills@louie.udel.edu, braden@venera.isi.edu, cheriton@pescadero.stanford.edu, end2end-interest@venera.isi.edu Subject: Re: Timeliness is Next to Godliness Message-Id: <9005030925.aa22712@huey.udel.edu> Jon, As I recall, the Prime Meridian got there only after some considerable squabble between the British and the French. Once the chronometer prize was won, its fate was sealed. There be lesson in that, including the observation that the Prime zips from pole to pole, but its twin wiggles all over the Pacific. Heck, some countries didn't believe in Pop Gregory until the twentieth century and some don't even yet. Dave From craig@NNSC.NSF.NET Thu May 3 06:44:37 1990 Posted-Date: Thu, 03 May 90 09:44:35 -0400 Received-Date: Thu, 3 May 90 06:44:37 -0700 Message-Id: <9005031344.AA19888@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Thu, 3 May 90 06:44:37 -0700 To: end2end-interest@venera.isi.edu Cc: mills@udel.edu Subject: re: Timeliness is Next to Godliness Date: Thu, 03 May 90 09:44:35 -0400 From: Craig Partridge SWAN Tours Announces its Timely Cruise Come tour the world in search of sites with signifigance to timekeeping. Visit the Prime Meridian, the atomic clocks that keep international time, see Aztec and Mayan calendars and cross the international date line! Our expert consultant, Dr. David Mills, will be on hand to explain the importance of each day's tour, with frequent evening lectures. Get your tickets now for this once in a lifetime opportunity! [Well, it would be fun wouldn't it?] [For those who don't know, SWAN specializes in cruises with evening lectures from experts in the field explaining the next day's tour -- e.g. an expert on ancient Greece gives lectures while you tour the Agean ] From Mills@udel.edu Thu May 3 09:15:52 1990 Posted-Date: Thu, 3 May 90 16:06:54 GMT Received-Date: Thu, 3 May 90 09:15:52 -0700 Received: from louie.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 3 May 90 09:15:52 -0700 Received: from huey.udel.edu by louie.udel.edu id aa06953; 3 May 90 12:08 EDT Date: Thu, 3 May 90 16:06:54 GMT From: Mills@udel.edu To: Craig Partridge Cc: end2end-interest@venera.isi.edu, mills@udel.edu Subject: Re: Timeliness is Next to Godliness Message-Id: <9005031206.aa24486@huey.udel.edu> Craig, Tourists, your atttention is directed to Appendix E of the latest revision of the NTP spec. Craig, you remind me I forgot to include a description of the chronometry of the remarkable Maya calendar; however, you may note my personal letterhead includes a scanned and displayed glyph of the Long-Count dating used in that calendar. My habit is to display the date of public presentations in Long-Count glyphs, but nobody seems to notice. You may derive some indication of the importance of precision dating to the Maya from the file pub/ntp/jaguar.txt found on louie.udel.edu. Repliable reports have it that Mark Pullen of DARPA has adopted it as the DARPA battle cry. Dave From cheriton@Pescadero.Stanford.EDU Thu May 3 11:56:55 1990 Posted-Date: Thu, 3 May 90 11:56:44 PDT Received-Date: Thu, 3 May 90 11:56:55 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Thu, 3 May 90 11:56:55 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA24148; Thu, 3 May 90 11:56:44 PDT Date: Thu, 3 May 90 11:56:44 PDT From: David Cheriton Message-Id: <9005031856.AA24148@Pescadero.Stanford.EDU> To: end2end@venera.isi.edu Subject: Congestion at Purdue Anyone familiar with Comer and Yavatkar's paper in the upcoming ICDCS on rate-based congestion control? From braden Fri May 4 09:11:49 1990 Received-Date: Fri, 4 May 90 09:11:49 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 4 May 90 09:11:49 -0700 Date: Fri, 4 May 90 09:12:34 PDT From: braden Posted-Date: Fri, 4 May 90 09:12:34 PDT Message-Id: <9005041612.AA04242@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Fri, 4 May 90 09:12:34 PDT To: end2end-interest Subject: What is this all about?? ----- Begin Included Message ----- From owner-ietf@ISI.EDU Thu May 3 19:08:52 1990 To: hjf@sword.bellcore.com Cc: ietf@ISI.EDU, dave@sword.bellcore.com Subject: Re: Congestion problems In-Reply-To: Your message of Thu, 03 May 90 18:01:37 -0400. <9005032201.AA02354@rapier> Date: Thu, 03 May 90 20:53:57 EDT From: dave@sword.bellcore.com Nicely stated. By the way, Paul Tsuchiya, Joe Lawrence and I had quite a debate over the relative importance of trying to squelch traffic at the source when congestion is being experienced in the network supporting SMDS. They have some pretty convincing arguments that suggest that if you cannot MAKE the traffic sources behave, there is little benefit to attempting to localize the penalty (i.e., discarding) to the noisiest sources, and that the overhead and complexity required to put such a system in place may be greater than its worth. So, it may be worthwhile revisiting our recent thinking. I guess I made some assumptions that relied on some transport layer behaviors that wouldn't take effect without explicit feedback mechanisms, sorry. ----- End Included Message ----- From lixia@kandron.parc.xerox.com Fri May 4 16:31:55 1990 Posted-Date: Fri, 4 May 1990 16:40:51 PDT Received-Date: Fri, 4 May 90 16:31:55 -0700 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Fri, 4 May 90 16:31:55 -0700 Received: from kandron.parc.Xerox.COM by arisia.Xerox.COM with SMTP (5.61+/IDA-1.2.8/gandalf) id AA01957; Fri, 4 May 90 16:31:41 -0700 Received: by kandron.parc.xerox.com (5.61+/IDA-1.2.8/gandalf) id AA15372; Fri, 4 May 90 16:40:53 PDT Sender: Lixia Zhang Date: Fri, 4 May 1990 16:40:51 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: Van Jacobson Cc: end2end-interest@venera.isi.edu Subject: Re: RFC 1072 Implementations Needed In-Reply-To: Your message of Fri, 23 Mar 90 19:40:45 PST Message-Id: This comment is really late (never read the msg till now), but late is better than never. ............ Just to make sure we're in sync here, the problem we're trying to solve is that TCP duplicate detection and sequencing can fail if it is possible to wrap the sequence space in less than the IP-guaranteed maximum packet lifetime (the IP time-to-live). An expression for the constraint that must be satisfied is B * 2^31 > TTL where B is the bandwidth of the path (in bytes/sec) and TTL is the IP TTL (what has units of "seconds" in this case). To plug in some numbers, the max ttl you could use on a 1Gbps link would be 17 sec., using the rfc793 ttl of 30, you are completely safe at any bandwidth less than 573Mbps and, using the max ttl of 255, you are safe at any bandwidth < 67Mbps. As an aside, note that this constraint has absolutely nothing to do with windows, big or small. I agree with all the above, but the last sentence is not exactly right --- The window size has to be smaller than seq# space. Although Alex McKenzie's rfc pointed out a real problem, I think there was some confusion caused by bringing it up in the context of big windows. Because the IP TTL is in seconds, the issue is how fast does the bandwidth let you eat sequence space, not how many round trips it takes to eat the space. Or to put it another way, we're going to be in trouble on FDDI local nets with 16KB windows long, long before we have a problem with 100MB windows on a Gbit transcontinental backbone. I don't understand the last sentence. Could Van help explain a bit? .......... Let me stress again that you essentially *must* use the echo option with big windows -- not using it opens the door to some really dangerous instabilities (and you probably want to use the option even with small windows -- it makes the sender faster). But, since packets now have a timestamp, the receiver should be able to use it to protect against sequence number wraparound. ------------------------------------------ ?? Why do you want to do this? This somehow sounds as we don't really trust TTL (since you worry that seq# may wraparound within pkts lifetime). I'm not arguing TTL is trustful (especially with the Internet BW going up). But whatever our attitude is, we ought to spell out more EXPLICITLY and more LOUDLY than this. In particular, the receiver's algorithm is: 1) if an arriving packet is in sequence, record its timestamp and accept it normally. 2) Otherwise, if the packet is outside the window, reject it. (normal tcp processing) 3) Otherwise, if the packet timestamp is less than the timestamp of the most recently received in-sequence packet, treat the packet as if it is outside the window. 4) Otherwise treat the packet as a normal in-window, out-of-sequence tcp packet (e.g., queue it for reassembly). OK, if we do want to make use of timestamp to protect against seq# wraparound, I think the step (1) in the above should be moved after step (3). I.e. we should do: 1) If the packet is outside the window, reject it. (normal tcp processing) 2) Otherwise, if the packet timestamp is less than the timestamp of the most recently received in-sequence packet, treat the packet as if it is outside the window. 3) Otherwise, if an arriving packet is in sequence, record its timestamp and accept it normally. 4) Otherwise treat the packet as a normal in-window, out-of-sequence tcp packet (e.g., queue it for reassembly). Even when a packet arrived in-seq, better to check whether it's too old first. Lixia From craig@NNSC.NSF.NET Mon May 7 04:38:00 1990 Posted-Date: Mon, 07 May 90 07:32:37 -0400 Received-Date: Mon, 7 May 90 04:38:00 -0700 Message-Id: <9005071138.AA08606@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Mon, 7 May 90 04:38:00 -0700 To: cheriton@pescadero.stanford.edu Cc: end2end Subject: re: Congestion at Purdue Date: Mon, 07 May 90 07:32:37 -0400 From: Craig Partridge Dave: Here's the abstract of their paper (from the upcoming CCR bibliography). I haven't seen the paper itself. Craig %K Congestion\0Control %A D. Comer %A R. Yavatkar %T A rate-based congestion avoidance and control scheme for packet switched networks %J Proc. 10th Intl. Conf. Distributed Computing Systems (ICDCS-10) %D May 28-June 1, 1990 %C Paris, France %I IEEE %X \fBAbstract:\fP The problem of congestion control in packet switched networks continues to attract widespread attention in the networking community. Under overloading conditions, a network drops packets, informs the traffic sources of congestion, and relies on the external sources to reduce the load on the network. The policy of reacting to the congestion after it occurs results in considerable degradation in the thoroughput during network recovery. We argue that it is important to excercise congestion control inside the network and discard the excess traffic before it enters the network instead of waiting for the congestion to build at some intermediate point in the network. We have devised a congestion avoidance and control scheme that monitors the incoming traffic to each destination and provides rate based feedback information to the sources of bursty traffic so that sources of traffic can adjust their packet rates to match the network capacity. The paper discusses the scheme in detail and describes the results of an experimental evaluation. From J.Crowcroft@Cs.Ucl.AC.UK Mon May 7 08:30:03 1990 Posted-Date: Mon, 07 May 90 16:22:36 +0100 Received-Date: Mon, 7 May 90 08:30:03 -0700 Message-Id: <9005071530.AA13329@venera.isi.edu> Received: from cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Mon, 7 May 90 08:30:03 -0700 To: Craig Partridge Cc: end2end@venera.isi.edu Subject: Congestion (of a sort) In-Reply-To: Your message of Mon, 07 May 90 07:32:37 -0400. <9005071138.AA08606@venera.isi.edu> Date: Mon, 07 May 90 16:22:36 +0100 From: Jon Crowcroft Source-Info: sol.cs.ucl.ac.uk should you have cause to telephone London now, you'll find there's two exchanges (071 central, 081 outer) rather than good olde 01. (if you use the quipu or nysernet directory, this info is already there). Its due to them running out of numbers (one very cute but totally ignorant suggestion from a naive user was - why dont they bring back using the letters again, then we'd have(26+10)**7 instead of just 10**7) cheers jon From Mills@udel.edu Mon May 7 09:08:09 1990 Posted-Date: Mon, 7 May 90 15:53:40 GMT Received-Date: Mon, 7 May 90 09:08:09 -0700 Received: from louie.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 7 May 90 09:08:09 -0700 Received: from huey.udel.edu by louie.udel.edu id aa10384; 7 May 90 11:56 EDT Date: Mon, 7 May 90 15:53:40 GMT From: Mills@udel.edu To: Jon Crowcroft Cc: Craig Partridge , end2end@venera.isi.edu Subject: Re: Congestion (of a sort) Message-Id: <9005071153.aa29894@huey.udel.edu> Allan Sherman: "Let's all go down to the AT&T and complain to the President waltz..." about digit dialing. If you remember that musical farce, you remember the San Fransisco exchange codes - Emerson-8, Davenport-3, MurrayHill-7. But, while Jon points out the joys of base-36 arithmetic, we should point out the telephone keypad, at least in this country, has only 12 keys. As a warrior in the scheme to force 12 keys on AT&T twenty years ago (I had the first such intrument in Michigan - everybody else had a 10-key instrument), I have to report I was unsuccessful in forcing a 16-key phone on the world, even though the MF tones are defined for it. Dave P.S. I also had the first Data Access Arrangement in Michigan. Remember those things? Time lurches on. DLM From braden Fri May 11 09:08:48 1990 Received-Date: Fri, 11 May 90 09:08:48 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 11 May 90 09:08:48 -0700 Date: Fri, 11 May 90 09:09:32 PDT From: braden Posted-Date: Fri, 11 May 90 09:09:32 PDT Message-Id: <9005111609.AA07280@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Fri, 11 May 90 09:09:32 PDT To: end2end-interest Subject: Van's comments ----- Begin Included Message ----- From van@helios.ee.lbl.gov Fri May 11 04:03:53 1990 To: braden@ISI.EDU Subject: Re: Minutes of Teleconference - for approval In-Reply-To: Your message of Fri, 04 May 90 16:54:03 PDT. Date: Fri, 11 May 90 04:04:30 PDT From: Van Jacobson > Clark explained why this two-channel > model is not a good approach: the two streams must be synchronized, > which is a difficult problem. Clark explained that this was the original motivation for the urgent pointer (responding to my crack that `urgent' probably existed because opening two connections under Multics would have been too expensive). At least one party (myself) was not convinced that experience and current practice validate this design decision. The only extant user of URG (telnet) embeds explicit synchonization in the data stream and uses the urgent pointer only as an indicator that the synchronization is present somewhere in the stream. And, if the `pointer' model of URG were ever desired, I don't see why it's hard for an application to keep track of what it has sent (in bits, bytes, records, requests, or whatever units it happens to work in) and tell its peer via a `control' connection that `unit X is urgent'. Plus a separate connection allows you to implement other models of `urgent data' (like a priority model) and works better if we ever do something like round-robin on a per-connection (as opposed to per-host-pair) basis. Although I'm sure I've missed the subtleties of the arguement for URG and will regret saying this, I still think TCP would have cleaner, simpler and just as useful had URG been left out of the design. > Jacobson suggested that that for TP4 the absolute time would > probably be more appropriate. (He did not explain this teaser). I think the explanation was "Oops, sorry! Fuzzy thinking!" I was thinking of a protocol where either the receiver did more traffic analysis or the receiver response time could be long compared to the round trip time (perhap netblt would fit one or both of these categories). Then the receiver might want the frequency and reference of the time to be fixed. > Jacobson explained that there are actually 2 algorithms > interleaved: (1) slow start for fat pipe ("keep the flywheel > spinning when a packet is lost") and (2) increased fairness on > busy network. There are some errors in the description of the > algorithm. > > Lixia pointed out that the algorithm loses when the pipe size is > larger than the switch buffer size. I thought I said that I had described two different algorithms over dinner: The one I just sent a note about (which is really about improving TCP efficiency over LFNs and making it less sensitive to loss -- Remember that 1/(1 + pW) throughput scaling we talked about two years ago? The new algorithm was intended to change that to 1/(1 + p) [which is the best any protocol can do] and it does.) Then there was the "fairness" change to slow-start that you mentioned that was intended to add some hysteresis to account for that fact that a connection is more likely to have packets dropped in the first few round trip times after a drop (i.e., while slow-start has made its window artificially small). At the Xerox PARC meeting, I thought suitable hysteresis might result from giving a connection log2(W) round trip times of grace before a loss would change the congestion avoidance threshhold (ssthresh). Subsequent thought and experiment showed this was actually a horrible idea if the window was large compared to the available network buffer resources. Lixia implemented and simulated something very close to this second algorithm and found the same thing: it doesn't work very well on busy networks. At the time of the teleconference, Lixia had *not* simulated the first algorithm, probably because I botched the description over dinner and didn't make it clear there were two different algorithms doing two different things. > Currently NSFNET has a drop ratio of 1-2% and a delay of 1-2 > seconds. How about 150-300ms? (the fact that delay increases quickly with packet size is an ongoing mystery but it never gets anywhere near as bad as 1-2 sec). NSFNET also has does some "anti-drops" (it duplicates packets and the rate of duplication increases to several percent when the load increases). This make for an interesting "noise" source in an algorithm that works off of duplicate acks. > Clark said that some RISC chips don't do a move-and-sum very > well, and he cited the MIPS processor. Jacobson said that is a > solved problem. I made versions of bcopy and bcopy+checksum for the pmax that ran at the same speed (on packet sizes >128 bytes). The trick was to load two cache-lines worth of data into registers, doing the adds on the previous line's data in the load and store delay slots of the current line (the '>128' is what it takes to amortize the cost of the first line's load where you can't do anything useful in the delay slots). The code for this was on one of Dave Cheriton's machines at Stanford (we don't have any pmax's) and seems to have vanished. I'll get someone to type it back in from my listing and mail it to DDC (or the list if others are interested). > Jacobson argued that there are only two accuracies: none, and > max; "there is no win for crummy time". Braden pointed out that > this is only true once synchronization is complete everywhere. I was trying to say that with the algorithms Dave Mills has developed, it doesn't cost any more, in terms of traffic, to get the best time the system is capable of. Then you pointed out that this was only true at steady-state and accuracy costs in time to convergence and/or traffic during convergence. This is probably one of the major scale issues: Assume some fixed frequency of host failures so the density of "starting" hosts will increase linearly with the number of hosts on the network. What is the cost, in terms of traffic, time to convergence and fraction of hosts out-of-sync as a function of accuracy and number of hosts? - Van ----- End Included Message ----- From deering@Pescadero.Stanford.EDU Fri May 11 15:48:33 1990 Posted-Date: 11 May 1990 14:48-PDT Received-Date: Fri, 11 May 90 15:48:33 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Fri, 11 May 90 15:48:33 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA26564; Fri, 11 May 90 15:48:18 PDT Date: 11 May 1990 14:48-PDT From: Steve Deering Subject: Re: Multicast Routing To: braden@venera.isi.edu Cc: end2end-interest@venera.isi.edu Message-Id: <90/05/11 1448.847@pescadero.stanford.edu> In-Reply-To: braden's message of Tue, 1 May 90 162720 PDT > I am concerned about multicast routing. You have been working for > some time (a year?) with the OSPF folks to put multicast routing > into OSPF. What is the status of that, by the way? It's been about 8 months since I first met with Jon Moy to discuss multicast routing extensions for OSPF. Since that time, an IETF working group has been formed (with me as chair) to develop a specification for the multicast OSPF extensions. We had a productive meeting at the Pittsburgh IETF last week, at which we reviewed the extensions required, resolved some outstanding issues, and identified some new ones. John Moy has volunteered to write up the specification (probably in the form of an appendix to the current OSPF spec.), and he hopes to have that ready for review before the August IETF meeting. The implementors of OSPF at U. Maryland and Cornell are eager to implement the multicast stuff as soon as we can agree on a spec. > Have you given any thought to the difficulty of putting multicast > routing into extended IS-IS? Yes, I have thought about it, but not in great detail. I can't immediately see why it should be any more difficult to add multicast routing to IS-IS than it is for OSPF. I had assumed that whatever we learn from the OSPF effort will map directly into IS-IS (at least for IP multicasting -- CLNP multicasting raises some additional issues, related to addressing/multihoming/etc.). > I am worried that the following sequence of events is possible: > (1) ANSI standardizes the Internet extensions for IS-IS -- Ross is > chugging away at that -- without multicast routing, and subsequently > (2) the IETF/IAB decides to adopt IS-IS rather than OSPF. This > sequence could make widespread deployment of intra-AS multicast > routing within a reasonable time frame much more difficult. Yes, that scenario is possible. (As an IAB member, perhaps you have a better idea than I do how likely it is.) I have been more concerned about the reality of OSPF currently being deployed at a number of places, without the multicast extensions in place yet. Do you have any reason to believe that the IS-IS people would (or would not) be receptive to multicast routing extensions? I am certainly willing to talk to Ross about it, but I cannot commit to any more committees at this time. Steve From lixia@kandron.parc.xerox.com Fri May 11 18:26:52 1990 Posted-Date: Fri, 11 May 1990 18:35:54 PDT Received-Date: Fri, 11 May 90 18:26:52 -0700 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Fri, 11 May 90 18:26:52 -0700 Received: from kandron.parc.Xerox.COM by arisia.Xerox.COM with SMTP (5.61+/IDA-1.2.8/gandalf) id AA20951; Fri, 11 May 90 18:26:36 -0700 Received: by kandron.parc.xerox.com (5.61+/IDA-1.2.8/gandalf) id AA16469; Fri, 11 May 90 18:35:55 PDT Sender: Lixia Zhang Date: Fri, 11 May 1990 18:35:54 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: van@helios.ee.lbl.gov Cc: end2end-interest@venera.isi.edu Subject: Re: Van's comments In-Reply-To: Your message of Fri, 11 May 90 09:09:32 PDT Message-Id: NSFNET also has does some "anti-drops" (it duplicates packets and the rate of duplication increases to several percent when the load increases). This make for an interesting "noise" source in an algorithm that works off of duplicate acks. Van, This sounds interesting. Could you spell out more details of this "anti-drop" algorithm? Like where packets get duplicated, and by how much? (how is the duplicate ratio related to the load increase ?) How is the network load measured? Any measurement results about the effect/effectiveness of this algorithm? Lixia From Mills@udel.edu Sun May 13 14:06:48 1990 Posted-Date: Sun, 13 May 90 0:26:50 GMT Received-Date: Sun, 13 May 90 14:06:48 -0700 Received: from louie.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Sun, 13 May 90 14:06:48 -0700 Received: by louie.udel.edu id ae19272; 13 May 90 20:58 GMT Received: from huey.udel.edu by louie.udel.edu id aa02884; 12 May 90 20:33 EDT Date: Sun, 13 May 90 0:26:50 GMT From: Mills@udel.edu To: end2end Subject: Wandering nanoseconds Message-Id: <9005122026.aa21197@huey.udel.edu> Folks, The table I flashed at the recent N2N telemeet was Xerogmented. Following is the original extract from a section of a report I wrote for the National Academy. The data was assembled from briefing slides presented by AT&T, MCI and Sprint. "The accuracy achieved by a slave clock also depends on the synchronization path to the master and may be impaired by temperature variations, transmission errors, protection switching and equipment maintenance. Table 3 shows the expected daily and yearly variation (wander) for various facilities. Facility Daily var (ns) Yearly var (ns) --------------------------------------------------------------- Radio Link 1000 km 210 420-580 Coaxial Cable 1000 km 57 860 Fiber optic 1000 km 110-160 1,690-2,440 Polyethene 100 km buried 100 1,500 Polyethene 50 km aerial 830 2,080 Paper 50 km buried 160 2,360 Paper 250 km aerial 14,000 36,000 Table 3. Facility Delay Variation "As a specific example, measurements of a 30-mile circuit between two typical exchanges near New York City showed a diurnal variation due to temperature of 200 ns RMS and a pulse-stuffing wander of 75 ns RMS. Note that paper-insulated cables and aerial cables of all types can be considered unlikely for most interesting backbone circuits, except in the Louisiana bayous, where the principle danger is shotgun-toting old boys. My conclusions are (a) for the most accurate time transfer don't throw away the copper just yet, (b) preamble guard times up to a few microseconds may be required for synchronous systems, (c) existing (radio/satellite) time-transfer techniques are not likely to be obsoleted by fiber. Time transfer with GPS can be accurate to less than a nanosecond (if you believe the title of the paper I flashed), while time transfer with LORAN-C is typically accurate to 500 ns. Dave From ddc@thyme.LCS.MIT.EDU Mon May 14 07:31:03 1990 Posted-Date: Mon, 14 May 90 10:31:26 -0400 Received-Date: Mon, 14 May 90 07:31:03 -0700 Received: from THYME.LCS.MIT.EDU by venera.isi.edu (5.61/5.61+local) id ; Mon, 14 May 90 07:31:03 -0700 Received: from THYME.LCS.MIT.EDU by thyme.LCS.MIT.EDU via TCP with SMTP id AA01487; Mon, 14 May 90 10:31:28 EDT Message-Id: <9005141431.AA01487@thyme.LCS.MIT.EDU> To: braden@venera.isi.edu Cc: end2end-interest@venera.isi.edu Subject: Re: Van's comments In-Reply-To: Your message of Fri, 11 May 90 09:09:32 -0700. <9005111609.AA07280@braden.isi.edu> From: David Clark Date: Mon, 14 May 90 10:31:26 -0400 Sender: ddc@thyme.LCS.MIT.EDU Folks, concerning Van's comment about Urgent. I agree with him about suitability of Urgent. I was not trying to defned the reasoning, just note what had been said back then. If you want to see the original justification of Urgent (perhaps there is a historian out there) see the design documents for a protocl called DSP by Dave Reed. Thats all. Dave From braden Mon May 14 15:07:47 1990 Received-Date: Mon, 14 May 90 15:07:47 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 14 May 90 15:07:47 -0700 Date: Mon, 14 May 90 15:08:23 PDT From: braden Posted-Date: Mon, 14 May 90 15:08:23 PDT Message-Id: <9005142208.AA08080@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 14 May 90 15:08:23 PDT To: end2end-interest Subject: Minutes of last E2E meeting End-to-End Research Group MEETING MINUTES Videoconference Friday April 27, 1990 MINUTES TAKEN BY: Lixia Zhang Present at BBN: Present at DARPA: Present at ISI: Present at SRI: Clark (MIT) Cooper CMU) Braden (ISI) Cheriton (Stanford) Partridge (BBN) Mills (U. Del) Estrin* (USC) Deering (Stanford) Topolcic* (BBN) Jacobson (LBL) Mckenney* (SRI) Zhang (PARC) *Guest MINUTES 1. TCP ENHANCEMENTS 1.1 FAT PIPE ENHANCEMENTS Jacobson has talked to Sandia people who run TCP over T1 satellite channels without FEC (forward error correction), and a data error rate of 2%. They really need selective-ACK (SACK) and big window size. The research work for employing big window and time-stamping (see RFC1072) is considered completed. The mechanisms are not implemented in BSD4.3 due to mbuf limitation, but they can be easily implemented in 4.4. In fact SACK has been coded (but not tested). Jacobson would like to test the code with live traffic. DARTNET can be a nice vehicle for this test, but will need a "flakeway" in order to see how throughput scales with loss rate. Will also want live traffic for this test. Partridge mentioned that Dave Borman is also interested in this issue and may like to join the game. Braden then asked to what extent the E2E RG needs to continue working on this TCP enhancement. Partridge will not attend the forthcoming IETF, but he hopes to convince his group to proceed with RFC1072 at the next IETF meeting at Vancouver. One issue raised at IETF is the 16-bit Urgent Pointer. Several people said that 16-bit should be long enough. Clark: more than 16-bit length for Urgent Pointer is silly -- the Urgent Pointer is a message to the application and is set relative to the current sequence number. One should never have more than 2^16 bytes data waiting in the pipe. Currently Telnet is the only user of Urgent Pointer option. Cheriton suggested that a more general solution -- TOS/priority at the transport level -- would be better. Jacobson suggested a two separate channel model, one for urgent data and the other for normal data. Clark explained why this two-channel model was not adopted by the TCP developers: the two streams must be synchronized, which was thought to be a difficult problem. [Subsequent comment by Jacobson: The only extant user of URG (telnet) embeds explicit synchonization in the data stream and uses the urgent pointer only as an indicator that the synchronization is present somewhere in the stream. And, if the `pointer' model of URG were ever desired, I don't see why it's hard for an application to keep track of what it has sent (in bits, bytes, records, requests, or whatever units it happens to work in) and tell its peer via a `control' connection that `unit X is urgent'. Plus a separate connection allows you to implement other models of `urgent data' (like a priority model) and works better if we ever do something like round-robin on a per-connection (as opposed to per-host-pair) basis. Although I'm sure I've missed the subtleties of the arguement for URG and will regret saying this, I still think TCP would have cleaner, simpler and just as useful had URG been left out of the design.] 1.2 VAN's TIMESTAMP PROPOSAL Braden raised the packet lifetime problem in high speed networks. An RFC is needed to convince people that Van's time-stamping algorithm really works. Craig suggested an article on CCR. Braden asked whether CCR is an effective vehicle for communication with IETF. Craig thinks it is moderately effective. Jacobson asked if he could just turn his previous message into an RFC. Braden agreed and offered help with getting the RFC out. Cheriton questioned whether we would need to expand the time-stamp option field to 64 bits later, but Jacobson said that 32 bits is adequate for TCP. 1.3 REVISED SLOW-START ALGORITHM Braden had sent to the group Lixia's description of Jacobson's revised Slow-Start algorithm. He asked whether the description was correct. Jacobson said he has a partially-completed message correcting the algorithm. ACTION [DONE]: Jacobson: Finish message describing revised Slow Start algorithm. Jacobson explained that he has proposed 2 different new algorithms: (1) slow start for LFNs ("keep the flywheel spinning when a packet is lost"). This is intended to improve TCP efficiency over LFNs by making it less sensitive to losses. The original slow-start had a throughput scaling as 1/(1 + pW); the revised algorithm scales as 1/(1 + p) [which is the best any protocol can do]. (2) increased fairness on acbusy network. This is intended to add some hysteresis to account for that fact that a connection is more likely to have packets dropped in the first few round trip times after a drop (i.e., while slow-start has made its window artificially small). At the Xerox PARC meeting, Van suggested suitable hysteresis might result from giving a connection log2(W) round trip times of grace before a loss would change the congestion avoidance threshhold (ssthresh). Subsequent thought and experiment showed this was actually a horrible idea if the window was large compared to the available network buffer resources. Lixia implemented and simulated something very close to this second algorithm and found the same thing: it doesn't work very well on busy networks. Jacobson argued the we must engineer the network with buffer size greater than pipe size, otherwise the system cannot be stable. Sincoskie and Fraser are convinced by this argument. But some theoreticians at Bell Labs and Bellcore are not convinced, arguing that the pipe size can grow faster than the buffer size. Braden asked for suggestions on Bell Labs/Bellcore people we could invite to the next meeting. Van and Clark made some suggestions, and suggested asking Shenker's opinion. Jacobson wants to test the revised Slow-Start algorithm over NSFNET to MIT. Currently NSFNET has a drop ratio of 1-2% and a delay of 150-300 ms. Braden also suggested DARTNET as the testbed. Jacobson thinks we can do both. He will test it first and then hand the code over to Clark or his student and let them play with it. ACTION: Jacobson: Test revised Slow Start to MIT over NSFnet. 1.4 TCP ALTERNATE CHECKSUM Cooper suggested this subject. He wants a checksum for ATM environment that detects reassembly errors (mis-ordering). It was noted that the CCITT T1 study group is considering sequence numbers in the Adaptation Layer of ATM. Partridge said that BBN is preparing an alternative proposal to handle this problem. Action item: Partridge to send a copy of BBN's proposal on ATM Adaption layer to the group. Cooper would like to move checksum computation away from TCP, e.g., pre-compute it at the network interface. He questioned the efficiency of doing checksums at higher layers. Jacobson commented that one should always do checksum and data move together, making the checksum "costless"; the network interface should be treated as memory. He claimed that this scales to multiprocessors (another teaser). Clark said that some RISC chips don't do a move-and-sum very well, and he cited the MIPS processor. Jacobson said that is a solved problem. Clark has submitted a paper to SIGCOMM on protocol layering, arguing that "packets are no damn good". Action item: Clark to send a copy of his SIGCOMM paper to the group. Deering pointed out that the proposed Alternate Checksum Options (by Partridge, see RFC1146) does not allow the checksum field to be moved to the end of the packet for efficient hardware implementation. Jacobson believes the alternate checksum option should die, and gave arguments for the "light-weight" checksum currently in TCP. If additional protection is needed, it should be supplied over each hop by appropriate link-level mechanisms. Cheriton wants to discuss checksum issues at the next meeting. 2. SYNCHRONIZED TIME Braden requested that the research group prepare a position paper on the applications and implications of ubiquitous syncrhonized time, for the IAB and other planning bodies. Mills agreed to contribute data and facts on how well and accurate the network time can be synchronized. Jacobson pointed out that, besides the benefits, there are also costs to be listed. Currently, synchronized time is available only on gateways, and there will be additional cost to bring the time to workstations. Scalability may also be an issue. Some further research may need be done to make sync'd time a truly available resource for applications. Mills has some numbers for the cost, although he does not like the current values. In terms of scalability, he thinks it should not be worse than routing protocol. Cooper reported that a CMU student (Bart Bloch?) is finishing up a thesis on replicated systems which make use of sync'd time. The virtue of a global clock has been known for long time, but few people have tried to take advantage of sync'd time. Clark noticed that lots of people still do painful handshaking all the time. Cheriton pointed out that different applications need different degrees of accuracy in sync'd time. We need a hook from application to the transport level to specify what time accuracy is needed. Jacobson argued that with Dave Mills' algorithms, no more traffic is required to get the best time of which the system is capable. Braden pointed out that this is only true in steady state; accuracy costs in time and traffic to converge. [Jacobson added: This is probably one of the major scale issues: Assume some fixed frequency of host failures, so the density of "starting" hosts will increase linearly with the number of hosts on the network. What is the cost, in terms of traffic, time to convergence, and fraction of hosts out-of-sync, as a function of accuracy and number of hosts?] Liskov has tried to make use of sync time. The idea is to avoid handshaking if there is a global clock available, and fall back to the less efficient procedure if not. ACTION: Clark to send an updated version of Liskov paper to the group. (The paper has been submitted to SIGCOMM) Mills pointed out that when we move to gigabit networks, everything changes by a factor of thousand. One concern is the required accuracy to be considered synchronized. Mills showed some numbers of time jitter (delay variance) over various transmission lines. The point is that the time jitter is unavoidable. Braden suggested to write up a 10-20 page coherent description of the cost and benefit of sync'd time. This is to be input to the IAB and to agencies for planning future R&D. Cheriton and Mills volunteered to compile people's messages to put together this paper. Everyone is requested to think about the potential benefits of universal sync'd time and report to two Dave's. ACTION: Cheriton & Mills: Put together draft of summary of possible benefits of synchronized time as a ubiquitous Internet service. Cooper and Clark volunteered to write up a few paragraphs about global clock applications. Zhang mentioned ongoing work with Shenker on using global clock in network jitter control. Multicast is currently used for time distribution in NTP in Canada. ACTION: Deering: Look at Mills draft of multicast NTP. 3. STANDARD FORMAT FOR PACKET TRACE ACTION: Braden: Try to persuade Bill Nowicki to propose packet trace standard. 4. MULTICASTING Cheriton gave a presentation on "New Age of Multicasting in the Spirit of Internet". The idea is to establish a "channel" model that one sends to and receives from. There are currently four kinds of channels: One to one: eg telephone conversation One to many: eg TV channel Many to one: eg Logging channel Many to many: eg CB radio channel One-to-one is actually a special case of many-to-many. A channel can be specified by source-destination address pairs, so he would allow multicast source addresses. He suggested extending IP addresses by including UDP port number and reserving part of port # space for multicast ports. There was a brief discussion on this channel model. 5. TCP MONITORING Estrin, representing ANRG, made a request to build monitoring tools in BSD TCP implementation. Jacobson replied that 4.3BSD already collects global statistics, and in 4.4 the monitoring will be done on per-connection basis. He expects there will be a new socket class for protocol monitoring; when a connection closes, its statistics will be delivered to all appropriate open monitoring sockets. What is available currently is the statistics of all connections, things like the number of bytes transmitted, retransmitted, received, duplicated, etc. ACTION [DONE]: Jacobson: Send message describing TCP information being monitored in 4.3BSD (done). 6. DARTNET EXPERIMENTS 6.1 STATUS Jacobson will get an alpha T1 card from Sun today (4/27), and will start debugging the driver after returning from the IETF meeting (5/1--5/4). He needs two more weeks of work on his reservation code/ algorithm before he can send it out. 6.1 PLANNING The discussion resulted in the following three steps: 1. Before starting experiments, we need to load DARTNET with conventional protocol/traffic for awhile, to test out bugs and get measurement tools working. Meanwhile, Lixia will simulate Van's revised Slow-Start algorithm and his resource reservation algorithm. 2. Van will start first with congestion control tests: run the old congestion-avoidance algorithm and then the revised algorithm. One goal is to compare results with Lixia's earlier simulation. Van will proceed to test his resource reservation mechanism. He will overload the network without his mechanism, and then show how much his mechanism improves things. At the same time, simulation effort will proceed with the test algorithms of the next stage. 3. By this time, other tests should be ready to go: o statistical queueing, o rate-based flow control and flow-based resource management, o connection management According to Topolcic at BBN, it should be possible to try a video conference using ST encapsualated in IP soon after DARTNET becomes available. There was some discussion of ways to force real traffic across DARTNET. Clark incautiously suggested forcing LCS.MIT traffic to reach NSFnet via DARTNET and BARRNET. Braden said that careful planning and coordination of any such routing change will be necessary. 6.2 COLLABORATION It is important that people can collaborate as closely as possible. One tool for collaboration is video conferencing. Having teleconferencing at all research sites requires to moving current video host implementation out of Butterfly. Casner currently plans to porting the code to Sparcstations in about 6 months. [Note: He is trying to move this earlier: RTB]. (There was a random discussion about frustration on the delay to get teleconf facility more commonly available, the choice of codec, how to show multi-site on a workstation screen, etc.) 6.3 TRAFFIC GENERATORS Paul McKenney plans to contribute traffic generators. A question that arose is: how good a model of "real" traffic is needed in order to satisfy the demands of experiments? Lixia asked: at what protocol layer should traffic generators operate -- IP, TCP, or application? It was decided that application-level data generators (as opposed to a simple packet pump at network level) are best. With a high level data generator, an important issue is what to do with temporarily blocked data when lower layer(s) flow control is enforced. This is application-dependent and must be captured in the generator model. Jacobson suggested we need two different kinds of traffic generators. At the beginning of any experiment we will need very simple data generators (e.g. infinite data source, Poisson source, etc.) for debugging. Later we will need more complicated generators, e.g., simulating the characteristics of video, FTP, and Telnet traffic, and to the extent possible, real traffic sources. To set up particular experiments, it will be necessary to fire off programs at various test hosts using synchronized clocks. Synchronization will be needed to only about 1 second. 7. RESOURCE RESERVATION MODELS At the last DARTNET teleconf, there was a discussion between Van and Paul about implementation issues for resource reservation -- whether it should be done at link level or network level. Formally, it should be at the network layer, but in real implementations it may better be done at the link level (i.e., in the driver). Jacobson gave his thoughts on connection-oriented resource management. There are 4 major issues: (1) Managing resources on a link, with policy feedback based on link utilization. (2) Choosing units of resource accounting. (3) What to do when reach a resource limit (4) Resource binding time. The primitives that are required are: - packet clarification - resource control .......... (Van, could you help complete this list please? -- LZ) Clark commented that speed requires both right architecture and efficient implementation. If a control algorithm is complicated in implementation, it will lose. The core of speed lies on what happens when forwarding a packet. And the most critical part is link layer. Although some protocol models do resource management on a per connection basis at network layer, resource management eventually comes down to who owns/can use the channel. Jacobson briefly described his resource control implementation in the driver: - one queue for each traffic class. - traffic classes are set up by higher level protocols in advance. - when a traffic class runs over its usage limit, the driver will upcall to a pre-specified network-layer procedure (e.g., to send Source Quench or set the Dec bit). Jacobson's mechanism can implement FQ on a per-class basis, but not on a per source-destination address pair basis, because each traffic class has to be set up prior to arrival of packets from that class. Braden asked the group to explore some convergence between Van's implementation of resource control and the "kernel" approach of the COIP group. Topolcic (who was able to join the group at this point) explained the COIP WG's concept of "protocol kernel". It is felt that various connection-oriented protocols have different ways to do similar things. The idea is to implement identical functions once only and implement differences separately. One example is connection setup. All the protocols will have a setup stage, although the connection descriptors may contain different information. All connection-oriented approaches also need to run a state machine, although the number of states, state information, and state transition table may be different. In a sense, the "protocol kernel" may be considered a state machine interpreter. Similarly, there may also be a packet interpreter to accept packets of different protocols. The whole design is still under discussion; the implementation has not started. Since Topolcic had arrived (unfortunately, Casner was unable to make it), discussion returned to ST traffic for DARTNET testing. BBN is working on encapsulating ST packets in IP (putting IP header on top of ST packets and sending out without ST connection setup). Topolcic has talked to Cohen about how to route the encapsulated ST packets to DARTNET. There was again a random discussion about video conference hardware and about variable rate video coding. The available hardware does only constant rate-encoding; manufacturers are waiting for the market. Currently commercial video conferences are all run through circuit-switching networks, so there is no need for variable rate encoder. 8. NEXT MEETING Cooper will host the next End-to-End Research Group at CMU on June 13-14. Travel info will be sent out soon. From braden Mon May 14 17:14:56 1990 Received-Date: Mon, 14 May 90 17:14:56 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 14 May 90 17:14:56 -0700 Date: Mon, 14 May 90 17:15:40 PDT From: braden Posted-Date: Mon, 14 May 90 17:15:40 PDT Message-Id: <9005150015.AA08236@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 14 May 90 17:15:40 PDT To: end2end-interest Subject: JSAC issue on Congestion Control May 1990 issue of IEEE Comm Mag (p51) calls for papers for a JSAC issue to be devoted to Congestion Control in High-Speed Packet-Switched Networks. Submission due Sept 1, 1990. Bob Braden From braden Tue May 15 09:22:52 1990 Received-Date: Tue, 15 May 90 09:22:52 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 15 May 90 09:22:52 -0700 Date: Tue, 15 May 90 09:23:39 PDT From: braden Posted-Date: Tue, 15 May 90 09:23:39 PDT Message-Id: <9005151623.AA08423@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 15 May 90 09:23:39 PDT To: end2end-interest Subject: Hamant Kanakia alights -- at a gigabit ----- Begin Included Message ----- From kanakia@research.att.com Tue May 15 07:41:49 1990 From: kanakia@research.att.com Date: Tue, 15 May 90 10:40:35 EDT To: braden@ISI.EDU Bob, Hello. I have finally settled in Bell Labs and New Jersey. Feel more settled workwise after having decided on my long-term research plan. It seems I am going to start buidling a gigabit network. (Who isn't these days?) I think I have a neat idea for a fast packet switch - 16 by 16, 1 gigabit maximum line rate. Currently, the effort is funded internally and is ready to go to prototype phase. It is wonderful working in a place where managemment is technical and has the discretionary power to start major projects. I miss not being able to plug into Internet community to take part in ongoing technical discussions. That was easy to do at Stanford, courtesy Steve Deering and David Cheriton. Because of my relative isolation, I was wondering if you could add me to a mailing list where I recollect there was technical discussion about high-speed networks and other IETE-TG related issues. This is not an attempt to be a part of the task group. I am not sure I have at this stage much to contribute and may just add more smoke. I also do not want to travel much while I am trying to set up the project that has just started. I am interested in being able to read and exchange technical stuff mail with others working in this area, especially in the internet world. And I would appreciate any help you could give me there. Hemant Kanakia ----- End Included Message ----- From lixia@kandron.parc.xerox.com Tue May 15 12:35:57 1990 Posted-Date: Tue, 15 May 1990 12:45:17 PDT Received-Date: Tue, 15 May 90 12:35:57 -0700 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Tue, 15 May 90 12:35:57 -0700 Received: from kandron.parc.Xerox.COM by arisia.Xerox.COM with SMTP (5.61+/IDA-1.2.8/gandalf) id AA22080; Tue, 15 May 90 12:35:55 -0700 Received: by kandron.parc.xerox.com (5.61+/IDA-1.2.8/gandalf) id AA17110; Tue, 15 May 90 12:45:18 PDT Sender: Lixia Zhang Date: Tue, 15 May 1990 12:45:17 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: braden@venera.isi.edu Cc: end2end-interest@venera.isi.edu Subject: Re: Minutes of last E2E meeting In-Reply-To: Your message of Mon, 14 May 90 15:08:23 PDT Message-Id: Sorry, but this time I have to stand up against misquotes, also to share my REAL simulation results with others. Jacobson explained that he has proposed 2 different new algorithms: (1) slow start for LFNs ("keep the flywheel spinning when a packet is lost"). This is intended to improve TCP efficiency over LFNs by making it less sensitive to losses. The original slow-start had a throughput scaling as 1/(1 + pW); the revised algorithm scales as 1/(1 + p) [which is the best any protocol can do]. (2) increased fairness on acbusy network. This is intended to add some hysteresis to account for that fact that a connection is more likely to have packets dropped in the first few round trip times after a drop (i.e., while slow-start has made its window artificially small). At the Xerox PARC meeting, Van suggested suitable hysteresis might result from giving a connection log2(W) round trip times of grace before a loss would change the congestion avoidance threshhold (ssthresh). Subsequent thought and experiment showed this was actually a horrible idea if the window was large compared to the available network buffer resources. Lixia implemented and simulated something very close to this second algorithm and found the same thing: it doesn't work very well on busy networks. That is NOT what I simulated or found. I did observe the problem mentioned above, but I considered that as a separate one from the major problem with the algorithm (the January version of Slow-Start, SS(jan), as described in my msg to Bob). SS(jan) indeed strived to improve efficiency over LFNs by trying to keep the pipe full in the following ways: (a) When receiving a dup_ack and #dup_ack < N (some threshold), cwnd += max_segment; An ack (duplicate or not) means that a packet has left the network; but a duplicate ack does not advance the flow control window, so cwnd is increased to get a new packet out to keep the pipe full. (b) Upon detection of a pkt loss by dup_ack (i.e. #dup_ack == N): o Retransmit the missing ptk immediately. (I forgot to mention this detail in the earlier msg -- LZ). o If cwnd < ssthresh, do not change cwnd or ssthresh. (cwnd is never below ssthresh in this case, if the network load--the number of users--is not drastically increased) o If cwnd >= ssthresh: ssthresh = (cwnd - #dup_acks) / 2, and cwnd = ssthresh + #dup_ack This reduces cwnd to half (instead of maxseg, as in old Slow-Start), taking into account that cwnd has been, and should keep, inflated by the #dup_acks. What I saw as a major problem exposed in the simulation is that, after detecting #dup_ack == N : - now there are cwnd packets (about half of the pipe+buffer size) oustanding; the last pkt sent was the retransmission of the lost pkt. - when the loss is recovered -- the hole filled -- the sender got an ack back for all previously outstanding pkts, it then sends cwnd(half of pipe+buffer!) packets in one BURST. * If the pipe is bigger than the switch buffer, pkts are lost instantly (that's why I said earlier that SS(jan) was a big loss when pipe > buffer). * sending out big bursts of packets certainly causes stability concerns. In short, filling a hole results in an ack for all outstanding pkts. If we don't close cwnd down to bottom, we have to have a way to prevent packet bursts at this time. Van's latest revision Slow-Start (in his 4/30 msg, SS(4/30) -- I have to timestamp his revisions :-) solved this problem: - when #dup_ack < N, do nothing - when #dup_ack == N, retransmit the missing pkt; reduce cwnd to half, H. - upon receiving every further dup_ack, cwnd += maxseg This tries to KEEP H pkts in the net by inflating cwnd by dup_acks -- the key point in the solution. - when the hole is filled, immediately reset cwnd to H (deflation). This seems to work nicely for single loss cases. I'll have to simulate to see what happens if there are 2 losses in a row. Lixia From craig@NNSC.NSF.NET Wed May 16 08:44:55 1990 Posted-Date: Wed, 16 May 90 11:44:29 -0400 Received-Date: Wed, 16 May 90 08:44:55 -0700 Message-Id: <9005161544.AA20344@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Wed, 16 May 90 08:44:55 -0700 To: end2end-interest Subject: conference on high speed networks Date: Wed, 16 May 90 11:44:29 -0400 From: Craig Partridge A call just landed on my desk, courtesy of Marjorie Johnson. It is advertised as the 3rd conference on high speed networking< to be held in Berlin, 18-22 March 1991. Sponsor is IFIP WG 6.4 with DETECON and GMD-FOKUS. General chairs are Popescu-Zeletin and Kanzow, Program chairs are Danthine (Univ. Liege) and Spaniol (Univ. Aachen). The usual suspects on the program committee. Papers are due 15 September. Central e-mail box is embox@fokus.berlin.gmd.dbp.de Craig From braden Wed May 16 10:18:32 1990 Received-Date: Wed, 16 May 90 10:18:32 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 16 May 90 10:18:32 -0700 Date: Wed, 16 May 90 10:19:18 PDT From: braden Posted-Date: Wed, 16 May 90 10:19:18 PDT Message-Id: <9005161719.AA09223@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 16 May 90 10:19:18 PDT To: craig@nnsc.nsf.net, end2end-interest Subject: Re: conference on high speed networks Boy, if you like to write papers and give talks on high speed networking, you can have a full time job just going to conferences! Bob From guru@flora.wustl.edu Wed May 23 11:11:40 1990 Posted-Date: Wed, 23 May 90 13:11:53 -0500 Received-Date: Wed, 23 May 90 11:11:40 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 23 May 90 11:11:40 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA05930; Wed, 23 May 90 13:11:12 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA03124; Wed, 23 May 90 13:11:54 CDT Message-Id: <9005231811.AA03124@flora.wustl.edu> To: braden@venera.isi.edu Cc: deering@pescadero.stanford.edu, end2end-interest@venera.isi.edu Subject: Re: Multicast Routing In-Reply-To: Your message of Tue, 01 May 90 16:27:20 -0700. <9005012327.AA02617@braden.isi.edu> Date: Wed, 23 May 90 13:11:53 -0500 From: Gurudatta Parulkar As you know, there is a low-key war going on these days to determine which routing protocol, OSPF or extended IS-IS, will become the standard open intra-AS routing protocol for the Internet. The war will probably go on for awhile... I am concerned about multicast routing. You have been working for some time (a year?) with the OSPF folks to put multicast routing into OSPF. What is the status of that, by the way? Have you given any thought to the difficulty of putting multicast routing into extended IS-IS? I am worried that the following sequence of events is possible: (1) ANSI standardizes the Internet extensions for IS-IS -- Ross is chugging away at that -- without multicast routing, and subsequently (2) the IETF/IAB decides to adopt IS-IS rather than OSPF. This sequence could make widespread deployment of intra-AS multicast routing within a reasonable time frame much more difficult. Comments? Sorry I am responding so late. Finding an optimal multicast route is an NP complete problem, and one has to depend on heuristics which are hard to decide for something as complex and dynamic as the Internet. More over, if the multicast group is very dynamic (end points come and go often), maintaining a close to optimal tree is very difficult. These are difficulties in addition to what we see in point to point routing algorithms (stability, convergence, etc.) Given this I wonder how one could possiblly work a multicast routing solution in a short time before adaptation of OSPF or IS-IS as an internet standard. Was there any discussion on these issues following Bob's message (I missed that and appreciate receiving those messages again)? Thanks. -guru From deering@Pescadero.Stanford.EDU Wed May 23 12:47:28 1990 Posted-Date: 23 May 1990 12:04-PDT Received-Date: Wed, 23 May 90 12:47:28 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Wed, 23 May 90 12:47:28 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA14253; Wed, 23 May 90 12:46:45 PDT Date: 23 May 1990 12:04-PDT From: Steve Deering Subject: Re: Multicast Routing To: Gurudatta Parulkar Cc: end2end-interest@venera.isi.edu Message-Id: <90/05/23 1204.276@pescadero.stanford.edu> In-Reply-To: Gurudatta Parulkar's message of Wed, 23 May 90 131153 -0500 > From: Gurudatta Parulkar > > Finding an optimal multicast route is an NP complete problem, and one > has to depend on heuristics which are hard to decide for something as > complex and dynamic as the Internet. That depends on what you are trying to optimize. Yes, least-cost multicast routing, in which the goal is to minimize the network cost (e.g., total number of packet-hops), is NP-complete (it's the Steiner tree problem). On the other hand, shortest-path multicast routing, in which the goal is minimizing the path length (usually a measure of delay) to each multicast receiver from the sender, is straightforward -- shortest-path trees are derived trivially from the unicast routing tables. I claim that minimizing multicast delay is much more important than minimizing network cost. Many multicast applications, such as conferencing and resource location, are delay-sensitive, while network resources (bandwidth, processing, memory) keep getting cheaper. Besides, in "typical" topologies, the network costs of shortest-path routing are not much worse than least-cost routing (ref. Bharath-Kumar and Jaffe, IEEE Trans. Communications, March 1983). > More over, if the multicast group is very dynamic (end points come and > go often), maintaining a close to optimal tree is very difficult. Maintaining shortest-path trees for dynamic multicast groups is not difficult -- all the hard work is done by the unicast routing protocol. See my paper from SIGCOMM '88 for examples of how to do shortest-path multicast routing in a distance-vector, link-state, or single-spanning- tree (i.e., bridge) environment. Of course, there are some significant scaling issues that arise with shortest-path multicast routing; they are handled by the usual mechanisms for dealing with scaling problems, such as caching, hierarchy, and scoping. > Was there any discussion on these issues following Bob's message (I > missed that and appreciate receiving those messages again)? There was only a single reply message from me, which I have appended here in case anyone else missed it. ---------------------------------------------------------------------- Date: 11 May 1990 14:48-PDT From: Steve Deering Subject: Re: Multicast Routing To: braden@venera.isi.edu Cc: end2end-interest@venera.isi.edu Message-Id: <90/05/11 1448.847@pescadero.stanford.edu> In-Reply-To: braden's message of Tue, 1 May 90 162720 PDT > I am concerned about multicast routing. You have been working for > some time (a year?) with the OSPF folks to put multicast routing > into OSPF. What is the status of that, by the way? It's been about 8 months since I first met with Jon Moy to discuss multicast routing extensions for OSPF. Since that time, an IETF working group has been formed (with me as chair) to develop a specification for the multicast OSPF extensions. We had a productive meeting at the Pittsburgh IETF last week, at which we reviewed the extensions required, resolved some outstanding issues, and identified some new ones. John Moy has volunteered to write up the specification (probably in the form of an appendix to the current OSPF spec.), and he hopes to have that ready for review before the August IETF meeting. The implementors of OSPF at U. Maryland and Cornell are eager to implement the multicast stuff as soon as we can agree on a spec. > Have you given any thought to the difficulty of putting multicast > routing into extended IS-IS? Yes, I have thought about it, but not in great detail. I can't immediately see why it should be any more difficult to add multicast routing to IS-IS than it is for OSPF. I had assumed that whatever we learn from the OSPF effort will map directly into IS-IS (at least for IP multicasting -- CLNP multicasting raises some additional issues, related to addressing/multihoming/etc.). > I am worried that the following sequence of events is possible: > (1) ANSI standardizes the Internet extensions for IS-IS -- Ross is > chugging away at that -- without multicast routing, and subsequently > (2) the IETF/IAB decides to adopt IS-IS rather than OSPF. This > sequence could make widespread deployment of intra-AS multicast > routing within a reasonable time frame much more difficult. Yes, that scenario is possible. (As an IAB member, perhaps you have a better idea than I do how likely it is.) I have been more concerned about the reality of OSPF currently being deployed at a number of places, without the multicast extensions in place yet. Do you have any reason to believe that the IS-IS people would (or would not) be receptive to multicast routing extensions? I am certainly willing to talk to Ross about it, but I cannot commit to any more committees at this time. Steve From guru@flora.wustl.edu Wed May 23 13:20:21 1990 Posted-Date: Wed, 23 May 90 15:20:49 -0500 Received-Date: Wed, 23 May 90 13:20:21 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 23 May 90 13:20:21 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA08259; Wed, 23 May 90 15:20:07 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA03316; Wed, 23 May 90 15:20:51 CDT Message-Id: <9005232020.AA03316@flora.wustl.edu> To: Steve Deering Cc: end2end-interest@venera.isi.edu Subject: Re: Multicast Routing In-Reply-To: Your message of 23 May 90 12:04:00 -0700. <90/05/23 1204.276@pescadero.stanford.edu> Date: Wed, 23 May 90 15:20:49 -0500 From: Gurudatta Parulkar > On the other hand, shortest-path multicast routing, in which the goal > is minimizing the path length (usually a measure of delay) to each > multicast receiver from the sender, is straightforward -- shortest-path > trees are derived trivially from the unicast routing tables. I am sorry but I don't understand the shortest path trees as opposed to least cost tree and how they are trivially derived from unicast routing tables. Could you please explain little more ? Thanks very much for your response. -guru From deering@Pescadero.Stanford.EDU Wed May 23 14:51:55 1990 Posted-Date: 23 May 1990 13:42-PDT Received-Date: Wed, 23 May 90 14:51:55 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Wed, 23 May 90 14:51:55 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA14731; Wed, 23 May 90 14:51:26 PDT Date: 23 May 1990 13:42-PDT From: Steve Deering Subject: Re: Multicast Routing To: Gurudatta Parulkar Cc: end2end-interest@venera.isi.edu Message-Id: <90/05/23 1342.584@pescadero.stanford.edu> In-Reply-To: Gurudatta Parulkar's message of Wed, 23 May 90 152049 -0500 OK, here's a simple example. Imagine the "O"s are network nodes, joined by point-to-point links as illustrated. The node labelled "sender" is sending multicast packets to the group consisting of the two nodes labelled "member". The "*"s are supposed to be arrow heads, indicating the direction of packet flow. sender sender O O / \ / \ / \ / \ * \ * * O O O O / \ / \ / \ / \ * \ * * O--------------*O O---------------O member member member member Least-Cost Multicast Shortest-Path Multicast Assume that each link has cost 1 and delay 1. under least-cost multicast routing: - the network cost of each multicast packet (i.e., the total number of links traversed) is 3, which is the minimum possible in this topology. - the delivery delay to one member is 2, and to the other member is 3. under shortest-path multicast routing: - the network cost of each multicast packet is 4. - the delay to each member is 2, which is the minimum possible in this topology. Computing least-cost trees in an arbitary topology is NP-complete. Computing shortest-path trees in an arbitrary topology is what unicast routing protocols do; for example, Dijkstra's algorithm is used to compute shortest-path trees in most link-state routing protocols. This distinction between least-cost and shortest-path (or minimum-delay) multicasting, and the trade-offs between the two, has been well covered in the multicast routing literature; see, for example, the paper by Bharath-Kumar and Jaffe that I cited, or David Wall's thesis (which is cited in every other paper on multicasting). Of course, the theorists mainly write papers about least-cost multicast routing, each trying to out-heuristic the previous one; shortest-path multicast routing is already a "solved problem". I claim that shortest-path is what we want for multicasting in real internetworks. For details on possible ways to do shortest-path multicast routing in a datagram internetwork, I again refer you to my SIGCOMM '88 paper (and to my (ever) forthcoming thesis). Steve From braden Mon May 28 10:54:20 1990 Received-Date: Mon, 28 May 90 10:54:20 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 28 May 90 10:54:20 -0700 Date: Mon, 28 May 90 10:05:17 PDT From: braden Posted-Date: Mon, 28 May 90 10:05:17 PDT Message-Id: <9005281705.AA01497@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 28 May 90 10:05:17 PDT To: lixia@parc.xerox.com, van@helios.ee.lbl.gov Subject: Re: RFC 1072 Implementations Needed Cc: end2end-interest From lixia@kandron.parc.xerox.com Wed May 9 10:50:47 1990 Sender: Lixia Zhang Date: Wed, 9 May 1990 10:58:48 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: Van Jacobson , braden@ISI.EDU Subject: Re: RFC 1072 Implementations Needed In-Reply-To: Your message of Fri, 23 Mar 90 19:40:45 PST Cc: lixia@parc.xerox.com Bob and Van, Here's some detailed explanation re. my earlier msg about RFC1072 implementation. I suggested to change Van's aalgorithm: 1) if an arriving packet is in sequence, record its timestamp and accept it normally. 2) Otherwise, if the packet is outside the window, reject it. (normal tcp processing) 3) Otherwise, if the packet timestamp is less than the timestamp of the most recently received in-sequence packet, treat the packet as if it is outside the window. 4) Otherwise treat the packet as a normal in-window, out-of-sequence tcp packet (e.g., queue it for reassembly). by moving step(1) to be after step(3). The advantage of this change is to get rid of the problem of mistaking an old pkt (which happens to bear the expected seq#). The disadvantage is that it may falsely reject valid pkt(s) under the follow case: Assume a sender transmitted ptks A, B, C, D, E, F, G in sequence, and pkt C is lost and retransmitted. If D arrived after the rxt'ed C, D will be rejected (and has to be rxted) since it has a timestamp smaller than the rxt'ed C. Van, Lixia's change seems right to me. The exception she notes (confusion if retransmitted segment arrives badly out of order) would be fixed if the retransmitted segment carried the same sequence number as the original. That would be logically consistent with the timestamp as an extension of the sequence number space. Unfortunately, you also plan to use the timestamp for measuring RTT, and that usage requires that the retransmission timestamp be the current time. Too bad! Could we imagine sending TWO timestamps in a retransmission, one the original timestamp, and the other the current time? The receiver would be "trained" to echo only the later of the two. Bob From craig@NNSC.NSF.NET Tue May 29 07:18:53 1990 Posted-Date: Tue, 29 May 90 10:18:04 -0400 Received-Date: Tue, 29 May 90 07:18:53 -0700 Message-Id: <9005291418.AA19384@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Tue, 29 May 90 07:18:53 -0700 To: end2end Subject: sigcomm '90 Date: Tue, 29 May 90 10:18:04 -0400 From: Craig Partridge I sent the SIGCOMM program and registration out today to the IETF and TCP-IP lists. As is normal, the E2E RG is well represented at SIGCOMM (papers by Eric, Scott, DDC, Lixia and Jon and a course taught by Van) and the general level of papers is quite high so I expect a fair number of the E2E members are likely to want to attend. I'm writing to encourage you to register early (before 9/1) because there's a small chance we're going to run out of space at the conference. We're doing a big publicity blitz for the first time [direct mailing, ads in non-ACM journals, etc] and may overflow our hotel space. Craig From postel Wed May 30 14:01:17 1990 Received-Date: Wed, 30 May 90 14:01:17 -0700 Received: from bel.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 30 May 90 14:01:17 -0700 Date: Wed, 30 May 90 14:01:30 PDT From: postel Posted-Date: Wed, 30 May 90 14:01:30 PDT Message-Id: <9005302101.AA05264@bel.isi.edu> Received: by bel.isi.edu (4.1/4.0.3-4) id ; Wed, 30 May 90 14:01:30 PDT To: end2end-interest Subject: IP Timestamp option Cc: finn Hi. Greg Finn has been doing some experiments with IP congestion control and needed a time stamp on IP datagrams. Here is a description of what he has been using. --jon. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ UNIX TIMESTAMPS BSD 4.3 UNIX represents its time-of-day as seconds since its origin time of January 1, 1970 UTC. The structure used to hold this is a timeval. The long int tv_sec holds the number of seconds since origin and the long int tv_usec holds the fractional portion of a second in microseconds. It is not necessarily true that tv_usec always is less than 1,000,000. The effective grain of resolution is the OS's scheduling interval, called a `tick'. A global int, tick (see ), contains the duration of a scheduling interval in microseconds. The number of ticks per second is defined at startup in the int hz. For the VAX the clock ticks at 100 Hz, for SUN 3/xxx models the hardclock() routine discards every other tick, resulting in a scheduling tick of 50 Hz. struct timeval { long tv_sec; long tv_usec; }; IP Source Timestamp Option The IP/SQ algorithm requires that a system transmission time-stamp be placed into outgoing datagrams. This is accomplished just before the datagram is handed to the local network interface routines for transmission. An IP header option, IPOPT_STS, is used for this purpose. If the header of the datagram is already too full and the option cannot be inserted, the datagram is transmitted without containing the option. The new IP option Source Timestamp option: IPOPT_STS, is indicated by an IP option field of value of 205. The IP option field octet breaks down as: Copy Flag bit 0: 1 Copy this option on fragmentation Class bits 1-2: 2 Option's use is for measurement Type bits 3-7: 13 Type thirteen for Source Timestamp option Option Length <------- Seconds --------><------Microseconds ------> ____________________________________________________________________ | | | | | | | | | | | | 205 | 10 | | | | | | | | | |______|______|______|______|______|______|_____|______|______|______| struct sts_timestamp { u_char sts_code; /* IPOPT_STS (205) */ u_char sts_len; /* size of structure (10) */ long sts_sec; /* seconds since Jan. 1, 1970 UTC */ long sts_usec; /* added microseconds offset */ }; The length field for this option always contains the value ten. The time of day is found by summing the seconds and micoseconds fields. The seconds field is a four-octet integer. It contains the four byte tv_sec field (seconds since Jan. 1, 1970) from the timeval structure that is returned by UNIX as its daytime clock. The microseconds field is a four byte integer that contains the tv_usec field from the timeval structure. It contains the fractional part of a second in units of microseconds. The UNIX hardclock() routine should rarely let the tv_usec field hold an amount of time much larger than one second plus a scheduling tick (for BSD UNIX systems, the basic scheduling tick is approximately 1/100th of a second for a VAX, on SUN 3/xxx models it is 1/50 second). The routines that receive SQ messages should take care to normalize the timeval passed back in a Source Timestamp option. Illegal values must be detected and no congestion control related actions taken as a result of its reception. This option allows the source to specify its time-of-day clock in the header of an outgoing IP packet. In the case of the IP/SQ algorithm, a returning SQ message contains the original time-stamp. Therefore, the source can determine the round-trip time to the gateway that built the SQ message. Although the option could be reduced by two and perhaps three octets, preserving compatibility with the UNIX standard format was considered more important. A reason to decrease the size of this option is that the IP/SQ algorithm effectively increases the size of each IP datagram by 10 octets. This deserves some more consideration at a later date. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ From Mills@udel.edu Sun Jun 3 20:17:47 1990 Posted-Date: Mon, 4 Jun 90 3:15:33 GMT Received-Date: Sun, 3 Jun 90 20:17:47 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Sun, 3 Jun 90 20:17:47 -0700 Date: Mon, 4 Jun 90 3:15:33 GMT From: Mills@udel.edu To: postel@venera.isi.edu Cc: end2end-interest@venera.isi.edu, finn@venera.isi.edu Subject: Re: IP Timestamp option Message-Id: <9006032315.aa11190@huey.udel.edu> Greg, Do I understand that your intent is to etch the timestamp in the options field as the driver actually sends the packet, or at the time the packet is created? Also, if your intent is as an applique to use in specific experiments, rather than as a generic facility, then your timestamp format is probably appropriate; however, if you expect the feature to be used widely, then I suggest you might consider the NTP timestamp format, which is a 64-bit quantity with 32 bits as the fraction part. In principle, this gives you precisions down to 232 picoseconds, which allows you to measure the propagation time for on-chip loopback. Dave From finn Mon Jun 4 11:13:49 1990 Received-Date: Mon, 4 Jun 90 11:13:49 -0700 Received: from dalek.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 4 Jun 90 11:13:49 -0700 Date: Mon, 4 Jun 90 11:14:21 PDT From: finn Posted-Date: Mon, 4 Jun 90 11:14:21 PDT Message-Id: <9006041814.AA00773@dalek.isi.edu> Received: by dalek.isi.edu (4.1/4.0.3-4) id ; Mon, 4 Jun 90 11:14:21 PDT To: Mills@udel.edu Cc: postel, end2end-interest Cc: Finn Subject: IP Timestamp option Dave, I have the IP driver place the time-stamp into the IP header just as it's handed to the local network driver output routine. The assumption is that the driver inserts negligible delay before actual transmission. In any case, what delay it inserts on average would figure into the round-trip time that my algorithm uses. As a practical matter, the effectively usable grain appears to be 20000 microseconds, which is a scheduling interval on SUN 3/xxx models. That is, the times that I actually see are so rounded due to the hardclock() code in the kernel. Higher precision is not practical unless all BSD kernels are altered. I do not think that that is likely in the current circumstance. My principal concern is that the format is immediately available in a standard UNIX implementation. It is also computationally efficient in that sense, since no conversion is performed into or out from the workstation. Note that this UNIX format was specifically created by the UNIX designers to be a resolution-independent date/time quantity. The principal drawback is its length in bytes which is now ten. If you are wondering whether the format would be better if different, I maintain that the first thing to consider would be shortening it to six bytes of total magnitude. This would allow the time-stamp to be inserted with no padding as its total length would then be eight bytes and so evenly divisible by 32-bit header length count. Each datagram would then be four bytes shorter. One could also allow seven bytes of magnitude if the option became one with a known fixed length, so saving the length byte. That is not practical though, as all IPs everywhere would need to be recoded. Restating my personal position, I maintain that a change of format would best be served by moving to a six byte format. An increase in precision is not generally useful in a standard kernel. A more precise format, such as is made available by add-on cards, of necessity requires non-standard kernel modifications. At this point we can argue the relative merits of maintaining compatibility. Obviously, although its length does bother me, I do not feel increase of precision a compelling reason for change to a non-standard format. No doubt this will occasion some discussion. Sincerely, --- ggf From Mills@udel.edu Mon Jun 4 12:13:45 1990 Posted-Date: Mon, 4 Jun 90 19:09:24 GMT Received-Date: Mon, 4 Jun 90 12:13:45 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 4 Jun 90 12:13:45 -0700 Date: Mon, 4 Jun 90 19:09:24 GMT From: Mills@udel.edu To: finn@venera.isi.edu Cc: Mills@udel.edu, postel@venera.isi.edu, end2end-interest@venera.isi.edu, Finn@venera.isi.edu Subject: Re: IP Timestamp option Message-Id: <9006041509.aa16507@huey.udel.edu> Greg, You are probably familiar with much of the hoary discussion that led to the choice of timestamp format in ICMP, UDP/TIME and UDP/NTP. The UDP/NTP choice, seconds past 1 January 1900 to 64 bits with 32 bits fraction, was arrived with not a little bit of argument, analysis and debate. Conversion to/from ICMP, UDP/TIME and Unix format requires one multiply/divide operation. A considerable body of information bearing on this issue can be found in the (amended) NTP specification now reclining in the files pub/ntp/ntprh.ps, -/ntpr.ps and -/ntpra.ps on louie.udel.edu, should you be adventurous enough to snarf those couple of megabytes. I would regard a driver-timestamp option as very valuable and especially useful in cases involving considerable encryption overhead, output queue processing, etc., are involved. Of course, the encryption case is tricky, since the driver timestamp may not be under the crypto- checksum span. Nevertheless, should such an option become available, I would nominate some 2000+ NTP users that could trim their time-transfer error budget in worthwhile ways. All this seems to argue for a Unix- independent format. While it might be convenient to repackage the format for six octets, the additional overhead for, say, the eight-octet NTP timestamp may not be particularly burdensome. Did I miss a companion proposal for an input-driver timestamp? For the record, fuzzballs stamp every packet buffer upon arrival at the driver and can stamp a reply message to the user program when a buffer has departed the driver. In principle, this provides a time adjustment that can be used in subsequent processing, so that the error budgets can be reduced to the neighborhood of microseconds. Finally, the SPARCstation hardware is allegedly capable of most accurate timestamps using onboard hardware; however, Sun apparently broke the timekeeping function in at least two ways yet to be resolved. Meanwhile, from my experience with NTP on many VAX and Sun platforms, you should be able to maintain precision pretty much at the level of the kernel tick variable - 10/20 ms in those machines. Fuzzballs have had squid-like output rate controls since circa 1983, but this was for much lower-speed circuits than I think you have in mind. Back then I was horsing 1200-bps links and trying to equalize fairness on the driver queues. Fuzzies estimate the rountrip delay on every transmitted segment initial sequence number, as well as the smoothed rate in octets per second of the TCP right window edge. At one time I had a D/A converter and panel meter rigged to watch these numbers. Made an interesting demonstration item. Dave From finn Mon Jun 4 15:03:10 1990 Received-Date: Mon, 4 Jun 90 15:03:10 -0700 Received: from dalek.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 4 Jun 90 15:03:10 -0700 Date: Mon, 4 Jun 90 15:03:42 PDT From: finn Posted-Date: Mon, 4 Jun 90 15:03:42 PDT Message-Id: <9006042203.AA00803@dalek.isi.edu> Received: by dalek.isi.edu (4.1/4.0.3-4) id ; Mon, 4 Jun 90 15:03:42 PDT To: Mills@udel.edu Cc: postel, end2end-interest Cc: Finn Subject: IP Timestamp option Dave, You are correct that conversion to/from your NTP format is trivial. However, NTP representation format is non-standard as far as UNIX is concerned [Leffler]. Unless my RFC 958 copy is bad, your origin is 0000 1 Jan 1900. BSD and AT&T UNIX origin is 0000 1 Jan 1970. Curiously, why did you choose 1900? I have no companion for input time-stamp. Sufficient would be the same with merely a different type octet. One might run out of IP header though, which points out an IP design flaw (IMHO). I do not think an argument that SPARC station hardware is allegedly capable of higher precision time-stamps is germain. As you claim, they don't now maintain a higher precision user readable clock (although their tick is supposedly 100/sec instead of 50/sec). But surely that is a minor point. SPARC is a CPU chip, or more loosely, a motherboard category. There are a lot of different UNIX platforms. Many of these do not support higher resolution clocks without increased overhead, which was the reason that the tick rate was held down in the first place. Other than the inevitable confusion that results from yet another new UNIX standard for white bread (if it is not to be made two bytes shorter), I have no objection to a few more "unimplemented" bits to the right of a scheduling tick. My algorithm would throw them away though, I cannot use them, a tick is the smallest real-time UNIX response quanta. Sincerely, --- ggf From Mills@udel.edu Mon Jun 4 18:45:40 1990 Posted-Date: Tue, 5 Jun 90 1:37:19 GMT Received-Date: Mon, 4 Jun 90 18:45:40 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 4 Jun 90 18:45:40 -0700 Date: Tue, 5 Jun 90 1:37:19 GMT From: Mills@udel.edu To: finn@venera.isi.edu Cc: Mills@udel.edu, postel@venera.isi.edu, end2end-interest@venera.isi.edu, Finn@venera.isi.edu Subject: Re: IP Timestamp option Message-Id: <9006042137.aa19139@huey.udel.edu> Gregg, Your point appears to be that confusion will result from creation of a new "non-standard Unix" format. Us old Internet buzzards have, of course, been constantly confused by the apparent expectation that Unix standards have anything necessarily to do with Internet standards. In fact, the only extant Internet time-format standards are in the DAYTIME, TIME and NTP protocols. In point of fact, both TIME and NTP stake origins at the origin of this century, not that it matters much. If any claim to primordial time can be made in the modern era, it would be the instantiation of Coordinated Univeral Time (UTC), which ticked its first at 0000 hours 1 January 1972, so even Unix anticipated that event. You will be amused to learn that in the Digital Time Service format, the origin of time was placed upon issue of the papal bull of 1582 that created the Gregorian Calendar. Many more amusing stories like that can be found op cit my last message. So far as resolution is concerned, I submit any relevance to a Unix platform tick to a Sun/3, SPARCjumper or VAX implementation may not be useful when you consider the vast array of platforms Unix presently runs on. Are you prepared, for instance, for the 8.5-nanosecond resolution of a Cray? How about a Sequent? Fuzzballs don't count, although they keep milliseconds. My advice is to stick to an Internet "standard" representation that is capable of handling the full spectrum of implementations and precisions now in use. Having spied on Unix, how about V-kernel, Mach, IBM 8000, etc., etc.? Dave From cheriton@Pescadero.Stanford.EDU Mon Jun 4 23:29:14 1990 Posted-Date: Mon, 4 Jun 90 23:28:26 PDT Received-Date: Mon, 4 Jun 90 23:29:14 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Mon, 4 Jun 90 23:29:14 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA06678; Mon, 4 Jun 90 23:28:26 PDT Date: Mon, 4 Jun 90 23:28:26 PDT From: David Cheriton Message-Id: <9006050628.AA06678@Pescadero.Stanford.EDU> To: Mills@udel.edu, finn@venera.isi.edu Subject: Re: IP Timestamp option Cc: end2end-interest@venera.isi.edu, postel@venera.isi.edu I would definitely agree with Dave Mills on this: 1) I think designing protocols around some perceived compatibility with the current version of a popular operating system is a loser. Protocols have a longer lifetime (of stability) if successful, and if not, who cares! And Unix in particular has shown the ability to evolve in its various forms greater precision time formats. 2) Machines and networks are clearly going much faster in the future. I think much greater time precision (as provided by NTP) may well be important, more than we realize now (perhaps I should have said "I suspect"). 3) I read a confusion in your message between what precision the time fromat allows versus the time precision my system clock provides. The picosecond bits may be irrelevant for the current SUN workstations, but I can easily handle these extra bits, and I'll be glad I have them if Bill Joy's fantasies for the SUN 9 come true. David Cheriton From braden Tue Jun 5 12:25:29 1990 Received-Date: Tue, 5 Jun 90 12:25:29 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 5 Jun 90 12:25:29 -0700 Date: Tue, 5 Jun 90 12:26:29 PDT From: braden Posted-Date: Tue, 5 Jun 90 12:26:29 PDT Message-Id: <9006051926.AA05825@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 5 Jun 90 12:26:29 PDT To: end2end-interest Subject: multicasting ----- Begin Included Message ----- From iso-RELAY@NIC.DDN.MIL Tue Jun 5 10:30:27 1990 Date: Mon, 4 Jun 90 09:32:11 EDT From: patb@gateway.mitre.org (Pat Blankenship) To: new@louie.udel.edu Subject: Re: Broadcast / Multicast in ISO world Cc: iso@nic.ddn.mil Hi Darren. I am a member of the DCA Protocol Standards Technical Panel, and we have been looking into the question you have raised. The military, especially tactical forces, have a distinct need for multicasting which can make use of existing broadcast (radio-based) non-OSI subnets. We too have found it very difficult to construct a scenario for multi- casting in a CO-environment. It is a lot easier in the CL environment. ISO explored this whole topic, both CO and CL, in a project called Multi-Peer Data Transmission (MPDT), which would have caused major changes in the reference model and virtually all protocols. The project was cancelled due to lack of support in late 1989. We are hoping to do further study and perhaps prototyping of a multicast service in a CL environment in the next year or so. FYI, there are a couple of RFCs describing multicasting in the Internet. The latest one is RFC 1054, by Steve Deering. Let me know if anyone responds to you with any bright ideas. Pat Blankenship MITRE Corporation Eatontown, NJ ----- End Included Message ----- From finn Tue Jun 5 16:09:20 1990 Received-Date: Tue, 5 Jun 90 16:09:20 -0700 Received: from dalek.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 5 Jun 90 16:09:20 -0700 Date: Tue, 5 Jun 90 16:10:06 PDT From: finn Posted-Date: Tue, 5 Jun 90 16:10:06 PDT Message-Id: <9006052310.AA00895@dalek.isi.edu> Received: by dalek.isi.edu (4.1/4.0.3-4) id ; Tue, 5 Jun 90 16:10:06 PDT To: cheriton@pescadero.stanford.edu Cc: Mills@udel.edu, end2end-interest, postel Cc: Finn Subject: IP Timestamp option Quantities of religious dogma have been disinterred by this discussion. Each of us practices religious intolerance. I derive amusement from this but it is not very productive. Let us move away from religion. As I stated in the first message that brought this issue to the surface, my principal objection to my format arises from IP header limitations, from lack of space and forced padding to 32-bit boundaries. Keep in mind that we are discussing the IP layer and not TCP, UDP or higher layers. If some type of rate controlled congestion avoidance algorithm occurs at the IP layer I feel certain that it needs a feedback interval to ensure stability. For this type of application, a time-stamp goes out on EVERY datagram. As things stand now, an 8-octet IP header time-stamp option, rather than a 10-octet one, would usually save four octets per datagram. This requires a 6-octet magnitude part. Saving four octets on each datagram sent is significant. But perhaps you disagree? My position is that a 6-octet magnitude format should be adopted for this class of application. My choice of the UNIX data/time stamp is merely provisional. Does this make my position clear? The NTP format does not address this issue at all. It would still require sending those four extra octets for each datagram. Why not two classes of IP header option time-stamps: One that is used on every datagram and one that is used in those cases where high precision date/time stamps are required? A decision on that is a necessary precursor to our religious battle. I trust that this clears up any confusion that might previously existed concerning my position. Back to religion ... I can't let this rest unanswered. You assert that protocols have a lot longer lifetime than the current version of popular operating systems. But not IP protocols. The IP/TCP shift occurred in the early 80's. Yet UNIX was around in the 70's, and IBM .... You go on to say, "and if not, who cares!" This is a strange way to put forward an argument. One might ask, why did you make the assertion in the first place? I agree with your second assertion. However, what claim to fame does 100 ns or 1 ns or 200 ps have. Laboratory MBE devices far surpass even the 200ps figure. It appears that the NTP format is already obsolete ... ... just kidding. Sincerely, --- ggf From Mills@udel.edu Tue Jun 5 22:13:02 1990 Posted-Date: Wed, 6 Jun 90 5:06:12 GMT Received-Date: Tue, 5 Jun 90 22:13:02 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 5 Jun 90 22:13:02 -0700 Date: Wed, 6 Jun 90 5:06:12 GMT From: Mills@udel.edu To: finn@venera.isi.edu Cc: cheriton@pescadero.stanford.edu, Mills@udel.edu, end2end-interest@venera.isi.edu, postel@venera.isi.edu, Finn@venera.isi.edu Subject: Re: IP Timestamp option Message-Id: <9006060106.aa29341@huey.udel.edu> Greg, My cards are on the table. However, be advised the 232-ps resolution in NTP stamps was not established by whim. Knowing time precisely does not do you justice unless you know position precisely, as I have come to understand. If you know time precisely, NTP will provide navigation to within a couple of hundred feet. However, as I have recently verified, U.S. navigation systems, in particular the Global Positioning System, have intentionally degraded accuracy to the order of LORAN-C; that is, a thousand feet or so. I submit that, unless you can get a waiver from DoD and obtain the undither code, it doesn't make much sense to strive for less than 500 ns or so relative to UTC. Yes, I know you only need relative time, not absolute time. However, when you come right down to it, you have to account for continental drift and even altitude due gravitational shift. At Boulder this amounts to a whopping few parts in 10^14. Awesome. Dave From cheriton@Pescadero.Stanford.EDU Tue Jun 5 22:35:55 1990 Posted-Date: Tue, 5 Jun 90 22:35:11 PDT Received-Date: Tue, 5 Jun 90 22:35:55 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 5 Jun 90 22:35:55 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA11188; Tue, 5 Jun 90 22:35:11 PDT Date: Tue, 5 Jun 90 22:35:11 PDT From: David Cheriton Message-Id: <9006060535.AA11188@Pescadero.Stanford.EDU> To: finn@venera.isi.edu Subject: Re: IP Timestamp option Cc: Mills@udel.edu, end2end-interest@venera.isi.edu, postel@venera.isi.edu I presume labelling one's points as "religious dogma" is a way to avoid intellectually defending one's position. Therefore, I have nothing more to say. From finn Wed Jun 6 09:08:00 1990 Received-Date: Wed, 6 Jun 90 09:08:00 -0700 Received: from dalek.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 6 Jun 90 09:08:00 -0700 Date: Wed, 6 Jun 90 09:08:38 PDT From: finn Posted-Date: Wed, 6 Jun 90 09:08:38 PDT Message-Id: <9006061608.AA00946@dalek.isi.edu> Received: by dalek.isi.edu (4.1/4.0.3-4) id ; Wed, 6 Jun 90 09:08:38 PDT To: cheriton@pescadero.stanford.edu Cc: Mills@udel.edu, end2end-interest, postel Subject: IP Timestamp option Will the type of modification to IP that requires an interval time-stamp be utilized? If we accept that as either likely or given, we arrive at the issue that I discussed in my reply to you. Introductory sentence aside, the content of the first section of my response is a purely technical arguement. For my class of measurement, six bytes of magnitude is sufficient. We would be better served by deciding whether or not the four octets per datagram saved (on every datagram) is worth proposing two IP header timestamp options rather than one all encompassing option. I carefully differentiated in that message the purely technical from the (if you prefer) quasi-technical arguments, a term I freely apply to my discussion in the later section of the reply. You took this all as deadly serious and saw no humour in the discussion. My apologies for offending you. --- ggf From cheriton@Pescadero.Stanford.EDU Wed Jun 6 13:02:00 1990 Posted-Date: Wed, 6 Jun 90 13:01:26 PDT Received-Date: Wed, 6 Jun 90 13:02:00 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Wed, 6 Jun 90 13:02:00 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA14207; Wed, 6 Jun 90 13:01:26 PDT Date: Wed, 6 Jun 90 13:01:26 PDT From: David Cheriton Message-Id: <9006062001.AA14207@Pescadero.Stanford.EDU> To: finn@venera.isi.edu Subject: Re: IP Timestamp option Cc: Mills@udel.edu, end2end-interest@venera.isi.edu, postel@venera.isi.edu It seems highly desirable to minimize the number of itme formats in the Internet architecture. Given the status of NTP, it seems logically to pick a time format compatible with NTP, especially given that conversion to and from is not a big cost. I can imagine a case for sending different precisions or subfields of a timestamp, especially since the current NTP timeformat is not adequate in total span to cover all time. However, we would like a consistent "time architecture" to the protocol architecture. Perhaps we need an Internet time architect who coordinates the various needs and uses of time to this end (and we have an obvious candidate). David Cheriton From Mills@udel.edu Wed Jun 6 13:32:23 1990 Posted-Date: Wed, 6 Jun 90 20:26:31 GMT Received-Date: Wed, 6 Jun 90 13:32:23 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 6 Jun 90 13:32:23 -0700 Date: Wed, 6 Jun 90 20:26:31 GMT From: Mills@udel.edu To: finn@venera.isi.edu Cc: cheriton@pescadero.stanford.edu, Mills@udel.edu, end2end-interest@venera.isi.edu, postel@venera.isi.edu Subject: Re: IP Timestamp option Message-Id: <9006061626.aa06982@huey.udel.edu> Greg, Hey, no offense taken. Timecasting is too much fun to be taken seriously, anyway. While I personally see merit in having the option, even if it winds up "nonstandard." However, if the particular application requires only that the packet gazinta and gazouta times be known precisely, I would recommend the purely internal scheme I mentioned in connection with the fuzzbums, stamp the gazinta in the driver, but leave the stamp itself in the buffer but separate from the packet itself, and stamp the buffer for the gazouota upon completion interrupt and provide this info in a reply message to the user process. Of course, this is not the architecture of Unix, at least in the forms I know about. Dave From Mills@udel.edu Wed Jun 6 13:42:57 1990 Posted-Date: Wed, 6 Jun 90 20:41:38 GMT Received-Date: Wed, 6 Jun 90 13:42:57 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 6 Jun 90 13:42:57 -0700 Date: Wed, 6 Jun 90 20:41:38 GMT From: Mills@udel.edu To: David Cheriton Cc: finn@venera.isi.edu, Mills@udel.edu, end2end-interest@venera.isi.edu, postel@venera.isi.edu Subject: Re: IP Timestamp option Message-Id: <9006061641.aa07298@huey.udel.edu> David, While definitely not volunteering as nominal architect, I do suggest Appendix E of the Version 3 (sic) spec now displayed in the familiar spot. It jabbers on and yawn on the timescales, calendars and time- transfer technology encountered by Internet time travellers. The new Version 3 is running in all fuzzballs, documented and tested, but still in review by my buddies. Anybody is welcomed to paw the scraps found in the files ntprh.ps, ntpr.ps and ntpra.ps in the pub/ntp directory on louie.udel.edu. About 100 pages of prose, tables, fourmulae, graphs and related goop lurk there. Come to think of it, I'll copy this list with the announcement. Dave From Mills@udel.edu Wed Jun 6 13:53:15 1990 Posted-Date: Wed, 6 Jun 90 20:49:50 GMT Received-Date: Wed, 6 Jun 90 13:53:15 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 6 Jun 90 13:53:15 -0700 Date: Wed, 6 Jun 90 20:49:50 GMT From: Mills@udel.edu To: end2end-interest Subject: [To: ntp: NTP Version 3 (what?!)] Message-Id: <9006061649.aa07524@huey.udel.edu> ----- Forwarded message # 1: Received: from louie.udel.edu by huey.udel.edu id aa04216; 6 Jun 90 12:51 EDT Received: from trantor.umd.edu by louie.udel.edu id aa10108; 6 Jun 90 12:49 EDT Received: by trantor.umd.edu (5.63/1.34) id AA02482; Wed, 6 Jun 90 16:47:20 GMT Received: from huey.udel.edu by trantor.umd.edu (5.63/1.34) id AA02478; Wed, 6 Jun 90 16:47:15 GMT Date: Wed, 6 Jun 90 16:37:31 GMT From: Mills@udel.edu To: ntp@trantor.umd.edu Subject: NTP Version 3 (what?!) Message-Id: <9006061237.aa04070@huey.udel.edu> Folks, I am happy to announce that Version 3 of NTP may have (almost) arrived. The simulations, fuzzball code and documentation are now believed in operational status. The fuzzballs have been running the new code for a week (didn't notice, did you?). The Version 3 spec/implementation guide is in the three PostScript files ntprh.ps, ntpr.ps and ntpra.ps, all in the pub/ntp directory on louie.udel.edu. At that place are also the updated C-language simulator ntp.c and (for the brave) the fuzzball assembly-language NTP module ntpsrv.mac. There are a whole bunch of dangling issues here, both in detail functionality, compatibility and debugging the weenies. The new stuff doesn't change the architecture much, but it does change a lot of pesky details. My intent is that Version 3 interoperates with previous versions while adding improved algorithms, believable error bounds and trustworthy correctness principles. I do not anticipate a Version 3 daemon will be difficult to adapt from either a Version 1 or Version 2 implementation, except possibly for the intersection algorithm now part of the clock-selection procedure. This algorithm involves constructing and sorting a list of all eligible peers, not just the few low- dispersion peers admitted by the Version-2 algorithm. Of course, Version 3 owes many of its ideas on correctness to DTS; however, in order to preserve stability and accuracy without compromising correctness, the DTS intersection algorithm had to be amended in subtle ways. There are a couple of insiduousities that may not claim your attention right away. In order to improve the statistical estimates based on long- term averages, which will be necessary as the speeds we run these things at increases, the peer delay now is considered a signed number. However, all distance caluclations are based on the quantity dispersion plus one- half the delay, which should always be a positive number greater than zero. The Version-2 algorithms added a fudge factor to the delay so that it would never become negative. The fudge factor is now a rigorous calculation (see Appendix G). Also, in order to account for the maximum credible skew in oscillator frequencies, the dispersion is now a time- sensitive quantity, which means every offset/delay sample in the entire system has to be tracked in age and its dispersion corrected accordingly. This is not hard, just fastidious. The local-clock algorithm has been improved so that it is now possible to back off the poll interval for even those peers selected for synchronization. This has become rather important for the existing primary reference (stratum-1) servers now operating. Most of the fuzzball primary servers are now serving over a packet per second on average for well over 100 customers. This situation cannot continue to escalate or NTP will rise from the noise and become a serious traffic problem in its own right. The new model allows the poll interval to be decreased to a packet every 17 minutes or so, which may hold us for a year or two or until the gigabit era, whichever arrives first. Among the things hopefully improved in Version 3 is a major reduction in the tweakable parameters that complicate previous versions. The protocol is largely self-tuning over regimes from the NSF backbone to the infamous Norway link. In tests here I have intentionally diddled the timekeeping in raucous ways and enjoyed watching the algorithms fight back. None of the parameters are especially critical; however, there remain a couple which can be fine-tuned to emphasize reliability as against precision, for instance. For a j-random workstation you might want very reliable service, even if the precision suffers; while, to number the Norway atomic seconds, you might want precision and to pass up more of the outlyers. Experience should refine the details. You will note a number of protocol details have been simplified. For instance, the clock holdoff is gone, since the carefully crafted (8) sanity checks remove the need for that. The poll interval is based entirely on reachability and not on dispersion threshold or survivor status. After some experiment I decided the extra functionality provided by the detail engineering was not justified by its obscurity and opted for the kindler, gentler approach. I welcome your comments and advice on these issues. This thing has eaten six months of my life and consumed a lot of late- night testing and checking against the specification. I would very much appreciate whatever time you have to read and mark up the specification, especially the painful task of carefully checking the procedure code for bugs and omissions. I tried faithfully to keep the fuzzcode, simulator, specification prose, C-language code (in Appendix I) consistent, but errors may have crept in. In particular, the primary-clock procedures have not been thoroughly tested, since the fuzzball does not use that technique. On the other hand, xntpd does use that technique, so careful observers may find some gotchas. I am purposely not distributing this thing outside the NTP list for now to give you guys first whack at critique. Please treat it as a draft document and confine redistribution to your friends who understand this has not yet been proposed for standard status. While I would be pleased as punch to hear comments and corrections on the entire 100+-page document, I am most interested in fixing bugs in Sections 3-5 and Appendices A-D and I. The remaining appendices are largely analytical or tutorial in nature and best read next to a warm, crackling fire on a dark, rainy night. From the Preface This document describes Version 3 of the Network Time Protocol (NTP). It supersedes Version 2 of the protocol described in RFC-1119 dated September 1989. However, it neither changes the protocol in any significant way nor obsoletes previous versions or existing implementations. The main motivation for the mew version is to refine the analysis and implementation models for new applications at much higher network speeds to the gigabit-per-second regime and to provide for the enhanced stability, accuracy and precision required at such speeds. In particular, the sources of time and frequency errors have been rigorously examined and error bounds established in order to improve performance, provide a model for correctness assertions and indicate timekeeping quality to the user. The revision also incorporates two new optional features, (1) an algorithm to combine the offsets of a number of peer time servers in order to enhance accuracy and (2) improved local-clock algorithms which allow the poll intervals on all synchronization paths to be substantially increased in order to reduce network overhead. An overview of the changes, which are described in detail in Appendix D, follows: 1. In Version 3 The local-clock algorithm has been overhauled to improve stability and accuracy. Appendix G presents a detailed mathematical model and design example which has been refined with the aid of feedback-control analysis and extensive simulation using data collected over ordinary Internet paths. Section 5 of RFC-1119 on the NTP local clock has been completely rewritten to describe the new algorithm. Since the new algorithm can result in message rates far below the old ones, it is highly recommended that they be used in new implementations. Note that this algorithm is not integral to the NTP protocol specification itself and its use does not affect interoperability with previous versions or existing implementations; however, in order to insure overall NTP subnet stability in the Internet, it is essential that the local-clock characteristics of all NTP time servers conform to the analytical models presented previously and in this document. 2. In Version 3 a new algorithm to combine the offsets of a number of peer time servers is presented in Appendix F. This algorithm is modelled on those used by national standards laboratories to combine the weighted offsets from a number of standard clocks to construct a synthetic laboratory timescale more accurate than that of any clock separately. It can be used in an NTP implementation to improve accuracy and stability and reduce errors due to asymmetric paths in the Internet. The new algorithm has been simulated using data collected over ordinary Internet paths and, along with the new local-clock algorithm, implemented and tested in the Fuzzball time servers now running in the Internet. Note that this algorithm is not integral to the NTP protocol specification itself and its use does not affect interoperability with previous versions or existing implementations. 3. Several inconsistencies and minor errors in previous versions have been corrected in Version 3. The description of the procedures has been rewritten in pseudo-code augmented by English commentary for clarity and to avoid ambiguity. Appendix I has been added to illustrate C-language implementations of the various filtering and selection algorithms suggested for NTP. Additional information is included in Section 5 and in Appendix E, which includes the tutorial material formerly included in Section 2 of RFC-1119, as well as much new material clarifying the interpretation of timescales and leap seconds. 4. Minor changes have been made in the Version-3 local-clock algorithms to avoid problems observed when leap seconds are introduced in the UTC timescale and also to support an auxiliary precision oscillator, such as a cesium clock or timing receiver, as a precision timebase. In addition, changes were made to some procedures described in Section 3 and in the clock-filter and clock-selection procedures described in Section 4. While these changes were made to correct minor bugs found as the result of experience and are recommended for new implementations, they do not affect interoperability with previous versions or existing implementations in other than minor ways (at least until the next leap second). 5. In Version 3 changes were made to the way delay, offset and dispersion are defined, calculated and processed in order to reliably bound the errors inherent in the time-transfer procedures. In particular, the error accumulations were moved from the delay computation to the dispersion computation and both included in the clock filter and selection procedures. The clock-selection procedure was modified to remove the first of the two sorting/discarding steps and replace with an algorithm first proposed by Marzullo and later incorporated in the Digital Time Service. These changes do not significantly affect the ordinary operation of or compatibility with various versions of NTP, but they do provide the basis for formal statements of correctness as described in Appendix H. Dave ----- End of forwarded messages From almquist@jessica.stanford.edu Thu Jun 7 16:12:03 1990 Posted-Date: Thu, 07 Jun 90 16:11:31 -0700 Received-Date: Thu, 7 Jun 90 16:12:03 -0700 Received: from Jessica.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Thu, 7 Jun 90 16:12:03 -0700 Received: from LOCALHOST by jessica.stanford.edu (5.59/25-eef) id AA01519; Thu, 7 Jun 90 16:11:32 PDT Message-Id: <9006072311.AA01519@jessica.stanford.edu> To: mogul@decwrl.dec.com, van@helios.ee.lbl.gov Cc: end2end, ietf-rreq-editor@jessica.stanford.edu Subject: IP Fragmentation strategies Date: Thu, 07 Jun 90 16:11:31 -0700 From: "Philip Almquist" Jeff and Van, Is there in any sense of the word an optimal way to size the fragments when doing IP fragmentation? This issue came up in the Router Requirements Working Group, and it was suggested that one or both of you might have some insight into this question. Some people thought that the best mechanism was to make all fragments MTU-sized except for the last. Others suggested that a strategy which generated the same number of fragmants but made them all of equal size might make it less likely that there would be additional fragmentation later on. Philip From mogul@decwrl.dec.com Thu Jun 7 16:38:56 1990 Posted-Date: 7 Jun 1990 1638-PDT (Thursday) Received-Date: Thu, 7 Jun 90 16:38:56 -0700 Received: from decwrl.dec.com by venera.isi.edu (5.61/5.61+local) id ; Thu, 7 Jun 90 16:38:56 -0700 Received: by decwrl.dec.com; id AA19233; Thu, 7 Jun 90 16:38:22 -0700 Received: by acetes.pa.dec.com (5.54.5/4.7.34) id AA05949; Thu, 7 Jun 90 16:38:07 PDT From: mogul@decwrl.dec.com (Jeffrey Mogul) Message-Id: <9006072338.AA05949@acetes.pa.dec.com> Date: 7 Jun 1990 1638-PDT (Thursday) To: "Philip Almquist" Cc: end2end, ietf-rreq-editor@jessica.stanford.edu, van@helios.ee.lbl.gov Subject: Re: IP Fragmentation strategies In-Reply-To: "Philip Almquist" / Thu, 07 Jun 90 16:11:31 -0700. <9006072311.AA01519@jessica.stanford.edu> Is there in any sense of the word an optimal way to size the fragments when doing IP fragmentation? This issue came up in the Router Requirements Working Group, and it was suggested that one or both of you might have some insight into this question. Some people thought that the best mechanism was to make all fragments MTU-sized except for the last. Others suggested that a strategy which generated the same number of fragmants but made them all of equal size might make it less likely that there would be additional fragmentation later on. From the point of view of the PMTU Discovery working group (and current proposed protocol) it doesn't matter what the routers do, since the effect of the PMTU Discovery mechanism is to prevent fragmentation in almost all instances. I have heard some people (such as Leo McLaughlin from TWG, if my memory serves) say that if a router generates fragments so that a packet turns into a tiny datagram followed by a large datagram (the reverse of current practice, as it happens) then this is optimal for hosts with interfaces that cannot handle back-to-back packets; the second fragment arrives (in the sense that reception completes) long enough after the first fragment that there is more chance to be ready for it. (If the second fragment is tiny, it "arrives" almost simultaneously with the first fragment.) Of course, this might argue for the "equal pieces" approach, since in this case it would be unlikely that there would be any really tiny fragments at all. As to the question: which scheme causes the least number of fragments if additional fragmentation is done ... I have pondered this on occasion but it makes my brain hurt to think about it. However, if you use the table of known MTUs in draft-ietf-mtudisc-pathmtu-00.txt and compute all the possible ways in which a datagram that starts out with size = the MTU of some network and is fragmented N times, then there are (16!)/((16-N)!) possible combinations, and less than 5! of these are at all likely. So you could write the obvious program to simulate various strategies and decide if one has a clear advantage. -Jeff From Mills@udel.edu Sun Jun 10 20:43:26 1990 Posted-Date: Mon, 11 Jun 90 3:41:19 GMT Received-Date: Sun, 10 Jun 90 20:43:26 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Sun, 10 Jun 90 20:43:26 -0700 Date: Mon, 11 Jun 90 3:41:19 GMT From: Mills@udel.edu To: end2end Subject: [To: ntp: V3 compressed] Message-Id: <9006102341.aa07944@huey.udel.edu> ----- Forwarded message # 1: Received: from louie.udel.edu by huey.udel.edu id aa07340; 10 Jun 90 20:48 EDT Received: from trantor.umd.edu by louie.udel.edu id aa08929; 10 Jun 90 20:41 EDT Received: by trantor.umd.edu (5.63/1.34) id AA08013; Mon, 11 Jun 90 00:40:02 GMT Received: from huey.udel.edu by trantor.umd.edu (5.63/1.34) id AA08007; Mon, 11 Jun 90 00:39:57 GMT Date: Mon, 11 Jun 90 0:38:01 GMT From: Mills@udel.edu To: ntp@trantor.umd.edu Subject: V3 compressed Message-Id: <9006102038.aa07297@huey.udel.edu> Folks, For convenience, I have bottled the three parts of the V3 spec as one document and left both the raw and compressed versions in the pub/ntp directory on louie.udel.edu as the files ntpv3.ps and ntpv3.ps.Z respectively. While at it, I fixed the bugs found by Darren New. The raw document is to become a UDel technical report so it can be cited if necessary. Dave ----- End of forwarded messages From Mills@udel.edu Fri Jun 15 07:09:49 1990 Posted-Date: Fri, 15 Jun 90 14:04:31 GMT Received-Date: Fri, 15 Jun 90 07:09:49 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 15 Jun 90 07:09:49 -0700 Date: Fri, 15 Jun 90 14:04:31 GMT From: Mills@udel.edu To: end2end Subject: [Erik E. Fair: Re: Peer selection] Message-Id: <9006151004.aa19816@huey.udel.edu> Folks, The natives are getting restless trying to find time servers in a more-or-less systematic fashion. Erik has one suggestion, but his is not the only one. It strikes me this is of course a generic issue and maybe worth study and resolution. I am tinkling this group rather than the entire ietf first. Dave ----- Forwarded message # 1: Received: from louie.udel.edu by huey.udel.edu id aa16762; 15 Jun 90 0:48 EDT Received: from apple.com by louie.udel.edu id aa17613; 15 Jun 90 0:47 EDT Received: by apple.com with SMTP (5.61/25-eef) id AA24974; Thu, 14 Jun 90 21:47:18 -0700 for Mills@udel.edu From: "Erik E. Fair" (Your Friendly Postmaster) Subject: Re: Peer selection In-Reply-To: <9006142315.aa16258@huey.udel.edu> To: Mills@udel.edu X-Return-Path: ntp-relay@trantor.umd.edu Cc: ntp@trantor.umd.edu Date: Thu, 14 Jun 90 21:47:17 -0700 Message-Id: <24969.645425237@apple.com> Sender: fair@apple.com There is another alternative that I have been thiking of foisting on you all: clock.org domain with CNAMEs for each primary, and designated secondary NTP server on the Internet, managed by the NTP cognoscenti, in cooperation. That is, ntp1.stratum1.clock.org. IN CNAME dcn1.udel.edu ntp2.stratum1.clock.org. IN CNAME dcn5.udel.edu ntp3.stratum1.clock.org. IN CNAME dcn6.udel.edu ntp4.stratum1.clock.org. IN CNAME apple.com. ntp5.stratum1.clock.org. IN CNAME bitsy.mit.edu. ntp6.stratum1.clock.org. IN CNAME fuzz.sdsc.edu. ntp7.stratum1.clock.org. IN CNAME ncarfuzz.ucar.edu. Then, perhaps: barrnet1.stratum2.clock.org. IN CNAME kermit.stanford.edu. barrnet2.stratum2.clock.org. IN CNAME violet.berkeley.edu. nysernet1.stratum2.clock.org. IN CNAME lilben.tn.cornell.edu. and so on. What I'd like to see is a well connected stratum 2 NTP distribution network, so that if there is a failure of all but one stratum 1 server, we can still all get true tick from that one, via a well connected stratum 2 distribution network. This would have prevented the failure we had a while ago when all the fuzzies stopped talking to the rest of us, and bitsy and apple were chiming for those parts of the network that were listening to us. The above could also lead to simpler config files which wouldn't have to change as often, and the NTP network could be reconfigured centrally by the administrator who controlled the domain... I haven't thought out the name scheme very carefully, so please don't take this as a completely concrete proposal. However, I think that the basic gist of it (some centrally managed CNAMES for the servers) can really make our lives easier. I also think that us stratum 1 tickers should force this kind of a setup soon, by turning on authentication, and not ticking with anyone who isn't a designated stratum 2 server. Otherwise, we'll all end up like dcn1 did when everyone beat on it for TCP/time, long ago. Erik E. Fair apple!fair fair@apple.com ----- End of forwarded messages From Mills@udel.edu Fri Jun 15 07:20:16 1990 Posted-Date: Fri, 15 Jun 90 14:09:46 GMT Received-Date: Fri, 15 Jun 90 07:20:16 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 15 Jun 90 07:20:16 -0700 Date: Fri, 15 Jun 90 14:09:46 GMT From: Mills@udel.edu To: end2end Subject: [Stan Barber: Re: Peer selection] Message-Id: <9006151009.aa19866@huey.udel.edu> Folks, And so it goes. No, I don't think this belongs on namedroppers. Fundamental policy issues here. Note in passing, I designed the NTP control-message functions to support a wire-walker feature so, in principle, it is possible to winkle out the subnet topology given access to at least one existing server, so that is not the issue. I see the scope extending to multicast-group formation and maintenance, even ad-hoc groups. We need a resource- location protocol, right? Deja vu all over again... Dave ----- Forwarded message # 1: Received: from louie.udel.edu by huey.udel.edu id aa17723; 15 Jun 90 3:03 EDT Received: from bcm.tmc.edu by louie.udel.edu id aa18529; 15 Jun 90 2:59 EDT Received: from tmc.edu by bcm.tmc.edu (AA04900); Fri, 15 Jun 90 01:59:17 CDT Received: by tmc.edu (AA22528); Fri, 15 Jun 90 01:59:30 CDT Message-Id: <9006150659.AA22528@tmc.edu> From: Stan Barber Date: Fri, 15 Jun 90 01:59:29 CDT In-Reply-To: Yes X-Mailer: Mail User's Shell (6.5.6 6/30/89) To: "Erik E. Fair" (Your Friendly Postmaster) , Mills@udel.edu Subject: Re: Peer selection Cc: ntp@trantor.umd.edu Why not make it "time.net" :-)? ----- End of forwarded messages From J.Crowcroft@cs.ucl.ac.uk Sun Jun 17 06:38:49 1990 Posted-Date: Sun, 17 Jun 90 14:28:10 +0100 Received-Date: Sun, 17 Jun 90 06:38:49 -0700 Message-Id: <9006171338.AA06546@venera.isi.edu> Received: from nsfnet-relay.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Sun, 17 Jun 90 06:38:49 -0700 Received: from cs.ucl.ac.uk by vax.NSFnet-Relay.AC.UK via NSFnet with SMTP id aa05444; 17 Jun 90 14:23 BST Received: from bells.cs.ucl.ac.uk by vs6.Cs.Ucl.AC.UK via Ethernet with SMTP id aa02797; 17 Jun 90 14:28 WET DST Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id ; Sun, 17 Jun 1990 14:28:10 +0100 To: Mills@louie.udel.edu Cc: end2end Subject: Re: [Erik E. Fair: Re: Peer selection] In-Reply-To: Your message of Fri, 15 Jun 90 14:04:31 +0000. <9006151004.aa19816@huey.udel.edu> Date: Sun, 17 Jun 90 14:28:10 +0100 From: Jon Crowcroft >There is another alternative that I have been thiking of foisting on >you all: > clock.org > >domain with CNAMEs for each primary, and designated secondary NTP >server on the Internet, managed by the NTP cognoscenti, in cooperation. sonds like a v. good idea to me - i never did like all that config info...we had been considering doing something extremely similar fot the NTP on ROS using the directory (the code for lookup is already sort of there, but we need to define a few attributes.../object IDs and all that stuff...) j. From J.Crowcroft@cs.ucl.ac.uk Sun Jun 17 06:44:22 1990 Posted-Date: Sun, 17 Jun 90 14:34:46 +0100 Received-Date: Sun, 17 Jun 90 06:44:22 -0700 Message-Id: <9006171344.AA06568@venera.isi.edu> Received: from nsfnet-relay.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Sun, 17 Jun 90 06:44:22 -0700 Received: from cs.ucl.ac.uk by vax.NSFnet-Relay.AC.UK via NSFnet with SMTP id aa05549; 17 Jun 90 14:28 BST Received: from bells.cs.ucl.ac.uk by vs6.Cs.Ucl.AC.UK via Ethernet with SMTP id aa02857; 17 Jun 90 14:34 WET DST Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id ; Sun, 17 Jun 1990 14:34:46 +0100 To: end2end Subject: UCL/Nemesys ATM/B-ISDN Simulator & PBR from DEC Date: Sun, 17 Jun 90 14:34:46 +0100 From: Jon Crowcroft 1. The story is this - we can give the Nemesys ATM/B-ISDN simualtor to friends so long as we DO NOT tell our partners!! in other words, if anyone is interested, they should contact me (and cc: d.sirovica@cs.ucl.ac.uk) and we will dispatch tape... word of warning - its in a transitional stage, so you either get a year old version, or wait 6 months and get one that includes lots of fancy graphics, control systems, etc etc... 2. the Policy Based Routing stuff I mentioned in Pitt was a DEC WRL report by Jeff Mogul from 6 months ago... (WRL Res Rep 89/4) - I'm not sure if I was meant to see it, so its probably best you ask Jeff direct... regards j. From Mills@udel.edu Sun Jun 17 08:10:32 1990 Posted-Date: Sun, 17 Jun 90 15:08:56 GMT Received-Date: Sun, 17 Jun 90 08:10:32 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Sun, 17 Jun 90 08:10:32 -0700 Date: Sun, 17 Jun 90 15:08:56 GMT From: Mills@udel.edu To: Jon Crowcroft Cc: Mills@louie.udel.edu, end2end Subject: Re: [Erik E. Fair: Re: Peer selection] Message-Id: <9006171108.aa04541@huey.udel.edu> Jon, There is of course precedent in DTS, which uses multicast to find nearby chimers and directory services to find far-away ones. DTS does this for every server once at least once per day. If we copied the idea in NTP there would be upwards of 2000 servers honking for several friends that often. Note that using multicast to find NTP neighbors (on the same cable) would not help much in NTP, since most of the peer paths are via WANs without that feature. Now, let's use IP multicast to do all this and then we get to solve the same problem in forming the IP multicast group and its data base. However, as I pointed out before, You can use the NTP control message (or its SNMP equivalent if/when that happens) to explore the subnet above you on the way to the root, assuming you can find at least one friend at an appropriate stratum. This might be augmented with IP multicast and directory services to construct a fairly efficient system. Dave From J.Crowcroft@cs.ucl.ac.uk Sun Jun 17 08:36:47 1990 Posted-Date: Sun, 17 Jun 90 16:27:48 +0100 Received-Date: Sun, 17 Jun 90 08:36:47 -0700 Message-Id: <9006171536.AA07363@venera.isi.edu> Received: from nsfnet-relay.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Sun, 17 Jun 90 08:36:47 -0700 Received: from cs.ucl.ac.uk by vax.NSFnet-Relay.AC.UK via NSFnet with SMTP id aa08104; 17 Jun 90 16:24 BST Received: from bells.cs.ucl.ac.uk by vs6.Cs.Ucl.AC.UK via Ethernet with SMTP id aa03092; 17 Jun 90 16:27 WET DST Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id ; Sun, 17 Jun 1990 16:27:50 +0100 To: Mills@louie.udel.edu Cc: end2end Subject: Re: [Erik E. Fair: Re: Peer selection] In-Reply-To: Your message of Sun, 17 Jun 90 15:08:56 +0000. <9006171108.aa04541@huey.udel.edu> Date: Sun, 17 Jun 90 16:27:48 +0100 From: Jon Crowcroft >peer paths are via WANs without that feature. Now, let's use IP multicast >to do all this and then we get to solve the same problem in forming the >IP multicast group and its data base. now this brings up something of a chicken and egg thing - it has been suggested that by stepping up TTL in a multicast request foreach subsequent requests, one can achieve a search starting 'around here' and getting wider - if NTP provides good rtt estimates, then one can do this realistically by interpreting TTL in strict time sense - however, if NTP is using this to peer and explore the subnet and form its tree) which comes first ? hmm... jon From J.Crowcroft@cs.ucl.ac.uk Sun Jun 17 09:04:31 1990 Posted-Date: Sun, 17 Jun 90 16:53:15 +0100 Received-Date: Sun, 17 Jun 90 09:04:31 -0700 Message-Id: <9006171604.AA07581@venera.isi.edu> Received: from nsfnet-relay.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Sun, 17 Jun 90 09:04:31 -0700 Received: from cs.ucl.ac.uk by vax.NSFnet-Relay.AC.UK via NSFnet with SMTP id aa08811; 17 Jun 90 16:49 BST Received: from bells.cs.ucl.ac.uk by vs6.Cs.Ucl.AC.UK via Ethernet with SMTP id aa03186; 17 Jun 90 16:53 WET DST Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id ; Sun, 17 Jun 1990 16:53:17 +0100 To: end2end Subject: UCL research notes...in the last 6 months... Date: Sun, 17 Jun 90 16:53:15 +0100 From: Jon Crowcroft %A Paul Barker %T Issues raised by an X.500 pilot service %R RN/90/5 %D January 1990 %P %O %Y %X This paper considers the issues raised by trying to run an X.500 pilot Directory Service. The discussion draws on experience gained from participation in directory pilots within the ESPRIT project THORN, and the U.K.~academic community. .P A report is given on the status of the U.K.~Academic Pilot and other piloting activities. .P The following points are discussed: the naming of organisations and people in the directory; what type of information is stored in the directory; managing the data in the directory; accessing the data -- mode of access and user interfaces; security -- authentication and access control; data protection legislation. %A Paul Barker %T Managing data derived from multiple sources in an X.500 Directory %R RN/90/6 %D January 1990 %P %O %Y %X X.500 Directories will not be the master source of much data until they are much longer established, and, hopefully by then, trusted. Until then there will remain the substantial problem of keeping an X.500 Directory up-to-date from a number of sources. The volume of data will require that maintenance procedures are as automated as possible. .P However, naive procedures will not suffice for a number of reasons: different sources will name the same object differently; different sources will name different objects the same; the data obtained from these sources may overlap to some extent, though one source may be regarded as more authoritative than another. .P These problems, amongst others, are considered and suggestions made to solve the problems. %A Paul Barker %T Directory System Configuration Guide %R RN/90/7 %D January 1990 %P %O %Y %X This document is a hardware and software configuration guide for U.K. academic sites participating in the Directory Pilot. .P Several people contributed considerably in the production of this document. I would like to acknowledge the help given by Andrew Findlay, Graham Carpenter, Chris Elvin, Tony Bates, Colin Robbins and John Andrews. %A S.M.~Walton %A G.J.~Knight %T The Construction of a Connectionless Ethernet ISDN Gateway %R RN/90/8 %D January 1990 %O lpr -Pps paper.ps %Y /cs/research/proof/ron/simonw/paper.ps:4813070290 %P 10 %J Proceedings of Open Systems and Interoperability Online Conference, 1990 %K ISDN gateway relay %X As part of its continuing research into Open Systems and ISDN systems the Computer Science Department at University College London has built a Network Layer gateway between Primary rate ISDN and Ethernet. The gateway is built from a 68000-based front-end, with a Primary rate ISDN interface, installed in a SUN workstation; it is designed to have a minimal impact on the native SUN operating system software. The gatewaying task is shared between the SUN and the front-end, with the aim of relieving the front-end of the more complex gatewaying tasks such as fragmentation and the exchange of routing information with other gateways. The gateway operates as a Connectionless Network Relay and thus reflects the dominant mode of working on LANs. .P The paper describes the design decisions that were taken in the construction of the gateway and outlines the eventual hardware and software design. Consideration is given to the overall system and protocol architecture, memory access, buffering strategies, circuit management, and the facilities and protocols used for network management. %A Jon Crowcroft %A Mark d'Inverno %T Specification, Design, and Implementation of A Real Time Conferencing System %R RN/90/9 %D February 1990 %x submitted to ACM SIGCOMM Symposium '90 %Y /cs/research/darpa/ron/jon/docs/in/xonf/xonf.tex %O %P %K Distributed Systems, Formal Specification, Conferencing, Floor Control, Z. %X This paper presents the specification, design and implementation of a text based multi-way real time conferencing program. The motivation was frustration with the limitations of the Unix\footnote{Unix is a trademark of AT\&T Bell Laboratories} talk program.\footnote{The talk program was originally written by Kipp Hickman.} .P The system is described in three parts: The user interface, the distribution mechanism for users' contributions and the floor control scheme. .P There are certain limitations to the ability of humans to assimilate textual information. We look at how these are reflected in the design of the windowing interface to the conference, and how they affect the floor control system. .P Underlying mechanisms are emerging for multi-destination delivery of data. We see how these can be used by the conferencing system. .P Humans have evolved many complex ways of interacting face to face. In this paper, we specify one specific system of floor control using the Formal Specification Language, Z. .P We assume the reader is familiar with networked windowing systems such as X-Windows. %A Z.~Wang %A J.~Crowcroft %T Shortest Path First with Emergency Exits %R RN/90/11 %D February 1990 %O Postscript format %Y ~zwang/docs/rn/spf-ee.ps:4214260490 %P %x submitted to SIGCOMM '90 %K Routing algorithm, shortest-path algorithm %X Under heavy and dynamic traffic, the SPF routing algorithm often suffers from wild oscillation and severe congestion and results in degradation of the network performance. In this paper, we present a new routing algorithm (SPF-EE) which attempts to eliminate the problems associated with the SPF algorithm by providing alternative paths as emergency exits when packets are accumulating in some areas of the network. With the SPF-EE algorithm, traffic is routed along the shortest-paths under normal condition. However, in the presence of congestion and resource failures, the traffic can be dispersed temporarily to alternative paths without routing table updating. Extensive simulation shows that the SPF-EE algorithm achieves grater throughput, higher responsiveness, better congestion control and fault tolerance, and substantially improves the performance of routing in a dynamic environment. %A S.E.~Kille %T Implementing The Directory %R RN/90/12 %D March 1990 %P %O %Y /cs/research/dsa/users/pink/steve/rn-90-12.psc:1314200390 %X This paper is being presented at the IEE Colloquium on `The Global Directory' in April 1990. .P The OSI Directory has recently been jointly standardised as CCITT X.500~Series Recommendations and ISO~9594. There are a range of closed and open applications for which this service is suitable. A natural application of the Directory is to provide a global service to look up information about people or OSI Applications: the majority of examples in the standard are formulated with respect to this sort of service. Many groups are working towards provision of this sort of service. This paper discusses a a number of practical developments which are moving towards provision of an open and distributed directory. Implementing the directory is discussed both in terms of implementing components, and integrating them together to implement a service. .P A prerequisite to any practical investigation or use of the directory is the availability of implementations of DUAs and DSAs. Two implementations of the directory, and their key features are described. Then, the application of these systems into two directory pilots is described. Finally, some of the problems and implications of this work are considered. %A S.R.~Wilbur %T The Challenges of Distributed Computing %R RN/90/13 %D November 1989 %P 25 %O %Y %X Inaugural lecture by Professor Stephen Wilbur given on 30th November 1989. %A S.E.~Kille %T Replication in the OSI Directory %R RN/90/18 %D April 1990 %O LaTeX %Y %P 6 %X This note is a position paper, submitted for the IEEE Workshop on the Management of Replicated Data. .P The OSI Directory is intended to provide a large and highly distributed database. This paper gives a very brief summary of the OSI Directory, which does not have {\it standardised support\/} for replication. The general requirements for replication in the OSI Directory are then described. Implementations of the OSI Directory have added in non-standard support for replication. One such implementation is described, and experience in deploying it in pilot exercises is discussed. Developments in the OSI Standards bodies to support replication are considered, and the future of the work in this area is discussed. %A A.N.~Refenes %T Message passing via singly-buffered channels: an efficient and flexible communications control mechanism %R RN/90/19 %B EUROMICRO-90, 16th Int. Symposium on Microprogramming and Microprocessing, 27-30 August 1990, Amsterdam. %D March 1990 %O | tbl| pic | eqn | psroff -mm -rO0.9i -rW6.7i -rL11.4i -Ppsc %Y /cs/docs/research/researchlib/lists/rn9019 %P %X Message-passing via singly-buffered channels provide an efficient and safe mechanism for controlling the communication and synchronisation between concurrent processes. Message-passing via singly-buffered channels is a symmetric communications mechanism that permits an arbitrary number of processes to be synchronised in one of three complementary ways: a common handshake, an singly-buffered receive action, or singly-buffered send action. This is a generalisation of the usual approach employed in languages like CSP and Ada, in which communication is asymmetric and restricted to involve only two processes. A formal description of the mechanism is given and a generic implementation strategy is described. %A Tom Casey %A Richard Beckwith %A Gordon Jameson %A Yael Pinto %A Bill Tuck %T Interactive Video using Broadcast Satellite and Terrestrial Return Paths %R RN/90/21 %D April 1990 %P 13 %O latex %Y ~tcasey/doc/eact90 %K Video Broadcast,Image Server, Teletext, Laser Video Recorder %X The paper describes the design and implementation of a prototype for an interactive video network using broadcast satellite and terrestrial return paths. The prototype is being developed and tested on the interactive network of the University of London. The system will then be scaled up to include video transmission on the Olympus satellite with data links between the receiver and the head-end sites. %A Tom Casey, %A Richard Beckwith %A Gordon Jameson %A Bill Tuck %T The Communications Infrastructure for Interactive Video using DBS and ISDN %R RN/90/22 %D April 1990 %P 11 %O latex %Y ~tcasey/doc/delta_b %K DBS, ISDN, Image, Server, Access, Protocol %X This paper describes the design and implementation of a prototype for an Interactive video network using DBS and ISDN return paths. The prototype is being developed and tested on the interactive video network of the University of London, and in the Computer Science Communications Laboratory of University College London using a pair of British Telecom IDA connections. The system is also being scaled up to include video transmission on the Olympus satellite with data links between the receiver and the head-end sites. %A Michael Roe %A Tom Casey %T Integrating Cryptography in the Trusted Computing Base %R RN/90/25 %D May 1990 %P 8 %O latex %Y ~tcasey/doc/streams %K TCB, STREAMS, cryptography, validation %X Secure distributed systems are not easily constructed, as they combine mechanisms based on very different theories of security (encryption and reference monitors). We show how these mechanisms may be integrated via UNIX STREAMS. Examples are given of how this architecture can support existing security protocols, and it is shown why it is consistent with the Bell-LaPadula and Biba information-flow models. %T On extended attribute grammars %A N.P.~Chapman %R RN/90/27 %D June 1990 %O %Y %P %X We present a new formal definition of Madsen and Watt's extended attribute grammars (EAGs), in an attempt to overcome certain perceived technical deficiencies of their original definition. The new definition is purely syntactical. We show how, by including an algebraic specification of the types of attribute values, it is possible to characterize the class of well-formed EAGs using the equations of the specification. We show that the necessary conditions are satisfied by several useful attribute types. %T Using the OSI Directory to achieve User Friendly Naming %A S.E.~Kille %R RN/90/29 %D June 1990 %O %Y %P %X The OSI Directory has user friendly naming as a goal. A simple minded usage of the directory does not achieve this. Two aspects not achieved are: \bulletlist \point A user oriented notation \point Guessability \endlist This proposal sets out some conventions for representing names in a friendly manner, and shows how this can be used to achieve really friendly naming. From Mills@udel.edu Sun Jun 17 17:55:15 1990 Posted-Date: Mon, 18 Jun 90 0:47:47 GMT Received-Date: Sun, 17 Jun 90 17:55:15 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Sun, 17 Jun 90 17:55:15 -0700 Date: Mon, 18 Jun 90 0:47:47 GMT From: Mills@udel.edu To: Jon Crowcroft Cc: Mills@louie.udel.edu, end2end Subject: Re: [Erik E. Fair: Re: Peer selection] Message-Id: <9006172047.aa06625@huey.udel.edu> Jon, Not quite. Using the control message and one known peer address, an NTP chimer can poke the peer's tables and find out which associations are active and thus their IP addresses. However, a chimer keeps only those peers at equal or less stratum, as persistent associations and you get only what happens to be in its tables. However, you can walk the subnet from there to the root recursively. Since all the root (primary) servers chime will all the others, it is possible (at present) to winkle all the primary servers in this way. Now, an IP multicast hop-bounded search would find nearby servers presumably independent of subnet hierarchy, which would also be useful; however, it is unclear what impact 2000 chimers would have on the gateway tables. Perhaps the most urgent need is for campus servers to find a reliable set of stratum-2 friends right now and unload the clamoring horde at the primary servers. This could be done with maybe only 20-50 in the multicast group and may well be worth trying. Dave From S.Kille@cs.ucl.ac.uk Mon Jun 18 03:15:17 1990 Posted-Date: Mon, 18 Jun 90 10:59:23 +0100 Received-Date: Mon, 18 Jun 90 03:15:17 -0700 Received: from nsfnet-relay.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Mon, 18 Jun 90 03:15:17 -0700 Received: from cs.ucl.ac.uk by vax.NSFnet-Relay.AC.UK via NSFnet with SMTP id aa07695; 18 Jun 90 10:56 BST Received: from bells.cs.ucl.ac.uk by vs6.Cs.Ucl.AC.UK via Ethernet with SMTP id aa02835; 18 Jun 90 10:59 WET DST Received: from glenlivet.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id ; Mon, 18 Jun 1990 10:59:27 +0100 To: Mills@louie.udel.edu Cc: end2end Subject: Re: [Erik E. Fair: Re: Peer selection] Phone: +44-71-380-7294 In-Reply-To: Your message of Fri, 15 Jun 90 14:04:31 +0000. <9006151004.aa19816@huey.udel.edu> Date: Mon, 18 Jun 90 10:59:23 +0100 Message-Id: <811.645703163@UK.AC.UCL.CS> From: Steve Kille Dave, Erik is clearly addressing some very real requirements. The way in which he treis to use DNS to achieve this seems to me to show that DNS does not provide all the services you need. Clearly it could be extended. I'd like to propose that you use the X.500 Directory for your NTP configuration. Here is a thumbnail sketch of the sort of thing you might do. Clearly there are a lot of variants. Define a new Object Class: NTPServer OBJECT-CLASS SUBCLASS OF ApplicationEntity MUST CONTAIN NTPLevel Peers PeerPolicy MAY CONTAIN -- Other Quality Attributes This defines a server type, which would be named in a ``natural'' manner, probably in the context of an organistion: Master Time Server, Computer Lab, University of Delaware, US It is a subclass of ApplicationEntity (presumably some internet variant), which would define the basic attributes for such objects (common name for naming, addressing information, etc). Then there are a set of attributes. These might be define as follows: NTPLevel ATTRIBUTE WITH SYNTAX INTEGER (1..ub-max-ntp-levels) MATHCES FOR EQUALITY, ORDERING SINGLE VALUE This defines the level of server. Peers ATTRIBUTE WITH SYNTAX distinguishedNameSyntax MULTI VALUE This defines a list of other NTP servers which the defined server should peer with. This allows for managers to define a basic configuration. This seems approporiate for the major servers. PeerPolicy ATTRIBUTE WITH SYNTAX CHOICE { explicit-peer [0] DistinguishedName, subtree [1] DistinguishedName, group [2] DistinguishedName } MULTI VALUE This allows the manager to define who is allowed to peer with the server. This can be: - explicit peers. So that a static configuration can be clearly defined. - subtree. Typically to allow any server within an organisttion to peer - group. This would allow a defined list (e.g., US National NTP Servers) to be identified by name of group In addition, there couould be various quality of service attributes (availability, charaaceteristics of server, etc...). NTP servers can be identified in several ways: - explicitly by name - lookup by approximate name (white pages searching) - searching in a context (locally) on the basis of associtated attributes. This would be a sort of yellow page search, to identify a convenient local server with the right charactersitics. This approach allows for: 1. Definition of a configuration of major servers 2. Allows local (unregistered) servers to identify local servers to peer with. 3. Allows user identification of a local server I believe that this is the sort of framework that is needed. DNS could be extended to do all of this. However, I would encourage you to think about using X.500. Anyone interested? Steve From Mills@udel.edu Mon Jun 18 05:27:33 1990 Posted-Date: Mon, 18 Jun 90 12:17:27 GMT Received-Date: Mon, 18 Jun 90 05:27:33 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 18 Jun 90 05:27:33 -0700 Date: Mon, 18 Jun 90 12:17:27 GMT From: Mills@udel.edu To: Steve Kille Cc: Mills@louie.udel.edu, end2end Subject: Re: [Erik E. Fair: Re: Peer selection] Message-Id: <9006180817.aa09410@huey.udel.edu> Steve, Sure, I would absolutely love the idea of automating the peer-discovery process. I keep no particular axe grinder on which technology to use and whether it makes use of DNS, X.500, IP multicasting, etc., but I do not have the personal bandwidth to prosecute a specific agenda to use any particular one. I would rather this be viewed in a wider context of which NTP is only one application. It would be Real Neat if some warm body were prepared to wield blunt instrument to xntp, for example, and I would cheer/assist in doing that. However, all my wbs are presently navel-deep in other projects right now. Meanwhile, I have two problems with your suggested database structure. First, many of the performance metrics you might want to know, such as stratum, dispersion, etc., are dynamic quantities unsuitable for a generally static directory database. It is probably justifiable to assign an operational class to any particular server which may have relevance to peer-selection strategy and which would have more permanence; however, I am more inclined towards a real-time distributed algorithm which can construct such a strategy based on the current status and recent history of performance. Such an algorithm represents in principle a system management function in its own right. The second problem I have has nothing to do with any particular technology, but with the collection and maintenance of the database itself. I don't think a technology that requires specific maintenance of NTP information distinct from general housekeeping chores (like DNS maintainence) will as a practical matter succeed, unless there is general utility added that is absolutely irresistable for other reasons. In other words, if somehow we absolutely had to have X.500 for other reasons and NTP came with the ride, then so be it so. Otherwise, heck, nobody hardly ever even tells me anything to park in clock.txt without me knocking on their door. Dave From S.Kille@cs.ucl.ac.uk Mon Jun 18 06:13:26 1990 Posted-Date: Mon, 18 Jun 90 13:54:07 +0100 Received-Date: Mon, 18 Jun 90 06:13:26 -0700 Received: from nsfnet-relay.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Mon, 18 Jun 90 06:13:26 -0700 Received: from cs.ucl.ac.uk by vax.NSFnet-Relay.AC.UK via NSFnet with SMTP id aa14741; 18 Jun 90 13:49 BST Received: from bells.cs.ucl.ac.uk by vs6.Cs.Ucl.AC.UK via Ethernet with SMTP id aa04305; 18 Jun 90 13:54 WET DST Received: from glenlivet.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id ; Mon, 18 Jun 1990 13:54:16 +0100 To: Mills@louie.udel.edu Cc: end2end Subject: Re: [Erik E. Fair: Re: Peer selection] Phone: +44-71-380-7294 In-Reply-To: Your message of Mon, 18 Jun 90 12:17:27 +0000. <9006180817.aa09410@huey.udel.edu> Date: Mon, 18 Jun 90 13:54:07 +0100 Message-Id: <1204.645713647@UK.AC.UCL.CS> From: Steve Kille Dave, Wider context: I agree. From my standpoint, I would like to see the Directory used for a lot of things, and we need to broaden. White Pages is the preferred initial application. Other applications which seem likely: - support of X.400 - support of other applications (file transfer etc..) - X.519 athentication - libarary/bibliographic experiment - Replacement of DNS. experimental for now, but is a possible longer term service goal I think that NTP support might be a rather neat choice for and experimental directory application. - It is a serious service, but does not have the operational requiements of something such as mail. Viz, it is still reasonably contained. - The directory needs synchoronised clocks (symbiosis) - It can make quite a bit of fancy use of the directory Dynamic data: Dynamic is relative. Some things need to be dealt with in NTP itself. Other things need to be dealt with by talking directly to the NTP servers by a management protocol (SNMP seems the best bet). Other information is better stored in a general purpose system. The directory is primarily targetted at relatively stable things. However, there is no fundamental reason for not having things which change moderately fast (e.g., minutes, or perhaps even seconds). I think that NTP needs all three mechanisms, and there needs to be engineering tradeoffs as to which is used for which type of data. Directory Maintenance: I agree that NTP should use a general purpose directory, which is essentially being maintained for other reasons. DNS is the current operational choice. However, I believe that X.500 infrastructure on the Internet is now only a question of WHEN, not IF. I think that the X.500 services better meet the requirements of NTP. The choice of Directory for NTP is political/strategic as well as technical. Steve From Mills@udel.edu Fri Jun 29 18:44:37 1990 Posted-Date: Sat, 30 Jun 90 1:40:31 GMT Received-Date: Fri, 29 Jun 90 18:44:37 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 29 Jun 90 18:44:37 -0700 Date: Sat, 30 Jun 90 1:40:31 GMT From: Mills@udel.edu To: end2end Subject: [uucp: Warning From uucp] Message-Id: <9006292140.aa27860@huey.udel.edu> (wondrous gasp) You may fondly remember the wonderful nonsense spewed by Communication Satellite on old mit-multics machine. Even if the dude that sent this doesn't come close nonsensewise, at least he tries. Dave ----- Forwarded message # 1: Received: from louie.udel.edu by huey.udel.edu id aa21530; 29 Jun 90 8:18 EDT Received: from att-in.att.com by louie.udel.edu id aa07456; 29 Jun 90 8:17 EDT From: uucp@att-in.att.com Date: Fri, 29 Jun 90 08:09 EDT To: Mills@udel.edu Subject: Warning From uucp Message-ID: <9006290817.aa07456@louie.udel.edu> We have been unable to contact machine 'tds386e' since you queued your job. The following file have not been delivered. tds386e!mail tds (Date 06/27) The job will be deleted in several days if the problem is not corrected. If you care to kill the job, execute the following command: Note: this command can only be executed on machine (att): uustat -ktds386eZ6a8a Sincerely, att!uucp ############################################# ##### Data File: ############################ From udel.edu!Mills Wed Jun 27 20:52:44 GMT 1990 remote from att Received: by att.att.com; Wed Jun 27 17:12:42 1990 Posted-Date: Wed, 27 Jun 90 20:52:44 GMT Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 27 Jun 90 14:00:38 -0700 Date: Wed, 27 Jun 90 20:52:44 GMT From: Mills@udel.edu To: braden@venera.isi.edu Cc: end2end-tf@venera.isi.edu Subject: Re: Next Meeting Message-Id: <9006271652.aa05229@huey.udel.edu> Bob, I'm not going to tell you about conflicts that may occur next semmester. I won't know about them until the semester begins. I suppose that means you will be MAD. Dave ----- End of forwarded messages From Mills@udel.edu Fri Jun 29 18:46:23 1990 Posted-Date: Sat, 30 Jun 90 1:42:11 GMT Received-Date: Fri, 29 Jun 90 18:46:23 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 29 Jun 90 18:46:23 -0700 Date: Sat, 30 Jun 90 1:42:11 GMT From: Mills@udel.edu To: end2end Subject: More delights from att.com Message-Id: <9006292142.aa27868@huey.udel.edu> Folks, Getting closer, getting closer... Dave ----- Forwarded message # 1: Received: from louie.udel.edu by huey.udel.edu id ad21639; 29 Jun 90 8:28 EDT Received: from att-in.att.com by louie.udel.edu id aa07520; 29 Jun 90 8:28 EDT From: uucp@tds386e.att.com Date: Fri, 29 Jun 90 08:17 EDT To: Mills@udel.edu Message-ID: <9006290828.aa07520@louie.udel.edu> remote execution [uucp job tds386eA6a8d (6/29-8:17:04)] rmail tds exited with status 1 ===== stderr was ===== bad system name: honet9 uux failed ( 11 ) rmail: Return to att!udel.edu!Mills ----- End of forwarded messages From craig@NNSC.NSF.NET Tue Jul 10 11:33:03 1990 Posted-Date: Tue, 10 Jul 90 14:30:40 -0400 Received-Date: Tue, 10 Jul 90 11:33:03 -0700 Message-Id: <9007101833.AA26144@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Tue, 10 Jul 90 11:33:03 -0700 To: end2end-interest Subject: paper on ATM Segmentation and Reassembly Date: Tue, 10 Jul 90 14:30:40 -0400 From: Craig Partridge Hi folks: Last month I made available a rough-and-ready draft of a paper Julio Escobar and I wrote on an ATM SAR protocol. Thanks to comments from Steve Deering and considerable editorial work by Julio, we've got something very close to a polished version now ready. Unfortunately, I've only got it in hardcopy (due to a diagram that we neglected to prepare with our text formatter). If you'd like a copy send me a note. The abstract is appended. Craig Segmentation and reassembly of protocol data units into and from fixed-size cells in an Asynchronous Transfer Mode network is carried out by the Adaptation layer of the network using Segmentation and Reassembly protocols. We have developed an experimental Segmentation and Reassembly protocol to be used with all desired Asynchronous Transfer Mode services. The use of a single protocol for all services simplifies implementation and interoperability. Among its main characteristics, the protocol provides cell-based error correction and detection, a cell sequence number modulo 1024 to provide cell sequence integrity, and the ability for applications to insert control cells in the Asynchronous Transfer Mode cell stream. The protocol requires a 3-octet trailer. From braden Mon Aug 27 12:38:32 1990 Received-Date: Mon, 27 Aug 90 12:38:32 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 27 Aug 90 12:38:32 -0700 Date: Mon, 27 Aug 90 12:38:23 PDT From: braden Posted-Date: Mon, 27 Aug 90 12:38:23 PDT Message-Id: <9008271938.AA00579@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 27 Aug 90 12:38:23 PDT To: end2end-interest, lixia@parc.xerox.com, van@helios.ee.lbl.gov Subject: Draft RFC on TCP Timestamps Cc: braden Hi. Some months ago Lixia and I undertook to write an RFC containing Van's proposal for using the TCP timestamp option to solve the problem of high speed operation. I believed it to be an important idea that deserved to be written up and considered for standardization, so as E2E chair I undertook the writing task with Lixia's help. After a long delay, here is a fairly advanced draft; comments are welcome. Bob Braden Van and Lixia: I think this meets all objections you made to the last draft. Network Working Group V. Jacobson Request for Comments: 11XX LBL R. Braden ISI L. Zhang PARC August 28, 1990 TCP Reliability over High-Speed Paths *** DRAFT *** Status of This Memo This memo describes an experimental protocol extension to TCP. Summary This memo describes a small extension to TCP to support reliable operation over very high-speed paths, using the TCP timestamp echo option proposed in RFC-1072. 1. INTRODUCTION TCP uses positive acknowledgments and retransmissions to provide reliable end-to-end delivery over a full-duplex virtual circuit or connection [Postel81]. A connection is defined by its two end points; each end point is a "socket", i.e., a (host,port) pair. To protect against data corruption, TCP uses an end-to-end checksum. Duplication and reordering are handled using a fine-grained sequence number space, with each octet receiving a distinct sequence number. The TCP protocol [Postel81] was designed to operate reliably over almost any transmission medium regardless of transmission rate, delay, corruption, duplication, or reordering of segments. In practice, proper TCP implementations have demonstrated remarkable robustness in adapting to a wide range of network characteristics. For example, TCP implementations can adapt well in the range from about 100 to 10**7 bits per second transfer rate, and 1 ms to 100 seconds delay. However, the introduction of fiber optics is resulting in ever-higher transmission speeds, and the fastest paths are moving out of the domain for which TCP was originally engineered. This memo and RFC- 1072 [Jacobson88] propose modest extensions to TCP to extend the domain of its application to higher speeds. There is no one-line answer to the question: "How fast can TCP go?". Jacobson, Braden, & Zhang [Page 1] ***DRAFT*** TCP over High-Speed Paths The issues are reliability and efficiency, and these depend upon the round-trip delay and the maximum time that segments may be queued in the Internet, as well as upon the transmission speed. We must think through these relationships very carefully if we are to successfully extend TCP's domain. TCP efficiency depends not upon the transfer rate itself, but rather upon the product of the transfer rate and the round-trip delay. This "bandwidth*delay product" measures the amount of data that would "fill the pipe" and is the buffer space required at sender and receiver to obtain maximum throughput on the TCP connection over the path. RFC-1072 proposed a set of TCP extensions to improve TCP efficiency for "LFNs" (long fat networks), i.e., networks with large bandwidth*delay products. On the other hand, high transfer rate alone can threaten TCP reliability, by violating the assumptions behind the TCP mechanism for duplicate detection and sequencing. The present memo specifies a solution for this problem that will extend TCP reliability well beyond the upper limit of foreseeable bandwidths. An especially serious kind of error would result from an accidental reuse of TCP sequence numbers. Suppose that an "old duplicate segment", i.e., a duplicate data segment that was delayed in Internet queues, was delivered to the receiver at the wrong moment so that its sequence numbers fell somewhere within the current window. There would be no checksum failure to warn of the error, and the result could be an undetected corruption of the data. Reception of an old duplicate ACK segment at the transmitter could be only slightly less serious: it is likely to lock up the connection so that no further progress can be made, so a RST will be required to resynchronize the two ends. Duplication of sequence numbers might happen in either of two ways: (1) Sequence number wrap-around on the current connection A TCP sequence number is limited to 32 bits. At a high enough transfer rate, the 32-bit sequence space may be "wrapped" (cycled) within the time that a segment may be delayed in queues. Section 2 discusses this case and proposes a mechanism to reject old duplicates on the current connection. (2) Segment from an earlier connection incarnation Suppose a connection terminates, either by a proper close sequence or due to a host crash, and the same connection (i.e., connecting the same pair of sockets) is immediately reopened. A Jacobson, Braden, & Zhang [Page 2] ***DRAFT*** TCP over High-Speed Paths delayed segment from the terminated connection could fall within the current window for the new incarnation and be accepted as valid. This case is discussed in Section 3. TCP reliability depends upon the existence of a bound on the lifetime of a segment: the "Maximum Segment Lifetime" or MSL. An MSL is gen- erally required by any reliable transport protocol, since every sequence number field must be finite and therefore must eventually be reused. In the Internet protocol suite, the MSL bound is enforced by an IP-layer mechanism, the "Time-to-Live" or TTL field. 2. SEQUENCE NUMBER WRAP-AROUND 2.1 Background Avoiding reuse of sequence numbers within the same connection is simple in principle: enforce a segment lifetime shorter than the time it takes to cycle the sequence space. More specifically, if the maximum effective bandwidth at which TCP is able to transmit over a particular path is B bytes per second, then the following constraint must be satisfied for error-free operation: 2**31 / B > MSL (secs) [1] The following table shows the value for Twrap = 2**31/B in seconds, for some important values of the bandwidth B: Network B*8 B Twrap bits/sec bytes/sec secs _______ _______ ______ ______ ARPANET 56kbps 7KBps 3*10**5 (~3.6 days) DS1 1.5Mbps 190KBps 10**4 (~3 hours) Ethernet 10Mbps 1.25MBps 1700 (~30 mins) DS3 45Mbps 5.6MBps 380 FDDI 100Mbps 12.5MBps 170 Gigabit 1Gbps 125MBps 17 It is clear why wrap-around of the sequence space was not a Jacobson, Braden, & Zhang [Page 3] ***DRAFT*** TCP over High-Speed Paths problem for 56kbps packet switching or even 10Mbps Ethernets. On the other hand, at DS3 and FDDI speeds, Twrap is comparable to the 2 minute MSL assumed by the TCP specification [Postel81]. Moving towards gigabit speeds, Twrap becomes too small for reliable enforcement by the TTL mechanism. McKenzie has pointed out [McKenzie89] that the 16-bit window field of TCP limits the effective bandwidth B to 2**16/RTT, where RTT is the round-trip time in seconds. If the RTT is large enough, this limits B to a value that meets the constraint [1] for a large MSL value. For example, consider a transcontinental backbone with an RTT of 60ms (basically set by the laws of physics). With the bandwidth*delay product limited to 64KB by the TCP window size, B is then limited to 1.1MBps, no matter how high the theoretical transfer rate of the path. This corresponds to cycling the sequence number space in Twrap= 2000 secs, which is safe in today's Internet. Based on this reasoning, McKenzie cautions that expanding the TCP window space as proposed in RFC-1072 will lead to sequence wrap- around and hence to possible data corruption. We believe that this is mis-identifying the culprit, which is not the larger win- dow but rather the high bandwidth. For example, consider a (very large) FDDI LAN with a diameter of 10km. Using the speed of light, we can compute the RTT across the ring as (2*10**4)/(3*10**8), and the delay*bandwidth product is then 6.7KB. A TCP connection across this LAN using a window of only 6700 bytes will run at the full 100mbps and wrap the sequence space in only 170 seconds. Thus, high speed alone can cause a reliability problem with sequence number wrap-around, even without extended windows. An "obvious" fix for the problem of cycling the sequence space is to increase the size of the TCP sequence number field. For exam- ple, the sequence number field (and also the acknowledgment field) could be expanded to 64 bits. However, the proposals for making such a change while maintaining compatibility with current TCP have tended towards complexity and ugliness. We propose an alternative solution to the problem, using the timestamp echo option defined in RFC-1072. This option was origi- nally defined to help the sender to measure RTT accurately. We now propose using this same timestamp option at the receiver to reject old duplicate segments. The following two sections describe these two different Jacobson, Braden, & Zhang [Page 4] ***DRAFT*** TCP over High-Speed Paths applications of the TCP timestamp option. 2.2 Measuring Round-Trip Time The echo option of RFC-1072 was designed to solve the following problem: almost all TCP implementations base their RTT measure- ments on a sample of only one packet per window. If we look at RTT estimation as a signal processing problem (which it is), there is a data signal at some frequency (the packet rate) that is sam- pled at a lower frequency (the window rate). Unfortunately, this lower sampling frequency violates Nyquist's criteria [***ref.***], introducing "aliasing" artifacts into the estimated RTT.* A good RTT estimator with a conservative retransmission timeout calculation can tolerate the aliasing when the sampling frequency is "close" to the data frequency. For example, with a window of 8 packets, the sample rate is 1/8 the data frequency -- less than an order of magnitude different. However, when the window is tens or hundreds of packets, the RTT estimator may be seriously in error, resulting in spurious retransmissions.** A solution to the aliasing problem that actually simplifies the sender substantially (since the RTT code is typically the single biggest protocol cost for TCP) is to have the sender put a times- tamp in each segment and have the receiver reflect that timestamp back in the ACK segment. Then a single subtract gives the sender an accurate RTT measurement for every ACK segment (corresponding to every other data segment with a sensible receiver). RFC-1072 defined a timestamp echo option for this purpose. It is vitally important to use the timestamp echo option with big windows; otherwise, the door is opened to some dangerous instabil- ities due to aliasing. Furthermore, the option is probably useful for all TCP's, since it simplifies the sender. 2.3 Avoiding Old Duplicate Segments This section describes the application of the timestamp echo option defined in RFC-1072 to prevent data corruption caused by _________________________ *VJ has (slightly pathological) Arpanet data showing that aliasing frequently resulted in a 70% underestimate of the average RTT. **VJ notes: Because of the huge "stored energy" in a link that re- quires a large window, these spurious retransmits are deadly -- they correspond exactly to a feedback control system with a loop gain >1, and it's very difficult to keep such a system stable. Jacobson, Braden, & Zhang [Page 5] ***DRAFT*** TCP over High-Speed Paths sequence number wrap-around. 2.3.1 Basic Algorithm Assume that every received TCP segment contains a timestamp. The basic idea is that a segment received with a timestamp that is earlier than the timestamp of the most recently accepted segment can be discarded as an old duplicate. More specifi- cally, the following processing is to be performed on normal incoming segments: R1) If the timestamp in the arriving segment timestamp is less than the timestamp of the most recently received in- sequence segment, treat the arriving segment as not acceptable: If SEG.LEN > 0, send an acknowledgement in reply as specified in RFC-793 page 69, and drop the segment; otherwise, just silently drop the segment.* R2) If the segment is outside the window, reject it (normal TCP processing) R3) If an arriving segment is in-sequence (i.e, at the left window edge), accept it normally and record its timestamp. R4) Otherwise, treat the segment as a normal in-window, out- of-sequence TCP segment (e.g., queue it for reassembly). Steps R2-R4 are the normal TCP processing steps specified by RFC-793, except that in R3 the latest timestamp is set from each in-sequence segment that is accepted. Thus, the latest timestamp recorded at the receiver corresponds to the left edge of the window and only advances when the left edge moves [Jacobson88]. It is important to note that the timestamp is checked only when a segment first arrives at the receiver, regardless of whether it is in-sequence or is queued for reassembly. Consider the following example. _________________________ *Sending an ACK segment in reply is not strictly necessary, since the case can only arise when a later in-order segment has already been re- ceived. However, for consistency and simplicity, we suggest treating a timestamp failure the same way TCP treats any other unacceptable segment. Jacobson, Braden, & Zhang [Page 6] ***DRAFT*** TCP over High-Speed Paths Suppose the segment sequence: A.1, B.1, C.1, ..., Z.1 has been sent, where the letter indicates the sequence number and the digit represents the timestamp. Suppose also that segment B.1 has been lost. The highest in-sequence times- tamp is 1 (from A.1), so C.1, ..., Z.1 are considered acceptable and are queued for reassembly. When B is retransmitted as segment B.2 (using the latest timestamp), it fills the hole and causes all the segments through Z to be acknowledged and passed to the user. The timestamps of the queued segments are *not* inspected again at this time, since they have already been accepted. When B.2 is accepted, the receivers's current timestamp is set to 2. This rule is vital to allow reasonable performance under loss. A full window of data is in transit at all times, and after a loss a full window less one packet will show up out-of-sequence to be queued at the receiver (e.g., up to ~2**30 bytes of data); the timestamp option must not result in discarding this data. In certain unlikely circumstances, the algorithm of rules R1-R4 could lead to discarding some segments unnecessarily, as shown in the following example: Suppose again that segments: A.1, B.1, C.1, ..., Z.1 have been sent in sequence and that segment B.1 has been lost. Furthermore, suppose delivery of some of C.1, ... Z.1 is delayed until AFTER the retransmission B.2 arrives at the receiver. These delayed segments will be discarded unnecessarily when they do arrive, since their timestamps are now out of date. This case is very unlikely to occur. If the retransmission was triggered by a timeout, some of the segments C.1, ... Z.1 must have been delayed longer than the RTO time. This is presumably an unlikely event, or there would be many spurious timeouts and retransmissions. If B's retransmission was triggered by the "fast retransmit" algorithm, i.e., by duplicate ACK's, then the queued segments that caused these ACK's must have been received already. Even if a segment was delayed past the RTO, the selective ack- nowledgment (SACK) facility of RFC-1072 will cause the delayed packets to be retransmitted at the same time as B.2, avoiding an extra RTT and therefore causing a very small performance penalty. We know of no case with a significant probability of occurrence Jacobson, Braden, & Zhang [Page 7] ***DRAFT*** TCP over High-Speed Paths in which timestamps will cause performance degradation by unnecessarily discarding segments. 2.3.2 Header Prediction Jacobson has developed an implementation technique called "header prediction" [Jacobson90] that is applicable to any window-based transport protocol but is most important for high-speed links. This technique optimizes the code for the most common case: receiving a segment correctly and in order. Using header prediction, the receiver asks the question, "Is this segment the next in sequence?" This question can be answered in fewer machine instructions than the question, "Is this segment within the window?" Adding header prediction to our timestamp procedure leads to the following sequence for processing an arriving TCP segment: H1) Check timestamp (same as step R1 above) H2) Do header prediction: if segment is in-sequence and there are no special conditions requiring additional processing, accept the segment, record its timestamp, and skip H3. H3) Process the segment normally, as specified in RFC-793. This includes dropping segments that are outside the win- dow and possibly sending acknowledgments, and queueing in-window, out-of-sequence segments. However, the timestamp check in step H1 is very unlikely to fail, and it is a relatively expensive operation since it requires interval arithmetic on a finite field. To perform this check on every single segment seems like poor implementa- tion engineering, defeating the purpose of header prediction. Therefore, we suggest that an implementor interchange H1 and H2, i.e., perform header prediction FIRST, performing H1 and H3 only if header prediction fails. We believe that this change might gain 5-10% in performance on high-speed networks. This reordering does raise a theoretical hazard: there is a possibility that a segment from 2**32 bytes in the past will arrive at exactly the wrong time and mistakenly be accepted by the header-prediction step. We make the following argument to show that the probability of this failure is negligible. If all segments are equally likely to show up as old duplicates, then the probability of an old duplicate Jacobson, Braden, & Zhang [Page 8] ***DRAFT*** TCP over High-Speed Paths exactly matching the left window edge is the maximum seg- ment size (MSS) divided by the size of the sequence space. This ratio must be less than 2**-16, since MSS must be < 2**16; for example, it will be (2**12)/(2**32) = 2**-20 for an FDDI link. In fact, the older a segment is, the less likely it is to be retained in the Internet, and under any reasonable model of segment lifetime the proba- bility of an old duplicate exactly at the left window edge must be much smaller than 2**16. The 16 bit TCP checksum also allows a basic unreliability of one part in 2**16. A protocol mechanism whose relia- bility exceeds the reliability of the TCP checksum should be considered "good enough", i.e., it won't contribute significantly to the overall error rate. We can therefore believe we can ignore the problem of an old duplicate being accepted by doing header prediction before checking the timestamp. 2.3.3 Timestamp Frequency It is important to understand that the receiver algorithm for timestamps does not involve clock synchronization with the sender. The sender's clock is used to stamp the segments, and the sender uses this fact to measure RTT's. However, the receiver treats the timestamp as simply a monotone-increasing serial number, without any necessary connection to its clock. From the receiver's viewpoint, the timestamp is acting as a logical extension of the high-order bits of the sequence number. The receiver algorithm places the following requirements on the frequency of the timestamp "clock": (a) Timestamp clock must not be "too slow". It must tick at least once for each 2**31 bytes sent. In fact, in order to be useful to the sender for round trip timing, the clock should tick at least once per window's worth of data, and even with the RFC-1072 window exten- sion, 2**31 bytes must be at least two windows. To make this more quantitative, any clock faster than 1 tick/sec will reject old duplicate segments for link speeds of ~2 Gbps; a 1ms clock will work up to link speeds of 2 Tbps (10**12 bps!). Jacobson, Braden, & Zhang [Page 9] ***DRAFT*** TCP over High-Speed Paths (b) Timestamp clock must not be "too fast". Its cycling time must be greater than MSL seconds. Since the clock (timestamp) is 32 bits and the worst-case MSL is 255 seconds, the maximum acceptable clock frequency is one tick every 59 ns. However, since the sender is using the timestamp for RTT calculations, the timestamp doesn't need to have much more resolution than the granularity of the retransmit timer, e.g., tens or hundreds of milliseconds. Thus, both limits are easily satisfied with a reasonable clock rate in the range 1-100ms per tick. Using the timestamp option relaxes the requirements on MSL for avoiding sequence number wrap-around. For example, with a 1 ms timestamp clock, the 32-bit timestamp will wrap its sign bit in 25 days. Thus, it will reject old duplicates on the same con- nection as long as MSL is 25 days or less. This appears to be a very safe figure. If the timestamp has 10 ms resolution, the MSL requirement is boosted to 250 days. An MSL of 25 days or longer can probably be assumed by the gateway system without requiring precise MSL enforcement by the TTL value in the IP layer. 3. DUPLICATES FROM EARLIER INCARNATIONS OF CONNECTION We now turn to the second potential cause of old duplicate packet errors, packets from an earlier incarnation of the same connection. Appendix A contains a review the mechanisms currently included in TCP to handle this problem. These mechanisms depend in an essential manner upon an enforced max- imum segment lifetime (MSL). However, unlike the case discussed in the previous section, the MSL required to prevent failures due to an earlier connection incarnation does not depend (directly) upon the transfer rate. However, the timestamp option can provide additional security against old duplicates from earlier connections, and in some cases can allow the MSL bound to be relaxed. There are two cases to be considered (see Appendix A for more expla- nation): (1) a system crashing (and losing connection state) and rebooting, and (2) the same connection being closed and reopened without a loss of host state. Jacobson, Braden, & Zhang [Page 10] ***DRAFT*** TCP over High-Speed Paths 3.1 System Crash with Loss of State TCP's quiet time of 1*MSL upon system startup handles the loss of connection state in a system crash/restart. The required MSL here does not depend upon the transfer speed. The current TCP MSL of 2 minutes seems acceptable as an operational compromise, as many host systems take this long to boot after a crash. However, the timestamp option may be used to ease the MSL require- ments, or to provide additional security against data corruption. What is required is to be able to guarantee that the first value of the sender's timestamp clock after a crash/restart will be greater than or equal to the last value before the event. This requires that the host clock be synchronized to a stable time source. For example, suppose that the clock is always re-synchronized to within N timestamp clock ticks and that booting takes more than N ticks (or there is a quiet time of N ticks). This will guarantee monoticity of the timestamps, so old duplicates will be rejected without any assumption about MSL. 3.2 Closing and Reopening a Connection According to the TCP specification, a delay of 2*MSL in TIME-WAIT state ties up a socket pair for 4 minutes. Applications built upon TCP that close one connection and open a new one (e.g., FTP using Stream mode) must choose a new socket pair each time. This delay serves two different purposes: (a) Implement the full-duplex reliable close handshake of TCP. Note that for this purpose, the proper time to delay the final close step is only vaguely related to the MSL; it really depends upon the RTO for the FIN segments and upon the RTT of the path. (b) Allow old duplicate segements to expire. If the TIME-WAIT delay plus the RTT together last at least one tick of the sender's timestamp clock, then the timestamp mechanism will prevent old duplicate segments without waiting 2*MSL. From this, we conclude that the timestamp mechanism can be con- sidered to provide additional security against old duplicates from earlier connection incarnations, and in some circumstances could be used to relax somewhat the requirement on the enforced maximum Jacobson, Braden, & Zhang [Page 11] ***DRAFT*** TCP over High-Speed Paths segment lifetime. However, some TIME-WAIT delay and MSL enforce- ment mechanism must be retained to provide the reliable close handshake of TCP. Finally, we note that TIME-WAIT state could cause an indirect per- formance problem if an application needed to repeatedly close one connection and open another at a very high frequency, since the number of available TCP ports on a host is less than 2**16. How- ever, high network speeds are not the major contributor to this problem; the RTT is the limiting factor in how quickly connections can be opened and closed. Therefore, the problem will not be any worse at high transfer speeds. 4. CONCLUSIONS We have presented a mechanism, using the TCP timestamp echo option of RFC-1072, that will allow very high TCP transfer rates without relia- bility problems due to old duplicate segments on the same connection. This mechanism also provides additional security against intrusion of old duplicates from earlier incarnations of the same connection, and can be used to reduce the quiet time when a system is rebooted. REFERENCES [Cerf76] Cerf, V., "TCP Resynchronization", Tech Note #79, Digitial Systems Lab, Stanford, January 1976. [Dalal74] Dalal, Y. "More on Selecting Sequence Numbers", INWG Proto- col Note #4, October 1974. [Garlick77] Garlick, L., R. Rom and J. B. Postel, "Issues in Reliable Host-to-Host Protocols", Proc. Second Berkeley Workshop on Dis- tributed Data Management and Computer Networks, May 1977. [Jacobson88] Jacobson, V. and Braden, R., "TCP Extensions for Long- Delay Paths", RFC-1072, October 1988. [Jacobson90] Jacobson, V., "4BSD Header Prediction", ACM Computer Communication Review, April 1990. [McKenzie89] McKenzie, A., "A Problem with the TCP Big Window Option", RFC-1110, August 1989. [Postel81] Postel, J., "Transmission Control Protocol", RFC-793, Sep- tember 1981. [Tomlinson74] Tomlinson, R. S., "Selecting Sequence Numbers", INWG Jacobson, Braden, & Zhang [Page 12] ***DRAFT*** TCP over High-Speed Paths Protocol Note #2, September 1974. APPENDIX A -- Protection against Old Duplicates in TCP During the development of TCP, a great deal of effort was devoted to the problem of protecting a TCP connection from segments left from earlier incarnations of the same connection. Several different mechanisms were proposed for this purpose [Tomlinson74] [Dalal74] [Cerf76] [Garlick77]. The connection parameters that are required in this discussion are: Tc = Connection duration in seconds. Nc = Total number of bytes sent on connection. B = Effective bandwidth of connection = Nc/Tc. Tomlinson proposed a scheme with two parts: a clock-driven selection of ISN (Initial Sequence Number) for a connection, and a resynchroni- zation procedure [Tomlinson74]. The clock-driven scheme chooses: ISN = (integer(R*t)) mod 2**32 [2] where t is the current time relative to an arbitrary origin, and R is a constant. R was intended to be chosen so that ISN will advance faster than sequence numbers will be used up on the connection. How- ever, at high speeds this will not be true; the consequences of this will be discussed below. The clock-driven choice of ISN in formula [2] guarantees freedom from old duplicates matching a reopened connection if the original connec- tion was "short-lived" and "slow". By "short-lived", we mean a con- nection that stayed open for a time Tc less than the time to cycle the ISN, i.e., Tc < 2**32/R seconds. By "slow", we mean that the effective transfer rate B is less than R. This is illustrated in Figure 1, where sequence numbers are plotted against time. The asterisks show the ISN lines from formula [2], while the circles represent the trajectories of several short-lived incarnations of the same connection, each terminating at the "x". Note: allowing rapid reuse of connections was believed to be an important goal during the early TCP development. This require- ment was driven by the hope that TCP would serve as a basis for user-level transaction protocols as well as connection-oriented Jacobson, Braden, & Zhang [Page 13] ***DRAFT*** TCP over High-Speed Paths protocols. The paradigm discussed was the "Christmas Tree" or "Kamikazee" segment that contained SYN and FIN bits as well as data. Enthusiasm for this was somewhat dampened when it was observed that the 3-way SYN handshake and the FIN handshake mean that 5 packets are required for a minimum exchange. Furthermore, the TIME-WAIT state delay implies that the same connection really cannot be reopened immediately. No further work has been done in this area, although existing applications (especially SMTP) often generate very short TCP sessions. The reuse problem is generally avoided by using a different port pair for each connection. |- 2**32 ISN ISN | * * | * * | * * | *x * | o * ^ | * * | | * x * | * o * S | *o * e | o * q | * * | * * # | * x * | *o * |o_______________*____________ ^ Time --> 4.55hrs Figure 1. Clock-Driven ISN avoiding duplication on short-Lived, slow connections. However, clock-driven ISN selection does not protect against old duplicate packets for a long-lived or fast connection: the connec- tion may close (or crash) just as the ISN has cycled around and reached the same value again. If the connection is then reopened, a datagram still in transit from the old connection may fall into the current window. This is illustrated by Figure 2 for a slow, long- lived connection, and by Figures 3 and 4 for fast connections. In each case, the point "x" marks the place at which the original con- nection closes or crashes. The arrow in Figure 2 illustrates an old duplicate segment. Figure 3 shows a connection whose total byte count Nc < 2**32, while Figure 4 concerns Nc >= 2**32. Jacobson, Braden, & Zhang [Page 14] ***DRAFT*** TCP over High-Speed Paths To prevent the duplication illustrated in Figure 2, Tomlinson pro- posed to "resynchronize" the connection sequence numbers if they came within an MSL of the ISN. Resynchronization might take the form of a delay (point "y") or the choice of a new sequence number (point "z"). |- 2**32 ISN ISN | * * | * * | * * | * * | * * ^ | * * | | * * | * * S | * * e | * x* y q | * o * | * o *z # | *o * | * * |*_________________*____________ ^ Time --> 4.55hrs Figure 2. Resynchronization to Avoid Duplication on Slow, Long-Lived Connection Jacobson, Braden, & Zhang [Page 15] ***DRAFT*** TCP over High-Speed Paths |- 2**32 ISN ISN | * * | x o * * | * * | o-->o* * | * * ^ | o o * | | * * | o * * S | * * e | o * * q | * * | o* * # | * * | o * |*_________________*____________ ^ Time --> 4.55hrs Figure 3. Duplication on Fast Connection: Nc < 2**32 bytes |- 2**32 ISN ISN | o * * | x * * | * * | o * * | o * ^ | * * | | o * * | * o * S | * * e | o * * q | * o * | * * # | o * | * o * |*_________________*____________ ^ Time --> 4.55hrs Figure 4. Duplication on Fast Connection: Nc > 2**32 bytes In summary, Figures 1-4 illustrated four possible failure modes for old duplicate packets from an earlier incarnation. We will call these four modes F1 , F2, F3, and F4: Jacobson, Braden, & Zhang [Page 16] ***DRAFT*** TCP over High-Speed Paths F1: B < R, Tc < 4.55 hrs. (Figure 1) F2: B < R, Tc >= 4.55 hrs. (Figure 2) F3: B >= R, Nc < 2**32 (Figure 3) F4: B >= R, Nc >= 2**32 (Figure 4) Another limitation of clock-driven ISN selection should be mentioned. Tomlinson assumed that the current time t in formula [2] is obtained from a clock that is persistent over a system crash. For his scheme to work correctly, the clock must be restarted with an accuracy of 1/R seconds (e.g, 4 microseconds in the case of TCP). While this may be possible for some hosts and some crashes, in most cases there will be an uncertainty in the clock after a crash that ranges from a second to several minutes. As a result of this random clock offset after system reinitializa- tion, there is a possibility that old segments sent before the crash may fall into the window of a new connection incarnation. The solu- tion to this problem that was adopted in the final TCP spec is a "quiet time" of MSL seconds when the system is initialized [Postel81, p. 28]. No TCP connection can be opened until the expiration of this quiet time. A different approach was suggested by Garlick, Rom, and Postel [Gar- lick77]. Rather than using clock-driven ISN selection, they proposed to maintain connection records containing the last ISN used on every connection. To immediately open a new incarnation of a connection, the ISN is taken to be greater than the last sequence number of the previous incarnation, so that the new incarnation will have unique sequence numbers. To handle a system crash, they proposed a quiet time, i.e., a delay at system startup time to allow old duplicates to expire. Note that the connection records need be kept only for MSL seconds; after that, no collision is possible, and a new connection can start with sequence number zero. The scheme finally adopted for TCP combines features of both these proposals. TCP uses three mechanisms: (A) ISN selection is clock-driven to handle short-lived connections. The parameter R = 250KBps, so that the ISN value cycles in 2**32/R = 4.55 hours. (B) (One end of) a closed connection is left in a "busy" state, known as "TIME-WAIT" state, for a time of 2*MSL. TIME-WAIT Jacobson, Braden, & Zhang [Page 17] ***DRAFT*** TCP over High-Speed Paths state handles the proper close of a long-lived connection without resynchronization. It also allows reliable completion of the full-duplex close handshake. (C) There is a quiet time of one MSL at system startup. This han- dles a crash of a long-lived connection and avoids time resyn- chronization problems in (A). Notice that (B) and (C) together are logically sufficient to prevent accidental reuse of sequence numbers from a different incarnation, for any of the failure modes F1-F4. (A) is not logically necessary since the close delay (B) makes it impossible to reopen the same TCP connection immediately. However, the use of (A) does give additional assurance in a common case, perhaps compensating for a host that has set its TIME-WAIT state delay too short. Some TCP implementations have permitted a connection in the TIME-WAIT state to be reopened immediately by the other side, thus short- circuiting mechanism (B). Specifically, a new SYN for the same socket pair is accepted when the earlier incarnation is still in TIME-WAIT state. Old duplicates in one direction can be avoided by choosing the ISN to be the next unused sequence number from the preceding connection (i.e., FIN+1); this is essentially an applica- tion of the scheme of Garlick, Rom, and Postel, using the connection block in TIME-WAIT state as the connection record. However, the connection is still vulnerable to old duplicates in the other direction. Mechanism (A) prevents trouble in mode F1, but failures can arise in F2, F3, or F4; of these, F2, on short, fast connections, is the most dangerous. Finally, we note TCP will operate reliably without any MSL-based mechanisms in the following restricted domain: * Total data sent is less then 2**32 octets, and * Effective sustained rate less than 250KBps, and * Connection duration less than 4.55 hours. At the present time, the great majority of current TCP usage falls into this restricted domain. The third component, connection dura- tion, is the most commonly violated. Jacobson, Braden, & Zhang [Page 18] From Z.Wang@cs.ucl.ac.uk Thu Sep 20 02:35:03 1990 Posted-Date: Thu, 20 Sep 90 10:33:27 +0100 Received-Date: Thu, 20 Sep 90 02:35:03 -0700 Message-Id: <9009200935.AA04277@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Thu, 20 Sep 90 02:35:03 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id ; Thu, 20 Sep 1990 10:33:31 +0100 To: end2end, Z.Wang@cs.ucl.ac.uk Subject: A New Congestion Control Scheme: Tri-S Date: Thu, 20 Sep 90 10:33:27 +0100 From: Zheng Wang Dear End2enders; While I was working on routing algorithms, I was interested in some behaviors of the Slow-Start scheme. After further examination of many end user control schemes, we have developed a new scheme called "Slow Start and Search (Tri-S)". The new scheme adopts a new apporach for end user traffic control. It has three improtant features: 1) In contrast to most of the user control schemes in which traffic demands are subject to continuous adjustment based on the feedback, the Tri-S scheme attempts to establish the sharing of the resources when there are significant traffic changes, eg. at the beginning and the end of a connection. Once the sharing of the resources has been negotiated, it will remain unchanged until next negotiation. 2) The operating-point searching in the Tri-S scheme is based on continuous evaluation of the current throughput gradient. Throughput increases linearly with the traffic load under light traffic and levels off when the path is saturated. The throughput gradient approachs zero when queue is building up at the bottleneck. 3) In the Tri-S scheme, a new approach called "statistical fairness" is adopted. The idea is to ensure that during the share negotiation all users start increase their traffic demands at the same time from the same level and with a same algorithm until the path is saturated. The final sharing of resources may not be equal due to the statistical nature of traffic and network. But all users have the equal opportunity. In other words, such approach is statistically fair, ie. over many runs of negotiations, each user has an equal share on the average. A typical session is like this: When a new user joins in, all users start a new session of share negotiation if the overall traffic demands can not be accommodated with the resources and lead to overflow of the buffer. When a user leaves, the remaining users will detect the change in the throughput gradient and absorb the relieved resources. I have written a paper on this with Jon. If you are interested, I can send you the postscript file via email. Due to excessive data and graphs, the ps file is over 2 MBytes and may take you 20 mins to print out (if your printer does understand our postscript :-) So if you prefer a hardcopy, drop me your surface address. By the way, if you are going to Sigcomm next week, I can hand to you a copy in person ... We would appreciate any comments from you. Zheng From lixia@redwing.parc.xerox.com Thu Sep 20 08:04:42 1990 Posted-Date: Thu, 20 Sep 1990 8:04:14 PDT Received-Date: Thu, 20 Sep 90 08:04:42 -0700 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Thu, 20 Sep 90 08:04:42 -0700 Received: from redwing.parc.xerox.com by arisia.Xerox.COM with SMTP (5.61+/IDA-1.2.8/gandalf) id AA09648; Thu, 20 Sep 90 08:04:44 -0700 Received: by redwing.parc.xerox.com (5.61+/IDA-1.2.8/gandalf) id AA00474; Thu, 20 Sep 90 08:04:15 PDT Sender: Lixia Zhang Date: Thu, 20 Sep 1990 8:04:14 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: Zheng Wang Cc: end2end, Z.Wang@cs.ucl.ac.uk Subject: Re: A New Congestion Control Scheme: Tri-S In-Reply-To: Your message of Thu, 20 Sep 1990 02:33:27 PDT Message-Id: Zheng, Your stuff sounds interesting, although I can't figure out exactly how it works just from your brief msg (e.g. When a new user joins in, how would all other existing users discover this fact and how/when they start negotiation? Have you tested the scheme by any means, simulation or whatever?). I'll be at SSIGCOMM and would love to read your paper. Thanks, Lixia From tds@honet9.att.com Fri Sep 21 08:45:24 1990 Posted-Date: Fri, 21 Sep 90 11:19 EDT Received-Date: Fri, 21 Sep 90 08:45:24 -0700 Message-Id: <9009211545.AA21239@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Fri, 21 Sep 90 08:45:24 -0700 From: tds@honet9.att.com Date: Fri, 21 Sep 90 11:19 EDT To: Z.Wang@cs.ucl.ac.uk Cc: end2end In-Reply-To: <9009200935.AA04277@venera.isi.edu> Subject: A New Congestion Control Scheme: Tri-S I'll look forward to getting a copy of your paper at SIGCOMM. Of course we in the phone biz are very interested in reservation and the like. The first question that comes to mind, if I understand your scheme, is how the endpoints probe the throughput gradient. The second question is how the scheme scales to high speed, when both the number and rate of ``connections'' gets large. Tony DeSimone From braden Tue Oct 2 10:18:46 1990 Received-Date: Tue, 2 Oct 90 10:18:46 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 2 Oct 90 10:18:46 -0700 Date: Tue, 2 Oct 90 10:18:41 PDT From: braden Posted-Date: Tue, 2 Oct 90 10:18:41 PDT Message-Id: <9010021718.AA15864@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 2 Oct 90 10:18:41 PDT To: end2end-interest Subject: Trailing TCP Checksums Several of the gigabit testbed groups have recently dusted off the venerable trailing-TCP-checksum idea. Here is a summary of recent messages on this subject, for your edification (slightly truncated in spots, mostly to protect the innocent). Bob Braden _______________________________________________________________ I started by throwing the rabbit into the ring: Will someone please remind me why trailing TCP checksums are a Bad Idea? Bob Braden ________ Reply from Craig Partridge: Bob: Here's my argument about why trailing checksums are a Bad Idea. * If you are going to maintain compatibility with old TCPs, then you'll have to negotiate trailing checksums. But folks who want trailing checksums only claim they add performance in pipelined architectures. This implies you'll have to change the pipeline depending on the result of TCP option negotiation in the SYNs. Sounds very messy to me.... * More concretely, I doubt trailers really give a performance boost, for the following reason. Everyone agrees that the receiving TCP doesn't win. You have to accumulate the entire segment and confirm it checksums before passing it up to the application, and you can just as easily check a number at the front of the segment as at the end. Furthermore, you can checksum as the segment comes in, regardless of where the checksum sits. So the question is, does the sender win? The idea of a trailer is that the header of the segment is on the network while the rest of the data is being checksummed. If the header isn't on the network, then you could just as easily stuff the checksum into the header as add it in a trailer. But how often can you be sure that your pipeline won't block at the MAC layer due to flow control? My guess, from what I've seen so far, is that the pipelines end up having a buffer at the MAC layer, to buffer the segment in case the MAC layer isn't ready for the segment yet. Now we're in an argument about how often the MAC layer is blocked. I'll bet it is most of the time, even if only for a very very short period. On token rings you have to wait for the token (and on gigabit token rings, token times are in the hundreds of thousands of bits). On ATM nets, flow control is likely to squelch you periodically, and if you do per datagram routing, you may have to wait for routing information. So, in my view, the header is extremely unlikely to get off your machine before the checksumming is done (and we haven't even discussed the fact that the IP checksum is easy to parallelize or how long the pipeline for other functions is likely to be). If that's the case, what does a trailer buy you except nuisance? Craig PS: I did tell Dick Binder that I was willing to see folks define a TCP checksum trailer option to play with, much as we defined a TCP alternate checksum option to play with. I just said I thought it was a bad idea, and thought there were others agreed with me (perhaps for different reasons). _______ From Vint Cerf: Bob, My understanding of the work at CMU (H.T.Kung is the PI for the Nectar switch effort) is that they were looking for a quick way to put up applications driving the gigabit switch at gigabit speeds. The checksum proved to be a bottleneck and I think they decided to try putting it at the end, in hardware. ... Vint ___________________________________________ From Hamant Kanakia: Bob: I thought this trailing checksum issue was settled long ago! (This was the only change I required of VMTP for my work on NAB.) The following are my reasons for favoring trailing checksums. One would like trailing checksums for any transport protocol (TCP, TP, VMTP) in order to generate or verify these on-the-fly. The reason why this wins is because pipelining this function reduces memory references made per data word. The pipeline, at least the NAB-style pipeline, is used to reduce the cost of memory references and NOT to speed up the operation of performing checksum. First storing a packet and then reading it to perform checksum almost doubles memory reads per packet. The only argument against the trailing checksum seems to be that such a change would make the modified TCP (or TP) non-standard, but then there are lots of other changes that people seem to want to make in TCP and as long as you are making some changes, I would say include this change as well. At worst, it will not hurt the performance. One may not choose to use NAB-style pipelining, but then the location of checksum will not increase processing cost. Incidentally, I believe that UltraNet folks also use trailing checksums (for TP) and at least one person from UltraNet I spoke to believed that performance they see is primarily a result of this feature. ____________________________________ A partial rebuttal from Craig: > First storing a packet and then reading it to perform checksum almost doubles > memory reads per packet. Why not do the checksum at the same time the packet is being stored? Van and Dave Clark have shown this is a big win in software -- is it hard to do in hardware? Craig ____________________________________ From Dave Clark: I think that the location of the checksum has nothing to do with pipelining, for sure on the receiving side. It can process the bytes, prepare a result, and then leave to the CPU the one memory reference to retrieve the value in the packet and compare. The issue is only on the sending side. If the pipeline is so constrained that it cannot get to bytes once they have gone by, (FIFO hardware rather than memory, for example) then it cannot get to the header field to put down the computed value, so it nneds to put it at the end. I always dislike pipelines that simple, since I fear that in practice they will be too restricted to do waht is needed. But perhaps this one optimization has some value. But for sure you can to the computation in the pipeline, independent of where the value is inthe packet. Dave From kanakia@research.att.com Tue Oct 2 19:22:07 1990 Posted-Date: Tue, 2 Oct 90 22:21:08 EDT Received-Date: Tue, 2 Oct 90 19:22:07 -0700 Message-Id: <9010030222.AA10043@venera.isi.edu> Received: from research.att.com by venera.isi.edu (5.61/5.61+local) id ; Tue, 2 Oct 90 19:22:07 -0700 From: kanakia@research.att.com Date: Tue, 2 Oct 90 22:21:08 EDT To: end2end-interest Subject: Re: Trailing Checksums In response to msgs from Craig and Dave on the subject of trailing checksums: >Why not do the checksum at the same time the packet is being stored? Van >and Dave Clark have shown this is a big win in software -- is it hard to >do in hardware? >>Craig NAB's Packet Processing Pipeline is "hardware" that stores/reads packets and calculates checksums while storing/reading the packet. (It also does encryption/decryption.) Using a processor to move (let alone checksumming) is a bad idea, although many seem to favor the idea. I think it is a bad idea since that wastes memory bandwidth. A block-mode DMA transfer cycle is smaller than the processor-initiated cycle since the later cycle includes address set-up time for each word being accessed. If you were using a DRAM with static-column or nibble or page mode, a full read cycle for random access would take at least 3-5 times more than the reading words sequentially using a page-mode cycle. Although some processors (e.g. AMD's 29K) do allow block moves from memory to its on-chip cache, the block size is governed by the onchip cache size. This block size is typically a lot smaller than what one would like for the efficient transfer of packets from memory. Please note that the argument again boils down to what saves memory bandwidth. Processor reading a word at a time (and calculating a partial checksum) requires address set-up time for each word being read and thus wastes memory bandwidth. >From Dave Clark: >The issue is only on the sending side. If the pipeline is so constrained >that it cannot get to bytes once they have gone by, (FIFO hardware rather >than memory, for example) then it cannot get to the header field to put down >the computed value, so it needs to put it at the end. >I always dislike pipelines that simple, since I fear that in practice they >will be too restricted to do waht is needed. Dave, If you don't like the pipeline that simple what do you suggest? Build a receiving pipeline that merely appends the checksum, and a sending pipeline where you can reach in and put bytes in the header, after the checksum is calculated? That would certainly increase the latency for sending a single msg. The latency in this pipeline would be at lest one packet transmission time. Moreover, the pipeline hardware would be messy (and costly). Making the network adapter look-like a memory addressable by the host processor is the other "bad" option. With this memory-like model of the interface, host processor would checksum while storing packets into this memory, but as I already pointed out above that wastes memory bandwidth. I would say that if the memory bandwidth is a critical resource, one should either use a FIFO-like hardware or use a pipeline, both reduce the memory bottleneck. The pipeline should be shallow, i.e., allow little storage of data within the pipe. The pipeline could either sit between the host bus and the adapter memory or between the adapter memory and the network. For reasons outlined in my PhD thesis, I chose the later option in the NAB. There are (important -:)) host interfaces built or being built which use a simple pipeline or a simple FIFO-like hardware. E.g., UltraNet's host interface, DECSRC's Autonet's host interface, Host Interface for plan-9 hosts (ongoing work at Bell Labs), and of course NAB. So why not make these folks happy by using a trailing checksum? The point to focus on is to ask if there is any reason at all to provide a checksum in header as TCP currently does. If there is none, then the trailing checksum idea seems like the natural winner. Hemant Kanakia From craig@NNSC.NSF.NET Wed Oct 3 08:39:32 1990 Posted-Date: Wed, 03 Oct 90 11:37:24 -0400 Received-Date: Wed, 3 Oct 90 08:39:32 -0700 Message-Id: <9010031539.AA26843@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Wed, 3 Oct 90 08:39:32 -0700 To: end2end-interest Subject: Borman tests of Van's code From: Craig Partridge Date: Wed, 03 Oct 90 11:37:24 -0400 Sender: craig@NNSC.NSF.NET Dave Borman called me today to tell me that he's going to run a test of the window-multiplier and TCP echo options between a couple of his Crays late this week (over HSX). One of the tests will be over a long distance (1,000 mile +) link. He says his hope is to get about 400Mbits/sec. That's what he's seen with non-TCP tests of HSX using 64Kbyte transfer sizes. (HSX can go up to 800Mbits/sec but that requires larger transfer sizes). Catch him in the halls at Interop if you want results hot off the HSX... Craig From braden Wed Oct 3 10:13:28 1990 Received-Date: Wed, 3 Oct 90 10:13:28 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 3 Oct 90 10:13:28 -0700 Date: Wed, 3 Oct 90 10:13:23 PDT From: braden Posted-Date: Wed, 3 Oct 90 10:13:23 PDT Message-Id: <9010031713.AA16289@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 3 Oct 90 10:13:23 PDT To: end2end-interest, kanakia@research.att.com Subject: Re: Trailing Checksums Build a receiving pipeline that merely appends the checksum, and a sending pipeline where you can reach in and put bytes in the header, after the checksum is calculated? That would certainly increase the latency for sending a single msg. The latency in this pipeline would be at lest one packet transmission time. Moreover, the pipeline hardware would be messy (and costly). Hamant, Of course, you were minimizing latency for VMTP. The gigabit folks are maximizing throughput, and are facing a bandwidth*delay product where there may be 100s of packets in a single window. In that situation, one packet's delay does not seem significant. This does not detract from your main argument about memory bandwidth. Bob From craig@NNSC.NSF.NET Wed Oct 3 10:25:32 1990 Posted-Date: Wed, 03 Oct 90 13:25:52 -0400 Received-Date: Wed, 3 Oct 90 10:25:32 -0700 Message-Id: <9010031725.AA01502@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Wed, 3 Oct 90 10:25:32 -0700 To: kanakia@research.att.com Cc: end2end-interest Subject: re: Trailing Checksums From: Craig Partridge Date: Wed, 03 Oct 90 13:25:52 -0400 Sender: craig@NNSC.NSF.NET > In response to msgs from Craig and Dave on the subject of trailing > checksums: > > >Why not do the checksum at the same time the packet is being stored? Van > >and Dave Clark have shown this is a big win in software -- is it hard to > >do in hardware? > > >>Craig > > NAB's Packet Processing Pipeline is "hardware" that stores/reads packets and > calculates checksums while storing/reading the packet. (It also does > encryption/decryption.) > > Using a processor to move (let alone checksumming) is a bad idea, > although many seem to favor the idea. Hemant: I think you missed my point. If I am DMA'ing from main memory to outboard memory, why can't I have two sets of wires out of my DMA chip -- one straight to my outboard memory, the other into a cumulative adder for my TCP checksum. In other words, it seems to me that doing the checksum in parallel with moving the memory is just a matter of having two sets of wires at the point where you DMA. Craig From minshall@wc.novell.com Wed Oct 3 16:48:20 1990 Posted-Date: Wed, 03 Oct 90 17:05:28 -0700 Received-Date: Wed, 3 Oct 90 16:48:20 -0700 Received: from OPTICS.KINETICS.COM by venera.isi.edu (5.61/5.61+local) id ; Wed, 3 Oct 90 16:48:20 -0700 Received: from plasma.wc.novell.com by wc.novell.com (4.0/SMI-DDN) id AA16085; Wed, 3 Oct 90 16:47:30 PDT Received: from localhost by plasma.wc.novell.com (3.2/SMI-3.2) id AA08931; Wed, 3 Oct 90 17:05:29 PDT Message-Id: <9010040005.AA08931@plasma.wc.novell.com> To: kanakia@research.att.com Cc: end2end-interest Subject: Re: Trailing Checksums In-Reply-To: Your message of Tue, 02 Oct 90 22:21:08 -0400. <9010030222.AA10043@venera.isi.edu> Date: Wed, 03 Oct 90 17:05:28 -0700 From: minshall@wc.novell.com >The latency in this pipeline would be at lest one packet transmission time. >Moreover, the pipeline hardware would be messy (and costly). Just to be picky, "transmission time" implies "transmission time on the wire", whereas the real latency here is "memory to memory transfer time". I think the point Craig made, which I haven't quite seen addressed, is that if your are DMA'ing from host memory directly onto the wire you are going to have underrun/overrun problems. Thus, you probably need to buffer "on board". Thus (as Craig just said) why not compute the checksum as you are transferring to the "on board" memory. The place I think this breaks down is in a specialized "network computer" (such as a router) where the entire system is tuned to the task of sending and receiving packets (and thus memory overruns, etc., don't happen by executive fiat). Then, you don't need to buffer and so maybe a trailing checksum makes sense. Greg Minshall Novell, Inc. minshall@wc.novell.com 1-415-975-4507 From kanakia@research.att.com Thu Oct 4 08:48:33 1990 Posted-Date: Thu, 4 Oct 90 11:48:22 EDT Received-Date: Thu, 4 Oct 90 08:48:33 -0700 Message-Id: <9010041548.AA06118@venera.isi.edu> Received: from research.att.com by venera.isi.edu (5.61/5.61+local) id ; Thu, 4 Oct 90 08:48:33 -0700 From: kanakia@research.att.com Date: Thu, 4 Oct 90 11:48:22 EDT To: end2end-interest Greg and Craig: What I said: >>The latency in this pipeline would be at lest one packet transmission time. >>Moreover, the pipeline hardware would be messy (and costly). Greg Minshall's response: >Just to be picky, "transmission time" implies "transmission time on >the wire", whereas the real latency here is "memory to memory transfer >time". Umhh! being picky gets you nowhere! Pipeline I was discussing runs at the same rate as the network link. Thus, storing of a packet by the pipeline (or reading one to send out) takes the same time as the packet transmission time. Greg Minshall: >I think the point Craig made, which I haven't quite seen addressed, is >that if your are DMA'ing from host memory directly onto the wire you >are going to have underrun/overrun problems. >Thus, you probably need >to buffer "on board". Three out of 4 network interfaces I mentioned in the previous msg do know how to avoid underruns and overruns, and not all of them use "on board" memory to store a complete packet before begining transmission. To know how Autonet's host interface avoids the problem, ask M.Schroeder from DECSRC, To know how Plan 9's host interface avoids the problem, ask me or Ken Thompson after few months. To know how NAB avoids the problem, read on. In the case of the NAB, a full packet (only the data part of a packet) is transferred to the "on board" memory. This transfer runs 3 times faster than the transfer out from that memory to the network (via the pipeline). Memory has enough bandwidth to support both transfers to run at the same time. Thus, the memory-to-memory transfer time you mention is not the bottleneck in NAB. As long as, one doesn't insist on using commercially available DMA chips or use processor to do memory-to-memory transfers, one can make the memory-to-memory transfer time much faster than the time it takes to send/recieve data on the wire. This is true currently with 100 Mb links and VME bus. This assertion would also hold in the gigabit range, although you would have to use a higher-speed bus than VME. From Greg: >>>Thus, you probably need >> to buffer "on board". Thus (as Craig just said) why not compute the > checksum as you are transferring to the "on board" memory. From Craig: >>> If I am DMA'ing from main memory to outboard memory, why can't >> I have two sets of wires out of my DMA chip -- one straight to my > outboard memory, the other into a cumulative adder for my TCP checksum. You sure could compute checksum while transferring data from host memory to the "on board" memory. If you use software to do that it would waste bandwidth (see the argument in my previous msg.) If you use hardware and not use trailing checksum you have the following problems to deal with. The point you both are missing is what to do with the checksum you just calculated. Interrupt the on board processor so that it can read the checksum computed and insert it in the right place in the "on board" memory? or have additional hardware to address the on board memory and to insert the checksum at the right place in the header? The later option also means this hardware would compete with the "on board" processor for peeking into the on board memory. And while you are servicing this "insert checksum" interrupt or the hardware is waiting to gain access to the on board memory, you would not be able to start DMA'ing (and checksumming) the next packet. Of course you could solve the last problem by building a separate FIFO that just holds computed checksums!! I am not saying it can't be done but to understand how meesy it would all be I suggest you both try building hardware that would checksum and DMA at the top speed of the bus, and has the capability to reach into the "on board" memory to insert the checksum in the right place in the packet header. and why do all that?? just so that we don't have a trailing checksum? I close my tirade by citing a general belief of mine. The underlying principle in designing a packet format should be that If one chooses to process a packet in flight with very little storage, the packet format should make it possible. This principle should be followed for not just checksum field but digital signatures, presentation level fields etc. Currently I don't see any other overriding reason to lay out a packet format differently. That way everybody can build host interfaces and packet switches anyway they like and anyway they know how to. Hemant Kanakia AT&T Bell Labs, Murray Hill. 201-582-3090. From craig@NNSC.NSF.NET Thu Oct 4 09:13:05 1990 Posted-Date: Thu, 04 Oct 90 12:13:19 -0400 Received-Date: Thu, 4 Oct 90 09:13:05 -0700 Message-Id: <9010041613.AA06794@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Thu, 4 Oct 90 09:13:05 -0700 To: kanakia@research.att.com Cc: end2end-interest Subject: re: checksum From: Craig Partridge Date: Thu, 04 Oct 90 12:13:19 -0400 Sender: craig@NNSC.NSF.NET Hemant: Thanks for the tirade. It was helpful. Among other things, it finally explained to me why a trailer checksum is useful, to wit, insertion of the checksum forward into the pipeline or into a buffer in the pipeline is a problem. (Do able but very painful). As a result, trailer checksums are convenient. The reason I was beating on this subject is that I have some nagging concern that if we espouse trailers, we'd set a precedent for future gigabit researchers, and I wanted to make sure our reasons for doing trailers were clear and reasonable -- not snake oil. Thanks! Craig From J.Crowcroft@cs.ucl.ac.uk Thu Oct 4 09:13:50 1990 Posted-Date: Thu, 04 Oct 90 17:09:42 +0100 Received-Date: Thu, 4 Oct 90 09:13:50 -0700 Message-Id: <9010041613.AA06831@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Thu, 4 Oct 90 09:13:50 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id <20117-0@bells.cs.ucl.ac.uk>; Thu, 4 Oct 1990 17:09:54 +0100 To: kanakia@research.att.com Cc: end2end-interest In-Reply-To: Your message of Thu, 04 Oct 90 11:48:22 -0400. <9010041548.AA06118@venera.isi.edu> Date: Thu, 04 Oct 90 17:09:42 +0100 From: Jon Crowcroft >You sure could compute checksum while transferring data from host memory >to the "on board" memory. >If you use software to do that it would waste bandwidth (see the argument in >my previous msg.) If you use hardware and not use trailing checksum you have >the following problems to deal with. why not put the checksum in the next packet:-) jon From braden Fri Oct 5 16:31:22 1990 Received-Date: Fri, 5 Oct 90 16:31:22 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 5 Oct 90 16:31:22 -0700 Date: Fri, 5 Oct 90 16:31:16 PDT From: braden Posted-Date: Fri, 5 Oct 90 16:31:16 PDT Message-Id: <9010052331.AA18098@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Fri, 5 Oct 90 16:31:16 PDT To: end2end-interest Subject: Architecture Limits Hi. At a recent IAB meeting, the question came up of limits to the Internet architecture. Since I pointed out that the E2E RG has been delving into this for several years, I was asked to summarize our thoughts for the IAB. Here is my attempt; please give me feedback, on errors and/or omissions. Thanks! Bob Braden _____________________________________________________________________ Considerations on Internet Architecture Limits Developed by the End-to-End Research Group October 1990 A. Quantitative Transport Protocol Limits A.1 TCP Performance Depends upon both bandwidth and delay. For fully utilize the bandwidth with a large bandwidth*delay products, TCP needs the extensions defined in RFC-1072: o extended windows o timestamps to measure RTT o selective acknowledgments). A.2 TCP Correctness At FDDI bandwidth of 100Mbps, the 32-bit TCP sequence space will be used up in 170 seconds, which is close to the 120 second maximum segment lifetime prescribed by TCP. This begins to raise the possibility of data corruption by old duplicate segments. This can be avoided by a simple TCP extension suggested by Van Jacobson and described in a forthcoming RFC. This extension will extend the range of TCP correctness well beyond 1 Gbps. A.3 Port Space If there is a rapid sequence of short-duration TCP connections between a given host pair, the 16-bit port space can be used up very quickly. This limits the use of TCP for short transactions (unless the transactions are multiplexed on the same open TCP connection, but that has other problems). This argues for a timer-based transaction-transport protocol. Note that each port pair is tied up in TIME-WAIT state for 240 seconds. If the RTT is less than 2 ms and if a minimum 5-packet exchange is used to open a connection, send data, and close the connection SYN--> <--SYN ACK ACK DATA FIN--> <--ACK FIN ACK---> and if this is done repeatedly between the same host pair as fast as possible, then all 65000 ports will be used up in less time than 240 seconds, and further progress will be held up until 240 seconds elapses. Since this limit depends upon RTT but not bandwidth, it will not get worse at higher speeds. A.4 Window-Based Flow Control There is a fundamental problem with window-based flow control as the bandwidth-delay product becomes very large: the transport protocol will be unable to reach steady-state before the data is exhausted. Furthermore, existing congestion control mechanisms do not work very well. On the other hand, rate-based flow control may be subject to catastrophic phase-entrainment effects, and we don't currently know how to avoid this. B. Engineering/Implementation Issues B.1 Packet Sizes Higher-speed networks are tending to larger packet sizes (32Kb for FDDI, 72Kb for SMDS). MTU Discovery will be essential to make use of these sizes. If the CPU is interrupted once for each packet, and if the max reasonable interrupt frequency is 1000 per second, then at 1 Gbps each packet seen by the CPU had better be at least 10**6 bits, a factor of 100 larger than currently used. B.2 Performance Limits The fundamental performance limit is now understood to be the host memory bus speed. Dave Clark summarized the current wisdom most eloquently at SIGCOMM '90. From llp@cs.arizona.edu Fri Oct 5 18:36:43 1990 Posted-Date: Fri, 5 Oct 90 18:36:38 MST Received-Date: Fri, 5 Oct 90 18:36:43 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Fri, 5 Oct 90 18:36:43 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA03149; Fri, 5 Oct 90 18:36:40 -0700 Date: Fri, 5 Oct 90 18:36:38 MST From: "Larry Peterson" Message-Id: <9010060136.AA17862@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Fri, 5 Oct 90 18:36:38 MST To: braden Cc: end2end-interest In-Reply-To: <9010052331.AA18098@braden.isi.edu> Subject: Architecture Limits How about a bullet on functionality. The architecture provides only a byte-stream and an unreliable datagram end-to-end service. It does not support RPC or any of its variations, and it does not support group communication. This may seem like too obvious of point to make, but I consider this the most significant limitation of the architecture---there are probably more application programmers out there rolling their own transport protocols *outside* the architecture than there are programmers using naked TCP and UDP. Larry From braden Sun Oct 7 15:45:51 1990 Received-Date: Sun, 7 Oct 90 15:45:51 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Sun, 7 Oct 90 15:45:51 -0700 Date: Sun, 7 Oct 90 15:45:47 PDT From: braden Posted-Date: Sun, 7 Oct 90 15:45:47 PDT Message-Id: <9010072245.AA18568@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Sun, 7 Oct 90 15:45:47 PDT To: end2end-interest Subject: Msg from Estrin on Van's connectionless ideas (Forwarded with permission): ----- Begin Included Message ----- From estrin%jerico.usc.edu@usc.edu Fri Oct 5 14:48:19 1990 Date: Fri, 5 Oct 90 14:47:33 PDT From: estrin@usc.edu (Deborah Estrin) Sender: estrin%jerico.usc.edu@usc.edu To: braden@ISI.EDU Subject: van, connection/less dogma, etc. Reply-To: estrin@usc.edu Bob, I had an interesting conversation with Van at Sigcomm. I started it by saying that I wanted to understand his basic arguments with things like lixia's flow protocol and the impression that i had 3rd handthat he did not believe in resource reservation. I wondered how you could make svc guraantees in many situations with out it... Anyway, here is my attempt to articulate our simple conversation. ... There is nothing profound, but i finally understood what Van's anti-connectionist ravings were about... ------ >From estrin@usc.edu Fri Oct 5 14:42:45 1990 Return-Path: Date: Fri, 5 Oct 90 14:40:48 PDT From: estrin@usc.edu (Deborah Estrin) Sender: estrin To: van@helios.ee.lbl.gov Cc: floyd@horse.ee.lbl.gov, estrin Subject: VAN is this what you said when we spoke at sigcomm? Reply-To: estrin@usc.edu Sally, since i expect you will read this before Van, I would be interested to hear what you have to say. I was just trying to capture the gist of a conversation i had with Van at Sigcomm. It helps me to write things down and repeat them back to see if i actually understood... Here is an attempt to document my conversation with Van. 1. Connectionist architectures at the lowest level are not the right way to go. The fundamental unit should still be the datagram because out of it you can build everything else most efficiently. 2. Gateways can keep state regarding: a-user class b-type of service class (maybe this is combined with user class?) c-priority d-resource-reservation ID (if there is one) e- % allocation allowed to (a,b,c,d) class (a,b,c,d together define aggregatable traffic that share resources on a datagram basis) f-counter of packets handled for the class g-absolute threshold allowed to this class (dont remember what this is) Given this information, you can do anything on top of it. The goal is to aggregate traffic as much as possible. This creates the most efficient system because of the statistical properties of traffic (more aggregation--smoother looking traffic patterns)... 3. Gateways then handle packets according to their class. Such a gateway can enforce both: (a) resource sharing policies such as, NSF gets 50% and Nasa gets 50% of this link when there is contention and if there is not contention than either can get 100% if they need it and are not interfering with the other. So packets are queued according to the class, allocation guaranteed, existing queue length, etc. You also need a protocol to install the % thresholds in the gws to correspond to the sharing policies. (b) service guarantees for particular transport sessions when needed. In other words, flow like protocols can be implemented in this way if you have a Resource Reservation protocol that can cause the gw to set up corresponding class state and manage the various parameters. The RR protocol component of the gw would have to keep track of the guarantees made in order to not over extend it self of course... 4. But bottom line is that the gw is build out of datagrams, queue mangement, state, instead of building the flow protocol/resource reservation at the lowest level and treating Datagrams as noise using unused capacity. 5. Van also talked about additional problems with flow style resource reservation because of the e2e delay required to obtain guarantee. He recommended a credit system. BUt this requires that you have the right kinds of cretidts generated and handy at the source. This means very good, predictable traffic patterns, or else inefficient use of the resources... I think it degenerates into e2e delay if your model of use is not accurate. But the two could coexist with credits being an optimatization that you have along side default flow setup protocol. 6. It seems that state established during policy route (PR) setup is consistent with all this in that it provides for aggregation of transport sessions and does not imply that the gws are doing connections in terms of forwrding traffic.l The state established for PR is used for facilitating routing decisions only. D. ----- End Included Message ----- From lixia@redwing.parc.xerox.com Mon Oct 8 08:09:14 1990 Posted-Date: Sun, 7 Oct 1990 11:29:51 PDT Received-Date: Mon, 8 Oct 90 08:09:14 -0700 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 08:09:14 -0700 Received: from redwing.parc.xerox.com by arisia.Xerox.COM with SMTP (5.61+/IDA-1.2.8/gandalf) id AA04340; Mon, 8 Oct 90 08:09:52 -0700 Received: by redwing.parc.xerox.com (5.61+/IDA-1.2.8/gandalf) id AA02919; Sun, 7 Oct 90 11:29:53 PDT Sender: Lixia Zhang Date: Sun, 7 Oct 1990 11:29:51 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: braden Cc: end2end-interest Subject: Re: Architecture Limits In-Reply-To: Your message of Fri, 5 Oct 1990 18:36:38 PDT Message-Id: I second Larry's bullet on functionality. In addition to what Larry already mentioned, TCP/IP also lacks support to real time applications with rigid performance requirements, such as video and voice. I also have a question: On the other hand, rate-based flow control may be subject to catastrophic phase-entrainment effects, and we don't currently know how to avoid this. I don't understand the problem well. Could you explain a bit ? Lixia From guru@flora.wustl.edu Mon Oct 8 09:39:31 1990 Posted-Date: Mon, 08 Oct 90 11:41:01 -0500 Received-Date: Mon, 8 Oct 90 09:39:31 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 09:39:31 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA26529; Mon, 8 Oct 90 11:38:13 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA05231; Mon, 8 Oct 90 11:41:02 CDT Message-Id: <9010081641.AA05231@flora.wustl.edu> To: braden Cc: end2end-interest Subject: Re: Architecture Limits In-Reply-To: Your message of Fri, 05 Oct 90 16:31:16 -0700. <9010052331.AA18098@braden.isi.edu> Date: Mon, 08 Oct 90 11:41:01 -0500 From: Gurudatta Parulkar > Hi. At a recent IAB meeting, the question came up of limits to the > Internet architecture. Since I pointed out that the E2E RG has been > delving into this for several years, I was asked to summarize our > thoughts for the IAB. Here is my attempt; please give me feedback, > on errors and/or omissions. I would like to first understand what does the IAB consider as the target environment for the next generation Internet Architecture. Specifically, there are two points that I consider important: - what application mix the internet architecture is to support? For example, a number of people believe the application mix would include more of interactive imaging (multimedia conferencing, interactive televisualization, interactive electronic radiology, etc. etc.) and less (in proportion, of course) of existing traffic, such as electronic mail and domain name system. Given this environment, the limits will turn out to be different than otherwise. - can we characterize the underlying neworks that would comprise the future internet? If ATM becomes a default technology at the network layer, and phone companies are able to provide packet switched capabilities (instead of just leased circuit switched channels) with statistical guarantees, the burden on the internet may be much less. However, one can easily argue that the future internet (at least the NREN part) will continue to be a set of very heterogeneous networks, with a large part providing no statistical guarantees. This will require the internet architecture to be more sophisticated in order to support the interactive imaging applications well. Without the target environment defined, I don't think we can outline the limits in a meaningful way. > A. Quantitative Transport Protocol Limits > > A.1 TCP Performance > > Depends upon both bandwidth and delay. > > For fully utilize the bandwidth with a large bandwidth*delay > products, TCP needs the extensions defined in RFC-1072: > o extended windows > o timestamps to measure RTT > o selective acknowledgments). Regarding, the sequence number, has anybody considered the following simple extension: TCP uses a sequence number for each byte, and therefore, can run of sequence numbers fast. However, TCP can introduce an option using which two ends can agree on using one seq number for each data segment of a large size (e.g., 1Mbyte). In other words, you are numbering a large data segment and not each byte in the segement. The existing sequence number will be sufficient for this scheme. Regarding the selective ack, I want to understand semantics of its implementation. The selective ack scheme requires the receiving end to buffer partially received data, while the missing packets are detected, retransmission requests made, and retransmissions received. During this wait, does the TCP allow application to use the partially received data? If not, selective ACK doesn't seem to help application blocking due to errors and retransmissions? Also, does the application have any say in what gets selectively acked or nacked to avoid receiving retransmission that maybe the application does care about. Both these comments have to do with providing a service abstraction more than just reliable byte stream. As Larry also Peterson mentioned, the reliable byte stream may not adequate for all applications. > A.4 Window-Based Flow Control > > There is a fundamental problem with window-based flow control > as the bandwidth-delay product becomes very large: the > transport protocol will be unable to reach steady-state before > the data is exhausted. Furthermore, existing congestion control > mechanisms do not work very well. I am glad to hear this comment. > On the other hand, rate-based flow control may be subject to > catastrophic phase-entrainment effects, and we don't currently > know how to avoid this. However, I wish I can understand this one. I guess this is enough for now. -guru From braden Mon Oct 8 10:04:53 1990 Received-Date: Mon, 8 Oct 90 10:04:53 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 10:04:53 -0700 Date: Mon, 8 Oct 90 10:04:45 PDT From: braden Posted-Date: Mon, 8 Oct 90 10:04:45 PDT Message-Id: <9010081704.AA19000@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 8 Oct 90 10:04:45 PDT To: lixia@parc.xerox.com Subject: Re: Architecture Limits Cc: end2end-interest From lixia@redwing.parc.xerox.com Mon Oct 8 08:09:33 1990 Sender: Lixia Zhang Date: Sun, 7 Oct 1990 11:29:51 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: braden@ISI.EDU Cc: end2end-interest@ISI.EDU Subject: Re: Architecture Limits In-Reply-To: Your message of Fri, 5 Oct 1990 18:36:38 PDT I second Larry's bullet on functionality. In addition to what Larry already mentioned, TCP/IP also lacks support to real time applications with rigid performance requirements, such as video and voice. I also have a question: On the other hand, rate-based flow control may be subject to catastrophic phase-entrainment effects, and we don't currently know how to avoid this. Yes, I did not include these qualitative issues simply because I did not think they were intended to be part of the question. However, you are convincing me that for completeness they should be included. I don't understand the problem well. Could you explain a bit ? Lixia The phases of all the "independently" clocked transmitters may drift until they are aligned, and then all transmitters will send packets simultaneously, overflowing queues. Bob From estrin%jerico.usc.edu@usc.edu Mon Oct 8 11:37:27 1990 Posted-Date: Mon, 8 Oct 90 11:37:13 PDT Received-Date: Mon, 8 Oct 90 11:37:27 -0700 Received: from usc.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 11:37:27 -0700 Received: from jerico.usc.edu by usc.edu (5.59/SMI-3.0DEV3) id AA09398; Mon, 8 Oct 90 11:36:32 PDT Received: by jerico.usc.edu (4.1/SMI-3.0DEV3) id AA16516; Mon, 8 Oct 90 11:37:13 PDT Date: Mon, 8 Oct 90 11:37:13 PDT Message-Id: <9010081837.AA16516@jerico.usc.edu> From: estrin@usc.edu (Deborah Estrin) Sender: estrin%jerico.usc.edu@usc.edu To: braden%venera.isi.edu@usc.edu Cc: end2end-interest In-Reply-To: braden%venera.isi.edu@usc.edu's message of Sun, 7 Oct 90 15:45:47 PDT <9010072245.AA18568@braden.isi.edu> Subject: Re: Msg from Estrin on Van's connectionless ideas Reply-To: estrin@usc.edu Well, bob, I did give you permission. But you could have editted out the reference to "Van's anti-connectionist RAVINGS"!!!!! Sorry, Van. No slur intended :} ---from one who raves all the time... From braden Mon Oct 8 11:43:37 1990 Received-Date: Mon, 8 Oct 90 11:43:37 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 11:43:37 -0700 Date: Mon, 8 Oct 90 11:43:30 PDT From: braden Posted-Date: Mon, 8 Oct 90 11:43:30 PDT Message-Id: <9010081843.AA19215@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 8 Oct 90 11:43:30 PDT To: estrin@usc.edu Subject: Re: Msg from Estrin on Van's connectionless ideas Cc: end2end-interest Well, bob, I did give you permission. But you could have editted out the reference to "Van's anti-connectionist RAVINGS"!!!!! Sorry, Van. No slur intended :} ---from one who raves all the time... Deborah, Sorry, I didn't edit it out because I thought it was sorta cute, and I knew Van would enjoy it rather than be offended by it. Researchers are supposed to rave, aren't they? Bob From tds@honet9.att.com Mon Oct 8 15:36:38 1990 Posted-Date: Mon, 8 Oct 90 18:15 EDT Received-Date: Mon, 8 Oct 90 15:36:38 -0700 Message-Id: <9010082236.AA08680@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 15:36:38 -0700 From: tds@honet9.att.com Date: Mon, 8 Oct 90 18:15 EDT To: end2end-interest Cc: estrin@usc.edu In-Reply-To: Subject: RE: Msg from Estrin on Van's connectionless ideas >>>>> On Mon, 8 Oct 90 11:37:13 PDT, estrin@usc.edu (Deborah Estrin) said: Deborah> Well, bob, I did give you permission. But you could have editted out Deborah> the reference to "Van's anti-connectionist RAVINGS"!!!!! >>>>> On Mon, 8 Oct 90 11:43:30 PDT, braden@venera.isi.edu said: braden> Sorry, I didn't edit it out because I thought it was sorta cute, and I braden> knew Van would enjoy it rather than be offended by it. Researchers braden> are supposed to rave, aren't they? Hmmmm. Is this some kind of invitation? OK, here goes! I've had only a couple of discussions with Van on this connectionist stuff, and it could be that I don't understand him, but it all still sounds like dogma to me. I originally though the issue was stateful versus stateless. The less state the better, at least conceptually, amd at least for failure recovery. But apparently for real-time traffic the case is settled: stateless network nodes cannot do resource reservation (by definition) and so cannot make service guarantees. And stateless nodes imply a "bad" service discipline, most likely FIFO. A network that wants to provide service for real-time traffic must provide service guarantees and must maintain state information. But once I figured this out, I learned that stateful/stateless was not the real issue, but something about connection vs. connectionless was the issue. At this point I mostly fall off the boat. I don't see why anything Deborah said Van said (!) applies only to (connectionless, datagram, route-on-global-address) networks and not to (connection-oriented, ATM, route-on-VCI) networks. Or are we talking about something else? I think Van's point, to me and in Deborah's note, is that a network built to provide datagram service can maintain sufficient state to provide service guarantees. No doubt that is true. But is it somehow inherently better? I don't see why. My feeling is that connections (or flows or associations or something) are the appropriate unit on which to make routing and resource reservation decisions. As well the network needs to provide for applications that don't want to bother making connections (e.g. name servers, maybe some query-response and of course NTP) but I claim that switches should be designed to provide that on on a best-effort basis and expect traffic that is resource (bandwidth) intensive to make reservations. I think this is in the spirit of Lixia's Flow Network work, although I'm too lazy to go find her thesis and check. [Exercise for the reader: Prove that all applications whose b/w demand should concern the network can stand to make reservations. Hint: consider that we're ultimately serving a user that is pleased to get 250 ms echoplex delays. :-] Getting back to my point: connections are the units on which the network makes routing and resource reservation decisions. That doesn't preclude statistical sharing, but does make rerouting under failure more complicated. The connectionless network node would need to develop some mechanism for service denial, to ensure that its resources are not over-reserved. Maybe this is the crux: the problem is not how to provide reserved resources but what to do when you *deny* a request for resources. This is exactly what the phone company has been trying to do for at least fifty years with its hierarchical alternate-routing network and in the last decade with non-hierarchical routing in the AT&T network. You need to determine an end-to-end connection and reserve resources along the way. The tradeoffs: if you want efficiency you must have high blocking on direct connections and many users sharing alternate paths, but you need to ensure that the use of alternate paths doesn't degrade the performance of the network under congestion. I think the strategies can (with some work) be mapped to networks that allow statistical sharing of bandwidth among virtual connections. I would like to know how that problem is solved in a "connectionless" network without knowing about the pattern of flows in the network. And determining the pattern of flows is cheating, because at that point the only difference between connectionless or connection-oriented is the use of a global destination address for local routing in the former, and a local destination address (assigned at call setup) in the latter. But if one is going to go to all the trouble of keeping state information on all the flows going through a gateway, why not take advantage of the call setup or reservation handshake or whatever to assign a virtual circuit identifier to simplify the search of the routing table? At best the argument against is that not much is gained, or we've never done it that way. Whew, I feel much better now. Nothing like a good rant every once in a while! Tony From tds@honet9.att.com Mon Oct 8 16:41:13 1990 Posted-Date: Mon, 8 Oct 90 19:14 EDT Received-Date: Mon, 8 Oct 90 16:41:13 -0700 Message-Id: <9010082341.AA11109@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 16:41:13 -0700 From: tds@honet9.att.com Date: Mon, 8 Oct 90 19:14 EDT To: end2end-interest In-Reply-To: Subject: RE: Architecture Limits >>>>> On Mon, 8 Oct 90 10:04:45 PDT, braden@venera.isi.edu said: braden> The phases of all the "independently" clocked transmitters may drift until braden> they are aligned, and then all transmitters will send packets simultaneously, braden> overflowing queues. Is this considered a real problem? Sounds like the proverbial "set of measure zero," except for strictly deterministic sources (i.e. TDM) where it is manifested as phase slips (or manipulation of payload pointers for you SONET fans). Tony From cheriton@Pescadero.Stanford.EDU Mon Oct 8 21:27:01 1990 Posted-Date: Mon, 8 Oct 90 21:26:55 PDT Received-Date: Mon, 8 Oct 90 21:27:01 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 21:27:01 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA23137; Mon, 8 Oct 90 21:26:55 PDT Date: Mon, 8 Oct 90 21:26:55 PDT From: David Cheriton Message-Id: <9010090426.AA23137@Pescadero.Stanford.EDU> To: braden, lixia@parc.xerox.com Subject: Re: Architecture Limits Cc: end2end-interest As much as I would agree/argue that we dont fully understand rate control (after all, I want to do some further research on this), I think some of the phase entrainment concerns are overblown relative to rate control. If hosts are sending out packets at regular rates, i.e. packet every 1/r time units for rate r, it seems to me there are significant contraints on the degree that "phase entrainment" can cause problems. 1) for a given switch, packet arrivals are serialized by number of lines, so pulses beyond some level look like back-to-back packets, which switch reception should be able to handle to the point of buffering. 2) The 1/r period is relatively short compared to other times, such as RTT, so I believe that the switching buffer will have to have capacity that matches bursts larger than what one can receive in a period 1/r. That is, the rate interval is too short to overflow buffers/queues unless they are inadequate for any approach (that makes reasonable utilization of the network). 3) Given 1) and 2), queue overflow is a danger only to the degree that the switch is too optimistic in rates it provides to sources, relative to capacity and buffering. But this is an issue for any congestion control scheme. In particular, when a switch starts to see queues grow, does it react strongly at the risk of poor line utilization if one of the sources slows down, or does it bet one some load disappearing and react more mildly. The former seems necessary with very little buffering, but the latter seems feasible as at least a first round reaction providing there is buffering to absorb the impact when the switch is guessing wrong. (And I recognize that one needs an exponential backup of rates to stabilize when everyone really is leaning on the network). From cheriton@Pescadero.Stanford.EDU Mon Oct 8 22:41:42 1990 Posted-Date: Mon, 8 Oct 90 22:41:37 PDT Received-Date: Mon, 8 Oct 90 22:41:42 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Mon, 8 Oct 90 22:41:42 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA23333; Mon, 8 Oct 90 22:41:37 PDT Date: Mon, 8 Oct 90 22:41:37 PDT From: David Cheriton Message-Id: <9010090541.AA23333@Pescadero.Stanford.EDU> To: end2end-interest, tds@honet9.att.com Subject: RE: Msg from Estrin on Van's connectionless ideas Cc: estrin@usc.edu Yes, Tony, I cant wait for your brave new world of "resource reservations" and "real-time traffic guarantees" (which of course I will want written into my contract for service from the telephone companies. Because my computers will anxiously monitor traffic to make sure these guarantees are honored (even though there is nothing like this in the current phone network) and autodial a class action lawyer when they are violated. Also, when Bob Braden gets me to block off a day for a teleconference and the network says "all circuits are now busy, try your call later", just as we all sit down to start the conference, I will be delighted to do so, with my friendly workstation which will try again 2.3 microseconds later, and then again a similar time interval later --- because you never know when those resource might become available, right, and the network certainly cant charge me for denial of service in this situation. If that level of enthusiasm is not sufficient, my workstation has another 40 or more friends that would like to keep the network reservation manager on its toes. Sorry for the sarcasm, but ever since I spent 3 days on the east coast trying to reach family on the west coast by phone after Oct. 17th earthquake, knowing full well that the phone network was running at less than 25 percent real utilization (if it was built without circuit setup, etc.), the phone company view that denial of service is better than degraded service just does not wash. If degraded service was such a bogeyman, people would not use cellular phones. David Cheriton From Z.Wang@cs.ucl.ac.uk Tue Oct 9 02:27:12 1990 Posted-Date: Tue, 09 Oct 90 10:24:04 +0100 Received-Date: Tue, 9 Oct 90 02:27:12 -0700 Message-Id: <9010090927.AA25635@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 02:27:12 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id <17217-0@bells.cs.ucl.ac.uk>; Tue, 9 Oct 1990 10:24:11 +0100 To: braden Cc: end2end-interest Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Sun, 07 Oct 90 15:45:47 -0700. <9010072245.AA18568@braden.isi.edu> Date: Tue, 09 Oct 90 10:24:04 +0100 From: Zheng Wang ItO seems to me that with all the state information inside the network "VC vs datagram" is not an issue any longer. Whether there is reservation or not makes little difference to the network since the information is there and the gateways or switches can easily do the queuing with or without reservation. Whether there is a reservation does affect applications and end user traffic control. So it is decided by the users. If you want absolute guaranteed resources, then make reservations. Or if you can accept degraded services, ask for the best deal (it works like a theatre ticket office -- if you want to make sure, you book it and pay for the standard price, otherwise you can go there before it starts. You may get a cheaper ticket if the demand is low or buy one in the black market :-) Zheng From guru@flora.wustl.edu Tue Oct 9 08:13:34 1990 Posted-Date: Tue, 09 Oct 90 10:14:50 -0500 Received-Date: Tue, 9 Oct 90 08:13:34 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 08:13:34 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA19176; Tue, 9 Oct 90 10:12:03 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA05840; Tue, 9 Oct 90 10:14:52 CDT Message-Id: <9010091514.AA05840@flora.wustl.edu> To: David Cheriton Cc: end2end-interest, tds@honet9.att.com, estrin@usc.edu Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Mon, 08 Oct 90 22:41:37 -0700. <9010090541.AA23333@Pescadero.Stanford.EDU> Date: Tue, 09 Oct 90 10:14:50 -0500 From: Gurudatta Parulkar Well, I understand this point, and I thought this has been the strenght of a classical datagram model. However, I thought Tony was trying to contrast Van's proposed scheme with the connection-oriented scheme. Clearly, Van has moved away a great deal from the classical datagram model. His scheme (as I understand it) includes reservations and attempts to provide performance guarantees. As Tony outlined, I don't see why Van's scheme (which seems little perplexing) should do better than the straightforward connection-oriented. That is, if you are going to have reservations to provide performance guarantees, what is wrong in doing it the straight forward way? I don't think Tony or others who argue for connection-oriented support disagree that we should continue to have datagram support, and applications can use it, if they can tolerate degraded service. -guru Yes, Tony, I cant wait for your brave new world of "resource reservations" and "real-time traffic guarantees" (which of course I will want written into my contract for service from the telephone companies. Because my computers will anxiously monitor traffic to make sure these guarantees are honored (even though there is nothing like this in the current phone network) and autodial a class action lawyer when they are violated. Also, when Bob Braden gets me to block off a day for a teleconference and the network says "all circuits are now busy, try your call later", just as we all sit down to start the conference, I will be delighted to do so, with my friendly workstation which will try again 2.3 microseconds later, and then again a similar time interval later --- because you never know when those resource might become available, right, and the network certainly cant charge me for denial of service in this situation. If that level of enthusiasm is not sufficient, my workstation has another 40 or more friends that would like to keep the network reservation manager on its toes. Sorry for the sarcasm, but ever since I spent 3 days on the east coast trying to reach family on the west coast by phone after Oct. 17th earthquake , knowing full well that the phone network was running at less than 25 percent real utilization (if it was built without circuit setup, etc.), the phone company view that denial of service is better than degraded service just doe s not wash. If degraded service was such a bogeyman, people would not use cellular phones. David Cheriton From guru@flora.wustl.edu Tue Oct 9 08:15:17 1990 Posted-Date: Tue, 09 Oct 90 10:16:45 -0500 Received-Date: Tue, 9 Oct 90 08:15:17 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 08:15:17 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA19208; Tue, 9 Oct 90 10:13:59 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA05852; Tue, 9 Oct 90 10:16:47 CDT Message-Id: <9010091516.AA05852@flora.wustl.edu> To: Zheng Wang Cc: braden, end2end-interest Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Tue, 09 Oct 90 10:24:04 +0100. <9010090927.AA25635@venera.isi.edu> Date: Tue, 09 Oct 90 10:16:45 -0500 From: Gurudatta Parulkar ItO seems to me that with all the state information inside the network "VC vs datagram" is not an issue any longer. That is why we call our service primitive CONGRAM (aimed at providing strenghts of both CONnection and dataGRAM)! -guru From postel Tue Oct 9 10:01:57 1990 Received-Date: Tue, 9 Oct 90 10:01:57 -0700 Received: from bel.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 10:01:57 -0700 Date: Tue, 9 Oct 90 10:00:30 PDT From: postel Posted-Date: Tue, 9 Oct 90 10:00:30 PDT Message-Id: <9010091700.AA26642@bel.isi.edu> Received: by bel.isi.edu (4.1/4.0.3-4) id ; Tue, 9 Oct 90 10:00:30 PDT To: end2end-interest Subject: ravings Tony: Hi. In the ever shifting grounds of the connections vs. datagrams debate another place where the argument rages is the nature of the state and what happens when things go wrong. There is an suggestion that the state in the routers be "soft state" in the sense that if the the router hickups and forgets all its state communication keeps working (on a raw datagram basis) and the router builds up its state again to provide the perferred performance (or the resource reservation). The idea is that somehow the traffic could be handled (with some perhaps lower performance for the end user) as raw datagrams, and that additional information for resource reservation, or policy routing, or classes of service is carried along and built up in the routers that can then give better service to datagrams matching one of these perferred classes; but if any thing goes wrong (a route changes because a router goes down, or the state information is lost) the communication keeps on going as raw datagrams (until the state information is built up again). --jon. From tds@honet9.att.com Tue Oct 9 10:06:21 1990 Posted-Date: Tue, 9 Oct 90 11:24 EDT Received-Date: Tue, 9 Oct 90 10:06:21 -0700 Message-Id: <9010091706.AA06673@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 10:06:21 -0700 From: tds@honet9.att.com Date: Tue, 9 Oct 90 11:24 EDT To: cheriton@Pescadero.Stanford.EDU Cc: end2end-interest, estrin@usc.edu In-Reply-To: Subject: RE: Msg from Estrin on Van's connectionless ideas >>>>> On Mon, 8 Oct 90 22:41:37 PDT, David Cheriton said: [demonstration of the litigious mentality rampant in America deleted] David> I will be delighted to do so, with my friendly workstation which will try David> again 2.3 microseconds later, and then again a similar time interval David> later --- because you never know when those resource might become David> available, right, and the network certainly cant charge me for denial David> of service in this situation. At least now I know where the "malicious users" everyone talks about can be found. I wonder what PacTel does to you today if you leave your phone off the hook, constantly demanding service, without dialing. You'll probably find that the phone company values the dial-tone generator enough to take it back after a time whether you like it or not. Some of the things the phone company did in the predivestiture days to "protect the network" were positively comical, like the Hush-a-Phone business [anybody remember that: the phone company tried to argue that a little foam handset attachment was an illegal interconnection device], but your message helps me understand that kind of thinking within the phone company. David> Sorry for the sarcasm, Sarcasm's fine, as long as you keep your sense of humor. I wonder: did a phone company truck run over your dog when you were a kid? David> but ever since I spent 3 days on the east coast David> trying to reach family on the west coast by phone after Oct. 17th earthquake, David> knowing full well that the phone network was running at less than 25 percent David> real utilization (if it was built without circuit setup, etc.), the phone I think this was explained to you at the Cambridge meeting last January, but I'll try again. The network (in AT&T's case anyway) implemented controls to block incoming calls to the bay area so that outgoing calls would have a better chance of completing. Since you were calling in, you were blocked so that your family would have a better chance of calling you if they tried. This is a good strategy for a circuit-switched network with a focused calling pattern and a high probability of blocking on access links at the focus. The "real utilization" figure above is of course nonsense. Circuit utilization was near 100%. If you're suggesting that a better strategy would have been to redesign the phone network instead so that it didn't use circuits, well, what can I say, some of us need to solve real problems in real time. David> company view that denial of service is better than degraded service just does not David> wash. If degraded service was such a bogeyman, David> people would not use cellular phones. I'm not preculding some kind of service degradation. For example, schemes for bit-dropping in voice communications have been around for some time. My contention is that networks shared by many users that are uncontrolled in allocating resources operate well only at low utilizations. When resources become scarce, the network is in the best position to search out and allocate resources. I wish you could give me a counterexample. If you're opposed to reservations and (ultimately) denial of service, what do you suggest instead? Since the flame-to-fact density of the above seems rather high, let me make a quantitative observation. Earlier this year I was brousing some of the NSFnet traffic data that Merit makes available. At that point there had been a tenfold increase in packet traffic traversing the net in a little more than a year. I could be wrong, but I believe there was no corresponding increase in transmission capacity. This tells me that those T1 puppies were way underutilized, which is OK as long as your uncle is paying the bill but not likely to be a stable mode of operation when somebody tries to figure out how to charge for all this. From braden Tue Oct 9 10:22:12 1990 Received-Date: Tue, 9 Oct 90 10:22:12 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 10:22:12 -0700 Date: Tue, 9 Oct 90 10:22:04 PDT From: braden Posted-Date: Tue, 9 Oct 90 10:22:04 PDT Message-Id: <9010091722.AA19686@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 9 Oct 90 10:22:04 PDT To: guru@flora.wustl.edu Subject: Re: Architecture Limits Cc: end2end-interest Guru, Thanks for your comments. I would like to first understand what does the IAB consider as the target environment for the next generation Internet Architecture. Specifically, there are two points that I consider important: - what application mix the internet architecture is to support? For example, a number of people believe the application mix would include more of interactive imaging (multimedia conferencing, interactive televisualization, interactive electronic radiology, etc. etc.) and less (in proportion, of course) of existing traffic, such as electronic mail and domain name system. Given this environment, the limits will turn out to be different than otherwise. I think you are trying to make the same point that Lixia did... that the current Internet architecture provides only best-effort delivery, not qualities of service such as bounded-delay. It is not a question of MORE of these services; the current system does not support them at all! There is not doubt that this is a limitation. - can we characterize the underlying neworks that would comprise the future internet? If ATM becomes a default technology at the network layer, and phone companies are able to provide packet switched capabilities (instead of just leased circuit switched channels) with statistical guarantees, the burden on the internet may be much less. However, one can easily argue that the future internet (at least the NREN part) will continue to be a set of very heterogeneous networks, with a large part providing no statistical guarantees. This will require the internet architecture to be more sophisticated in order to support the interactive imaging applications well. This seems like an interesting question. One way to put it might be: a prime virtue of the Internet architecture is that it can use any network, no matter how bad... a future architecture needs to be able to make effective use of any network, no matter how good. Without the target environment defined, I don't think we can outline the limits in a meaningful way. That seems too strong a statement... > A. Quantitative Transport Protocol Limits > > A.1 TCP Performance > > Depends upon both bandwidth and delay. > > For fully utilize the bandwidth with a large bandwidth*delay > products, TCP needs the extensions defined in RFC-1072: > o extended windows > o timestamps to measure RTT > o selective acknowledgments). Regarding, the sequence number, has anybody considered the following simple extension: TCP uses a sequence number for each byte, and therefore, can run of sequence numbers fast. However, TCP can introduce an option using which two ends can agree on using one seq number for each data segment of a large size (e.g., 1Mbyte). In other words, you are numbering a large data segment and not each byte in the segement. The existing sequence number will be sufficient for this scheme. Yes, this has been considered, but rejected as too complex in practice and too ugly in theory. Van has come up with a much cleaner approach, described in a RFC that has been submitted and should be published soon. Regarding the selective ack, I want to understand semantics of its implementation. The selective ack scheme requires the receiving end to buffer partially received data, while the missing packets are detected, retransmission requests made, and retransmissions received. During this wait, does the TCP allow application to use the partially received data? If not, selective ACK doesn't seem to help application blocking due to errors and retransmissions? Also, does the application have any say in what gets selectively acked or nacked to avoid receiving retransmission that maybe the application does care about. Both these comments have to do with providing a service abstraction more than just reliable byte stream. As Larry also Peterson mentioned, the reliable byte stream may not adequate for all applications. Grafting a more complex service abstraction onto TCP would probably not work very well. Note that a new transport protocol with the attributes you suggest could easily be added to the Internet, in parallel with TCP. Therefore, this is a limitation in the current realization, not a fundamental architectural limitation such as the lack of qualities of service that you talked about earlier. Bob Braden From braden Tue Oct 9 10:35:22 1990 Received-Date: Tue, 9 Oct 90 10:35:22 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 10:35:22 -0700 Date: Tue, 9 Oct 90 10:35:15 PDT From: braden Posted-Date: Tue, 9 Oct 90 10:35:15 PDT Message-Id: <9010091735.AA19715@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 9 Oct 90 10:35:15 PDT To: cheriton@pescadero.stanford.edu, tds@honet9.att.com Subject: RE: Msg from Estrin on Van's connectionless ideas Cc: end2end-interest, estrin@usc.edu Since the flame-to-fact density of the above seems rather high, let me make a quantitative observation. Earlier this year I was brousing some of the NSFnet traffic data that Merit makes available. At that point there had been a tenfold increase in packet traffic traversing the net in a little more than a year. I could be wrong, but I believe there was no corresponding increase in transmission capacity. This tells me that those T1 puppies were way underutilized, which is OK as long as your uncle is paying the bill but not likely to be a stable mode of operation when somebody tries to figure out how to charge for all this. Tony, Aw c'mon, surely it is true for the phone companies as for everyone else that new capacity tends to come in big chunks with significant capital costs. Like laying fibers, for example. The result is that at any time, there must be parts of the network that are seriously under-utilized, while other parts are not. The phone companies have the advantage of very large numbers, of course, but they cannot repeal the laws of economics! This does not diminish your argument about the relative behavior of CO vs. CL under heavy load; I am just reacting to what seems to me to be a somewhat bogus (as well as irrelevant) "uncle pays" comment. Bob Braden From braden Tue Oct 9 10:41:53 1990 Received-Date: Tue, 9 Oct 90 10:41:53 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 10:41:53 -0700 Date: Tue, 9 Oct 90 10:41:48 PDT From: braden Posted-Date: Tue, 9 Oct 90 10:41:48 PDT Message-Id: <9010091741.AA19749@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 9 Oct 90 10:41:48 PDT To: llp@cs.arizona.edu Subject: Re: Architecture Limits Cc: end2end-interest From llp@cs.arizona.edu Fri Oct 5 18:36:58 1990 Date: Fri, 5 Oct 90 18:36:38 MST From: "Larry Peterson" To: braden@ISI.EDU Cc: end2end-interest@ISI.EDU In-Reply-To: <9010052331.AA18098@braden.isi.edu> Subject: Architecture Limits How about a bullet on functionality. The architecture provides only a byte-stream and an unreliable datagram end-to-end service. It does not support RPC or any of its variations, and it does not support group communication. This may seem like too obvious of point to make, but I consider this the most significant limitation of the architecture---there are probably more application programmers out there rolling their own transport protocols *outside* the architecture than there are programmers using naked TCP and UDP. Larry Larry, I think I would claim that the Internet architecture does support RPC. Don't you run RPC over the Internet? VMTP certainly does. Perhaps you are suggesting the lack of a widely-implemented transaction transport protocol, a theme with which I certainly agree. Similarly, in what sense does the Internet architecture as it it currently defined to include IP multicast, not support group communication? Is your concern at the transport level? Bob Braden From guru@flora.wustl.edu Tue Oct 9 12:12:58 1990 Posted-Date: Tue, 09 Oct 90 14:14:22 -0500 Received-Date: Tue, 9 Oct 90 12:12:58 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 12:12:58 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA26993; Tue, 9 Oct 90 14:11:32 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA06026; Tue, 9 Oct 90 14:14:23 CDT Message-Id: <9010091914.AA06026@flora.wustl.edu> To: braden Cc: end2end-interest, estrin@usc.edu Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Tue, 09 Oct 90 10:35:15 -0700. <9010091735.AA19715@braden.isi.edu> Date: Tue, 09 Oct 90 14:14:22 -0500 From: Gurudatta Parulkar Since the flame-to-fact density of the above seems rather high, let me make a quantitative observation. Earlier this year I was brousing some of the NSFnet traffic data that Merit makes available. At that point there had been a tenfold increase in packet traffic traversing the net in a little more than a year. I could be wrong, but I believe there was no corresponding increase in transmission capacity. This tells me that those T1 puppies were way underutilized, which is OK as long as your uncle is paying the bill but not likely to be a stable mode of operation when somebody tries to figure out how to charge for all this. Tony, Aw c'mon, surely it is true for the phone companies as for everyone else that new capacity tends to come in big chunks with significant capital costs. Like laying fibers, for example. The result is that at any time, there must be parts of the network that are seriously under-utilized, while other parts are not. The phone companies have the advantage of very large numbers, of course, but they cannot repeal the laws of economics! This does not diminish your argument about the relative behavior of CO vs. CL under heavy load; I am just reacting to what seems to me to be a somewhat bogus (as well as irrelevant) "uncle pays" comment. In most of this discussion, we are forgeting the lowest level service primitive, which is actually circuit switched today. For example, it is true that the datagram networks, such as NSFNet backbone, are built out of leased circuit switched lines that are part of complex TDM hierarchy of the phone network. The leased lines of the backbone need to be considerably underutilized to get descent service to datagram networks with relatively unpredictable traffic patterns. Also, the datagram network provider is paying for the peak rate of the leased line (if you lease 1.5 Mbps, the phone company assumes you are sending data at peak rate all the time). Thus, today a datagram network is essentially built on top of circuit switched network. Wouldn't it be more effective, if the phone company can provide a leased packet switched channel (LPSC) on demand, and the datagram network pays on the usage basis (goals of ATM). Thus, every time there is a TCP connection opened, the gateway can ask the phone network to set up a LPSC, and tear is down when TCP connection is terminated. This is much more effective in terms of cost and usage of resources. Now, the point I want to make is that this LPSC on demand can be the same as a connection with resource reservations. So one can build both datagram networks and connection-oriented networks using LPSC. -guru From braden Tue Oct 9 13:39:04 1990 Received-Date: Tue, 9 Oct 90 13:39:04 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 13:39:04 -0700 Date: Tue, 9 Oct 90 13:38:59 PDT From: braden Posted-Date: Tue, 9 Oct 90 13:38:59 PDT Message-Id: <9010092038.AA19897@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 9 Oct 90 13:38:59 PDT To: end2end-interest Subject: Comments from Gerald Newman on Cray TCP speedups ----- Begin Included Message ----- From gkn@Sds.Sdsc.Edu Tue Oct 2 11:12:12 1990 Date: Tue, 2 Oct 90 18:11:24 GMT From: gkn@Sds.Sdsc.Edu (Gerard K. Newman) Subject: RE: TCP extensions To: braden@ISI.EDU Organization: San Diego Supercomputer Center X-St-Vmsmail-To: ST%"braden@venera.isi.edu" Bob: We're considering making some enhancments to TCP processing in order to get more out of the CASA gigabit testbed. We're mostly concentrating on the enhancement of a particular implementation of TCP (UNICOS) rather than protocol changes; however, it seems pretty clear that we need some extentions to TCP to support big windows and perhaps selective acknowlegement. This is not yet work in progress, but I'll throw out some things we've been thinking about down here anyway. A couple of obvious places to look for performance enhancement are buffering (minimizing the number of copies, appropriate buffer sizes, etc), checksumming (although this already vectorizes on the CRAY), timer support (it's still pretty expensive to set a timer in UNICOS). Another place to look which is CRAY specific is the interface between the mainframe itself and the IOS (I/O subsystem), which has some performance bottlenecks as well. Another killer for performance in CASA is the fact that we have to do data marshalling; the CRAY floating point format is like nothing else, and we have to convert between it and the IEEE format that the CM-2 at LANL uses. We have two algorithms for doing this -- one which is robust but costs 18 cp (clock periods, 6.1ns) per 64-bit word, and one which is not robust (doesn't check for legality, nor does it cope well with the edges of the conversion space) which costs only 3 cp. As ugly as this sounds, one thing under consideration is to do the copy/checksum/conversion at the same time, saving a pass over the buffers to be transmitted. This of course supposes that everything being transmitted is floating-point data. I'm also interested in being on the end2end-interest mailing list. Cheers, gkn ----- End Included Message ----- From tds@honet9.att.com Tue Oct 9 22:31:36 1990 Posted-Date: Tue, 9 Oct 90 22:35 EDT Received-Date: Tue, 9 Oct 90 22:31:36 -0700 Message-Id: <9010100531.AA04217@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 22:31:36 -0700 From: tds@honet9.att.com Date: Tue, 9 Oct 90 22:35 EDT To: end2end-interest, estrin@usc.edu In-Reply-To: Subject: there are circuits and there are circuits... (was: RE: Msg from Estrin on Van's connectionless ideas) >>>>> On Tue, 09 Oct 90 14:14:22 -0500, Gurudatta Parulkar said: Guru> In most of this discussion, we are forgeting the lowest level service Guru> primitive, which is actually circuit switched today. Guru> For example, it is true that the datagram networks, such as NSFNet Guru> backbone, are built out of leased circuit switched lines that are part Guru> of complex TDM hierarchy of the phone network. There's a distinction that should be made between switched circuits and leased circuits (i.e. those used in the NSFnet). Both leased circuits and interswitch trunks to carry switched traffic are routed on the same facility (fiber, microwave, whatever) network using DCSs (Digital Cross-Connect Systems). This routing changes slowly. Then switches (4ESS, 5ESS, DMS, whatever) route calls on the trunks. So the telephone companies have a circuit-switched network built on top of a facility network. From my perspective, the NSFnet is built from the facility network and not the circuit-switched network. Guru> Thus, today a datagram Guru> network is essentially built on top of circuit switched network. I would maintain that this is not true. Today a datagram network is built on top of a facility network. Higher speeds in the TDM hierarchy were invented for "facility multiplexing," not to provide service to users, so the only way to get higher bandwidth was to get resources out of the facility pot. Until recently real-time circuit-switching was done only for 64 Kb/s circuits. Today switched service is available at up to 1.536 Mb/s, but as far as I know no one has suggested any nontrivial use of real-time wideband circuits in datagram networks. Guru> Wouldn't it be more effective, if the phone company can provide a Guru> leased packet switched channel (LPSC) on demand, and the Guru> datagram network pays on the usage basis (goals of ATM). Thus, every Guru> time there is a TCP connection opened, the gateway can ask the phone Guru> network to set up a LPSC, and tear is down when TCP connection is Guru> terminated. This is much more effective in terms of cost and usage of Guru> resources. This may be a longer-term architecture, but similar arguments might be made for datagrams on top of switched circuits, which might be the precursor to the ATM world. In any case I don't think the correct view is to map a TCP connection to a connection (either circuit or LPSC) in the carrier's network. I imagine that if gateways are keeping track of point-to-point traffic flows, there should be a way of feeding that information into a routing algorithm to determine the optimal rearrangement of connections. (This is the focus of a research proposal I'm putting together so if anyone's interested in pursuing this let me know.) For example, I made another observation from the NSFNet data: the two highest packet counts were recorded at Princeton and Palo Alto, so all else being equal (which it never is) one would expect the highest point-to-point traffic to be between Princeton and Palo Alto. Yet the shortest path between those nodes takes three hops (if the map I have is current). Gateways that could dynamically establish circuits might establish a direct Princeton-Palo Alto connection to avoid using up resources on the multihop path. There are efficiencies on both sides: the datagram network calls up bandwidth when and where it needs it and the carrier provides the bandwidth out of the shared pool of switched resources. Tony Oh, and by the way, >>>>> On Tue, 9 Oct 90 10:35:15 PDT, braden@venera.isi.edu said: Bob> what seems to me to be a somewhat bogus (as well as irrelevant) "uncle Bob> pays" comment. Bob's completely right about the bogusity of my "uncle pays" reference. From cheriton@Pescadero.Stanford.EDU Tue Oct 9 22:33:08 1990 Posted-Date: Tue, 9 Oct 90 22:33:01 PDT Received-Date: Tue, 9 Oct 90 22:33:08 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 22:33:08 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA28290; Tue, 9 Oct 90 22:33:01 PDT Date: Tue, 9 Oct 90 22:33:01 PDT From: David Cheriton Message-Id: <9010100533.AA28290@Pescadero.Stanford.EDU> To: braden, guru@flora.wustl.edu Subject: Re: Architecture Limits Cc: end2end-interest Dave Clark gave a very eloquent and insightful talk at SiGCOMM (in my humble opinion), arguing that a key piece of the magic in the Internet was designing for what we know not, all sorts of future applications and network technologies. If Guru`s claims is correct, that we need to be able to predict the future to design for it, it's time to collect up our marbles and go home before we lose our marbles, so to speak. I think it would be interesting to hear some compelling arguments that identify what, if anything, we really do need to know about the future to do the right thing now. I think the key deficiency is lack of knowledge about the mechanisms and approaches being proposed, not the futurenet and application characteristics. From zsu@NISC.SRI.COM Tue Oct 9 23:40:51 1990 Posted-Date: Tue, 09 Oct 90 23:39:21 PDT Received-Date: Tue, 9 Oct 90 23:40:51 -0700 Received: from TERRA.NISC.SRI.COM by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 23:40:51 -0700 Received: by terra.nisc.sri.com (5.64/SRI-NISC1.0) id AA04751; Tue, 9 Oct 90 23:39:23 -0700 From: zsu@NISC.SRI.COM (Zaw-Sing Su) Message-Id: <9010100639.AA04751@terra.nisc.sri.com> To: braden Cc: end2end-interest Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Sun, 07 Oct 90 15:45:47 -0700. <9010072245.AA18568@braden.isi.edu> Date: Tue, 09 Oct 90 23:39:21 PDT > 2. Gateways can keep state regarding: > a-user class > > b-type of service class (maybe this is combined with user class?) > > c-priority > > d-resource-reservation ID (if there is one) > > e- % allocation allowed to (a,b,c,d) class (a,b,c,d together define > aggregatable traffic that share resources on a datagram basis) > > f-counter of packets handled for the class > > g-absolute threshold allowed to this class (dont remember what this > is) > > Given this information, you can do anything on top of it. The goal > is to aggregate traffic as much as possible. This creates the most > efficient system because of the statistical properties of traffic > (more aggregation--smoother looking traffic patterns)... Without identifying different connections packets belong to, how does Van suggest to aggregate traffic from parallel real-tiem, say voice, transactions and still be able to deliver the performance (voice quality in this case) required for each transaction? Let us assume that there are enough voice packets arriving at a gateway so that they have to be queued for transmission. How would the gateway schedule the packets so that there is enough packets delivered in time for all transactions? It seems that without being able to distingish, there is the danger that the gateway might ship more packets for some transactions and not enough for others. Do I miss a point here? Maybe what he calls "resource-reservation ID" is an implicit identification for connections? Zaw-Sing From cheriton@Pescadero.Stanford.EDU Tue Oct 9 23:54:52 1990 Posted-Date: Tue, 9 Oct 90 23:43:58 PDT Received-Date: Tue, 9 Oct 90 23:54:52 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 9 Oct 90 23:54:52 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA28501; Tue, 9 Oct 90 23:43:58 PDT Date: Tue, 9 Oct 90 23:43:58 PDT From: David Cheriton Message-Id: <9010100643.AA28501@Pescadero.Stanford.EDU> To: tds@honet9.att.com Subject: RE: Msg from Estrin on Van's connectionless ideas Cc: end2end-interest, estrin@usc.edu 1) Monopolistic/dictorial systems have to expect that not everyone will kowtow to their official doctrine. The real threat to the phone companies is not malicious, annoyed users, but that someone else might do networking right, and big and medium users will abandon the telco's. 2) I think it is entirely reasonable to talk about the "real utilization" in terms of what the physical channels can carry in the context of a discussion about designing the next generation of networking. Certainly, I dont necessarily argue with the telco's call blocking approach for the given instance, but I do argue against building the mistakes into the next generation of networking. 3) I think the right model for managing under load is to control the resource allocation across users to provide fairness, modified by some notion of priority/quality of service, that is coupled into the charging - i.e. you pay more for higher priority service. Thus, to first approximation, every user in a priority class seems the same degraded service under load. If there is any significant amount of time that this service is really unacceptable, fire the network planning manager. A real network has to have capacity to handle expected peak loads even after significant line failures, so real capacity limits should only be reached when there are beyond peak traffic together with failures. Again, I claim the competition (the current circuit swtiched stuff) is running the lines in many cases at a "real utilization" that is comfortably low for datagrams and the type of mechanism I suggest. 4) NSF capacities are going up, e.g. DS-3 backbone, plus if the network charged per packet, I think the additional capacity would always to economic, else the charging would reduce/control the packet traffic levels. I dont see this question as simply the old datagrams vs virtual circuits, but rather the issue of which is better --- degraded service or denial of service. I have lots of ideas of how applications can adapt to degraded service, and how the network can inform them about changes. I have no idea how to deal usefully with denial of service, and I see limited hope in hiding these extra RTTs that reservations, circuit setup, etc. imply when data rates even more clearly expose prop. delay as the performance killer. David Cheriton From J.Crowcroft@cs.ucl.ac.uk Wed Oct 10 02:11:27 1990 Posted-Date: Wed, 10 Oct 90 10:00:34 +0100 Received-Date: Wed, 10 Oct 90 02:11:27 -0700 Message-Id: <9010100911.AA09541@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Wed, 10 Oct 90 02:11:27 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id <2923-0@bells.cs.ucl.ac.uk>; Wed, 10 Oct 1990 10:00:38 +0100 To: tds@honet9.att.com, end2end-interest, estrin@usc.edu Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Tue, 09 Oct 90 23:43:58 -0700. <9010100643.AA28501@Pescadero.Stanford.EDU> Date: Wed, 10 Oct 90 10:00:34 +0100 From: Jon Crowcroft differences between circuit switching and cell switching include visibility to the end user and abilities to aggregate traffic circuits are, by definition, switched slowly compared with each packet time - this assumes you can statisitically aggregate traffic over end to end paths (not clear in the future) - you need to do autocorrelation stuff on all the NSFnet data - and they dont have video - when our video is more stable, i'm quite happy to gather stats and hand them to anyone with neat tools for such analysis (Bob has some i believe) the lack of visibility of the internal routing of circuits means the end user cannot influence things too much - I belive this mucks up time varying policies and makes it hard for dsewcisions to be made at the right place and time... any experience based on the (even digital) phone system is totally irelevant to a data net or dynamic compressed voice/video net for example, the CODEC we're using for video can vary from 0 - 2Mbps in one video frame time (in prinicple)- and we are connecting to east&west coast from here - i refuse to pay full time for the full 2Mbps to both sites from the UK full time, when my average is less than 128kbps, and variance could be as much as 1Mbps... if you can switch in 30 64 kbps channels 50 times per second from each customer premises, then i might just consider a circuit based service From llp@cs.arizona.edu Wed Oct 10 09:33:07 1990 Posted-Date: Wed, 10 Oct 90 09:32:52 MST Received-Date: Wed, 10 Oct 90 09:33:07 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Wed, 10 Oct 90 09:33:07 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA21785; Wed, 10 Oct 90 09:33:02 -0700 Date: Wed, 10 Oct 90 09:32:52 MST From: "Larry Peterson" Message-Id: <9010101632.AA12930@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Wed, 10 Oct 90 09:32:52 MST To: braden Cc: llp@cs.arizona.edu, end2end-interest In-Reply-To: <9010091741.AA19749@braden.isi.edu> Subject: Re: Architecture Limits I think I would claim that the Internet architecture does support RPC. Don't you run RPC over the Internet? VMTP certainly does. Perhaps you are suggesting the lack of a widely-implemented transaction transport protocol, a theme with which I certainly agree. Similarly, in what sense does the Internet architecture as it it currently defined to include IP multicast, not support group communication? Is your concern at the transport level? My claim that the architecture supports neither RPC nor group protocols, although I grant you that it does not prohibit you from building your own. VMTP is not officially part of the architecture, and that's one (the main?) reason why it is not widely implemented. (No, I'm not prepared to defend VMTP as the right transaction transport protocol, but I consider it a shame that an acceptable alternative hasn't been ironed out.) Likewise, IP multicast support the delivery of packets to many hosts, but there is no transport protocol that can really take advantage of it. As pointed out by others, the list is longer than the two examples I raised. \begin{flame} The issue is not what services are and are not supported, but rather how we can possibly define an architecture for the future given our fuzzy vision of what services will be required and what form the technology will take. The real limitation of the current architecture that I wanted to make with my original message is that it's locked into a "one size fits all" mentality. As I argued at the Pittsburgh e2e meeting, you can augment/permute/extend/enhance/tweak TCP only so long before you end up with... [substitute your favorite example; mine's TCP]. It's time to think seriously about a meta-architecture. \end{flame} Larry From Z.Wang@cs.ucl.ac.uk Wed Oct 10 10:35:04 1990 Posted-Date: Wed, 10 Oct 90 18:34:15 +0100 Received-Date: Wed, 10 Oct 90 10:35:04 -0700 Message-Id: <9010101735.AA21846@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Wed, 10 Oct 90 10:35:04 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id <15081-0@bells.cs.ucl.ac.uk>; Wed, 10 Oct 1990 18:34:18 +0100 To: David Cheriton Cc: tds@honet9.att.com, end2end-interest Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Tue, 09 Oct 90 23:43:58 -0700. <9010100643.AA28501@Pescadero.Stanford.EDU> Date: Wed, 10 Oct 90 18:34:15 +0100 From: Zheng Wang >I dont see this question as simply the old datagrams vs virtual circuits, >but rather the issue of which is better --- degraded service or denial >of service. ^^^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^ It is exactly the ISSUE. But it is an application issue. I am sure that some applications prefer the former while some prefer the latter. The "congram" (in Guru's term) has to do better than VC or datagram since it has double overhead: 1) state info on the switches --> memory and cpu overhead on switches 2) src/dest address & QoS info in packets --> packet header overhaed In return, we can provide both VC and datagram services. In fact, full ranges of services can be provided to the applications: 1) datagram: very cheap, use spare resources in a standby fashion. 2) comgram: standard charge, no denial, guaranteed fair share, but QoS is demand-dependent. 3) VC: expensive, resources reserved and guaranteed, denial if requirements can not be met. Zheng From Z.Wang@cs.ucl.ac.uk Wed Oct 10 11:08:41 1990 Posted-Date: Wed, 10 Oct 90 19:08:19 +0100 Received-Date: Wed, 10 Oct 90 11:08:41 -0700 Message-Id: <9010101808.AA23382@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Wed, 10 Oct 90 11:08:41 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id <15755-0@bells.cs.ucl.ac.uk>; Wed, 10 Oct 1990 19:08:21 +0100 To: tds@honet9.att.com Cc: end2end-interest Subject: Re: there are circuits and there are circuits... (was: RE: Msg from Estrin on Van's connectionless ideas) In-Reply-To: Your message of Tue, 09 Oct 90 22:35:00 -0400. <9010100531.AA04217@venera.isi.edu> Date: Wed, 10 Oct 90 19:08:19 +0100 From: Zheng Wang >I imagine that if gateways are >keeping track of point-to-point traffic flows, there should be a way >of feeding that information into a routing algorithm to determine the >optimal rearrangement of connections. It is a problem indeed - the routing is going to have terrible problems now since the switches are aware of the life time of end2end connections. Zheng From craig@NNSC.NSF.NET Wed Oct 10 11:44:24 1990 Posted-Date: Wed, 10 Oct 90 14:40:30 -0400 Received-Date: Wed, 10 Oct 90 11:44:24 -0700 Message-Id: <9010101844.AA24897@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Wed, 10 Oct 90 11:44:24 -0700 To: cheriton@pescadero.stanford.edu Cc: end2end-interest Subject: re: Architecture Limits From: Craig Partridge Date: Wed, 10 Oct 90 14:40:30 -0400 Sender: craig@NNSC.NSF.NET Dave: In general I agree with your note (in my tutorial yesterday I told people that one of the differences between telco-rained engineers and data communications types, is the telco guys want to know what you're planning to do, while the datacomm guys say "I want to do everything"). > I think it would be interesting to hear some compelling arguments that > identify what, if anything, we really do need to know about the future > to do the right thing now. One point I'd like to suggest is important, is that people keep in mind that networks aren't the only thing speeding up -- systems and disk capacity are keeping pace... (I've heard lots of dumb talks about how gigabits networks will be hard because hosts have only 10s of MIPS). Craig PS: I too liked DDC's talk. From tds@honet9.att.com Thu Oct 11 07:18:29 1990 Posted-Date: Thu, 11 Oct 90 09:51 EDT Received-Date: Thu, 11 Oct 90 07:18:29 -0700 Message-Id: <9010111418.AA01689@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 07:18:29 -0700 From: tds@honet9.att.com Date: Thu, 11 Oct 90 09:51 EDT To: Z.Wang@cs.ucl.ac.uk Cc: end2end-interest In-Reply-To: Subject: RE: there are circuits and there are circuits... >>>>> On Wed, 10 Oct 90 19:08:19 +0100, Zheng Wang said: >I imagine that if gateways are >keeping track of point-to-point traffic flows, there should be a way >of feeding that information into a routing algorithm to determine the >optimal rearrangement of connections. Zheng> It is a problem indeed - the routing is going to have terrible Zheng> problems now since the switches are aware of the life time of Zheng> end2end connections. That's not what I intended at all. Even with no knowledge of end-to-end connections the routers can observe patterns in the traffic flows, e.g. how much traffic was sent from point A to point B in the last five minutes? Now can you accumulate that information and use it to rearrange circuits on a fairly slow (minutes to hours) timescale? Is there enough "coherence" in the network on those timescales? (This is a question of user behavior more than anything.) And is it worth the trouble for the potential performance gains from the reduction in use of multihop paths and cost savings from not keeping as many dedicated circuits up? Has anybody kept track of NSF traffic data over the course of a day to see what the traffic patterns look like? My seat-of-the-pants guess is that there's a lot of structure, but I've never been able to get at the data. If someone knows where that kind of data can be gotten I'm sure I could find someone around here who's like to bash on it. Tony From lixia@redwing.parc.xerox.com Thu Oct 11 09:32:30 1990 Posted-Date: Thu, 11 Oct 1990 9:32:25 PDT Received-Date: Thu, 11 Oct 90 09:32:30 -0700 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 09:32:30 -0700 Received: from redwing.parc.xerox.com by arisia.Xerox.COM with SMTP (5.61+/IDA-1.2.8/gandalf) id AA10254; Thu, 11 Oct 90 09:33:12 -0700 Received: by redwing.parc.xerox.com (5.61+/IDA-1.2.8/gandalf) id AA01357; Thu, 11 Oct 90 09:32:26 PDT Sender: Lixia Zhang Date: Thu, 11 Oct 1990 9:32:25 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: braden Cc: end2end-interest Subject: Re: Architecture Limits In-Reply-To: Your message of Mon, 8 Oct 1990 10:04:45 PDT Message-Id: > On the other hand, rate-based flow control may be subject to > catastrophic phase-entrainment effects, and we don't currently > know how to avoid this. > > I don't understand the problem well. Could you explain a bit ? > > Lixia > > The phases of all the "independently" clocked transmitters may drift until > they are aligned, and then all transmitters will send packets simultaneously, > overflowing queues. Sorry for my slowness, but this is still not clear to me. I agree that clocks of rate-controlled flows may drift. But, without any external force from the network side to sync the clocks (e.g. nothing like acks to affect the clock or transmission), why SHOULD (or how can) all the clocks drift until aligned, rather than drift independently and randomly? Is there either a theory or an observed fact to prove the existance of this "catastrophic phase-entrainment" in rate-control?? ^^^^^ ^^^^^^^^^^^^ (We all know stories like DEC routing update sync or LAN broadcast storms, but that has no relation to rate control). In my flow net simulation (which uses rate control), I tried hard to look for such phase-entrainment behavior, but got no evidence that it exists (if interested, see my thesis). There are evidences, however, that window flow control may cause some sort of phase-entrainment (we are still studying this. More later). Lixia From lixia@redwing.parc.xerox.com Thu Oct 11 10:12:58 1990 Posted-Date: Thu, 11 Oct 1990 10:12:57 PDT Received-Date: Thu, 11 Oct 90 10:12:58 -0700 Received: from arisia.Xerox.COM by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 10:12:58 -0700 Received: from redwing.parc.xerox.com by arisia.Xerox.COM with SMTP (5.61+/IDA-1.2.8/gandalf) id AA10998; Thu, 11 Oct 90 10:13:38 -0700 Received: by redwing.parc.xerox.com (5.61+/IDA-1.2.8/gandalf) id AA01362; Thu, 11 Oct 90 10:12:58 PDT Sender: Lixia Zhang Date: Thu, 11 Oct 1990 10:12:57 PDT From: lixia@parc.xerox.com Reply-To: lixia@parc.xerox.com To: David Cheriton Subject: Re: Architecture Limits In-Reply-To: Your message of Tue, 9 Oct 1990 22:33:01 PDT Cc: end2end-interest Message-Id: > I think it would be interesting to hear some compelling arguments that > identify what, if anything, we really do need to know about the future > to do the right thing now. I think the key deficiency is lack of knowledge > about the mechanisms and approaches being proposed, not the futurenet > and application characteristics. We would probably be better-off with best possible knowledge about futhure technologies and applications. But about the key deficiency, I cannot agree more with Dave. One example is window flow control mechanism. It has been in use for so many years that (I feel) its features ought to be well understood. Nonetheless only very recently it was observed that window control may cause bursty transmission behavior caused by a burst of ack returns, which happens when acks, on their way to the data sender, get queued behind data packets. This phenomenon has been seen not only in simulation, but also by DECbit people in measurement(I heard from KK). Lixia (oops, I've criticized window in two msgs in a row. Please don't take it as me being a biased rate-advocate--I'm just trying to gain more knowledge about both mechanisms.) From kanakia@research.att.com Thu Oct 11 11:39:13 1990 Posted-Date: Thu, 11 Oct 90 14:38:59 EDT Received-Date: Thu, 11 Oct 90 11:39:13 -0700 Message-Id: <9010111839.AA13030@venera.isi.edu> Received: from research.att.com by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 11:39:13 -0700 From: kanakia@research.att.com Date: Thu, 11 Oct 90 14:38:59 EDT To: end2end-interest Subject: Re: Msg from Estrin on Van's connectionless ideas It may seem foolish to jump into the discussion where participants include Dave Cheriton, Guru and Tony. Reminds me of a raging bull in a china shop scenario. But since I would like to focus the debate on a topic that is somewhat sidelined so far, here I go. A number of general principles should be outlined and specified to which most of us can agree. In my opinion these are: (in the form of a statement followed by an inset that expands it) 1. Future high-speed packet switching networks should be built to provide at least the same grade of service that the networks one intends to replace. Does this mean that we want to build a network that offers as ubiquitous service and at least as good quality real-time voice and video transport as existing Telephone and CATV networks provide? My personal answer is yes. If you don't agree with this statement, you should stop reading the msg. The rest of the msg is unlikely to be relevant to what you are doing/thinking. 2. This implies that the network should offer a guaranteed rate service to users with real-time voice and video based applications. It does not imply that we build a network that somehow mimics circuit-switched network. To elaborate, it would mean that a user should see the same low probability of blocking calls, if the network is blocking, and see the same clarity in signals reconstructed out of packets as a voice call or a TV transmission, whether the network is blocking or non-blocking. Otherwise, we would not succeed in building a replacement network. 3. The guaranteed rate service is incompatible with the pure datagram model. It is surprising that this point needs to be made at all. The emphasis is on the "pure" part. Any switching that uses information, carried in each packet or only at set-up time, that would be meaningful only for a sequence of packets between two entities is not a "pure" datagram model. Obviously, the Estrin model is not a pure datagram model. The Estrin model is a shorthand to say the model that allegedly Van is using and was reconstructed by Deborah Estrin. It seems unfair to praise or criticize a model as Van's model until he owns up the model. I prefer a different name. I have been calling the network I am building as using a "light-weight" Virtual Circuit model of switching. If you like the name, give the credit for it (the name and not the network) to Ion Leslie who coined it while we were discussing the network. The light-weight VC switching to me implies switching that is not pure datagram model. It also is not a virtual circuit switching in the sense that switches do not guarantee ordered or reliable delivery of packets on connection-basis. It will make resource (i.e., buffers and transmission capacity) allocation decisions based on information such as rate, jitter, absolute delay bounds required, carried either in each packet or a burst of packets or negotiated at the circuit set-up time. The model does not in my mind preclude aggregation of information from many different virtual connections passing through the switch. More on aggregation comes later. 4. Can't provide everything to everybody. Only people with deep pockets even think they can do so. And when it comes to building a nationwide high-speed network, forget it. Not even Uncle Sam has deep enough (make it infinitely deep) pocket necessary to build a network that has the capacity to provide everything to everybody. Many Computer architectures who thought along a similar line about address sizes, memory, CPU power etc have also bitten dust. One can provide degraded service to some types of applications, but to make that the only way of accommodating traffic overload is going to be painful to those users who are used to hearing clear conversations or flicker-free, and non-fading TV signals. (David may not like the fact that he was blocked from reaching his family but I bet he has not stopped using phones or fax machines and switched over permanently to using email.) 5. The network must drop packets in the absence of global congestion control. All of us in the Internet world of datagram switching accept this fact but this also needs to be told to Datakit type folks who have built virtual-circuit switching model. The only way of not dropping a packet is to not allow any type of statistical behavior in the network usage, and then design the network to handle the worst-case. Even that I bet would be a very costly network design. 6. Guaranteed rates and delays for some sessions force a network to monitor all rates (at least for sessions of the same or higher priority). This is a big jump and here are the smaller steps one can take to reach the above conclusion. Even with the fixed rate input source, the resulting data stream gets slightly bunched in passing through a packet-switching network. This is also true for an input source to which one has attached an admission control scheme to provide a fixed rate input. Increasing burstiness of one session at the network input causes all sessions to be more bursty at the output. If you don't believe these statements, you should consult your favorite statistician, or me to provide references to the work of my favorite statisticians. (Most likely, we will just end up arguing further on this.) 7. Nothing I said so far precludes packet-switching model that uses aggregated state information. Aggregation state is a result of monitored or reserved traffic parameters offered by several sessions of similar types. I have purposely kept this definition of aggregated state model vauge. I consider that to be the interesting thing to debate and focus on in the discussion that has followed from the msg on the Estrin model. Now, here are the problems and opportunities as I see it in aggregation. The first two questions are obvious. (I got tired of typing so these also remain the last questions I ask in this msg.) Why aggregate the state-information? The answers are less obvious. The aggregation offers reduction in the memory needed to keep the state. Is this reduction important? The second answer, provided by Van (by hearsay again), is that aggregation allows burstiness of traffic sources to smooth out when taken into considering resource allocation decision. This is a reasonable answer but ..(see next paragraph). Another possible answer is that somehow aggregation would reduce the average computation load per packet switched. The later is important in high-speed packet networks, and it alone could justify the aggregated model. Why not aggregate the state-information? People frequently forget to consider both sides of an issue. Not me. I see aggregated state as causing a number of subtle problems. Aggregation of state information retained/monitored at a gateway/switch may smooth out the effect of variations in traffic sources but would it remain fair in allocating resources? Consider the question of congestion control, done either implicitly by dropping a packet (Van prefers this) or by explicit signaling (Raj Jain (DECbit alg) and I (selective and localized back-pressure of Ynet) prefer this). If you allow bursty behavior of the sources within a class (a group defined by the virtue of having an aggregated state), then it would always mean that one member could end up hogging up the resources at a switch allocated to the class. That would not be fair. If you don't allow the bursty sources to belong to the same class, you have destroyed one important reason to use aggregation technique in resource allocation policies. Another way to look at the problem is to realize that pure datagram model is but an example of a single class society. And have we solved the problem of fairness in that model? All of this leads me to believe that it is unavoidable to keep per channel information and to make some use of it in resource allocation policies. But, of course doing that destroys other reasons (memory and computation load) mentioned above for doing aggregation. Another problem is that we will have to define and build aggregated model, i.e. who gets aggregated with whom, without knowing the future applications that may arise. Would these applications be satisfied by the behavior that result from rather ad-hoc choices we would make today? Any takers for continuing to focus on this aspect of the Estrin model? Disclaimers: 1. Most of the early part of the message is a result of deep resonance I felt with the ideas Bob Gallaghar from MIT presented at the ITC seminar. If you find any originality and beauty of expression in them, these should be credited to him. I will take the flake for misrepresenting (misresonance) with what he said at the talk or what you don't like about these seven statements. 2. And of course count me in as a member of the admiration society for Dave Clark's talk at SIGCOMM90. He indeed eloquently focused on the memory bandwidth problem, the solution of which has motivated much of my original work in networking so far. 3. Looking over the content of the msg it struck me that I am sending this to E2E forum, but has discussed no E2E issue in it. Any complaints about this, send to Bob Braden who sent the first message. (I suspect he will pipe the complaints to /dev/null.) Hemant Kanakia kanakia@research.att.com AT&T Bell Labs, 600 Mountain Ave, Murray Hill, NJ 07974. 201-582-3090. From guru@flora.wustl.edu Thu Oct 11 12:11:17 1990 Posted-Date: Thu, 11 Oct 90 14:12:48 -0500 Received-Date: Thu, 11 Oct 90 12:11:17 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 12:11:17 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA17779; Thu, 11 Oct 90 14:09:59 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA07682; Thu, 11 Oct 90 14:12:50 CDT Message-Id: <9010111912.AA07682@flora.wustl.edu> To: Craig Partridge Cc: cheriton@pescadero.stanford.edu, end2end-interest Subject: Re: Architecture Limits In-Reply-To: Your message of Wed, 10 Oct 90 14:40:30 -0400. <9010101844.AA24897@venera.isi.edu> Date: Thu, 11 Oct 90 14:12:48 -0500 From: Gurudatta Parulkar In general I agree with your note (in my tutorial yesterday I told people that one of the differences between telco-rained engineers and data communications types, is the telco guys want to know what you're planning to do, while the datacomm guys say "I want to do everything"). And who does the given job right? I suppose telco guys!! (Of course not.) > I think it would be interesting to hear some compelling arguments that > identify what, if anything, we really do need to know about the future > to do the right thing now. One point I'd like to suggest is important, is that people keep in mind that networks aren't the only thing speeding up -- systems and disk capacity are keeping pace... (I've heard lots of dumb talks about how gigabits networks will be hard because hosts have only 10s of MIPS). Agreed. However, people should also note that they are NOT going to have crays on their desk in near future. Therefore, a statement like 2 crays talking to each other devoting 25% of their cpu power to tcp/ip and achieving 300 Mbps throughput should be interpreted carefully. -guru From tds@honet9.att.com Thu Oct 11 14:21:39 1990 Posted-Date: Thu, 11 Oct 90 16:56 EDT Received-Date: Thu, 11 Oct 90 14:21:39 -0700 Message-Id: <9010112121.AA19333@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 14:21:39 -0700 From: tds@honet9.att.com Date: Thu, 11 Oct 90 16:56 EDT To: guru@flora.wustl.edu Cc: craig@NNSC.NSF.NET, cheriton@pescadero.stanford.edu, end2end-interest In-Reply-To: Subject: telco bashing (was: Architecture Limits) >>>>> On Thu, 11 Oct 90 14:12:48 -0500, Gurudatta Parulkar said: Craig> In general I agree with your note (in my tutorial yesterday I told Craig> people that one of the differences between telco-rained engineers and Craig> data communications types, is the telco guys want to know what you're Craig> planning to do, while the datacomm guys say "I want to do everything"). Guru> And who does the given job right? I suppose telco guys!! (Of course Guru> not.) Sheesh. I'm not one to argue that the phone companies have done everything right, but for a network that was originally intended to connect an electret microphone to a speaker and now can carry an aggregate 50 Gb/s of switched traffic and give a user T1 rates in a few seconds, I don't think the telco guys have anything to be ashamed of! Good thing I was never "telco-trained" or I'd start taking this personally. Tony From craig@NNSC.NSF.NET Thu Oct 11 16:17:32 1990 Posted-Date: Thu, 11 Oct 90 19:16:12 -0400 Received-Date: Thu, 11 Oct 90 16:17:32 -0700 Message-Id: <9010112317.AA23321@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 16:17:32 -0700 To: guru@flora.wustl.edu Cc: end2end-interest Subject: re: telco bashing From: Craig Partridge Date: Thu, 11 Oct 90 19:16:12 -0400 Sender: craig@NNSC.NSF.NET Guru and Tony: My comment about data comm folks vs. telco folks was not intended as disparating either party -- it is simply a matter of pointing out why cooperation between datacomm and telco folks on gigabit networking is often tricky -- they've got different ,mindsets about how to solve the problem.s Craig From minshall@wc.novell.com Thu Oct 11 20:30:22 1990 Posted-Date: Thu, 11 Oct 90 20:47:46 -0700 Received-Date: Thu, 11 Oct 90 20:30:22 -0700 Received: from OPTICS.KINETICS.COM by venera.isi.edu (5.61/5.61+local) id ; Thu, 11 Oct 90 20:30:22 -0700 Received: from plasma.wc.novell.com by wc.novell.com (4.0/SMI-DDN) id AA21007; Thu, 11 Oct 90 20:29:34 PDT Received: from localhost by plasma.wc.novell.com (3.2/SMI-3.2) id AA23301; Thu, 11 Oct 90 20:47:47 PDT Message-Id: <9010120347.AA23301@plasma.wc.novell.com> To: tds@honet9.att.com Cc: Z.Wang@cs.ucl.ac.uk, end2end-interest Subject: Re: there are circuits and there are circuits... In-Reply-To: Your message of Thu, 11 Oct 90 09:51:00 -0400. <9010111418.AA01689@venera.isi.edu> Date: Thu, 11 Oct 90 20:47:46 -0700 From: minshall@wc.novell.com > Has anybody kept track of NSF traffic data over the course of a day to > see what the traffic patterns look like? My seat-of-the-pants guess > is that there's a lot of structure, but I've never been able to get at > the data. If someone knows where that kind of data can be gotten I'm > sure I could find someone around here who's like to bash on it. I think Jeff Mogul presented a paper at SIGCOMM 90 (I've seen it as a WRL research report) which analyzes some traffic data (from a LAN) and exhibits some of the structure. It might be interesting to get him some NSF data and see what he can see (and see what he can see). Greg Minshall Novell, Inc. minshall@wc.novell.com 1-415-975-4507 From J.Crowcroft@cs.ucl.ac.uk Fri Oct 12 02:05:49 1990 Posted-Date: Fri, 12 Oct 90 10:04:57 +0100 Received-Date: Fri, 12 Oct 90 02:05:49 -0700 Message-Id: <9010120905.AA07766@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Fri, 12 Oct 90 02:05:49 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id <3083-0@bells.cs.ucl.ac.uk>; Fri, 12 Oct 1990 10:04:59 +0100 To: minshall@wc.novell.com Cc: end2end-interest Subject: Re: there are circuits and there are circuits... In-Reply-To: Your message of Thu, 11 Oct 90 20:47:46 -0700. <9010120347.AA23301@plasma.wc.novell.com> Date: Fri, 12 Oct 90 10:04:57 +0100 From: Jon Crowcroft >I think Jeff Mogul presented a paper at SIGCOMM 90 (I've seen it as >a WRL research report) which analyzes some traffic data (from a LAN) >and exhibits some of the structure. It might be interesting to get Greg, but the best i've seen is our very leader's (bob braden of end2end) analysis of correlation of 1 packets dst to the next and subsequent packets (and {sun}*sequent pkts) which was also done by jain and some people at mit and others... however, LAN traffic includes very strong correlations due to 8 packet nfs interactions and the like - i've asked a couple of times for a CD-Rom full of NSFNET backbone data to analyse on a 480MIPS parallel processor we have here, but either it hasnt been saved, or its export embargoed:-) (Actually, i suspect i asked the wrong people) but until someone does the analysis for backbone traffic, we wont really have a good handle on even meeting the base target for next generation network designs (that is, they cater at least as well as this generation nets do for this generation traffic) but, i agree really with the Clark dictum, this debate is probably beside the point - we dont know what traffic mix will be on the matrix tomorrow btw, if peoplke are interrested - the BBC (british tv folks) have developed a v. smart video CODEC that compresses down to 32kbps at best (but bursts at 6Mbps to achive full PAL broadcast quiality)... it goes as far, i belive, as recognizing facial features, and sending wire-frame animations once the end points have exchanged the surface info...plus, i think, they use fractal compression techniques for details... so, the packet of today (burst of 16k bits, say) is the "cut to camera two" full video frame change of tomorrow...:-) so long as everyone watches the same show, circuit switching may survive; otherwise... jon From legato!Legato.COM!nowicki@Sun.COM Fri Oct 12 10:31:21 1990 Posted-Date: Fri, 12 Oct 90 10:03:17 PDT Received-Date: Fri, 12 Oct 90 10:31:21 -0700 Received: from Sun.COM by venera.isi.edu (5.61/5.61+local) id ; Fri, 12 Oct 90 10:31:21 -0700 Received: from sun.Eng.Sun.COM (sun-bb.Corp.Sun.COM) by Sun.COM (4.1/SMI-4.1) id AA27002; Fri, 12 Oct 90 10:31:18 PDT Received: from legato.UUCP by sun.Eng.Sun.COM (4.1/SMI-4.1) id AA06403; Fri, 12 Oct 90 10:31:15 PDT Received: from quattro.Legato.COM by Legato.COM (4.0/SMI-4.0) id AA05926 for end2end-interest@venera.isi.edu; Fri, 12 Oct 90 09:59:59 PDT Received: from rose.Legato.COM by quattro.Legato.COM (4.1/SMI-4.1) id AA09932; Fri, 12 Oct 90 10:03:17 PDT Date: Fri, 12 Oct 90 10:03:17 PDT From: nowicki@Legato.COM (Bill Nowicki) Message-Id: <9010121703.AA09932@quattro.Legato.COM> To: guru@flora.wustl.edu Subject: Re: Architecture Limits Cc: end2end-interest However, people should also note that they are NOT going to have crays on their desk in near future. And, speaking of reality, they are not going to have gigabits on their desktop any time soon either. Just take a look at InterOp. Almost every booth had a machine faster than 10 MIPS. The network was still Ethernet, since the FDDI ring that was supposed to be the backbone did not really work. So by the time 100 Mbps networks are "real", it is likely that 100 MIPS machines will be real. Any farther out than that, and your crystal ball must be pretty good. For quite some time, gigbit networks will need to be shared by many simultaneous users because they are so expensive (or else paid for by the government!). - WIN From J.Crowcroft@cs.ucl.ac.uk Fri Oct 12 11:15:06 1990 Posted-Date: Fri, 12 Oct 90 19:14:17 +0100 Received-Date: Fri, 12 Oct 90 11:15:06 -0700 Message-Id: <9010121815.AA24031@venera.isi.edu> Received: from bells.cs.ucl.ac.uk by venera.isi.edu (5.61/5.61+local) id ; Fri, 12 Oct 90 11:15:06 -0700 Received: from sol.cs.ucl.ac.uk by bells.cs.ucl.ac.uk with SMTP inbound id <17014-0@bells.cs.ucl.ac.uk>; Fri, 12 Oct 1990 19:14:21 +0100 To: minshall@wc.novell.com, end2end-interest, J.Crowcroft@cs.ucl.ac.uk Subject: Re: there are circuits and there are circuits... In-Reply-To: Your message of Fri, 12 Oct 90 10:04:57 +0100. <9010120905.AA07766@venera.isi.edu> Date: Fri, 12 Oct 90 19:14:17 +0100 From: Jon Crowcroft >btw, if peoplke are interrested - the BBC (british tv folks) have >developed a v. smart video CODEC that compresses down to 32kbps at best (but >bursts at 6Mbps to achive full PAL broadcast quiality)... I forgot to say: the reason they want this is so they can run over a packet switched net instead of paying for broadcast quality satellite channel circuits!! From estrin%jerico.usc.edu@usc.edu Fri Oct 12 15:02:06 1990 Posted-Date: Fri, 12 Oct 90 15:00:54 PDT Received-Date: Fri, 12 Oct 90 15:02:06 -0700 Received: from usc.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 12 Oct 90 15:02:06 -0700 Received: from jerico.usc.edu by usc.edu (5.59/SMI-3.0DEV3) id AA15267; Fri, 12 Oct 90 15:00:59 PDT Received: by jerico.usc.edu (4.1/SMI-3.0DEV3) id AA10684; Fri, 12 Oct 90 15:00:54 PDT Date: Fri, 12 Oct 90 15:00:54 PDT Message-Id: <9010122200.AA10684@jerico.usc.edu> From: estrin@usc.edu (Deborah Estrin) Sender: estrin%jerico.usc.edu@usc.edu To: end2end-interest Subject: now i know.... Reply-To: estrin@usc.edu That there is NOTHING that can get Van to respond to email in a deterministic fashion!!!!! :} First, I misrepresent him by my attempt to articulate his "model", then others rip it apart, or agree with it in my name and he is still silent!!! :} Anyway, let me say again that I take sole credit only for any errors that i introduced into the articulation of my conversation with Van... I will ponder Hemants message before responding in full but one quick comment is on the last section "Why not aggregate the state-inf?"...You say that we should not make ad hoc choices about who gets aggregated with kwhom. I dont understand this commetn. the whole point of the van-according-to-estrin model is that you are not imposing (or restricting) particular aggregation on future applications. You are determining aggregation depending upown what an application needs. That is the flexibility of it. There will be different resourc reservation approaches that sit on the spectrum betweeen guaranteed service with risk of service denial, and degradable service with less risk of denial (i.e., cheriton's comment). If the GWs are built with the sort of structure in my earlier message, this is not precluded. At any particular time of course you cant do everything. You cant arbitrarily aggregate and make svc guarantees, of course. But otherwise, at this very hypothetical level, I dont see how this model, binds us via "ad hoc choices"....can you elaborate? (3 pages was not enuf for me :}) You also say "...unavoidable to keep per channnel information: what is a channel in this sense? As to this being appropriate fodder for E2E list....Dont you know that E2ETF is just short for "End to end and everything in between" Task Force (oops, i mean research group) (It just occurred toome that perhaps van did reply to all this but only to the e2e TF list...if so, will someone let me know...In the mean time, i apprecaiate the conversation continuing on the e2e-interest list). D. From legato!Legato.COM!nowicki@Sun.COM Fri Oct 12 15:48:24 1990 Posted-Date: Fri, 12 Oct 90 14:49:16 PDT Received-Date: Fri, 12 Oct 90 15:48:24 -0700 Received: from Sun.COM by venera.isi.edu (5.61/5.61+local) id ; Fri, 12 Oct 90 15:48:24 -0700 Received: from sun.Eng.Sun.COM (sun-bb.Corp.Sun.COM) by Sun.COM (4.1/SMI-4.1) id AA05244; Fri, 12 Oct 90 15:48:21 PDT Received: from legato.UUCP by sun.Eng.Sun.COM (4.1/SMI-4.1) id AA01819; Fri, 12 Oct 90 15:48:18 PDT Received: from quattro.Legato.COM by Legato.COM (4.0/SMI-4.0) id AA08813 for end2end-interest@venera.isi.edu; Fri, 12 Oct 90 14:45:58 PDT Received: from rose.Legato.COM by quattro.Legato.COM (4.1/SMI-4.1) id AA10179; Fri, 12 Oct 90 14:49:16 PDT Date: Fri, 12 Oct 90 14:49:16 PDT From: nowicki@Legato.COM (Bill Nowicki) Message-Id: <9010122149.AA10179@quattro.Legato.COM> To: kanakia@research.att.com Subject: "reality" Cc: end2end-interest I suppose I am out of my league here talking about reality in a "research group", but let me at least state my data points. Ten years ago, I had a Sun-0 with 3Mbps Ethernet on my desk. I now have a SparcStation and 10Mbps Ethernet. My CPU has gotten about twenty times faster, and my network has gotten about three times faster. At this rate, by the turn of the century my CPU should be about 200 MIPS, and affordable FDDI will probably be "just around the corner". But I will let the hardware folks do the specific prognostication. -- WIN From craig@NNSC.NSF.NET Sat Oct 13 10:42:09 1990 Posted-Date: Sat, 13 Oct 90 13:39:01 -0400 Received-Date: Sat, 13 Oct 90 10:42:09 -0700 Message-Id: <9010131742.AA25032@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Sat, 13 Oct 90 10:42:09 -0700 To: minshall@wc.novell.com Cc: end2end-interest Subject: mogul and NSF data From: Craig Partridge Date: Sat, 13 Oct 90 13:39:01 -0400 Sender: craig@NNSC.NSF.NET > It might be interesting to get > him some NSF data and see what he can see (and see what he can see). He has some -- he demoed the software for me at Interop using some NSFNET backbone traffic. Craig From craig@NNSC.NSF.NET Mon Oct 15 06:27:10 1990 Posted-Date: Mon, 15 Oct 90 09:26:51 -0400 Received-Date: Mon, 15 Oct 90 06:27:10 -0700 Message-Id: <9010151327.AA05198@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Mon, 15 Oct 90 06:27:10 -0700 To: llp@cs.arizona.edu Cc: end2end-interest Subject: re: Architecture Limits From: Craig Partridge Date: Mon, 15 Oct 90 09:26:51 -0400 Sender: craig@NNSC.NSF.NET [I've been at Interop and only partly able to read my mail, sorry to jump in late] Larry: I agree that functionality is important, but I wouldn't hang my hat on RPC. I think RPC will be in retreat, or dramatically revamped within ten years. The reason is that on high-delay bandwidth product networks (i.e. gigabit networks) RPC gives terrible performance because it moves such tiny pieces of data with such constrained semantics -- doing rpc over such networks is akin to moving the contents of your house across country one by one in the backseat of your sports car. Better to get better semantics and better overall performance by using a moving van. :-) Craig From craig@NNSC.NSF.NET Mon Oct 15 06:48:59 1990 Posted-Date: Mon, 15 Oct 90 09:44:49 -0400 Received-Date: Mon, 15 Oct 90 06:48:59 -0700 Message-Id: <9010151348.AA05489@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Mon, 15 Oct 90 06:48:59 -0700 To: tds@honet9.att.com Cc: end2end-interest Subject: re: RE: Msg from Estrin on Van's connectionless ideas From: Craig Partridge Date: Mon, 15 Oct 90 09:44:49 -0400 Sender: craig@NNSC.NSF.NET Tony: Nice rave -- mind if I mutter my beard in reply? A couple of points/incomplete research questions: > [Exercise for the reader: Prove that all applications whose b/w demand > should concern the network can stand to make reservations. Hint: consider > that we're ultimately serving a user that is pleased to get 250 ms echoplex > delays. :-] (1) I think I can disprove this theorem. Applications will care about the delay imposed by reservations, and their b/w demands will be high. Sketch of the proof -- on a fast CPU (say 1 BIP with 32-bit words), a system will be consuming data at gigabit rates. The data rate through the CPU for a RISC architecture will be in the 10s of gigabits, and while loops and caching reduce the amount of that data which must come from off system, some number of gigabits will have to come over the network -- implication is that even your workstation will, for bursts, want gigabits of data. Now posit an application which is touching data on many remote machines spread around the country (perhaps doing database consistency checking or library catalog lookups). A setup RTT is on the order of 100 ms (based on estimates I've gotten from Fraser and Sincoskie). Just ten of those RTTs is enough to make a clearly observable difference in execution time. More generally you can make the argument, that, for application performance, extra RTs are bad, and one wants to stomp them out where ever possible. (2) Re: stateless/stateful. Crazy proposition. It seems to me from some of what I've heard Van say, and some crazed thinking on my part about fair queueing, that it might be possible to setup a world in which by putting a small amount of state information in each datagram, routers could, based entirely on the traffic they currently have queued, plus a few permanent rules, make rational and reasonable decisions about resource allocations. I have emphatically not thought this through -- just want to see how big the ripples in the pond get as a result of casting this stone. Craig From craig@NNSC.NSF.NET Mon Oct 15 07:00:09 1990 Posted-Date: Mon, 15 Oct 90 09:56:30 -0400 Received-Date: Mon, 15 Oct 90 07:00:09 -0700 Message-Id: <9010151400.AA05789@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Mon, 15 Oct 90 07:00:09 -0700 To: end2end-interest Subject: interim Cray results From: Craig Partridge Date: Mon, 15 Oct 90 09:56:30 -0400 Sender: craig@NNSC.NSF.NET Hi folks: I did chat with Dave Borman at Interop and he had some preliminary test results. He asked me not to give numbers since they're still preliminary and subject to tuning, but the gist is that the window-size option and timestamp appeared to work OK. Complete results not available because well before Dave managed to saturate the HSX, he tickled a bug in the Cray OS involving buffer management. Better results later this fall. Craig From craig@NNSC.NSF.NET Mon Oct 15 07:06:40 1990 Posted-Date: Mon, 15 Oct 90 10:06:03 -0400 Received-Date: Mon, 15 Oct 90 07:06:40 -0700 Message-Id: <9010151406.AA05929@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Mon, 15 Oct 90 07:06:40 -0700 To: tds@honet9.att.com Cc: end2end-interest Subject: NSFNET data From: Craig Partridge Date: Mon, 15 Oct 90 10:06:03 -0400 Sender: craig@NNSC.NSF.NET Tony: NSF backbone data can be gotten from Merit (at least before ANS showed up -- dunno the current map). Bilal Chinoy was the contact. (1-800-66MERIT). Also, Steve Heimlich of U.Md. did a nice study of traffic patterns (think it was in the Winter USENIX -- you can also FTP it from UMD). Craig From tds@honet9.att.com Mon Oct 15 08:49:19 1990 Posted-Date: Mon, 15 Oct 90 11:25 EDT Received-Date: Mon, 15 Oct 90 08:49:19 -0700 Message-Id: <9010151549.AA09227@venera.isi.edu> Received: from att.att.com by venera.isi.edu (5.61/5.61+local) id ; Mon, 15 Oct 90 08:49:19 -0700 From: tds@honet9.att.com Date: Mon, 15 Oct 90 11:25 EDT To: craig@NNSC.NSF.NET Cc: tds@honet9.att.com, end2end-interest In-Reply-To: Subject: RE: Msg from Estrin on Van's connectionless ideas >>>>> On Mon, 15 Oct 90 09:44:49 -0400, Craig Partridge said: Craig> Nice rave -- mind if I mutter my beard in reply? Craig> A couple of points/incomplete research questions: > [Exercise for the reader: Prove that all applications whose b/w demand > should concern the network can stand to make reservations. Hint: consider > that we're ultimately serving a user that is pleased to get 250 ms echoplex > delays. :-] Craig> bursts, want gigabits of data. Now posit an application which is Craig> touching data on many remote machines spread around the country Craig> (perhaps doing database consistency checking or library catalog Craig> lookups). A setup RTT is on the order of 100 ms (based on estimates Craig> I've gotten from Fraser and Sincoskie). Just ten of those RTTs is Craig> enough to make a clearly observable difference in execution time. I don't get it. I guess your point is that one RTT may not be so bad, but you may want to touch many locations which gives you a multiplier which pushes delays up too high. But if delay performance is a concern, why are all the setups done serially? Anyway, the library catalog example is a good illustration of the paradigm I hold. Imagine that you want to get your hands on a color postscript version of a Gutenberg bible. You want to send a query to lots of points. (I know this is not exactly original, since it sounds much like "knowbots.") Setting up connections is a pain so you just send to a few hundred eletronic catalogs around the world. The network doesn't much care, since you are not using a lot of bandwidth, but be warned, you're only going to get best effort delivery. Once you find what you're looking for you want to move gigabits so now the network wants a chance to select a route and allocate resources, since you want to move lots of stuff and the network wants to do that efficiently, and maybe send your traffic through Utah because the link through Seattle is clogged up now. For this you get some level of service guarentees that maybe lets you build better transport protocols. Here's a relevant problem I'm too lazy to solve: If I can get loss rates of 10e-6 for 1Kbyte packets in a network that requires me to amortize 100ms call setup over the life of 1250 Mbyte transfer at 1 Gb/s (nomially 10 seconds), what loss rate do I need in a network with no call setup to get the same response time if my transport protocol does slow-start? The answer gives some kind of target to think about. Fiddle with parameters as deemed appropriate. Craig> More generally you can make the argument, that, for application Craig> performance, extra RTs are bad, and one wants to stomp them out Craig> where ever possible. Certainly true, but as always there are tradeoffs. Hemant pointed out very nicely some of the considerations in providing service guarentees to applications that need them. For me the telling challenge is constructing a traffic-sensitive routing scheme to distribute traffic in the network while maintaining service guarentees. Craig> (2) Re: stateless/stateful. Crazy proposition. It seems to me Craig> from some of what I've heard Van say, and some crazed thinking Craig> on my part about fair queueing, that it might be possible to setup Craig> a world in which by putting a small amount of state information Craig> in each datagram, routers could, based entirely on the traffic they Craig> currently have queued, plus a few permanent rules, make rational and Craig> reasonable decisions about resource allocations. I have emphatically Craig> not thought this through -- just want to see how big the ripples in Craig> the pond get as a result of casting this stone. I must not understand, since this sounds distressingly like call setup for every packet. Anyway, back to my day job... Tony From craig@NNSC.NSF.NET Mon Oct 15 09:29:47 1990 Posted-Date: Mon, 15 Oct 90 12:26:50 -0400 Received-Date: Mon, 15 Oct 90 09:29:47 -0700 Message-Id: <9010151629.AA10884@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Mon, 15 Oct 90 09:29:47 -0700 To: tds@honet9.att.com Cc: end2end-interest Subject: RE: Msg from Estrin on Van's connectionless ideas From: Craig Partridge Date: Mon, 15 Oct 90 12:26:50 -0400 Sender: craig@NNSC.NSF.NET > I don't get it. I guess your point is that one RTT may not be so bad, > but you may want to touch many locations which gives you a multiplier > which pushes delays up too high. But if delay performance is a > concern, why are all the setups done serially? Because I don't know, apriori, which sites my application will talk to. I don't even know within the application until it is done. Parallelism only helps if there aren't dependencies. Assume there are; imagine that the application has to go traipsing through indexes (or name servers) to determine which 100 sites to talk to this time, and those aren't 100 independent threads, because they interact in some fashion. > I must not understand, since this sounds distressingly like call setup > for every packet. Well, I guess I view call setup as potentially cheap, if we do it right. Craig From braden Mon Oct 15 12:02:19 1990 Received-Date: Mon, 15 Oct 90 12:02:19 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Mon, 15 Oct 90 12:02:19 -0700 Date: Mon, 15 Oct 90 12:02:10 PDT From: braden Posted-Date: Mon, 15 Oct 90 12:02:10 PDT Message-Id: <9010151902.AA00634@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Mon, 15 Oct 90 12:02:10 PDT To: llp@cs.arizona.edu Subject: Re: Architecture Limits Cc: end2end-interest VMTP is not officially part of the architecture, and that's one (the main?) reason why it is not widely implemented. ... The real limitation of the current architecture that I wanted to make with my original message is that it's locked into a "one size fits all" mentality. As I argued at the Pittsburgh e2e meeting, you can augment/permute/extend/enhance/tweak TCP only so long before you end up with... [substitute your favorite example; mine's TCP]. It's time to think seriously about a meta-architecture. Larry, As Dave Clark argued forcefully in his '88 SIGCOMM paper, the IP datagram layer provides a least-common-denominator service on which any transport protocol can be built. This is aleady "meta-architecture". I have trouble blaming the lack of diversity in transport protocols on the Internet architecture. The fault, dear Brutus, is not in our architecture... Bob Braden From hwb@merit.edu Tue Oct 16 07:17:42 1990 Posted-Date: Tue, 16 Oct 90 10:16:57 EDT Received-Date: Tue, 16 Oct 90 07:17:42 -0700 Received: from mcr.umich.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 07:17:42 -0700 Received: Tue, 16 Oct 90 10:16:57 EDT by mcr.umich.edu (5.51/1.6) Date: Tue, 16 Oct 90 10:16:57 EDT From: Hans-Werner Braun Message-Id: <9010161416.AA16542@mcr.umich.edu> To: craig@nnsc.nsf.net Subject: Re: NSFNET data Cc: end2end-interest, swolff@note.nsf.gov >>NSF backbone data can be gotten from Merit (at least before ANS showed up -- dunno the current map). Bilal Chinoy was the contact. (1-800-66MERIT).<< People who want to have NSFNET traffic data should approach NSF, per NSF's request. A good start may be to contact Steve Wolff. -- Hans-Werner From postel Tue Oct 16 08:51:43 1990 Received-Date: Tue, 16 Oct 90 08:51:43 -0700 Received: from bel.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 08:51:43 -0700 Date: Tue, 16 Oct 90 08:49:55 PDT From: postel Posted-Date: Tue, 16 Oct 90 08:49:55 PDT Message-Id: <9010161549.AA05121@bel.isi.edu> Received: by bel.isi.edu (4.1/4.0.3-4) id ; Tue, 16 Oct 90 08:49:55 PDT To: end2end-interest Subject: RFC1185 on TCP over High-Speed Paths ----- Begin Included Message ----- From RFC-DIST-LIST-RELAY@NIC.DDN.MIL Mon Oct 15 19:43:57 1990 To: Request-for-Comments-List:; Subject: RFC1185 on TCP over High-Speed Paths Cc: jkrey@ISI.EDU Reply-To: jkrey@ISI.EDU Date: Mon, 15 Oct 90 17:31:38 PDT From: "Joyce K. Reynolds" A new Request for Comments is now available from the Network Information Center in the online library at NIC.DDN.MIL. RFC 1185: Title: TCP Extension for High-Speed Paths Author: V. Jacobson, R. Braden & L. Zhang Mailbox: van@CSAM.LBL.GOV, Braden@ISI.EDU, lixia@PARC.XEROX.COM Pages: 21 Characters: 49,508 pathname: RFC:RFC1185.TXT This memo describes an Experimental Protocol extension to TCP for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "IAB Official Protocol Standards" for the standardization state and status of this protocol. Distribution of this memo is unlimited. RFCs can be obtained via FTP from NIC.DDN.MIL, with the pathname RFC:RFCnnnn.TXT (where "nnnn" refers to the number of the RFC). Login with FTP, username "anonymous" and password "guest". The NIC also provides an automatic mail service for those sites which cannot use FTP. Address the request to SERVICE@NIC.DDN.MIL and in the subject field of the message indicate the RFC number, as in "Subject: RFC nnnn". RFCs can also be obtained via FTP from NIS.NSF.NET. Using FTP, login with username "anonymous" and password "guest"; then connect to the RFC directory ("cd RFC"). The file name is of the form RFCnnnn.TXT-1 (where "nnnn" refers to the number of the RFC). The NIS also provides an automatic mail service for those sites which cannot use FTP. Address the request to NIS-INFO@NIS.NSF.NET and leave the subject field of the message blank. The first line of the text of the message must be "SEND RFCnnnn.TXT-1", where nnnn is replaced by the RFC number. Requests for special distribution should be addressed to either the author of the RFC in question, or to NIC@NIC.DDN.MIL. Unless specifically noted otherwise on the RFC itself, all RFCs are for unlimited distribution. Submissions for Requests for Comments should be sent to POSTEL@ISI.EDU. Requests to be added to or deleted from this distribution list should be sent to RFC-REQUEST@NIC.DDN.MIL. Joyce K. Reynolds USC/Information Sciences Institute ----- End Included Message ----- From llp@cs.arizona.edu Tue Oct 16 09:44:11 1990 Posted-Date: Tue, 16 Oct 90 09:44:03 MST Received-Date: Tue, 16 Oct 90 09:44:11 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 09:44:11 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA18924; Tue, 16 Oct 90 09:44:05 -0700 Date: Tue, 16 Oct 90 09:44:03 MST From: "Larry Peterson" Message-Id: <9010161644.AA02069@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Tue, 16 Oct 90 09:44:03 MST To: Craig Partridge Cc: end2end-interest Subject: re: Architecture Limits Craig, I guess I don't equate RPC with "tiny pieces of data"; maybe I should have used the term "message transaction protocol". It may be that we will use it more for fetching remote data than asking for remote computation, but there will still be plenty of that to do in a gigabit world. Also, while I believe that we're going to have to think hard about pre-fetching and caching, I don't see the fundamental need for a request/reply protocol going way. Larry From llp@cs.arizona.edu Tue Oct 16 09:45:52 1990 Posted-Date: Tue, 16 Oct 90 09:45:46 MST Received-Date: Tue, 16 Oct 90 09:45:52 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 09:45:52 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA19070; Tue, 16 Oct 90 09:45:48 -0700 Date: Tue, 16 Oct 90 09:45:46 MST From: "Larry Peterson" Message-Id: <9010161645.AA02161@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Tue, 16 Oct 90 09:45:46 MST To: braden Cc: end2end-interest Subject: Re: Architecture Limits In-Reply-To: <9010151902.AA00634@braden.isi.edu> Bob, I apoligize for the confrontational tone of my previous message. My only excuse is that my tone was only as harsh as what I'm suggesting is radical. As Dave Clark argued forcefully in his '88 SIGCOMM paper, the IP datagram layer provides a least-common-denominator service on which any transport protocol can be built. This is aleady "meta-architecture". I have trouble blaming the lack of diversity in transport protocols on the Internet architecture. The fault, dear Brutus, is not in our architecture... Anyway, I couldn't agree more that IP's least-common-denominator design is a wonderful thing. I proably apply the LCD argument more than any other in the systems I design, and I certainly hope this feature of the Internet architecture doesn't get lost as we worry about gigabit nets. My observation is that there hasn't been a new transport protocol added to the architecture since its inception. I'm not sure if the architecture is at fault, or something much more nebulous (e.g., the standardization policy), but as Clark88 points out, it was recognized from the start that TCP was not the right protocol for everybody. One could then argue that UDP provides an escape hatch---you can build your own transport protocol outside the architecture by using UDP---but I think this is the mechanism that I'm arguing is broken. Here's why: (1) Protocols designed outside the architecture are generally ad hoc, and quite often don't perform very well. (2) The wheel is being reinvented everyday; I wonder how many man years of programming effort could have been saved if the Internet architecture provided a request/reply protocol. (3) The coverage of these protocols is very limited; I can't get to your distributed applications because I didn't implement your version of some transport protocol or another. (4) People are using the wrong transport protocol because it's easier; for example, they are using TCP because it's reliable even though they don't want byte streams. Now, what is it that makes me believe that we can do better? This is only an observation, but I see the OS community working very hard to standardize a few interfaces, and for better or worse, I think they're going to standardize the the protocol interface. They may use the xkernel or they may use System V streams; my best guess is that a small handfull of defacto standard interfaces are going to evolve. (I could argue why I think this is going to happen, but I'll leave that for someother time.) My basic argument is that the networks community should not passively let this happen, but should take a leading role. It is by defining an "abstract protocol" (around which the protocol interface is wrapped) that we would be able to define a meta-architecture that makes it possible to specify and widely disseminate new transport protocols. Larry From llp@cs.arizona.edu Tue Oct 16 09:47:32 1990 Posted-Date: Tue, 16 Oct 90 09:47:25 MST Received-Date: Tue, 16 Oct 90 09:47:32 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 09:47:32 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA19186; Tue, 16 Oct 90 09:47:28 -0700 Date: Tue, 16 Oct 90 09:47:25 MST From: "Larry Peterson" Message-Id: <9010161647.AA02192@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Tue, 16 Oct 90 09:47:25 MST To: Craig Partridge Cc: end2end-interest Subject: re: Architecture Limits [I've forwarded Craig's response to my last message because I forgot to send my previous message to the whole group.] I believe pre-fetching and caching is a dead-end -- consistency issues will kill you (you're caching to keep from using the network, but consistency checking forces you to use the network). And I agree we'll need request/reply. But I think the emphasis will be very different. Instead of sending small amounts of data (i.e. a page), or calling a pre-defined procedure, we're gonna move entire procedures (and their subroutines) or retrieve an entire file in one swoop. I think the resulting protocol you need is very different from what we currently call transaction protocols/RPC. I agree that *what* we chose to move across the network is going to change, but not the need to move things with a transaction-like protocol. It may be the case that instead of asking you to run a remote procedure for me, you'll send my a copy of your procedure and I'll run it myself. It may be the case that instead of reading a page of a file, I'll read an entire partition of the file system. We're working on a project to build a nationwide collaboratory (that's the buzzword you get when you cross collaboration and laboratory) for a community of molecular biologists. They have lots of read mostly data (i.e., consistency is not a big issue) that they need to ship around; high-throughput nets will make it possible to send page images in addition to the smaller text-oriented objects they can ship now. Regarding the death of pre-fetching and caching.... that's certainly going to put a lot of OS people out of work :) Larry From braden Tue Oct 16 09:58:23 1990 Received-Date: Tue, 16 Oct 90 09:58:23 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 09:58:23 -0700 Date: Tue, 16 Oct 90 09:58:19 PDT From: braden Posted-Date: Tue, 16 Oct 90 09:58:19 PDT Message-Id: <9010161658.AA01042@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 16 Oct 90 09:58:19 PDT To: llp@cs.arizona.edu Subject: Re: Architecture Limits Cc: end2end-interest My observation is that there hasn't been a new transport protocol added to the architecture since its inception. I'm not sure if the architecture is at fault, or something much more nebulous Yes, I think it is something more nebulous. Two things, actually. (1) Lack of funding About the time we "finished" TCP and UDP and were ready to move on to transaction transport protocols (which was also about the time that the E2E TF was formed, 1985), the government (DARPA) stopped funding transport protocol research; networking was "done". Later, NSF started to fund it, but their style (competitive rather than collaborative) is not likely to lead to the kind of broad concensus necessary to make a new transport protocol fly. (2) Commercial Interests Since the Internet has become so commercially successful, it becomes hard to introduce totally new things unless there are overwhelming reasons to do so. Transaction transport prtoocols, like multicasting, are universally thought to be a "good thing", but not enough of a win over TCP/UDP to make a lot of pressure on vendors. (e.g., the standardization policy), but as Clark88 points out, it was recognized from the start that TCP was not the right protocol for everybody. One could then argue that UDP provides an escape hatch---you can build your own transport protocol outside the architecture by using UDP---but I think this is the mechanism that I'm arguing is broken. Here's why: (1) Protocols designed outside the architecture are generally ad hoc, and quite often don't perform very well. I would say: "designed outside the collaborative Internet technical community". Designing a protocol is hard, and requires some external review and broad bashing. (2) The wheel is being reinvented everyday; I wonder how many man years of programming effort could have been saved if the Internet architecture provided a request/reply protocol. I have been harping on that theme since RFC-955. Bob From llp@cs.arizona.edu Tue Oct 16 11:07:03 1990 Posted-Date: Tue, 16 Oct 90 11:06:58 MST Received-Date: Tue, 16 Oct 90 11:07:03 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 11:07:03 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA23824; Tue, 16 Oct 90 11:07:00 -0700 Date: Tue, 16 Oct 90 11:06:58 MST From: "Larry Peterson" Message-Id: <9010161806.AA04870@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Tue, 16 Oct 90 11:06:58 MST To: braden Cc: llp@cs.arizona.edu, end2end-interest In-Reply-To: <9010161658.AA01042@braden.isi.edu> Subject: Re: Architecture Limits Yes, I think it is something more nebulous. Two things, actually. (1) Lack of funding About the time we "finished" TCP and UDP and were ready to move on to transaction transport protocols (which was also about the time that the E2E TF was formed, 1985), the government (DARPA) stopped funding transport protocol research; networking was "done". Later, NSF started to fund it, but their style (competitive rather than collaborative) is not likely to lead to the kind of broad concensus necessary to make a new transport protocol fly. (2) Commercial Interests Since the Internet has become so commercially successful, it becomes hard to introduce totally new things unless there are overwhelming reasons to do so. Transaction transport prtoocols, like multicasting, are universally thought to be a "good thing", but not enough of a win over TCP/UDP to make a lot of pressure on vendors. Seems like we should do everything we can to make protocols simpler and cheaper (and probably shorter-lived too). I'll admit that I'm proceeding based on early results, but I think it can be done. Larry From braden Tue Oct 16 11:46:33 1990 Received-Date: Tue, 16 Oct 90 11:46:33 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 11:46:33 -0700 Date: Tue, 16 Oct 90 11:46:29 PDT From: braden Posted-Date: Tue, 16 Oct 90 11:46:29 PDT Message-Id: <9010161846.AA01192@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 16 Oct 90 11:46:29 PDT To: llp@cs.arizona.edu Subject: Re: Architecture Limits Cc: end2end-interest From llp@cs.arizona.edu Tue Oct 16 11:07:28 1990 Date: Tue, 16 Oct 90 11:06:58 MST From: "Larry Peterson" To: braden@ISI.EDU Cc: llp@cs.arizona.edu, end2end-interest@ISI.EDU In-Reply-To: <9010161658.AA01042@braden.isi.edu> Subject: Re: Architecture Limits Yes, I think it is something more nebulous. Two things, actually. (1) Lack of funding About the time we "finished" TCP and UDP and were ready to move on to transaction transport protocols (which was also about the time that the E2E TF was formed, 1985), the government (DARPA) stopped funding transport protocol research; networking was "done". Later, NSF started to fund it, but their style (competitive rather than collaborative) is not likely to lead to the kind of broad concensus necessary to make a new transport protocol fly. (2) Commercial Interests Since the Internet has become so commercially successful, it becomes hard to introduce totally new things unless there are overwhelming reasons to do so. Transaction transport prtoocols, like multicasting, are universally thought to be a "good thing", but not enough of a win over TCP/UDP to make a lot of pressure on vendors. Seems like we should do everything we can to make protocols simpler and cheaper (and probably shorter-lived too). I'll admit that I'm proceeding based on early results, but I think it can be done. Larry Larry, Sorry, but I think that is the wrong idea. Well, the issue is the service interface... that had better be stable for a decade, or applications programmers are not going to be interested! The protocol itself can change, of course. Bob From llp@cs.arizona.edu Tue Oct 16 12:01:12 1990 Posted-Date: Tue, 16 Oct 90 12:01:07 MST Received-Date: Tue, 16 Oct 90 12:01:12 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 12:01:12 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA26343; Tue, 16 Oct 90 12:01:09 -0700 Date: Tue, 16 Oct 90 12:01:07 MST From: "Larry Peterson" Message-Id: <9010161901.AA06425@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Tue, 16 Oct 90 12:01:07 MST To: braden Cc: llp@cs.arizona.edu, end2end-interest In-Reply-To: <9010161846.AA01192@braden.isi.edu> Subject: Re: Architecture Limits Seems like we should do everything we can to make protocols simpler and cheaper (and probably shorter-lived too). I'll admit that I'm proceeding based on early results, but I think it can be done. Sorry, but I think that is the wrong idea. Well, the issue is the service interface... that had better be stable for a decade, or applications programmers are not going to be interested! The protocol itself can change, of course. We must be mis-communicating somewhere. The service interface certainly needs to be stable for an extended amount of time. I'm arguing that the protocols themselves should be free to change far more frequently than they do now. One way to get this is to stabalize the inter-protocol interface.... I implement a new protocol that conforms to the interface, and the next thing you know the whole world is running it because they support that interface too. If stable interfaces are good for application programmers, then they're good for protocol programmers too. Larry From braden Tue Oct 16 12:33:14 1990 Received-Date: Tue, 16 Oct 90 12:33:14 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 16 Oct 90 12:33:14 -0700 Date: Tue, 16 Oct 90 12:33:11 PDT From: braden Posted-Date: Tue, 16 Oct 90 12:33:11 PDT Message-Id: <9010161933.AA01239@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Tue, 16 Oct 90 12:33:11 PDT To: end2end-interest Subject: Satellites are still interesting... ----- Begin Included Message ----- From alison@osc.edu Tue Oct 16 09:29:20 1990 Date: Tue, 16 Oct 90 12:29:50 -0400 From: alison@osc.edu To: braden@ISI.EDU Subject: TCP Extensions for Long-Delay Paths You probably know about this NASA ACTS (Advanced Comm Technology Satellite) project, which is due to launch a satellite in June 1992 which can be used for a testbed for network experiments. It has (among other things) a gigabit channel, which will form another gigabit testbed to complement those which Bob Kahn has organized. The Ohio Supercomputer Center will be getting one of the ground stations for the satellite, and we've already had one NRA accepted by NASA to look at file transfer using the gigabit channel of the satellite. We have another proposal in to DARPA in response to their BAA which included funding for some of the NASA experiments of interest to DARPA. That proposal involves using the broadcast ability of the gigabit channel of the satellite to support scientists at up the three remote (and separated) geographical location viewing the same complex simulation running on a Cray and steering the computation (in turn, of course) from their local workstations. Don't know if this is funded, of course. In any case, we are now looking at what kinds of protocols we need to support these projects. The TCP extensions for Long-Delay Paths are certainly one part of what we need. Has anyone implemented the extensions, and is there any experience with them? The RFC is two years old, so I was hoping maybe someone had tried out the ideas over the past two years. ----- End Included Message ----- From cheriton@Pescadero.Stanford.EDU Wed Oct 17 11:44:05 1990 Posted-Date: Wed, 17 Oct 90 11:43:54 PDT Received-Date: Wed, 17 Oct 90 11:44:05 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Wed, 17 Oct 90 11:44:05 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA04831; Wed, 17 Oct 90 11:43:54 PDT Date: Wed, 17 Oct 90 11:43:54 PDT From: David Cheriton Message-Id: <9010171843.AA04831@Pescadero.Stanford.EDU> To: end2end Subject: Stanford Meeting Travle Info Hi Folks, I tried to edit/update travel info Scott used last year. I think most everyone knows the routine anyway. Here goes .... see you Nov. 8/9th. Here is some travel info. Both the San Francisco (SFO) and San Jose (SJC) Airports are convenient (1/2 hour drive to Palo Alto). Schematic local maps are available upon request (give me a FAX #). (1) Hotels: Hyatt Rickey's 4219 El Camino Real Palo Alto 415-493-8000 Palo Alto Holiday Inn 625 El Camino Real Palo Alto 415-328-2800 Dinah's Motor Hotel 4261 El Camino Real Palo Alto 415-493-2844 (2) Transportation from Airport to Hotels: SuperShuttle: 415-558-8500 (no service from San Jose Airport) from San Francisco Airport to Hyatt Rickey's: $13/person (other hotels, $22/person) No reservations are needed for arrival. Instead, upon arriving at airport, go outside (on the main terminal level, not the baggage claim level) and present yourself to one of SuperShuttle's "curb coordinators" (!) and they will find you a spot on a shuttle. Shuttles leave for Palo Alto roughly every 15 minutes. Airport Connection: 415-363-1500 Scheduled Shuttles: Reservations required from SFO: at 15 minutes after the hour, $13/person from SJC: at 50 minutes after the hour, $11/person Shared Rides from SFO: $23/person, max waiting time 30 minutes Driving to Palo Alto from Airports: >From SFO, take 101 South >From SJO, take 101 North Take 101 to the Oregon Expressway and Embarcadero Exit. Head West on Oregon Expressway until you hit El Camino Real. To Hyatt Rickey's and Dinah's: Take a left, and continue South on El Camino. After passing Charleston/Arastradero Rd., the hotels are on your left. To Palo Alto Holiday Inn: Take a right. The hotel will be on your right just before you pass under University Avenue. To Stanford,: Take a right to University Avenue, turn left onto University Ave (Palm Drive) Turn off is on right side of El Camino, over the underpass! (You can also come up University Ave from 101 as well as Embarcadero). Follow Palm Drive to quad, fold up your car, put it in your pocket (or just abandon it). Meeting is in Margaret Jacks Hall Bldg 460, Rm 252 which is same building as Computer Science is in, on the way to Memorial Church (fitting!) Please let me know if you need campus directions/map or parking permit. It is walking distance from Palo Alto Holiday Inn, and Airport Connection offers a minibus service to/from airports from there. From braden Wed Oct 17 12:02:28 1990 Received-Date: Wed, 17 Oct 90 12:02:28 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Wed, 17 Oct 90 12:02:28 -0700 Date: Wed, 17 Oct 90 12:02:23 PDT From: braden Posted-Date: Wed, 17 Oct 90 12:02:23 PDT Message-Id: <9010171902.AA01595@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Wed, 17 Oct 90 12:02:23 PDT To: end2end Subject: Re: Stanford Meeting Travle Info Hi. Dave Cheriton accidentally sent out travel information to the entire end2end-interest mailing list, for a subset E2E meeting to be held next month at Stanford. Sorry to clog your mailboxes. But this brings up something I have been pondering. Perhaps we should try to set up a broad community meeting on issues of interest to the E2E RG, perhaps in 1991. Comments and suggestions on this would be welcome. Bob Braden From cheriton@Pescadero.Stanford.EDU Wed Oct 17 12:24:17 1990 Posted-Date: Wed, 17 Oct 90 12:24:12 PDT Received-Date: Wed, 17 Oct 90 12:24:17 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Wed, 17 Oct 90 12:24:17 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA05028; Wed, 17 Oct 90 12:24:12 PDT Date: Wed, 17 Oct 90 12:24:12 PDT From: David Cheriton Message-Id: <9010171924.AA05028@Pescadero.Stanford.EDU> To: braden, llp@cs.arizona.edu Subject: Re: Architecture Limits Cc: end2end-interest Larry, I think you are dramatically underestimating the problem of changing or even extending protocols. I just talked with a compnay yesterday that is a perfect candidate for using IP multicast, and they'd love to, but it is really impractical until it is part of the standard OS distribution from the major manufacturers, including Sun, DEC, IBM and HP. Considering IP multicast has been around for years now and does not involve changing interfaces, and is needed by many people, I would put it forward as an indication of the speed of evolution we can expect. I think TCP/UDP got in on a window of opportunity with Berkeley Unix which is now basically closed, or at best ajar. That is, these protocols rode BSD into hundreds of companies, sites, and established these services as a standard. Now we face the chicken and egg of wont use until standard service, and not everywhere (standard) until widely needed. From llp@cs.arizona.edu Wed Oct 17 18:59:20 1990 Posted-Date: Wed, 17 Oct 90 13:40:30 MST Received-Date: Wed, 17 Oct 90 18:59:20 -0700 Received: from megaron.cs.Arizona.EDU by venera.isi.edu (5.61/5.61+local) id ; Wed, 17 Oct 90 18:59:20 -0700 Received: from cheltenham.cs.arizona.edu by megaron.cs.arizona.edu (5.61/15) via SMTP id AA29158; Wed, 17 Oct 90 13:40:33 -0700 Date: Wed, 17 Oct 90 13:40:30 MST From: "Larry Peterson" Message-Id: <9010172040.AA17302@cheltenham.cs.arizona.edu> Received: by cheltenham.cs.arizona.edu; Wed, 17 Oct 90 13:40:30 MST To: David Cheriton Cc: braden, llp@cs.arizona.edu, end2end-interest In-Reply-To: <9010171924.AA05028@Pescadero.Stanford.EDU> Subject: Re: Architecture Limits I think TCP/UDP got in on a window of opportunity with Berkeley Unix which is now basically closed, or at best ajar. That is, these protocols rode BSD into hundreds of companies, sites, and established these services as a standard. Now we face the chicken and egg of wont use until standard service, and not everywhere (standard) until widely needed. You're right. I just think I see a new window of opportunity based on reaction we're getting to the xkernel. Larry From braden Thu Oct 18 12:06:49 1990 Received-Date: Thu, 18 Oct 90 12:06:49 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 18 Oct 90 12:06:49 -0700 Date: Thu, 18 Oct 90 12:06:45 PDT From: braden Posted-Date: Thu, 18 Oct 90 12:06:45 PDT Message-Id: <9010181906.AA02398@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Thu, 18 Oct 90 12:06:45 PDT To: end2end-interest Subject: End-to-end Protocols Workshop... The point raised by Greg in the following has been a sticking point for awhile... how to avoid conflict/overlap with SIGCOMM. I think there IS a place for a meeting that is deep enough to interest the best people in the field, but not as formal (eg not refereed or published) as SIGCOMM. Another potential conflict is with the meetings that the gigabits researchers are starting to have. Bill Nowicki suggests scheduling it with IETF. There is a serious problem with this as a generic solution to anything... IETF meetings now stretch to 4.5 days because people are BUSY for that time! Not clear how to fit in a 1.5 day E2E workshop. Still, it is an interesting idea which I will think about. More comments would be welcome. Bob Braden ----- Begin Included Message ----- From minshall@wc.novell.com Wed Oct 17 19:16:38 1990 To: braden@ISI.EDU Subject: Re: Stanford Meeting Travle Info In-Reply-To: Your message of Wed, 17 Oct 90 12:02:23 -0700. <9010171902.AA01595@braden.isi.edu> Date: Wed, 17 Oct 90 16:27:09 -0700 From: minshall@wc.novell.com Bob, I think a broad meeting would be well attended, though I'm not sure that most of the casual attendees (like myself) wouldn't get just as much out of SIGCOMM, ie: I'm not sure about the value gained (by the attendees, over SIGCOMM, compared to going to SIGCOMM) relative to the "cost" you would incur. Anyway, count me as a possible attendee. (But, you might gain the value that is associated with hosting a successful meeting.) Greg Minshall Novell, Inc. minshall@wc.novell.com 1-415-975-4507 ----- End Included Message ----- ----- Begin Included Message ----- From legato!Legato.COM!nowicki@Sun.COM Wed Oct 17 21:00:16 1990 Date: Wed, 17 Oct 90 13:59:57 PDT From: nowicki@Legato.COM (Bill Nowicki) To: braden@ISI.EDU Subject: Re: Stanford Meeting Travle Info Date: Wed, 17 Oct 90 12:02:23 PDT From: braden@venera.isi.edu Posted-Date: Wed, 17 Oct 90 12:02:23 PDT Subject: Re: Stanford Meeting Travle Info But this brings up something I have been pondering. Perhaps we should try to set up a broad community meeting on issues of interest to the E2E RG, perhaps in 1991. How about overlapping an RG meeting with the IETF? The Security folks did this last time at Vancover, for example. There already is a plethora of meetings and shortage of travel funds. As you saw, Phil Gross is already trying to move in on transport protocols. Maybe you need to come to IETF and keep him honest? -- WIN ----- End Included Message ----- From craig@NNSC.NSF.NET Thu Oct 18 12:19:08 1990 Posted-Date: Thu, 18 Oct 90 15:19:15 -0400 Received-Date: Thu, 18 Oct 90 12:19:08 -0700 Message-Id: <9010181919.AA29735@venera.isi.edu> Received: from WS6.NNSC.NSF.NET by venera.isi.edu (5.61/5.61+local) id ; Thu, 18 Oct 90 12:19:08 -0700 To: end2end-interest Subject: re: End-to-End Protocols Workshop From: Craig Partridge Date: Thu, 18 Oct 90 15:19:15 -0400 Sender: craig@NNSC.NSF.NET Bob: I agree that focussed workshops are worthwhile -- certainly the IRSG workshop that Dave and I hosted last winter was well-attended and people felt it was productive. If you were to do such a workshop, I'd suggest trying to find some particular issues to focus on; and I firmly believe writing up a short summary of what happened there (i.e. an workshop report RFC) makes the workshop more valuable to the community. As for finding a good meeting time, I think doubling up with other meetings *does* seem to help with this horrible problem. From my point of view, the best meetings to double up with are INET '91, SIGCOMM '91 and Rare/EARN '91 (all in Europe). IETF is also do-able for me. Craig From braden Thu Oct 18 12:47:54 1990 Received-Date: Thu, 18 Oct 90 12:47:54 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 18 Oct 90 12:47:54 -0700 Date: Thu, 18 Oct 90 12:47:51 PDT From: braden Posted-Date: Thu, 18 Oct 90 12:47:51 PDT Message-Id: <9010181947.AA02495@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Thu, 18 Oct 90 12:47:51 PDT To: craig@nnsc.nsf.net, end2end-interest Subject: re: End-to-End Protocols Workshop As for finding a good meeting time, I think doubling up with other meetings *does* seem to help with this horrible problem. From my point of view, the best meetings to double up with are INET '91, SIGCOMM '91 and Rare/EARN '91 (all in Europe). IETF is also do-able for me. Craig Craig, It is my impression that (reimbursed) European travel poses a problem for a lot of people, although I would be VERY happy to have evidence to the contrary! Bob From craig@NNSC.NSF.NET Thu Oct 18 14:25:27 1990 Posted-Date: Thu, 18 Oct 90 17:16:23 -0400 Received-Date: Thu, 18 Oct 90 14:25:27 -0700 Message-Id: <9010182125.AA05798@venera.isi.edu> Received: from nnsc.nsf.net by venera.isi.edu (5.61/5.61+local) id ; Thu, 18 Oct 90 14:25:27 -0700 To: braden Cc: end2end-interest Subject: re: End-to-End Protocols Workshop From: Craig Partridge Date: Thu, 18 Oct 90 17:16:23 -0400 Sender: craig@NNSC.NSF.NET Bob: I confess I wasn't worrying so much about reimbursed European travel budget (tho a two conferences for one deal would seem more appealing) than I was listing meetings that I thought might help give us an interesting mix of people (i.e. might be complimentary). Another thought is INFOCOM '91. Craig From braden Thu Oct 18 17:24:08 1990 Received-Date: Thu, 18 Oct 90 17:24:08 -0700 Received: from braden.isi.edu by venera.isi.edu (5.61/5.61+local) id ; Thu, 18 Oct 90 17:24:08 -0700 Date: Thu, 18 Oct 90 17:24:05 PDT From: braden Posted-Date: Thu, 18 Oct 90 17:24:05 PDT Message-Id: <9010190024.AA02667@braden.isi.edu> Received: by braden.isi.edu (4.1/4.0.3-4) id ; Thu, 18 Oct 90 17:24:05 PDT To: end2end-interest Subject: Mailing list change For a long time, the mailing list "end2end@isi.edu" has existed as an alias for "end2end-interest@isi.edu". This has sometimes caused confusion, not least for yours truly. I have therefore asked to have "end2end@isi.edu" removed; thus, "end2end-interest@isi.edu" will be the (only) list name for discussions of issues of interest to the E2E RG. Thanks, Bob Braden From cheriton@Pescadero.Stanford.EDU Thu Oct 18 21:08:52 1990 Posted-Date: Thu, 18 Oct 90 21:08:46 PDT Received-Date: Thu, 18 Oct 90 21:08:52 -0700 Received: from Pescadero.Stanford.EDU by venera.isi.edu (5.61/5.61+local) id ; Thu, 18 Oct 90 21:08:52 -0700 Received: by Pescadero.Stanford.EDU (5.59/25-eef) id AA14360; Thu, 18 Oct 90 21:08:46 PDT Date: Thu, 18 Oct 90 21:08:46 PDT From: David Cheriton Message-Id: <9010190408.AA14360@Pescadero.Stanford.EDU> To: end2end-interest Subject: re: End-to-End Protocols Workshop 1) I agree with holding it in conjunction with something else, like SigComm'91 2) I agree there should be a theme, and a clear focus to the program that makes it different from a conference. 3) I dont have a problem with funds for European travel, just time, which is solved in part by combining. From guru@flora.wustl.edu Fri Oct 19 12:39:51 1990 Posted-Date: Fri, 19 Oct 90 14:41:24 -0500 Received-Date: Fri, 19 Oct 90 12:39:51 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 19 Oct 90 12:39:51 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA00527; Fri, 19 Oct 90 14:38:32 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA13697; Fri, 19 Oct 90 14:41:26 CDT Message-Id: <9010191941.AA13697@flora.wustl.edu> To: kanakia@research.att.com Cc: end2end-interest Subject: Re: Msg from Estrin on Van's connectionless ideas In-Reply-To: Your message of Thu, 11 Oct 90 14:38:59 -0400. <9010111839.AA13030@venera.isi.edu> Date: Fri, 19 Oct 90 14:41:24 -0500 From: Gurudatta Parulkar Just when you thought it is over, ... Sorry for a late response, but I had trouble keeping up with the mail. > A number of general principles should be outlined and specified to > which most of us can agree. In my opinion these are: (in the form of a > statement followed by an inset that expands it) Excellent message! Surprised that it didn't generate much response. > 1. Future high-speed packet switching networks should be built to provide > at least the same grade of service that the networks one intends to replace. > Does this mean that we want to build a network that offers as > ubiquitous service and at least as good quality real-time voice and > video transport as existing Telephone and CATV networks provide? > My personal answer is yes. I understand your point, but ... It is the goal of the BISDN providers to replace all the voice, data, and video networks with one single high-speed packet switched network. Therefore, the BISDN has to be at least as ubiquitous and as good as existing telephone, CATV, and data networks. However, I am not sure if federal agencies that support communication for the scientific world have as their highest priority to create one single BISDN which is better than all these networks together. I believe their priority is to provide high speed networks to facilitate innovations in science and engineering. > If you don't agree with this statement, you should stop reading the msg. > The rest of the msg is unlikely to be relevant to what you > are doing/thinking. Though we don't have complete agreement, your message was too good to quit. So here I go (most of what you said applies to networks of reasonable size wanting to support a variety of applications). > 2. This implies that the network should offer a guaranteed rate > service to users with real-time voice and video based applications. Complete agreement. >3. The guaranteed rate service is incompatible with the pure datagram model. > It is surprising that this point needs to be made at all. > The light-weight VC switching to me implies switching that is not pure > datagram model. It also is not a virtual circuit switching in the sense > that switches do not guarantee ordered or reliable delivery of packets > on connection-basis. > It will make resource (i.e., buffers and transmission capacity) > allocation decisions based on information such as rate, jitter, > absolute delay bounds required, The light-weight VC so far is not different from connections in ATM environment. Due to prejudice resulting from using old names, people have been inventing new names, such as congrams, flows, soft connections, etc. However, all these different names lead to the same conclusion: when you have a very demanding application (requires bandwidth close to peak rate, and requires guaranteed performance), it is best to set a path, allocate resources (statistical allocation, of course), use them, and when application is done, release the path and associated resources. I believe it is also best to do these steps in the most straight forward way. > It will make resource (i.e., buffers and transmission capacity) > allocation decisions based on information such as rate, jitter, > absolute delay bounds required, > carried either in each packet or a burst of packets or > negotiated at the circuit set-up time. Theoretically, you are right. These reservations can be adjusted for each packet, burst of packet, or a connection. However, the important questions are What is the cost of adjusting these reservations for each packet, burst of packets, each connection? What is the effect of these adjustments (if done for each packet) on stability and convergence of network control algorithms. What happens if you don't get the reservation you need? If you were making reservation on a connection basis, you would block the connection, and the application would retry it later. But if you were making reservation for each packet, and you don't get the reservation, what would you do to the packet? How would an application deal with this? I believe it is better to make these adjustments as infrequently as possible. >5. The network must drop packets in the absence of global congestion > control. > All of us in the Internet world of datagram switching accept this fact but > this also needs to be told to Datakit type folks who have built > virtual-circuit switching model. The only way of not dropping a packet > is to not allow any type of statistical behavior in the network usage, and > then design the network to handle the worst-case. Even that I bet would be > a very costly network design. Fortunately ATM folks that I have come across don't have this problem. >6. Guaranteed rates and delays for some sessions force a network > to monitor all rates (at least for sessions of the same or higher > priority). No problem with this. > The model does not in my > mind preclude aggregation of information from many different virtual > connections passing through the switch. More on aggregation comes later. >7. Nothing I said so far precludes packet-switching model that uses aggregated > state information. Aggregation state is a result of monitored or reserved > traffic parameters offered by several sessions of similar types. > I have purposely kept this definition of aggregated state model vauge. > I consider that to be the interesting thing to debate and focus on in > the discussion that has followed from the msg on the Estrin model. Let me try to summarize what you said in MY words: We can divide applications into two classes: M and NM (like P and NP). Applications in class M have good multiplexing characteristics (can be aggretated with others) and applications in class NM do not multiplex well with other applications and demand guarantees. So for a conversation originating from an application in class NM, we should have a dedicated light-weight VC (or a connection), and for a conversation from an application in class M, we should multiplex it with others (because this leads to smoother traffic, reduced memory, and reduced computation in packet swithces). Is this a reasonable interpretation of what you said? Assuming it is, the million dollar questions that we as researchers should try to answer are: - how to decide if an application is in M or NM ? - can an application be only in M or in NM? - for applications in NM, how much resources to allocate - how many and what type of applications in M can be multiplexed on a given channel? - how do active conversations of applications in M share resources in a fair way? I believe this message has become too long, and I should stop. However, I have to point out that in our congram model, we have two kinds of congrams: UCON and PICON. A UCON is set up for an application that needs performance guarantees and does not multiplex well with others. PICON is used to multiplex data from a variety of applications conversations that do have good multiplexing characteristics. And we are trying to get answers to some of the questions mentioned above. (Now you know why I liked Hemant's message so much.) -guru From guru@flora.wustl.edu Fri Oct 19 13:02:27 1990 Posted-Date: Fri, 19 Oct 90 15:03:44 -0500 Received-Date: Fri, 19 Oct 90 13:02:27 -0700 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 19 Oct 90 13:02:27 -0700 Return-Path: Received: from flora.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA01348; Fri, 19 Oct 90 15:00:57 CDT Received: from localhost by flora.wustl.edu (4.0/SMI-4.0) id AA13722; Fri, 19 Oct 90 15:03:48 CDT Message-Id: <9010192003.AA13722@flora.wustl.edu> To: tds@honet9.att.com Cc: end2end-interest, estrin@usc.edu Subject: Re: there are circuits and there are circuits... (was: RE: Msg from Estrin on Van's connectionless ideas) In-Reply-To: Your message of Tue, 09 Oct 90 22:35:00 -0400. <9010100531.AA04217@venera.isi.edu> Date: Fri, 19 Oct 90 15:03:44 -0500 From: Gurudatta Parulkar Guru> In most of this discussion, we are forgeting the lowest level service Guru> primitive, which is actually circuit switched today. Guru> For example, it is true that the datagram networks, such as NSFNet Guru> backbone, are built out of leased circuit switched lines that are part Guru> of complex TDM hierarchy of the phone network. There's a distinction that should be made between switched circuits and leased circuits (i.e. those used in the NSFnet). Both leased circuits and interswitch trunks to carry switched traffic are routed on the same facility (fiber, microwave, whatever) network using DCSs (Digital Cross-Connect Systems). This routing changes slowly. Then switches (4ESS, 5ESS, DMS, whatever) route calls on the trunks. So the telephone companies have a circuit-switched network built on top of a facility network. From my perspective, the NSFnet is built from the facility network and not the circuit-switched network. You are right. However, I was told that the trend is to use the same switching systems (e.g., 5ESS) to do both. That is, the same switches can be used in the facility network and circuit switched network. The control is "slightly" different". Also, I believe there is no reason why this two level hierarchy cannot be merged into one (you have also indicated that in your message). Guru> Wouldn't it be more effective, if the phone company can provide a Guru> leased packet switched channel (LPSC) on demand, and the Guru> datagram network pays on the usage basis (goals of ATM). Thus, every Guru> time there is a TCP connection opened, the gateway can ask the phone Guru> network to set up a LPSC, and tear is down when TCP connection is Guru> terminated. This is much more effective in terms of cost and usage of Guru> resources. This may be a longer-term architecture, but similar arguments might be made for datagrams on top of switched circuits, which might be the precursor to the ATM world. Well, the point I wanted to make is that datagrams are not necessarily the LOWEST level abstraction using which we should build all other higher level networking services (The Estrin model seems to imply this.). Datagrams themselves are built on top of switched connections. What I am suggesting is that you can make the underlying circuit switched model to evolve towards supporting packet switched connections (or light weight VC) using which you can support both datagrams and connections. -guru From Mills@udel.edu Fri Oct 19 23:30:57 1990 Posted-Date: Sat, 20 Oct 90 6:26:17 GMT Received-Date: Fri, 19 Oct 90 23:30:57 -0700 Received: from huey.udel.edu by venera.isi.edu (5.61/5.61+local) id ; Fri, 19 Oct 90 23:30:57 -0700 Date: Sat, 20 Oct 90 6:26:17 GMT From: Mills@udel.edu To: braden Cc: end2end-interest Subject: Re: Mailing list change Message-Id: <9010200226.aa13698@huey.udel.edu> Bob, Your planmakes it impossible to direct a message to a private list known to be receptive to personal flames not expected to be resolved in the genral community. My contributions are suitably modulated. Dave From tds@honet9.att.com Mon Oct 29 07:06:37 1990 Posted-Date: Mon, 29 Oct 90 09:41 EST Received-Date: Mon, 29 Oct 90 07:06:37 -0800 Message-Id: <9010291506.AA06345@venera.isi.edu> Received: from ATT-IN.ATT.COM by venera.isi.edu (5.61/5.61+local) id ; Mon, 29 Oct 90 07:06:37 -0800 From: tds@honet9.att.com Date: Mon, 29 Oct 90 09:41 EST To: guru@flora.wustl.edu Cc: end2end-interest, estrin@usc.edu In-Reply-To: Subject: just when you thought it was safe... I've had a couple of messages from Guru sitting in my mail box for some time, and I wanted to make a couple of comments. >>>>> On Fri, 19 Oct 90 14:41:24 -0500, Gurudatta Parulkar said: Guru> when you have a very demanding application (requires bandwidth close Guru> to peak rate, and requires guaranteed performance), it is best to set Guru> a path, allocate resources (statistical allocation, of course), use Guru> them, and when application is done, release the path and associated Guru> resources. Should be obvious by now that I'm in phenominally close agreement with these sentiments. Guru> What happens if you don't get the Guru> reservation you need? If you were making reservation on a connection Guru> basis, you would block the connection, and the application would retry Guru> it later. Actually things are more complicated. Blocking is the mechanism that a large connection-oriented network uses to spread the "connection demand" across the network, through whatever routing algorithm is in use. Saying that a node blocks a connection does not say that the network blocks the connection (or equivalently, that the application is blocked). Ultimately that may happen, but the network has the opportunity to look elsewhere for resources. Elementary teletraffic theory gives the "node" blocking, but network blocking under an alternate-routing scheme is non-trivial. My view is that working on the level of connections gives an effective way of implementing load sharing by alternate routing. >>>>> On Fri, 19 Oct 90 15:03:44 -0500, Gurudatta Parulkar said: Tony> the telephone companies have a circuit-switched network built on top Tony> of a facility network. From my perspective, the NSFnet is built from Tony> the facility network and not the circuit-switched network. Guru> You are right. However, I was told that the trend is to use the same Guru> switching systems (e.g., 5ESS) to do both. That is, the same switches Guru> can be used in the facility network and circuit switched network. The Guru> control is "slightly" different". I have to quibble with "slightly". Circuit demand is spread across a network by routing algorithms. We know how to alternate route efficiently on a richly connected circuit-switched network using fast (and usually distributed) routing algorithms. I don't think we have a good, fast distributed schemes for alternate routing on a sparse network. So the facility network might in fact have very different control, even if both use the same switching fabric. Tony From minshall@wc.novell.com Mon Nov 12 17:25:01 1990 Posted-Date: Mon, 12 Nov 90 17:25:04 -0800 Received-Date: Mon, 12 Nov 90 17:25:01 -0800 Received: from [130.57.64.11] by venera.isi.edu (5.61/5.61+local) id ; Mon, 12 Nov 90 17:25:01 -0800 Received: from plasma.wc.novell.com by wc.novell.com (4.0/SMI-DDN) id AA11398; Mon, 12 Nov 90 17:24:03 PST Received: from localhost by plasma.wc.novell.com (3.2/SMI-3.2) id AA03767; Mon, 12 Nov 90 17:25:05 PST Message-Id: <9011130125.AA03767@plasma.wc.novell.com> To: David Cheriton Cc: end2end-interest Subject: Re: Architecture Limits In-Reply-To: Your message of Wed, 17 Oct 90 12:24:12 -0700. <9010171924.AA05028@Pescadero.Stanford.EDU> Date: Mon, 12 Nov 90 17:25:04 -0800 From: minshall@wc.novell.com > I think you are dramatically underestimating the problem of changing > or even extending protocols. I just talked with a compnay yesterday > that is a perfect candidate for using IP multicast, and they'd love to, > but it is really impractical until it is part of the standard OS distribution > from the major manufacturers, including Sun, DEC, IBM and HP. Considering > IP multicast has been around for years now and does not involve changing > interfaces, and is needed by many people, I would put it forward as an > indication of the speed of evolution we can expect. I have the vague impression that over the last ten years research dollars for building network facilities (protocols, etc.) have dried up and that researchers are now being asked to look for these facilities from off-the-shelf software vendors. The problem is that vendors are by no means in a position to supply the research community with anything "researchy". We, by our natures, have to see what is going to sell the most in the market, and develop that. So, we know people need telnet (say) and we sell telnet. There just aren't any applications with massive demand which need, for example, IP multi-casting. Until there are, researchers with a legitimate need for multi-casting are going to need to add it themselves. When there is a large community of people demanding these applications, vendors will start to supply them. Somewhat of a chicken and egg problem, to be sure. (There is a separate, but highly interesting, study of how high-tech ideas permeate engineering organizations and ultimately find their ways into products. SNMP is the most recent example I can think of.) I think the challenge is to simultaneously make use of off-the-shelf solutions for those applications which need it (which is a) cheaper for the tax-payer and b) creates a larger market for commercial networking products, thus helping the growth of the network computing industry) while at the same time being able to add new facilities which have not yet found their way into the vendor-supported domain. If you make the latter *too* easy to do, NIH will tend to shrink the former. > I think TCP/UDP got in on a window of opportunity with Berkeley Unix > which is now basically closed, or at best ajar. That is, these protocols > rode BSD into hundreds of companies, sites, and established these services > as a standard. Now we face the chicken and egg of wont use until standard > service, and not everywhere (standard) until widely needed. I think Berkeley Unix is responsible for a lot of good in the world. To my uncertain knowledge, 4.4 development is still at least somewhat open to new additions. I think people with value-added this, that, or the other thing might well work with Mike Karels to get it into 4.4. Greg Minshall Novell, Inc. minshall@wc.novell.com 1-415-975-4507 ps - It's always hard (for me at least) to get the tone right in these messages. I'm not trying to bash the research community. I'm not even trying to salvage the honor of the vendor community. I'm just trying to point out that there exists a current philosophy which doesn't work, isn't going to work, and (probably) shouldn't work. From christos@dworkin.wustl.edu Tue Nov 13 11:09:31 1990 Posted-Date: Tue, 13 Nov 90 13:09:37 CST Received-Date: Tue, 13 Nov 90 11:09:31 -0800 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 13 Nov 90 11:09:31 -0800 Return-Path: Received: from dworkin.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA06245; Tue, 13 Nov 90 13:08:07 CST Received: by dworkin.wustl.edu (4.0/yuck-4.0) id AA08301; Tue, 13 Nov 90 13:09:37 CST Date: Tue, 13 Nov 90 13:09:37 CST From: christos@dworkin.wustl.edu (Chris Papadopoulos) Message-Id: <9011131909.AA08301@dworkin.wustl.edu> To: end2end-interest Subject: Looking for probes for TCP/IP Let me first introduce myself. I am Christos Papadopoulos, a graduate student at Washington Univ working with Guru Parulkar. I am interested in remote visualization. I plan to implement an application which we think has characteristics typical of visualization applications and distribute it across our campus network using TCP/IP. Then tap in the protocols at different levels and take various measurements to identify where the bottlenecks are (if any). The application we picked is "Display of cell trajectories in 3D using Optical Sectioning Microscopy". This uses an optical microscope equipped with a CCD camera to collect images of planes in an organism. The big advantage of this method is that there is no physical slicing involved, thus the organism does not have to be killed. The CCD camera has variable pixel resolution of up to 1000 by 1000, at 4096 graylevels. The volumetric data will be in the range of 10-20 MBytes per image. To trace the trajectories of cells several images need to be collected to construct a short animation. There is a number of steps that need to be performed before the data can be rendered. These include elimination of noise due to the point spread function of the lens, edge enhancement, image segmentation and various geometric transformations. After distributing the application we would like to monitor the communication and take measurements on protocol overhead. The communication model we think may be appropriate is as follows: TCPq Networkq Networkq TCPq -------- ----| ----| ----| ----| ---------- |Sender|-----> |||-----> |||----------> |||-----> |||----->|Receiver| -------- ----| ----| Ethernet ----| ----| ---------- The measurements we would like to take include: (i) Dynamic queue length at the various queues (ii) Various delays a packet experiences The question I want to ask is the following: Are there probe and measurement utility programs which will allow us to take these measurements without any kernel modifications? Thanks in advance, Christos. From christos@dworkin.wustl.edu Tue Nov 13 11:27:16 1990 Posted-Date: Tue, 13 Nov 90 13:09:37 CST Received-Date: Tue, 13 Nov 90 11:27:16 -0800 Received: from wucs1.wustl.edu by venera.isi.edu (5.61/5.61+local) id ; Tue, 13 Nov 90 11:27:16 -0800 Return-Path: Received: from dworkin.wustl.edu by wucs1.wustl.edu (5.59/1.35); id AA06245; Tue, 13 Nov 90 13:08:07 CST Received: by dworkin.wustl.edu (4.0/yuck-4.0) id AA08301; Tue, 13 Nov 90 13:09:37 CST Date: Tue, 13 Nov 90 13:09:37 CST From: christos@dworkin.wustl.edu (Chris Papadopoulos) Message-Id: <9011131909.AA08301@dworkin.wustl.edu> To: end2end-interest Subject: Looking for probes for TCP/IP Let me first introduce myself. I am Christos Papadopoulos, a graduate student at Washington Univ working with Guru Parulkar. I am interested in remote visualization. I plan to implement an application which we think has characteristics typical of visualization applications and distribute it across our campus network using TCP/IP. Then tap in the protocols at different levels and take various measurements to identify where the bottlenecks are (if any). The application we picked is "Display of cell trajectories in 3D using Optical Sectioning Microscopy". This uses an optical microscope equipped with a CCD camera to collect images of planes in an organism. The big advantage of this method is that there is no physical slicing involved, thus the organism does not have to be killed. The CCD camera has variable pixel resolution of up to 1000 by 1000, at 4096 graylevels. The volumetric data will be in the range of 10-20 MBytes per image. To trace the trajectories of cells several images need to be collected to construct a short anim