use TCP_NODELAY on TCP sockets?

Wed May 15 18:37:18 CEST 2024

Dear BIRD people,

On most systems RFC 896 TCP congestion control is used, also known as
"Nagle's algorithm". This algorithm is intended to help coalesce
consecutive small packets from userland applications (like BIRD) into a
single larger TCP packet. The idea being it reduces bandwidth because
there is less TCP overhead if data is bundled into fewer packets.

This happens at the cost of an increase in end-to-end latency: the
sender will locally queue up data (in the operating system's TCP
buffers) until it either receives a TCP Ack from the remote side for all
previously sent data, or sufficient additional data piled up to send out
a full-sized TCP segment.

In my simple testing setup (announce 255 routes to a peer) with this
patch it takes ~ 0.007478 seconds after SYN to send the UPDATE message
out. Without this patch, it takes ~ 0.206955 seconds to send the UPDATE
out (perhaps because of negative interaction between Nagle's algorithm
and Delayed Acks?).

I think using TCP_NODELAY is interesting to consider, because it seems
sensible to try to deliver BGP messages as fast as possible. OpenBGPD
and FRR set the TCP_NODELAY socket option.

But please be very careful in considering this patch, because it does
introduce some subtle changes in the on-the-wire behaviour of BIRD. For
example, without this patch an UPDATE for a handful of routes and the
End-of-RIB marker might end up in the same TCP packet (if this fits);
but with this patch, the End-of-RIB marker ends up in its own TCP
packet. As things are today, setting TCP_NODELAY will increase the
average number of packets send to peers. Packing multiple BGP messages
into fewer TCP frames could perhaps be reintroduced by using writev()
instead of write(), when it is known a bunch of messages have been
queued up for a peer.

Another complex behaviour is that the effect of the Nagle algorithm
depends on latency between local system and remote system. Nagle's
algorithm makes it so that data is locally buffered as long as the
remote side has not yet acked all previously sent data; this means that
proportional to the distance (latency) chances of message coalescence
increase. Without this patch BIRD will send more individual TCP packets
to peers that are close by compared to peers that are far away. But with
TCP_NODELAY applied, all peers will receive the same number of TCP
packets, regardless of latency.

Look forward to hear your thoughts!

Kind regards,

Job

 sysdep/unix/io.c | 6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/sysdep/unix/io.c b/sysdep/unix/io.c
index 9b499020..f9746cb7 100644
--- a/sysdep/unix/io.c
+++ b/sysdep/unix/io.c
@@ -946,7 +946,7 @@ sock_new(pool *p)
 static int
 sk_setup(sock *s)
 {
-  int y = 1;
+  int y = 1, nodelay = 1;
   int fd = s->fd;
 
   if (s->type == SK_SSH_ACTIVE)
@@ -1048,6 +1048,10 @@ sk_setup(sock *s)
 	return -1;
   }
 
+  if ((s->type == SK_TCP_PASSIVE) || (s->type == SK_TCP_ACTIVE))
+    if (setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, &nodelay, sizeof(nodelay)) < 0)
+      ERR("TCP_NODELAY");
+
   /* Must be after sk_set_tos4() as setting ToS on Linux also mangles priority */
   if (s->priority >= 0)
     if (sk_set_priority(s, s->priority) < 0)