In this situation we were discarding the HTTP response without reading it before closing, which kept Go from reusing the connection. I didn't dig quite as deep as this post's author, but I imagine the same RST behavior was happening under the hood.
/* As outlined in RFC 2525, section 2.17, we send a RST here because
* data was lost. To witness the awful effects of the old behavior of
* always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk
* GET in an FTP client, suspend the process, wait for the client to
* advertise a zero window, then kill -9 the FTP client, wheee...
* Note: timeout is always zero in such a case.
*/
Ok, so the RST is explained and well justified by the literature. But what are the “awful effects” of sending FIN instead? Can someone explain?Um, yes? That's how TCP has been universally implemented for more than 30 years. See [0], 2.17 for discussion.
In some protocols, end-of-file has semantic meaning that all data has been transferred, and TCP is set up such that you should be able to rely on that - if you can't rely on that difference, it is a bug in a TCP stack along the way.
FIN also has a sequence number, so you can wait to ACK it until you get the corresponding data if it is dropped or out of order.
TCP RST says the other side won't be resending if not ACKed, it is reset. Further, the downloading client usually cannot even read any packets in the receive window either once an RST has been received - that might be hundreds of KB of missed data.
RST and FIN are very semantically meaningfully different.
Reading the post, if gunicorn is e.g. sending a 404 after seeing a POST to a path it doesn't know about before reading the body, the client will never get the 404 because gunicorn hasn't read the message body.
This case is partly why "Expect: 100-continue" exists, so it will be properly handled, even if it does introduce an extra round-trip lag in the POST.
It might be dangerous to have your protocol rely on a piece of TCP that is often incorrectly implemented.
Also, I think state of the art hasn't really changed? If you don't want a reset, you need to read everything from the socket before you close. If you don't really care about a reset as long as it doesn't interrupt the reader, you can shutdown in your direction, and drop the socket off to something that will wait "long enough" before it closes. In an eventloop architecture, you can just put in as a deferred task; in process per connection, you should probably send the socket to a dedicated lingering closer process that doesn't interrupt your flow.
That's not what's happening here. The server is closing the socket when there's data from the client that it hasn't read.
If the client does a read() after the RST gets there, any data in the receive window is gone, any packets that might have been dropped and need to be resent are gone, etc.
FTP has different data and command connections so the server may not have an outstanding read to detect the data connection break.
But.. it should still clean up both when the command connection dies
For the case here the server should call shutdown with SHUT_WR after sending the data and then drain the incoming data before closing the socket.
Here's a little reproducer: https://gist.github.com/jcalvinowens/da57edda9a01ca9f4c4088a...
$ gcc -O2 test.c -o test
$ strace -e socket,connect,write,accept,read,close ./test --rx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
accept(3, NULL, NULL) = 4
close(3) = 0
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
<...>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 3035
read(4, "", 4096) = 0
close(4) = 0
+++ exited with 0 +++
$ strace -e socket,connect,write,accept,read,close ./test --tx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(31337), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 600000) = 600000
close(3) = 0
+++ exited with 0 +++
...versus: $ gcc -O2 -DWRITE_TO_SOCKET_BEFORE_READ test.c -o test
$ strace -e socket,connect,write,accept,read,close ./test --rx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
accept(3, NULL, NULL) = 4
close(3) = 0
write(4, "\250\3\0\0\0\0\0\0\250\3\0\0\0\0\0\0$\0\0\0\0\0\0\0$\0\0\0\0\0\0\0"..., 4096) = 4096
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 4096
<...>
read(4, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 4096) = 997
read(4, 0x7ffd45c2d3c0, 4096) = -1 ECONNRESET (Connection reset by peer)
<...>
+++ exited with 1 +++
$ strace -e socket,connect,write,accept,read,close ./test --tx
<...>
socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
connect(3, {sa_family=AF_INET, sin_port=htons(31337), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
write(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 600000) = 600000
close(3)
+++ exited with 0 +++blog - git - desktop - contact
2026-05-05
Two services running on the same machine. One of them opens a listening TCP socket bound to localhost, the other one connects to that. They exchange data. Every now and then, the service that initiated the connection gets an ECONNRESET while reading data from the socket -- but no other errors show up in the logs, no crashes, nothing. What's going on?
Go on to part 2.
Let's start with the "server", i.e. the service that opens the listening socket.
The following program does just that: Create a new TCP socket, wait for connections, fork into a new process for each request. There's not much of a "request", though: The server simply dumps 600'000 x bytes to the client upon connection.
The number 600'000 bears some meaning: It needs to be large enough to trigger the behavior that I want to show. 600 bytes, for example, probably won't work.
Now on to the client: It connects to port 8125 on localhost where our server is waiting, and then it calls recv() until EOF or error. We'll get to the --spam flag in a second.
Also:
Let's run the two programs:
[terminal1]$ ./server
[terminal2]$ ./client
Read 600000 bytes, final return value was 0, errno was 0
Nothing spectacular.
But let's use the --spam flag:
$ ./client --spam
Read 600000 bytes, final return value was -1, errno was 104
Connection reset by peer
$ ./client --spam
Read 256000 bytes, final return value was -1, errno was 104
Connection reset by peer
$ ./client --spam
Read 351232 bytes, final return value was -1, errno was 104
Connection reset by peer
$ ./client --spam
Read 351232 bytes, final return value was -1, errno was 104
Connection reset by peer
$ ./client --spam
Read 351232 bytes, final return value was -1, errno was 104
Connection reset by peer
$ ./client --spam
Read 256000 bytes, final return value was -1, errno was 104
Connection reset by peer
$ ./client --spam
Read 600000 bytes, final return value was -1, errno was 104
Connection reset by peer
--spam causes the client to first send some data to the server before it tries to receive data. And appararently this causes the connection to break at some point: The client's recv() sees a return value of -1 and errno gets set to 104 = Connection reset by peer.
First, what's "on the wire"?
Okay. So there actually is a TCP RST. Could have also been a programming error or misinterpretation on my part.
strace ./server seesThe RST originates from the server, so let's attach strace and see what we get:
19:59:03.420432 accept(3, NULL, NULL) = 4
19:59:05.652715 clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fe43484fa10) = 239546
[pid 239546] 19:59:05.652831 ...
[pid 239546] 19:59:05.652959 mmap(NULL, 602112, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 <unfinished ...>
[pid 239546] 19:59:05.652980 <... mmap resumed>) = 0x7fe43456d000
[pid 239546] 19:59:05.653235 sendto(4, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 600000, 0, NULL, 0) = 600000
[pid 239546] 19:59:05.653474 close(4) = 0
[pid 239546] 19:59:05.653553 exit_group(0) = ?
[pid 239546] 19:59:05.653667 +++ exited with 0 +++
No crash. It forked and used sendto() to dump all the data to the client. Then it quit.
Also note that sendto() returned the full 600000, so from the perspective of this program, "all data got sent" (there's a footnote, obviously, as the manpage explains: "Successful completion of a call to sendto() does not guarantee delivery of the message. A return value of -1 indicates only locally-detected errors.").
In fact, there is no difference here whether you use --spam on the client or not.
strace ./client --spam sees19:59:05.652518 connect(3, {sa_family=AF_INET, sin_port=htons(8125), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
19:59:05.652649 sendto(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 100, 0, NULL, 0) = 100
19:59:05.652805 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 4096
19:59:05.653382 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 4096
...
19:59:05.654440 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 4096
19:59:05.654473 recvfrom(3, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096, 0, NULL, NULL) = 1024
19:59:05.654506 recvfrom(3, 0x55d23c5b5010, 4096, 0, NULL, NULL) = -1 ECONNRESET (Connection reset by peer)
19:59:05.654575 write(1, "Read 128000 bytes, final return "..., 60) = 60
19:59:05.654694 write(4, "Connection reset by peer\n", 25) = 25
19:59:05.654725 close(4) = 0
19:59:05.654750 close(3) = 0
19:59:05.654783 exit_group(0) = ?
19:59:05.654864 +++ exited with 0 +++
Nothing out of the ordinary here, either. We ran recvfrom() until one of them returned with -1 / ECONNRESET.
Let's test when we get the ECONNRESET. Because, if you scroll back up, some of the invocations read the full 600'000 bytes and some returned other values. So there's probably some timing issue here.
Let's do what is most obvious in hindsight: Delay the close() in the server. Because if anything legitimately creates a RST, it'll probably be that.
In serve_client(), simply add a sleep(1) before the call to close():
sleep(1);
if (close(s) == -1)
{
perror("close");
return 1;
}
And there you go, we can clearly see a one second delay now. A tcpdump shows it best:
So, without having dug deeper, here is some speculation:
close() in the server, the socket is "dirty" (because of pending data), so a RST gets fired to (hopefully) tell the client that not all data has been read.This makes sense to me at the moment, although I haven't found any definitive sources yet to confirm this behavior.
(Initially, one of the hypotheses was that some buffer was filling up, hence spamlen is set to 100 and it originally was 100'000 -- but that doesn't matter, a single byte is enough.)
gunicorn serving a flask app was running behind nginx as a reverse proxy. Sometimes, nginx got these ECONNRESET from gunicorn. There were also cloud hosters involved and firewalls and lots of routers and parallel requests and what not, getting derailed by ioctls and non-blocking sockets, so boiling this down took a while.
Essentially, what nginx does is pass the HTTP request on to gunicorn, but it does so with two syscalls:
09:11:31.254489 writev(29, [{iov_base="POST /reveal/d48z/iha4A9MOMuLW40"..., iov_len=392}], 1) = 392
09:11:31.255435 writev(29, [{iov_base="compat=lynx+needs+this", iov_len=22}], 1) = 22
392 bytes HTTP headers, 22 bytes HTTP body, 414 bytes in total.
gunicorn then reads this data from the socket:
09:11:31.593968 recvfrom(6, "POST /reveal/gEJh/bIoAUdWrSV47mI"..., 8192, 0, NULL, NULL) = 414
However, every now and then, it only got the data from the first writev() call:
09:11:31.251229 recvfrom(6, "POST /reveal/d48z/iha4A9MOMuLW40"..., 8192, 0, NULL, NULL) = 392
gunicorn (and/or the application running inside of it) was happy to just get the headers, it didn't care about the body. I assume that this part of the software stack is "lazy": When there's nothing accessing the body in the application, it won't bother even recv()-ing it.
Problem is, gunicorn ends these transactions like that:
09:11:31.583979 sendto(6, "HTTP/1.1 200 OK\r\nServer: gunicor"..., 212, 0, NULL, 0) = 212
09:11:31.584225 sendto(6, "\312\205]"..., 614400, 0, NULL, 0) = 614400
09:11:31.590869 close(6) = 0
Send off the data and close the socket. If there's data still pending to be read, this will cause a RST, I think.
The workaround was to have the Python app do some dummy operation on the HTTP body, to make sure that it got fully read from the socket. So far, I haven't seen any more ECONNRESET. (Depending on your application, this can open up the possibility of a DoS, though: If someone POSTs 10 GB of data on your server and you only have 1 GB of memory, then you might be in trouble. client_max_body_size in nginx can probably prevent that.)
Verify that the close() call really is the cause of the TCP RST and find a good source for that.
It might be because of RFC 1122:
A host MAY implement a "half-duplex" TCP close sequence, so that an application that has called CLOSE cannot continue to read data from the connection. If such a host issues a CLOSE call while received data is still pending in TCP, or if new data is received after CLOSE is called, its TCP SHOULD send a RST to show that data was lost.
Figure out who's to "blame" on the Python side: gunicorn, flask, or the actual flask app. Report upstream.