Unexpected SocketExceptions after AWS Lambda Execution

Having successfully conducted the migration from Quarkus RESTEasy Classic to RESTEasy Reactive one persistent problem with a specific AWS Lambda remained open. This Lambda connects to the Microsoft Graph API for sending out mails and doing some other things. Before the migration this worked pretty stable apart from sporadic java.net.SocketException: Connection reset during regular executions of the Lambda. However, after the migration we saw a lot of SocketException occurring after the actual execution of the Lambda. Turning on detail logging showed (omitting the full stacktrace here):

 1END RequestId: 5b3b8a77-1d2b-5a4c-b6d6-d622a70caaee
 2REPORT RequestId: 5b3b8a77-1d2b-5a4c-b6d6-d622a70caaee	Duration: 1960.48 ms	Billed Duration: 2506 ms	Memory Size: 512 MB	Max Memory Used: 181 MB	Init Duration: 545.50 ms	
 3[id: 0x441da31f, L:/<local-ip>:48654 - R:graph.microsoft.com/20.190.190.102:443] READ COMPLETE
 4[id: 0x441da31f, L:/<local-ip>:48654 - R:graph.microsoft.com/20.190.190.102:443] EXCEPTION: java.net.SocketException: Connection reset
 5[id: 0x441da31f, L:/<local-ip>:48654 - R:graph.microsoft.com/20.190.190.102:443] CLOSE
 6[id: 0x441da31f, L:/<local-ip>:48654 ! R:graph.microsoft.com/20.190.190.102:443] USER_EVENT: io.netty.channel.socket.ChannelInputShutdownReadComplete@2bb1f694
 7[id: 0x441da31f, L:/<local-ip>:48654 ! R:graph.microsoft.com/20.190.190.101:443] WRITE: 0B
 8[id: 0x441da31f, L:/<local-ip>:48654 ! R:graph.microsoft.com/20.190.190.101:443] FLUSH
 9[id: 0x441da31f, L:/<local-ip>:48654 ! R:graph.microsoft.com/20.190.190.101:443] CLOSE
10[id: 0x441da31f, L:/<local-ip>:48654 ! R:graph.microsoft.com/20.190.190.102:443] USER_EVENT: SslCloseCompletionEvent(java.nio.channels.ClosedChannelException)
11[id: 0x441da31f, L:/<local-ip>:48654 ! R:graph.microsoft.com/20.190.190.102:443] INACTIVE
12[id: 0x441da31f, L:/<local-ip>:48654 ! R:graph.microsoft.com/20.190.190.102:443] UNREGISTERED

When these exceptions occurred, then usually roughly 6 minutes after the Lambda had finished its business and always on the I/O thread (vert.x-eventloop-thread-1), so it was obvious that this problem arose through the use of RESTEasy Reactive. To make things even stranger, these exceptions didn't occur on every run. Sometimes the Lambda finished as expected, sometimes we saw some activity on the event loop after the execution like

1END RequestId: 6e2dc844-94ea-5e81-b9e3-94127b2a5639
2REPORT RequestId: 6e2dc844-94ea-5e81-b9e3-94127b2a5639	Duration: 2431.19 ms	Billed Duration: 2971 ms	Memory Size: 512 MB	Max Memory Used: 180 MB	Init Duration: 538.81 ms	
3[id: 0x40792513, L:/<local-ip>:46942 - R:graph.microsoft.com/20.190.190.102:443] WRITE: 0B
4[id: 0x40792513, L:/<local-ip>:46942 - R:graph.microsoft.com/20.190.190.102:443] FLUSH
5[id: 0x40792513, L:/<local-ip>:46942 - R:graph.microsoft.com/20.190.190.102:443] CLOSE
6[id: 0x40792513, L:/<local-ip>:46942 ! R:graph.microsoft.com/20.190.190.102:443] USER_EVENT: SslCloseCompletionEvent(java.nio.channels.ClosedChannelException)
7[id: 0x40792513, L:/<local-ip>:46942 ! R:graph.microsoft.com/20.190.190.102:443] INACTIVE
8[id: 0x40792513, L:/<local-ip>:46942 ! R:graph.microsoft.com/20.190.190.102:443] UNREGISTERED

which looks quite normal or at least not caused pain. The ClosedChannelException can be ignored from my understanding. In short, it looked like a classic problem with asynchronous programming, sometimes you see the error, sometimes not, which makes it very hard to find the root cause. Additionally, running the Lambda locally in quarkus:dev mode and also via the AWS SAM CLI never produced the exception.

Turning off the Connection Pool

Asking in the Quarkus discussion forum, the Quarkus team came up with the idea to reduce the keep-alive time for the connection pool that is used by the REST client (so that connections are closed earlier) or even disable it completely. Playing around with the quarkus.rest-client.connection-ttl parameter and some others didn't have any positive effect on the problem. Also, I couldn't find any timeout value which could explain why the exception occurred around 6 minutes after the Lambda had run. Finally, I decided to turn off the connection pool completely by implementing a ContextResolver for HttpClientOptions like

 1import io.vertx.core.http.HttpClientOptions;
 2
 3import javax.ws.rs.ext.ContextResolver;
 4import javax.ws.rs.ext.Provider;
 5
 6@Provider
 7public class CustomHttpClientOptions implements ContextResolver<HttpClientOptions> {
 8
 9    @Override
10    public HttpClientOptions getContext(Class<?> aClass) {
11        final HttpClientOptions options = new HttpClientOptions();
12        options.setKeepAlive(false);
13        return options;
14    }
15}

This way implemented it turns off the pool for all REST clients, but it is also possible to turn it off only for specific REST clients by creating the client programmatically like described in the official guide. Maybe the Quarkus team also implements a parameter so that it can simply be turned off by setting the corresponding value in the application.properties which would make the above class obsolete.

In the end this helped to resolve the problem. Yes, the lack of the connection pool has an impact on the overall performance, but in our case this is uncritical because the Lambda is not under heavy load or time critical.

Conclusion

We could solve the problem with the thrown SocketException after the execution of the Lambda by disabling the connection pool for the REST client with the downside of a reduced performance which is not a problem in our case (but might be in others).

What remains unclear is:

  • What is the root cause of this behavior? Why is there activity on the connection minutes after the Lambda has finished?
  • Why is the exception thrown roughly 6 minutes after the regular execution has finished? Is there a timeout value which causes this specific duration?

Remarkable is that we are not having this kind of problems with the RESTEasy Reactive REST client in any other service, so maybe it could also be related to the Microsoft Graph API we are connecting to with this Lambda. Based on my experience with Microsoft, that wouldn't be a surprise to me.

Update 21.02.2023

The further analysis yielded a few more updates and developments:

  • The ~6 minutes timeout we saw is not caused by Quarkus or any Lambda code but rather by AWS shutting down the Lambda execution environment after having waited for further invocations which did not come in our scenario. AWS does not really make the algorithm transparent which decides how long to wait until starting to shut down the Lambda execution environment, in our case it was just around 6 minutes. That's also the reason why we never saw this issue locally, but only on the AWS environment.
  • We compile some of our Lambdas natively in order to improve startup times and reduce memory footprint. The SocketException only appears when the Lambdas are natively compiled and run with the Lambda custom (provided) runtime. Running them non-natively with the AWS provided Java runtime does not show this behavior. It seems that something is not working correctly during the shutdown phase when the Lambdas are natively compiled.
  • This also means that the problem does not exist only for one specific Lambda as assumed at first, but indeed for all our natively compiled Lambdas which use the RESTEasy Reactive REST client.
  • From Quarkus version 2.16.3.Final on there is a parameter quarkus.rest-client.keep-alive-enabled for directly enabling or disabling the connection pool for all REST clients (enabled by default). Accordingly quarkus.rest-client."config-key".keep-alive-enabled enables or disables the keep alive only for the specified REST client. I contributed this parameter to Quarkus with PR #31125. When used, the above shown implementation of ContextResolver<HttpClientOptions> is not necessary anymore.

Feedback Welcome

I am always happy to receive comments, corrections, improvements or feedback in general! Please drop me a note on Mastodon or by E-Mail anytime.