Advanced Data Solutions : 401 1 2148074248 in IIS logs behind a load balancer with multiple servers

If you're receiving an unexpected 401 and IIS logs show this: 401 1 2148074248, this blog could be useful if you have this setup:

Windows Authentication enabled in IIS (specifically if NTLM is being used), and
a load balancer with multiple web servers behind it

This is an infrequent occurrence, but I have personally troubleshooted it a few times over the past several years. It's an odd one and can be difficult to identify especially if you cannot reproduce it on-demand or if it's intermittent. This particular issue should not occur if you have only one server, behind a load balancer or not. So if you do have multiple web servers and remove all but one and the issue goes away, there's a chance you could be experiencing this problem.

For this, it would be extremely helpful if internal traffic between the load balancer and web servers is unencrypted. This is to be able to find the problem more-quickly as described.

The first thought for me that comes to mind when I see this particular problem is that the NTLM messages are being split between multiple TCP connections.

The overall auth flow of NTLM is described on this page, and it wouldn't hurt to understand it more deeply. In short, when a client is authenticating using NTLM, there are multiple roundtrips needed for the building of that authenticated user context to be complete.

Those multiple round-trips consist of three (3) NTLM messages that must be exchanged between client and server, in order, on the same socket and server, for NTLM to be successful.

By "on the same socket and server" above I mean that the client must be communicating with the same server and the client-side IP:port combination must remain the same as the NTLM messages are being passed (remember in a load-balanced situation, the direct client of the web server is typically the load balancer as that is where the TCP connections originate from in most scenarios). In other words, if the load balancer opened port 50000 to communicate with a web server and Windows Auth/NTLM is needed, the load balancer must not break the NTLM messages up between different ephemeral/dynamic ports and must remain on the same server. If those messages are broken up between different ports/TCP connections or between servers, then this is when you can see the 401 1 2148074248 issue.

401.1 == logon failure.
The 2148074248 code translates to:
SEC_E_INVALID_TOKEN: The token supplied to the function is invalid.

Here's an example of what this would look like from network traces - these are real-world from a customer environment...

Note this is a new TCP connection.

The initial request in frame 6109 was anonymous, so the server sent back the typical 401.2 and requsted Windows Auth. This would have logged a "401 2 5" in the IIS log. This is normal.

The second request was frame 6118, and it contained the NTLM Type-1 message (not shown).

The second 401 in a default setup (Kernel-mode enabled) is actually sent from the HTTP.sys driver underneath IIS, and that 401 contains the NTLM Type-2 message, and is normal. If kernel-mode is disabled then you would see a 401 1 2148074254 in the IIS log, which would also be normal here.

The problem with the communication above is the TCP FIN sent from the client in frame 6121. This would be unexpected, as what we would expect to see here is a third HTTP request that would contain the NTLM Type-3 message to complete the auth flow.

What actually happened here is the load balancer had sent the Type-3 message to a new server, instead of sending it on the original (now-closed) socket:

Notice all the IPs are different here: the internal interface of the load balancer is different, along with the server-side IP (it was a different server).

It is on this second server where the 401 1 2148074248 is observed. And, since this 401 would be unexpected from an end-client perspective (as far as the client is concerned it was using the same sockets to communicate with the load balancer), a credential prompt appeared on the client browser. That particular error code is sent back by Local Security Authority (LSA) code and occurs because the context is only partial - since it wasn't generated on this server (or even on the same socket since that's what NTLM needs here) it failed.

It's not shown here, but when digging into the NTLM messages in the HTTP requests/responses, we could see that server-2 was indeed receiving NTLM challenges sent to the client by server-1.

The problem here was the load balancer was not using session affinity/persistence/etc. What's interesting is that the load balancer was configured with persistence based on IPs but the reasons for why it wasn't honoring that all the time are unknown.

This particular issue was resolved when the load balancer was switched to a cookie-based persistence mechanism.

Posted at https://sl.advdat.com/3wcDOpB

Thursday, July 1, 2021

401 1 2148074248 in IIS logs behind a load balancer with multiple servers