BizTalk ESB Toolkit 2.0 Portal Timeouts and (401) Unauthorized Errors

The Problem

During application testing in our recently-built test and newly-built production BizTalk 2009 environments, we started having problems with the ESB Portal throwing a System.TimeoutException or a (401) Unauthorized error.  This was happening with increasing frequency on the portal home page and the Faults page.  On the home page, the problem seemed to be localized to the Faults pane.

When we saw the (401) Unauthorized errors, they contained a detail message like this:

MessageSecurityException: The HTTP request is unauthorized with client authentication scheme ‘Negotiate’. The authentication header received from the server was ‘Negotiate,NTLM’.

De-selecting some of the BizTalk applications in My Settings seemed to decrease but not eliminate the problem.  We had already checked and re-checked virtual directory authentication and application pool settings, etc.  Needless to say, everyone was tired of being unable to reliably view faults through the portal.

Debugging

A couple of issues complicated the debugging process, both related to the portal pulling fault data from a web service – specifically the ESB.Exceptions.Service.

First, the ESB.Exceptions.Service uses the webHttp (in other words, REST) binding introduced in .NET 3.5.  REST is fine for certain applications, but it also lacks many features of SOAP.  The one that stands out in particular here is REST’s lack of a fault communication protocol.  SOAP has a well-defined structure and protocol for faults, so from the client side it’s easy to identify and obtain information about a service call failure.  With REST, you’ll probably end up with a 400 Bad Request error and you’re on your own to guess as to what happened.

In other words, one can’t really trust the error messages arising from calls to the ESB.Exceptions.Service.

Second, the ESB.Exceptions.Service does not have built-in exception logging.  [In another post I’ll have a simple solution for that.]  Combined with REST’s lack of a fault protocol, any exception that occurs inside the service is essentially lost and obscured.

One of our first debugging steps was to run SQL Profiler on the EsbExceptionDb and see which queries were taking so long.  To our great surprise, when we refreshed the Faults page in the portal we saw in Profiler the same query running over and over, dozens or hundreds of times!

Fortunately, I was able to obtain permissions to our test EsbExceptionDb, which had over 10,000 faults in it, and run the portal and WCF services on my development machine.  Sure enough, I kept hitting a breakpoint inside the ESB.Exceptions.Service GetFaults() method over and over until the client timed out.  However, there were no loops in the code to explain that behavior!

Next, I turned on full WCF diagnostics for the ESB.Exceptions.Service, including message logging, using the WCF Service Configuration Editor.  Using the Service Trace Viewer tool, I indeed saw the same service call happening again and again – but the trace also captured an error at the end of each call cycle.

The error was a failure serializing the service method’s response back to XML.  The service call was actually completing successfully (which I had also observed in the debugger).  Once WCF took control again to send the response back to the client, it failed.  Instead of just dying, it continuously re-executed the service method!  This could be a bug in WCF 3.5 SP1.

Problem Solved

The solution to the WCF re-execution problem was increasing the maxItemsInObjectGraph setting.  On the service side, I did this by opening ESB.Exceptions.Service’s web.config, locating the <serviceBehaviors> section, and adding the line <dataContractSerializer maxItemsInObjectGraph="2147483647" /> to the existing “ExceptionServiceBehavior” behavior.

With that simple configuration change, the service call now returned promptly and the portal displayed a matching error about being unable to de-serialize the data.  As with the service, I needed to increase the maxItemsInObjectGraph setting.  I opened the portal’s web.config, located the <endpointBehaviors> section, and added the line <dataContractSerializer maxItemsInObjectGraph="2147483647" /> to the existing “bamservice” behavior.  The error message didn’t change!  I eventually discovered that the <dataContractSerializer> element must be placed before the <webHttp /> element.

The portal now displayed the home page and Faults page properly, and the timeout and unauthorized errors disappeared.