Fix for BizTalk ESB Toolkit 2.0 Error 115004 in ALL.Exceptions Send Port

On occasion we have had messages suspend on the ALL.Exceptions send port with the error:

Error 115004: An unexpected error occurred while attempting to retrieve the System.Exception object from the ESB Fault Message.

The source of the error is the pipeline component Microsoft.Practices.ESB.ExceptionHandling.Pipelines.ESBFaultProcessor, part of the ESB Toolkit’s custom pipeline on the ALL.Exceptions port.

Some suspicions and hunting through code using .NET Reflector led to an explanation.

The ExceptionMgmt.CreateFaultMessage() method, which is used to create a fault message in an orchestration exception handler, automatically locates and stores the exception object that was previously thrown.  It stores the exception by binary-serializing it, Base64 encoding it and storing it in a property on the first message part of the fault message.  Later on, the ESBFaultProcessor pipeline component attempts to de-serialize the exception.

The trouble arises when the thrown exception contains a non-serializable inner exception more than one level deep.  The method ExceptionMgmt.IsExceptionSerializable() only checks the root exception and the first InnerException.  If a non-serializable exception happens to be nested further, the code does not detect it.  As a result, the ESBFaultProcessor fails while attempting to de-serialize it.

In our case, we are pulling a flat file down from a web service and disassembling it inside of an orchestration using the XLANGPipelineManager class.  If there is a problem with the file format, an XLANGPipelineManagerException is thrown.  It contains an InnerException of XmlException, which in turn contains an InnerException of Microsoft.BizTalk.ParsingEngine.AbortException – which has no serialization constructor.

To solve this issue, I wrote a short helper method in C#.  I call it immediately after ExceptionMgmt.CreateFaultMessage() and pass it the newly created fault message and the caught exception’s Message property value.  It checks whether the stored exception can be de-serialized, and if not, replaces the stored exception with a special exception class.  This is the same thing that would have happened had the IsExceptionSerializable() method correctly detected the situation.

I submitted this bug to Microsoft Connect.

To use this code, you’ll need a C# class library with references to:

  • Microsoft.Practices.ESB.ExceptionHandling
  • Microsoft.Practices.ESB.ExceptionHandling.Schemas.Faults
  • Microsoft.XLANGS.BaseTypes

For convenience, I added a couple of using statements at the top.

using System;
using Microsoft.XLANGs.BaseTypes;
using ExceptionHandling = Microsoft.Practices.ESB.ExceptionHandling;
using ExceptionHandlingSchemas = Microsoft.Practices.ESB.ExceptionHandling.Schemas.Property;

namespace BizTalkHelpers
{
    public static class OrchestrationHelper
    {
        /// <summary>
        /// Work around a bug in the BizTalk ESB Toolkit 2.0 related to
        /// non-serializable exceptions. When
        /// ExceptionMgmt.CreateFaultMessage() creates a message, it
        /// automatically locates and stores the caught exception. If the
        /// exception contains an InnerException more than one level deep
        /// that is not serializable, the ESBFaultProcessor pipeline
        /// component will later fail when it attempts to deserialize the
        /// exception, resulting in the error:
        /// Error 115004: An unexpected error occurred while attempting to
        /// retrieve the System.Exception object from the ESB Fault Message.
        /// </summary>
        /// <param name="msg">
        /// Message created by ExceptionMgmt.CreateFaultMessage()</param>
        /// <param name="exceptionMsg">
        /// Message property value of the caught exception</param>
        public static void FixNonSerializableExceptionInFaultMsg(
            XLANGMessage msg, string exceptionMsg)
        {
            // Incoming msg must have been created by
            // ExceptionMgmt.CreateFaultMessage()
            XLANGPart p = msg[0];

            if (p == null)
            {
              return;
            }

            // Extract the Base64-encoded string representation of the
            // exception serialized by CreateFaultMessage().
            string str =
              p.GetPartProperty(
                typeof(ExceptionHandlingSchemas.SystemException)) as string;

            if (str == null)
            {
              return;
            }

            try
            {
              ExceptionHandling.Formatter.DeserializeObject<Exception>(str);
            }
            catch (Exception)
            {
              // If an exception is not serializable, the correct behavior
              // is to store a serialized instance of
              // SetExceptionNonSerializableException.
              ExceptionHandling.SetExceptionNonSerializableException ex =
                new ExceptionHandling.SetExceptionNonSerializableException(
                  0x1c13e, new object[] { exceptionMsg });

              p.SetPartProperty(
                typeof(ExceptionHandlingSchemas.SystemException),
                ExceptionHandling.Formatter.SerializeObject<
                ExceptionHandling.SetExceptionNonSerializableException>(ex));
            }
        }
    }
}

BizTalk Adapter for DB2 Error SQLSTATE: HY000, SQLCODE: -270

My client had been using the BizTalk Adapter for DB2 from the BizTalk Adapters for Host Systems for quite some time with no significant issues.  The target system was DB2 on an AS/400 (iSeries/System i).  Late last week, the AS/400’s i5/OS was upgraded from V5R4 to V6R1 and the DB2 team did a database restore.

Somewhere around that time, BizTalk started logging the following errors:

An internal network library error has occurred. The requested command encountered an implementation-specific error condition on the target system. SQLSTATE: HY000, SQLCODE: –270

Searches for this error turned up only one post on a Host Integration Server newsgroup, which didn’t give us any answers.

I started out by looking up the SQLCODE –270 in the IBM SQL Messages and Codes book.  That description had to do with unique indexes or constraints on distributed tables, which didn’t seem relevant to our situation.

I found the actual meaning of –270 in DB2OLEDB.H on the Host Integration Server 2006 CD (MSI\x86\PFiles\SDK\Include).  It’s defined there as DB2OLEDB_DDM_CMDCHKRM, and CMDCHKRM means “command check reply message.”  The problem is that the error code(s) contained in the reply message are not surfaced, so this was still a dead end.

The Microsoft OLE DB Provider for DB2 is the foundation for the DB2 Adapter, so in order to rule out BizTalk and the DB2 Adapter as possible problems, we created a five-line C# command-line app:

OleDbConnection cn =
    new OleDbConnection("Provider=DB2OLEDB;REST_OF_CONNECTION_STRING_HERE");
cn.Open();
OleDbCommand cmd =
    new OleDbCommand("SELECT COUNT(*) FROM A-TABLE-IN-DB2", cn);
int rc = Convert.ToInt32(cmd.ExecuteScalar());
Console.WriteLine("Rows in table: " + rc.ToString());

Sure enough, the test app encountered the same error.  The problem was definitely in the OLE DB Provider for DB2.

The OLE DB Provider requires various “packages” (a DB2 concept) to exist in the DB2 system.  The packages correspond to various transaction isolation levels, so they are named READ UNCOMMITTED, REPEATABLE READ, etc.  We did not enable transactions in the DB2 connection string, nor did we configure isolation level in BizTalk, so we still don’t know which isolation level (and thus which package) is being used.  SERIALIZABLE is a good guess since it is often the BizTalk default.

When a connection is opened, if the DB2 Provider finds that the package associated with the active isolation level does not exist, it is supposed to automatically create it.  The active user account must have sufficient rights in DB2.  If that is not an option, then the Data Access Tool can be used to manually create the packages (the Packages button on the next-to-last page of the New Data Source wizard).

In our case, the user account should have had enough permissions to automatically create a package, but evidently that process failed and resulted in the obscure SQLSTATE: HY000, SQLCODE: –270 error.  As soon as I manually created the packages in the Data Access Tool, the error disappeared and everything began working normally again!

This page of the OLE DB Provider for DB2 documentation is an excellent resource for understanding the DB2 packages, the auto-create process, various error messages that may result and more.

An Optimization for the BizTalk ESB Toolkit 2.0 Portal Faults Page

While debugging the issues described in my previous post, I looked at how the ESB.Exceptions.Service’s GetFaults() method was implemented.  In our case, we had stack traces inside the Description text, so the size of the data returned for each fault was quite large.  Multiplied by thousands of faults, this is why we overflowed the default setting for maxItemsInObjectGraph.

However, this raised an important question: why was the value for Description (and many other fields) being returned from the service when the web pages never show it?

The answer?  The service’s GetFaults() method returns every column from the Fault table, including potentially large values like ExceptionStackTrace, ExceptionMessage and Description.  These fields are never used by the ESB Portal, so this behavior only serves to slow down the queries and cause issues like that described in my last post!

I modified the GetFaults() method’s Linq query to select only the columns used in the portal:

select new
{
    f.Application,
    f.DateTime,
    f.ErrorType,
    f.FailureCategory,
    f.FaultCode,
    f.FaultGenerator,
    f.FaultID,
    f.FaultSeverity,
    f.InsertMessagesFlag,
    f.MachineName,
    f.Scope,
    f.ServiceName
};

And then created the actual Fault objects just before returning from the method:

List<Fault> faults = new List<Fault>();

foreach (var fault in result)
{
    Fault f = new Fault()
    {
        Application = fault.Application,
        DateTime = fault.DateTime,
        ErrorType = fault.ErrorType,
        FailureCategory = fault.FailureCategory,
        FaultCode = fault.FaultCode,
        FaultGenerator = fault.FaultGenerator,
        FaultID = fault.FaultID,
        FaultSeverity = fault.FaultSeverity,
        InsertMessagesFlag = fault.InsertMessagesFlag,
        MachineName = fault.MachineName,
        Scope = fault.Scope,
        ServiceName = fault.ServiceName
    };
    
    faults.Add(f);
}

return faults;

This avoids the expense of SQL Server selecting many large data values that are never used, and can greatly reduce the amount of data that must be serialized and de-serialized across the service boundary.

This change provided a very noticeable boost in performance on the Faults page when searching, filtering and moving between pages.

BizTalk ESB Toolkit 2.0 Portal Timeouts and (401) Unauthorized Errors

The Problem

During application testing in our recently-built test and newly-built production BizTalk 2009 environments, we started having problems with the ESB Portal throwing a System.TimeoutException or a (401) Unauthorized error.  This was happening with increasing frequency on the portal home page and the Faults page.  On the home page, the problem seemed to be localized to the Faults pane.

When we saw the (401) Unauthorized errors, they contained a detail message like this:

MessageSecurityException: The HTTP request is unauthorized with client authentication scheme ‘Negotiate’. The authentication header received from the server was ‘Negotiate,NTLM’.

De-selecting some of the BizTalk applications in My Settings seemed to decrease but not eliminate the problem.  We had already checked and re-checked virtual directory authentication and application pool settings, etc.  Needless to say, everyone was tired of being unable to reliably view faults through the portal.

Debugging

A couple of issues complicated the debugging process, both related to the portal pulling fault data from a web service – specifically the ESB.Exceptions.Service.

First, the ESB.Exceptions.Service uses the webHttp (in other words, REST) binding introduced in .NET 3.5.  REST is fine for certain applications, but it also lacks many features of SOAP.  The one that stands out in particular here is REST’s lack of a fault communication protocol.  SOAP has a well-defined structure and protocol for faults, so from the client side it’s easy to identify and obtain information about a service call failure.  With REST, you’ll probably end up with a 400 Bad Request error and you’re on your own to guess as to what happened.

In other words, one can’t really trust the error messages arising from calls to the ESB.Exceptions.Service.

Second, the ESB.Exceptions.Service does not have built-in exception logging.  [In another post I’ll have a simple solution for that.]  Combined with REST’s lack of a fault protocol, any exception that occurs inside the service is essentially lost and obscured.

One of our first debugging steps was to run SQL Profiler on the EsbExceptionDb and see which queries were taking so long.  To our great surprise, when we refreshed the Faults page in the portal we saw in Profiler the same query running over and over, dozens or hundreds of times!

Fortunately, I was able to obtain permissions to our test EsbExceptionDb, which had over 10,000 faults in it, and run the portal and WCF services on my development machine.  Sure enough, I kept hitting a breakpoint inside the ESB.Exceptions.Service GetFaults() method over and over until the client timed out.  However, there were no loops in the code to explain that behavior!

Next, I turned on full WCF diagnostics for the ESB.Exceptions.Service, including message logging, using the WCF Service Configuration Editor.  Using the Service Trace Viewer tool, I indeed saw the same service call happening again and again – but the trace also captured an error at the end of each call cycle.

The error was a failure serializing the service method’s response back to XML.  The service call was actually completing successfully (which I had also observed in the debugger).  Once WCF took control again to send the response back to the client, it failed.  Instead of just dying, it continuously re-executed the service method!  This could be a bug in WCF 3.5 SP1.

Problem Solved

The solution to the WCF re-execution problem was increasing the maxItemsInObjectGraph setting.  On the service side, I did this by opening ESB.Exceptions.Service’s web.config, locating the <serviceBehaviors> section, and adding the line <dataContractSerializer maxItemsInObjectGraph="2147483647" /> to the existing “ExceptionServiceBehavior” behavior.

With that simple configuration change, the service call now returned promptly and the portal displayed a matching error about being unable to de-serialize the data.  As with the service, I needed to increase the maxItemsInObjectGraph setting.  I opened the portal’s web.config, located the <endpointBehaviors> section, and added the line <dataContractSerializer maxItemsInObjectGraph="2147483647" /> to the existing “bamservice” behavior.  The error message didn’t change!  I eventually discovered that the <dataContractSerializer> element must be placed before the <webHttp /> element.

The portal now displayed the home page and Faults page properly, and the timeout and unauthorized errors disappeared.

Race Condition in BizTalk ESB Toolkit 2.0 Exception Notification Service

We are using the ESB Exception Notification (aka ESB.AlertService) Windows service in conjunction with the ESB Portal website.  On occasion, we have a problem where the service indefinitely sends out duplicate emails for the same alert.  In the server’s Application Event Log, we see the error: “An exception of type ‘System.Data.StrongTypingException’ occurred and was caught.”  The log entry also includes “The value for column ‘To’ in table ‘AlertEmail’ is DBNull.”

We are allowing the service to pull user email addresses from Active Directory by configuring the LDAP Root under Alert Queue Options to LDAP://DC=company, DC=com.  With Active Directory you don’t need to specify a server name in your LDAP path.  Just point to the domain itself and Windows will figure out which domain controller to contact.

The vast majority of the rows in AlertEmail contain the correct email address in the To column, but every once in a while there is a NULL.  Looking at the service code (QueueGenerator.cs), we can see that the email address in CustomEmail is always used first, if one was provided when the alert subscription was created.  We do not set this value, so the code next attempts to pull the email address from Active Directory using the GetEmailAddress() method (ActiveDirectoryHelper.cs).

In order to reduce the number of AD queries, the code caches email addresses using the Enterprise Library caching block.  The cached entries expire after a configurable interval, which defaults to 1 minute.  If the username is already in the cache, then the corresponding email address is returned.  Otherwise, the code looks up the username in AD, grabs the email address and caches it.  The lookup code throws an exception if it doesn’t get back a valid email address, so it doesn’t explain how we got a NULL email address.

The problematic code is the cache lookup:

if (CacheMgr.Contains(name))
{
  Trace.WriteLine("Reusing email address for " + name + " from cache.");
  return (string)CacheMgr.GetData(name);
}

This is a classic race condition.  The code checks to see if the username is in the cache, then runs a Trace.WriteLine(), then asks for the cached data associated with the username.  In the time between the Contains() and the GetData() calls, the cached data can expire and drop out of the cache, in which case GetData() will return null.  Most of the time it gets lucky and the data is still cached.  This probably explains how we sometimes get NULL values in the database.

The proper code is simple because GetData() simply returns null when the requested data is not in the cache:

string cachedEmail = (string)CacheMgr.GetData(name);

if (!string.IsNullOrEmpty(cachedEmail))
{
    Trace.WriteLine("Reusing email address for " + name + " from cache.");
    return cachedEmail;
}

The new version of the code eliminates the race condition and should prevent us from ever seeing NULL values in the database.

I also created a bug report on Microsoft Connect.