Geeks With Blogs

Charles Young

I've spent some time in the last two days testing the resilience of a BizTalk production environment.  The environment consists of a two-box BizTalk 2004 server group and a two-box (active-passive) SQL cluster.  Testing primarily consisted of rebooting machines, moving SQL cluster groups and, best of all, pulling power cables out of the wall while creating large number of files in a drop folder.  We tried several failure scenarios for each of the four machines, and checked carefully for 'lost' messages and any other problems.

I'm glad to report that the testing was successful.  At one point, we thought we had lost a single message, but the problem proved spurious.  When we killed a process running a little file generation utility we had created, a mal-formed XML file was dropped into the 'in' folder and then subsequently suspended, quite correctly, by BizTalk.  In total, we passed something like 100,000 messages through Biztalk while testing, and every one got through.

The only problem we ran into concerned tracking records.   Every time we pulled the plug on a server, we were left with a handful of spurious records that showed up in the HAT Operations/Messages view.   These records were marked as 'Delivered, not consumed', but in fact the messages were consumed correctly by the service instance.  The records were for messages that were being processed at the point we switched off the power.   BizTalk did not lose these messages, and correctly routed them to their destination. 

The documentation is a little opaque here, but I think the issue is related to TDDS (Tracking Data Decode Service, also known as the BAM Event Bus Service).   TDDS is a Windows service that is responsible for transferring and decoding event data from the MessageBox to the tracking database.   Microsoft states that tracking data can be "lost to the backlog of the BAM Event Bus service", and that as a result, you cannot "rely on HAT to reveal everything".  These comments are associated with recovery scenarios.

BizTalk generally took up to a minute or so to recover from failure (we didn't atually time this), athough in the very last test, a couple of messages seemed to get 'stuck' for about 3-4 minutes before being routed.

Posted on Tuesday, July 20, 2004 9:54 PM BizTalk Server 2004/2006 | Back to top

Comments on this post: BizTalk 2004 recovery: Works well, but beware of HAT!

# re: BizTalk 2004 recovery: Works well, but beware of HAT!
Requesting Gravatar...
I believe we have experienced exactly the same issues regarding the "delivered not consumed" entries. Were you unable to remove these entries in the traditional fashion within HAT? We had to use a stored procedure supplied by Microsoft that replaced the CleanUpMsgBox.
Left by Matt Hall on Jul 21, 2004 5:54 AM

# re: BizTalk 2004 recovery: Works well, but beware of HAT!
Requesting Gravatar...
Interesting write up. Either I or Jean plan to post a detailed description of how some of this works on the core engine blog. There seems to be some concern here. We are also doing some good stuff in SP1 to get more immediate restarts and reduce the overhead of this in certain orchestration scenarios.
Left by lee on Sep 08, 2004 8:04 AM

# re: BizTalk 2004 recovery: Works well, but beware of HAT!
Requesting Gravatar...
Nice tests.

So, does it mean that you must have tracking enable in a live environment in order to have a a guarranty of delivery of your message in BizTalk?
(I thought that tracking was only for development or test purposes)

If it is so, how do you clean-up or archive the messages in the biztalk DB?

I saw a script on "Biztalk Core Engine" about a SP to cleanup some information on the MessageBox DB, but they advise not to use it in Live environment.

Thank you
Left by Joao Morais on Apr 12, 2005 12:12 PM

Your comment:
 (will show your gravatar)

Copyright © Charles Young | Powered by: