Tuesday, June 14, 2011

Stung by HornetQ: The Revenge

Following on from my previous post I've been spending some more time with HornetQ and have discovered a few more gotchas:

4. Stuck By Default

There's some advice here that says...

"Probably the most common messaging anti-pattern we see is users who create a new connection/session/producer for every message they send or every message they consume. This is a poor use of resources... Always re-use them"

...couple that with other advice that says...

"Please note the default value for address-full-policy [when the send buffer is full] is to PAGE [out to disk]"

And you might think the Right Thing To Do is set up a single connection/session/producer and send all your messages to the queue. But if you're doing this in a transaction (most Web applications are) you'd be wrong. Why? Because there's some conflicting advice that says...

"By default, HornetQ does not page messages - this must be explicitly configured to activate it"

And a setting in the JBoss/HornetQ integration (hornetq-configuration.xml) that says...

<max-size-bytes>10485760</max-size-bytes>
<address-full-policy>BLOCK</address-full-policy>

So by default JBoss will get stuck if you try sending more than 10MB of messages in a single transacted producer (ie. before your MDBs can start consuming them). 10MB is not a lot. For me it was about 1,000 messages of about 5,000 characters each (a Unicode XML string).

Here are my suggestions:
  1. HornetQ should treat its JBoss JMS integration as more of a first class citizen. It should be a primary use case, rather than relegated to a chapter at the back of the user guide. Why? Because most people that just dip into the User Guide are going to be doing so from a JBoss JMS mindset. So if you say something like "Please note the default value is to PAGE" then you need to also say, immediately afterwards, " (except on JBoss where the default value is to BLOCK)"

  2. BLOCKing is a poor default value. Either make it fail (so the user gets an error) or make it PAGE (so the user gets an error when their disk is full). At least then the developer knows where to look. But blocking just results in the queue being 'stuck' - with no clue to the developer who has barely heard of their underlying JMS implementation, let alone blocking versus paging and <address-full-policy>

5. Stuck By Bugs

HornetQ is pretty new and there are a few bugs that can cause your JMS messages to get stuck. There's the fact that MDBs will rollback/retry indefinitely, that messages with different JMS priorities may get forgotten, that messages can be forwarded to dead cluster nodes.

When you have several different bugs interacting to produce an overall symptom (ie. a stuck queue) it can be very hard to separate them to understand their underlying causes. This causes a lot of pain!

6. Stuck By Birth

This one isn't really HornetQ's fault, but its behaviour seems different to JBoss Messaging. If your MDB uses @EJB injection you really need to setup <depends ... > blocks in your jboss.xml...

<message-driven>
   <ejb-name>LongRunningProcessConsumerBean</ejb-name>
   <destination-jndi-name>queue/avant-ss.long-running-processes</destination-jndi-name>
   <!-- Stop MDB consuming too early -->
   <depends>jboss.j2ee:ear=avant-ss-app.ear,jar=avant-ss-dev-ejb.jar,name=LongRunningProcessBean,service=EJB3</depends>
   <depends>jboss.j2ee:ear=avant-ss-app.ear,jar=avant-ss-dev-ejb.jar,name=BusinessRulesBean,service=EJB3</depends>
</message-driven>

...because HornetQ starts up and starts consuming very early. This is particularly bad because these errors go quietly into the boot.log, not the regular system.log, so you don't realise that your MDBs crashed on startup.

6 comments:

Clebert Suconic said...

I'm still about to see a software that doesn't have any bugs ;-) That's why companies offer support contracts.

Well.. I reckon we are lacking a few support options on HornetQ, but this is something we are about to fix.

BTW: We advice to do as much as you can in a single TX but you can't just consume all the memory of the server in a single TX.

Maybe you could increase the page-max-size.

Richard said...

Clebert,

Thanks for your response!

Please don't misunderstand: I think you guys are doing a great job, and are really responsive, and I appreciate finding bugs is just the nature of early adoption.

With respect to the 'blocking' issue: the problem is with JBoss Messaging (which used a relational database) you could never 'consume all the memory of the server' by producing lots of messages. So our code that queues up 10MB of messages ran fine by creating lots of database entries. But it hangs under HornetQ by default.

So the default behaviour of HornetQ is quite different to JBoss Messaging, and perhaps that could be improved. Either a) page don't block (more like JBoss Messaging) or b) block but warn the developer (and don't say you page in the User Guide!)

Regards,

Richard.

Jaikiran said...

Richard,

Regarding the need to setup in the jboss.xml to prevent HornetQ from delivering before the MDB is fully ready (with @EJB injected), do you see that issue even in latest AS 6.1.0 snapshots? If yes, can you please file a JIRA and attach a sample application? I think with the upgrade to HornetQ in the 6.1.0 snapshots, this should have been fixed.

Carl said...

I was rattled by the BLOCK default configuration about a week ago.

We're considering upgrading from JBM to HornetQ, and deployed it on our system test cluster. A bad consumer caused the DLQ to fill up to the set limit, after which queues started to freeze (we route messages to the DLQ manually).

It took me a while to figure out that hornetq paging needs to be enabled explicitly.

I also think that paging should be enabled by default. It's not fun to see a system freeze up.

Dmitry S said...

Thank you very much for your post.
If nothing picks up outbound messages, then HornetQ just stucks after some time without any information in the logs. Fucking JBoss has wrong default configuration for HornetQ.

Now it seems to working fine.

Rich said...

Thanks so much!
We had this issue for the past few weeks now and could not find a solution until i read your post.
Awesome fix!