Monday, February 7, 2011

Cluster Ointment: JBoss, JGroups and Shunning

For some months now we've been having intermittent problems with our JBoss cluster. Whenever a cluster node died and restarted, sometimes it would fail to rejoin the cluster. The log would just keep saying:

WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying

This happened on JBoss 4.0.3, 4.2, 5.0 and 5.1. But today I found some interesting tidbits:
  • When a [JGroups] member [defined by its IP and port number] is expelled from the group, e.g. because it didn't respond to are-you-alive messages, and later comes back, then it is shunned (source)
  • JGroups relies on the fact that the assignment of ports by the OS is always (not necessarily monotonically) increasing across a single machine... [but this may not hold] for TCP ... because we're defining the start_port for a member, and so that member will always reuse the same port when restarted [the term is 'reincarnated'] (source)
  • In [JGroups] 2.8, shunning has been removed, so the sections below only apply to versions up to 2.7 (source)
So any version of JBoss using JGroups 2.7 or earlier (which includes JBoss 5.1) will see this problem if you're explictly defining the start_port for a node. The solution is to edit jgroups-channelfactory-stacks.xml:

<FD_SOCK/>
<FD timeout="6000" max_tries="5" shun="false"/>
<VERIFY_SUSPECT timeout="1500"/>
<pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"
   retransmit_timeout="300,600,1200,2400,4800"
   discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
   max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
   shun="false"
   reject_join_from_existing_member="false"

   view_bundling="true"
   view_ack_collection_timeout="5000"/>

This has shown an immediate improvement in our cluster. Hope it works for you too!

7 comments:

bilsch said...

Just curious, were you able to reliably reproduce the condition to prove the change helped? ( I'm having a similar problem and am working on recreation process )

Richard said...

Sort of.

Following a node shutdown/restart, we would typically see this problem 1 in 5 times. I believe it is to do with the timing of how quickly the node comes back up, as to whether it gets 'shunned'.

Since the change, we have never seen this problem again.

Regards,

Richard.

bilsch said...

Interesting, thanks!

FWIW I've observed that once this condition is entered the only way to recover is to completely take the cluster down. Obviously less than ideal.

hunters said...

Hi Richard

Do you get this problem with jGroups 2.8+?

Richard said...

No. Shunning has been removed in 2.8+

Anonymous said...

Hi Richard

Can you please tell me under which section these changes are required
UDP
UDP-Sync
TCP
TCP-Sync

Richard said...

I'm afraid I don't recall, but I'd imagine TCP.