WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
WARN [org.jgroups.protocols.pbcast.GMS] join(<IP address>) sent to <IP address> timed out (after 3000 ms), retrying
This happened on JBoss 4.0.3, 4.2, 5.0 and 5.1. But today I found some interesting tidbits:
- When a [JGroups] member [defined by its IP and port number] is expelled from the group, e.g. because it didn't respond to are-you-alive messages, and later comes back, then it is shunned (source)
- JGroups relies on the fact that the assignment of ports by the OS is always (not necessarily monotonically) increasing across a single machine... [but this may not hold] for TCP ... because we're defining the start_port for a member, and so that member will always reuse the same port when restarted [the term is 'reincarnated'] (source)
- In [JGroups] 2.8, shunning has been removed, so the sections below only apply to versions up to 2.7 (source)
<FD_SOCK/>
<FD timeout="6000" max_tries="5" shun="false"/>
<VERIFY_SUSPECT timeout="1500"/>
<pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
shun="false"
reject_join_from_existing_member="false"
view_bundling="true"
view_ack_collection_timeout="5000"/>
<FD timeout="6000" max_tries="5" shun="false"/>
<VERIFY_SUSPECT timeout="1500"/>
<pbcast.NAKACK use_mcast_xmit="false" gc_lag="0"
retransmit_timeout="300,600,1200,2400,4800"
discard_delivered_msgs="true"/>
<UNICAST timeout="300,600,1200,2400,3600"/>
<pbcast.STABLE stability_delay="1000" desired_avg_gossip="50000"
max_bytes="400000"/>
<pbcast.GMS print_local_addr="true" join_timeout="3000"
shun="false"
reject_join_from_existing_member="false"
view_bundling="true"
view_ack_collection_timeout="5000"/>
This has shown an immediate improvement in our cluster. Hope it works for you too!
7 comments:
Just curious, were you able to reliably reproduce the condition to prove the change helped? ( I'm having a similar problem and am working on recreation process )
Sort of.
Following a node shutdown/restart, we would typically see this problem 1 in 5 times. I believe it is to do with the timing of how quickly the node comes back up, as to whether it gets 'shunned'.
Since the change, we have never seen this problem again.
Regards,
Richard.
Interesting, thanks!
FWIW I've observed that once this condition is entered the only way to recover is to completely take the cluster down. Obviously less than ideal.
Hi Richard
Do you get this problem with jGroups 2.8+?
No. Shunning has been removed in 2.8+
Hi Richard
Can you please tell me under which section these changes are required
UDP
UDP-Sync
TCP
TCP-Sync
I'm afraid I don't recall, but I'd imagine TCP.
Post a Comment