Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Federation HA - Passive node not getting Presence syncing broadcast #3564

Open
xhuang-sangoma opened this issue Jan 22, 2025 · 4 comments
Assignees

Comments

@xhuang-sangoma
Copy link

xhuang-sangoma commented Jan 22, 2025

OpenSIPS version you are running

[root@sip-b97b69845-fgbn6 /]# opensips -V
version: opensips 3.5.3 (x86_64/linux)
flags: STATS: On, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, FAST_LOCK-ADAPTIVE_WAIT
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 051e1c4cc
main.c compiled on 02:28:56 Dec 19 2024 with cc 8

Describe the bug

This is a followup issue relating to a previous issue reported at: #2960

We have two opensips instances configured as active-backup HA pair in a federation cluster mode.

The active node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 1)
modparam("clusterer", "sharing_tag" ,"69.108.214.70/1=active")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1)
modparam("presence", "cluster_federation_mode", "full-sharing")
modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

The backup node has following settings:

modparam("clusterer","db_url","mysql://xxxx")
modparam("clusterer", "my_node_id", 2) 
modparam("clusterer", "sharing_tag" ,"69.108.214.70/1=backup")

modparam("presence","db_url","mysql://xxxx")
modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1)
modparam("presence", "cluster_federation_mode", "full-sharing")
modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

This is the entries in clusterer table:

mysql> select * from clusterer;
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
| id | cluster_id | node_id | url                     | state | no_ping_retries | priority | sip_addr      | flags | description |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+
|  2 |          1 |       1 | bin:69.108.214.99:5566  |     1 |               3 |       50 | 69.108.214.70 | seed  | NULL        |
|  4 |          1 |       2 | bin:69.108.214.97:5566  |     1 |               3 |       50 | 69.108.214.70 | NULL  | NULL        |
+----+------------+---------+-------------------------+-------+-----------------+----------+---------------+-------+-------------+

The VIP 69.168.214.70 is configured on the active node_id 1.

We have phones sending REGISTER and SUBSCRIBE (for BLF) requests to VIP on the active node. And the subscriptions are processed by active node and stored in memory.

Note that we set db_update_period and fallback2db to 0 and cluster_federation_mode to full-sharing as we don't want to use db to share the subscriptions. Instead we want the backup node to get subscriptions synced by receiving the cluster broadcast from the active node.

We are expecting that subscriptions processed by active node will also be synced to the backup node. And if we run "opensips-cli -x mi subs_phtable_list" on both active and backup node, they should show the same list.

The result is that only active node prints out the subscriptions list. And the backup node prints empty list.

This is causing an issue that if we switch VIP and make the backup node active, since the backup node doesn't have the subscriptions, it will fail to handle any PUBLISH and deliver NOTIFYs to subscribers.

We think the issue is related to this code commit: 8b96b70

In this code, the backup node checks the cluster_be_active_shtag, since it's off, the node will stop accepting any presence cluster traffic. We think this is incorrect.

The expected behaviour is that the backup node should continue to accept the presence cluster broadcast and store the subscriptions in memory. It just does NOT NEED TO PROCESS any of them.

@bogdan-iancu
Copy link
Member

@xhuang-sangoma , it might be a bit a confusion here. The cluster_be_active_shtag is to be used in scenarios where you have a federated clustering, but you want some nodes within the cluster to be inactive from clustering perspective (not to send or receive anything via the clustering layer) - the idea here is to allow such cluster-idle node to act as standby backups. Such node are DB updated, not cluster update.
But in your case you do not want to use the DB at all. So the cluster-idle nodes will be totally disconnected (as data) from the rest of the nodes.
The right configuration will be to drop this cluster_be_active_shtag and let all nodes in the cluster to receive and send data over the cluster - so all nodes will have the published data replicated. And in order to control the active-backup setup of your nodes, you need to use the sharing data attached to the subscription, via the handle_subscribe() function - this sh-tag will control which server is the one responsible for the actions related to the subscription - and of course, this sh-tag must be active on the opensips node handling the data and backup on the opensips node in standby.
With this setting, both node will share the full presentity data set, but only one will perform subscription related actions (expiring, notifications, etc).

@xhuang-sangoma
Copy link
Author

@bogdan-iancu Thanks for looking into this.

Following your suggestion, I updated configuration like below:

Active Node (Note that I've commented out the cluster_be_active_shtag option)

modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1) 
modparam("presence", "cluster_federation_mode", "full-sharing")
#modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

route[handle_presence]
{
	t_newtran();
	if (is_method("PUBLISH")) {
		 handle_publish();
	}
	if (is_method("SUBSCRIBE")) {
		handle_subscribe(,"69.168.214.70");
	}
	exit
}	

Backup Node:

modparam("presence", "db_update_period", 0)
modparam("presence", "fallback2db", 0)
modparam("presence", "cluster_id", 1) 
modparam("presence", "cluster_federation_mode", "full-sharing")
#modparam("presence", "cluster_be_active_shtag" ,"69.168.214.70")

route[handle_presence]
{
	t_newtran();
	if (is_method("PUBLISH")) {
		 handle_publish();
	}
	if (is_method("SUBSCRIBE")) {
		handle_subscribe(,"69.168.214.70");
	}
	exit
}

The result

  1. There're error log below in the backup node:
Jan 25 01:54:23 [326554] INFO:clusterer:handle_sync_packet: Received all sync packets for capability 'presence' in cluster 1
Jan 25 01:54:23 [326544] CRITICAL:db_mysql:wrapper_single_mysql_real_query: driver error (1062): Duplicate entry 'HP-000FD3D085A7-18300027-sandbox2-sip.nxf-test.fonality.com-pres' for key 'presentity.presentity_idx'
Jan 25 01:54:23 [326544] ERROR:core:db_do_insert: error while submitting query
Jan 25 01:54:23 [326544] ERROR:presence:update_presentity: inserting new record in database
Jan 25 01:54:23 [326544] ERROR:presence:handle_replicated_publish: failed to update presentity based on replicated Publish
Jan 25 01:54:23 [326544] ERROR:presence:handle_replicated_publish: failed to handle bin packet 101 from node 1
Jan 25 01:54:23 [326544] WARNING:presence:bin_packet_handler: failed to process sync chunk!

Seems the backup node is still trying to insert entries to presentity table. It's not supposed to do so as a backup node.

  1. The backup node is not receiving any subscriptions sync from active node.

Active node prints out entry in subs_phtable_list:

[root@sip-b97b69845-fgbn6 /]# opensips-cli -x mi subs_phtable_list
[
    {
        "pres_uri": "sip:[email protected]",
        "event": "message-summary",
        "expires": "2025-01-25 02:09:40",
        "db_flag": 2,
        "version": 16,
        "sharing_tag": "69.168.214.70",
        "to_user": "HM-0000000012862-18300027",
        "to_domain": "69.168.214.70",
        "to_tag": "c36a-521bf6280b9673752315e00b3a6378f9",
        "from_user": "HM-0000000012862-18300027",
        "from_domain": "69.168.214.70",
        "from_tag": "720cdb28",
        "contact": "sip:[email protected]:60909;transport=UDP",
        "callid": "WR1r0xWGrL48bcfGSfC4rg..",
        "local_cseq": 16,
        "remote_cseq": 16
    }
]


[root@sip-b97b69845-fgbn6 /]# opensips-cli -x mi clusterer_list_shtags
[
    {
        "Tag": "69.168.214.70",
        "Cluster": 1,
        "State": "active"
    }
]

Backup node has empty list:

[root@sip-b97b69845-bzbr6 /]# opensips-cli -x mi subs_phtable_list
[]

[root@sip-b97b69845-bzbr6 /]# opensips-cli -x mi clusterer_list_shtags
[
    {
        "Tag": "69.168.214.70",
        "Cluster": 1,
        "State": "backup"
    }
]

@bogdan-iancu
Copy link
Member

@xhuang-sangoma , I'm a bit confused when comes to what you want to achieve here. Do you want to setup a pure active-backup configuration or a federation (with multiple active nodes sharing parts of the data) ??

@xhuang-sangoma
Copy link
Author

xhuang-sangoma commented Jan 28, 2025 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants