Exceptions seen : org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Could not process ready local transaction member-1-datastore-operational-fe-4-txn-403-0.
Description
Environment
Activity
Prasad Soudi December 24, 2019 at 6:38 AM
In the testcase Bring Up All Instances, Why a single node ODL1 is started and ODL2, ODL3 is not started ?
As only ODL1 is started alone, all these issues started to happen.
If all the nodes would have stated along with ODL1, then this issue wouldn't have happened.
Prasad Soudi December 19, 2019 at 10:19 AM
In the testcase Bring Up All Instances, Why a single node ODL1 is started and ODL2, ODL3 is not started ?
ODL1 started at : Nov 27, 2019 12:16:39 PM org.apache.karaf.main.lock.SimpleFileLock lock
ODL1 waited for nodes ODL2 and ODL3 for some time,
2019-11-27T12:16:57,034 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.124:2550
2019-11-27T12:16:57,035 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.116:2550
As ODL2 and ODL3 are not up, it formed its own cluster,
2019-11-27T12:17:08,912 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] is JOINING itself (with roles [member-1, dc-default]) and forming new cluster
If all the nodes would have stated along with ODL1, then this issue wouldn't have happened.
Srinivas Rachakonda December 18, 2019 at 11:49 AMEdited
All the karaf logs are at:
The exceptions are seen at multiple locations. Mostly the test case has the below steps:
===========================
1)Create All Controller Sessions
2)Take Down Leader Of Default Shard and bring it up
3)Create Networks and Create Subnets For net_1 ,Create Subnets For net_2,Add Ssh Allow All Rule
4)Create Vm Instances For net_1
Create Vm Instances For net_2
Check Vm Instances Have Ip Address
5)Create Router router_2, router_3 and add interfaces to this router
6) Take Down ODL1
7) Bring up ODL1 and take down ODL2
8) Bring up ODL2 and take down ODL3
9) Bring up ODL3
10) Ping all Vm Instance1 In net_2 From net_1
Ping all Vm Instance1 In net_1 From net_2
Connectivity Tests From Vm Instance1 In net_1 In Healthy Cluster
11) Take Down ODL1 and ODL2
Connectivity Tests From Vm Instance1 In net_1 With Two ODLs Down
12)Bring Up ODL1 and ODL2
13)Take Down ODL2 and ODL3
Connectivity Tests From Vm Instance1 In net_2
14) Bring up ODL2 and ODL3
15) Takedown all ODL nodes.
16) Bring up all ODL nodes >>>> This test case fails
Prasad Soudi December 18, 2019 at 10:14 AM
2019-11-27T12:16:40,485 Karaf started
2019-11-27T12:16:57,034 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.124:2550
2019-11-27T12:16:57,035 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.116:2550
2019-11-27T12:17:02,909 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | ReliableDeliverySupervisor | 47 - com.typesafe.akka.slf4j - 2.5.25 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.116:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.116:2550]] Caused by: [java.net.ConnectException: Connection refused: /10.30.170.116:2550]
2019-11-27T12:17:02,910 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-34 | ReliableDeliverySupervisor | 47 - com.typesafe.akka.slf4j - 2.5.25 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.124:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.124:2550]] Caused by: [java.net.ConnectException: Connection refused: /10.30.170.124:2550]
2019-11-27T12:17:08,912 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] is JOINING itself (with roles [member-1, dc-default]) and forming new cluster
2019-11-27T12:17:08,915 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - is the new leader among reachable nodes (more leaders may exist)
2019-11-27T12:17:08,931 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] to [Up]
2019-11-27T12:18:28,136 | ERROR | Blueprint Extender: 1 | AbstractDataStore | 246 - org.opendaylight.controller.sal-distributed-datastore - 1.10.2 | Shard leaders failed to settle in 90 seconds, giving up
From the logs it says, Connection refused from 10.30.170.124 and 10.30.170.116. Shard leaders failed to settle in 90 seconds. And 10.30.170.109:2550 started to form its own cluster. Later Shards remained in candidate there after and never settled (Because it never gets the majority as 10.30.170.124 and 10.30.170.116 are not reachable).
In this state, it will not be able to process any transactions as there is no Leader. That's why its throwing "org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException"
Can you please tell more about the test-case you were trying to do ?
And for further analysis, need karaf.logs, akka.logs, akka.conf, module-shards.conf and modules.conf files from all the 3 cics
In 3 nodes setup we are seeing exceptions after multiple bringing up and down of ODL.
logs:
https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-sodium/393/
Karaf Logs:
https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-sodium/393/odl_1/odl1_karaf.log.gz
Karaf exceptions:
org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Could not process ready local transaction member-1-datastore-operational-fe-4-txn-403-0. Shard member-1-shard-default-operational currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.ShardTransactionMessageRetrySupport$MessageInfo.timedOut(ShardTransactionMessageRetrySupport.java:102) ~[246:org.opendaylight.controller.sal-distributed-datastore:1.10.2]
at org.opendaylight.controller.cluster.datastore.ShardTransactionMessageRetrySupport.onTimerMessage(ShardTransactionMessageRetrySupport.java:72) ~[246:org.opendaylight.controller.sal-distributed-datastore:1.10.2]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:369) ~[246:org.opendaylight.controller.sal-distributed-datastore:1.10.2]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[231:org.opendaylight.controller.sal-akka-raft:1.10.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) ~[41:com.typesafe.akka.actor:2.5.25]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) ~[41:com.typesafe.akka.actor:2.5.25]
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) ~[41:com.typesafe.akka.actor:2.5.25]
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:93) ~[239:org.opendaylight.controller.sal-clustering-commons:1.10.2]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:29) ~[239:org.opendaylight.controller.sal-clustering-commons:1.10.2]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:30) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at akka.actor.Actor.aroundReceive(Actor.scala:539) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.actor.Actor.aroundReceive$(Actor.scala:537) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.persistence.AbstractPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:455) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.Eventsourced$$anon$4.stateReceive(Eventsourced.scala:898) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:250) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:249) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.AbstractPersistentActor.aroundReceive(PersistentActor.scala:455) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.actor.ActorCell.invoke(ActorCell.scala:581) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.Mailbox.run(Mailbox.scala:229) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.Mailbox.exec(Mailbox.scala:241) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [41:com.typesafe.akka.actor:2.5.25]