Exceptions seen : org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Could not process ready local transaction member-1-datastore-operational-fe-4-txn-403-0.

Description

In 3 nodes setup we are seeing exceptions after multiple bringing up and down of ODL.

 

logs:

https://jenkins.opendaylight.org/releng/view/netvirt-csit/job/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-sodium/393/

 

 

Karaf Logs:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-sodium/393/odl_1/odl1_karaf.log.gz

 

Karaf exceptions:

org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException: Could not process ready local transaction member-1-datastore-operational-fe-4-txn-403-0. Shard member-1-shard-default-operational currently has no leader. Try again later.
at org.opendaylight.controller.cluster.datastore.ShardTransactionMessageRetrySupport$MessageInfo.timedOut(ShardTransactionMessageRetrySupport.java:102) ~[246:org.opendaylight.controller.sal-distributed-datastore:1.10.2]
at org.opendaylight.controller.cluster.datastore.ShardTransactionMessageRetrySupport.onTimerMessage(ShardTransactionMessageRetrySupport.java:72) ~[246:org.opendaylight.controller.sal-distributed-datastore:1.10.2]
at org.opendaylight.controller.cluster.datastore.Shard.handleNonRaftCommand(Shard.java:369) ~[246:org.opendaylight.controller.sal-distributed-datastore:1.10.2]
at org.opendaylight.controller.cluster.raft.RaftActor.handleCommand(RaftActor.java:270) ~[231:org.opendaylight.controller.sal-akka-raft:1.10.2]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) ~[41:com.typesafe.akka.actor:2.5.25]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) ~[41:com.typesafe.akka.actor:2.5.25]
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) ~[41:com.typesafe.akka.actor:2.5.25]
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:172) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:93) ~[239:org.opendaylight.controller.sal-clustering-commons:1.10.2]
at org.opendaylight.controller.cluster.common.actor.MeteringBehavior.apply(MeteringBehavior.java:29) ~[239:org.opendaylight.controller.sal-clustering-commons:1.10.2]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:127) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:126) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:30) ~[481:org.scala-lang.scala-library:2.12.10.v20190904-150159-VFINAL-61701c2]
at akka.actor.Actor.aroundReceive(Actor.scala:539) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.actor.Actor.aroundReceive$(Actor.scala:537) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.persistence.AbstractPersistentActor.akka$persistence$Eventsourced$$super$aroundReceive(PersistentActor.scala:455) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.Eventsourced$$anon$4.stateReceive(Eventsourced.scala:898) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.Eventsourced.aroundReceive(Eventsourced.scala:250) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.Eventsourced.aroundReceive$(Eventsourced.scala:249) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.persistence.AbstractPersistentActor.aroundReceive(PersistentActor.scala:455) ~[44:com.typesafe.akka.persistence:2.5.25]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:612) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.actor.ActorCell.invoke(ActorCell.scala:581) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:268) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.Mailbox.run(Mailbox.scala:229) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.Mailbox.exec(Mailbox.scala:241) ~[41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) [41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) [41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) [41:com.typesafe.akka.actor:2.5.25]
at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) [41:com.typesafe.akka.actor:2.5.25]

Environment

None

Activity

Show:

Prasad Soudi December 24, 2019 at 6:38 AM

In the testcase Bring Up All Instances, Why a single node ODL1 is started and ODL2, ODL3 is not started ?

As only ODL1 is started alone, all these issues started to happen.

If all the nodes would have stated along with ODL1, then this issue wouldn't have happened.

Prasad Soudi December 19, 2019 at 10:19 AM

In the testcase Bring Up All Instances, Why a single node ODL1 is started and ODL2, ODL3 is not started ?

ODL1 started at : Nov 27, 2019 12:16:39 PM org.apache.karaf.main.lock.SimpleFileLock lock

 

ODL1 waited for nodes ODL2 and ODL3 for some time,
2019-11-27T12:16:57,034 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.124:2550
2019-11-27T12:16:57,035 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.116:2550

 

As ODL2 and ODL3 are not up, it formed its own cluster,
2019-11-27T12:17:08,912 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] is JOINING itself (with roles [member-1, dc-default]) and forming new cluster

 

If all the nodes would have stated along with ODL1, then this issue wouldn't have happened.

Srinivas Rachakonda December 18, 2019 at 11:49 AM
Edited

All the karaf logs are at:

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-3node-0cmb-1ctl-2cmp-openstack-queens-upstream-stateful-sodium/393/

 

The exceptions are seen at multiple locations. Mostly the test case has the below steps:
===========================
1)Create All Controller Sessions
2)Take Down Leader Of Default Shard and bring it up
3)Create Networks and Create Subnets For net_1 ,Create Subnets For net_2,Add Ssh Allow All Rule
4)Create Vm Instances For net_1
  Create Vm Instances For net_2
  Check Vm Instances Have Ip Address
5)Create Router router_2, router_3 and add interfaces to this router
6) Take Down ODL1
7) Bring up ODL1 and take down ODL2
8) Bring up ODL2 and take down ODL3
9) Bring up ODL3
10) Ping  all Vm Instance1 In net_2 From net_1
  Ping all Vm Instance1 In net_1 From net_2
  Connectivity Tests From Vm Instance1 In net_1 In Healthy Cluster
11) Take Down ODL1 and ODL2
  Connectivity Tests From Vm Instance1 In net_1 With Two ODLs Down
12)Bring Up ODL1 and ODL2

13)Take Down ODL2 and ODL3
  Connectivity Tests From Vm Instance1 In net_2
 14)  Bring up ODL2 and ODL3

15) Takedown all ODL nodes.

16) Bring up all ODL nodes >>>> This test case fails

Prasad Soudi December 18, 2019 at 10:14 AM

2019-11-27T12:16:40,485 Karaf started
2019-11-27T12:16:57,034 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.124:2550
2019-11-27T12:16:57,035 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-19 | NettyTransport | 47 - com.typesafe.akka.slf4j - 2.5.25 | Remote connection to [null] failed with java.net.ConnectException: Connection refused: /10.30.170.116:2550
2019-11-27T12:17:02,909 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-36 | ReliableDeliverySupervisor | 47 - com.typesafe.akka.slf4j - 2.5.25 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.116:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.116:2550]] Caused by: [java.net.ConnectException: Connection refused: /10.30.170.116:2550]
2019-11-27T12:17:02,910 | WARN | opendaylight-cluster-data-akka.actor.default-dispatcher-34 | ReliableDeliverySupervisor | 47 - com.typesafe.akka.slf4j - 2.5.25 | Association with remote system [akka.tcp://opendaylight-cluster-data@10.30.170.124:2550] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://opendaylight-cluster-data@10.30.170.124:2550]] Caused by: [java.net.ConnectException: Connection refused: /10.30.170.124:2550]
2019-11-27T12:17:08,912 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] is JOINING itself (with roles [member-1, dc-default]) and forming new cluster
2019-11-27T12:17:08,915 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - is the new leader among reachable nodes (more leaders may exist)
2019-11-27T12:17:08,931 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | Cluster(akka://opendaylight-cluster-data) | 47 - com.typesafe.akka.slf4j - 2.5.25 | Cluster Node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] - Leader is moving node [akka.tcp://opendaylight-cluster-data@10.30.170.109:2550] to [Up]
2019-11-27T12:18:28,136 | ERROR | Blueprint Extender: 1 | AbstractDataStore | 246 - org.opendaylight.controller.sal-distributed-datastore - 1.10.2 | Shard leaders failed to settle in 90 seconds, giving up

 

From the logs it says, Connection refused from 10.30.170.124 and 10.30.170.116. Shard  leaders failed to settle in 90 seconds. And 10.30.170.109:2550 started to form its own cluster. Later Shards remained in candidate there after and never settled (Because it never gets the majority as 10.30.170.124 and 10.30.170.116 are not reachable).

 

In this state, it will not be able to process any transactions as there is no Leader. That's why its throwing "org.opendaylight.controller.cluster.datastore.exceptions.NoShardLeaderException"

 

Can you please tell more about the test-case you were trying to do ?

And for further analysis, need karaf.logs, akka.logs, akka.conf, module-shards.conf and modules.conf files from all the 3 cics

Details

Assignee

Reporter

Components

Affects versions

Priority

Created November 28, 2019 at 6:52 AM
Updated February 6, 2025 at 2:12 PM