Inject DataBroker only when all shards have leaders

Description

We are hitting an issue in genius on stable/oxygen, where randomly idmanager-impl bundle does not come up, as the datastore read in the blueprint initialization code was failing with NoShardLeaderException.

https://lists.opendaylight.org/pipermail/genius-dev/2019-January/003554.html

While applications work on putting propoer failure handling, was thinking it would be better if dataBroker is injected only when all shards have leaders, than having a 2 min timeout.

 

ODL :: genius :: idmanager-impl (260)

-------------------------------------

Status: Failure

Blueprint

1/8/19 10:47 AM

Exception:

 

org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager org.osgi.service.blueprint.container.ComponentDefinitionException: org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager                 at org.apache.aries.blueprint.container.ServiceRecipe.createService(ServiceRecipe.java:310)                 at org.apache.aries.blueprint.container.ServiceRecipe.internalGetService(ServiceRecipe.java:252)                 at org.apache.aries.blueprint.container.ServiceRecipe.internalCreate(ServiceRecipe.java:149)                 at org.apache.aries.blueprint.di.AbstractRecipe$1.call(AbstractRecipe.java:79)                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)                 at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:88)                 at org.apache.aries.blueprint.container.BlueprintRepository.createInstances(BlueprintRepository.java:255)                 at org.apache.aries.blueprint.container.BlueprintRepository.createAll(BlueprintRepository.java:186)                 at org.apache.aries.blueprint.container.BlueprintContainerImpl.instantiateEagerComponents(BlueprintContainerImpl.java:704)                 at org.apache.aries.blueprint.container.BlueprintContainerImpl.doRun(BlueprintContainerImpl.java:410)                 at org.apache.aries.blueprint.container.BlueprintContainerImpl.run(BlueprintContainerImpl.java:275)                 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)                 at org.apache.aries.blueprint.container.ExecutorServiceWrapper.run(ExecutorServiceWrapper.java:106)                 at org.apache.aries.blueprint.utils.threading.impl.DiscardableRunnable.run(DiscardableRunnable.java:48)                 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)                 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)                 at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)                 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)                 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)                 at java.lang.Thread.run(Thread.java:748) Caused by: org.osgi.service.blueprint.container.ComponentDefinitionException: Error when instantiating bean idManager of class org.opendaylight.genius.idmanager.IdManager                 at org.apache.aries.blueprint.container.BeanRecipe.wrapAsCompDefEx(BeanRecipe.java:361)                 at org.apache.aries.blueprint.container.BeanRecipe.getInstanceFromType(BeanRecipe.java:351)                 at org.apache.aries.blueprint.container.BeanRecipe.getInstance(BeanRecipe.java:282)                 at org.apache.aries.blueprint.container.BeanRecipe.internalCreate2(BeanRecipe.java:830)                 at org.apache.aries.blueprint.container.BeanRecipe.internalCreate(BeanRecipe.java:811)                 at org.apache.aries.blueprint.di.AbstractRecipe$1.call(AbstractRecipe.java:79)                 at java.util.concurrent.FutureTask.run(FutureTask.java:266)                 at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:88)                 at org.apache.aries.blueprint.di.RefRecipe.internalCreate(RefRecipe.java:62)                 at org.apache.aries.blueprint.di.AbstractRecipe.create(AbstractRecipe.java:106)                 at org.apache.aries.blueprint.container.ServiceRecipe.createService(ServiceRecipe.java:285)                 ... 21 more Caused by: ReadFailedException{message=Error executeRead ReadData for path /(urn:opendaylight:genius:idmanager?revision=2016-04-06)id-pools, errorList=[RpcError [message=Error executeRead ReadData for path /(urn:opendaylight:genius:idmanager?revision=2016-04-06)id-pools, severity=ERROR, errorType=APPLICATION, tag=operation-failed, applicationTag=null, info=null, cause=org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard 192.168.70.1-shard-default-config currently has no leader. Try again later.]]}                 at org.opendaylight.controller.cluster.datastore.NoOpTransactionContext.executeRead(NoOpTransactionContext.java:71)                 at org.opendaylight.controller.cluster.datastore.TransactionProxy$1.invoke(TransactionProxy.java:98)                 at org.opendaylight.controller.cluster.datastore.TransactionContextWrapper.executePriorTransactionOperations(TransactionContextWrapper.java:194)                 at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.onFindPrimaryShardFailure(AbstractTransactionContextFactory.java:109)                 at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory.access$100(AbstractTransactionContextFactory.java:37)                 at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:136)                 at org.opendaylight.controller.cluster.datastore.AbstractTransactionContextFactory$1.onComplete(AbstractTransactionContextFactory.java:130)                 at akka.dispatch.OnComplete.internal(Future.scala:260)                 at akka.dispatch.OnComplete.internal(Future.scala:258)                 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:188)                 at akka.dispatch.japi$CallbackBridge.apply(Future.scala:185)                 at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:60)                 at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)                 at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:91)                 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)                 at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:81)                 at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:91)                 at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)                 at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:43)                 at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)                 at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)                 at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)                 at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) Caused by: org.opendaylight.controller.md.sal.common.api.data.DataStoreUnavailableException: Shard 192.168.70.1-shard-default-config currently has no leader. Try again later.

 

Activity

Show:

Robert Varga July 31, 2020 at 1:29 PM

Most of the refactoring is done, such that core services do not really rely on blueprint. Most notably AbstractDataStore is started asynchronously and published to Service Registry only after shards have settled. There is still a notable gap of Distributed EOS, which does not wait for settle:

2020-07-31T15:18:16,998 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Distributed Entity Ownership Service starting 2020-07-31T15:18:17,009 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Distributed Entity Ownership Service started 2020-07-31T15:18:17,012 | INFO | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | Shard created : member-1-shard-entity-ownership-operational, persistent : false 2020-07-31T15:18:17,013 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | DistributedEntityOwnershipService | 139 - org.opendaylight.controller.sal-distributed-eos - 2.0.4.SNAPSHOT | Successfully created entity-ownership shard 2020-07-31T15:18:17,014 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-28 | RoleChangeNotifier | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | RoleChangeNotifier:akka.tcp://opendaylight-cluster-data@127.0.0.1:2550/user/shardmanager-operational/member-1-s hard-entity-ownership-operational/member-1-shard-entity-ownership-operational-notifier#533402702 created and ready for shard:member-1-shard-entity-ownership-operational 2020-07-31T15:18:17,018 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDOMDataBroker | 138 - org.opendaylight.controller.sal-distributed-datastore - 2.0.4.SNAPSHOT | DOM Data Broker starting 2020-07-31T15:18:17,039 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | ConcurrentDOMDataBroker | 185 - org.opendaylight.yangtools.util - 5.0.5 | ThreadFactory created: CommitFutures 2020-07-31T15:18:17,051 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-17 | OSGiDOMDataBroker | 138 - org.opendaylight.controller.sal-distributed-datastore - 2.0.4.SNAPSHOT | DOM Data Broker started 2020-07-31T15:18:17,051 | INFO | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | Starting recovery for member-1-shard-entity-ownership-operational with journal batch size 1 2020-07-31T15:18:17,055 | INFO | opendaylight-cluster-data-shard-dispatcher-48 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational: Recovery completed - Switching actor to Follower - last log index = -1, last l og term = -1, snapshot index = -1, snapshot term = -1, journal size = 0 2020-07-31T15:18:17,059 | INFO | opendaylight-cluster-data-akka.actor.default-dispatcher-35 | RoleChangeNotifier | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | RoleChangeNotifier for member-1-shard-entity-ownership-operational , received role change from null to Follower 2020-07-31T15:18:17,060 | INFO | opendaylight-cluster-data-shard-dispatcher-47 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational (Candidate): Starting new election term 1 2020-07-31T15:18:17,061 | INFO | opendaylight-cluster-data-shard-dispatcher-47 | EntityOwnershipShard | 136 - org.opendaylight.controller.sal-clustering-commons - 2.0.4.SNAPSHOT | member-1-shard-entity-ownership-operational (Follower) :- Switching from behavior Follower to Candidate, election term: 1

i.e. DistributedEntityOwnershipService itself does not wait for settle before being published. This will need to be fixed in a follow-up patch.

Robert Varga April 28, 2020 at 10:24 AM

So the general approach we will take to this is that we will completely eliminate use of BP in mdsal/controller in favor of using OSGi DS. This work will result in the system being decomposed to smaller components (as opposed to BP containers). Each component can (and will) start as soon as it can. There will be no deadlines to activation, as OSGi DS is inherently asynchronous and aligned with OSGi lifecycle.

This will result in services being available as components, and thus the choice of DI framework with be left to downstreams – who can opt to switch to OSGi DS, too, eliminating this need for battling timeouts.

 

Robert Varga September 16, 2019 at 10:38 PM

There are different trade-offs here and depending on how the applications are wired, and how BP is configured.

Faseela K January 11, 2019 at 4:25 PM

Sure, I am checking all these. We will also work towards adding retries in application side logic.

Please do check if there will be a benefit in making this 2 min timeout increased? Or is it already configurable, so that we can pick a value based on scale of the deployment.

Robert Varga January 11, 2019 at 2:26 PM

This is related to CDS, hence it is not an MD-SAL issue.

Done

Details

Assignee

Reporter

Components

Fix versions

Priority

Created January 11, 2019 at 7:06 AM
Updated January 10, 2021 at 10:23 AM
Resolved September 22, 2020 at 8:21 AM