ElanUtils.waitForTransactionToComplete should not LOG.error for OptimisticLockFailedException

Description

https://bugzilla.redhat.com/show_bug.cgi?id=1567509 requests that this LOG.error() be suppressed:

I'm not entirely convinced yet that it can just be supressed - or if it must be "handled" (retries?) - let's have a closer look, and discuss further here...

Environment

None

Attachments

1

Activity

Show:

Michael Vorburger April 26, 2018 at 11:45 AM

I suddenly had a doubt whether making ElanUtils.waitForTransactionToComplete() only LOG.debug instead of error (in c/70968) could ever hide any problems, so just to re-assure myself I put together the attached FutureGetOnFailureTes.java, which proves - of course, that IFF all callers of ElanUtils.waitForTransactionToComplete() "handle" the CheckedFuture which that returns, which 3/4 callers already did, and c/70968 adds to the 4th, then this is fine.  I've raised c/71417 which adds a @CheckReturnValue to it.

With this, close this issue as the LOG.error() shown on top is now suppressed. There may be other errors now - let's track that in new JIRAs.

Kiran N Upadhyaya April 16, 2018 at 1:35 PM

Michael,

Thanks for posting the patch. I will review them.

A few observations:

1) I dont think that remove() called from both the ElanInterfaceManager and ElanInterfaceStateChangeListener() is causing this conflicting modification. Assuming 1NODE CSIT run, the JOBS for both of them are enqueued on the same key, and coupled with the synchronous submit from waitForOperationalTransaction(), these two submits should not conflict with each other. (I would have rather expected a ModifiedNodeDoesNotExistException). I guess this is already known and the goal here is to suppress the exception.

2) Now, since there is a clear Conflicting Modification from the stack trace, I suspect 

ElanUtils.waitForTransactionToComplete(flowTx) , such that some other job(with a different key) or a non-DJC task might be trying to remove this flow. This is similar to that is linked to this bug. In this case, both the threads are run with different triggers but both the situations are valid. For this, I agree with changing the trace to debug to suppress the exception.

 I'll post further comments, if any, on the reviews.

Thanks!

Kiran N Upadhyaya April 16, 2018 at 1:07 PM
Edited

There was a similar bug that was analysed to be containing conflicting modification between two DJCs with different keys:

https://lf-opendaylight.atlassian.net/browse/NETVIRT-972.

Analysis:

There are two job-coordinators with different keys trying to add the same flow resulting in conflicting modification exception. One would think that these two jobs should have the same key, but that is not the case here.

One job adds flows onto remote DPNs for a new FIB entry (and has the FIB prefix in the job-key). The other job adds remote flows for all FIB entries for a VPN when that VPN establishes footprint on a new DPN (this cannot have prefix in its job-key). Both the keys are valid, and they will sometimes conflict with each other.

The functionality failure caused due to this was resolved a while back by adding retry for one these jobs, but the exception will still be sent into the logs.

Michael Vorburger April 16, 2018 at 12:19 PM

I've added you as reviewer on the (still Draft) changes c/70968 and c/70969; please chime in there or here (better than on the private email) with any comments you have, if any, to contribute to this issue... thanks!

Michael Vorburger April 16, 2018 at 9:17 AM

Changes c/70968 and c/70969 prepared, but still Draft due to Broken World - will publish when Peace is back.

Done

Details

Assignee

Reporter

Labels

Fix versions

Priority

Created April 16, 2018 at 8:51 AM
Updated April 26, 2018 at 11:50 AM
Resolved April 26, 2018 at 11:46 AM