Default tables missing

Description

CSIt job and the default table is missing so the suite setup fails.

https://logs.opendaylight.org/releng/vex-yul-odl-jenkins-1/netvirt-csit-1node-0cmb-1ctl-2cmp-openstack-queens-gate-stateful-neon/439/robot-plugin/log_01_l2.html#s1-k1-k1-k9-k2-k3-k1-k1-k5

Environment

None

Activity

Show:

Arunprakash D February 6, 2019 at 7:03 AM

DeviceContext is writing node information to the oper inventory.

Rolecontext is responsible to device's mastership election and ownership change callback.

FRM registers for ownership callback and will get notified once master is elected for a device. In some cases, Rolecontext is taking time for ownership election but deviceContext is going ahead and writing the node information to the oper inventory. So, apps which listen for node DTCL will go ahead and push default table flows which might be dropped by FRM as it has not yet got ownership callback.

The new implementation would be for devicecontext to wait for mastership election to go through and then write the switch information to the oper inventory. This will make sure FRM always has the mastership details when it receives flow information.

JamO Luhrsen December 21, 2018 at 6:28 PM

Was able to recreate with the distro created in the debug patch

here is a link to the robot failure where you can see table=45 was not found on one of the nodes

this is a clustered job so three karaf logs to look at:

ODL 1
ODL 2
ODL 3

The node with the missing table=45 was the first compute node, and it's ovsdb UUID was 46c9d66c-60a6-4da3-8b58-2c4831689600. I think
the ODL that was owning and writing to it was ODL 2, just based off of grepping for that UUID in each karaf.log and seeing things like addPatchPort, etc
in ODL 2 and not in the others.

JamO Luhrsen December 20, 2018 at 10:52 PM

Here's the sandbox job to try and reproduce this. Note that it
will be purged in two days, so if we don't hit the problem before then I'll
have to recreate.

JamO Luhrsen December 20, 2018 at 10:46 PM

Looks like the 3node cluster jobs see this more frequently than others. This tempest one
in particular failed because of missing default tables 3 times in the past 30 days (runs once a day).

I will create a sandbox job that runs without any test cases, in a loop,
using the distribution from the patch (c/78730) and monitor for any
failures due to missing tables.

JamO Luhrsen December 20, 2018 at 10:29 PM

We need to figure out which job sees this the most frequent and then try to reproduce it there
with this patch. The job given in the description is a gate job. I've checked the non-gate
job of the same type and this problem hasn't happened in at least the last 30 tries.

I'll see what I can figure out, but if anyone else knows please comment.