OFP should proactively check and fix flow table on nodes
Description
relates to
Activity

Anil Vishnoi August 21, 2018 at 8:25 AM
https://lf-opendaylight.atlassian.net/browse/OPNFLWPLUG-1028#icft=OPNFLWPLUG-1028 is something we are targeting for Neon.
Yes, agree that having a retry mechanism is useful. Currently we have the functionality that user/application can use to trigger the device reconciliation, but there is no service that does this automatically for user.
https://docs.opendaylight.org/projects/openflowplugin/en/stable-oxygen/specs/reconciliation-cli.html
My initial comment was not against the functionality, but moreover where it should be implemented (application level - like FRM or new application) and it should be configurable. As of now i can think of following configuration parameter that we can provide
(1) Enable / disable service
(2) Retry interval (periodic interval this service should trigger the synchronization for each device, if it's set to 0, it will be triggered on each statistics cycle). Syncing on each statistics cycle can have performance issue, so we need to provide interval so that user can configure it as per their environment and scale.
(3) How many attempt it should make to install the missing flows (because in case user install wrong flow, you don't want sync service to keep firing the flow-mod in every iteration).
(4) It should remove the flow if it's present in operational datastore but not in config datastore? (Unknown flow installed by someone externally), also how many statistics cycle do we want to wait for this flow to be removed (ideally 2)
Probably some more might come up during development phase.

Vishal Thapar August 21, 2018 at 6:00 AM
@Anil Vishnoi Agree, we're mainly talking about (1) here. The issues we've seen mostly end up being about flow in FRM not pushed due to leader election being in progress. this mostly happens during bring up, restart or when due to other reasons there is ask timeout and leader re-election kicks in.
For 2, though we don't have use cases for external entities currently, do agree a config parameter to enable/disable this functionality would be good way.
I believe https://lf-opendaylight.atlassian.net/browse/OPNFLWPLUG-1028 addresses a part of it. Arun mentions a proper fix for OPENFLWPLUG-1028 in Neon. I think a proper fix for 1028 should address bulk of issues that require this. If we can prioritize a proper redesign/fix for it, it would help.
I think retry mechanism would help as a failsafe mechanism. Just like 1028, there might be other such corner cases where a flow can miss being programmed, not to mention bugs in fixes that are coming in or already in. We can't assume that we've fixed all possible scenarios where (1) can occur and having option for retry would help avoid data plane failures for such scenarios.

Anil Vishnoi August 20, 2018 at 9:31 PM
@Tim Rozet As you mentioned above, ODL as a controller has a complete control over the flow programming, than the question is how the flow is deleted from the switch? It means someone externally deleted the flow from switch (probably intentionally (attack) or accidentally (admin's mistake) . In the production environment both of these scenarios are critical to be notified to the application and the default action of re-installing the flow/group is not a recommended approach because you are probably ignoring the possible network attack, so we can't assume reprogramming as a default action in this case. That's the reason this feature needs to be implemented in a way that user can enable it if they want it, so the application who don't care about any external system removing flows/groups from their controlled switches (like netvirt) can enable this feature by default, but application who wants to take action in these scenarios (like isolate switch and avoid any programming on the switch as it might be compromised), can keep it disabled.

Anil Vishnoi August 20, 2018 at 8:55 PM
I am not sure what is the real ask here. There are two scenarios where flow won't be present in the switch
(1) When user configured the flow/group, FRM failed to install the flow/group properly on the switch ( normal installation, or while reconciliation).
(2) Flow was installed properly by openflowplugin, but removed by external entity (user, other application sharing switch etc).
FRM is responsible for (1) and it should take care of proper programming of the flow/groups. There has been many improvement is been done in this area. So if you are hitting some issue where flow/group is correct, but plugin is not installing the flow, please report that specific issue with error message.
https://git.opendaylight.org/gerrit/73531
https://git.opendaylight.org/gerrit/71090
https://git.opendaylight.org/gerrit/70184
Retry mechanism is something that make sense for (2) only and (2) is something that can be usecase specific as well. So just enforcing a common rule of re-installing flow/group whenever it's missing from operational might not go well with all the usecases. So that's where i was suggesting to implement this as a new feature in the FRM, where user can enable it if it really needs that feature.

Vishal Thapar July 16, 2018 at 6:22 AM
@Anil Vishnoi This is issue with Netvirt that uses FRM. Expectation is that applications write flows to ConfigDS and then OFP takes over. Here we're seeing flow is added to and present in FRM, but missing from switches. So not a case of flow missing, but flow never showing up. Netvir today doesn't even track operational flows. If it were using RPCs, it would make sense.
In other words, this is a bug in -frm. Doesn't that count as part of OFP?
Details
Details
Assignee
Reporter

If an OVS node is somehow missing a flow that should exist on the switch, currently OFP does not try to proactively fix the issue by reprogramming the flow on the switch. OFP should periodically read all of the flows and groups on the switch and reprogram them if something is missing.