OFP RPC does not work from all instances in the cluster

Description

Regression was detected here:

https://jenkins.opendaylight.org/releng/view/openflowplugin/job/openflowplugin-csit-3node-clustering-only-sodium/

To reproduce just connect OVS switch to 3 controllers and file an RPC like this from all instances:
POST http://controller:8181/restconf/operations/sal-flow:add-flow

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<input xmlns="urn:opendaylight:flow:service">
    <node xmlns:inv="urn:opendaylight:inventory">/inv:nodes/inv:node[inv:id="openflow:1"]</node>
    <table_id>0</table_id>
    <priority>2</priority>
    <match>
        <ethernet-match>
            <ethernet-type>
                <type>2048</type>
            </ethernet-type>
        </ethernet-match>
        <ipv4-destination>10.0.1.0/24</ipv4-destination>
    </match>
    <instructions>
        <instruction>
            <order>0</order>
            <apply-actions>
                <action>
                    <output-action>
                        <output-node-connector>1</output-node-connector>
                    </output-action>
                    <order>0</order>
                </action>
            </apply-actions>
        </instruction>
    </instructions>
</input>

At least 1 instance will complain with this message:

<errors xmlns="urn:ietf:params:xml:ns:yang:ietf-restconf">
    <error>
        <error-type>application</error-type>
        <error-tag>operation-failed</error-tag>
        <error-message>The operation encountered an unexpected error while executing.</error-message>
        <error-info>Ask timed out on [Actor[akka.tcp://opendaylight-cluster-data@10.18.130.162:2550/user/rpc/broker#-516941188]] after [15000 ms]. Message of type [org.opendaylight.controller.remote.rpc.messages.ExecuteRpc]. A typical reason for `AskTimeoutException` is that the recipient actor didn't send a reply.</error-info>
    </error>
</errors>

Environment

None

Attachments

06 Aug 2019, 11:59 AM
06 Aug 2019, 11:57 AM
06 Aug 2019, 11:49 AM

Activity

Show:

Luis Gomez Palacios August 13, 2019 at 7:55 PM

I think the issue is fixed now: https://jenkins.opendaylight.org/sandbox/job/openflowplugin-csit-3node-clustering-only-sodium/

Luis Gomez Palacios August 12, 2019 at 12:57 PM

Hi Emmett,

FYI, I started a verification on this patch, considering this is the candidate fix:

https://git.opendaylight.org/gerrit/c/controller/+/83530

Emmett Cox August 8, 2019 at 1:58 PM

Discovered the root of the issue is to do with the OpsRegistrar changes made as part of my commit.

Part of the changes removed functionality that removed and closed old rpc registrations, which caused the rpc's to not update correctly and fail when a node was shutdown.

I'm in the midst of making some code changes to fix the bug.

Emmett Cox August 6, 2019 at 11:37 AM
Edited

discovered that I was missing the debug option for remote rpc logs, so there's a little bit more being logged now....

going to include the logs from all 3 nodes, give me a min to add them....

Emmett Cox August 1, 2019 at 12:35 PM

I wonder about the warning regarding the Connection refused for one of the nodes.... I've noticed the same warning appear for some of the other tests that succeed, but those tests take a few seconds longer to execute.... could it simply be akka timing out when it' would work if given a few extra seconds? not that it should take so long, but...

Done

Details

Assignee

Emmett Cox

Reporter

Luis Gomez Palacios

Priority

Highest

Created July 18, 2019 at 9:25 PM

Updated February 6, 2025 at 2:17 PM

Resolved August 14, 2019 at 10:32 AM

Configure