Skip to content
This repository has been archived by the owner on Oct 29, 2021. It is now read-only.

Repeated make helm-delete-vpn helm-install-vpn makes impossible to install any NSC #2255

Open
Bolodya1997 opened this issue Jul 20, 2021 · 5 comments
Labels
bug 🐛 Something isn't working

Comments

@Bolodya1997
Copy link
Collaborator

Expected Behavior

Repeated make helm-delete-vpn helm-install-vpn should delete previous vpn-case NSC, NSE and install new ones.

Current Behavior

After some time of repeating make helm-delete-vpn helm-install-vpn starts failing to install vpn-case NSC and so any other NSC.

Failure Information (for bugs)

Steps to Reproduce

  1. Start NSM.
  2. Run make helm-delete-vpn helm-install-vpn.
  3. Repeat [2] several times.

Important: make helm-install-vpn should be executed while last instance of vpn (nsc/nse) is deleted/terminating. If wait after helm-delete-vpn before helm-install-vpn and do the test, it seems to be fine running it multiple times.

Context

  • Vagrant

Failure Logs

Looking at the logs, it seems the request to forwarder from manager goes with an invalid netnsInode
reference containing a terminated container namespace reference of the vpn while I am trying to install
the icmp nsc.
@Bolodya1997 Bolodya1997 added the bug 🐛 Something isn't working label Jul 20, 2021
@Bolodya1997 Bolodya1997 added this to Community issues in Issue/PR tracking Jul 20, 2021
@karthick18
Copy link

karthick18 commented Jul 27, 2021

Thanks for filing this bug. Attaching some logs.
git log
commit fea0424 (HEAD -> master, origin/master, origin/HEAD)
Author: juangascon 31316533+juangascon@users.noreply.github.com
Date: Fri May 21 16:26:25 2021 +0200

The attached logs has failures around the time when the bug was reproduced.
Easily reproducible with a bash like:
while [ 1 ]; do make helm-delete-vpn helm-install-vpn; sleep 5; done
And you will find an install-vpn failure with the vpn-nsc pod stuck in INIT state and grpc failures trying to setup the secure-intranet-connectivity service.

I am also attaching a minor nsm patch to run with ubuntu-20.04 box and also increasing ram size for the vagrant vms. Has nothing to do with the failure per-se.

The failure logs from forwarder on node 4 and node 1 (suffix 4 and 1). And nsm-manager container logs from respective nodes 1 and 4. Node 4 being the nsc and node 1 hosting the gateway nse pod.
The reference to stale container in the forwarder while trying to configure the vpp (using ligato) is a concern. It seems that the pod reference even for a subsequent install of icmp-responder seems to contain a namespace reference to that of the vpn nsc from what I had seen.
nsm-failure.tar.gz

nsm-vagrant-patch.tar.gz

@edwarnicke
Copy link
Member

@karthick18 Question... is there a reason you are using NSM v0.2 vs NSM v1.0?

@karthick18
Copy link

karthick18 commented Jul 29, 2021

@karthick18 Question... is there a reason you are using NSM v0.2 vs NSM v1.0?

Not really. Just wanted to use the latest. Having said that, I think I also did one against v1.0 and saw the same thing. Not completely sure but can try it again.

@karthick18
Copy link

karthick18 commented Aug 18, 2021

I got sidetracked from this issue as it wasn't really a blocker. However I got to move to nsm v0.1.0 tag and also reproduced it.
I was also able to fix it based on the issues seen. And was able to run a repeat loop for 30 times without issues.

The patch addresses 3 things. 2 panics in nsmd. And a lockup in vpp server which seems to be a result of a configurator race from dataplane from Request and Close. As the lockup in vpp results in vpp connect failures and dataplane failing with ErrConnectionFailed to network service manager. vppctl also hangs on unix socket on a connect when this happens.
With a mutex synchronizing ConnectOrDisconnect, this issue of vpp lockup does not seem to occur.
And dataplane configuration seems to be successful on repeat install and delete test case.

However need to confirm if the next 2 patches in nsmd as mentioned below would prevent the vpp lockup (highly unlikely)

First one is simple and based on a missing networkservice endpoint resulting in traversal of null endpoints
in RestoreConnections. This issue won't be present in v0.2.0.
However the second panic stems from an interface typecast error while trying to close a local connection with a local network service client for a remote destination connection and vice-versa for a remote network service client with local destination/connection.

This was from Upate cross connects.
Need to check if this can be hit.

time="2021-08-18T20:48:24Z" level=info msg="Connection with Remote Network Service kube-worker1 at 10.44.0.3:5001 is established"
time="2021-08-18T20:48:24Z" level=info msg="NSM_Heal(1.1-EBDDDDA5) Connection d healing state is finished..."

**panic: interface conversion: nsm.NSMConnection is connection.Connection, not connection.Connection (types from different packages)

goroutine 1427 [running]:
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsm.(*nsmClient).Close(0xc000493758, 0xb51d10, 0xc00002c020, 0xb57468, 0xc000a6ac60, 0xc000493758, 0x0)
/root/networkservicemesh/controlplane/pkg/nsm/remote_nsm_client.go:33 +0x14b
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsm.(*networkServiceManager).closeEndpoint(0xc00014c120, 0xb51d10, 0xc00002c020, 0xc000258e60, 0x0, 0x0)
/root/networkservicemesh/controlplane/pkg/nsm/nsm.go:891 +0x2fa
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsm.(*networkServiceManager).close(0xc00014c120, 0xb51d10, 0xc00002c020, 0xc000258e60, 0xc000250101, 0x2, 0x2)
/root/networkservicemesh/controlplane/pkg/nsm/nsm.go:427 +0x13a
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsm.(*networkServiceManager).Close(...)
/root/networkservicemesh/controlplane/pkg/nsm/nsm.go:408
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsm.(*networkServiceManager).Heal(0xc00014c120, 0xb50080, 0xc000258e60, 0x1)
/root/networkservicemesh/controlplane/pkg/nsm/nsm_heal.go:61 +0x2df
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/services.(*ClientConnectionManager).UpdateXcon(0xc0005004e0, 0xb50080, 0xc000258e60, 0xc000355b60)
/root/networkservicemesh/controlplane/pkg/services/client_connection_manager.go:50 +0x1ce
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsmd.(*NsmMonitorCrossConnectClient).dataplaneCrossConnectMonitor(0xc0004f62c0, 0xc0003a80c0, 0xb51cd8, 0xc0002dbc80)
/root/networkservicemesh/controlplane/pkg/nsmd/nsmd_crossconnect_client.go:204 +0x598
created by github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsmd.(*NsmMonitorCrossConnectClient).DataplaneAdded
/root/networkservicemesh/controlplane/pkg/nsmd/nsmd_crossconnect_client.go:92 +0x12d

The other straight-forward panic also addressed in v0.2.0 branch is a result of a panic while traversing null endpoints returned from find networkservice endpoints in 1.0 while restoring connections.

2021/08/18 19:56:08 Reporting span 1532223ecb7f9611:1532223ecb7f9611:0:1
time="2021-08-18T19:56:08Z" level=error msg="Failed to find NSE to recovery: rpc error: code = Unknown desc = no NetworkService with name: secure-intranet-connectivity"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0x9566bb]

goroutine 1256 [running]:
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsm.(*networkServiceManager).RestoreConnections(0xc000104120, 0xc0003e3540, 0x2, 0x2, 0xc00002c0d0, 0x8)
/root/networkservicemesh/controlplane/pkg/nsm/nsm.go:758 +0x101b
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/services.(*ClientConnectionManager).UpdateFromInitialState(...)
/root/networkservicemesh/controlplane/pkg/services/client_connection_manager.go:177
github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsmd.(*NsmMonitorCrossConnectClient).dataplaneCrossConnectMonitor(0xc00007ef00, 0xc0004945a0, 0xb51df8, 0xc00007e1c0)
/root/networkservicemesh/controlplane/pkg/nsmd/nsmd_crossconnect_client.go:222 +0x94d
created by github.com/networkservicemesh/networkservicemesh/controlplane/pkg/nsmd.(*NsmMonitorCrossConnectClient).DataplaneAdded
/root/networkservicemesh/controlplane/pkg/nsmd/nsmd_crossconnect_client.go:92 +0x12d
In order to reproduce on nsm v0.1.0, run this after k8s-infra-deploy.

#!/usr/bin/env bash
i=0
while [ $i -lt 10 ]; do
echo "Testing helm install $i"
helm install --atomic -n nsm-system vpn-test deployments/helm/vpn
echo "nsm-app installed"
helm delete -n nsm-system vpn-test
i=$(($i+1))
done

The fix on v0.1.0 for the issues to make the test successful is:
diff --git a/controlplane/pkg/nsm/endpoint_client.go b/controlplane/pkg/nsm/endpoint_client.go
index c6d77751..8156e962 100644
--- a/controlplane/pkg/nsm/endpoint_client.go
+++ b/controlplane/pkg/nsm/endpoint_client.go
@@ -38,10 +38,14 @@ func (c *endpointClient) Cleanup() error {
return err
}
func (c *endpointClient) Close(ctx context.Context, conn nsm.NSMConnection) error {

  • var err error
    if c.client == nil {
    return fmt.Errorf("Remote NSM Connection is already cleaned...")
    }
  • _, err := c.client.Close(ctx, conn.(*connection.Connection))
  • switch conn.(type) {
  • case *connection.Connection:
  •   _, err = c.client.Close(ctx, conn.(*connection.Connection))
    
  • }
    _ = c.Cleanup()
    return err
    }
    diff --git a/controlplane/pkg/nsm/nsm.go b/controlplane/pkg/nsm/nsm.go
    index 9da67277..ce4d1d65 100644
    --- a/controlplane/pkg/nsm/nsm.go
    +++ b/controlplane/pkg/nsm/nsm.go
    @@ -754,15 +754,16 @@ func (srv *networkServiceManager) RestoreConnections(xcons []*crossconnect.Cross
    })
    if err != nil {
    logrus.Errorf("Failed to find NSE to recovery: %v", err)
  •   			}
    
  •   			for _, ep := range endpoints.NetworkServiceEndpoints {
    
  •   				if xcon.GetRemoteDestination() != nil && ep.EndpointName == xcon.GetRemoteDestination().GetNetworkServiceEndpointName() {
    
  •   					endpoint = &registry.NSERegistration{
    
  •   						NetworkServiceManager:  endpoints.NetworkServiceManagers[ep.NetworkServiceManagerName],
    
  •   						NetworkserviceEndpoint: ep,
    
  •   						NetworkService:         endpoints.NetworkService,
    
  •   			} else {
    
  •   				for _, ep := range endpoints.NetworkServiceEndpoints {
    
  •   					if xcon.GetRemoteDestination() != nil && ep.EndpointName == xcon.GetRemoteDestination().GetNetworkServiceEndpointName() {
    
  •   						endpoint = &registry.NSERegistration{
    
  •   							NetworkServiceManager:  endpoints.NetworkServiceManagers[ep.NetworkServiceManagerName],
    
  •   							NetworkserviceEndpoint: ep,
    
  •   							NetworkService:         endpoints.NetworkService,
    
  •   						}
    
  •   						break
      					}
    
  •   					break
      				}
      			}
      		}
    

@@ -775,6 +776,7 @@ func (srv *networkServiceManager) RestoreConnections(xcons []*crossconnect.Cross
endpoint = localEndpoint.Endpoint
endpointRenamed = true
}
+
} else {
logrus.Errorf("Failed to find Endpoint %s", endpointName)
}
@@ -782,7 +784,6 @@ func (srv *networkServiceManager) RestoreConnections(xcons []*crossconnect.Cross
logrus.Infof("Endpoint found: %v", endpoint)
}
}

		clientConnection := &model.ClientConnection{
			ConnectionID:            xcon.GetId(),
			Xcon:                    xcon,

diff --git a/controlplane/pkg/nsm/remote_nsm_client.go b/controlplane/pkg/nsm/remote_nsm_client.go
index bc5d25a0..0fe15997 100644
--- a/controlplane/pkg/nsm/remote_nsm_client.go
+++ b/controlplane/pkg/nsm/remote_nsm_client.go
@@ -27,10 +27,14 @@ func (c *nsmClient) Request(ctx context.Context, request nsm.NSMRequest) (nsm.NS
return proto.Clone(response).(*connection.Connection), err
}
func (c *nsmClient) Close(ctx context.Context, conn nsm.NSMConnection) error {

  • var err error
    if c == nil || c.client == nil {
    return fmt.Errorf("Remote NSM Connection is not initialized...")
    }
  • _, err := c.client.Close(ctx, conn.(*connection.Connection))
  • switch conn.(type) {
  • case *connection.Connection:
  •   _, err = c.client.Close(ctx, conn.(*connection.Connection))
    
  • }
    _ = c.Cleanup()
    return err
    }
    diff --git a/dataplane/vppagent/pkg/vppagent/vppagent.go b/dataplane/vppagent/pkg/vppagent/vppagent.go
    index c364b0c4..c8cdbe75 100644
    --- a/dataplane/vppagent/pkg/vppagent/vppagent.go
    +++ b/dataplane/vppagent/pkg/vppagent/vppagent.go
    @@ -38,7 +38,7 @@ import (
    "github.com/sirupsen/logrus"
    "google.golang.org/grpc"
    "google.golang.org/grpc/status"
  •    "sync"
    
    "github.com/networkservicemesh/networkservicemesh/controlplane/pkg/apis/crossconnect"
    local "github.com/networkservicemesh/networkservicemesh/controlplane/pkg/apis/local/connection"
    remote "github.com/networkservicemesh/networkservicemesh/controlplane/pkg/apis/remote/connection"
    @@ -76,6 +76,7 @@ type VPPAgent struct {
    srcIP net.IP
    egressInterface common.EgressInterface
    monitor monitor_crossconnect.MonitorServer
  • sync.Mutex
    }

func CreateVPPAgent() *VPPAgent {
@@ -129,6 +130,8 @@ func (v *VPPAgent) Request(ctx context.Context, crossConnect *crossconnect.Cross
}

func (v *VPPAgent) ConnectOrDisConnect(ctx context.Context, crossConnect *crossconnect.CrossConnect, connect bool) (*crossconnect.CrossConnect, error) {

  • v.Lock()
  • defer v.Unlock()
    if crossConnect.GetLocalSource().GetMechanism().GetType() == local.MechanismType_MEM_INTERFACE &&
    crossConnect.GetLocalDestination().GetMechanism().GetType() == local.MechanismType_MEM_INTERFACE {
    return v.directMemifConnector.ConnectOrDisConnect(crossConnect, connect)

@karthick18
Copy link

Also attaching the patch on v1.0 branch which is inlined in above comment as a file.
nsm-patch.txt

@denis-tingaikin denis-tingaikin moved this from Community issues to Backlog in Issue/PR tracking Aug 24, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug 🐛 Something isn't working
Projects
Issue/PR tracking
  
Backlog
Development

No branches or pull requests

3 participants