bpf: Reset CT entry for CT_REOPENED #32614

jrajahalme · 2024-05-17T22:21:26Z

Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection.

This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed.

For this bug to trigger a pod needs to open a new TCP connection using the same (ephemeral) source port to the same destination before and after a change in policy adding an L7 policy applicable to that connection. While this is a bug that has surfaced in CI, this bug is less likely to be hit when policies are more stable. Given this we need further discussion about backporting this bug fix to the stable releases.

Fixes: #27762 (see #27762 (comment))

Datapath conntrack entries for reopened connections are fully reinitialized to fix rare L7 proxy redirect failures.

jrajahalme · 2024-05-17T22:53:43Z

/test

jrajahalme · 2024-05-19T15:38:13Z

Simplified bpf changes a bit.

jrajahalme · 2024-05-19T15:42:06Z

/test

jrajahalme · 2024-05-19T15:47:37Z

bpf lint fixes.

jrajahalme · 2024-05-19T16:08:04Z

/test

julianwiedmann · 2024-05-21T05:37:59Z

Nice catch! For clarity - this change would only help TCP connections, correct? I think we need to find a solution for UDP / SCTP as well (or CT_ESTABLISHED in general). As something easy to backport, we could just follow the current approach and add the additional constraints here.

But long-term I think the CT lookups from bpf_lxc should follow the approach we've already started for Service-related CT entries, and use more precise matching in the low-level conntrack code. Which is also why I would prefer if we split the nodeport.h changes off into a separate patch (or even PR?). There's a good chance that the lookup code there is already good enough ...

jrajahalme · 2024-05-21T08:42:13Z

Nice catch! For clarity - this change would only help TCP connections, correct? I think we need to find a solution for UDP / SCTP as well (or CT_ESTABLISHED in general). As something easy to backport, we could just follow the current approach and add the additional constraints here.

TCP only, due to this snippet in bpf/lib/conntrack.h (assuming SCTP does not have TCP_FLAG_SYN):

	if (unlikely(flags.value & TCP_FLAG_SYN))
		return ACTION_CREATE;

Our Envoy proxy redirects are TCP only, so to fix the Envoy issue fixing this for TCP suffices for now. This same issue could come up with DNS proxy though, so it would be good to solve this for UDP (and SCTP) as well.

But long-term I think the CT lookups from bpf_lxc should follow the approach we've already started for Service-related CT entries, and use more precise matching in the low-level conntrack code. Which is also why I would prefer if we split the nodeport.h changes off into a separate patch (or even PR?). There's a good chance that the lookup code there is already good enough ...

I'll split off the nodeport.h changes to a separate commit.

Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection. This prevents failing proxy redirect when the CT entry of a non-redirected connection would have been reused for a redirected connection, which has happened in the CI where an egress L7 policy was applied and an old CT entry from time without the L7 policy existed. Fixes: cilium#27762 Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

Reset the CT entry when an old entry has been found when looking up an entry for TCP SYN packet (CT_REOPENED). This way the old entry has no effect to the entry for the new connection and security identity, as well as proxy and nodeport related flags get set correctly for the new connection. Signed-off-by: Jarno Rajahalme <jarno@isovalent.com>

jrajahalme · 2024-05-21T09:01:06Z

Nice catch! For clarity - this change would only help TCP connections, correct? I think we need to find a solution for UDP / SCTP as well (or CT_ESTABLISHED in general). As something easy to backport, we could just follow the current approach and add the additional constraints here.

I considered this approach, but concluded that it would easily result into more brittle code, and this simpler approach is better, for the following reasons:

we'd have to expand the comparison logic to cover all the possible differences between the old entry and the new state
this could be extracted to a helper function
but still the helper function could get stale
CT_REOPENED happens only for new connections, and we already typically create new entries for new connections; it is simpler to just do the CT_NEW behavior for CT_REOPENED

Given this it might be better to get rid of CT_REOPENED altogether and only return CT_NEW though?

Current lib/conntrack.h updates the given ct_state for CT_REOPENED, so we have to explicitly clear it. would be simpler if lib/conntrack.h simply returned CT_NEW (without touching the given *ct_state), if the policy always is to reset the CT entry anyway?

As for changes for CT_ESTABLISHED: changes in proxy redirection mid-connection will not work, but currently policy changes break existing connections for which redirection status changes, due to forward direction immediately behaving accoring to the new policy, while return direction takes the redirection status from the CT entry, that is not currently updated for CT_ESTABLISHED. I guess this breaks DNS proxy for UDP across policy changes as well, and updating a stale CT entry for CT_ESTABLISHED would fix this.

But long-term I think the CT lookups from bpf_lxc should follow the approach we've already started for Service-related CT entries, and use more precise matching in the low-level conntrack code.

Looks like ct_entry_matches_types() would have to to get much more complex if proxy redirection would be added in there?

jrajahalme · 2024-05-21T09:01:34Z

/test

julianwiedmann · 2024-05-21T12:41:03Z

bpf/bpf_lxc.c

-# ifdef ENABLE_DSR
-		/* Clear .dsr_internal flag for old connections: */
-		if (ret == CT_REOPENED && ct_state->dsr_internal)
-			ct_update_dsr(get_ct_map4(tuple), tuple, false);
-# endif /* ENABLE_DSR */


Let's address this part separately (#32642), there's more to it.

This was conditionally setting the flag to false, removed this as now the flag is false by default. IMO removal of this is the right thing to do in this PR (no functional change for the parts that were already updated for CT_REOPENED.

julianwiedmann

ty! I took a closer look at the nodeport.h changes.

julianwiedmann · 2024-05-21T12:46:01Z

bpf/lib/nodeport.h

-	/* Maybe we can be a bit more selective about CT_REOPENED?
-	 * But we have to assume that both the CT and the SNAT entry are stale.
-	 */
 	case CT_REOPENED:


Removing the comment without any explanation why it no longer applies, doesn't feel very helpful.

This PR changes to reset the CT entry for CT_REOPENED, so there is no longer any concern for being more selective?

julianwiedmann · 2024-05-21T12:46:26Z

bpf/lib/nodeport.h

-		ct_state_new.dsr_internal = 1;
-		ct_state_new.proxy_redirect = 0;
-		ct_state_new.from_l7lb = 0;
+		ct_state_new.dsr_internal = true;
+		/* all other flags are already false */



This is just a cleanup, correct?

Yes, stemming from the fact that other sites were missing these explicit = 0 statements.

julianwiedmann · 2024-05-21T12:53:33Z

bpf/lib/nodeport.h

+		case CT_REOPENED:
+			memset(&ct_state, 0, sizeof(ct_state));
+			ct_state.rev_nat_index = ct_state_svc.rev_nat_index;
+			fallthrough;
 		case CT_NEW:
 			ct_state.src_sec_id = src_sec_identity;
-			ct_state.node_port = 1;
+			ct_state.node_port = true;
 #ifndef HAVE_FIB_IFINDEX


I still believe that teaching ct_entry_matches_types() about src_sec_id and ifindex for the case of CT_ENTRY_NODEPORT would get us much further here. And address any concerns for CT_ESTABLISHED.

We could literally just copy the code that fills the ct_state in front of the CT lookup, and say "this is what I'm expecting".

So you mean the ct_* function would insert the new entry then?

We can't do policy lookup before CT lookup, as we can do policy lookup on a forward direction only, and a CT lookup is needed to know if a packet is in forward or reply direction. Also, some of the CT entry contents come from the policy lookup, so it does not seem feasible to be able to give a full expected CT entry before the CT lookup.

julianwiedmann · 2024-05-21T13:07:04Z

Given this it might be better to get rid of CT_REOPENED altogether and only return CT_NEW though?

Current lib/conntrack.h updates the given ct_state for CT_REOPENED, so we have to explicitly clear it. would be simpler if lib/conntrack.h simply returned CT_NEW (without touching the given *ct_state), if the policy always is to reset the CT entry anyway?

If we follow your change suggestion here, that would be my conclusion as well. We're already clearing quite a few parts of the entry, so why not go all the way. But I don't have sufficient historical context on why we've introduced CT_REOPENED in the first place. Therefore I'm a bit hesitant to just drop it when it feels inconvenient.

As for changes for CT_ESTABLISHED: changes in proxy redirection mid-connection will not work, but currently policy changes break existing connections for which redirection status changes, due to forward direction immediately behaving accoring to the new policy, while return direction takes the redirection status from the CT entry, that is not currently updated for CT_ESTABLISHED. I guess this breaks DNS proxy for UDP across policy changes as well, and updating a stale CT entry for CT_ESTABLISHED would fix this.

Ack. So we need to decide whether this PR is only about fixing the L7 proxy aspect, or wants to address fuzzy CT lookups as a whole.

jrajahalme · 2024-05-21T14:16:22Z

Ack. So we need to decide whether this PR is only about fixing the L7 proxy aspect, or wants to address fuzzy CT lookups as a whole.

I would like this PR to be a backportable bug fix for the l7 proxy redirect issue/flake. In that light maybe I should remove the cleanups as well to make this truly minimal?

THB, I don't know if the nodeport.h changes are absolutely necessary for the l7 proxy redirect fix, though.

jrajahalme · 2024-05-21T14:33:51Z

Given this it might be better to get rid of CT_REOPENED altogether and only return CT_NEW though?
Current lib/conntrack.h updates the given ct_state for CT_REOPENED, so we have to explicitly clear it. would be simpler if lib/conntrack.h simply returned CT_NEW (without touching the given *ct_state), if the policy always is to reset the CT entry anyway?

If we follow your change suggestion here, that would be my conclusion as well. We're already clearing quite a few parts of the entry, so why not go all the way. But I don't have sufficient historical context on why we've introduced CT_REOPENED in the first place. Therefore I'm a bit hesitant to just drop it when it feels inconvenient.

Here's the commit that added CT_REOPENED` (#13340):

commit ada53a91a98618d24dad2abda97dd9c60923003c
Author: Zang Li <zangli@google.com>
Date:   Mon Sep 28 12:15:14 2020 -0700

    Add the return code CT_REOPENED for conntrack table lookup.
    
    When a tcp SYN packet hits a conntrack entry that is in close-wait state,
    the entry timeout will be modified and the entry will be reopened. But
    because the return code is CT_ESTABLISHED, policy-verdict event will not
    be generated even this is a new connection. The patch fix the issue by
    returning the code CT_REOPENED instead in this case. Except for policy
    verdict events, all other handlings are the same as if CT_ESTABLISHED
    is returned.

It looks to me CT_REOPENED was added to get policy verdicts also for re-opened connections. Furthermore, from the discussion therein I'd conclude that CT_REOPENED is in fact more likely to be a new connection (due to CT entry lingering longer than TCP state at the endpoints) rather than the same connection really being re-opened from TCP viewpoint. @joestringer argued in the PR that even a real TCP re-open should likely be policy enforced like a new connection. This is doubly so if there is a policy change between the original SYN and the new one!

jrajahalme added sig/datapath Impacts bpf/ or low-level forwarding details, including map management and monitor messages. area/proxy Impacts proxy components, including DNS, Kafka, Envoy and/or XDS servers. release-note/bug This PR fixes an issue in a previous release of Cilium. labels May 17, 2024

jrajahalme requested a review from a team as a code owner May 17, 2024 22:21

jrajahalme requested a review from markpash May 17, 2024 22:21

jrajahalme marked this pull request as draft May 17, 2024 22:21

jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from cee504a to 81a8e00 Compare May 17, 2024 22:44

jrajahalme marked this pull request as ready for review May 17, 2024 23:03

jrajahalme mentioned this pull request May 19, 2024

bpf: host: also handle from-egress proxy traffic #30095

Merged

jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from 81a8e00 to c901696 Compare May 19, 2024 15:37

jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from c901696 to 130f2ea Compare May 19, 2024 15:46

julianwiedmann self-requested a review May 21, 2024 05:28

julianwiedmann added feature/conntrack dont-merge/discussion A discussion is ongoing and should be resolved before merging, regardless of reviews & tests status. labels May 21, 2024

jrajahalme added 2 commits May 21, 2024 10:43

jrajahalme force-pushed the bpf-ct-reopened-reset-proxy-redirect branch from 130f2ea to d813c38 Compare May 21, 2024 08:46

julianwiedmann reviewed May 21, 2024

View reviewed changes

jrajahalme mentioned this pull request May 21, 2024

bpf: Remove CT_REOPENED, return CT_NEW instead #32653

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bpf: Reset CT entry for CT_REOPENED #32614

bpf: Reset CT entry for CT_REOPENED #32614

jrajahalme commented May 17, 2024 •

edited

jrajahalme commented May 17, 2024

jrajahalme commented May 19, 2024

jrajahalme commented May 19, 2024

jrajahalme commented May 19, 2024

jrajahalme commented May 19, 2024

julianwiedmann commented May 21, 2024

jrajahalme commented May 21, 2024

jrajahalme commented May 21, 2024 •

edited

jrajahalme commented May 21, 2024

julianwiedmann May 21, 2024

jrajahalme May 21, 2024

julianwiedmann left a comment

julianwiedmann May 21, 2024

jrajahalme May 21, 2024

julianwiedmann May 21, 2024

jrajahalme May 21, 2024

julianwiedmann May 21, 2024

jrajahalme May 21, 2024

jrajahalme May 21, 2024

julianwiedmann commented May 21, 2024 •

edited

jrajahalme commented May 21, 2024

jrajahalme commented May 21, 2024

bpf: Reset CT entry for CT_REOPENED #32614

Are you sure you want to change the base?

bpf: Reset CT entry for CT_REOPENED #32614

Conversation

jrajahalme commented May 17, 2024 • edited

jrajahalme commented May 17, 2024

jrajahalme commented May 19, 2024

jrajahalme commented May 19, 2024

jrajahalme commented May 19, 2024

jrajahalme commented May 19, 2024

julianwiedmann commented May 21, 2024

jrajahalme commented May 21, 2024

jrajahalme commented May 21, 2024 • edited

jrajahalme commented May 21, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianwiedmann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julianwiedmann commented May 21, 2024 • edited

jrajahalme commented May 21, 2024

jrajahalme commented May 21, 2024

jrajahalme commented May 17, 2024 •

edited

jrajahalme commented May 21, 2024 •

edited

julianwiedmann commented May 21, 2024 •

edited