Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault during Windows pcap processing (maybe caused by detect-MHR) #3534

Open
philrz opened this issue Jan 8, 2024 · 11 comments
Labels
Area: Windows Type: Bug 🐛 Unexpected behavior or output.

Comments

@philrz
Copy link
Contributor

philrz commented Jan 8, 2024

I've reproduced this issue using Zeek v6.0.2 that I compiled on a Windows 2019 Server AWS EC2 instance using the instructions from https://docs.zeek.org/en/master/install.html.

I've used my compiled Zeek to successfully generate logs from small/medium pcaps. However when I try to process the pcap at https://archive.wrccdc.org/pcaps/2018/wrccdc.2018-03-23.010014000000000.pcap.gz (after gunziping) I can fairly consistently trigger a segmentation fault if I invoke the local script (unmodified from what shipped with the release).

It seems the error is best presented if I drop into Git Bash after calling the BAT script to set the necessary environment variables to run my compiled Zeek.

C:\Users\Administrator\Downloads\zeek-6.0.2\build>call zeek-path-dev

C:\Users\Administrator\Downloads\zeek-6.0.2\build>bash

Administrator@EC2AMAZ-GE3C5F0 MINGW64 ~/Downloads/zeek-6.0.2/build
$ src/zeek.exe -C -r wrccdc.2018-03-23.010014000000000.pcap

Administrator@EC2AMAZ-GE3C5F0 MINGW64 ~/Downloads/zeek-6.0.2/build
$ echo $?
0

Administrator@EC2AMAZ-GE3C5F0 MINGW64 ~/Downloads/zeek-6.0.2/build
$ src/zeek.exe -C -r wrccdc.2018-03-23.010014000000000.pcap local
Segmentation fault

Administrator@EC2AMAZ-GE3C5F0 MINGW64 ~/Downloads/zeek-6.0.2/build
$ echo $?

I then started commenting out lines in local.zeek to try to narrow it down, and it seems that if I can avoid the segfault if I comment out this line:

@load frameworks/files/detect-MHR

Ten successful runs in a row with that line commented out:

Administrator@EC2AMAZ-GE3C5F0 MINGW64 ~/Downloads/zeek-6.0.2/build
$ for run in 0 1 2 3 4 5 6 7 8 9; do  echo Running;   src/zeek -C -r wrccdc.2018-03-23.010014000000000.pcap local; echo $?; done
Running
0
Running
0
Running
0
Running
0
Running
0
Running
0
Running
0
Running
0
Running
0
Running
0
@philrz
Copy link
Contributor Author

philrz commented Jan 9, 2024

Hmm. The plot thickens.

I thought that while waiting to hear back on this one I'd try testing with the rest of the many test pcaps at https://archive.wrccdc.org/pcaps/2018/ but with that detect-MHR line commented out, since if those tests all worked it would help confirm detect-MHR was a root cause and that other egregious runtime problems aren't lurking in there.

It didn't take long to find https://archive.wrccdc.org/pcaps/2018/wrccdc.2018-03-23.010356000000000.pcap.gz also triggers a segfault. So maybe that has a different root cause or maybe the root cause is the same and commenting out detect-MHR just somehow gave more headroom before causing a segfault.

Anyway, just figured I'd throw that on the pile in case it helps with isolating the problem and verifying a fix.

@JustinAzoff
Copy link
Contributor

Well detect-mhr specifically is potentially very relevant.

That script does one thing: look for downloads of a certain mime type: https://github.com/zeek/zeek/blob/master/scripts/policy/frameworks/files/detect-MHR.zeek#L18-L24

Then for any matches it looks up the file hash using a TXT DNS query via lookup_hostname_txt to fmt("%s.malware.hash.cymru.com", hash); inside of a when statement.

when with DNS queries has been a bit buggy even on linux.

There's only one other script that does this that might be triggering on a random pcap run: ssh/interesting-hostnames.zeek.

Does that other pcap crash if you disable both the MHR and the ssh script?

@philrz
Copy link
Contributor Author

philrz commented Jan 10, 2024

Thanks @JustinAzoff! Indeed, your theory panned out. I just commented out the SSH interesting-hostnames line in my local.zeek on my test Windows VM and was able to successfully run Zeek against that pcap dozen of times without triggering the segfault.

(I also just realized that I pasted the wrong pcap into my prior comment, but I just went back and fixed it. Should have been https://archive.wrccdc.org/pcaps/2018/wrccdc.2018-03-23.010356000000000.pcap.gz.)

So does knowing the problem reproduces reliably on Windows actually make it any easier to fix? 😬 😄

@JustinAzoff
Copy link
Contributor

Does this script make things crash?

event zeek_init()
{
	when ( local result = lookup_hostname_txt("example.com"))
	{
		print result;
	}
}

We have some tests for dns, but I think they all run using the fake 'test' resolver, so if it's the real resolver that has the issue this could have been missed.

@philrz
Copy link
Contributor Author

philrz commented Jan 10, 2024

Assuming I did it right, it doesn't seem to cause a crash. I put your script into a file justin.zeek and ran it repeatedly. Output seems to toggle. No crash.

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ cat justin.zeek
event zeek_init()
{
        when ( local result = lookup_hostname_txt("example.com"))
        {
                print result;
        }
}

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ src/zeek justin.zeek
wgyf8z8cgvm2qmxpnbnldrcltvk4xqfn

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ src/zeek justin.zeek
v=spf1 -all

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ src/zeek justin.zeek
wgyf8z8cgvm2qmxpnbnldrcltvk4xqfn

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ src/zeek justin.zeek
v=spf1 -all

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ src/zeek justin.zeek
wgyf8z8cgvm2qmxpnbnldrcltvk4xqfn

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ src/zeek justin.zeek
v=spf1 -all

Administrator@EC2AMAZ-NVIH7LJ MINGW64 ~/zeek/build (master)
$ src/zeek justin.zeek
wgyf8z8cgvm2qmxpnbnldrcltvk4xqfn

@timwoj
Copy link
Contributor

timwoj commented Jan 10, 2024

It's definitely possible there's some bug in c-ares on Windows. I'll try to get a backtrace out of a Windows build tomorrow for this.

@timwoj
Copy link
Contributor

timwoj commented Jan 10, 2024

Got a more useful backtrace finally (after I actually read your repro steps above 🤦🏼‍♂️ ):

Exception thrown: read access violation.
filt was 0xFFFFFFFFFFFFFFD7.

>	zeek.exe!windows_kevent_copyout(kqueue * kq, int nready, kevent * eventlist, int nevents) Line 142	C
 	zeek.exe!kevent(int kqfd, const kevent * changelist, int nchanges, kevent * eventlist, int nevents, const timespec * timeout) Line 451	C
 	zeek.exe!zeek::iosource::Manager::Poll(std::vector<zeek::iosource::Manager::ReadySource,std::allocator<zeek::iosource::Manager::ReadySource>> * ready, double timeout, zeek::iosource::IOSource * timeout_src) Line 180	C++
 	zeek.exe!zeek::iosource::Manager::FindReadySources(std::vector<zeek::iosource::Manager::ReadySource,std::allocator<zeek::iosource::Manager::ReadySource>> * ready) Line 174	C++
 	zeek.exe!zeek::run_state::detail::run_loop() Line 270	C++
 	zeek.exe!main(int argc, char * * argv) Line 95	C++
 	zeek.exe!invoke_main() Line 79	C++
 	zeek.exe!__scrt_common_main_seh() Line 288	C++
 	zeek.exe!__scrt_common_main() Line 331	C++
 	zeek.exe!mainCRTStartup(void * __formal) Line 17	C++
 	kernel32.dll!BaseThreadInitThunk()	Unknown
 	ntdll.dll!RtlUserThreadStart()	Unknown

@JustinAzoff
Copy link
Contributor

Ah, @philrz another thing you can try is use the stock scripts, but set ZEEK_DNS_FAKE=1 in the environment. if it crashes without that, but runs ok with that set, then yeah, it's definitely something with c-ares or as the backtrace above shows, the event loop.

@philrz
Copy link
Contributor Author

philrz commented Jan 11, 2024

@JustinAzoff: Ok, I just tried that, and it seems to validate your "it's definitely something" theory. I was able to use the stock scripts (i.e., both detect-MHR and interesting-hostnames enabled in my invoked local.zeek) and both those test wrccdc pcaps could be processed to completion several dozen times in a row without triggering the segfault.

@timwoj
Copy link
Contributor

timwoj commented Jan 11, 2024

It's odd because DNS_Mgr and c-ares aren't doing anything untoward that I can tell. It drops the nameserver connection a couple of times but re-establishes at the same time both times. I've tried running against another large pcap and it's not failing there either, though it's doing the same thing. I'll open an issue on the kqueue repo with the crash and see if they have any pointers about what could cause that memory to be invalid.

@philrz
Copy link
Contributor Author

philrz commented Jan 15, 2024

Update: I spotted mheily/libkqueue#155 as the issue @timwoj mentioned in the last comment above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Windows Type: Bug 🐛 Unexpected behavior or output.
Projects
None yet
Development

No branches or pull requests

3 participants