-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ETW-based process events table for Windows #7821
Add ETW-based process events table for Windows #7821
Conversation
… Pushing initial version of the ETW Process Event Publisher and Subscriber
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only took a quick look so far (mostly at the schema).
Very glad to see the thorough comments on all the ETW "framework" header files! Thank you for doing that.
After we get this initial work merged into osquery core, it would be really helpful if you could write up some documentation in the osquery repo about how further ETW tables should be built using the framework.
table_name("etw_process_events") | ||
description("Windows process execution events.") | ||
schema([ | ||
Column("type", TEXT, "Event Type"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would help to see the possible values. Is this right?
Column("type", TEXT, "Event Type"), | |
Column("type", TEXT, "Event Type (ProcessStart, ProcessStop)"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! I think having the possible types listed here will help
Column("header_pid", BIGINT, "Process ID of the process reporting the event", hidden=True), | ||
Column("process_sequence_number", BIGINT, "Process Sequence Number - Present only on ProcessStart events", hidden=True), | ||
Column("parent_process_sequence_number", BIGINT, "Parent Process Sequence Number - Present only on ProcessStart events", hidden=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these columns hidden because they are mostly useful for dev/debuggign?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the process_sequence_number unique at some level? Would be useful to have a non-PID unique identifier for a process
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A corner case scenario on Windows called "PID reuse" can happen when the same PIDs get reused during OS runtime lifecycle. This is something very unlikely that OS developers decided to address by including the concept of the "process sequence number" on the ETW fields. I'm including this data as a hidden value and exposing this information in the unlikely event this is needed.
@zwass yes, these could be useful for debugging scenarios
@defensivedepth yes, they are unique
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosd4h How unique? Sysmon generates a GUID for each new process creation (https://learn.microsoft.com/en-us/sysinternals/downloads/sysmon#event-id-1-process-creation) specifically because of the issues around PIDs.
I would recommend that we do not hide the sequence number columns. For security use cases / correlation, some type of GUID would be better than PIDs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosd4h Wanted to make sure my reply didn't get lost in all the other comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @defensivedepth, sorry for the late response! I totally miss your response.
This sequence value the code is exposing is generated by the kernel itself. The code is just grabbing it from the ETW event and exposing it to the table. In other words, the code does not control this value; it just relies on the kernel-generated value to be unique for the current os boot session.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real comments, but I'm excited to see it!
@@ -0,0 +1,29 @@ | |||
table_name("etw_process_events") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This isn't blocking)
Does it make sense to expose any of this in the existing process_events
? I feel like we've never really had a good answer for how we handle that, and the migration between event subsytems. (process_events vs bpf_process_events vs es_process_events)
I guess maybe it makes more sense to let them all diverge, and people can write platform specific SQL?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this yesterday with @zwass and other folks during the Osquery office hours. There is still not a clear decision on how to move fwd. We most likely go ahead with this change and then explore options to converge es_process_events
, bpf_process_events
, and etw_process_events
into a single table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My takeaway from office hours was:
- We should keep separate tables
- Future work to explore a merged
process_events
- We should probably name them
process_events_etw
so like tables sort together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed office hours, apologies, just my 2c here. I'm not a huge fan of diverging tables, as it puts the data in different places, and it's not nearly as intuitive as there being just one process_events
table. The diverging schemas was what we had originally shipped OS specific columns for I thought.
I'm not going to stand in the way of progress here, just wanted to voice my thoughts on this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @muffins, thanks for providing your input here. During office-hours meeting we decided to merge es_process_events
, bpf_process_events
, and etw_process_events
into a single table. There has to be some work involved to decide how to merge the schemas through os-specific columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosd4h thanks for the context! This is what I get for not coming to OH :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://hackmd.io/F9uUgvUkQuuyfrPoZbstYg?view#ETW-events-table--Process-events
I think we were more nuanced. IIRC the extended schema has some issues under the hood, and it's often hard to see how to combine things.
So we were mulling on the idea of having divergent tables, and also a single table with a simplified set of things. But we didn't think we should block here on that.
We did most think we should start naming them process_events_XXXX
so they sorted together
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@directionless @zwass I ended up naming the table process_etw_events
. The main rationale for picking this name instead of process_events_XXXX
is that evented tables must have the _events
suffix at the end of the table name
About the failing Windows test:
|
…ries as tdh.dll is going to be loaded at runtime to get ETW working
Hi @zwass , excellent work!, BTW i had a POC for the same, but using a different provider (not the kernel one) with further modifications to share code between process_events and processes while getting information not present in ETW event. In case you are interested to add some improvements or fields in a future PR, feel free to use some of those changes. (lcostantino#1) |
@lcostantino Flaco!! mann, it's really a small world, after all! I'm happy to see you here. Also, glad to see that you are thinking that adding ETW support to Osquery is the way to go. Would it be possible for you to code review this PR if you have a chance? I agree that there are things in your PoC that can be used to enrich the process events created on this PR. I also created a framework to add new ETW-based Event publishers, described in the blueprint [here]. (#7826). Your feedback on the blueprint will be greatly appreciated. Are you on the Osquery slack by chance? |
Small clarification here: @zwass created this PR by mistake while we were reviewing the initial specs for this new evented table. That's why there are confusing details in the PR description. This PR introduces POTE (Programmable OS Tracing Engine) framework + a new windows evented table called Having POTE in place will simplify the addition of future evented tables as POTE provides a simplified mechanism to create ETW-based Event publishers. The PR blueprint is detailed here. |
Hey Marcos, didn't knew it was you!!! happy to see you, sure i will check the PR. (not in OSQUERY slack yet) |
T pop() { | ||
std::unique_lock<std::mutex> lock(mutex_); | ||
while (queue_.empty()) { | ||
condition_.wait(lock); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would it be safer to use wait with predicate for empty? (spurious wake should also be covered with that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point! Thanks for bringing it up. I've just modified the wait call to use a predicate to avoid lost or spurious wakeups
@marcosd4h for the CVEs of krabsetw, the CI script will ask you to insert all fields for the library, but I checked in the NVD database and there's no CPE for the library. I think I need to improve the very first message when the library is not found, so that it points to the instructions for adding a new library and handling special cases. |
Good catch @Smjert - I'll start a thread with NIST per this guidance to get a CPE created for krabsetw. Thanks for pointing this out! |
Hello @swannman and thanks! That's definitely useful! |
Thanks @Smjert, for pointing this out. I updated the branch to get the new CI scripts and updated the library manifest file. BTW, the version field was required and KrabsETW library does not have tagged released versions, so I ended up just using 0.0.0.0. Let me know if that works here. |
Yeah the version is mostly for human readers when there's no CPE, but if there's no version it's fine too, the commit sha is actually what gets compared in the repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Presuming you've changed the items you've noted, this is largely looking good to me. As mentioned I'd really love to have another C++ human look through, esp for potential design improvements, but overall I think it's fine.
I'm happy to stamp if needed, but I'm going to let other folks have a chance to look through. If no one else steps up ping me and I'm happy to approve.
case EtwProviderConfig::EtwKernelProviderType::File: { | ||
// https://learn.microsoft.com/en-us/windows/win32/etw/fileio | ||
kernelProvider = std::make_shared<krabs::kernel::file_io_provider>(); | ||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I think you want these breaks inside of the curlys, keep all the code for each case together inside.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems easier to read to me as well. Any reason not to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point; I've just fixed that.
@@ -0,0 +1,29 @@ | |||
table_name("etw_process_events") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosd4h thanks for the context! This is what I get for not coming to OH :)
…n logic. Improving shutdown logic on the ETW sessions runners. Adding more tests to cover the shutdown scenario. Adding integration test.
Hey @muffins, I hope you are doing great! I wanted to let you know that I've just pushed integration and events tests for the ETW logic. I've also included tests for the concurrent queue code. Adding tests was a great idea as it helped me to improve the publisher shutdown logic. Thanks for bringing this up! Also, sorry for the delay in getting the tests in place; I've been buried with high-priority work at FleetDM and didn't have the time the last few weeks to work on this. |
@marcosd4h no stress on the time delays, it was also holidays so I wasn't looking at computers much! I'll try and take another look through everything over the next week or two. Hopefully we can get this merged up soon. Thanks a ton for bearing with us and writing tests, I know everyone will be pumped for the extra assurances 🙏 🥳 🧑🏭 |
Thanks @muffins, I hope you had a great start of the year!! I'm looking forward to the upcoming review. I'm open to work on any change that might be needed to get this merged 💯 🥳 🥳 🥳 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosd4h this is mostly nits and some requests for clarification.
This is tricky to read as such a huge PR, but it definitely helps that you documented many of the class members and methods. Thanks!
|
||
namespace osquery { | ||
|
||
// Returns a reference to the single global EtwController instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// Returns a reference to the single global EtwController instance | |
// Returns a reference to the single global EtwController instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
return instance; | ||
} | ||
|
||
// New events get store in the post-processing queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// New events get store in the post-processing queue | |
// New events get stored in the post-processing queue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
case EtwProviderConfig::EtwKernelProviderType::File: { | ||
// https://learn.microsoft.com/en-us/windows/win32/etw/fileio | ||
kernelProvider = std::make_shared<krabs::kernel::file_io_provider>(); | ||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems easier to read to me as well. Any reason not to do it?
void KernelEtwSessionRunnable::initKernelTraceSession( | ||
const std::string& sessionName) { | ||
if (sessionName.empty()) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
void KernelEtwSessionRunnable::stopKernelTraceSession( | ||
const std::string& sessionName) { | ||
if (sessionName.empty()) { | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
void EtwPublisherProcesses::initializeHardVolumeConversions() { | ||
const std::string validDriveLetters = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"; | ||
|
||
for (auto& driveLetter : validDriveLetters) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment says done but looks like change wasn't pushed?
processStartAggregationCache_.erase(it++); | ||
} else { | ||
++it; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same? Feels easier to understand for me to have the iterator advance outside the conditional
processStartAggregationCache_.erase(it++); | |
} else { | |
++it; | |
} | |
processStartAggregationCache_.erase(it); | |
} | |
++it; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the suggestion, I think the code is more readable now. I'm going to try out this change and see how it goes
strlen(userNameStr.data()) == 0 || | ||
strlen(userNameStr.data()) >= MAX_PATH || | ||
sidType == SID_NAME_USE::SidTypeInvalid) { | ||
// Inserting empty username to avoid the lookup logic to be called again |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see this comment but I don't actually see the insert. Am I misunderstanding?
Maybe log here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! I've just added the missing insert here. This helper is used to populate an attribute of the resulting event, I'd prefer not to log here
case TOKEN_ELEVATION_TYPE::TokenElevationTypeDefault: { | ||
tokenInfo.assign("TokenElevationTypeDefault"); | ||
} break; | ||
|
||
case TOKEN_ELEVATION_TYPE::TokenElevationTypeFull: { | ||
tokenInfo.assign("TokenElevationTypeFull"); | ||
} break; | ||
|
||
case TOKEN_ELEVATION_TYPE::TokenElevationTypeLimited: { | ||
tokenInfo.assign("TokenElevationTypeLimited"); | ||
} break; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's use the same break
in the curly braces syntax here please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
(ULONG)sizeof(EVENT_TRACE_PROPERTIES); | ||
|
||
/// Best effort to stop ongoing trace session | ||
ControlTraceA(NULL, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be nice to log the return value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, just fixed this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@marcosd4h thank you for addressing the comments!
This is looking ready to merge to me. I can't approve it since I opened the PR even though Marcos wrote all the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A somewhat blind 👍, since @zwass would have, but PR authorship happened. Consider the magic co-authored-by silliness.
uhoh, GitHub wants to mark me as the author in the squash commit. Anyone know a way to resolve that? I didn't write a single line of this. |
Yes, that's how it works. The best you can do is Could probably close this, and have Marcos open anew |
Maybe I can squash this locally into a commit with Marcos' name on it and then we can "rebase and merge" that? Marcos is out on Vacation this week so I'd rather not bother him to open another PR, and it would be nice to have the commit associated with all the comments here. |
Hrm. That sounds a little messy? We can try, but I don't think we can roll it back. |
I tried the things I could... adding Marcos as "co-author" and clarifying in the commit message seems to be the best we can do right now. |
This PR implements POTE (see Blueprint) and the new evented table
process_etw_events
.Blueprint: #7826
This work done by @marcosd4h. I just opened the PR.