-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement a performant cache for users and groups on Windows #7516
Implement a performant cache for users and groups on Windows #7516
Conversation
Add two new services UsersService and GroupsService, which will scan and cache users and groups information in memory. The speed of the scan is configurable with two new flags, users_service_delay and groups_service_delay. The services will scan 100 users or groups at a time and then use the above delay to throttle. The interval between full scans is configurable via other two new flags, users_service_interval and groups_service_interval. While building the cache an optimized indexing is created for the columns that are marked as index. The very first time it's run, the services will first do a full scan before providing any results, so any table or other code that wants to access the cache will have to wait. After the cache is initialized, the cache is updated incrementally, so if an access to the cache happens during one of the scans, it will block for a very small amount of time. Additional improvements have been done to the internal helper APIs, used to collect the users and groups information from Windows, to avoid unnecessary data transformation round trips.
When osquery is installed on Windows Domains which contain several thousands of users, table like The reason is that the groups table does not implement improved filtering on index columns, Given that changing the schema is always a bit undesirable and given that the process to do so would've taken several intermediate steps to get to a good a place, I've opted to focus on performance first. Note that while on osquery >= 4.9.0 it is possible to write a query joining users and groups to greatly reduce its run time to few minutes, depending on the amount of users:
This would also move the processing on the osquery process, so osquery would be using 100% of a core for those minutes required to complete the query, causing the watchdog limits to be increased. The solution proposed here gives more control in how much resources osquery uses or how much overhead it causes on the system instead. It also should have a normal JOIN query return in few seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very interesting, and I have no idea what I think. In effect, creating a materialized cache for this data makes osquery much closer to the database everyone thinks it is. I added it to an agenda doc for next office hours.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks @Smjert
I like the service and caching approach, and I think it works well for this issue.
I reviewed most of the code, and it looks good, only caveat is that I am a bit new to win APIs, but learning.
|
||
Windows only flag which defines the amount of milliseconds to wait during a scan of users information, between a batch of 100 users and the other. This is meant to throttle the CPU usage of osquery and especially the LSASS process on a Windows Server DC. The first users batch is always gathered immediately at the start of the scan. | ||
|
||
`--users_service_interval=1800` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there should be a way to trigger this to resync? Hidden column or such?
Add two new services UsersService and GroupsService,
which will scan and cache users and groups information in memory.
The speed of the scan is configurable with two new flags,
users_service_delay and groups_service_delay.
The services will scan 100 users or groups at a time
and then use the above delay to throttle.
The interval between full scans is configurable via other two new flags,
users_service_interval and groups_service_interval.
While building the cache an optimized indexing is created
for the columns that are marked as index.
The very first time it's run, the services will first do a full scan
before providing any results, so any table or other code
that wants to access the cache will have to wait.
After the cache is initialized, the cache is updated incrementally,
so if an access to the cache happens during one of the scans,
it will block for a very small amount of time.
Additional improvements have been done to the internal helper APIs,
used to collect the users and groups information from Windows,
to avoid unnecessary data transformation round trips.