-
-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rocksdb: Do not fsync WAL writes #7094
rocksdb: Do not fsync WAL writes #7094
Conversation
I'm unsure if this is the right approach. Or what the default is |
Yeah, I suspect this is the right start but it's not complete. |
@Jimjazzz would you mind testing https://github.com/osquery/osquery/suites/2682446025/artifacts/59150013 to see if it also fixes the issue. If it does we can go from there and further improve the safety of this approach. |
plugins/database/rocksdb.cpp
Outdated
@@ -30,6 +30,7 @@ HIDDEN_FLAG(int32, rocksdb_write_buffer, 16, "Max write buffer number"); | |||
HIDDEN_FLAG(int32, rocksdb_merge_number, 4, "Min write buffer number to merge"); | |||
HIDDEN_FLAG(int32, rocksdb_background_flushes, 4, "Max background flushes"); | |||
HIDDEN_FLAG(int32, rocksdb_buffer_blocks, 256, "Write buffer blocks (4k)"); | |||
HIDDEN_FLAG(bool, rocksdb_logs_wal, false, "Use the write-ahead-log for logs"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This defaults to false
? Do we really want to disable the WAL? Didn't you and @alessandrogario do a bunch of work to make it work?
I have tested it, but still i see that .LOG files are created (it went up to around 200 files). Attached here is the db of around the last hour: Following is the evolution of db in terms of size and number of files: |
8edd41e
to
59b3756
Compare
Ok, I looked into a few more of the details you shared and tried to test locally. I think a small change to not fsync the WAL on each write will alleviate the IOPS issue. If you combine the change in this PR with |
@Jimjazzz thank you so much for the testing effort. Would you mind trying once again with the latest changes: https://github.com/osquery/osquery/actions/runs/830071018 See my comment above about what is new. |
Hi @theopolis , we have tested the latest build but i did not get the time to write back a report to you. In terms of DB size and #files, i see a difference where this new build the number of LOG files is way more contained. Previous scenario grows up to 700 files, while the new one is around 100. I see that write IOPS are smaller, however i think that read spikes are bigger now: Also attached here the latest status of osquery.db (snapshot taken 5/14 16:28 GMT+2) I can see a big improvement with respect to v4.6.0.2 , but not as big with the normal v4.8.0 release. Still, completely disabling WAL and sync gives me the best results. However, i have been seeing that osquery seems to degrade after some time running when sending logs via TLS, as i can see event loss in our SIEM. I wonder if events buffering could be causing increased IOPS, since when we changed tls_logger_max_lines to 8192 we saw a good improvement on the situation. Now we are not reaching the limit in any flushm but we are still drastically lowering intervals and flush periods to see if we get any improvements. |
Ok thanks for testing. I think we should at least land this change. This essentially does not force an fsync after each WAL write. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems worth it for the limited improvement we get here. Let's make sure this is documented in the release notes.
This addresses #7079 but makes logs more susceptible to data loss if a process crashes.