Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement stickiness for the duration of master_ttl via cookies #194

Merged
merged 31 commits into from
Mar 20, 2018

Conversation

rosa
Copy link
Contributor

@rosa rosa commented Feb 12, 2018

This pull request significantly changes how Makara's context and stickiness behaves. It implements what is described in this comment, and can be summarised as:

  • Stickiness is not maintained only for the subsequent request after using master, is maintained for the whole duration of master_ttl, as many as request that be. I noticed another old comment in which this behaviour was assumed by another person, as a more intuitive way of mitigating replication lag.
  • Makara no longer keeps track of any context identifier or hash. It only keeps track of which configurations are stuck to master and when that stickiness expires. This is what is considered context now: a map of configuration ids and their stickiness expiration times.
  • Context is maintained between requests stored in a cookie. We don't store a hash anymore, we store the whole context (the map of config ids and timestamps).
  • It's still possible to skip stickiness, force stickiness, and release all stuck connections at once.

Todo:

  • Support passing stickiness information via query args, as it's currently possible with http://mysite.com/path/to/resource?_mkra_ctxt=#{context}.

This is a very big change in code and in behaviour, so I totally understand any reluctance to merge it 😅 If this is too radical, I'm happy to get to an intermediate solution, although I do believe that the change is behaviour makes more sense as the idea of replication lag mitigation goes.

@bleonard
Copy link
Contributor

💥 Thanks @rosa
I'll look at it this week.
Have you put this into use yet in your app?

@rosa
Copy link
Contributor Author

rosa commented Feb 12, 2018

@bleonard not yet! I have only tested it locally, but I'll let you know if we end up using it in production, I'll work on that this week 🤞

Copy link
Contributor

@bleonard bleonard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @rosa ! I tried to be thorough to show respect to the great work. Very exciting!

The main thing is "Full Request Sticking" which I mention a few times - that we should stick the whole request if it's a long one. Right now, I think the Time.now stuff might cause subtle issues in that world.

README.md Outdated

```ruby
Makara::Cache.store = MyRedisCacheStore.new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be useful to still support this store= setter and give a deprecation warning.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea 👍

README.md Outdated
```ruby
Makara::Cache.store = MyRedisCacheStore.new
```
To handle persistence of context across requests in a Rack app, makara provides a middleware. It lays a cookie named `_mkra_ctxt` which contains the current context. If the next request is executed before the cookie expires, that given context will be used. If something occurs which naturally requires master on the second request, the context is updated and stored again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You changed to cookie name to _mkra_stck

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooops, yes I did! I renamed it later on, thinking that using the same name could cause issues with already set cookies, and forgot to change this reference.

README.md Outdated

```ruby
ctx = Makara::Context.generate # or any unique sha
Makara::Context.set_current ctx
Makara::Context.release_all
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should they be able to release a single one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that you did. Just not in the readme. That's fine. Nice.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True! I need to add it to the readme 😅

## Todo

* Cookie based cache store?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💥

seed ||= "#{Time.now.to_i}#{Thread.current.object_id}#{rand(99999)}"
Digest::MD5.hexdigest(seed)
end
IDENTIFIER = '_mkra_stck'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in our app, we usually .freeze these. it's a habit now. have you seen benefits there?

def previously_stuck_to_master?
@sticky && Makara::Context.previously_stuck?(@id)
def stuck_to_master?
@sticky && !@skip_sticking && Makara::Context.stuck?(@id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use should_stick? somehow? Or DRY that up?
It's just the two variables but seems semantically correct.
should_stick?("stuck_to_master_check", []) && Makara::Context.stuck?(@id)

end


# if we are configured to be sticky and we aren't bypassing stickiness
# If we are configured to be sticky and we aren't bypassing stickiness,
# (method and args don't matter)
def should_stick?(method_name, args)
@sticky && !@skip_sticking
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder why we sent this method name and args here. probably so someone more clever than me could override them in some way. Not sure if that's needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh! I forgot to comment on this! It's useful when implementing other kind of proxy, like for the AR adapter: https://github.com/taskrabbit/makara/blob/c6b158e847468379da8cc2eb69aee455bef386fb/lib/active_record/connection_adapters/makara_abstract_adapter.rb#L167-L171

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, that's pretty key. I wonder if the above should use it with fake arguments or DRY up that check or leave it as-is. Unclear to me.

end


def stick_to_master(method_name, args, write_to_cache = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As noted above, let's keep write_to_cache.

end

def hijacked?
@hijacked
end

def stick_to_master!(write_to_cache = true)
@master_context = Makara::Context.get_current
Makara::Context.stick(@master_context, @id, @ttl) if write_to_cache
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This write_to_cache thing seems important, but it really would complicate your Context checks.
Now that I think about it, things get a little weird in the new model with really long requests. It seems against the expected behavior to stick in a request, it be a long request, and then unstick later in that request. Let's call this "Full Request Sticking" and I'll put a note above in the context.

Here, let's keep the write_to_cache variable and pass it to the Context.

end

def stuck?(config_id)
data[config_id] && !expired?(data[config_id])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For "Full Request Sticking" concept, this method would look more like

return false unless data[config_id]
return true if data[config_id][:ttl] # because it was asked to stick in this request
expired?(data[config_id][:expires_at])

@rosa
Copy link
Contributor Author

rosa commented Feb 19, 2018

@bleonard thanks so much for your thorough review, I really appreciate it! I'll address each comment separately and will push changes soon.

You made a very good point about long requests, I didn't think about that. What you propose with "full request sticking" makes sense to me. Unsticking in the middle of the request doesn't seem very problematic to me (if the master_ttl was chosen to allow for replication lag, queries that go to the replicas after that ttl has passed even if they happen in the same request should be fine in theory) but sticking way before the request is over does seem problematic! I mean the part in which sticking a config does Time.now.to_f + ttl, I definitely need to change that, and will explore implementing the whole full request sticking concept.

Thanks again! 🙇‍♀️

@bleonard
Copy link
Contributor

Great job @rosa, look forward to seeing the updates.

@rosa
Copy link
Contributor Author

rosa commented Feb 22, 2018

Ok! This took me a bit longer than I thought 😅I think I have addressed now all the issues brought up. I have also done a fair bit of refactoring in my previous code.

The main changes with respect to the original implementation are:

Proxy instead of Configuration

Stop using the concept of configuration id, or sticking a configuration because it's confusing and doesn't make a lot of sense. I started using this terminology because in Makara::Context we were passing config_id around, to build the cache key, so I continued using that. It made sense to me because I knew what I was talking about, but probably from someone not familiar with Makara this couldn't make sense (what does "sticking a configuration" can possibly mean?). Then I realised that the id defined in the configuration actually identifies the proxy, and in fact we use it as @id inside the Proxy. Changing this everything becomes much clearer. It's the proxy the one that gets stuck to master, not the configuration.

Full request sticking

Make TTL to "kick in" at the end of the request, when we move on to the next context and persist it. This is the "full request sticking" we talked about: if we stick to master in a long request, it could happen that the stickiness expires within the same request. This is no longer the case. After thinking about how to do this for a while, I finally did this by having two different sets of data in the current context: stored data and staged data. The stored data comes from the cookie, it has already been stored before. Staged data is new for the request, and will be committed/stored at the end of the request.

Sticking without persisting

Support Proxy#stick_to_master!(persist = false). While doing the "full request sticking" part, I realised I hadn't taken this into account. Someone can decide they want to stick to master for the current request, for a specific operation, but without necessarily continuing being stuck for the duration of master_ttl. This was achieved before by write_to_cache = false and an instance variable @master_context in the proxy. If write_to_cache was false, the context was only stored in that instance variable. Then on subsequent queries, if the current context was @master_context, it means we are stuck, but we haven't persisted this to the cache for the next request at all. Doing the same kind of implementation in this case doesn't quite work because Context.release[_all] doesn't know about individual proxy instances and their instance variables. It worked previously because all we did was to change the current context identifier, which indirectly _expired" @master_context. However, I found that using the new stored and staged areas in Context, I could implement this in an intuitive way: sticking without persisting is the same as adding a proxy to the staging area with a ttl = 0. That means: this proxy should be stick for the current request and no more. Then, Context.release_[all] obviously clears the stored and staged data, so we can expire non-persisted stickiness as well.


Apart from this I implemented the other minor issues you noticed, like showing a warning for Makara::Cache.store = .., or gsubbing | and : in the proxy id.

I still need to test this with our own app, and see if we can get it to production 🤞

Copy link
Contributor

@bleonard bleonard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great @rosa!
Just a few notes on possible edge cases and cookie lifecycle.

README.md Outdated

Makara handles stickyness by keeping track of a context (sha). In a multi-instance environment it persists a context in a cache. If Rails is present it will automatically use Rails.cache. You can provide any kind of store as long as it responds to the methods required in [lib/makara/cache.rb](lib/makara/cache.rb).
Makara handles stickiness by keeping track of which proxies are stuck at any given moment. The context is basically a hash of proxy ids and the timestamp until which they are stuck.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like we create a hash (like MD5 sort of thing) of those things. Maybe "The context keeps track of the expiration timestamp for each proxy id."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right, I was influenced by the Hash class name, perhaps naming it a dictionary or a mapping is better.

README.md Outdated
```

It'll keep the proxy stick to master for the current request, and if `persist = true` (default), it'll be also stored in the context for subsequent requests, keeping the proxy stuck up to the duration of `master_ttl` configured for the proxy.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"stuck"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅 Thanks!

## Todo

* Cookie based cache store?
* Support for providing context as query param
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe path/here?makara_stick=proxy_one|proxy_two
Not sure if need timestamps. In that case could just be the whole cookie string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, something like that! Either the whole cookie string, or just one, and merge it together with the cookie string 🤔

# Keeps track of the current and previous context (hexdigests)
# If a new context is needed it can be generated via Makara::Context.generate

# Keeps track of the current stickiness state for different Makara configurations
module Makara
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

end

def stage(proxy_id, ttl)
staged_data[proxy_id] = ttl.to_f
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you'll want to do a max here of the ttl - imagine it did a write_to_context=true and then write_to_context=false in the same request. This would end up at zero but probably should have ended up persisting.

Maybe:

[staged_data[proxy_id].to_f, ttl.to_f].max

Copy link
Contributor Author

@rosa rosa Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oohh, good point! I'll add a test for this! Not only in the same request, but also if it came from the previous request and it still hasn't expired.

# the current store
def commit
release_expired
store_staged_data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to store_staged_data before the release_expired so that the 0 values don't get written?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... the 0 values won't get written in any case, since we'll only store proxy_id, ttl pairs with ttl > 0. I don't see any other reason why we should check for expiration of data staged in this request, in theory no staged data should be expired because we'll be comparing Time.now + ttl (with ttl > 0) with Time.now, so I believe this is the right order. Am I missing anything?


# Pairs of {proxy_id}:{timestamp}, separated by "|"
# proxy_id1:1518270031.3132212|proxy_id2:1518270030.313232 ..
def parse(cookie_string)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is used anymore (or there is a duplicate in the new Cookie class).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oooops, yes! I moved it to Cookie and forgot to remove it from here! 😅

end

def store(context_data, headers, options = {})
unless context_data.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I follow everything correctly, this could be nil if there were no changes to the cookie (@was_dirty == false) - I'm not clear on the situation. Do you have to set the header here to keep the cookie around? That is it was a ttl what was longer than the request so there were no changes but there still needs to be a cookie. In other words, does it pass through by default? In which case, we do we delete the cookie in some case?

Copy link
Contributor Author

@rosa rosa Feb 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be nil if there were no changes to the cookie

Correct!

Do you have to set the header here to keep the cookie around?

No! If you don't send anything the cookie will be kept as it was. That was my idea of returning nil here: there is no next context => then there is no cookie sent.

In which case, we do we delete the cookie in some case?

We do! When the next context exists and it's empty, we send a cookie with max-age=0 and expires=<current time>. That's a way of deleting the cookie.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. thanks.

now = Time.now

cookie[:max_age] = if context_data.any?
(context_data.values.max - now.to_f).ceil + 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be nice to make this 5 a BUFFER_TIME constant or something to be clear what it is semantically.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, much better, great suggestion.

end


def stick_to_master(method_name, args, write_to_cache = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to pass through that write_to_cache=true (now persist=true) thing here to stick_to_master! ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary here, because in all cases we call stick_to_master, which is private to the proxy, it's because we came across some action that made us to use master, and in that case we always want to persist the value for the master_ttl duration.

@rosa
Copy link
Contributor Author

rosa commented Feb 25, 2018

Thanks for the second thorough review @bleonard! I'm quite enjoying working on this 😄

@rosa
Copy link
Contributor Author

rosa commented Feb 26, 2018

Last set of changes pushed. Last week at the end I couldn't get to try this with our app, I'll see if I can do that this week 💪


id.gsub(/[\|:]/, '').tap do |sanitized_id|
if sanitized_id.size != id.size
Logging::Logger.log "Proxy id '#{id}' changed to '#{sanitized_id}'", :warn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing. Might want to say "Makara" here so they have more context.

Copy link
Contributor Author

@rosa rosa Feb 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. For some reason I had assumed without checking that the logger would tag the lines with [Makara], but I realised it was added manually on exception handling. What I have done instead is to add this by default to all lines logged through Makara::Logging::Logger.log: 51d32a5 What do you think?

end

def store(context_data, headers, options = {})
unless context_data.nil?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. thanks.

@bleonard
Copy link
Contributor

bleonard commented Mar 4, 2018

@rosa I took some time today to get master to pass Travis. If you want to rebase on that, feel free.

@rosa rosa force-pushed the stickiness-without-context branch from 51d32a5 to 1a1b2b6 Compare March 4, 2018 21:15
@rosa
Copy link
Contributor Author

rosa commented Mar 4, 2018

I took some time today to get master to pass Travis. If you want to rebase on that, feel free.

Excellent! 🎉I had looked briefly at the Postgres related failures but only thing I did was to specify the pg gem version as ~> 0.18, which didn't fix the tests locally for me (surely something setup wrongly on my machine). Thanks so much for fixing this!

BTW, I started testing this PR on our app last week, and if everything goes well, we should be running it in production this week 🤞

@rosa
Copy link
Contributor Author

rosa commented Mar 19, 2018

Hey @bleonard, sorry for the delay! Testing this in our app was delayed because of other unrelated infrastructure tasks, but I finally got there last week. We've been using this version of Makara in production for 4 days already, and no issues so far 😄

@bleonard
Copy link
Contributor

That's great!
I merged a few more things into master from other PRs. Want to rebase and try that?

@bleonard
Copy link
Contributor

bleonard commented Mar 19, 2018

I tagged that version as v0.3.10
I think your update deserves a semantic version bump at least :-)

Instead of keeping track of two MD5 hashes that allow us to be stuck to
master for 1 or 2 requests at most, keep whether we are currently stuck
or not, for each configuration.

The context represents the current stickiness state. It knows whether we
are stuck to master for each of the different Makara configurations we
have defined. It's persisted between requests via an only cookie we set,
whose max age depends on stickiness duration for each of the
configurations we store at any given moment.
These were handled directly by the middleware before, but now that
cookie setting is done in Context, support for this should be moved
here.
We'll rely exclusively on cookie store for the context.
It's always set to the current context when the pool is intialised, and
then compared against the current context as well. The only case in
which it could be relevant is when the context is forced to be a new one
manually, in the middle of a request. This will be handled in a
different way.
We don't need to track any context hashes anymore, just whether we are
currently stuck in this request or not.
… request

It's enough with being stuck until the cache expires or all connections
are released (which will be added later). Releasing all connections no
longer depends on a specific context, so all we need to rely on is the
cache.
rosa added 17 commits March 20, 2018 07:24
The id defined in the configuration for a proxy was being referred as
`config_id` in several places, so I had kept it, but it was certainly
confusing, it wasn't easy to know what I was talking about. That id
really identifies the proxy, so there is no point on using a misleading
name for it.
And while there, increase the buffer for cookie's max-age to 5 seconds
instead of 1. It shouldn't harm, and it can be easily overridden.
This ensures support on old IE versions, while for other browsers it'll
be simply ignored in favour of `max-age`.
Now context doesn't know anything about cookies, only about the hash
with the data.
…uest

That is, when we stick a proxy to master in a request and the request is
very long, if we set the expiration time based on the TTL right at the
moment of sticking, we might unstick before the request even ends. This
is probably not the intention, so instead of setting the expiration time
right when the proxy is stuck, we wait until the context is committed,
and at that time we move the newly stuck proxies to the storage area
from the staging area.
That is, Proxy#stick_to_master!(false) should stick the proxy to master
for the current request but not beyond that. We shouldn't store that in
the cookie for future requests.
These characters are used as separators in the context cookie.
Freeze time when starting the request and use always that time to check
whether some entry has expired, instead of considering it expired at any
point during the request. Also, use always the max TTL seen for any
staged or stored entry.
@rosa rosa force-pushed the stickiness-without-context branch from 1a1b2b6 to 5b1f0a7 Compare March 20, 2018 06:24
@rosa
Copy link
Contributor Author

rosa commented Mar 20, 2018

I merged a few more things into master from other PRs. Want to rebase and try that?

Done! Basecamp 3 is now running version v0.3.10 + this pull request in production.

@bleonard
Copy link
Contributor

💥 I think I'll release v0.3.10 as-is and then v0.4.0 with this PR.
Do you know of any breaking changes? I think it should be relatively straightforward. To my knowledge, there is no new thing to do.

@bleonard bleonard merged commit 8dd258e into instacart:master Mar 20, 2018
@rosa
Copy link
Contributor Author

rosa commented Mar 21, 2018

\o/ Thanks! I think the only (obvious) breaking change is that anything depending on the previous context cookie won't work, and that MemoryStore has disappeared.

@rosa rosa deleted the stickiness-without-context branch March 21, 2018 09:51
@rosa rosa restored the stickiness-without-context branch March 21, 2018 15:02
@bleonard
Copy link
Contributor

bleonard commented Mar 21, 2018

I suppose, though I never considered someone in Ruby or JS reading that cookie.
We'll see.

There was some Ruby < 2.3 errors with the deprecation string that I fixed on master.
So it and the 0.4.0 tag should be good to go.

I'm about to embark on a pretty serious sabbatical and this is a big change, so I'll let people upgrade to 0.3.10 and/or try this tag out and push this up in May.

@rosa
Copy link
Contributor Author

rosa commented Mar 22, 2018

There was some Ruby < 2.3 errors with the deprecation string that I fixed on master.

Ouch, sorry about that, and thanks for fixing them 🙏

I'm about to embark on a pretty serious sabbatical and this is a big change, so I'll let people upgrade to 0.3.10 and/or try this tag out and push this up in May.

Perfect 👍 Enjoy your sabbatical and thanks so much for all your work and help with this! 🎉

@rosa rosa mentioned this pull request May 9, 2018
@bleonard
Copy link
Contributor

bleonard commented May 9, 2018

We've been using the v0.4.0 tag for a while and things seems good.
I've pushed it up to ruby gems: https://rubygems.org/gems/makara/versions/0.4.0

Congratulations @rosa - this is a great release. 🎉

@rosa
Copy link
Contributor Author

rosa commented May 9, 2018

Thanks @bleonard, and welcome back! We've also been using 0.4.0 since 20th March with no issues 😄

Thanks again for all your help!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants