Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate no comms in docker #1113

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Tomasz-Kluczkowski
Copy link
Contributor

@Tomasz-Kluczkowski Tomasz-Kluczkowski commented Jun 7, 2021

Add guard in Flower's boot sequence to wait for broker. This prevents situation where Flower boots in incorrect state and is never able to communicate with the broker even if it is available later on.

I do appreciate the fact that I did not solve the real issue of Flower not able to call the broker's url despite that fact that it is callable but:

  • this issue is only a problem if we boot without a broker. If broker disappears after Flower successfully booted with a broker present and then comes back - Flower operates as usual and can access the broker again.
  • if we protect Flower from booting without the broker, the issue is then resolved.

A smarter solution (for which I am afraid I do not have time) would be to understand why Flower/Tornado cannot call a simple url using async http client after Flower booted without the broker. I tried but lost my patience :). Tornado is not something I deal with in work life but I am opened to enlightenment :)

I also changed the Dockerfile and removed ENTRYPOINT since we no longer have flower as a standalone command and also we need to be able fully override the command that Flower's image starts with to be able to specify all the broker's and so on.

Small fixes to some docs where I noticed the old way of doing things was still used.

@mher
I have tested this code locally and by replacing the image that dockerfile uses with an image based on this branch in my fork of Flower and running docker-compose up with various options - I used rabbitmq, redis and checked booting Flower with them running and without - the new code makes Flower wait until broker is running or exits if max_retries is exceeded.

Image showing the new behaviour:
Screenshot from 2021-06-07 21-21-32

Related to issue: #1112

This function is called now before flower.start() is called to make sure we have a working broker and starting Flower will set it up correctly.
This is a first attempt at solving the issue of Flower booting up in a broken state.

There can be a race condition from the moment we checked the condition to the moment we call flower.start() although unlikely.
I will ponder on that...
As of celery 5 the command line invocation no longer supports specifying celery args after flower command.
Also add depends_on for flower.
We no longer have `flower` as a separate command and this entrypoint was incorrect.
We also need to be able to fully override the arguments for celery and flower via command line arguments when using the image.
@@ -48,7 +50,12 @@ def sigterm_handler(signal, frame):
sys.exit(0)

signal.signal(signal.SIGTERM, sigterm_handler)

if not is_broker_connected(celery_app=app):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here in theory the connection is checked but can drop before we call flower.start().

The chances are super small though.
I will think about better way of doing this but for now it is good enough.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at kombu code:

    def ensure_connection(self, *args, **kwargs):
        """Public interface of _ensure_connection for retro-compatibility.

        Returns kombu.Connection instance.
        """
        self._ensure_connection(*args, **kwargs)
        return self

and they follow similar pattern (check connection and then return value) so I think we should be fine with making a check for connection and then calling flower.start()

conn.ensure_connection(errback=_error_handler, max_retries=max_retries)
logger.info(f'Established connection to broker: {broker_url}. Starting Flower...')
is_connected = True
except OperationalError as e:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to find if kombu can raise other errors and add them here.

logger.error(f'Cannot connect to broker: {broker_url}. Error: {exc}. {next_step}')

try:
conn.ensure_connection(errback=_error_handler, max_retries=max_retries)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should start adding those in this file as they are merged - I had hard time remembering what I have done over the last few months.
@Tomasz-Kluczkowski
Copy link
Contributor Author

@mher I have made a tiny change inspired by other usage of celery connections - basically when checking presence of the broker at startup of Flower we can reuse connection from the pool of connections on the celery_app object instead of creating one. Not a big issue since this is done only once but a good practice I think.

This whole set of changes is now active in our production and staging kubernetes clusters and working fine.

Please let me know if you are happy with this PR :).

I was asked for more Prometheus metrics plus I see a fix needed for the queuing time one so I will open another PR for it.

I also wanted to ask here if you would be OK with adding some project wide standards in another PR:

  • linting
  • code formatting (black). Here I have a question if you would be ok with making the line length 120 chars as we are in 21st century :P and 79-80 seems so last century setting :).
  • git pre-commit hooks to enforce the above
  • step in CI pipeline to enforce the above

Finally if you are aware of anything that needs fixing please let me know and I will try to help.

@nicolaerosia
Copy link

@Tomasz-Kluczkowski I'm not @mher but I guess nobody is against what you mentioned these days and I would even go further and add:

  • static checking (mypy?)
  • typing annotations

If you're considering black I would go and do it from day 0 with a length of 120 to avoid later noise

@mher
Copy link
Owner

mher commented Jul 5, 2021

@Tomasz-Kluczkowski can you move docker updates into a separate pull request?

@Tomasz-Kluczkowski
Copy link
Contributor Author

Tomasz-Kluczkowski commented Jul 5, 2021 via email

@Tomasz-Kluczkowski
Copy link
Contributor Author

@mher Done in #1121

What are your thoughts on the bug I was trying to fix on this PR?
I am using this fix in our prod and it is fine although I would prefer a more elegant solution so maybe you have ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants