Some Python tips for experienced developers🐍

This is a collection of tips from a notion document for onboarding new developers at my current work that I started to help experienced developers who may not know the ins and outs of python. It's not exhaustive or supposed to contain unique insights you couldn't find elsewhere, it's just that after seeing many developers leave "Protip: Do (x)" in slack, I decided it'd be a good idea to collect them together. It's also mostly here for my benefit, so I keep a copy of it somewhere for my own reference.

There's quite a lot pythonic python that is non-obvious and just requires you to 'be in the know'. This is a collection of comments seen repeatedly in code review or a given as a helpful tip on slack. Hopefully it will help new comers get up to speed on common mistakes (that we've all made) and avoid them.

Logging

Don't format the string before sending it to the logger

Prefer

logger.info("Order: %s for User:%s went wrong", order.id, user.id)

over

logger.info("Order: {} for User: {} went wrong".format(order.id, user.id))
logger.info("Order:%s for User: %s went wrong" % (order.id, user.id))
logger.info(f"Order: {order.id} for User:{user.id} went wrong")

The former only formats the string if the logger gets evaluated, the latter evaluates the string formatting before it gets sent to the logger

Formatting python log messages - Reinout van Rees

Sentry is also able to group messages with the same template together, so the first would result in a single sentry error for all user and order combination.

The last three would have a different sentry entry for each order/user combo.

Use `logger.exception` when logging inside exception handlers

If you're handling an exception and want this data to appear in sentry, the logger.error(exc_info=True) and logger.exception() are equivalent.

Use `raise from` for reraising exceptions with a different type

raise from is often useful when you want to capture a third party library exception but raise it as an internal exception.

def some_django_rest_api():
    # Our APIs should only raise APIException types.
    try:
         stripe.Charge.something():
    except StripeError as e:
         raise PaymentExternalError("blah") from e

Python "raise from" usage - Stack Overflow

This means the original exception detail and traceback will appear in sentry.

Use `LogEvery` for repeated log messages

Logging the status syncing of 50k users can result in many repeated messages in logging/datadog. You can use LogEvery to group the messages together.

total_count = qs.count()
with LogEvery(log_every=10, total_count=total_count) as le:
    for each in qs.iterator():
        ... do something ...
        le.tick()

# Or, for the common case:
for each in LogEvery(log_every=10, total_count=total_count).iter(qs):
    ... do something ...

class LogEvery:
    """
    Log things periodically. Typical usage is something like this:
    ```
    total_count = qs.count()
    with LogEvery(log_every=10, total_count=total_count) as le:
        for each in qs:
            ... do something ...
            le.tick()
    You'll get a log message every 10 items to help you track progress.
    If you do supply total_count, you'll get richer messages including
    percentage completion and estimated time to finish.
    You don't have to use this as a context manager, but the start and finish
    log messages and timing data will be more accurate if you do.
    The above case is actually quite common, and there's a shortcut you can
    use:
    total_count = qs.count()
    for each in LogEvery(log_every=10, total_count=total_count).iter(qs):
        ... do something ...
    In this usage, you don't have to remember to call tick() yourself, the
    LogEvery instance will do that for you.
    """

    log_every: int
    total_count: Optional[int] = None

    name: str = ''
    count: int = 0
    log_start: bool = True
    log_finish: bool = True
    logger: logging.Logger = dataclasses.field(default_factory=logging.getLogger)

    msg: str = 'Processed {count} items'
    pct_msg: str = '{pct:02.2f}% Processed {count} of {total_count} items. Est finish: {finish}'
    start_msg: str = 'Starting'
    finish_msg: str = 'Finished'
    finish_with_error_msg: str = 'Finished with error'

    _started_at: Optional[float] = None
    _msg: Optional[str] = None

    def tick(self):
        self._start()
        self.count += 1
        if self.count % self.log_every == 0:
            self.log()

    def log(self):
        # We need to catch both total_count is None and total_count == 0 to avoid
        # a division by zero. If we do have a total_count, then we can output
        # some more advanced stats - % completed, estimated completion time, etc.
        if self.total_count:
            completion = self.count / self.total_count
            duration_secs = time.monotonic() - self._started_at
            estimated_secs = ((1 / completion) * duration_secs) - duration_secs
            finish = dt.datetime.now(dt.timezone.utc) + dt.timedelta(
                seconds=estimated_secs
            )
            info = {
                'count': self.count,
                'total_count': self.total_count,
                'pct': completion * 100,
                'finish': finish.strftime('%Y-%m-%d %H:%M:%S %Z'),
            }
            formatted = self.pct_msg.format(**info)
            self._info(formatted)
        else:
            self._info(self.msg.format(count=self.count))

    def iter(self, it):
        with self:
            try:
                for each in it:
                    yield each
                    self.tick()
            except GeneratorExit:
                # Our consumer (ie. caller) stopped consuming. That's OK,
                # but we'll swallow the GeneratorExit to prevent it being logged
                # as a failure.
                pass

    def start(self):
        if self.log_start:
            self._info(self.start_msg)

    def finish(self, exc_type, exc_val):
        if self.log_finish:
            if exc_type is not None:
                self._exception(self.finish_with_error_msg)
            else:
                self._info(self.finish_msg)

    def __enter__(self):
        self._start()
        return self

    def __exit__(self, exc_type, exc_val, exc_tb):
        self.finish(exc_type, exc_val)

    def _start(self):
        if self._started_at is None:
            self._started_at = time.monotonic()
            self.start()

    def _info(self, msg, *args, **kwargs):
        self.logger.info(self._p(msg), *args, **kwargs)

    def _exception(self, msg, *args, **kwargs):
        self.logger.exception(self._p(msg), *args, **kwargs)

    def _p(self, msg):
        # Given a message, prefix it with our name if one is defined. Otherwise
        # just return the message as-is.
        if self.name:
            return f'{self.name} {msg}'
        else:
            return msg

Structuring Your Code

utils.py and util.py considered evil

If you are going to create a file called utils.py or add any function to an already exiting one: think again

putting things into utils makes it a bag full of everything from simple datetime calculations to sophisticated validators and Class mixins.

utils is a nondescriptive name so nobody including you will know what has been put there in two weeks

there is always a better place with a better name than utils

if something is more generic than the project you are working in/requires sharing - move it to a shared library, but please don't call it utils there 🙃

Some examples from when we have already moved from various utils:

file_size_validator was moved to helpers/validators
GatherSuppressedErrorsFromFormMixin to helpers/mixins
calculate_days_until to helpers/datetime

a few topic related files can reside in their own directory, so we have helpers/forms/ with form related modules

Also related reads:

Style

Avoid long calculations or database actions for `@property`

SomeModel.hello gives no indication to the reader that hello performs cpu/memory intensive calculations, IO or database access. Most readers would assume that SomeModel.hello is just a class attribute. It's better to leave it as a method SomeModel.hello(). Naming is difficult, but SomeModel.calculate_hello() or SomeModel.fetch_hello() helps indicate to other developers its doing something else under the hood.

Prefer to raise exceptions instead of returning errors

Avoid returning errors, prefer to raise an exception and the callee can handle that instead in a try/catch block.

Best practice in python for return value on error vs. success - Stack Overflow

def i_love_horses(horses):
    errors = []
    for horsie in horses:
        if not is_a_horse(horsie):
            errors.append(HorseException("not a horse")
    return errors

# callee
errors = i_love_horses(my_horses)
if errors:
    # do some error handling

def i_love_horses(horses):
    errors = []
    for horsie in horses:
        if not is_a_horse(horsie):
            errors.append(HorseException("not a horse")
    if errors:
        raise ValidationError("those are not horses", errors=errors)

# callee
try:
    i_love_horses(my_horses):
except ValidationError:
    log

We've got exceptions in python, so use them to indicate errors, in python use if/else for error handling costs more than try/except see

Using try vs if in python - Stack Overflow

Testing

Factoryboy

Use pytest factory fixtures instead of importing factories

from blah.factories import CampaignFactory:

def test_something_bad(): # Boo!
    CampaignFactory.create()

def test_something_good(campaign_factory):  # Yay!
    campaign_factory.create()

Welcome to pytest-factoryboy’s documentation! — pytest-factoryboy 2.0.2 documentation

Use model fixtures when you just need a default instance

The two below are equivalent (in most cases)

def test_a_thing(campaign_factory):
    campaign = campaign_factory.create()
    do_something(campaign)

def test_a_thingy(campaign):
    do_something(campaign)

The first uses the factory fixture to create a model instance, the second is the model fixture automatically created by pytest-factory boy see (https://pytest-factoryboy.readthedocs.io/en/latest/#model-fixture)

Subfactories allows generation of objects that span lookups

use:

travel_manager = factory.SubFactory(TravelManagerFactory)

over:

travel_manager = factory.LazyFunction(lambda: TravelManagerFactory.create())

which means we can use sub properties:

WaitListFactory.create(travel_manager__name='tomek')

Use SelfAttribute to generate models with the same parent object

class BasketFactory:
    user = factory.SubFactory(UserFactory)

class OrderFactory:
    user = factory.SubFactory(UserFactory)
    basket = factory.SubFactory(BasketFactory)

order = OrderFactory.create()

In this example order.user != order.basket.user

You can make the Order have the same ambassador as the factory by using SelfAttribute

class BasketFactory:
    user = factory.SubFactory(UserFactory)

class OrderFactory:
    user = factory.SelfAttribute('..basket.user')
    basket = factory.SubFactory(BasketFactory)

order = OrderFactory.create()

python - Factory Boy and related objects creation - Stack Overflow

Strings

Avoid `+=` when joining/concatenating strings, use `.join`

How slow is Python's string concatenation vs. str.join? - Stack Overflow

Use format_map when string formatting with dict data

If your data is in a dictionary, perhaps from a JSON file, then str.format_map() is your new best friend.

Pretty way:
print('{name} works at {status} {company}'.format_map(info))
Less pretty:
print(f'{info["name"]} works at {info["status"]} {info["company"]}')

From Raymond Hettinger's tweet

You could also do:

print('{name} works at {status} {company}'.format(**info)

But this generates an extra copy of the dictionary with the **info, see Built-in Types — Python 3.8.3 documentation

Django

Don't fetch an object if you don't need it

Django creates a field for foreign keys called {your_field}_id, which you can often use to avoid fetching the related object.

Model field reference | Django documentation | Django

If you have a `null=true` field, you probably want it `blank=true`

null controls whether the field IS NULL on the database level

blank controls whether the field is required on forms (and hence the admin)

python - differentiate null=True, blank=True in django - Stack Overflow

Avoid `NULL` `CharFields`

Note: for CharFields you don't want null or blank to be True. Having a CharField with null value makes it have two types of empty values because both '' and None is an empty state, but one was filled in and the second wasn't.

From Django docs:

Avoid using null on string-based fields such as CharField and TextField. If a string-based field has null=True, that means it has two possible values for “no data”: NULL, and the empty string. In most cases, it’s redundant to have two possible values for “no data;” the Django convention is to use the empty string, not NULL. One exception is when a CharField has both unique=True and blank=True set. In this situation, null=True is required to avoid unique constraint violations when saving multiple objects with blank values.

Model `Meta` ordering is not free

Model Meta options | Django documentation | Django

if you add an ordering as a Meta option, it will apply to all queries. Even if you need an ordered queryset for your particular usage, it might not be needed all possible queries. It's better just to order it for specifc case.

Consider django-fsm when adding models status fields

If you need to add a status field to a model, consider using django-fsm (see Order.status for an example)

GitHub - viewflow/django-fsm: Django friendly finite state machine support

Prefer properties for date based status fields

If you’re adding a field like is_paid or is_complete or is_smushed, it’s almost always better to have a datetime field paid_at, completed_at, smushed_at and expose boolean as a model property:

class Order(models.Model):
    paid_at = models.DateTimeField(null=True)

    @property
    def is_paid(self):
        return self.paid_at is not None

Always assume new type will be added to a set of choices

If you're adding a status, option, choice or type-like field, it is best to account for the possibility that requirements will change and another option is going to be added later. You can raise a NotImplementedError to account for any status field

def your_function():
    if reservation.status == OPEN:
        ....
    elif reservation.status == COMPLETED:
        do_something()
    else:
        raise NotImplemented()

Often when a new option is added, the developer adding the new field will not be aware of all the areas the set of choices are used. So we need a way for a test to fail so that we catch these problems in development rather than being surprised and react to it in production.

@pytest.mark.parametrize('status',  Reservation.SomeBookableStatus.values)
def test_your_function():
    ...

If you parametrize all the values in your choices rather than the ones that currently exist, then if a new choice is added, test_your_function will fail with a NotImplemented error.

Celery (ugh)

Use per task queues instead of general ones

This means we can easily fire up one of dynos to work through specific queues/tasks if a queue suddenly gets very large. Also in RabbitMQ one queue == one process == one cpu, so using multiple queues lets us make use of multiple CPUs.

Avoid doing requests to multiple external services in one task / queue

If one of the external services becomes unavailable, we want to avoid being blocked by that - try to separate external calls to separate tasks, and each of these tasks to separate queues. In that case we can process at least part of the bigger task.

Use retries, prepare for tasks which may fail

Consider the scenario when task fails - because of some external service being down, or for example db state being different than the one task expects. Use https://docs.celeryproject.org/en/latest/userguide/tasks.html#retrying Retry mechanism with backoff to be sure that the failed task will be processed when external service recovers.

Prepare for complete failure of the task. Log the failure. Will it affect user flow? If yes, prepare a separate mechanism like cron, which will respawn the tasks which failed completely.

Avoid renaming/removing celery tasks

Do NOT rename celery/consumer tasks, it’s a destructive operation.You might risk loosing all pending messages in the queue.

Read #7 for more info here

Common Issues Using Celery (And Other Task Queues) - Adam Johnson

if you have to rename, do it in two steps:

create proxy task with the same name (which will be calling new tasks with same payload)
remove proxy task after some time (in next releases)

Requests

Add a default timeout when making requests

requests does not a timeout by default!

Quickstart — Requests 2.24.0 documentation

Define a mapping of constants TIMEOUTS in a configuration file and use those entries for specific timeouts.

Postgres

Avoid CharFields with max_length use CharFieldUnlimited

There's no performance difference in postgres between varchar(n) varchar and text, so prefer the latter two over the former.

Use CharFieldUnlimited when you want single line django admin input widgets and TextField if you want multiline text fields.

Datetimes and Timezones

Prefer django.utils.timezone.now() over datetime.datetime.now()

Time zones | Django documentation | Django