Django: optimizing a query with spread data

ghz 14hours ago ⋅ 3 views

I have Order objects and OrderOperation objects that represent an action on a Order (creation, modification, cancellation).

Conceptually, an order has 1 to many order operations. Each time there is an operation on the order, the total is computed in this operation. Which means when I need to find the total of an order, I just get the last order operation total.

The simplified code

class OrderOperation(models.Model):
    order = models.ForeignKey(Order)
    total = DecimalField(max_digits=9, decimal_places=2)

class Order(models.Model):

    @property
    def last_operation(self) -> Optional['OrderOperation']:
        try:
            qs = self.orderoperation_set.all()
            return qs[len(qs) - 1]
        except AssertionError:  # when there is a negative indexing (no operation)
            # IndexError can not happen
            return None

    @property
    def total(self) -> Optional[Decimal]:
        last_operation = self.last_operation
        return last_operation.total if last_operation else None

The issue

Since I get lots of orders, each time I want to make a simple filtering like "orders that have a total lower than 5€", it takes a long time, because I need to browse all orders, using the following, obviously bad query:

all_objects = Order.objects.all()
Order.objects.prefetch_related('orderoperation_set').filter(
    pk__in=[o.pk for o in all_objects if o.total <= some_value])

My current ideas / what I tried

Data denormalization?

I could simply create a total attribute on Order, and copy the operation total to the order total every time on operation is created. Then, Order.objects.filter(total__lte=some_value) would work. However, before duplicating data in my database, I'd like to be sure there is not an easier/cleaner solution.

Using annotate() method?

I somehow expected to be able to do: Order.objects.annotate(total=something_magical_here).filter(total__lte=some_value). It seems it's not possible.

Filtering separetely then matching?

order_operations = OrderOperation.objects.filter(total__lte=some_value)
orders = Order.objects.filter(orderoperation__in=order_operations)

This is very fast, but the filtering is bad since I didn't filter last operations, but all operations here. This is wrong.

Answer

To optimize your query and avoid filtering over all order operations (which can be inefficient), denormalizing the total field into the Order model is the most straightforward approach. However, if you want to avoid duplicating data and still want an efficient query, you can consider a few improvements and alternatives. Here's how you can approach the problem:

1. Denormalizing the Total Field

This is a good idea for performance reasons. You would store the total directly in the Order model and update it each time an operation is created or modified. This would allow you to filter directly on the Order model using total__lte=some_value, which is fast and avoids the need for looping over all orders.

Here’s an example of how you could implement this:

class Order(models.Model):
    total = DecimalField(max_digits=9, decimal_places=2, default=0.00)

class OrderOperation(models.Model):
    order = models.ForeignKey(Order, on_delete=models.CASCADE)
    total = DecimalField(max_digits=9, decimal_places=2)

    def save(self, *args, **kwargs):
        super().save(*args, **kwargs)
        # Update the order's total after each operation
        self.order.total = self.total
        self.order.save(update_fields=['total'])

With this approach, querying for orders with a total lower than a given value would be simple:

Order.objects.filter(total__lte=some_value)

Note: This approach has a trade-off in that you're duplicating data (i.e., storing the total on both Order and OrderOperation), but it makes querying fast and efficient.

2. Using annotate() with Subquery

If you don't want to denormalize the total, you can use annotate() along with a Subquery to compute the total directly in the query. This approach involves creating a subquery that fetches the total of the latest operation for each order.

Here's an example:

from django.db.models import Subquery, OuterRef

latest_total = OrderOperation.objects.filter(
    order=OuterRef('pk')
).order_by('-id').values('total')[:1]

orders = Order.objects.annotate(
    latest_total=Subquery(latest_total)
).filter(latest_total__lte=some_value)

In this approach:

  • OuterRef('pk') references the Order's primary key.
  • The Subquery fetches the total of the latest OrderOperation for each order.
  • The annotate() method adds this total as a field (latest_total) to each Order object.
  • You can then filter on this annotated latest_total.

This solution avoids the need for a separate filtering step and allows you to filter directly on the latest_total of each order.

3. Using prefetch_related() for Efficient Querying

You can also combine prefetch_related() with annotate() to reduce the number of queries. However, as you mentioned, filtering on all operations would be incorrect because you want to filter only on the last operation.

To filter the last operation while still prefetching related data, you can use Prefetch along with an ordering for the operations:

from django.db.models import Prefetch

latest_operations = OrderOperation.objects.order_by('-id')

orders = Order.objects.prefetch_related(
    Prefetch('orderoperation_set', queryset=latest_operations, to_attr='latest_operations')
).filter(
    latest_operations__total__lte=some_value
)

Here, Prefetch allows you to retrieve the latest OrderOperation for each Order (by ordering the orderoperation_set in descending order of id), and to_attr stores this result in the latest_operations attribute of each Order. You can then filter orders based on the total of the latest operation.

Conclusion

  • Denormalization (storing the total in Order) is the most efficient and straightforward solution for filtering on the total.
  • If you want to avoid duplicating data, you can use annotate() with a Subquery to compute the total of the latest OrderOperation efficiently.
  • Using prefetch_related() with Prefetch can help with reducing query overhead, but it may not completely eliminate the need to filter operations unless you use it with the latest operation correctly.

Consider denormalizing if performance is a priority and you're okay with some data duplication. If you prefer to keep the design normalized, using a subquery with annotate() is a good alternative.