Skip to content

More efficient clean_duplicate_history with PostgreSQL Partitioning #1013

Open
@humphrey

Description

@humphrey

Problem Statement
./manage.py clean_duplicate_history works great for small tables, but since it loads and processes each item individually, it becomes impractical to run once your tables have millions of rows. And that's when you most need to clean up duplicates.

Additionally, for tables with millions of rows, I usually have lots of fields excluded to save database size. This means that I end up with more duplicate historical rows, since the values that changes are not stored in the history table.

Describe the solution you'd like
An option to offload the detection of duplicates to the database engine. With PostgreSQL this can be done using Partitioning, and I have written my own management commands with custom SQL to achieve this for my largest table. However, having this packaged up nicely within simple history would bring improvements across the board. But I 100% understand if this level of optimization is beyond the scope of simple-history and perhaps too complex to maintain.

Describe alternatives you've considered

  • Custom writing this SQL query for models where this is an issue (my current solution)
  • Using a custom coded history model that more closely matches my requirements.

Additional context
Here is an example SQL query that achieve this (I've simplified the query I am currently using in my project).

Obviously this is a complex solution, which would require generating SQL based on the fields of you model. So, I'm throwing the idea out there, but also expect it might be a bit too much of a complicated solution for this project.

delete from myapp_historicalitem as h
where h.history_id in (
    select subq.history_id from (
        select 
            history_id,
            ROW_NUMBER () OVER (
                PARTITION BY 
                    my_field_1,
                    my_field_2,
                    my_field_3,
                    my_field_4,
                    my_field_5,
                    history_change_reason,
                    history_type,
                    history_user_id
                ORDER BY
                    id,
                    history_id,
                    history_date
            ) as n
        from myapp_historicalitem as hi
    ) as subq where subq.n > 1
)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions