Page MenuHomePhorge

Destroy file attachments when file is deleted, or object is deleted
ClosedPublic

Authored by Dylsss on Sep 12 2022, 13:25.

Details

Summary

Adds file attachment deletion logics:

  • PhabricatorFile: delete the attachment if file is deleted
  • destruction engine extension: delete attachment if object is deleted
  • SQL patch: delete existing leftover attachments from deleted files

To apply the cleanup, as usual, run:

./bin/storage upgrade

This cleanup may take some time, proportionally to the size of these tables:

phabricator_file.file
phabricator_file.file_attachment

Just as an indication: the storage upgrade, in a Phorge with file count 1.3M rows and file_attachment consisting in 9K rows, it may delete 170K rows in less than 1 second on average hardware.

Closes T15110

Test Plan

Apply the patch, run ./bin/storage/upgrade:

  • no "Unknown Object" in any "Referenced Files" curtain of any object.

Have phd daemon running.

Upload file, attach the file to a task, delete the file from the web interface:

  • no "Unknown Object" in "Referenced Files" curtain of that task.
  • the query SELECT * FROM file_attachment WHERE filePHID = '<file phid>' returns empty result

Upload file, attach the file to a task, delete the task from the ./bin/remove destroy workflow:

  • the query SELECT * FROM file_attachment WHERE objectPHID = '<task phid>' returns empty result

Diff Detail

Repository
rP Phorge
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Dylsss requested review of this revision.Sep 12 2022, 13:25
This revision is now accepted and ready to land.Jan 28 2023, 10:40

Do you want an hand to land this accepted revision?

Dylsss planned changes to this revision.Mar 22 2023, 23:21

I'm very curious about this improvement.

Feel free to share a comment about your current concerns on this change. I ask this since it's indicated as "planned changes".

Organised attachment deletion from the file side into the File object and deletion from the object side into the destruction engine extension. Also updated the SQL patch date in the file name.

This revision is now accepted and ready to land.Sep 17 2023, 22:59

Commit fixup/update author details

resources/sql/autopatches/20230917.fileattachment.01.delete.sql
2

I'm a little worried about this migration; Do we have garbage collector for this? it would be safer to let it run.

src/applications/files/engineextension/PhabricatorFileAttachmentDestructionEngineExtension.php
4

I might not have enough coffee yet, but what and when is triggering this extension?

src/applications/files/storage/PhabricatorFile.php
1793

I don't like using mpull for the side-effect of the method invoked.

src/applications/files/storage/PhabricatorFile.php
1793

I agree, also because it returns an array of [null, null, null, null] which is not useful to read. A simple foreach() may be better

Ekubischta added inline comments.
resources/sql/autopatches/20230917.fileattachment.01.delete.sql
2

Agreed!

resources/sql/autopatches/20230917.fileattachment.01.delete.sql
2

No there isn't any garbage collector for this specific issue (it's a bug originally from upstream so it's not intended for lingering file attachments to exist in the first place).

I agree it's a destructive migration, although I don't see any other way to get rid of the orphaned records (which will clean up the UI spam in the Referenced Files curtain). I've ran the query a couple times locally, and as far as I can tell, it deletes the intended records which reference a file PHID which does not exist.

Alternatively we could just change the UI logic to not show attachments that reference non-existent PHIDs, and leave the records left in the table, although I'd prefer not leaving a bunch of broken records using space/reducing efficiency.

src/applications/files/engineextension/PhabricatorFileAttachmentDestructionEngineExtension.php
4

PhabricatorFileAttachmentQuery seems to eagerly load the object relation, however since the DestructionEngine extensions run after the object is already deleted, this fails and nothing is returned in the query. Changed to LiskDAO loadAllWhere, which also seems to use less queries in general (due to not eagerly loading), so changed the query in PhabricatorFile as well. Changed mpull to foreach per suggestion.

resources/sql/autopatches/20230917.fileattachment.01.delete.sql
2

I'm not sure a garbage collector would be any different to a migration, you'd still be running the same query but in just in different ways.

I'm a little worried about this migration; Do we have garbage collector for this? it would be safer to let it run.

You are concerned for performance reasons because it doesn't limit the results. Right?

And instead a Phorge garbage collector would delete them a few at a time, right?

In this case, I completely agree. Do we have a cute example?

I'm a little worried about this migration; Do we have garbage collector for this? it would be safer to let it run.

You are concerned for performance reasons because it doesn't limit the results. Right?

I'm more worried about accidentally deleting something that shouldn't be deleted, although the performance reason is also valid.

There are implementation examples - they extend PhabricatorGarbageCollector:

resources/sql/autopatches/20230917.fileattachment.01.delete.sql
2

It's probably because I'm not comfortable enough with SQL, that I feel it would be safer to run the same query using PHP (but much slower):

foreach ($load-all-file-attachments as $attachment) {
  $have_file = $load_file_with_phid($filePHID);
  if ($have_file) {
    $attachment->delete();
  }
}

Running that periodically as a GC will also help(/hide) in any other cases where we generate bad attachments.

Just as clarification for myself (maybe already obvious to others):

This is "just" about touching the "minimal" attachment table that "just" indicates whenever a file is attached (or referenced) to an object or not.

For example, it contains an entry if a private File is attached to a Task, so, to make that File visible to persons who can see the Task (even if the Image itself is visible to "nobody").

This is also the table that stores mentions ("references").

EXPLAIN file_attachment;
+----------------+------------------+------+-----+---------+----------------+
| Field          | Type             | Null | Key | Default | Extra          |
+----------------+------------------+------+-----+---------+----------------+
| id             | int(10) unsigned | NO   | PRI | NULL    | auto_increment |
| objectPHID     | varbinary(64)    | NO   | MUL | NULL    |                |
| filePHID       | varbinary(64)    | NO   | MUL | NULL    |                |
| attacherPHID   | varbinary(64)    | YES  |     | NULL    |                |
| attachmentMode | varchar(32)      | NO   |     | NULL    |                |
| dateCreated    | int(10) unsigned | NO   |     | NULL    |                |
| dateModified   | int(10) unsigned | NO   |     | NULL    |                |
+----------------+------------------+------+-----+---------+----------------+
SELECT attachmentMode FROM file_attachment GROUP BY attachmentMode;
+----------------+
| attachmentMode |
+----------------+
| attach         |
| reference      |
+----------------+

With the above preambles:

  • resources/sql/autopatches/20230917.fileattachment.01.delete.sql
    • Running ./bin/storage upgrade cleared up my orphan references/attachments as expected from the above considerations.
  • src/applications/files/engineextension/PhabricatorFileAttachmentDestructionEngineExtension.php
    • I created a File, attached to a Task, then I deleted that Task, and finally the unexisting object is not mentioned anymore in that table.
  • src/applications/files/engineextension/PhabricatorFileAttachmentDestructionEngineExtension.php
    • I created a File, attached to a Task, then I deleted that File, and finally the unexisting file is not mentioned anymore in that table.

So this seems completely safe and desired to be executed.

Anyone still against this?

src/applications/files/storage/PhabricatorFile.php
1788–1791

✅ Note that I like that you have built this query in the raw way, instead of using PhabricatorFileAttachmentQuery() since in the raw way it does not try to pull the objects, and, it will not discard results if objects are already deleted for some reasons, and, there is no extra overhead for the omnipotent user thing.

valerio.bozzolan retitled this revision from Destroy file attachments when file is deleted to Destroy file attachments when file is deleted, or object is deleted.May 9 2024, 15:21
valerio.bozzolan edited the summary of this revision. (Show Details)
valerio.bozzolan edited the test plan for this revision. (Show Details)

Thanks again. My only concern is that, if somebody has a 1 trillion file(s) then the patch may lock the involved tables for an unexpected long amount of time, and you cannot really undo, without causing a massive implicit rollback.

Maybe running this in an infinite loop may be better in massive productions (see the added LIMIT):

USE phabricator_file;
DELETE FROM file_attachment
  WHERE NOT EXISTS
  (SELECT *
    FROM file
    WHERE phid=file_attachment.filePHID)
LIMIT 1000;

Premising that I'm not opinionated in the above tip. It's just "thinking aloud".

Thanks, approved, and please wait additional 7 days before landing. Just to attract more last-minute blockers on this delicate thing

valerio.bozzolan added inline comments.
resources/sql/autopatches/20230917.fileattachment.01.delete.sql
2

it would be safer to run the same query using PHP

The condition WHERE phid=file_attachment.filePHID is correct and deletes orphans.

Running that periodically as a GC will also help(/hide) in any other cases where we generate bad attachments.

I think it may be nice to just fix the past situation with a one-shot action, and then see what happens, without introducing a periodical garbage collector, that may just slow down the website to do nothing.

Anyway the idea of the garbage collector is a very good B-plan if this patch will be not enough (but the patch seems to cover all future needs)

src/applications/files/storage/PhabricatorFile.php
1789

Should this just be using the destruction engine passed in?

src/applications/files/storage/PhabricatorFile.php
1789

Do you mean having line 1736 $engine->destroyObject($attachment)?

Probably yes 🤔

And maybe same in PhabricatorFileAttachmentDestructionEngineExtension:19

src/applications/files/engineextension/PhabricatorFileAttachmentDestructionEngineExtension.php
20

I'm unsure. It's just an idea thanks to speck (1)

src/applications/files/storage/PhabricatorFile.php
1793

I'm unsure. It's just an idea thanks to speck (2)

We can take inspiration from Phame (?)

https://we.phorge.it/rPe8a39ca3e5f0307244655fe3d69ca03dad623c66

Maybe adopt $engine->destroyObject($attachment) and add PhabricatorDestructibleInterface interface to PhabricatorFileAttachment and add method destroyObjectPermanently().

Edited: no. it's probably not necessary. See sub-task.

Tested again. Double-accept.

Just for the glory of the archive. I've tested this feature in production before landing.

Before:

https://web.archive.org/web/20240810203628/https://gitpull.it/D160

After:

https://web.archive.org/web/20240810204343/https://gitpull.it/D160?asd

Referenced Files now without ghost things 🌈 Thanks again