Skip to content

Speed up the search process for available file names in findUniqueFilename #2506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
whym opened this issue Feb 22, 2019 · 17 comments · Fixed by #5877
Closed

Speed up the search process for available file names in findUniqueFilename #2506

whym opened this issue Feb 22, 2019 · 17 comments · Fixed by #5877

Comments

@whym
Copy link
Collaborator

whym commented Feb 22, 2019

If the user-chosen file name is occupied by an existing file, the app sequentially tries to find an available file name by trying suffixes _1, _2, etc. When there are hundreds of these numbered files, it will be unbearably slow. It already does for Test.jpg on the Beta Commons. I believe the same thing happens when the user uploads a lot of files under the same name.

Proposed solution:

Change the file name pattern from <user-given title>_<number>.jpg to <user-given title>_<short hash>.jpg when the user-chosen file name is unavailable. It should extend to a longer hash when collisions repeat. When that's not enough (let's say 3 consecutive collisions on random generation), we could add another 4, and repeat.

Steps to reproduce:

Build betaDebug and try uploading a file under the name Test. It will take a longer time (> 1 min in my WiFi environment) than it should.

Commons app version:

latest master

(Note: the task description has been rewritten based on feedback below.)

@madhurgupta10
Copy link
Collaborator

@whym I would like to solve this issue

@whym
Copy link
Collaborator Author

whym commented Feb 22, 2019

@madhurgupta10 thanks, that would be great! One thing I forgot to mention - currently the Beta Commons is read-only for maintenance and it might be a bit difficult to reproduce the slowness. It's been fixed since then.

@nicolas-raoul
Copy link
Member

@whym The "binary search" I know is https://en.wikipedia.org/wiki/Binary_search_algorithm , is that what you mean? Or is it a different Mediawiki API URL?

@whym
Copy link
Collaborator Author

whym commented Feb 22, 2019

@nicolas-raoul yes, I was thinking about the for-loop or recursive version of the algorithm. I don't think MediaWiki provides such a search function.

I assumed 'beginner' here was Android beginner, not CS beginner, seeing that many new comers say they are CS students.

@nicolas-raoul
Copy link
Member

A binary search is usually performed on a finite number of elements, but here there is a quasi-infinite number of elements. We could restrict the binary search to only 1000 elements, but images Test_1.jpg to Test_1000.jpg might all exist already.

How about just using something random like a UUID (or maybe a bit shorter)? As Commons switches to using captions, having pretty filenames is becoming a bit less important than before. :-)

@misaochan
Copy link
Member

Well spotted! Pinging @maskaravivek for input as he implemented the most recent version of our upload backend.

@whym
Copy link
Collaborator Author

whym commented Feb 23, 2019

We could restrict the binary search to only 1000 elements, but images Test_1.jpg to Test_1000.jpg might all exist already.

You are right, a simple binary search cannot be used. Perhaps we can use the 2^n sequence instead - 1, 2, 4, 8 instead of 1, 2, 3, 4? To make sure no gap is created because of this, you would have to go backward to find the least available position, though. (So if Test_1.jpg - Test_10.jpg exist and you find Test_16.jpg to be available after failing _1, _2, _4, _8, you will have to go backward to find 11 is the least one.) ... not sure how much improvement this would be, actually, but that's the best I can think of at the moment.

How about just using something random like a UUID (or maybe a bit shorter)?

Commons users (mainly admins who have to deal with a lot of files in a short time) might not be ready for entirely meaningless file names, but the scheme of <user-given title>_<short hash (maybe 4 hexadecimals)>.jpg might be viable for avoiding file name collisions. When that's not enough (let's say 3 consecutive collisions on random generation), we could add another 4.

@madhurgupta10
Copy link
Collaborator

@whym shall I start implementing the Binary Search technique you mentioned under the previous comment?

@albendz
Copy link
Contributor

albendz commented Feb 23, 2019

For this scenario where the first part of the string is the same among multiple search results, a Trie is one of the common structures used (https://en.wikipedia.org/wiki/Trie). I don't think our storage is in the right format for this but if we are loading values into the app we can consider using this structure. This could be too complex for the amount of improvement we're looking for.

An entirely different solution would be to create a suffix which is a hash of the current date time to ensure uniqueness and completely skip the whole searching for a unique filename thing.

@whym
Copy link
Collaborator Author

whym commented Feb 24, 2019

@madhurgupta10: It seems like this might be harder than I first imagined, sorry about that. The end result might be simple enough, but there are a couple of approaches and I don't know which one is the best at the moment. I'd wait for a while to see what people think.

@albendz brought up a good point. Not exactly trie, it looks like we can combine search operators to find what we want: e.g. Parmadan Forest <number>.jpg
(It's slightly different, though, since it returns files with zero-padded numbers'01', '02' that we don't try creating.)

  1. Use binary search to find the first available sequence number with an arbitrarily max (let's say 1000 would be enough for most cases). Some more work will need to be done for the rare cases where you have to go beyond the chosen max.
  2. Change the file name pattern to <user-given title>_<short hash>.jpg when the user-chosen file name is unavailable. It should extend to longer hash when collisions repeat. (See my comment above too.)
  3. Use Search API to get the list of all existing files under the given pattern, and find the next available name. (See above within this comment.)

Number 1 was my first idea, but it now seems less effective than the other options. I'm now inclined to support 3, although 2 wouldn't be too bad, too.

@nicolas-raoul
Copy link
Member

Thanks for analyzing the different strategies!
I vote for (2) because it is the fastest and easiest.

And also because small numbers can make a filename ambiguous. For instance imagine I take a picture of an interesting building and name it with its address: 南青山 2 - 3 - 1. Appending #1 would make the name very confusing, whereas adding something like #6s2p would be less confusing as it is clearly a hash.
By the way, I would suggest using 5 characters instead of 4, because with 4 a large proportion would looks like years: #1789.
(I suggest using # instead of - to make it clear that it is a hash)

@maskaravivek
Copy link
Member

I like the second approach but would like to suggest some modification in the upload flow.

Background context about current flow:

If the user chooses a title that already exists, we show a warning dialog informing the user that the title already exists and asking whether he wants to proceed or not.

If the user clicks on No > he stays on the same page and he can manually edit the title.

If the user clicks on Yes > he proceeds further and once the upload is queued, he would discover the actual title that was assigned to his upload.

It might surprise the user to see 1, 2, 3 etc (or based on the new discussion some hash) appended to the title. Will it be more fruitful to show the final title to the user(ie. on the upload screen itself) before he proceeds further. Seeing an unaesthetic title might encourage the user to actually update the title to something more meaningful.

@whym
Copy link
Collaborator Author

whym commented Mar 1, 2019

Thanks for the input, everyone. It sounds like the short hash approach is the way to go. I've changed the task description accordingly.

@maskaravivek: do you think the UI change you described (which is a great idea) can be a separate task? I believe it can be, although hashes might make it a bit more surprising than numbers.

@maskaravivek
Copy link
Member

maskaravivek commented Mar 1, 2019

Yup for sure it should be a separate task. I just wanted to point it out :)

@whym
Copy link
Collaborator Author

whym commented Mar 5, 2019

@madhurgupta10: do you still want to work on the modified task? If yes, that's great. If no, that would be reasonable too, given that it has been significantly changed. Thanks!

@madhurgupta10
Copy link
Collaborator

@whym I can still work on the issue :)

@AlexG-2003
Copy link
Contributor

@nicolas-raoul Is this issue still open and would I be able to work on it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants