-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Speed up the search process for available file names in findUniqueFilename #2506
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@whym I would like to solve this issue |
@madhurgupta10 thanks, that would be great! |
@whym The "binary search" I know is https://en.wikipedia.org/wiki/Binary_search_algorithm , is that what you mean? Or is it a different Mediawiki API URL? |
@nicolas-raoul yes, I was thinking about the for-loop or recursive version of the algorithm. I don't think MediaWiki provides such a search function. I assumed 'beginner' here was Android beginner, not CS beginner, seeing that many new comers say they are CS students. |
A binary search is usually performed on a finite number of elements, but here there is a quasi-infinite number of elements. We could restrict the binary search to only 1000 elements, but images Test_1.jpg to Test_1000.jpg might all exist already. How about just using something random like a UUID (or maybe a bit shorter)? As Commons switches to using captions, having pretty filenames is becoming a bit less important than before. :-) |
Well spotted! Pinging @maskaravivek for input as he implemented the most recent version of our upload backend. |
You are right, a simple binary search cannot be used. Perhaps we can use the 2^n sequence instead - 1, 2, 4, 8 instead of 1, 2, 3, 4? To make sure no gap is created because of this, you would have to go backward to find the least available position, though. (So if Test_1.jpg - Test_10.jpg exist and you find Test_16.jpg to be available after failing _1, _2, _4, _8, you will have to go backward to find 11 is the least one.) ... not sure how much improvement this would be, actually, but that's the best I can think of at the moment.
Commons users (mainly admins who have to deal with a lot of files in a short time) might not be ready for entirely meaningless file names, but the scheme of |
@whym shall I start implementing the Binary Search technique you mentioned under the previous comment? |
For this scenario where the first part of the string is the same among multiple search results, a Trie is one of the common structures used (https://en.wikipedia.org/wiki/Trie). I don't think our storage is in the right format for this but if we are loading values into the app we can consider using this structure. This could be too complex for the amount of improvement we're looking for. An entirely different solution would be to create a suffix which is a hash of the current date time to ensure uniqueness and completely skip the whole searching for a unique filename thing. |
@madhurgupta10: It seems like this might be harder than I first imagined, sorry about that. The end result might be simple enough, but there are a couple of approaches and I don't know which one is the best at the moment. I'd wait for a while to see what people think. @albendz brought up a good point. Not exactly trie, it looks like we can combine search operators to find what we want: e.g.
Number 1 was my first idea, but it now seems less effective than the other options. I'm now inclined to support 3, although 2 wouldn't be too bad, too. |
Thanks for analyzing the different strategies! And also because small numbers can make a filename ambiguous. For instance imagine I take a picture of an interesting building and name it with its address: |
I like the second approach but would like to suggest some modification in the upload flow. Background context about current flow:
It might surprise the user to see 1, 2, 3 etc (or based on the new discussion some hash) appended to the title. Will it be more fruitful to show the final title to the user(ie. on the upload screen itself) before he proceeds further. Seeing an unaesthetic title might encourage the user to actually update the title to something more meaningful. |
Thanks for the input, everyone. It sounds like the short hash approach is the way to go. I've changed the task description accordingly. @maskaravivek: do you think the UI change you described (which is a great idea) can be a separate task? I believe it can be, although hashes might make it a bit more surprising than numbers. |
Yup for sure it should be a separate task. I just wanted to point it out :) |
@madhurgupta10: do you still want to work on the modified task? If yes, that's great. If no, that would be reasonable too, given that it has been significantly changed. Thanks! |
@whym I can still work on the issue :) |
@nicolas-raoul Is this issue still open and would I be able to work on it? |
If the user-chosen file name is occupied by an existing file, the app sequentially tries to find an available file name by trying suffixes _1, _2, etc. When there are hundreds of these numbered files, it will be unbearably slow. It already does for Test.jpg on the Beta Commons. I believe the same thing happens when the user uploads a lot of files under the same name.
Proposed solution:
Change the file name pattern from
<user-given title>_<number>.jpg
to<user-given title>_<short hash>.jpg
when the user-chosen file name is unavailable. It should extend to a longer hash when collisions repeat. When that's not enough (let's say 3 consecutive collisions on random generation), we could add another 4, and repeat.Steps to reproduce:
Build betaDebug and try uploading a file under the name
Test
. It will take a longer time (> 1 min in my WiFi environment) than it should.Commons app version:
latest master
(Note: the task description has been rewritten based on feedback below.)
The text was updated successfully, but these errors were encountered: