- Imgurpocalypse
- 21 Apr 2023 01:36:14 am
- Last edited by Tari on 21 Apr 2023 11:48:54 pm; edited 1 time in total
I saw this in the news and noted that it would likely affect the Cemetech forum as well as many others: Hosting site Imgur will remove explicit and anonymous content next month.
Although Imgur's comments are focused exclusively on removing adult content, they also seem to imply that they plan to delete items that were uploaded anonymously.
On our forum we've long encouraged users to share images in posts via imgur, but it seems those days may be at an end- I know that many of the images I've shared in posts were uploaded anonymously to Imgur (because why would I log in if I don't need to?) and I assume that others have mostly done the same.
Since we've got about a month before any changes are expected to be made, I figured it shouldn't be hard to find all the images embedded in posts that are hosted on imgur, and archive them elsewhere. Using my bbcode parsing library and public dumps of the forum that I've published previously, I parsed every public post on the site and extracted the [img]s that point to imgur.com or i.imgur.com.
After deduplicating those that appear more than once, we end up with 7303 links to images hosted on imgur that are at risk of deletion in the near future.
With that list, I was able to easily generate a WARC web archive using wget (where images-dedup.txt is the list of image URLs I extracted from post text):
Code:
These images total about 2GB of data, which is modest but not negligibly small. Since it had been more than a year since I last published a dump of the forum I published a new one, and included the WARC of images alongside that.
Now that I've ensured that the images are saved somewhere, there's an interesting question of whether something more should be done with them and possibly even if the scope of archival should be expanded.
If we expect that many of the images in posts will be killed by imgur, it would be reasonable to rehost them and automatically update posts to point to the new location. It might even be reasonable to use tooling like the replayweb.page (or components thereof) to transparently fix broken images (though by inspecting the captures I can see that imgur return 301 redirects to a placeholder image so you need to understand exactly how imgur handles deletion to detect those that are broken rather than simply looking for ones that return HTTP errors).
As for expanding the scope of the images we archive, I'm sure there are many images embedded in posts that aren't hosted on imgur and over arbitrarily long timescales are also at risk of deletion. It wouldn't be too hard to capture those and archive them in the same way, but at that point we're kind of becoming an image host which is not really a function that we as a web site want to have- but perhaps it's worth exploring how we could allow users to host images directly alongside their posts?
It invariably seems to be the case that new image hosts appear and see a lot of use then realize they can't allow everybody and their dog to hotlink images because that costs them a lot of money with no possibility to charge for it, so the only long-term solution might be to commit to hosting post-related images ourselves.
What do you think about these questions? Do you have an opinion on how we should try to keep images online (or not bother)? What about developing some ability for forum users to upload images to the site for posts?
Although Imgur's comments are focused exclusively on removing adult content, they also seem to imply that they plan to delete items that were uploaded anonymously.
On our forum we've long encouraged users to share images in posts via imgur, but it seems those days may be at an end- I know that many of the images I've shared in posts were uploaded anonymously to Imgur (because why would I log in if I don't need to?) and I assume that others have mostly done the same.
Since we've got about a month before any changes are expected to be made, I figured it shouldn't be hard to find all the images embedded in posts that are hosted on imgur, and archive them elsewhere. Using my bbcode parsing library and public dumps of the forum that I've published previously, I parsed every public post on the site and extracted the [img]s that point to imgur.com or i.imgur.com.
After deduplicating those that appear more than once, we end up with 7303 links to images hosted on imgur that are at risk of deletion in the near future.
With that list, I was able to easily generate a WARC web archive using wget (where images-dedup.txt is the list of image URLs I extracted from post text):
Code:
wget --warc-file=images -i images-dedup.txt --random-wait --wait=1 --user-agent "Cemetech-ImageRescue" --delete-after
These images total about 2GB of data, which is modest but not negligibly small. Since it had been more than a year since I last published a dump of the forum I published a new one, and included the WARC of images alongside that.
Now that I've ensured that the images are saved somewhere, there's an interesting question of whether something more should be done with them and possibly even if the scope of archival should be expanded.
If we expect that many of the images in posts will be killed by imgur, it would be reasonable to rehost them and automatically update posts to point to the new location. It might even be reasonable to use tooling like the replayweb.page (or components thereof) to transparently fix broken images (though by inspecting the captures I can see that imgur return 301 redirects to a placeholder image so you need to understand exactly how imgur handles deletion to detect those that are broken rather than simply looking for ones that return HTTP errors).
As for expanding the scope of the images we archive, I'm sure there are many images embedded in posts that aren't hosted on imgur and over arbitrarily long timescales are also at risk of deletion. It wouldn't be too hard to capture those and archive them in the same way, but at that point we're kind of becoming an image host which is not really a function that we as a web site want to have- but perhaps it's worth exploring how we could allow users to host images directly alongside their posts?
It invariably seems to be the case that new image hosts appear and see a lot of use then realize they can't allow everybody and their dog to hotlink images because that costs them a lot of money with no possibility to charge for it, so the only long-term solution might be to commit to hosting post-related images ourselves.
What do you think about these questions? Do you have an opinion on how we should try to keep images online (or not bother)? What about developing some ability for forum users to upload images to the site for posts?