Sep 9, 2023If anyone from arXiv is reading this, I implore you: we are both researchers who have better things to spend our time on than playing cat-and-mouse games for hours trying to figure out useless detection and circumvention methods. An author should have the prerogative to present his final work instead of being forced to feed his every draft and script into an automatic machine that may or may not work. Even though you claim it is for the sake of archiving, please remember: when the PDF format becomes no longer relevant or viewable, TeX will likely have faded out too. Hence, the current requirements are only good for facilitating the collection of large datasets for AI training, which every author should have the right to consent or object. Please stop imposing your perspective on every author. Is that too much to ask?
So arXiv currently prevent people from uploading compiled LaTeX PDFs for the sake of archiving. I did some experiment and it seems that the detector checks for metadata and embedded font of the PDF. Fortunately, both are easy to obfuscate to a certain degree:
There are infinite ways to erase the metadata of any file. I borrowed this one from here.
So currently LaTeX uses CMR as the default font, which the detector checks. You can use
pdffonts $PDFFILE to check what font your PDF has.
Fortunately, there are many fonts that look and behave like CMR. One way is to use the newtx font:
Alternatively you may also try lmodern or any other font that you find satisfying.
Another way to bypass the detector is up upload a shell LaTeX project that contains the original PDF. Unfortunately directly including a PDF has been detected and banned (what cat-and-mouse game are we playing now, huh?) But this method from here still works:Note: you need to install and run pax first to get internal links in the page.
I think it is possible that someone from arXiv may try to improve the detector. Nevertheless I feel it is ultimately untenable to try to catch all the LaTeX PDFs (worst case I can open an MS Word and typeset the same thing). I also think it is morally wrong to do so, as an author should have the right to present the work in the way they wished (PDF and TeX make no difference to any reader today other than getting feeded to AI). Given that arXiv is the only influential self-publishing platform for some field (so there is no suitable alternative and it cannot be claimed that 'you can always use another platform'), it is sad that they decide to force this wrong policy.