So here’s a weird behavior from ChatGPT. If you give it commands such as, “Find statistics on how dental technology has moved the industry forward over the years. Find scientific studies, and reference the URLs of those studies.”
At first, ChatGPT seems to flawlessly produce an answer, complete with bullet points of URLs. And if it were that simple, this would be an extremely efficient way to gather research for blog topics.
But the first time I started trying to visit those URLs to plumb them for more info, I noticed they produced 404 errors in my browser. Every single one.
That’s troubling, since if you didn’t double check you’d have an article potentially based on nothing, with an erroneous citation area.
Why does ChatGPT reference broken URLs?
What some people are saying is that ChatGPT simply makes things up when it doesn’t know. That rather than saying it couldn’t find any good studies, it will simply share statistics that may or may not be based on anything, and then invent reference URLs where that data supposedly came from.
But I began wondering if there was something else at play some of the time as well.
Before there even was a ChatGPT to talk about, I’d run into an issue a few times where articles I’d written years ago featured reference URLs from facts I’d cited. When I checked those URLs, I found they were surprisingly gone. Even on websites like nih.gov, the CDC, etc. Places one might assume are trustworthy, and also assume would have solidarity and not keep removing pages or changing their URLs.
But it was happening to me juuuust often enough to start seeming suspicious.
Especially when in a couple cases the reference URLs that suddenly vanished were sites talking about infection rates of the flu pre-pandemic. At one time the CDC was happy to provide info about that, but then that article was gone.
It wasn’t simply moved to a new URL; that information, with all the data and numbers I’d referenced, was simply gone. No searches I did on their site or Google itself could reproduce the information.
I found it rather suspicious that stats related to the flu were magically redacted post-pandemic, like as though the reality of the flu from years back would conflict with a current narrative or confuse the issue, so someone found it easier to remove it.
Given that ChatGPT’s research data only goes as recent as 2021, I wonder how often these broken URLs aren’t the AI making things up. How often, instead, is it the case that ChatGPT is referencing data that did exist prior to 2021 but has since been removed?
Science.com says that scientific journals are vanishing from the internet over the last few years, and none of them have been preserved by archive-type groups. Supposedly these articles were inactive, and thus deemed unneeded. But, as this article points out, there are dozens more like them that are inactive and in danger of disappearing.
The Science.com article quotes Andrea Marchitelli, who is the managing editor of the Italian Journal of Library, Archives, and Information Science:
“The analysis demonstrates that research integrity and the scholarly record preservation… are at risk across all academic disciplines and geographical regions.”
According to her, this has been happening for awhile and isn’t specific to any one industry or scientific category.
CNN covered this as well, mentioning 176 open access journals spanning 47 countries from between 2000 and 2019 have disappeared. What else I find interesting about that is the publication date of both the above mentioned articles. Mid to late 2020.
More to come on this.