This is an archive of past discussions with User:Citation bot. Do not edit the contents of this page. If you wish to start a new discussion or revive an old one, please do so on the current talk page.
Yeah never modify URLs without testing that the URL works. This is what I have learned with WAYBACKMEDIC. It is continually finding crazy things in URLs that are not predictable. One can not safely say a URL ending in a set of characters should be changed, or added to. Same with encoding schemes, they can be all over the place such as %20 vs + there is no right way, even within the same URL. Standards are out the window these days the only "right" URL is the one that works. -- GreenC16:19, 5 January 2019 (UTC)
it gets updated because of the bibcode. We have a blacklist of bibcode a that are actually arXiv despite claiming to be journals. Obviously, you found a new liar to add. AManWithNoPlan (talk) 19:04, 5 January 2019 (UTC)
Bot changes "cite web" to "cite news" and adds a new "work" parameter, but "website" parameter is already present. These two parameters are aliases therefore a redundant parameter error occurs.
What should happen
remove "website" parameter if "work" parameter is added.
This would be super useful. We could be build lists of pages with crappy citations with AWB's database scanner or with clever insource:// search (e.g. pages with raw GoogleBooks links, pages with raw DOI links, ...), then put the list of pages to be edited somewhere (e.g. User:Headbomb/Sandbox5), then tell the bot to run against those pages (follow redirects if they exist). Headbomb {t · c · p · b}14:45, 22 August 2018 (UTC)
@Smith609: not really. Those list would have to manually be built and fed manually every time. It's OK for a one-time list, but the idea is that you could embed have a one-click way of running the bot on a list of links. Book:Canada would be a prime example (or cleanup-centric lists, like WP:JCW/J30 and fix a crap ton of capitalization mistakes in one click). If you could have something like https://tools.wmflabs.org/citations/list.php?linksonpage=Book:Canada, that would find all links on the page (likely direct links for simplicity) and run the bot on those pages, that would be great.
That is if you have [[Foobar|Barfoo]] somewhere on the page, get Foobar (follow redirects if there are any), and run the bot on that. Repeat for all other links it finds. Headbomb {t · c · p · b}22:30, 26 August 2018 (UTC)
This is basically a request that would allow any user to run a full-automated bot without needing WP:BRFA. Given this is a tool designed for manual watching of diffs, I wonder how wise it would be to turn the bot keys over. -- GreenC16:27, 5 January 2019 (UTC)
Indeed. It is not even a category, so one could do this on a fashion article and find a link to quantum mechanics because the designers uncle was a physics professor. AManWithNoPlan (talk) 16:48, 5 January 2019 (UTC)
Yeah I agree there's a concern there. While running on Book:Canada (and other books) is no different than running on a category, maybe build a whitelist of users that could use it in such a fashion on other pages? Or some other whitelisting (e.g. any page that start with "Book:", "Wikipedia:WikiProject ..." + specific pages "User:EXAMPLE/SANDBOX2"). Headbomb {t · c · p · b}18:07, 5 January 2019 (UTC)
Yeah, but that's not extremely useful. I know what pages are on Book:Canada[5] (or say WP:JCW/Sandbox[6]), the goal is to kick the bot into action once the list of pages to run on has been built, much like it does with a category. Headbomb {t · c · p · b}20:58, 7 January 2019 (UTC)
So now we have a piped list of things no one knows to do anything with, and articles that still don't edit edited by citation bot. Headbomb {t · c · p · b}13:57, 8 January 2019 (UTC)
When trying to expand all the citations in an article, one is not expanded. It is a journal that apparently has two bibcodes, 1982mcts.book.....H and 1982MSS...C03....0H. Although the non-book bibcode is in the template, the book bibcode is reported for the big query, then later the non-book bibcode is reported as not found (big query returning a different bibcode from the one submitted in the query?). The details for this bibcode are added to the next citation returned from the big query, together with some details from that bibcode (additional authors, etc.).
There something wrong with that one bibcode that redirects to another one. That makes us not expand it since one check we do is to make sure the bibcode we get back is the one we sent out. This is unfixable, since we will not remove the double check. The second issue is that the not currently rejects expansion of any book bibcodes since that rehires is to write code that we have not done yet. I might look into writing that code. AManWithNoPlan (talk) 22:53, 6 January 2019 (UTC)
No citations are mangled now, at least not in that example. Bibcode 1982MSS...C03....0H is ignored. Book bibcodes are ignored, except that cite journal templates are changed to cite book templates. Lithopsian (talk) 14:58, 7 January 2019 (UTC)
Anthony Chenevix-Trench: {{cite news |last=Heffer|first=Simon|title= Beaten by Eton: The Land of Lost Content: The biography of Anthony Chenevix-Trench by Mark Peel |date=27 July 1996 |accessdate= 3 December 2012 |location =London |newspaper=[[Daily Mail]] {{Subscription required|via=[[Questia Online Library]]}}|url=https://www.questia.com/read/1G1-111427463}}
On both subscription status is noted with the {{subscription}} template, which can be inside or outside the CS1|2.
The better format would be:
Bovver boot: {{cite news | url=https://www.questia.com/read/1G1-61177939 | title=Max hangs up his boots with £200m | work=[[The People]] | url-access=subscription | via = [[Questia Online Library]] | date=March 31, 1996 | accessdate=March 4, 2013 | author=Gunn, Cathy}}
Anthony Chenevix-Trench: {{cite news |last=Heffer|first=Simon|title= Beaten by Eton: The Land of Lost Content: The biography of Anthony Chenevix-Trench by Mark Peel |date=27 July 1996 |accessdate= 3 December 2012 |location =London |newspaper=[[Daily Mail]] |url=https://www.questia.com/read/1G1-111427463 | url-access=subscription | via = [[Questia Online Library]] }}
Would also want to only do this if there was one cite template in the ref tag; since, one might be applying this to more than one cite template. Given that this is not easily done within the bot’s code, it might be best to make a Bot request. AManWithNoPlan (talk) 20:02, 7 January 2019 (UTC)
I thought about making a bot but not sure it would pass COSMETIC. Understood about matching up is tricky. Will keep the idea in mind. --GreenC20:23, 7 January 2019 (UTC)
So if you find |format=PDF or similar (e.g. |format=pdf / |format=Portable Document Format / |format=pdf), remove it as pointless. Headbomb {t · c · p · b}17:41, 5 January 2019 (UTC)
I think |format=pdf exist in case the URL does not have an apparent ".pdf", so this suggestion would only be done when the URL has a ".pdf". But I wonder if there is any other reason for using |format=pdf? -- GreenC18:22, 5 January 2019 (UTC)
I find those rather pointless personally, but the above request was for when URLs end in PDF. I'll update the header. Headbomb {t · c · p · b}18:37, 5 January 2019 (UTC)
According to Xover: "An URL ending in ".pdf" can (and not infrequently does) return something other than a PDF." Trappist also brought up the concern that other wiki-languages don't support the PDF icon unless there is format=pdf thus when they copy cites from enwiki they loose this meta information. Those are the two concerns that came up. -- GreenC15:50, 7 January 2019 (UTC)
Yeah, Headbomb's description there about the result of the discussion is clearly biased or otherwise misleading in its intent. That discussion has not completed at this time. --Izno (talk) 15:51, 7 January 2019 (UTC)
a) If a url ending in .pdf returns anything but a PDF, then |format=PDF will STILL be displayed. b) This is the English Wikipedia. Unlike |language=English other wikis can easily implement automatic PDF detection, and would be better off doing so. Headbomb {t · c · p · b}16:44, 7 January 2019 (UTC)
This is the wrong bot for the initial cleanup. Something else needs to fix this and then we can play whack a mole on new ones. Assuming this is a good idea of course. AManWithNoPlan (talk) 19:17, 7 January 2019 (UTC)
There wouldn't be any 'initial cleanup' really, it's a cosmetic cleanup, so that's akin to removing |postscript=. or |url=<PMC-URL>. It's simplifies the edit window and makes references easier/more consistent to edit. Headbomb {t · c · p · b}21:03, 7 January 2019 (UTC)
Yes, but it seems to be adding it to all {{cite journal}}'s with |journal=Genetics, which is a bug. Link to old edit. I saw the script trying to make this change on a page a few moments before I reported this as well, so it is still doing it. (t) Josve05a (c)02:13, 11 January 2019 (UTC)
That is just spiffy. Might have to block that bibcode explicitly. I will investigate later tonight. Probably will need to search for it and remove it where ever it is. AManWithNoPlan (talk) 02:45, 11 January 2019 (UTC)
Let us all take a moment to ponder Headbomb being wrong about something. This is a rare event. Please observe a moment of silence. 🤣😁😂 AManWithNoPlan (talk) 03:58, 11 January 2019 (UTC)
I cleaned up the current uses, btw. The only thing in common they had is they all were concerning citations for various articles of Genetics.Headbomb {t · c · p · b}06:24, 11 January 2019 (UTC)
The cause is that the journal Genetics is not indexed, but this one book has Journal=Genetics set in its record. Thus, any search for journal=genetics gets a hit. AManWithNoPlan (talk) 06:30, 11 January 2019 (UTC)
{{cite arXiv |author=Limin Lu |date=1998 |title=The Metal Contents of Very Low Column Density Lyman-alpha Clouds: Implications for the Origin of Heavy Elements in the Intergalactic Medium |eprint=astro-ph/9802189 |display-authors=etal}}</ref>
to
{{cite journal |author=Limin Lu |date=1998 |title=The Metal Contents of Very Low Column Density Lyman-alpha Clouds: Implications for the Origin of Heavy Elements in the Intergalactic Medium |arxiv=astro-ph/9802189 |display-authors=etal|bibcode=1998astro.ph..2189L }}</ref>
The style guides and template documentation strongly discourage the use of urls unless they link to a 100% free full copy. Also, URLs that duplicate other indentifiers are discouraged even if free. One reason is that with a doi you know you are going to a publisher, a link is without context. AManWithNoPlan (talk) 01:12, 12 January 2019 (UTC)
Bot breaks URL in pages field of citation template by changing hyphen to en dash in hidden URL
This bug may occur in this case because the link is a protocol-relative URL, which is a deprecated link format on Wikipedia. In such cases, citation bot should update the link format instead of breaking the URL with the unfortunate hyphen/dash exchange. Biogeographist (talk) 16:14, 10 January 2019 (UTC)
URLs should almost never be modified unless it can issue a GET to verify the new URL works, or in known cases of URL changes. -- GreenC16:21, 10 January 2019 (UTC)
Side bar: people often talk about old-crusty-unreadable code. They say things like: we need to replace this Fortan with C/C++/Java/Go/etc.. Then they do that and discover that the old code was unreadable since 90% of the code was error/exception handling. The same is true of the Citation Bot: if the template were always used right and they did not have six different names for the exact same parameter, then the bot would be 75% smaller. This is GIGO, but I think we can prevent the GO half. AManWithNoPlan (talk) 17:48, 10 January 2019 (UTC)
{{wontfix}} so many links and so many that block us or time out that it does eventually finish (after a long-time), if you (and your web browser) will let it. Probably best to run section by section. AManWithNoPlan (talk) 16:49, 16 January 2019 (UTC)
There also is "first=SPIEGEL ONLINE, Hamburg|last=Germany" on the page already which also does not seem to be correct, however this was not added by the bot.
once we have a doi to search with, we do not search absabs using arXiv. If the bibcode does not know about the doi, then it is outdated information. AManWithNoPlan (talk) 21:36, 16 January 2019 (UTC)
Not really no, there's a slew of citations, mostly in mathematics, that never get anything but an arxiv bibcode. Headbomb {t · c · p · b}21:37, 16 January 2019 (UTC)
at the very least I should combine the two title checking codes into a function call and remove dashes before doing the compare since bibcodeland seems to eat em dashes and leave an empty plate of white space in its place. AManWithNoPlan (talk) 19:28, 17 January 2019 (UTC)
Why does the bot remove publisher and location from the "Cite journal" template? Especially for magazines that have been published for a long time, these things change and may perhaps be of interest? Mr.choppers | ✎ 04:20, 19 January 2019 (UTC)
please see above discussion links and join in. One might ask why is the publisher information almost always wrong. You might also ask why do people use cite journal for non-journals such as magazine? AManWithNoPlan (talk) 04:43, 19 January 2019 (UTC)
May be of interest is not a worthwhile reason - the citation is for a reference in an article and not intended for a treatise on the magazine itself. If such information is useful, then please wikilink the magazine name and create a nice article for it. AManWithNoPlan (talk) 04:46, 19 January 2019 (UTC)
If it is useless, then why do the parameters exist? I only found out about the existence of "cite magazine" a little while ago, hence the occasional reappearance of "cite journal." Mr.choppers | ✎ 16:34, 19 January 2019 (UTC)
they exist because all the citation templates are based upon the same core code and core documentation. So, there are lots of useless parameters. AManWithNoPlan (talk) 16:45, 19 January 2019 (UTC)
Flagging for archiving since links exist above {{notabug}}. The documentation is lacking considering the publisher location removal has been standard for a decade.
We run our own Citoid installation. He uses Wikipedia’s install. The Wikipedia install would have to be willing to allow us to hit them much more aggressive than their policy allows, but that would make it easier for us. We do nothing with combining equivalent references. AManWithNoPlan (talk) 13:32, 8 January 2019 (UTC)
I have looked at the reFill code base and it appears to not use the Citoid instance, at least not for everything. That is one reason it seems to handle international stuff much better. AManWithNoPlan (talk) 00:19, 11 January 2019 (UTC)
I know that some users are tirelessly working on converting bare links to journal articles into {{cite journal}} calls (which then citation bot can clean up). What are your preferred ways? Do you have regular expressions or other aids to share for the purpose? I see that a simplistic regex search for DOI URLs in bare links, like insource:http insource:/\[http[^ ]+10\.[0-9]{4,5}\/[^ ]+ /, finds several thousands of pages and I'm not sure what's the best way to help. Nemo18:07, 16 January 2019 (UTC)
I usually just search for something like insource:/\>https\:\/\/doi\.org\/10/> or search for specific publisher links and try to "fix all" from that domain. (t) Josve05a (c)18:20, 16 January 2019 (UTC)
On that edit I only used the citation expander. The bot/tool can convert bare refs (with only URL) to the proper cite template, so no need to add basic cite journal fields maunually. (t) Josve05a (c)15:35, 18 January 2019 (UTC)
Is it just me, or is the bot considerable slower since about a week? We're talking 30 minutes + to run on articles. Sometimes several hours. Headbomb {t · c · p · b}03:24, 21 January 2019 (UTC)
I'm not sure, but other tools experience 500 errors due to a buildup of connections and work around the problem by periodically doing a "webservice restart". It seems kubernetes doesn't yet support increasing parallelism. Nemo11:56, 21 January 2019 (UTC)
These links can sometimes be ok, but they are often a violation of publisher copyright, so they can only be added if citeseer traces their provenance back to an author copy or a publisher-licensed copy. This needs to be checked by hand. Citation bot should never add such links automatically. There is currently a similar thread about Zenodo about WP:ANI likely to lead to a topic ban from modifying citations for the user incautiously adding such links. Do we want such a ban to be given to Citation bot? The edit is shown as "user activated" but is listed as being made by the bot and there is no responsibility assigned to a specific user for this bad edit.
Users are always responsible for the edits of the bot, since they are the ones that asked the bot to make the edit in the first place, so nothing is automatically added. The best way to deal with (the very small number of) copyvios on CiteSeerX is to contact them to take down the offending file (and possibly put a comment in the citeseerx parameter such as |citeseerx=<!--Copyvio: 10.1.1.whatever/foobar-->, although the CiteSeerX page contains more than just the file and the metadata is gives is useful).Headbomb {t · c · p · b}16:59, 7 November 2018 (UTC)
The number of copyvios is not small, because citeseerx copies all sorts of copies of papers — often copies made available for some course by someone else – that are neither author copies nor licensed from the publisher. They may be fair use for a course but that doesn't make them fair use for citeseerx and for us. And if the edit cannot be attributed to the specific user who caused it (and that user convinced or prevented from continuing to make bad edits) or if the process does not involve the user specifically vetting the edits that are made, with a big warning about COPYLINK, then it should not be happening at all. —David Eppstein (talk) 17:23, 7 November 2018 (UTC)
since we do not link to the PDF directly, does that make it okay? honest question about how close to the illegal copy do we need to be in order to be evil. AManWithNoPlan (talk) 18:00, 7 November 2018 (UTC)
I doubt it. We're linking to a site whose only purpose is to provide the link. WP:ELNEVER seems unambiguous: "If there is reason to believe that a website has a copy of a work in violation of its copyright, do not link to it." —David Eppstein (talk) 18:07, 7 November 2018 (UTC)
goes double for people. The Pope and the queen of England are both exempt from all criminal prosecution worldwide. They have sovereign≥ immunity at home and diplomatic immunity every where else. AManWithNoPlan (talk) 21:26, 22 January 2019 (UTC)
Why is cit book preferred to cite web for online “books”?
When we query the website, it gives us publisher information and says “I am a book online”. Many of the journals/books/patents/etc people reference are through websites. The issue of which template is preferable is another issue. AManWithNoPlan (talk) 23:26, 22 January 2019 (UTC)
The problem is, the online version doesn't give page numbers in the paper edition, and I am not citing a book, I'm citing a web page. I've never seen the book. I don't know why a bot is overriding legitimate editor choice of template. To me it seems a perverse outcome. Peacemaker67 (click to talk to me) 23:49, 22 January 2019 (UTC)
I am shocked that works. I forgot I added support for comments in the template type a while ago. I will blame being 35000 feet up in the air. AManWithNoPlan (talk) 03:15, 23 January 2019 (UTC)
That said, I do love the feature, but I would restrict it with an API call / additional checkbox in [25] so that usage is intentional, and users are warned to only use this on articles they plan to fully cleanup citations after the bot. Headbomb {t · c · p · b}18:22, 22 January 2019 (UTC)
Still iffy. The first one is a bare link, but when it tries to reformat a manual citation, you'll get in trouble and some will demand heads on pikes. An 'advanced' checkbox would probably be OK, but by default this is likely too risky. Headbomb {t · c · p · b}18:36, 22 January 2019 (UTC)
At least in Special:PermaLink/879639416, however, the citation ends up using a style consistent with the pre-existing one and all the others, which is why the specific case seemed fine to me. Isn't it? As for the general case, there ought to be a way to check whether the existing references use an inconsistent style or a (non-)style falling outside the realm of "where Wikipedia does not mandate". Nemo19:03, 22 January 2019 (UTC)
{{cite journal |last1=Benedict |first1=Ruth |title=Reviewed Work: An Apache Life-Way: The Economic, Social, and Religious Institutions of the Chiricahua Indians by Morris E. Opler |journal=American Anthropologist |series=New Series |date=October–December 1942 |volume=44 |issue=4, Part 1 |pages=692–693 |url=https://www-jstor-org.rp.nla.gov.au/stable/663315 |accessdate=17 January 2019 }}
What should happen
{{cite journal |last1=Benedict |first1=Ruth |title=Reviewed Work: An Apache Life-Way: The Economic, Social, and Religious Institutions of the Chiricahua Indians by Morris E. Opler |journal=American Anthropologist |series=New Series |date=October–December 1942 |volume=44 |issue=4, Part 1 |pages=692–693 |url=https://www-jstor-org.rp.nla.gov.au/stable/663315 |accessdate=17 January 2019 }}
seriously, that page is flagged to demand the use of the date format that the bot used. I wish everyone would use yyyy-mm-dd for computer stuff. In my writing I use 7 MAY 2001 format. AManWithNoPlan (talk) 14:08, 27 January 2019 (UTC)
@MB: I should note that there was consensus for over a decade to always remove publishers and recently this consensus is being challenged. I only note this since multiple are people incorrectly believe that this is a new feature of the Bot. AManWithNoPlan (talk) 18:00, 27 January 2019 (UTC)
The correct title is actually on the reference page, it can be found after "Subscribe to the FT to read:" in this case its: "Hanjin bankruptcy brings chaos but no capacity cut". Not sure if its feasible/possible to make the bot search for this. It seems to be like this for all Financial Times pages (ft.com) so preventing all links to that site from being edited by the bot is also a possibility. Redalert2fan (talk) 19:39, 26 January 2019 (UTC)
See bug report template. Both an access date and a complete URL were removed by Citation Bot from a "Cite journal" template. RobDuch (talk) 20:50, 28 January 2019 (UTC)
The url removal is described merely as parameters removed. The reason, which the person using the bot would see, that the url is removed because it is redundant with the DOI. AManWithNoPlan (talk) 21:11, 28 January 2019 (UTC)
The description of what is done is always a hard to describe since it is summarizing possibly 40 changed templates with 100 changes in one line. That is impossible to get right every time. AManWithNoPlan (talk) 22:06, 28 January 2019 (UTC)
Citation bot is making odd changes to references like this where it converts a {{cite journal}} to a {{cite book}} (when the reference in question very much is a journal, not a book) and removes valid publisher information. See also here where the bot simply removed parameters with no discernible reason. Can anyone explain why the bot is doing this? Parsecboy (talk) 12:40, 29 January 2019 (UTC)
The first edit is a bit strange, a alleged journal with an ISBN. Worldcat and Google Books seem to indicate that it is a book in a series rather than a typical journal. The bot might be able to be coded to avoid doing what it did, since context is important.
So, the bot removed the publisher since it is a journal and changed the type since it is a book. So, is is a bournal or jook? I am only 90% joking. The distinction is not always clear between a series of books and a journal. AManWithNoPlan (talk) 15:28, 29 January 2019 (UTC)
There are ten different DOI providers. We have always supported Crossref. We added more recently. Now even more are coming. We also are adding tests for the ones that don’t work so we know if they suddenly start working and can check for bugs. Who knew that movies had dois? And no, we don’t expand the black panther marvel movie doi even with the new code. https://github.com/ms609/citation-bot/pull/1253AManWithNoPlan (talk) 18:18, 26 January 2019 (UTC)
{{cite web}} is incorrectly changed to {{cite book}} in two Kirkus Reviews citations; this is never a good idea because the parameters in each are intentionally styled differently. For example, |title= in {{cite book}} italicizes, but |title= in {{cite web}} does not, because the title of a web article should not be automatically italicized in its entirety.
Looks like this issue is similar/related to the previously reported bug where {{cite web}} was changed to {{cite journal}}. What are the criteria with which this bot is changing citation templates from one to another? I think we can assume that most of these templates have been specifically chosen by editors, what is the bot supposed to be "fixing"? Thanks.— TAnthonyTalk18:43, 30 January 2019 (UTC)
the website tells us that it is a book. I will look into it. Honestly editors generally don’t put much thought into which one they choose. AManWithNoPlan (talk) 18:48, 30 January 2019 (UTC)
Well they are online book reviews, not books. And I've found that even sloppy editors can see the difference between cite web and cite book, I do a lot of citation cleanup and have almost never had to change a template like this.— TAnthonyTalk18:51, 30 January 2019 (UTC)
See here. In this case these were correct "cite web" links, but because the URLs were the same ISSN-DOi links that posed problems earlier, the bot changes this to "cite journal" links. The URL still redirects to the correct place, but the DOI doesn't. It incorrectly replaces the "work" field with a "journal" field (there is no journal with the title "Wiley Online Library"...).
The DOI is valid and points to the correct journal, but you are write that these ISSN only DOIs are probablematic and should probably be 100% ignored. AManWithNoPlan (talk) 16:57, 29 January 2019 (UTC)
The bot adds the doi because of the ISSN in the URL. However, the doi goes to the journal mainn page, even if the URL was pointing to another page (e.g. the listing of the editorial board). What was being referenced here was a page on the journal's website, not an article published in the journal, so "cite web" was correct and "cite journal" is not. Note that the ISSN-containing URL has been abandoned by Wiley and pages have gotten new URLs that doon't contain the ISSN. The old URLs are still functional, they are rediected to the new (non-ISSN) URL. Ideally, the bot would replace the old URL with the new one, but I have no idea how easy/difficult that is. If it's too hard, the bot should leave these instances alone. --Randykitty (talk) 17:07, 29 January 2019 (UTC)
Sorry, but this is not fixed. I just reverted the above diff and ran the bot again, with almost the same result, except that there are now erroneous DOIs and it still is changed to "cite journal"... --Randykitty (talk) 15:15, 30 January 2019 (UTC)
I have looked at those github links, but must admit my ignorance here and have no idea what all that means. Meanwhile, the bot is still doing this ([30]). I've corrected a few by hand, but that's quite tedious. This is such a wonderful tool and I really appreciate all the work and effort of you guys to keep this running, so I feel really bad to keep pestering you about this... --Randykitty (talk) 11:46, 2 February 2019 (UTC)
The linked code. I will convert to English. Drop url if:
Citation is complete
The doi is not an ISSN-only doi (points to article not journal)
The url hostname is on the list canonical publishers
The url does not contain 'pdf', 'image', 'plate', 'figure', or 'picture'
The doi resolves to something
The title is slightly misleading because this code doesn't check at all whether there is a match, it just relies on whoever has previously compiled the template to have verified and stated an identity between the URL and the DOI. You could argue that if they didn't it's just GIGO (example) and that this assumption works in the large majority of cases but I'd be curious if it's 99 %, 95 % or 80 % or whatever. Maybe I'll run some regex on the dumps so that whoever wants can check a sample of URLs.
Speaking of which, it may be helpful to use a slightly different constant than CANONICAL_PUBLISHER_URLS, where several domains are unlikely to be the target of a DOI: for instance link.springer.com receives nearly all DOI redirects, while www.springer.com is more likely to contain journal descriptions where the URL patterns can get tricky. Nemo20:02, 27 January 2019 (UTC)
I have changed the pull to now also check if the doi url matches the url in the template and also if it matches what the the url redirects to when actually polled. AManWithNoPlan (talk) 01:20, 29 January 2019 (UTC)
Citation bot adds a CiteSeerX link to a paper with the same title but a different (overlapping) author set and far different publication year. The authors and date of the paper are listed correctly on CiteSeerX but Citation bot fails to detect the inconsistency.
What should happen
Citation bot detects the inconsistent authors and year, doesn't consider the papers to be the same, and doesn't add the link
In citation template with contribution+title+series+publisher parameters, incorrectly changes publisher= to journal=
What should happen
contribution+title+series+journal is not a valid combination of parameters. Citation bot should recognize that it is converting a valid citation template into an invalid one and not do it, no matter what its source's metadata might say. GIGO is not an acceptable excuse for taking garbage from elsewhere and creating more of it here when it wasn't here already.
Relevant diffs/links
Special:Diff/882190191 (incidentally, at least one and possibly both of the CiteSeerX links added in the same diff appear to fail WP:ELNEVER)
Replaces the documented and standard parameter arxiv=, in a citation for which the arXiv link is a courtesy link rather than the main publication venue of the reference, with the undocumented and obsolete parameters eprint= and class=
Converting |arxiv= to |eprint= could probably be removed at this point, since that dates back to a time where |arxiv= was not supported. The addition of |class= to a cite arxiv is fine though. Headbomb {t · c · p · b}22:32, 7 February 2019 (UTC)
While the conversion is technically correct, it is just one more pointless change to tick people off -- or at least confuse. Also, if the citation ever gets upgraded to {{cite journal}} we have to convert it back. AManWithNoPlan (talk) 22:55, 7 February 2019 (UTC)
I don't understand how this one happened. Citation bot did correctly find a publication matching the arXiv preprint. To do so, it must have matched title and authors, because that's the only information in common between the arXiv preprint and the published version. When I ask for bibtex metadata from doi.org, I get
@incollection{Grier_2013,
doi = {10.1007/978-3-642-39206-1_42},
url = {https://doi.org/10.1007%2F978-3-642-39206-1_42},
year = 2013,
publisher = {Springer Berlin Heidelberg},
pages = {497--503},
author = {Daniel Grier},
title = {Deciding the Winner of an Arbitrary Finite Poset Game Is {PSPACE}-Complete},
booktitle = {Automata, Languages, and Programming}
}
which does correctly include the title of the paper (but not the series). So the information was obviously there. But Citation bot chose to remove it. —David Eppstein (talk) 21:47, 7 February 2019 (UTC)
Two points, we check DOIs in this order: 1. CrossRef 2. dx.doi.org JSON (not bibtex) 3. Zotero on the website itself (yuck!). So, you information is doubly irrelevant, it is not the dx.doi.org JSON, and we use CrossRef. We get this: AManWithNoPlan (talk) 22:02, 7 February 2019 (UTC)
<isbn type="print">978-3-642-39205-4</isbn>
<isbn type="electronic">978-3-642-39206-1</isbn>
<issn type="print">0302-9743</issn>
<issn type="electronic">1611-3349</issn>
<series_title>Lecture Notes in Computer Science</series_title>
<volume_title>Automata, Languages, and Programming</volume_title>
<volume>7965</volume>
<contributors>
<contributor sequence="first" contributor_role="author">
<given_name>Daniel</given_name>
<surname>Grier</surname>
</contributor>
</contributors>
<component_number>Chapter 42</component_number>
<year media_type="print">2013</year>
<first_page>497</first_page>
<last_page>503</last_page>
<doi type="book_content">10.1007/978-3-642-39206-1_42</doi>
<publication_type>full_text</publication_type>
<article_title>
Deciding the Winner of an Arbitrary Finite Poset Game Is PSPACE-Complete
</article_title>
Access date should not be removed from citations to IUCN Red List assessments. Assessments get updated, and it is useful to now when an editor has checked that the information present in Wikipedia is the most updated available. A 2004 assessment recently accessed is likely to up-to-date, whereas one with ancient access date is more likely to need an update.
The assessment date is clear. '2004'. No need for an accessdate. Change |date=2004 to |date=30 April 2004 to be more specific. Headbomb {t · c · p · b}02:50, 9 February 2019 (UTC)
See the actual test in Module:Citation/CS1 at local function name_has_etal (name, etal, nocat).
The naive suggested implementation above can cause duplicate parameters (as in display-authors is already set and/or happens to be set to the exact number of authors in the list i.e. author1, 2, and display-authors=2 is set), or it can cross over into pages listed in Category:CS1 maint: display-authors. You can find some of the former in the contribution history there. I would say this is a bit context sensitive, which is why it's not an error at this time. Trappist the monk might have an opinion. --Izno (talk) 04:22, 7 February 2019 (UTC)
Also, one other thing I've been doing in the run is taking care of uses of |authors= where I see it, which are often used in combination. --Izno (talk) 04:43, 7 February 2019 (UTC)
Well GIGO can be handled by a different bot/AWB thing, but cases similar to the ones I linked in the diff should be able to be handled by this bot relatively easily. Headbomb {t · c · p · b}18:09, 7 February 2019 (UTC)
title= Archived copy is changed in to title= "Zoeken in over NA na een 404" which makes no sense in Dutch, it literally translates to "Search in over NA after a 404"
The "dead" page contains "Deze pagina is niet gevonden" which means "this page was not found", While the archived copy is a pdf which does not seems to contain a specific title (other than the file name). Redalert2fan (talk) 22:25, 8 February 2019 (UTC)
I think I found one more for you, exactly the same style of problem. This time its in Vietnamese diff. title= "Bao phu nu - Đọc báo phụ nữ Việt Nam online tin tức mới nhất 24h" is added. I don't speak Vietnamese but according to google translate this means "Bao phu nu - Read newspaper Vietnamese women online latest news 24h" which seems like a title for the whole website and not for the specific article. By looking at the link a correct title should be something like "Bao-Trung-Quoc-noi-ve-may-bay-tuan-tieu-M28-cua-Viet-Nam". Thanks Redalert2fan (talk) 19:45, 9 February 2019 (UTC)
"J. SIAM" (correct) changed to "J. Siam" (incorrect)
What should happen
If you're going to make your usual excuse of "can't fix it because some other web site somewhere has bad metadata" then every edit needs to have the source of the metadata clearly identified so that the garbage can be traced back to its source. In this case, I checked the (JSON) metadata from doi.org and got "Journal of the Society for Industrial and Applied Mathematics" so that's not where it comes from.
Bad metadata for this is so common that we actually have a whole list of capitalization rules and exceptions . In fact it is so bad that we don’t trust the metadata and change the capitalization after we get it. AManWithNoPlan (talk) 14:11, 9 February 2019 (UTC)
Quite interesting, I also tried running it again and for the link it gave "Operation timed out after 10001 milliseconds with 0 bytes received" but in that case last time it probably didn't time out. Thanks for taking a look. Redalert2fan (talk) 23:35, 9 February 2019 (UTC)
The bot removes accessdates from citations that use chapterurl instead of the standard url. The parameter can be used as a standalone, especially when citing things like legislative texts (as my example shows). This bug was previously reported in 2015, but was withdrawn.
The bot changes the page numbers to reflect what pages the entire article can be found, overwriting any preexisting page numbers which direct the reader just to the relevant pages of said article.
What should happen
The citation should continue just displaying pages in the cited source containing the information that supports the article text. to quote Help:Citation Style 1#Pages, or A range of pages in the source that supports the content. to quote Template:Cite journal.
This was already partly addressed, perhaps a regression? Some more complicated example which may be useful for additional unit testing: [37]. Nemo22:21, 13 February 2019 (UTC)
Not a bug! You appear to be referring to the cite/citation |pages= parameter, which is supposed to be a range, as appropriate for the full citation. And not the in-source specifier of where specific material is to be found, which is appropriate for individual (and multiple) short-cites within the article.
I suspect your complaint stems from this edit, which replaced things like "|pages= 64, 66, 70" and "|pages= 396, 422" with "pages= 55–76" and "pages= 381–429". This is an instance of the perennial trying to "reuse a citation" with "named-refs". The problem is that while the "<ref name=" construction can make a note appear in more than one point in the text, it is still just one note applied to multiple, and usually differing, instances. The proper solution is to use short-cites (such done with the {{harv}} family of templates), which can be individually customized.
The problem here is you don't want to lose the specific page information. Which I think is legitimate. The proper way to preserve that information is put them into short-cites. But that can't be done in the bot, as the correct page number to use at each point in the text is indeterminable. E.g., one of the examples above has three page numbers, and appears in two places. Correct assignment of those page numbers requires comparison of the text with the source at each location. Until someone comes along to do that, I would like to suggest the following: that the incorrect page "range" being replaced be preserved as a comment. Also: we should have a maintenance category for such misplaced in-source specifiers. ♦ J. Johnson (JJ) (talk) 00:23, 14 February 2019 (UTC)
It could easily be a bug, in the (not infrequent) case that the doi goes to a collection of smaller articles and the citation goes to an individual one of those smaller articles. For instance, some journals publish collections of book reviews under a single doi, but each review within that collection has its own smaller page range and its own author. Example:
I would be quite annoyed if I found Citation bot "fixing" these by expanding the page range to the whole book review column given by the metadata for the doi (pp. 241–247 in this example).
Also, putting detailed page information into short-cites only works for citation styles that use both short-cites and long-cites. Because our citation templates are unable to handle it, my usual solution for citing specific material within a longer journal paper is to write it out in untemplated text after the template. —David Eppstein (talk) 00:40, 14 February 2019 (UTC)
The bot converts a citation template with |title=/|work= parameters (where the |title= is a conference paper and the |work= is the proceedings title) to |chapter=/|title=/|work= (moving paper title to |chapter= and conference proceedings title to |title= but leaving |work= in place. The original |title=/|work= is not the best coding but is a valid combination of parameters. The changed |chapter=/|title=/|work= is an invalid combination, the citation template complains about it, and in addition it fails to display the chapter.
What should happen
Citation bot should never convert a template with a valid combination of parameters to a template with an invalid combination of parameters.
I'm sure nothing would be different if the template also had |mode=cs1. So it's not the style, but the all-in-one template parameterization that you're complaining about. But that has its advantages, too: for instance, that way you don't have quite as much of a problem with people using cite journal for conference papers. —David Eppstein (talk) 03:13, 11 February 2019 (UTC)
This is probably a Pale Moon browser fault which apparently doesn't encoded url properly. On SeaMonkey "&" is encoded as %26, and entering the full url with unencoded "&" trims it just like the bot did. (It apparently was a temporary browser glitch, because after testing in Pale Moon, url was properly encoded too) Cause found: automatic cite in Visual Editor decodes %26 in q= to "&" (VisualEditor/Feedback). --MarMi wiki (talk) 19:53, 14 February 2019 (UTC)
I wouldn't mind if everything except id= and pg= were trimmed from Google Books links, but I think others disagree. Presumably, because this is a subject of editor disagreement, it shouldn't be overridden by the bot making a choice on what to trim. —David Eppstein (talk) 23:14, 14 February 2019 (UTC)
journal = Methods in Molecular Biology (Clifton, N.j.)
You could specify an exception for that journal/series. It's really really common, and I need to cleanup about 30-40 conversions from some weird |journal=Methods in Molecular Biology → |series=Methods in Molecular Biology → |journal=Methods in Molecular Biology (Clifton, N.j) + |series=Methods in Molecular Biology → |journal=<!-- --> + |series=Methods in Molecular Biology cycle per dump. Headbomb {t · c · p · b}02:55, 9 February 2019 (UTC)
That’s an interesting question. What should be done when a decade old consensus is challenged? Should we stop and wait or what. I don’t know. AManWithNoPlan (talk) 01:26, 10 February 2019 (UTC)
It almost always is because journals get purchased and repurchased over their history. Sure the Journal of Foo maybe be published by the Foo Society today, but in 2 years it might get published by Elsevier. Which means that all instances of Foo Society would need to be changed to Elsevier. This is one of the many reasons why it's completely pointless to include publisher information, against the advice of every style guide out there. Headbomb {t · c · p · b}04:19, 10 February 2019 (UTC)
@AManWithNoPlan: My ability to assume good faith is stretched pretty much to the limit by the behaviour surrounding CitationBot recently (not yours in particular), but since you asked an apparently sincere question I'll make an effort to answer it in kind.
First of all, a consensus is not a consensus if you can't link to it. CitationBot has no bot authorization for removing these parameters, and there is no community discussion supporting removing them. That means that what you have is not a consensus, but mere absence of challenge. And I didn't challenge it back in 2009 because I had no idea CitationBot existed and never saw it edit: if I did I would have challenged it then. The argument that this behaviour has consensus is thus extremely weak. Lack of objections ("implied consensus") is the very weakest form of consensus to begin with, and lack of objection due to obscurity weakens it yet further. It is sufficient to support that CitationBot's behaviour over that time was in good faith, but not sufficient to lean on when objections became evident.
In addition, the long standing and strong consensus on Wikipedia, exemplified in BRD and CON etc., is that when any consensus (both strong and weak) is challenged, the status quo prevails until a new consensus is reached. But note what status quo means in this context: article content should remain the way it was and changing it is considered edit-warring, pointy, gaming, and generally disruptive behaviour. This is why I say that Smith609, Kaldari, and you are actually at peril of sanctions here! Once such edits are challenged, all edits should cease until consensus is reached! And in this case, not only are the edits challenged, but the first close of the RfC concluded that the consensus was against making these changes. Under these circumstances, the only constructive and collegial and respectful (of consensus, I mean) thing to do is to disable this function (or rule or module or however it's implemented) until the question is resolved. You should have done that the second the RfC was launched and waited for consensus to emerge, but if it didn't become clear to you sooner it certainly should have at the first close. It's always possible that consensus will turn out in your favour (unlikely at this point, yes, but by no means impossible), in which case you can re-enable the function afterwards and now with an actual consensus to back it up.
I'll add that if it is accurate that there's been a significant uptick in removals recently (after the start of the RfC, or, worse, after the first close that indicated consensus was against you) that would actually constitute using automated editing to enforce your preference against consensus and would have to end up at the drama boards. I really really hope that isn't the case, because the project never wins when that happens (at best we just limit the damage).
But that's why I say my ability to assume good faith is stretched to the breaking point where CitationBot is concerned: at every single crossroads its proponents make the choice concomitant with "What can I get away with?" and "How can I furthest advance my preference in spite of those pesky other editors?" and "I know better than those other editors that whine and complain.". I have so far seen not a single instance where the choice indicated any kind of respect for other editors or community consensus processes. It doesn't even matter if the community is wrong, by whatever standard you choose to apply: consensus and cooperation and respect for others' opinions is the fundament of how Wikipedia functions.
So apologies for the wall of text, but I really want CitationBot to succeed, because the state of citations on the project is shockingly bad and in desperate need of improvement. But not at the expense of fundamental pillars of the project. And all this, currently, over optional parameters that do no harm, even when used incorrectly, and are required in relatively few instances; and merely because they offend the sensibilities of a few (that is, the case against is essentially a style issue, much like whether commas or full stops separate datums in citations). Strident advocacy may appear to lead to "success", for CitationBot, in the short term; but in the long term it pretty much only leads to disruption, drama, and more loss of editors that we cannot afford. Please reconsider your (collective) priorities and mode of interaction with the wider community: I would love to be a cheerleader for CitationBot, but absent at least some measure of humility towards the community, that just cannot be. --Xover (talk) 08:32, 10 February 2019 (UTC)
The bot is user-activated. If you don't want the bot to remove publishers because of a misguided belief that this information belongs there, don't use the bot. Or put a comment in the publisher field. Headbomb {t · c · p · b}08:55, 10 February 2019 (UTC)
"Consensus needs a link" is a common fallacy, to the point Wikipedia:Consensus#Achieving consensus disproves it in the first sentence: «Editors usually reach consensus as a natural process [...] Consensus is a normal and usually implicit and invisible process» (cf. 2009).
Personally I wish this feature wasn't there, because I think very few people care about it either way, but I accept that it's been there for a long while for a reason. As for the rest, maybe the more discussions there are the more popular a tool becomes (and vice versa)? Nemo11:24, 10 February 2019 (UTC)
Just asserting that something is a fallacy does not make it so. That certain forms of consensus can be presumed from "implied consensus" does not mean all consensus must be implied or even that all consensus can be implied. And this is the second time I've had to ask you to refrain from strawman arguments: I even acknowledge implied consensus in the message you presumably read since you're replying to it, and explain why "implied consensus" is not sufficient foundation for mass automated edits against explicit consensus. Even the very policy you cite (selectively) explains that an implied consensus does not hold once challenged: at which point you're supposed to engage in consensus building before editing further. --Xover (talk) 15:49, 10 February 2019 (UTC)
I don’t have strong opinion, I am here to code. Wow! That’s a lot a explanation! My one opinion is that people should remove publisher and location (which are almost always wrong sadly) and wiki link to a page about the journal-and make it if needed: a permanent fix that makes Wikipedia better and everyone happy. I just find it funny that pretty much every one who complains is pointing to journals with incorrect publishers listed or journals so obscure that even that information won’t help much. AManWithNoPlan (talk) 14:06, 10 February 2019 (UTC)
My apologies: I misunderstood the intent of your previous comment. Since it was phrased as a question and accompanied by a direct indication that you lacked knowledge, I took it to mean that you were soliciting answers to the apparent question. In light of your more recent comment I realize that was not the case. I shall bother you no further with either information or attempts to engage in constructive dialogue. --Xover (talk) 15:49, 10 February 2019 (UTC)
In this diff, the bot reports altering the title of a journal citation but (as far as I can see) changes a double space into a single space. Isn't this the kind of cosmetic change that is supposed to be avoided as a stand-alone edit under WP:COSMETICBOT?
Note, also, in an edit the bot made earlier this month, it altered the same citation but without changing the spacing... so I'm not sure why it made the change as a separate edit a couple of weeks later. EdChem (talk) 14:21, 15 February 2019 (UTC)
The Bot does some mostly cosmetic changes and some very important changes. On rare occasions the changes are all cosmetic. Since this is very rare, we do not track changes and then not make the edit if only cosmetic changes are made. AManWithNoPlan (talk) 14:50, 15 February 2019 (UTC)
The Bot does white space normalization. There a quite a few white space characters that we convert to spaces, and the last step is combining multiple spaces into one so that the wiki text matches the rendering. AManWithNoPlan (talk) 14:57, 15 February 2019 (UTC)
Agreed that this should be avoided on its own when it's just regular spacing if possible, but at the same time, the coding complexity for it might be too much. Normalizing other spacing (like converting invisible non-breaking spaces to regular spaces) has enough advantages to do it on its own though. Headbomb {t · c · p · b}17:30, 15 February 2019 (UTC)
It is only 99+44⁄100% cosmetic since it improves the editors view of the page by making the editable text more in line with what is displayed. Humor intended. AManWithNoPlan (talk) 19:40, 15 February 2019 (UTC)
I do not know if that is fixable. <ref>{{cite book|last=Berridge|first=Vanessa|title=The Princess's Garden: Royal Intrigue and the Untold Story of Kew|year=2015|publisher=Amberley Publishing Limited|url=https://books.google.com/books?id=NhpzCgAAQBAJ&pg=PT21|page=21]</ref> Note that the template does not END! AManWithNoPlan (talk) 23:26, 16 February 2019 (UTC)
Ah, right. In some of those regular expressions we could add the newline to the excluded characters (I would hope no DOI includes a newline!) but a broken template call is a broken template call... Nemo23:32, 16 February 2019 (UTC)