Hey, Trappist! With your regex you helped me a lot. I was wondering if you could "twist it" so it would incorporate one extra step in it:
I use these find and replace regex-es to see if the language name is made up of more than 1 part (it includes spaces) and to change it accordingly so the bash can read it.
Find: (\w)( )(\w)
Replace: \1\\\2\3
Basically for languages like "Church Slavic" we need them like this: Church\ Slavic
Any way to incorporate that logic in the regex you gave me?
Regex is pretty cool but can quickly become overwhelmingly complex. For that reason, I like simple regexes to do things one step at a time. So, for this task, I would do two steps:
replace whitespace in language names with an escaped space character:
Yes, that's basically what I'm doing right now just reversed. I totally get your logic but the thing is I need to do these steps over and over again. 186 times for all the 2 character ISO codes and then more than 3000 times for the 3 character ones. And then... You get the idea. So I'm trying to minimize the number of steps needed as much as possible so I can speed up the process as much as possible. I even thought of writing a Python script that will do that for me but in order to do that, I need to lower the complexity of the task before (by minimizing the number of needed steps) because I'm not that tech savy. :P
The actual number of steps I need to take now:
Change the code of lang lister to the next language;
Copy all languages with their corresponding codes;
Paste them all in Notepad++;
Run your find and replace regex;
Add a slash before spaces;
Add language codes by merging them in that page above;
Append " \ at the end of each line to complete the script;
Sixth and seventh steps can be combined by changing the replace to:
"\\|\\s*language\\s*=\\s*$2\\b" "|language=$1" \\
It is probable that the space-to-escaped-space regex will fail when the language name is written using non-Latn script or is written using Latn script and the letter before and/or after the space is Latn script with diacritic. An alternate find for that is: ([^:\s]) +(\S)
But, why do all of this? Why don't you create a special version of lang_lister() that takes the raw input from MediaWiki and then gives you the output that you require? Doing that will reduce your list of steps to more-or-less this:
invoke modified lang_lister() with the language-code for the desired language
Hahaha, no, I meant if it was possible to do it because somehow it had escaped my thoughts that I could change the module (template?) itself to make it do the needed work for me and not create a script from scratch for that. The problem is that I'm not familiar enough to change it from scratch. I have no idea where to start and what to do to achieve the needed results. And I thought maybe you can show me what to change, where and I could fine tune it further according to the bot's needs. Should I change the module Cs1 documentation support or the template Citation Style documentation/language/doc? - Klein Muçi (talk) 13:30, 14 September 2020 (UTC)[reply]
everything between lang_lister() function and the exported functions table
from the exported functions table delete everything except lang_lister = lang_lister
Save it and tell me where it is so that I can help if needed.
Decide what each entry in your final list should look like. Decide what the list of entries looks like (bulleted? in columns? plain?) and then edit what remains in your new module to make that happen. So that you don't end up saving a bazillion copies of the module that don't work, on some page (a sandbox page, usurp the module talk page or the module's doc page, add an invoke:
{{#invoke:<module name>|<function name>|lang=<language code>}} – replace <module name>, <function name>, and <language code> with actual module name, the function name (lang_lister until you change it – or not), and a legitimate language code
save that page then copy the page name to the Preview page with this template box at the bottom of the module (in edit mode). When you change something and want to test it, click the adjacent Show preview button. Still, you should save your work occasionally.
So basically like this? I'm yet to make changes to it other then deleting the unneeded parts. I'm not sure I will keep the module for long after I complete the task so I didn't think much about the name.
^ This is the standard and I'd like it to have in a bulleted list, as that's more easy to visualize and the bulleted points are not copied when copy-pasting so... Pretty convenient. What should I do next? - Klein Muçi (talk) 14:43, 14 September 2020 (UTC)[reply]
Also, I noticed something about what you had written above regarding your last regex. I don't think it fails with non-latin scripts because I tried it with Korean and many other "similar" scripts and it worked fine. But your last suggestion "\\|\\s*language\\s*=\\s*$2\\b" "|language=$1" \\ "doesn't work". The whole idea of the bot is to change language values used in citations in ISO codes. So having that =$1 defies the purpose of the script. But if I can make the "module solution" happen, I wouldn't need to deal with these extra steps so my attention is on that right now. :P - Klein Muçi (talk) 15:14, 14 September 2020 (UTC)[reply]
Did you use this find: ([a-z]{2,3}):\s*([^\n\r]+) with this replace: "\\|\\s*language\\s*=\\s*$2\\b" "|language=$1" \\? If you weren't then of course it didn't work.
How much assistance do you want? I can (have done) write something that should work (not tested). I can strip that to a skeleton so that you can find a solution yourself...
Oh wow! Apparently I hadn't because it does work now. I have to try that F&R above with some non-latin languages to see if it works well with escaping spaces or no. But basically, if it does, you have shortened my work tremendously even now. I only have to copy-paste and use 1 regex before saving and restarting the cycle. As for the assistance, believe me, as much as you can. Everything past wikimarkup leaves me with LOTS of trials and errors given that I have no technical background. Basically it takes me 1 month to just invent the wheel, if you get my metaphor. - Klein Muçi (talk) 15:54, 14 September 2020 (UTC)[reply]
Lovely! I have only one last problem with that. There are some languages which are "problematic" in following the standard. Edge cases like: սլավոներեն, եկեղեցական, luba-katanga (kiluba) or словѣньскъ / ⰔⰎⰑⰂⰡⰐⰠⰔⰍⰟ. Can anything be done about these cases or should I deal with them manually? That's totally doable because they are not a lot of them. I was just wondering if maybe a simple technical solution could be on the reach and I don't know about it. The problems I mention are these: The comma would make the regex (?) for the space escape fail since it's not a word character. The parentheses would need to be escaped too, I think, considering that this is intended as a bash script (I guess they would need to be escaped too even on regex, no?). They would also make the regex for the space escape fail. And finally the slash on itself would need to be escaped too and I believe it also would make the regex for the space escape fail. Maybe different cases like these exist too but these are the ones I've found until now (I've already done all the entries alphabetically until the code la). - Klein Muçi (talk) 17:37, 14 September 2020 (UTC)[reply]
We're talking about this? "\|\s*language\s*=\s*սլավոներեն,\ եկեղեցական\b" "|language=cu" \ Armenian for: Salvonic, Church. sq:Module:Test doesn't do anything about commas and commas aren't regex special characters so I don't understand the issue. The 'space escape' in Module:test (this: name:gsub (' +', '\\ ')) works on the name only so doesn't need to know about the characters on either side. I just looks for one or more space characters in the language name and replaces them with a single escaped space character. I've tweaked Module:Test to escape parentheses and the virgule.
Yes, about that. Well, okay then... I'll test it a bit later but I don't suspect there will be any problems with it. If everything works correctly I will try and get start basically from scratch because a method like this is faaar less prone to errors and especially is more secure regarding space escape in non-latin characters. (I had difficulties with Arabic and Hebrew.) Since there are far less steps to be taken now (change code, copy-paste, save - repeat) I believe there won't be a need for scripts. I just have a naive question regarding that: Is there any way I can select all of them quickly with a keyboard combination? Ctrl+A wouldn't work since it would get the whole page. - Klein Muçi (talk) 18:17, 14 September 2020 (UTC)[reply]
Don't you wish there was a 'do-what-I-want-done' button? Alas, I don't know of any such keyboard shortcut.
I can think of something that might be useful. I'm guessing that the English-language code/name pairing is the fallback. Bouncing back and forth between English and Hebrew versions of the sq:Module:Test output seems to confirm that. So, what Module:Test can do is create a list of English-language names and then, if English is not the language being rendered, compare that language's name against the English list. If the names are the same, skip, else create a new find/replace string. There isn't any benefit that I can see to copying stuff you've already copied.
What I do now is exactly that with extra steps. I finish with all languages in a certain code. Then I start with the next code in the list until I get to a code with a different first letter. Then I group all the lines (those from aa, ab, af, ak, etc) in a file named, for example, just A and I use a Notepad++ command to remove the duplicates. Then I go on with "b-codes" and so on. Having some steps removed from that would be another present for me. :P - Klein Muçi (talk) 18:55, 14 September 2020 (UTC)[reply]
Done, I think. You can switch it on with |no-fallback=true.
Is that the complete list of every language possible basically? At least, those possible on Wiki. If yes, where can I find the complete list of codes that I would need to change one after another so I can copy-paste the results? Is it the 2+3 character codes from this page? What about the IETF ones? And those other ones? - I now see that it is the complete list and IETF tags/languages are included. I don't know if I should try the overridden ones. My eye got caught on this one: ᬩᬲᬩᬮᬶ: ban-bali - Maybe it's the only one but my computer apparently can't render that script. What do I do in this case (or in other similar cases, if they exist)? :P
Given that you've shown me to be true many things I suspected not being possible in this conversation, I'm taking the courage to ask: is it possible to just show all the languages in all the languages immediately, without me needing to change them one by one? That's basically the end result I'm striving for and that would literally be a 'do-what-I-want-done' button. :P - Klein Muçi (talk) 00:54, 15 September 2020 (UTC)[reply]
The ban-Bali IETF tag (Balanese written using the ISO 15924 Bali script) is relatively new. You can see what the little boxes are supposed to be at https://r12a.github.io/uniview/ – copy/paste them into the text area box at right and click the down arrow.
Even though you struck the question, I would suggest that the only codes you need to worry about are the codes listed at List of Wikipedias.
I was wondering when you would work yourself round to that question. Could be done I think. The module would just loop through a list of codes. I'll try that and let you know.
Another question: How is it possible that even like this I still get around 1000 duplicates?! I thought I should get none now that the English duplicates are gone. - Klein Muçi (talk) 09:44, 15 September 2020 (UTC)[reply]
Are these duplicates duplicates of English? Example?
LOL Well magic IS true after all apparently. :P I have so much to learn regarding Lua. As for the duplicates, I was trying to loop through the codes now one by one manually, copy-pasting whatever results came through (of course there were very long pages, pages with only 1 result and blank pages). I finished all the codes alphabetically from aa to cs. That left me with 11760 lines of code. And when I try the "remove the duplicates" command on Notepad++, I'm left with 7063 lines of code. That's a tremendous amount of duplicates. But I don't know what exactly is removed because Notepad++ doesn't have a "show diff" page. Maybe I can try to find one online to find some examples of the duplicate lines. I'll deal with ban-Bali IETF tag when it shows up. Thank you for the help with that! :)) - Klein Muçi (talk) 10:32, 15 September 2020 (UTC)[reply]
You can check the differences here (the old text is the original one and then I removed the duplicates) BUT it is very confusing as Notepad++ has to align lines lexicographicaly before removing duplicates so I can't really make much sense of it. :/ - Klein Muçi (talk) 10:53, 15 September 2020 (UTC)[reply]
Not English. For example, do a Ctrl-F search for "\|\s*language\s*=\s*авадхи\b" "|language=awa" \ in the diff. 7× finds of which all but one were deleted. Not surprising that one language that is written with a Cyrillic script would have the same name as or fallback to another Cyrillic-script language.
I see... So English isn't the only one language that's bringing duplicates, as to say. Should we do anything about cases like these? Or should I take care at the end with that command? - Klein Muçi (talk) 11:18, 15 September 2020 (UTC)[reply]
I suspect that when I change sq:Module:Test to process a list of language codes/names, duplicates will naturally fall out (before adding a code/name pair, the module will look to see if that code/name pair already exists in our list) so I will probably disable the English fallback.
I understand. Well, if the end result is how I imagine it to be, that would be a great solution even for future updates. If anything changes regarding the vast amount of languages that exist, I can just copy-paste the whole code every once in a while and ensure the bot to be up to date with the changes. That would require me to put more attention to the module itself, with a proper name. Given that it is there to stand now. And I can also focus on trying to automatize other CS1 problem solving with the bot apart from language ones. If everything is all right, the next step would be to try and make the bot check for updates alone but given that languages don't change that often (I'm assuming), that won't be a big problem and it can be done manually for now. - Klein Muçi (talk) 12:09, 15 September 2020 (UTC)[reply]
Oh, so we still need to add the language codes. Can't it be that it displays all of them "automatically"? I mean, to get the list of all language codes inside the module itself so we just invoke it and we get the needed results? That's what I was imagining when I talked about the auto-update. And why 31? Is that the list of "Wiki languages" available? Or just an example by you and I should add the remaining codes? - Klein Muçi (talk) 13:26, 15 September 2020 (UTC)[reply]
According to List of Wikipedias there are 303 active Wikipedias. There are 818 language codes in MediaWiki's English language list. It seems to most likely that editors at any one of the 303 Wikipedias would use their local language to write <language name> in |language=<language name> or they would have copied a cs1|2 template from another wiki that used that wiki's local language. So, 303 languages associated with the 303 active Wikipedias seems the correct number of languages to process. Scribunto has mw.site.interwikiMap() but that function returns a table that has non-language codes (interwiki prefixes). I suppose that sq:Module:Test could spin through that table and make another table of prefixes that match an entry in the en.wiki language list. I'll think about that.
Yes, that was the logic I was following through. I imagined that if it could work with more than 1 code simultaneously, maybe it could work with all of them being part of the module code. Maybe 303 squared is the correct number or maybe we need all possible languages so we future failproof it? Whatever the answer, what do you suggest my next step should be? Should I go on and copy-paste all the language codes (303 or more) in the invocation or should I wait for them being part of the module? - Klein Muçi (talk) 14:14, 15 September 2020 (UTC)[reply]
I tweaked sq:Module:Testsq:Module:Smallem (it moved while I was doing it – don't do that) so that it used the 327 interwiki prefixes that have matching codes in the English language list. That did not work. It gave me this category:
I suspect that it is a post-expand include size limit error. That makes some sense. The 31-language list has a post-expand include size of 706,147/2,097,152 bytes so it shouldn't surprising that a 327-language list would produce more than 2MB of output.
Before going any further, what will be using this output? Is this the best form of output to use?
A sorted list of all of the interwiki prefixes with matching language codes can be found in the lua log. Edit sq:Përdoruesi:Trappist the monk/Livadhi personal; click Show preview; at the bottom of the page click the 'Parser profiling data' drop-down (it will be different for you, I switched my sq.wiki interface language to English); under the Lua logs heading click [Expand].
XD I'm sorry. Yes, I was just coming here to tell you just that. I just tried doing the same thing not only with those 300+ languages but all of them. See what happens here. (Not a surprise anymore.) That category is precisely that. Apparently it's a limit to protect the server from crashing/slowing down. As I've mentioned earlier, it is a replace pywikibot called Smallem. At the moment being it literally does just that: Finds and replaces. For example, it finds deprecated harv parameters and it removes them, it finds language values and it replaces them with their corresponding ISO code, it finds CS1 categories added as categories (not from templates, we've discussed that earlier) and it removes them... That initial, ideal plan was to make it better at finding and fixing different kinds of CS1 errors but since I was inspired by the language parameter, I started with that and unfortunately that opened a rabbit hole that put a halt to the general progress. Firstly I devised with trial and error that regex you're currently seeing in order to fix/replace/modify as more language values as it could. Then I made the bot work only with English language terms, given that that's the biggest occurrence. Then I added the top entries mentioned in this graphic and the bot is currently running with that output weekly. Then I started working to add all the possible languages given that A) I usually like to "future failproof things", B) SqWiki (or small wikis in general) are a bit unpredictable in the way they produce articles. First I thought I would add only the languages closest to Albania because they ought to be the languages people will try to translate from the most but judging from that graphic above I saw that wasn't true. Finnish was one of the top languages even though being a bit far from Albania. Soon enough I understood that that was because apparently SqWiki tends to periodically mass-produce bot generated articles. Therefore we have stub articles about "every" village in Finland and France (even though we may not have articles about every village in Albania yet). The data have been bot generated auto-translating from stub articles in EnWiki and all those articles have citations in their correlating languages which usually get imported with their "problematic values". Add here the fact that 90% of our new articles now come from CTT and soon enough the citations' languages become unpredictable (even though they still do come most of the time from EnWiki). Fun fact: Most of our new current articles for the moment being are related to Japan because that's the interest of a particular active user. So I was trying to add all possible languages as efficiently as possible and that's how I ended up in the technical village pump. You know the rest of the story. The output is fed in a bash script and Smallem is run weekly on Toolforge. Soon enough I discovered that I would have to deal with big data because of the vast amount of languages and bash scripts didn't do well with those, I read that probably I would have to get to PHP or C# for a definitive solution. I'm also aware that my trial and error found regex may not be the most optimized for that work but I thought I would deal with those details gradually in the future when they started to become a problem and when I could have gotten the hang of all it better. Hopefully by getting the help of another programmer as a maintainer. (For me is a great way to learn and while doing so, I also help the community.) Given your big help on the project, and the fact that it deals with CS1 problems, I also thought that maybe I could ask you in the end if you were interested. But judging by what is happening I'm starting to see that I'm facing limits much faster than I had anticipated to do so. - Klein Muçi (talk) 16:03, 15 September 2020 (UTC)[reply]
I did what you mentioned and it works but what am I to do with that list given that the limit we mentioned above blocks me from extracting any output from it? - Klein Muçi (talk) 16:16, 15 September 2020 (UTC)[reply]
I wondered if whatever tool you are using to ingest this list might have some sort of array or dictionary or something that might be more efficient than a list of independent search and replace strings. For example, if I were programming an awb script to do this, I would likely create a dictionary where the key is the language name and the value is the associated code. Then, I would write a function that searches for |language=<language name> and uses <language name> as an index into the dictionary. If found, replace <language name> with <language code>. This mechanism is much more efficient than a huge list of find/replace strings, each of which must be examined before moving on to the next article.
I don't know anything about python but surely it has something like c# dictionaries or Lua associative arrays (tables) that can hold lists of language-name / language-code pairs. Find someone who is conversant in python?
You wanted to future proof things so here is a list that MediaWiki maintains. Yeah, as it is used now, you have to use the list in fragments. But it is there for the looking; you don't have to do much special to get it.
I understand. So basically that's the whole list that MediaWiki deals with? I shouldn't worry about ALL the languages, eh? That seems doable. How many codes do you estimate I should try in one go in order to not break the limit? Or do you not know yet? If not, don't worry. I'll find out soon enough anyway. As for the optimization of the code, I'm still not sure I can do that or not. At the moment being, I don't deal AT ALL with Python. The replace pywikibot is premade and I only give the strings to find and replace. Please, take a quick look here. See the local and global parameters. I guess other commands can be given in the bash shell but I'm yet to experiment much on there. So basically, if you would chose to help in maintaining it later, much of your work would be in just helping to find ways with Lua in-wiki how to generate good regex-es for it and make them part of the source code (which is just a big list of regex-es), without having to deal with bash scripting or Python. You would just need a Toolforge account and me giving you access to the tool (Smallem). That is, if the bot is kept in the current mode, which, given my lack of knowledge, will be true for a while. But that's something to be discussed at the end of the conversation. For the moment being, I'll try and create the needed list for languages, hopefully to complete it this time, after having had to start it from scratch around 5 times or so. - Klein Muçi (talk) 17:08, 15 September 2020 (UTC)[reply]
I tried the whole list. First of all, strangely enough, it still had duplicates. The whole list had about 50 000 entries and after removing the duplicates it went down with 9 000 entries. I don't know why's that so. Secondly, unfortunately, the bash script got extra long with those added lines and it was too many arguments for it to be compiled so I guess I can't really use that list how I liked it to be used. I wonder if with the method you mentioned it could have been possible. :/ - Klein Muçi (talk) 02:13, 16 September 2020 (UTC)[reply]
I think I can still use that list if I run the job at the job grid. But still, the duplicates were a bit surprising. Shouldn't it have had none of them? - Klein Muçi (talk) 09:59, 16 September 2020 (UTC)[reply]
You have to run sq:Module:smallem on sections of the whole language-code list, right? The module has no knowledge of other sections so it seems entirely likely that each independently created section will have some list items that also appear in other section lists.
I added a test probe to Module:smallem that counts the number of list items in the final list. When I let the module process all 320+ language codes, the list (were we able to render it) would have 41,968 items. The count of items in the list is available in the Lua log.
I have precisely 41,968 lines of code dedicated to fixing languages so you (we?) are correct. I see now why the duplicates are created. So basically those are all the possible languages that exist? I want to confirm that as a fact so I know what to write in the bot's user page.
As for the name, I'm very glad that you asked. (Even though your question might be followed by a critique. :P ) Usually my nickname as an online persona is Bigem, a wordplay inspired by the initial letter of my last name. The bot is called Smallem, imagining it as a mini-helper. Given that that module doesn't serve much more than to help me set up Smallem right now (and I guess that will be its purpose even in the future), that's the module's name too. - Klein Muçi (talk) 14:32, 16 September 2020 (UTC)[reply]
Predicated on the notion that the majority of editors will write the value assigned to |language= in their own language on a Wikipedia that uses their own language, checking the languages associated with the various language editions of Wikipedia should cover most cases. An editor citing a Norwegian-language source at sq.wiki would write |lang=norvegjisht whereas an editor at en.wiki would write |lang=Norwegian, right? What we have now is a list of all of the languages supported by all of the specific-language Wikipedias.
I just wanted to know what the name meant because google translate just translated it to itself.
Ok. All of the languages supported by all of the specific language Wikipedias. Got that.
Oh, so it was sheer curiosity then. I was a bit afraid you would scold me for giving the module an obscure name. :P
So, I guess that brings us at the end of the problem and, while thanking you a hundred times for your help, I want to end it with an offer, as I mentioned in the beginning: Do you want to have access on operating the bot, hoping to enhance its functionalities in the future in fixing CS1 errors, maybe even beyond SqWiki? Don't be afraid at all to refuse my offer if it is outside of your scope of interest. I just felt that since you helped a lot in overcoming a key problem of it (that would have taken me literally month to accomplish it otherwise) and given that it deals with a subject you are familiar with, you deserved to have the possibility of operating it yourself too. - Klein Muçi (talk) 15:36, 16 September 2020 (UTC)[reply]
{{lang|fn=lang_xx_italic|code=eml|text=text}} → [text] Error: {{Lang-xx}}: unrecognized language code: eml (help)
We could do that and call it good enough or we could add some sort of support for deprecated ISO 639 language codes to Module:Lang that would render properly-formatted text with a suitable error message. I'm sort of inclined to this last because deprecated codes are just that, deprecated, not deleted.
If you think adding the deprecated support is better, than you have my support. If not, then converting the error to lang looks much better than the current setup. --Gonnym (talk) 08:43, 14 September 2020 (UTC)[reply]
{{lang/sandbox|eml|text}} → [text] Error: {{Lang}}: unrecognized language code: eml (help)
{{lang/sandbox|fn=lang_xx_italic|code=eml|text=text}} → [text] Error: {{Lang-xx}}: unrecognized language code: eml (help)
I think that some sort of error messaging is required per the TfD though it isn't clear to me what that messaging should look like so I will start a discussion at Template talk:Lang to see what the community think.
Wasn't sure you finished the code as it was in the sandbox still. Regarding the code, I'm still not sure why we are using user-input when the template already knows what ISO it supports in the category. If we take as an example Category:Articles containing French-language text which is used in the template doc, the code it says to use is {{Category articles containing non-English-language text|example=La plume de ma tante|French|fr|fre|fra}}. That produces:
This category contains articles with French-language text. The primary purpose of these categories is to facilitate manual or automated checking of text in other languages.
This category should only be added with the {{Lang}} family of templates, never explicitly.
For example: {{Lang|fr|text in French language here}}, which wraps the text with <span lang="fr">. Also available is {{Langx|fr|text in French language here}} which displays as French: ''La plume de ma tante''.
"fre" isn't supported and produces an error, and "fra" doesn't appear in the text anywhere. So either there is no point at all in listing the other ISO codes or there is and it should be added. But either way, if the backend knows what ISO each language supports, why can't it just retrieve the data and show it, instead of getting the data from a user, validating the data, then showing it or an error? --Gonnym (talk) 08:41, 14 September 2020 (UTC)[reply]
Ok, so the only issues there at the moment are the ISO. Which ISO should I get? And what module call will give me it? Also, is there a module call that I can use that will give me a lang-x version for an iso (without doing an exist check)? --Gonnym (talk) 13:20, 14 September 2020 (UTC)[reply]
If you take the language name from the category title then _tag_from_name() will give you the appropriate language tag to use in {{lang}}. You will have to test for the existence of {{lang-??}} because Module:Lang does not keep a list of those templates. Testing for existence isn't an issue because the template is used only once per category, right?
Did some changes before I read your comment so not sure if what I did fix it or not. Can you check Abenaki again? (with the /sandbox) --Gonnym (talk) 14:57, 14 September 2020 (UTC)[reply]
Yah, better. I tried it with |language=Eastern Abenaki and |language=Western Abenaki both of which caused error messages because ISO 639-3 spelling is Abnaki (without the 'e'). Worked when I gave it |language=Eastern Abnaki and |language=Western Abnaki.
I like named parameters, but... Because this template has a heritage of positional parameters, perhaps |language= should be backed up with args[1] (yields to |language= when both are present; plus error message?)
Ok, that seems to work, but but I'm getting an error of {{Lang|cel-x-proto error: cel-x-proto is an IETF tag (help)|text in Proto-Celtic language here}}. What should be done here? --Gonnym (talk) 16:39, 14 September 2020 (UTC)[reply]
These categories are {{lang}} categories so the documentation should be using the Module:Lang data set. But, _name_from_tag() doesn't support a |label= parameter. You can build a wikilink from _name_from_tag(). If the returned value contains 'languages' then:
Couldn't you just add support for |label= to that function? It already has a |link=yes parameter. There is no real reason to make this harder than necessary. --Gonnym (talk) 18:56, 14 September 2020 (UTC)[reply]
{{lang|fn=tag_from_name|Old Church Slavonic}} → Error: language: Old Church Slavonic not found
{{lang|fn=name_from_tag|{{lang|fn=tag_from_name|Old Church Slavonic}}}} → Error: unrecognized language tag: Error: language: Old Church Slavonic not found
So from the above, the only two questions is why the Old Korean and Church Slavonic work for the incorrect name. The others are correct not to work and will be CfDs once the code is live. --Gonnym (talk) 23:50, 14 September 2020 (UTC)[reply]
{{lang|fn=tag_from_name|Old Church Slavonic}} → Error: language: Old Church Slavonic not found
{{lang|fn=tag_from_name|Church Slavonic}} → cu
Old Korean (3rd-9th cent.) is a redirect to Old Korean. Module:lang, when it creates links to articles, strips IANA/ISO 639 disambiguators. When it creates categories, the IANA/ISO 639 disambiguators are retained so for oko the category name includes 'Old Korean (3rd-9th cent.)'. Module:Lang/name to tag strips IANA/ISO 639 disambiguators so when that list is queried:
{{lang|fn=tag_from_name|Old Korean (3rd-9th cent.)}} → oko
{{lang|fn=tag_from_name|Old Korean}} → oko
I suspect that the solution to this problem is to have Module:Lang/name to tag create name-to-tag entries for both disambiguated and undisambiguated names.
Thanks for explaining that! Really amazes me that the two Slavonic are two different things yet have the same code. Regarding the "Old Korean", the category might be better using the non-disambiguated title as the article is at Old Korean so maybe do what you did with the ISO module and add a |raw= parameter, that when used gives the disambiguated one, but in general use, gives the cleaner version. Once we have a fix for these, I think the code is ready to go live. --Gonnym (talk) 11:05, 15 September 2020 (UTC)[reply]
I'm not sure if this is good. The output of the tag_from_name code (at least in this situation) should be what the lang template populates. In this case, it does populate it, so that's good. But if this change affects other codes, then this isn't good. Is there a way to make sure that the name/code pairs used by the lang template to populate these categories, is the same name/pair that the tag_from_name gives? --Gonnym (talk) 19:40, 15 September 2020 (UTC)[reply]
This change only affects tag_from_name(). That is the only function that uses Module:Lang/name to tag. Unless we always use disambiguated names it is not possible to guarantee that name → tag_from_name() → tag → name_from_tag() → name will be circular. It is not a perfect system because we override stuff ...
I'm not sure you understood what I mean. Take look at both Category:Articles containing Old Church Slavonic-language text and Category:Articles containing Church Slavonic-language text with the /sandbox version. Both categories return a valid result. That isn't the expected result as only one of those categories is actually being populated by the {{Lang}} template with the "cu" code. It's also not something out of our control, as what category gets populated by what code, is something you wrote. We just need to be able to get the same result here, so the error will appear for any category title that isn't the correct one being populated. --Gonnym (talk) 08:06, 16 September 2020 (UTC)[reply]
If you were to write {{lang|chu|<text>}} Module:lang will use Module:Lang/ISO 639 synonyms to map chu to cu. When looking for the name to apply to the language link ({{lang-??}}), to the tool-tip ({{lang}}), to the category name (both), Module:lang looks for cu in Module:Lang/data where it finds and then uses the name 'Church Slavonic'. Were cu not overridden in Module:Lang/data, Module:lang would fetch 'Old Church Slavonic' from Module:Language/name/data (when there are multiple names associated with a code, Module:lang always takes the first name in the list).
You are suggesting that tag_from_name() should only return a tag for the name that gets used in {{lang}} and {{lang-??}}. That was not the intent of tag_from_name(). It was intended as a way to find codes for a variety of legitimate names. I have wondered if that table should be expanded to map all legitimate names to their associated code; 'Old Bulgarian' currently returns an error but shouldn't.
is 'Old Church Slavonic' the same as {{lang|fn=name_from_tag|{{lang|fn=tag_from_name|Old Church Slavonic}}}}? No? error:
'Old Church Slavonic' == Error: unrecognized language tag: Error: language: Old Church Slavonic not found?
But then:
'Old Korean (3rd-9th cent.)' == Old Korean?
This same test might be applied to |language= from {{Category articles containing non-English-language text}} because |language= exists (presumably) because the category title is something odd.
A hard nut to crack because the underlying data are messy. It may be that we will need a new function; something that returns the category name so that we don't have to infer what the category name ought to be.
We are currently using _tag_from_name to see if the language name is the one Module:Lang uses for the category because you said that was the one to use, no one is forcing us to use this. If you don't want to change the function (which I agree), then the solution would be a new function with a very limited use, which is to see if the language name supplied is the default one used by our system. This shouldn't be hard at all, as the same method you use to decide what language name to use, can be used here. (at it's most dumb way, we can use tag_from_name -> name_from_tag -> is_name_equal_name, but I'm sure this can be made much more elegant). --Gonnym (talk) 12:44, 16 September 2020 (UTC)[reply]
Yeah, I did say that _tag_from_name() was the thing to use, and in a perfect world I would have been correct. Now you know not to trust anything that I say, don't you? You don't believe me that tag_from_name -> name_from_tag -> is_name_equal_name won't work? Didn't I just demonstrate that that mechanism is wholly unreliable?
Here is another wrinkle. Believe it or not, _lang() and _lang_xx() handle category linking differently. _lang_xx() makes category names from the raw language name (with disambiguation if present); _lang() strips disambiguation when making the category name. I'm not sure why they are different (probably oversight) so I think that this is an error and that _lang() needs fixing. If that is true then _category_name_get(<tag>) could be used to return the category that both {{lang}} and {{lang-??}} populate. Compare the returned value to the actual category name; not same? error.
Now you know not to trust anything that I say, don't you? lol. Anyways, at least a few good things are coming out from all this back and forth and unnoticed bugs are getting fixed. I've modified the /sandbox to test the name-tag-name-equal check and it works for our Slavonic friends, but going by what you just said, it probably fails somewhere. So I guess we'll hold until the lang and lang-x templates are corrected and then we can continue. Right? --Gonnym (talk) 14:40, 16 September 2020 (UTC)[reply]
Derived from name_from_tag(), returns category name or error message.
The new method would be?
fetch category title
extract language name from category title
get language tag from _tag_from_name()
not successful: error
get expected category title from _category_from_tag()
not successful: error
compare category title to expected category title
same: Yay!
!same: error
Further proof that you can't believe anything that I say, {{lang}} fetches the language name as-is from the data and uses that for the tool tip and category. {{lang-??}} fetches the language name from the data and uses that for the category, but strips disambiguation for use in the language name prefix:
Thanks. So I think I'm done with the code refactoring and I think it works. Once you make the Module:Lang/sandbox changes live, I'll update it here and I think it is good to go. --Gonnym (talk) 11:52, 17 September 2020 (UTC)[reply]
Since we're already waiting with pushing the changes, I've updated the non_en_src_cat() function to work with the shared code. While doing that I noticed that some of the language categories will always be empty. Category:Articles with Northern Ndebele-language sources (nde) says that it tracks usages of {{in lang|nde}} but that isn't correct. Those get added to Category:Articles with Northern Ndebele-language sources (nd), along with {{in lang|nd}} usages. Should the categories be kept and the code updated, or should the categories be removed? And if removed, how can we get the correct category being populated? --Gonnym (talk) 13:33, 17 September 2020 (UTC)[reply]
{{in lang}} does as {{lang}} does and promotes ISO 639-2, -3 to ISO 639-1 when there is an ISO 639-1 equivalent. So, any category for those ISO 639-2, -3 codes can go away when there is an equivalent ISO 639-1 category. I don't know why I created that nde category.
Can _category_from_tag() be modified to accept a template name? So that way it can give me the correct lang, in lang or CS1 category names. --Gonnym (talk) 15:05, 17 September 2020 (UTC)[reply]
I don't think that it should. {{in lang}} might be modified accept a parameter(|list-cats=yes?) that instructs it to render a list of categories instead of a list of language names (it creates a list of categories anyway so rendering that list shouldn't be too onerous). cs1|2 has nothing to do with Module:lang and only has specific categories for two-character language codes; all of which have the same form.
It works only if I keep using .in_lang, but if I switch to ._in_lang it doesn't. (if you can just change the list-cats parameter to an underscore that would be even better). --Gonnym (talk) 11:24, 18 September 2020 (UTC)[reply]
_in_lang() has to be exported ...; try again.
Why? Does |list-cats= break something? The hyphenated-parameters form is consistent with multipart parameter names used by {{lang-??}} and similarly by {{ISO 639 name}}.
It doesn't break anything, it just makes the code a bit less nicer as lua doesn't recognize hyphenated-parameters, but does underscore ones (in simple form). Template parameters with underscore also seem to be much more common, but if it's already setup like this in the language templates, then nevermind. Anyways, your fix works and code works. Ready for live when the deprecated is done. --Gonnym (talk) 11:50, 18 September 2020 (UTC)[reply]
Working deprecated codes
{{#invoke:ISO 639 name|iso_639_code_to_name|link=yes|sh}} and {{#invoke:ISO 639 name/sandbox|iso_639_code_to_name|link=yes|car}} are currently the only deprecated codes that work with the live version. Not sure if that means there is a bug somewhere where they are in a list they shouldn't or if this is ok, but just letting you know. --Gonnym (talk) 22:29, 18 September 2020 (UTC)[reply]
sh is in the IANA language-subtag-registry file as a legitimate code even though ISO 639-2 and-3 custodians show it as deprecated. I wish that I could find an up-to-date definitive listing of ISO 639-1 codes from the 639-1 custodian. Best I can find from them is a 2001 doc. According to ISO 639-2 RA change notice, sh was deprecated in 2000. According to ISO-standards-are-us, there is a 2002 version still current as of 2019. No idea what's in that because I'll be damned before I hand over CHF158 to find out. So, who do we believe? IANA or the ISO 639 -2, -3 custodians?
According to ISO 639-5 change notice page, car was deleted in 2009 so for ISO 639-5, we treat it as deprecated. Still a valid ISO 639-2, -3 code.
The proper way to handle Category:CS1 maint: extra text: authors list is to evaluate what is there and likely, change the author-name parameters to editor-name parameters. So, the process looks like:
Find the cs1|2 template that has the extra text (several possible patterns for that)
if there are existing editor-name parameters, abandon
find the offending author-name parameter(s) (could be multiples) and delete the offending text – there are false positives reported by the cs1|2 test
replace:
|last<n>= → |editor-last<n>=
|first<n>= → |editor-first<n>=
|author<n>= → |editor<n>=
... for all of the rest of the possible author-name parameters
But, what if the author-name marked with (ed.) is author-name 3 in a list of four other author names, none of which have the (ed.) annotation? Why is it that I haven't written a bot to do this?
cs1|2 adds the Category:CS1 maint: extra punctuation category when the last character of a non-title holding parameter is one in the set of: [,:;]. Test each parameter in each cs1|2 template:
for each cs1|2 template:
for each parameter in that template
is this a title-holding parameter?
yes, next parameter
does parameter value end with [,:;]
yes, is it a semicolon?
yes, is it last character of an an html entity?
yes, next parameter
no, delete trailing [,:;]
next parameter
next template
My conclusion is that these are not simple regex find and replace solvable.
Since I arrived at the decision to abandon Module:Language/name/data, some of what it does needs to become part of something and Module:Lang/name to tag, with a different name, seems the correct place for those things. When I created ~/utilities, I imagined it as a place for things that are heavily dependent on Module:Lang but are not 'part' of Module:lang. That's why {{in lang}} is there and why I initially had the Nihongo template support there. For me, the ~/documentor tool doesn't meet that requirement. Yeah, we could moosh them all into ~/utilities but I rather prefer the segregation. And, it ain't broke, so why fix it?
I'm pretty confident that the current module naming scheme across the language, lang and their sub-modules is currently very broken. Currently it's pretty much guess-work to find out which is used for what (but at least with some of it being deleted, it's getting clearer). I'm not saying the 3 above are the main issue, or even an issue, that was just me musing out loud. The reason I noticed it was because the cat_test() function which you placed in the /utils and not in the /documentor and /name to tag is utility in function. The odd one out actually seems to be the in_lang() code which isn't actually util but a product by itself and doesn't really belong in a module called util. --Gonnym (talk) 13:42, 19 September 2020 (UTC)[reply]
Maybe so. I don't object to moving in_lang() to Module:In lang. That means that ~/utilities goes away because:
native_name_lang(), if it is ever developed, will be developed in Module:Infobox/utilities where it properly should belong
That does indeed help clean these up and sounds good. Waiting for you to say when the deprecation code is done so we can move the /sandbox code to live. --Gonnym (talk) 15:09, 19 September 2020 (UTC)[reply]
I fixed some template doc issues. I think if you fix the templates that show up first, the other pages will update by themselves. The system updates template changes much faster than module changes. --Gonnym (talk) 17:10, 19 September 2020 (UTC)[reply]
CS1 properties
And finally, the properties' categories...
Here I have some more questions other than "auto-fixing".
Can you explain to me in details what these 3 categories serve for? I think I know what's their purpose in general lines but I want to be better informed about them so I know what to do with them.
The first category has many subcategories in EnWiki. Is that list totally exhausted and definitive? If yes, I should recreate it in SqWiki. But I'm not sure what it serves for and if the articles in it require any kind of fix.
The second category... I sort of know what it is for but I've seen you haven't clearly decided what to do with entries in it so... Is that still the case? Can we do anything with articles in it?
And the same question applies to the third category. Can we do anything with the articles in it?
I'm daring to also ask if we can use a bot in any of them but I'm 90% sure we can't since I'm not sure if they require any kind of fixing whatsoever to begin with.
Please provide whatever information you can on those. And let me say I'm sorry for taking so much time from you but hopefully, this will be the last question of this sort for a while. :P - Klein Muçi (talk) 16:30, 19 September 2020 (UTC)[reply]
There was some dispute over how cs1|2 handles dates for citation metadata in the overlap period between the Julian and Gregorian calendars. I provided Category:CS1: Julian–Gregorian uncertainty as a way for those who were concerned to evaluate how cs1|2 does handle the overlap. Nothing has come of it and someday I hope to remove the category and the code that supports it.
Pretty much the same story for Category:CS1: long volume value; this was related to bolding of the |volume= value in various of the cs1|2 templates. Yet again, no real resolution.
I see.. But is that the total list of subcategories or more are continuously created on the go? And what really does the name of the category mean? Are there values using non-Latin alphabets? What does that mean for cs1|2? Can it understand those values? - Klein Muçi (talk) 17:19, 19 September 2020 (UTC)[reply]
The categories are associated with languages that are written using non-Latin script. cs1|2 adds articles to these categories when editors use |script-title=, |script-chapter=, |script-journal=, etc. I occasionally add a language code to the list and then create a category to match but that doesn't happen much anymore.
Understood. Then I'll go on and replicate them accordingly. And I guess that concludes this long conversation. Thank you a lot for your time! - Klein Muçi (talk) 17:50, 19 September 2020 (UTC)[reply]
The documentation template {{Category articles containing non-English-language text}} has been change to recognize when category titles do not match the categories populated by Module:Lang. There is an ISO 639 language code for Pinyin (pny) which should be used in place of the IETF language tag zh-Latn-pinyin.
The error in the category arises because Module:lang cannot locate a language code to match Pinyin romanization:
{{lang|fn=name_from_tag|zh-Latn-pinyin}} → Chinese
{{lang|fn=category_from_tag|zh-Latn-pinyin}} → Category:Articles containing Chinese-language text
I'm not sure what #1 means. Regarding #2 and #3 - I'm less knowledgeable here if these are wanted or not and I'll guess that this is the main question here. #4 sounds like it goes with #2 (unless I missed something). #5 should be avoided. If we decide to categorize, then the backend should be able to support functions, including getting the correct category name; if we decide we don't, then the error is correct. In short, I think that we either do #2 or #3. --Gonnym (talk) 15:06, 20 September 2020 (UTC)[reply]
When something about a language code / name pair is not to en.wiki's liking, we override the name in Module:Lang/data. So for this example we add:
['zh-Latn-pinyin']={'Pinyin romanization'},
Doing this avoids the category-name-error but shows that there isn't an article or redirect Pinyin romanization language; should redirect to Pinyin.
But, Pinyin romanization is not a language so we should't label it as a language. Pinyin romanization is a way of writing Chinese language. This is why I suggested #2. But, should zh-Latn-pinyin, a way of writing Chinese, be handled any differently than zh-Hant or zh-Hans, also ways of writing Chinese? I'm inclined to say no. And that suggests that for the purposes of Module:lang creating categories, tool tips, and link labels, a variant tag should be validated but otherwise ignored just as Module:lang ignores script and region tags (unless specifically overridden in Module:lang/data as, for example, the various en-?? tags).
@Trappist the monk, this is all theoretically interesting ... but in practice, what it amounts to for now is that you have emptied nearly 200 categories without discussion, and dumped another set of ages into Special:WantedCategories.
Whatever the merits of your case, this is not the way to pursue it. Please restore the categories pending the outcome of whatever proposal you make at consensus-forming discussion. And please ping me in any reply. --BrownHairedGirl(talk) • (contribs) 15:51, 20 September 2020 (UTC)[reply]
The changes that added the empty categories to Category:Lang and lang-xx template errors were made specifically to identify categories that should not exist and to identify categories that are misnamed so that all of these may be deleted. It is not clear to me why those fourteen red-linked categories suddenly appeared. For whatever reason, they are no longer red-linked so no-longer an issue.
The issues remaining are the non-empty categories listed in Category:Lang and lang-xx template errors:
Category:Articles containing Marwari-language text – there are three Marwari language codes: Marwari (Pakistan) (mve), Marwari (mwr), and Marwari (India) (rwr); I am not sure why tag_from_name() is returning the wrong name
@Trappist the monk, nearly every time I have tried to engage with you over the last few months you have failed to ping me in reply. That leaves me to have to hunt down the conversation and check whether you have replied. Your reply of 17:38[1] is now the second time in one day that you again have chosen to reply to me without a ping, despite on this occasion being specifically asked to ping me.[2]. (The first was your reply at 12;30[3])
I do not know why you choose to persistently engage in this passive-aggressive behavior , but I have had enough of it. I will no longer try to engage with you.
Substantively, your reply doesn't deal with the fact that changes by you have created a situation where {{Lang}} populates a category, but {{Category articles containing non-English-language text}} no longer works to populate it ... and you offer no working remedy.
Over the last few years, I have created hundreds of these categories when they appear in Special:WantedCategories. They are tedious and time-consuming to produce, but I have always strived to do them properly, using the templates to create the relevant links.
However, you have now broken the system, without having something better in place ... and you repeatedly fail to communicate effectively about your changes. This non-communication goes beyond your sustained failure to ping: it includes your failure to use meaningful edit summaries on major changes to modules which effect millions of pages, e,g. this edit[4] by you, which depopulated a set of categories being discussed at Wikipedia:Categories for discussion/Log/2020 August 18#Category:Articles_with_text_from_the_Afro-Asiatic_languages_collective.
I have had enough to the persistent non-communication, and this sustained pattern of changes which screw up the work done by other editors without providing an alternative. I will not longer try to make these categories link into the lang system. When I encounter them at Special:WantedCategories or (my replica at https://quarry.wmflabs.org/query/30916), I will simply take the minimal step to remove the redlink: that is, I will create them with {{Tracking category}}, and move on. --BrownHairedGirl(talk) • (contribs) 00:46, 21 September 2020 (UTC)[reply]
Yes, that's how I intend to do it from now on. The Lang system no longer works as it used to; there is no documented workaround in place; and the editor who broke the old system refuses to communicate effectively. So I now have that block of text to paste into any such categories which appear as redlinks. --BrownHairedGirl(talk) • (contribs) 13:47, 21 September 2020 (UTC)[reply]
No, I am not being at all WP:POINTy. Get yourself a mirror.
For several years, I have created these categories when they appear at Special:WantedCategories. They are slow to create: open the language article, look up the language code, check how it displays, do a few tests (because the code listed in the article doesn't always match the code which causes {{Lang}} to populate the article), then save.
They are the most time-consuming type of category to appear at SWC, but I have always taken whatever time is needed to do them properly ...and i have done many hundreds of them.
However, you have made changes to the lang system which means that these methods now don't work in some cases. When I asked you to resolve this, you engaged in repeated passive aggression: not pinging me in your replies, and adding lots of detail which doesn't answer my question of how I should now construct the categories. Since you offer no solution, I am not going to waste my time experimenting to find which options still work.
I have had enough of being messed around like this, so I have now started to make life easy for myself: instead of trying to use the lang system of templates, I simply create the category pages using {{Tracking category}}. That is a perfectly valid approach, because they are tracking categories ... and since WP:POINT describes disruption to make a point, this is not POINTy because it is not disruptive. It is clearly a perfectly a valid approach to creating a page for a tracking category which already has non-zero population ... and if you or anyone else wants to develop the category further, then of course you should feel free to do so.
That solution works for me, so I will simply route around your passive aggression rather than go to the drama boards. If you don't like that, then of course, feel free to take this to WP:ANI... but beware of WP:BOOMERANG. You will do yourself no favours with a request to "Please punish this editor who has been messed around by my undocumented changes and now won't volunteer her time to play a guessing game after I (Trappist) chose to repeatedly screw her around with useless communications". But if you want to make that trip, it's your choice. --BrownHairedGirl(talk) • (contribs) 16:16, 21 September 2020 (UTC)[reply]
BHG, no one has changed the way the categories are created. The only recent change was adding in the error message to allow fast and easy identification of categories not being populated by the template. Most have been like that for at least 3 years; others even more. The issue you reported about a false positive being added to the error category is being fixed. As these are tracking categories and not user-facing reading material, the incorrect error message does almost no harm. --Gonnym (talk) 16:52, 21 September 2020 (UTC)[reply]
@BrownHairedGirl:You? I have not been involved in any of these changes. Please sort out who you think you're talking to. The talk page warning here was the soft version of my request; please don't make me follow through with a harder version. --Izno (talk) 17:12, 21 September 2020 (UTC)[reply]
But regardless of who I am replying to, my answer remains the same: there is nothing at all WP:POINTy in my decision to desist from using a broken template system, and if you want to go to ANI, feel free.
The way I used to create those cats has been broken. So I will now create them a simpler way, and if anyone wants to polish the cats afterwards, they are free to do so. If you think there is an ANI case on that, that's up to you. --BrownHairedGirl(talk) • (contribs) 17:29, 21 September 2020 (UTC)[reply]
@BrownHairedGirl: Indeed, I actually don't mind basically mindless filling in these red linked categories. What you shouldn't be doing is besmirching a good-faith editor (if "uncommunicative") in category-space with your edits. That's what I am asking you to stop doing. A simple {{tracking category}} will suffice, any editors who are interested in fixing them beyond that can do so (you may wish to tell them they are now blue-linked rather than assume they will follow you around, but that is your prerogative). --Izno (talk) 17:41, 21 September 2020 (UTC)[reply]
@Izno, I am not gonna create some sort of log of these categs. I create them, then move on.
What exactly is the problem in describing as uncommunicative an editor who you seem to be agreeing is uncommunicative? I leave that note there to avoid having to field questions abut why I created the cats in that way. Do you want to suggest an alternative wording? --BrownHairedGirl(talk) • (contribs) 17:52, 21 September 2020 (UTC)[reply]
"Seem to" is funny; the statement is quote-marked for a reason (mostly to use your own words, rather than suggest that I agree with them; some might consider them scare-quotes but that was not my intention). If you believe he is uncommunicative, that is a thing for his talk page or for ANI, not for random (er, systematic) categories. If you are personally asked to answer for the creation of those categories (I am skeptical, but willing to answer the point), I would expect you to say "Please speak with Ttm. I am only filling in the red category.", which I would expect would suffice for most if not all people.
Now that languages are out of the way, I was looking at the other error/maint categories of CS1. Would it be wise to have the bot try and fix some of the errors here while using regex to change all (some? - maybe the most obvious typos?) suggestions here into their correct form? What could go wrong, if anything, while doing this? - Klein Muçi (talk) 11:19, 17 September 2020 (UTC)[reply]
What could go wrong? You and several of your friends are sitting around the campfire, drinking your favorite libations and swapping lies when one of them says, "Hey! Guys!, Watch this!" What could possibly go wrong?
The suggestions in Module:Citation/CS1/Suggestions are not always guaranteed to be correct. I suspect that sometimes they are mere guesses. Still, the correctly spelled other-language forms of a parameter name might be replaced without too much going wrong. But, perhaps the better solution for you, since sq:Kategoria:Faqe me burime që përmbajnë parametra të palejuar has relatively few members, is to concentrate on the errors that you have and not bother with ~/Suggestions and errors that you may never encounter. The most common error in Kategoria:Faqe me burime që përmbajnë parametra të palejuar seems to be the various forms to |dead-url= (which, alas is going to plague us for sometime to come because the now-unmaintained tools reFill and reFill2 continue to add |deadurl=y). |month= appears to be another one; combine that with the value in |year= to make |date= if |date= is not already present, and delete |month= and |year=.
Haha! A "here, hold my beer" scenario was what I had in mind when I made that question. :P Regarding the class error, the way I was imagining the solution was for the bot to check all the pages in that category and simply remove the class parameter. Would that be a safe solution? The logic being that since pages are already in that category, the CS1 module would have already made the needed checks.
As for the suggestions... I was thinking more of typos in English than the foreign not-recognized aliases. That is, as you too say, rare and I certainly don't wanna do another "all languages in all languages" for every parameter citation templates have. My script already got slowed down with more than 24 hours (went from a mere 30 minutes to 25 hours) in completion time after adding all the language regex-es. If you say typos are not safe to fix with a bot, I'm gonna agree with you. What would be the needed changes regarding the |dead-url= parameter? Can those fixes be regex-ified? I don't think the month + year = date deserves to be solved with a bot as I don't think it is a common occurrence, no?
Adding on that, do you think we can apply bot solutions at this category similar to what I said regarding |class=? Check the category, check the mentioned parameters there, remove wiki mark-up if found. Even though, judging by the text there, I don't think that task can be automatized as easily and it still requires human intervention to differentiate between different kinds of media, no? - Klein Muçi (talk) 15:47, 17 September 2020 (UTC)[reply]
If someone else has to hold your beer, you're doing it wrong.
Simply removing |class= from articles that have the error is not correct. You have to evaluate each usage of |class= so that you remove only those that are misused. |class= is valid for |arxiv=YYMM.#### and |arxiv=YYMM.##### forms so should not be removed.
When we replaced |dead-url= with |url-status=, I wrote a bot task; details at the task page.
|month= has been dead a long time but still pops up occasionally. Probably not worth too much effort ...
When we added markup detection, I wrote a bot task; details at the task page.
XD The reference was regarding the usual expression one drunk person says before "going on an adventure" without prior skills or information needed about the said "adventure". That's what I want to avoid (but usually end up involved anyway) when working with Smallem lately.
So what you're saying is that there may be articles that are in that category and practically have no real problem with the class parameter? Or that they do have problems but it can't simply be fixed by removing that?
I'm not sure what I should do regarding your bot jobs. Maybe I should be able to adapt them for Smallem? Of course that would have to take the course of recreating them as simple regex-es. I think I can do that for the specific 2-3 transformations that are needed regarding |dead-url= / |url-status= but I'm not sure how I would go about the job regarding markup detection. Maybe I should ask if Monkbot can work outside EnWiki? Although I'd like to have only one specific bot do all the changes regarding citations. - Klein Muçi (talk) 13:20, 18 September 2020 (UTC)[reply]
I wrote an awb script for that. I don't think that it is bot-able because quite often, the things that get trapped in that category are the result of vandalism or the result of unintentionally adding new article-text in the middle of a cs1|2 template (these for the CRLF errors). Fixing those kinds of errors requires humans. The script is wholly unpolished so I haven't published it but if you want a copy I can give it to you.
Seems sort of silly to me to convert the monkbot tasks to pywikibot smallem. You are listed at sq:Wikipedia:AutoWikiBrowser/CheckPage, so get yourself a bot flag for smallem-awb. Import tasks monkbot 14 and 16, tweak them to name smallem-awb in the edit summaries, test to make sure they are working correctly after the import, and then switch to autosave.
Oh, well, I guess you could give it to me and I can check after every fix. I must warn you though that Albanian does use a lot the letter Ë/ë and every script fixing CRLF characters in Albanian at least should take care of not removing diacritics. We also use Ç/ç (and that completes the full list of our non-latin letters) but that's more rare.
As for the IABot's subject, I try to check its edits regularly on our project and I've spent a lot of time localizing its pages (userpage/meta page/interface) and I was sure that it did make that switch of parameters whenever it met them. But your doubt on the subject is making me doubt that too now. Maybe Cyber can give us some insight on that.
And finally, I have a problem fully grasping what you mean with "AWB jobs/tasks". I've used AWB in the past. (Even JWB.) But I've never used it with code. I've downloaded the program, set up the find and replace transformations I wanted to make, set up a summary, set up a database dump after downloading it (that's a step I've unfortunately forgotten how to do now) and had to press save manually after every edit. After getting tired of doing that and seeing there were no problems happening, I devised a simple script to press "Ctrl+S" every 2 seconds and that's as close to autosave in that program as I have ever been. :P But I know nothing of using code to operate it. How do you do that? - Klein Muçi (talk) 14:55, 18 September 2020 (UTC)[reply]
Umm, those are Latin-script characters:
Ë 00CB LATIN CAPITAL LETTER E WITH DIAERESIS
ë 00EB LATIN SMALL LETTER E WITH DIAERESIS
Ç 00C7 LATIN CAPITAL LETTER C WITH CEDILLA
ç 00E7 LATIN SMALL LETTER C WITH CEDILLA
cs1|2 doesn't care about them:
{{cite book |title=Ë/ë and Ç/ç}} → Ë/ë and Ç/ç. – no error.
It is not a matter of doubt. I just don't know because I don't pay much attention to that bot's operation unless it is doing something that it ought not be doing.
I think that if you create sq:User:smallem-awb and then add smallem-awb to sq:Wikipedia:AutoWikiBrowser/CheckPage under §Botët and then login to awb as user smallem-awb, you should see the Bots tab appear between the Skip and Start tabs. Auto-save is a checkbox on the Bots tab.
You are right. I shouldn't have called them non-latin but the point is that every page I've found online that removes invisible characters, also removes the diaeresis and that's a big problem (also the cedilla but that can be usually fixed manually in no time).
I literally had no idea about that kind of functionality. Assuming I did this (judging by the other bots already there, I don't think I'll need a new account/userpage for the bot, just to add its current name there), apart from the auto-save checkbox, do I also get a specific page where to import your Monkbot's tasks? - Klein Muçi (talk) 17:13, 18 September 2020 (UTC)[reply]
Put them wherever you want. Settings and code files are stored on your local machine and run from there. I have a folder called Z:\Wikipedia\AWB\Monkbot_tasks (win 10). That folder holds the .xml settings files and the .cs code files. Because en.wiki requires prospective bot tasks to pass through the WP:BRFA gauntlet, I publish the code in Monkbot's userspace; I don't usually publish the settings files but can give you those if you want them.
Yeah, of course but the main problem is that I still know nothing about what you're saying apparently. I've never had to use files to operate AWB before. Is there somewhere I can learn about it? - Klein Muçi (talk) 09:54, 19 September 2020 (UTC)[reply]
Okay, I added Smallem as a bot user and logged in in AWB. I also saw that you have an option to open a file for settings at "File". So I guess I learned that. I'm supposing that's where you set up the settings files. Where do you open up the code files? Or am I messing it up? - Klein Muçi (talk) 10:11, 19 September 2020 (UTC)[reply]
Start simple: User:Monkbot/Task 0: null edit. Copy/paste that to a file on your local machine (on mine its at Z:\Wikipedia\AWB\Monkbot_tasks\Monkbot_task_0_null_edit.xml) – file extension is important. Start awb. Use the File → Open settings menu to browse to and open your task 0 file. Login as smallem. Make a list. On the Start tab, click Start. AWB should show you that it has added {{subst:null}} at the start of the page. On the Bots tab check Auto save. On the Start tab, click Start. AWB should start working through the list of articles.
Yes, it worked perfectly! Thank you! I'll try the other tasks now. I have a question though: I was thinking of adding the fix for |deadurl= -> |url-status= as a permanent regex fix for Smallem, thinking it will continuously pop up every now and then for a while. Am I right on that logic or the one time AWB task will be enough for it too? - Klein Muçi (talk) 11:09, 19 September 2020 (UTC)[reply]
Also, I tried doing the 2 other tasks. I just copy-pasted the script code without putting much attention to it. But it said it had an error. I tried continuing nonetheless and it worked but none of them did any fixings to the articles in the specific categories. Could this be because of the said error, the fact that I should do some kind of adaption in the code I just copy-pasted blindly or the fact that none of our articles could benefit from those scripts at the moment? - Klein Muçi (talk) 11:50, 19 September 2020 (UTC)[reply]
Slow down. If you haven't already, copy the c# code from User:Monkbot/task 16: remove replace deprecated dead-url params#script and paste it into notepad++ or some other plain-text editor – if you have, do it again because I just updated it. Monkbot does not want to be responsible for edits that Smallem makes so at line 113, replace the text inside the quotes with an edit summary message that will be meaningful to sq.wiki editors. Save but don't close the file (mine is at Z:\Wikipedia\AWB\Monkbot_tasks\Monkbot_task_16_remove_replace_deprecated_dead-url_params.cs) – file extension is important.
Close awb and then restart so that you start afresh. Monkbot does not want to be responsible for automatic edits that awb makes so I always uncheck Auto tag, Apply general fixes, and Unicodify whole page on the Options tab whenever I start awb. Choose a category where you will find |dead-url= errors (for us that's Category:Pages with citations using unsupported parameters) don't click make list. In Notepad++ Ctrl-A Ctrl-C the c# code. At awb, Tools → Make module. Check Enabled. In the text box at the bottom, Ctrl-A Ctrl-V to paste the c# code into awb (yeah, overwrite what is already there). Click Make module. After a pause you should get the green message Module compiled and loaded. Close the Module window. Back in awb, File → Save settings as ... save these settings (same name as the .cs file except with .xml file extension seems sensible). Next time you load these settings, you won't have to copy/paste the c# code; it is stored in the settings file. Login as smallem, click Make list. As you did for task 0, run this task manually enough to become comfortable with what it is doing before switching to auto save. Check the edit summary to make sure it looks as you expect it to look.
Thank you genuinely for spending your time to explain it to me! Before I go one with what you wrote though, I want to ask you something: Will everything you wrote up work with the other task too? The one related to formatting. Of course, not minding the specific instructions about specific lines. The reason I write that is because, as I mentioned above, I decided to write a regex for that task given that I suspect it to be a recurring occurrence for a while (along with ref=harv) so I've made it part of the find and replaces source code so that it can run automatically (periodically) for a while. - Klein Muçi (talk) 12:52, 19 September 2020 (UTC)[reply]
I was able to make all that except for the "set the summary" step. Can you help me what line/s exactly to erase so I set up a single general summary for all the needed changes? The summary is this: Rregullime automatike të gabimeve me referimet - Klein Muçi (talk) 13:55, 19 September 2020 (UTC)[reply]
I have updated the script so fetch a new copy. Line 5040; you can delete lines 5040 and 5041. Also at the bottom of the file, change the file name listed there to your file name. I've taken to putting the file name there because I sometimes have multiple instances of awb running each with a different module; the file name helps keep me organized.
Thank you! I was able to make it it work like I wanted. What I did actually was to remove everything and leaving only line 5040. It gave me one error (saying I needed to set up a summary eventhough it had one) but I was able to trick it by putting a single space at the AWB's summary placeholder. - Klein Muçi (talk) 14:46, 19 September 2020 (UTC)[reply]
The question for you posed by Editor Klein Muçi was: Shouldn't IA Bot take care of the aforementioned transformations? That question refers to changing the no-longer-supported |dead-url= and |deadurl= parameters to the |url-status= with the appropriate live or dead keywords.
Misnamed categories because the proper language names are overridden in that abomination that is Module:Language/data/wp languages merely for the purpose of suppressing the disambiguation.
You want me to predict the future? I can imagine that any of these except 'Hepburn romanization', 'Hong Kong Chinese in traditional script', 'traditional Chinese (HK)', and 'variant English' might be created in some form of the future. I tweaked Module:Lang/sandbox so that it will create categories for the various regional English tags listed in Module:Lang/data:
Category:Articles containing explicitly cited Australian English-language text
Category:Articles containing explicitly cited Canadian English-language text
Category:Articles containing explicitly cited Early Modern English-language text
Category:Articles containing explicitly cited British English-language text
Category:Articles containing explicitly cited Irish English-language text
Category:Articles containing explicitly cited Indian English-language text
Category:Articles containing explicitly cited New Zealand English-language text
Category:Articles containing explicitly cited American English-language text
Category:Articles containing explicitly cited South African English-language text
I haven't looked but it would not surprise me to find many or all of these in the articles listed in Category:Articles containing explicitly cited English-language text. Defer a decision about the regional English cats until after the module update but nuke the others?
You want me to predict the future? I somehow knew that no matter how I'd phrase my question I'd get something like that :) I meant, in your /sandbox changes if the above categories are valid. --Gonnym (talk) 00:09, 24 September 2020 (UTC)[reply]
It is said that Niels Bohr (who had a much bigger brain than I) once made a remark something like: "Making predictions is difficult; especially about the future." Yeah, probably apocryphal, but ... Only the regional English cats are supported in ~/sandbox.
Question: I was checking the source code of Smallem randomly to see if I could do any regex optimizations on it now that I've completed giving it most of the tasks it can solve. I noticed these 4 lines:
Don't they seem a bit odd? You know what I mean. I don't know how many lines could be like this (I only noticed these because they were at the beginning of the code since I have them lexicographically sorted) or what I should do with them.
Fact: As we've talked some times ago, I took the liberty of creating this Meta discussion about the CS1 module. My hope is to get volunteers to help in practical ways in creating the Meta infrastructure of the CS1 system - you'll understand what I mean with "system" when you read the discussion. Of course, if that happens, your help in guiding us would be necessary but I'm not too optimistic on the project yet so I didn't ping you on it. Feel free to participate on it though if you want. I hope you agree with everything I've said there (I feel like I've added nothing new we haven't discussed prior). - Klein Muçi (talk) 13:16, 27 September 2020 (UTC)[reply]
The tags are legitimate:
German names:
{{#language:abq|de}} → Abasinisch
{{#language:abq-latn|de}} → Abasinisch
English names:
{{#language:abq|en}} → Abaza
{{#language:abq-latn|en}} → Abaza
I don't know how many simple language tags are paired with IETF language tags; likely not all that many. If this duplication is not causing problems, is it worth the effort to 'fix' it?
In the meta discussion, I think that you have stated my position accurately. It would be good to see that produce tangible results. If it does, let me know.
Glad to hear that. :) As for the tags, the problem is that I don't know how Smallem will react to them but I suspect they will bring problems. The reason for that is because:
Not simultaneously. One at a time. If done in the order that they are listed here, Smallem will find |language=Abasinisch and replace it with |language=abq. The next search for |language=Abasinisch will find nothing so Smallem will move on to whatever pattern follows next.
Oh! I see. Well, in that case I should try and remove the "second entry" from each language in the code, to make it a tiny bit faster. That was the initial intention when I randomly found out these details. Would it be wise to search for "-" and therefore manually see cases like this? - Klein Muçi (talk) 14:45, 27 September 2020 (UTC)[reply]
After move issues
I created new language categories for the ones I found red-linked. The pages in Category:Lang and lang-xx template errors will need to finish moving categories before the categories can be CfD.
Manually did a null update on the pages of the smaller categories in the error category. A few bigger ones are now left. We also need to wait for the ~/data/name and ~/wp languages modules to update transclusions and see if they can be sent to TfD.
The ISO 639-3 custodian says that hbs is active and that its ISO 639-1 equivalent, sh, is deprecated; see iso639-3:hbs. IANA has this for sh so not listed as deprecated (there would be a Deprecated: YYYY-MM-DD item in the record):
Type: language
Subtag: sh
Description: Serbo-Croatian
Added: 2005-10-16
Scope: macrolanguage
Comments: sr, hr, bs are preferred for most modern uses
But, hbs is not listed in the source we use for synonyms. How Module:Lang should handle this oddball is a puzzlement.
For mo and mol, both deprecated, we might add support for ro-MD to the override data.
While I was writing the awb script to update the IANA data modules, I began second-guessing my decisions to keep deprecated codes out of the IANA modules; the codes really are in the registry file so they should be included in the data modules. And that brings us back to the question of how to inform editors that the codes they are providing to {{lang}} are deprecated? I intend to restart the deprecated-codes discussion at Template talk:Lang.
I do not think that {{in lang}} should be changed away from the IANA data set. Where would it go? ISO 639 name? Then it would lose the capability to support IETF language tags.
My question regarding In lang was because (and correct me if I'm wrong), the reason we follow IANA for lang is so we follow the html specifications. But In lang is not for html text on page, but for outside sources. So my question was if it should follow it, or enable even more languages (from where? I have no idea. I just thought about it because I saw those two categories enter the error cat). Regarding the deprecated message for the reader. Does it matter to the end-user if they are getting a deprecated language tag? If not, then there is no need for a message. We can add a tracking category for those usages if that is needed (but is it needed? Is there a downside to viewing a deprecated language tag?). --Gonnym (talk) 19:14, 30 September 2020 (UTC)[reply]
No, {{in lang}} uses Module:Lang so that it has support for IETF language tags. IANA has all of the language tags from ISO 639-1, ISO 639-2T, ISO 639-3, and ISO 639-5 except the deprecated language tags, the ISO 639-2B language tags, and any ISO 639-2, -3, -5 language tags that have ISO 639-1 equivalents.
Thanks! Removed all usages from that and from insource:"Category:Articles with" insource:/\[Category:Articles with [A-Za-z ]+\-language sources/ and insource:"Category:CS1" insource:/\[Category:CS1[A-Za-z ]+\-language sources/. Couldn't get the search to find any in style of "Category:Articles with text from the Berber languages collective" and "Category:Articles with Berber languages-collective sources (ber)", so either my code was wrong or there are none. --Gonnym (talk) 11:26, 1 October 2020 (UTC)[reply]
Which was why I asked if it wasn't painful to hack the fixes :) And you are right, if we do it we should take into account the two other you pointed out. --Gonnym (talk) 13:17, 1 October 2020 (UTC)[reply]
A few more date categories:
Category:Articles containing Middle English (1100-1500)-language text
Category:Articles containing Old Aramaic (up to 700 BCE)-language text
Category:Articles containing Ancient Greek (to 1453)-language text
Category:Articles containing Old Provençal (to 1500)-language text
Category:Articles containing Old Irish (to 900)-language text
Category:Articles containing Occitan (post 1500)-language text
Category:Articles containing Middle Korean (10th-16th cent.)-language text
Category:Articles containing Old Korean (3rd-9th cent.)-language text
"Old Aramaic (up to 700 BCE)" should change (if we change stuff) anyways as it says "up to" while the 3 stricken categories say "to". --Gonnym (talk) 14:04, 1 October 2020 (UTC)[reply]
Is there some MOS requirement that prohibits the use of 'up to' when the phrase precedes a date? If not then leave it alone; eschew special cases because they are special cases.
Just consistency between the titles, but it isn't a big issue. Old Persian and Jewish Babylonian Aramaic probably don't need the eras as the dates are in the same era and the other categories aren't using them. --Gonnym (talk) 14:23, 1 October 2020 (UTC)[reply]
Pretty sure you do need the BCE with Old Aramaic else you don't know if 'up to 700' means 700 BCE/BC or 700 CE/AD. Jewish Babylonian Aramaic probably doesn't need CE because current era can be assumed. Only example of the use of CE in a disambiguator? If so, special case...
Commenting on my deleted text :) I agree that 700 is unclear. However, by that thought so is Old Irish (to 900) and the other 3. --Gonnym (talk) 16:34, 1 October 2020 (UTC)[reply]
That's why I said that the current era can be assumed. Because Old Aramaic (up to 700 BCE) includes the era designator BCE we know that it is not 'up to 700' in the (assumed) current era (CE). All other of those disambiguators do not include an era designator so may be assumed to be in the current era. This appears to be in keeping with MOS:ERA which states, in part: In general, do not use CE or AD unless required to avoid ambiguity. It 'appears' to suggest that the current era may be assumed, though doesn't actually say that. Still, if 'Jewish Babylonian Aramaic (ca. 200-1200 CE)' is the only CE-marked date, it would be a special case to 'fix' it.
Ah, I see what you mean. Yeah makes sense that they are assumed. I'm not sure what the fix entails so can't really comment. But "Jewish Babylonian Aramaic" has both a dash issue and a ca. issue, so anyways it would need fixing, regardless of CE. If it's already being changed, then unless some heavy hack needed, not a lot of reason not to remove the CE to match the others without it. --Gonnym (talk) 17:08, 1 October 2020 (UTC)[reply]
Not onerous. One could write stuff like this for each unique disambiguator:
Probably better to split the task into a table of appropriate patterns such that the pattern matching and replacement are confined to within the disambiguator...
Further proof that wherever the code/name definitions in Module:Language/data/wp_languages came from, a lot are wrong; en-SA is English as spoken in Saudi Arabia; ZA is South Africa.
They are errors but not errors. The deprecated test expects that {{lang|fn=tag_from_name|Hebrew}} will return iw but instead gets he. This is because, for Hebrew, iw is deprecated but he is not deprecated. {{lang/sandbox}} is doing the correct thing when it returns the active tag for the language name.
I suspect that a tweak to the reference data set assembled by the testcase is required. When a language name has both active and deprecated tags, skip the language name.
True, but even though it is currently unused, I don't think it would be deleted as it is part of the ISO 639 set, which means it would need to be updated anyways. Having it used here would at least mean that it would not be out of sync from the others. --Gonnym (talk) 14:09, 26 September 2020 (UTC)[reply]
I've updated the /sandbox with the 639-1 module and also commented out a code that checks 2 or 3 letters as it seems (to me) unnecessary now that the lists are split, but I might be mistaken. The /testcases are all green except for 5 in Module talk:ISO 639 name/testcases which now use "not_found" instead of "not_code" for the error message. Can you take a look and see if the comment out code is still needed or not? --Gonnym (talk) 15:46, 26 September 2020 (UTC)[reply]
I don't think it should be deleted. AWB is a windows-only tool so otherwise qualified editors who edit with other systems would not be able to do the update.
Did you know about the external links in the style of iso639-3:bcp? Do you know where this is created and how? Seems very strange we have external links without any sign these are not Wikipedia links. --Gonnym (talk) 11:59, 7 October 2020 (UTC)[reply]
Could you change the sort key of the error messages produced by Module:Lang so that the key is the iso code? Makes it easier to fix group of pages. I tried doing it but couldn't get to the code without changes too many functions on the way and mess something there. --Gonnym (talk) 14:10, 9 October 2020 (UTC)[reply]
Can be done but is it really necessary? Are you expecting that we will be suddenly getting a huge number of errors?
Well, there are over 100 in the transl category. But I guess you are right that if there isn't going to be an unexpected surge, there is no need. --Gonnym (talk) 16:04, 9 October 2020 (UTC)[reply]