User talk:Trappist the monk/Archive 14


Help with regex - continued

Hey, Trappist! With your regex you helped me a lot. I was wondering if you could "twist it" so it would incorporate one extra step in it:

I use these find and replace regex-es to see if the language name is made up of more than 1 part (it includes spaces) and to change it accordingly so the bash can read it.

  • Find: (\w)( )(\w)
  • Replace: \1\\\2\3

Basically for languages like "Church Slavic" we need them like this: Church\ Slavic

Any way to incorporate that logic in the regex you gave me?

Also, while I'm at this point:

  • "\|\s*language\s*=\s*Afar\b" "|language=
  • "\|\s*language\s*=\s*Abkhaz\b" "|language=
  • "\|\s*language\s*=\s*Avesta\b" "|language=
  • "\|\s*language\s*=\s*Afrika\b" "|language=
  • "\|\s*language\s*=\s*Akan\b" "|language=
  • (many more entries)

I use this page to get to this point:

  • "\|\s*language\s*=\s*Afar\b" "|language=aa
  • "\|\s*language\s*=\s*Abkhaz\b" "|language=ab
  • "\|\s*language\s*=\s*Avesta\b" "|language=ae
  • "\|\s*language\s*=\s*Afrika\b" "|language=af
  • "\|\s*language\s*=\s*Akan\b" "|language=ak
  • (many more entries)

Any way I can do that faster without having to leave Notepad++? - Klein Muçi (talk) 10:33, 14 September 2020 (UTC)[reply]

Regex is pretty cool but can quickly become overwhelmingly complex. For that reason, I like simple regexes to do things one step at a time. So, for this task, I would do two steps:
  1. replace whitespace in language names with an escaped space character:
    Find: ([a-zA-Z]) +([a-zA-Z])
    Replace: $1\\ $2
  2. use this to create your final output:
    Find: ([a-z]{2,3}):\s*([^\n\r]+)
    Replace: "\\|\\s*language\\s*=\\s*$2\\b" "|language=$1
Trappist the monk (talk) 11:23, 14 September 2020 (UTC)[reply]
Yes, that's basically what I'm doing right now just reversed. I totally get your logic but the thing is I need to do these steps over and over again. 186 times for all the 2 character ISO codes and then more than 3000 times for the 3 character ones. And then... You get the idea. So I'm trying to minimize the number of steps needed as much as possible so I can speed up the process as much as possible. I even thought of writing a Python script that will do that for me but in order to do that, I need to lower the complexity of the task before (by minimizing the number of needed steps) because I'm not that tech savy. :P
The actual number of steps I need to take now:
  1. Change the code of lang lister to the next language;
  2. Copy all languages with their corresponding codes;
  3. Paste them all in Notepad++;
  4. Run your find and replace regex;
  5. Add a slash before spaces;
  6. Add language codes by merging them in that page above;
  7. Append " \ at the end of each line to complete the script;
  8. Save the file, restart the cycle;
I was trying to find a way to merge some of the 5th, 6th and 7th steps together. - Klein Muçi (talk) 11:58, 14 September 2020 (UTC)[reply]
Sixth and seventh steps can be combined by changing the replace to:
"\\|\\s*language\\s*=\\s*$2\\b" "|language=$1" \\
It is probable that the space-to-escaped-space regex will fail when the language name is written using non-Latn script or is written using Latn script and the letter before and/or after the space is Latn script with diacritic. An alternate find for that is: ([^:\s]) +(\S)
But, why do all of this? Why don't you create a special version of lang_lister() that takes the raw input from MediaWiki and then gives you the output that you require? Doing that will reduce your list of steps to more-or-less this:
  1. invoke modified lang_lister() with the language-code for the desired language
  2. copy/paste output into external file
  3. save the file
Trappist the monk (talk) 13:01, 14 September 2020 (UTC)[reply]
I can answer all that with 1 question: CAN I DO THAT?! XD Please, can you help me achieve that? :P That would be a game changer! - Klein Muçi (talk) 13:14, 14 September 2020 (UTC)[reply]
You don't need my permission. If you need help, let me know.
Trappist the monk (talk) 13:15, 14 September 2020 (UTC)[reply]

Hahaha, no, I meant if it was possible to do it because somehow it had escaped my thoughts that I could change the module (template?) itself to make it do the needed work for me and not create a script from scratch for that. The problem is that I'm not familiar enough to change it from scratch. I have no idea where to start and what to do to achieve the needed results. And I thought maybe you can show me what to change, where and I could fine tune it further according to the bot's needs. Should I change the module Cs1 documentation support or the template Citation Style documentation/language/doc? - Klein Muçi (talk) 13:30, 14 September 2020 (UTC)[reply]

Make a copy of sq:Module:Cs1 documentation support someplace (give it a meaningful name). Delete:
  1. exclusion_lists{} table
  2. everything between lang_lister() function and the exported functions table
from the exported functions table delete everything except lang_lister = lang_lister
Save it and tell me where it is so that I can help if needed.
Decide what each entry in your final list should look like. Decide what the list of entries looks like (bulleted? in columns? plain?) and then edit what remains in your new module to make that happen. So that you don't end up saving a bazillion copies of the module that don't work, on some page (a sandbox page, usurp the module talk page or the module's doc page, add an invoke:
{{#invoke:<module name>|<function name>|lang=<language code>}} – replace <module name>, <function name>, and <language code> with actual module name, the function name (lang_lister until you change it – or not), and a legitimate language code
save that page then copy the page name to the Preview page with this template box at the bottom of the module (in edit mode). When you change something and want to test it, click the adjacent Show preview button. Still, you should save your work occasionally.
Trappist the monk (talk) 14:12, 14 September 2020 (UTC)[reply]
So basically like this? I'm yet to make changes to it other then deleting the unneeded parts. I'm not sure I will keep the module for long after I complete the task so I didn't think much about the name.
"\|\s*language\s*=\s*Qafár\ af\b" "|language=aa" \
^ This is the standard and I'd like it to have in a bulleted list, as that's more easy to visualize and the bulleted points are not copied when copy-pasting so... Pretty convenient. What should I do next? - Klein Muçi (talk) 14:43, 14 September 2020 (UTC)[reply]
Also, I noticed something about what you had written above regarding your last regex. I don't think it fails with non-latin scripts because I tried it with Korean and many other "similar" scripts and it worked fine. But your last suggestion "\\|\\s*language\\s*=\\s*$2\\b" "|language=$1" \\ "doesn't work". The whole idea of the bot is to change language values used in citations in ISO codes. So having that =$1 defies the purpose of the script. But if I can make the "module solution" happen, I wouldn't need to deal with these extra steps so my attention is on that right now. :P - Klein Muçi (talk) 15:14, 14 September 2020 (UTC)[reply]
Did you use this find: ([a-z]{2,3}):\s*([^\n\r]+) with this replace: "\\|\\s*language\\s*=\\s*$2\\b" "|language=$1" \\? If you weren't then of course it didn't work.
Trappist the monk (talk) 15:41, 14 September 2020 (UTC)[reply]
How much assistance do you want? I can (have done) write something that should work (not tested). I can strip that to a skeleton so that you can find a solution yourself...
Trappist the monk (talk) 15:41, 14 September 2020 (UTC)[reply]
Oh wow! Apparently I hadn't because it does work now. I have to try that F&R above with some non-latin languages to see if it works well with escaping spaces or no. But basically, if it does, you have shortened my work tremendously even now. I only have to copy-paste and use 1 regex before saving and restarting the cycle. As for the assistance, believe me, as much as you can. Everything past wikimarkup leaves me with LOTS of trials and errors given that I have no technical background. Basically it takes me 1 month to just invent the wheel, if you get my metaphor. - Klein Muçi (talk) 15:54, 14 September 2020 (UTC)[reply]

Aaand, no. The space escape doesn't work with non-latin languages. :P - Klein Muçi (talk) 16:02, 14 September 2020 (UTC)[reply]

sq:Përdoruesi:Trappist the monk/Livadhi personal
Trappist the monk (talk) 16:54, 14 September 2020 (UTC)[reply]
Lovely! I have only one last problem with that. There are some languages which are "problematic" in following the standard. Edge cases like: սլավոներեն, եկեղեցական, luba-katanga (kiluba) or словѣньскъ / ⰔⰎⰑⰂⰡⰐⰠⰔⰍⰟ. Can anything be done about these cases or should I deal with them manually? That's totally doable because they are not a lot of them. I was just wondering if maybe a simple technical solution could be on the reach and I don't know about it. The problems I mention are these: The comma would make the regex (?) for the space escape fail since it's not a word character. The parentheses would need to be escaped too, I think, considering that this is intended as a bash script (I guess they would need to be escaped too even on regex, no?). They would also make the regex for the space escape fail. And finally the slash on itself would need to be escaped too and I believe it also would make the regex for the space escape fail. Maybe different cases like these exist too but these are the ones I've found until now (I've already done all the entries alphabetically until the code la). - Klein Muçi (talk) 17:37, 14 September 2020 (UTC)[reply]
We're talking about this? "\|\s*language\s*=\s*սլավոներեն,\ եկեղեցական\b" "|language=cu" \ Armenian for: Salvonic, Church. sq:Module:Test doesn't do anything about commas and commas aren't regex special characters so I don't understand the issue. The 'space escape' in Module:test (this: name:gsub (' +', '\\ ')) works on the name only so doesn't need to know about the characters on either side. I just looks for one or more space characters in the language name and replaces them with a single escaped space character. I've tweaked Module:Test to escape parentheses and the virgule.
Trappist the monk (talk) 18:05, 14 September 2020 (UTC)[reply]
Yes, about that. Well, okay then... I'll test it a bit later but I don't suspect there will be any problems with it. If everything works correctly I will try and get start basically from scratch because a method like this is faaar less prone to errors and especially is more secure regarding space escape in non-latin characters. (I had difficulties with Arabic and Hebrew.) Since there are far less steps to be taken now (change code, copy-paste, save - repeat) I believe there won't be a need for scripts. I just have a naive question regarding that: Is there any way I can select all of them quickly with a keyboard combination? Ctrl+A wouldn't work since it would get the whole page. - Klein Muçi (talk) 18:17, 14 September 2020 (UTC)[reply]
Don't you wish there was a 'do-what-I-want-done' button? Alas, I don't know of any such keyboard shortcut.
I can think of something that might be useful. I'm guessing that the English-language code/name pairing is the fallback. Bouncing back and forth between English and Hebrew versions of the sq:Module:Test output seems to confirm that. So, what Module:Test can do is create a list of English-language names and then, if English is not the language being rendered, compare that language's name against the English list. If the names are the same, skip, else create a new find/replace string. There isn't any benefit that I can see to copying stuff you've already copied.
Trappist the monk (talk) 18:42, 14 September 2020 (UTC)[reply]

What I do now is exactly that with extra steps. I finish with all languages in a certain code. Then I start with the next code in the list until I get to a code with a different first letter. Then I group all the lines (those from aa, ab, af, ak, etc) in a file named, for example, just A and I use a Notepad++ command to remove the duplicates. Then I go on with "b-codes" and so on. Having some steps removed from that would be another present for me. :P - Klein Muçi (talk) 18:55, 14 September 2020 (UTC)[reply]

Done, I think. You can switch it on with |no-fallback=true.
Trappist the monk (talk) 19:29, 14 September 2020 (UTC)[reply]
Perfect! Now I have 2 questions:
  1. Is that the complete list of every language possible basically? At least, those possible on Wiki. If yes, where can I find the complete list of codes that I would need to change one after another so I can copy-paste the results? Is it the 2+3 character codes from this page? What about the IETF ones? And those other ones? - I now see that it is the complete list and IETF tags/languages are included. I don't know if I should try the overridden ones. My eye got caught on this one: ᬩᬲᬩᬮᬶ: ban-bali - Maybe it's the only one but my computer apparently can't render that script. What do I do in this case (or in other similar cases, if they exist)? :P
  2. Given that you've shown me to be true many things I suspected not being possible in this conversation, I'm taking the courage to ask: is it possible to just show all the languages in all the languages immediately, without me needing to change them one by one? That's basically the end result I'm striving for and that would literally be a 'do-what-I-want-done' button. :P - Klein Muçi (talk) 00:54, 15 September 2020 (UTC)[reply]
The ban-Bali IETF tag (Balanese written using the ISO 15924 Bali script) is relatively new. You can see what the little boxes are supposed to be at https://r12a.github.io/uniview/ – copy/paste them into the text area box at right and click the down arrow.
Even though you struck the question, I would suggest that the only codes you need to worry about are the codes listed at List of Wikipedias.
I was wondering when you would work yourself round to that question. Could be done I think. The module would just loop through a list of codes. I'll try that and let you know.
Trappist the monk (talk) 10:06, 15 September 2020 (UTC)[reply]
Another question: How is it possible that even like this I still get around 1000 duplicates?! I thought I should get none now that the English duplicates are gone. - Klein Muçi (talk) 09:44, 15 September 2020 (UTC)[reply]
Are these duplicates duplicates of English? Example?
Trappist the monk (talk) 10:06, 15 September 2020 (UTC)[reply]
LOL Well magic IS true after all apparently. :P I have so much to learn regarding Lua. As for the duplicates, I was trying to loop through the codes now one by one manually, copy-pasting whatever results came through (of course there were very long pages, pages with only 1 result and blank pages). I finished all the codes alphabetically from aa to cs. That left me with 11760 lines of code. And when I try the "remove the duplicates" command on Notepad++, I'm left with 7063 lines of code. That's a tremendous amount of duplicates. But I don't know what exactly is removed because Notepad++ doesn't have a "show diff" page. Maybe I can try to find one online to find some examples of the duplicate lines. I'll deal with ban-Bali IETF tag when it shows up. Thank you for the help with that! :)) - Klein Muçi (talk) 10:32, 15 September 2020 (UTC)[reply]
You can check the differences here (the old text is the original one and then I removed the duplicates) BUT it is very confusing as Notepad++ has to align lines lexicographicaly before removing duplicates so I can't really make much sense of it. :/ - Klein Muçi (talk) 10:53, 15 September 2020 (UTC)[reply]
Not English. For example, do a Ctrl-F search for "\|\s*language\s*=\s*авадхи\b" "|language=awa" \ in the diff. 7× finds of which all but one were deleted. Not surprising that one language that is written with a Cyrillic script would have the same name as or fallback to another Cyrillic-script language.
Trappist the monk (talk) 11:11, 15 September 2020 (UTC)[reply]

I see... So English isn't the only one language that's bringing duplicates, as to say. Should we do anything about cases like these? Or should I take care at the end with that command? - Klein Muçi (talk) 11:18, 15 September 2020 (UTC)[reply]

I suspect that when I change sq:Module:Test to process a list of language codes/names, duplicates will naturally fall out (before adding a code/name pair, the module will look to see if that code/name pair already exists in our list) so I will probably disable the English fallback.
Trappist the monk (talk) 11:30, 15 September 2020 (UTC)[reply]
sq:Përdoruesi:Trappist the monk/Livadhi personal 31 language codes.
Trappist the monk (talk) 13:10, 15 September 2020 (UTC)[reply]

I understand. Well, if the end result is how I imagine it to be, that would be a great solution even for future updates. If anything changes regarding the vast amount of languages that exist, I can just copy-paste the whole code every once in a while and ensure the bot to be up to date with the changes. That would require me to put more attention to the module itself, with a proper name. Given that it is there to stand now. And I can also focus on trying to automatize other CS1 problem solving with the bot apart from language ones. If everything is all right, the next step would be to try and make the bot check for updates alone but given that languages don't change that often (I'm assuming), that won't be a big problem and it can be done manually for now. - Klein Muçi (talk) 12:09, 15 September 2020 (UTC)[reply]

Oh, so we still need to add the language codes. Can't it be that it displays all of them "automatically"? I mean, to get the list of all language codes inside the module itself so we just invoke it and we get the needed results? That's what I was imagining when I talked about the auto-update. And why 31? Is that the list of "Wiki languages" available? Or just an example by you and I should add the remaining codes? - Klein Muçi (talk) 13:26, 15 September 2020 (UTC)[reply]
According to List of Wikipedias there are 303 active Wikipedias. There are 818 language codes in MediaWiki's English language list. It seems to most likely that editors at any one of the 303 Wikipedias would use their local language to write <language name> in |language=<language name> or they would have copied a cs1|2 template from another wiki that used that wiki's local language. So, 303 languages associated with the 303 active Wikipedias seems the correct number of languages to process. Scribunto has mw.site.interwikiMap() but that function returns a table that has non-language codes (interwiki prefixes). I suppose that sq:Module:Test could spin through that table and make another table of prefixes that match an entry in the en.wiki language list. I'll think about that.
31 because I didn't want to add all of them ...
Trappist the monk (talk) 13:59, 15 September 2020 (UTC)[reply]
Yes, that was the logic I was following through. I imagined that if it could work with more than 1 code simultaneously, maybe it could work with all of them being part of the module code. Maybe 303 squared is the correct number or maybe we need all possible languages so we future failproof it? Whatever the answer, what do you suggest my next step should be? Should I go on and copy-paste all the language codes (303 or more) in the invocation or should I wait for them being part of the module? - Klein Muçi (talk) 14:14, 15 September 2020 (UTC)[reply]
I tweaked sq:Module:Test sq:Module:Smallem (it moved while I was doing it – don't do that) so that it used the 327 interwiki prefixes that have matching codes in the English language list. That did not work. It gave me this category:
sq:Category:Faqe ku stampat e përfshira kalojnë kufirin (google translate wan't helpful → 'Sites where the stamps included cross the border')
I suspect that it is a post-expand include size limit error. That makes some sense. The 31-language list has a post-expand include size of 706,147/2,097,152 bytes so it shouldn't surprising that a 327-language list would produce more than 2MB of output.
Before going any further, what will be using this output? Is this the best form of output to use?
Trappist the monk (talk) 15:15, 15 September 2020 (UTC)[reply]
A sorted list of all of the interwiki prefixes with matching language codes can be found in the lua log. Edit sq:Përdoruesi:Trappist the monk/Livadhi personal; click Show preview; at the bottom of the page click the 'Parser profiling data' drop-down (it will be different for you, I switched my sq.wiki interface language to English); under the Lua logs heading click [Expand].
Trappist the monk (talk) 15:38, 15 September 2020 (UTC)[reply]
Warning: Long answer below:
XD I'm sorry. Yes, I was just coming here to tell you just that. I just tried doing the same thing not only with those 300+ languages but all of them. See what happens here. (Not a surprise anymore.) That category is precisely that. Apparently it's a limit to protect the server from crashing/slowing down. As I've mentioned earlier, it is a replace pywikibot called Smallem. At the moment being it literally does just that: Finds and replaces. For example, it finds deprecated harv parameters and it removes them, it finds language values and it replaces them with their corresponding ISO code, it finds CS1 categories added as categories (not from templates, we've discussed that earlier) and it removes them... That initial, ideal plan was to make it better at finding and fixing different kinds of CS1 errors but since I was inspired by the language parameter, I started with that and unfortunately that opened a rabbit hole that put a halt to the general progress. Firstly I devised with trial and error that regex you're currently seeing in order to fix/replace/modify as more language values as it could. Then I made the bot work only with English language terms, given that that's the biggest occurrence. Then I added the top entries mentioned in this graphic and the bot is currently running with that output weekly. Then I started working to add all the possible languages given that A) I usually like to "future failproof things", B) SqWiki (or small wikis in general) are a bit unpredictable in the way they produce articles. First I thought I would add only the languages closest to Albania because they ought to be the languages people will try to translate from the most but judging from that graphic above I saw that wasn't true. Finnish was one of the top languages even though being a bit far from Albania. Soon enough I understood that that was because apparently SqWiki tends to periodically mass-produce bot generated articles. Therefore we have stub articles about "every" village in Finland and France (even though we may not have articles about every village in Albania yet). The data have been bot generated auto-translating from stub articles in EnWiki and all those articles have citations in their correlating languages which usually get imported with their "problematic values". Add here the fact that 90% of our new articles now come from CTT and soon enough the citations' languages become unpredictable (even though they still do come most of the time from EnWiki). Fun fact: Most of our new current articles for the moment being are related to Japan because that's the interest of a particular active user. So I was trying to add all possible languages as efficiently as possible and that's how I ended up in the technical village pump. You know the rest of the story. The output is fed in a bash script and Smallem is run weekly on Toolforge. Soon enough I discovered that I would have to deal with big data because of the vast amount of languages and bash scripts didn't do well with those, I read that probably I would have to get to PHP or C# for a definitive solution. I'm also aware that my trial and error found regex may not be the most optimized for that work but I thought I would deal with those details gradually in the future when they started to become a problem and when I could have gotten the hang of all it better. Hopefully by getting the help of another programmer as a maintainer. (For me is a great way to learn and while doing so, I also help the community.) Given your big help on the project, and the fact that it deals with CS1 problems, I also thought that maybe I could ask you in the end if you were interested. But judging by what is happening I'm starting to see that I'm facing limits much faster than I had anticipated to do so. - Klein Muçi (talk) 16:03, 15 September 2020 (UTC)[reply]

I did what you mentioned and it works but what am I to do with that list given that the limit we mentioned above blocks me from extracting any output from it? - Klein Muçi (talk) 16:16, 15 September 2020 (UTC)[reply]

I wondered if whatever tool you are using to ingest this list might have some sort of array or dictionary or something that might be more efficient than a list of independent search and replace strings. For example, if I were programming an awb script to do this, I would likely create a dictionary where the key is the language name and the value is the associated code. Then, I would write a function that searches for |language=<language name> and uses <language name> as an index into the dictionary. If found, replace <language name> with <language code>. This mechanism is much more efficient than a huge list of find/replace strings, each of which must be examined before moving on to the next article.
I don't know anything about python but surely it has something like c# dictionaries or Lua associative arrays (tables) that can hold lists of language-name / language-code pairs. Find someone who is conversant in python?
You wanted to future proof things so here is a list that MediaWiki maintains. Yeah, as it is used now, you have to use the list in fragments. But it is there for the looking; you don't have to do much special to get it.
Trappist the monk (talk) 16:43, 15 September 2020 (UTC)[reply]
I understand. So basically that's the whole list that MediaWiki deals with? I shouldn't worry about ALL the languages, eh? That seems doable. How many codes do you estimate I should try in one go in order to not break the limit? Or do you not know yet? If not, don't worry. I'll find out soon enough anyway. As for the optimization of the code, I'm still not sure I can do that or not. At the moment being, I don't deal AT ALL with Python. The replace pywikibot is premade and I only give the strings to find and replace. Please, take a quick look here. See the local and global parameters. I guess other commands can be given in the bash shell but I'm yet to experiment much on there. So basically, if you would chose to help in maintaining it later, much of your work would be in just helping to find ways with Lua in-wiki how to generate good regex-es for it and make them part of the source code (which is just a big list of regex-es), without having to deal with bash scripting or Python. You would just need a Toolforge account and me giving you access to the tool (Smallem). That is, if the bot is kept in the current mode, which, given my lack of knowledge, will be true for a while. But that's something to be discussed at the end of the conversation. For the moment being, I'll try and create the needed list for languages, hopefully to complete it this time, after having had to start it from scratch around 5 times or so. - Klein Muçi (talk) 17:08, 15 September 2020 (UTC)[reply]
PS: The end section of the MW page I sent you above seems to indicate it is possible to go beyond simple F&R regex-es. - Klein Muçi (talk) 17:13, 15 September 2020 (UTC)[reply]
I tried the whole list. First of all, strangely enough, it still had duplicates. The whole list had about 50 000 entries and after removing the duplicates it went down with 9 000 entries. I don't know why's that so. Secondly, unfortunately, the bash script got extra long with those added lines and it was too many arguments for it to be compiled so I guess I can't really use that list how I liked it to be used. I wonder if with the method you mentioned it could have been possible. :/ - Klein Muçi (talk) 02:13, 16 September 2020 (UTC)[reply]
I think I can still use that list if I run the job at the job grid. But still, the duplicates were a bit surprising. Shouldn't it have had none of them? - Klein Muçi (talk) 09:59, 16 September 2020 (UTC)[reply]

I can confirm I can use the full list on the job grid. The duplicates remain the only problem (if it can be called a problem). - Klein Muçi (talk) 12:12, 16 September 2020 (UTC)[reply]

You have to run sq:Module:smallem on sections of the whole language-code list, right? The module has no knowledge of other sections so it seems entirely likely that each independently created section will have some list items that also appear in other section lists.
I added a test probe to Module:smallem that counts the number of list items in the final list. When I let the module process all 320+ language codes, the list (were we able to render it) would have 41,968 items. The count of items in the list is available in the Lua log.
What does smallem mean?
Trappist the monk (talk) 13:34, 16 September 2020 (UTC)[reply]
I have precisely 41,968 lines of code dedicated to fixing languages so you (we?) are correct. I see now why the duplicates are created. So basically those are all the possible languages that exist? I want to confirm that as a fact so I know what to write in the bot's user page.
As for the name, I'm very glad that you asked. (Even though your question might be followed by a critique. :P ) Usually my nickname as an online persona is Bigem, a wordplay inspired by the initial letter of my last name. The bot is called Smallem, imagining it as a mini-helper. Given that that module doesn't serve much more than to help me set up Smallem right now (and I guess that will be its purpose even in the future), that's the module's name too. - Klein Muçi (talk) 14:32, 16 September 2020 (UTC)[reply]
Predicated on the notion that the majority of editors will write the value assigned to |language= in their own language on a Wikipedia that uses their own language, checking the languages associated with the various language editions of Wikipedia should cover most cases. An editor citing a Norwegian-language source at sq.wiki would write |lang=norvegjisht whereas an editor at en.wiki would write |lang=Norwegian, right? What we have now is a list of all of the languages supported by all of the specific-language Wikipedias.
I just wanted to know what the name meant because google translate just translated it to itself.
Trappist the monk (talk) 15:03, 16 September 2020 (UTC)[reply]
Ok. All of the languages supported by all of the specific language Wikipedias. Got that.
Oh, so it was sheer curiosity then. I was a bit afraid you would scold me for giving the module an obscure name. :P
So, I guess that brings us at the end of the problem and, while thanking you a hundred times for your help, I want to end it with an offer, as I mentioned in the beginning: Do you want to have access on operating the bot, hoping to enhance its functionalities in the future in fixing CS1 errors, maybe even beyond SqWiki? Don't be afraid at all to refuse my offer if it is outside of your scope of interest. I just felt that since you helped a lot in overcoming a key problem of it (that would have taken me literally month to accomplish it otherwise) and given that it deals with a subject you are familiar with, you deserved to have the possibility of operating it yourself too. - Klein Muçi (talk) 15:36, 16 September 2020 (UTC)[reply]

Template:Lang-eml

Template:Lang-eml was kept. So should it work? --Gonnym (talk) 17:58, 13 September 2020 (UTC)[reply]

Maybe. The way I read that discussion, an error message is the desired output. The current template does that though not very elegantly.
At the very least we could convert it to use Module:Lang so that the error message would be somewhat prettier cf.:
{{lang-eml|text}}{{lang-eml|text}}
{{lang|fn=lang_xx_italic|code=eml|text=text}} → [text] Error: {{Lang-xx}}: unrecognized language code: eml (help)
We could do that and call it good enough or we could add some sort of support for deprecated ISO 639 language codes to Module:Lang that would render properly-formatted text with a suitable error message. I'm sort of inclined to this last because deprecated codes are just that, deprecated, not deleted.
Trappist the monk (talk) 19:10, 13 September 2020 (UTC)[reply]
If you think adding the deprecated support is better, than you have my support. If not, then converting the error to lang looks much better than the current setup. --Gonnym (talk) 08:43, 14 September 2020 (UTC)[reply]
I have tweaked Module:Lang/sandbox and Module:lang/data/sandbox to add support for deprecated ISO 639 codes:
{{lang/sandbox|eml|text}} → [text] Error: {{Lang}}: unrecognized language code: eml (help)
{{lang/sandbox|fn=lang_xx_italic|code=eml|text=text}} → [text] Error: {{Lang-xx}}: unrecognized language code: eml (help)
I think that some sort of error messaging is required per the TfD though it isn't clear to me what that messaging should look like so I will start a discussion at Template talk:Lang to see what the community think.
Trappist the monk (talk) 11:53, 17 September 2020 (UTC)[reply]

off topic

But sort of related. You wanted me to add some error checking to {{Category articles containing non-English-language text}} and {{Category articles containing non-English-language text/inner core}}. I did that and now there are 304 categories listed at Category:Lang and lang-xx template errors. Surely you had a purpose in mind for these?

Trappist the monk (talk) 19:27, 13 September 2020 (UTC)[reply]

Wasn't sure you finished the code as it was in the sandbox still. Regarding the code, I'm still not sure why we are using user-input when the template already knows what ISO it supports in the category. If we take as an example Category:Articles containing French-language text which is used in the template doc, the code it says to use is {{Category articles containing non-English-language text|example=La plume de ma tante|French|fr|fre|fra}}. That produces:
This category contains articles with French-language text. The primary purpose of these categories is to facilitate manual or automated checking of text in other languages.

This category should only be added with the {{Lang}} family of templates, never explicitly.

For example: {{Lang|fr|text in French language here}}, which wraps the text with <span lang="fr">. Also available is {{Langx|fr|text in French language here}} which displays as French: ''La plume de ma tante''.
"fre" isn't supported and produces an error, and "fra" doesn't appear in the text anywhere. So either there is no point at all in listing the other ISO codes or there is and it should be added. But either way, if the backend knows what ISO each language supports, why can't it just retrieve the data and show it, instead of getting the data from a user, validating the data, then showing it or an error? --Gonnym (talk) 08:41, 14 September 2020 (UTC)[reply]
If the templates get fixed there will be no need to keep cat_test() so the sandbox is a fine place for it. Were it me, I would have {{Category articles containing non-English-language text}} fetch the language name from the category title. From the language name, the template can fetch the necessary language code. If there is a need to override (I'm not sure that there is) support for {{{1|}}} as an alternate language name could be provided. {{Category articles containing non-English-language text/core}} does nothing so can go away and I suspect that {{Category articles containing non-English-language text/inner core}} can also go away.
Trappist the monk (talk) 10:40, 14 September 2020 (UTC)[reply]
Ok, so we have two options. The first is a combination of Template:Category articles containing non-English-language text/sandbox and Template:Category articles containing non-English-language text/inner core/sandbox (the inner is used just because I don't want to "find" the language name from the title 5 or so times). This still needs a module call to fetch the ISO code. Second option is to use the code I added at Module:Lang/documentor tool/sandbox which also needs to fetch the ISO code. That module also takes care of Template:Non-English-language source category so that could be a good place to put this also. What do you think? --Gonnym (talk) 12:52, 14 September 2020 (UTC)[reply]
There are those who believe that wiki-text should be king; I am not one of them. If you intend articles_containing_language_text_category() in Module:Lang/documentor tool/sandbox to implement {{Category articles containing non-English-language text}} then I support your choice.
Trappist the monk (talk) 13:14, 14 September 2020 (UTC)[reply]
Ok, so the only issues there at the moment are the ISO. Which ISO should I get? And what module call will give me it? Also, is there a module call that I can use that will give me a lang-x version for an iso (without doing an exist check)? --Gonnym (talk) 13:20, 14 September 2020 (UTC)[reply]
If you take the language name from the category title then _tag_from_name() will give you the appropriate language tag to use in {{lang}}. You will have to test for the existence of {{lang-??}} because Module:Lang does not keep a list of those templates. Testing for existence isn't an issue because the template is used only once per category, right?
Trappist the monk (talk) 13:36, 14 September 2020 (UTC)[reply]
Can you take a look at Module:Lang/documentor tool/sandbox? Think it works. --Gonnym (talk) 14:40, 14 September 2020 (UTC)[reply]
I tried it at Category:Articles containing Abenaki-language text; did not work so well... Looks nice at Category:Articles containing French-language text.
Trappist the monk (talk) 14:49, 14 September 2020 (UTC)[reply]
Did some changes before I read your comment so not sure if what I did fix it or not. Can you check Abenaki again? (with the /sandbox) --Gonnym (talk) 14:57, 14 September 2020 (UTC)[reply]
Yah, better. I tried it with |language=Eastern Abenaki and |language=Western Abenaki both of which caused error messages because ISO 639-3 spelling is Abnaki (without the 'e'). Worked when I gave it |language=Eastern Abnaki and |language=Western Abnaki.
So then I tried something more difficult: Category:Articles containing Proto-Celtic-language text. Did not work. Language tag for that is: cel-x-proto which violates your three-character-max-length test.
I like named parameters, but... Because this template has a heritage of positional parameters, perhaps |language= should be backed up with args[1] (yields to |language= when both are present; plus error message?)
Trappist the monk (talk) 16:03, 14 September 2020 (UTC)[reply]
The |language= parameter doesn't need to be used. The pagename is analyzed automatically. The issue with the >3 error checking is ._tag_from_name() doesn't return a good value in lua when it's an error and I thought that 3 characters were the max. Can you modify that function with an |error=no value (or something) so that when it finds an error it returns nil? That way I can just check if the object is nil instead of a length. Regarding Category:Articles containing Abenaki-language text it is good that it fails, as the correct categories are Category:Articles containing Eastern Abnaki-language text and Category:Articles containing Western Abnaki-language text. --Gonnym (talk) 16:22, 14 September 2020 (UTC)[reply]
Perhaps:
	if iso:find ('error') then
		error_message = iso
		iso = nil
Trappist the monk (talk) 16:30, 14 September 2020 (UTC)[reply]
Ok, that seems to work, but but I'm getting an error of {{Lang|cel-x-proto error: cel-x-proto is an IETF tag (help)|text in Proto-Celtic language here}}. What should be done here? --Gonnym (talk) 16:39, 14 September 2020 (UTC)[reply]
These categories are {{lang}} categories so the documentation should be using the Module:Lang data set. But, _name_from_tag() doesn't support a |label= parameter. You can build a wikilink from _name_from_tag(). If the returned value contains 'languages' then:
[[<name>|<code>]]
else
[[<name> language|<code>]]
Trappist the monk (talk) 17:33, 14 September 2020 (UTC)[reply]
Couldn't you just add support for |label= to that function? It already has a |link=yes parameter. There is no real reason to make this harder than necessary. --Gonnym (talk) 18:56, 14 September 2020 (UTC)[reply]
Switch to Module:Lang/sandbox. Report back.
Trappist the monk (talk) 19:52, 14 September 2020 (UTC)[reply]
Seems to be working good. See if you can find any error. I tested it on a normal page (Category:Articles containing French-language text), on a collective page (Category:Articles with text from the Berber languages collective), on a non-iso page (Category:Articles containing Proto-Celtic-language text) and on a bad page (Category:Articles containing Abenaki-language text). --Gonnym (talk) 22:09, 14 September 2020 (UTC)[reply]
Category:Articles containing traditional Chinese-language text fails. This cat is not populated by {{lang}} ({{zh}} I think). Also Category:Articles containing simplified Chinese-language text
Category:Articles containing Old Church Slavonic-language text works but should it?
{{lang|fn=tag_from_name|Old Church Slavonic}}Error: language: Old Church Slavonic not found
{{lang|fn=name_from_tag|{{lang|fn=tag_from_name|Old Church Slavonic}}}}Error: unrecognized language tag: Error: language: Old Church Slavonic not found
Category:Articles containing Old Korean (3rd-9th cent.)-language text works only with |language=Old Korean; as a positional parameter, doesn't work
Category:Articles containing Havasupai-Hualapai-Yavapai-language text doesn't work because spelling does not agree with IANA / ISO 639-3 spelling: Havasupai-Walapai-Yavapai. There are cats for each of the individual languages: Category:Articles containing Havasupai-language text, Category:Articles containing Walapai-language text, Category:Articles containing Yavapai-language text; these all work.
At Category:Articles with text from the Bihari languages collective there is a big red warning as a reminder that, if the CfD ever closes, collective language code categories will get new names.
Trappist the monk (talk) 23:28, 14 September 2020 (UTC)[reply]
If Category:Articles containing traditional Chinese-language text and Category:Articles containing simplified Chinese-language text don't use the Lang template they should fail as the text is clearly wrong.
Not sure why Category:Articles containing Old Church Slavonic-language text doesn't fail as the page being populated is Category:Articles containing Church Slavonic-language text.
Why is Category:Articles containing Old Korean (3rd-9th cent.)-language text working when the code is "Old Korean"?
Category:Articles containing Havasupai-Hualapai-Yavapai-language text should fail as the correct category is Category:Articles containing Havasupai-Walapai-Yavapai-language text.
So from the above, the only two questions is why the Old Korean and Church Slavonic work for the incorrect name. The others are correct not to work and will be CfDs once the code is live. --Gonnym (talk) 23:50, 14 September 2020 (UTC)[reply]
Old Church Slavonic is listed in Module:Language/data/wp languages (which is a plague upon our house) and assigned the code cu. Module:lang searches Module:lang/data override{} table first, finds cu which has the assigned name Church Slavonic (two articles about different things assigned the same code). Both of Old Church Slavonic and Church Slavonic are listed in the tables created by Module:Lang/name to tag so both are found:
{{lang|fn=tag_from_name|Old Church Slavonic}}Error: language: Old Church Slavonic not found
{{lang|fn=tag_from_name|Church Slavonic}} → cu
Old Korean (3rd-9th cent.) is a redirect to Old Korean. Module:lang, when it creates links to articles, strips IANA/ISO 639 disambiguators. When it creates categories, the IANA/ISO 639 disambiguators are retained so for oko the category name includes 'Old Korean (3rd-9th cent.)'. Module:Lang/name to tag strips IANA/ISO 639 disambiguators so when that list is queried:
{{lang|fn=tag_from_name|Old Korean (3rd-9th cent.)}} → oko
{{lang|fn=tag_from_name|Old Korean}} → oko
I suspect that the solution to this problem is to have Module:Lang/name to tag create name-to-tag entries for both disambiguated and undisambiguated names.
Trappist the monk (talk) 10:46, 15 September 2020 (UTC)[reply]
Thanks for explaining that! Really amazes me that the two Slavonic are two different things yet have the same code. Regarding the "Old Korean", the category might be better using the non-disambiguated title as the article is at Old Korean so maybe do what you did with the ISO module and add a |raw= parameter, that when used gives the disambiguated one, but in general use, gives the cleaner version. Once we have a fix for these, I think the code is ready to go live. --Gonnym (talk) 11:05, 15 September 2020 (UTC)[reply]
I've tweaked Module:Lang/name to tag so that it includes disambiguated names. Category:Articles containing Old Korean (3rd-9th cent.)-language text now works (Old Korean (3rd-9th cent.)-language is a red-link but that can be remedied with a redirect to Old Korean).
Trappist the monk (talk) 18:03, 15 September 2020 (UTC)[reply]
I'm not sure if this is good. The output of the tag_from_name code (at least in this situation) should be what the lang template populates. In this case, it does populate it, so that's good. But if this change affects other codes, then this isn't good. Is there a way to make sure that the name/code pairs used by the lang template to populate these categories, is the same name/pair that the tag_from_name gives? --Gonnym (talk) 19:40, 15 September 2020 (UTC)[reply]
This change only affects tag_from_name(). That is the only function that uses Module:Lang/name to tag. Unless we always use disambiguated names it is not possible to guarantee that name → tag_from_name() → tag → name_from_tag() → name will be circular. It is not a perfect system because we override stuff ...
Trappist the monk (talk) 23:38, 15 September 2020 (UTC)[reply]

I'm not sure you understood what I mean. Take look at both Category:Articles containing Old Church Slavonic-language text and Category:Articles containing Church Slavonic-language text with the /sandbox version. Both categories return a valid result. That isn't the expected result as only one of those categories is actually being populated by the {{Lang}} template with the "cu" code. It's also not something out of our control, as what category gets populated by what code, is something you wrote. We just need to be able to get the same result here, so the error will appear for any category title that isn't the correct one being populated. --Gonnym (talk) 08:06, 16 September 2020 (UTC)[reply]

Module:Lang won't populate Category:Articles containing Old Church Slavonic-language text. The basic sources of the data set used by Module:lang are these three modules:
Module:Language/data/wp languages["cu"] = {"Old Church Slavonic"},
Module:Language/data/iana languages["cu"] = {"Church Slavic", "Church Slavonic", "Old Bulgarian", "Old Church Slavonic", "Old Slavonic"}
Module:Language/data/ISO 639-3["chu"] = {"Church Slavic", "Church Slavonic", "Old Bulgarian", "Old Church Slavonic", "Old Slavonic"}
They are combined (coalesced) in Module:Language/name/data by __coalesce() to produce:
["chu"] = {"Church Slavic", "Church Slavonic", "Old Bulgarian", "Old Church Slavonic", "Old Slavonic"}
["cu"] = {"Old Church Slavonic", "Church Slavic", "Church Slavonic", "Old Bulgarian", "Old Church Slavonic", "Old Slavonic"}
There is yet another data table:
Module:Lang/data["cu"] = {"Church Slavonic"}
If you were to write {{lang|chu|<text>}} Module:lang will use Module:Lang/ISO 639 synonyms to map chu to cu. When looking for the name to apply to the language link ({{lang-??}}), to the tool-tip ({{lang}}), to the category name (both), Module:lang looks for cu in Module:Lang/data where it finds and then uses the name 'Church Slavonic'. Were cu not overridden in Module:Lang/data, Module:lang would fetch 'Old Church Slavonic' from Module:Language/name/data (when there are multiple names associated with a code, Module:lang always takes the first name in the list).
You are suggesting that tag_from_name() should only return a tag for the name that gets used in {{lang}} and {{lang-??}}. That was not the intent of tag_from_name(). It was intended as a way to find codes for a variety of legitimate names. I have wondered if that table should be expanded to map all legitimate names to their associated code; 'Old Bulgarian' currently returns an error but shouldn't.
From Category:Articles containing Old Church Slavonic-language text, {{Category articles containing non-English-language text}} extracts 'Old Church Slavonic'. We use tag_from_name() to determine if that is a legitimate name. What we don't do is use name_from_tag() to see if 'Old Church Slavonic' is the name that {{lang}} and {{lang-??}} will use. So:
is 'Old Church Slavonic' the same as {{lang|fn=name_from_tag|{{lang|fn=tag_from_name|Old Church Slavonic}}}}? No? error:
'Old Church Slavonic' == Error: unrecognized language tag: Error: language: Old Church Slavonic not found?
But then:
'Old Korean (3rd-9th cent.)' == Old Korean?
This same test might be applied to |language= from {{Category articles containing non-English-language text}} because |language= exists (presumably) because the category title is something odd.
A hard nut to crack because the underlying data are messy. It may be that we will need a new function; something that returns the category name so that we don't have to infer what the category name ought to be.
Trappist the monk (talk) 12:19, 16 September 2020 (UTC)[reply]
We are currently using _tag_from_name to see if the language name is the one Module:Lang uses for the category because you said that was the one to use, no one is forcing us to use this. If you don't want to change the function (which I agree), then the solution would be a new function with a very limited use, which is to see if the language name supplied is the default one used by our system. This shouldn't be hard at all, as the same method you use to decide what language name to use, can be used here. (at it's most dumb way, we can use tag_from_name -> name_from_tag -> is_name_equal_name, but I'm sure this can be made much more elegant). --Gonnym (talk) 12:44, 16 September 2020 (UTC)[reply]
Yeah, I did say that _tag_from_name() was the thing to use, and in a perfect world I would have been correct. Now you know not to trust anything that I say, don't you? You don't believe me that tag_from_name -> name_from_tag -> is_name_equal_name won't work? Didn't I just demonstrate that that mechanism is wholly unreliable?
Here is another wrinkle. Believe it or not, _lang() and _lang_xx() handle category linking differently. _lang_xx() makes category names from the raw language name (with disambiguation if present); _lang() strips disambiguation when making the category name. I'm not sure why they are different (probably oversight) so I think that this is an error and that _lang() needs fixing. If that is true then _category_name_get(<tag>) could be used to return the category that both {{lang}} and {{lang-??}} populate. Compare the returned value to the actual category name; not same? error.
Trappist the monk (talk) 14:23, 16 September 2020 (UTC)[reply]
Now you know not to trust anything that I say, don't you? lol. Anyways, at least a few good things are coming out from all this back and forth and unnoticed bugs are getting fixed. I've modified the /sandbox to test the name-tag-name-equal check and it works for our Slavonic friends, but going by what you just said, it probably fails somewhere. So I guess we'll hold until the lang and lang-x templates are corrected and then we can continue. Right? --Gonnym (talk) 14:40, 16 September 2020 (UTC)[reply]
Yeah, works for Category:Articles containing Old Church Slavonic-language text and Category:Articles containing Church Slavonic-language text but doesn't work for Category:Articles containing Old Korean (3rd-9th cent.)-language text.
New function category_from_tag():
{{lang/sandbox|fn=category_from_tag|oko}} → Category:Articles containing Old Korean (3rd-9th cent.)-language text
{{lang/sandbox|fn=category_from_tag|cu}} → Category:Articles containing Church Slavonic-language text
{{lang/sandbox|fn=category_from_tag|chu}} → Category:Articles containing Church Slavonic-language text
{{lang/sandbox|fn=category_from_tag|art}} → Category:Articles containing constructed-language text
{{lang/sandbox|fn=category_from_tag|en-US}} → Category:Articles containing explicitly cited American English-language text
{{lang/sandbox|fn=category_from_tag|en-splat}}Error: unrecognized variant: splat
Derived from name_from_tag(), returns category name or error message.
The new method would be?
  1. fetch category title
  2. extract language name from category title
  3. get language tag from _tag_from_name()
    not successful: error
  4. get expected category title from _category_from_tag()
    not successful: error
  5. compare category title to expected category title
    same: Yay!
    !same: error
Further proof that you can't believe anything that I say, {{lang}} fetches the language name as-is from the data and uses that for the tool tip and category. {{lang-??}} fetches the language name from the data and uses that for the category, but strips disambiguation for use in the language name prefix:
text ← see tooltip
Old Korean: text
so {{lang}} and {{lang-??}} populate the same categories for the same language tag. Fixing not needed.
I did, as part of this, fix make_category() because it had special handling for art (Artificial languages) that wasn't used.
Trappist the monk (talk) 18:14, 16 September 2020 (UTC)[reply]
Great job! It seems it now works. My code could probably be nicer, but it's probably good enough for live. --Gonnym (talk) 18:32, 16 September 2020 (UTC)[reply]
I've done some work on Module:Lang/documentor tool/sandbox so the other 2 language categories can use shared code. I've ran into an issue with a soft redirect with Category:Articles with text from the Berber languages collective. Do you know how to check that the target isn't a soft redirect? --Gonnym (talk) 11:15, 17 September 2020 (UTC)[reply]
I'm not sure that I know what a 'soft redirect' is. Category:Articles with text from the Berber languages collective looks ok, all of its links link to their proper targets so what is the issue?
Trappist the monk (talk) 11:27, 17 September 2020 (UTC)[reply]
Category:Articles with text from Berber languages is a soft redirect (which was created by the mix-up with the /sandbox code containing the proposed new names). To be honest, this can just be deleted, as there is no need for soft redirects, as the only thing that should add pages to these categories is the template. --Gonnym (talk) 11:36, 17 September 2020 (UTC)[reply]
Deleted.
Trappist the monk (talk) 11:45, 17 September 2020 (UTC)[reply]
Thanks. So I think I'm done with the code refactoring and I think it works. Once you make the Module:Lang/sandbox changes live, I'll update it here and I think it is good to go. --Gonnym (talk) 11:52, 17 September 2020 (UTC)[reply]
After the deprecated ISO 639 codes issue is put to bed?
Trappist the monk (talk) 12:00, 17 September 2020 (UTC)[reply]
Sure, no rush here. --Gonnym (talk) 12:06, 17 September 2020 (UTC)[reply]

Happy Adminship Anniversary!

Happy Adminship Anniversary!

Template:Non-English-language source category categories

Since we're already waiting with pushing the changes, I've updated the non_en_src_cat() function to work with the shared code. While doing that I noticed that some of the language categories will always be empty. Category:Articles with Northern Ndebele-language sources (nde) says that it tracks usages of {{in lang|nde}} but that isn't correct. Those get added to Category:Articles with Northern Ndebele-language sources (nd), along with {{in lang|nd}} usages. Should the categories be kept and the code updated, or should the categories be removed? And if removed, how can we get the correct category being populated? --Gonnym (talk) 13:33, 17 September 2020 (UTC)[reply]

{{in lang}} does as {{lang}} does and promotes ISO 639-2, -3 to ISO 639-1 when there is an ISO 639-1 equivalent. So, any category for those ISO 639-2, -3 codes can go away when there is an equivalent ISO 639-1 category. I don't know why I created that nde category.
Trappist the monk (talk) 14:53, 17 September 2020 (UTC)[reply]
Can _category_from_tag() be modified to accept a template name? So that way it can give me the correct lang, in lang or CS1 category names. --Gonnym (talk) 15:05, 17 September 2020 (UTC)[reply]
I don't think that it should. {{in lang}} might be modified accept a parameter(|list-cats=yes?) that instructs it to render a list of categories instead of a list of language names (it creates a list of categories anyway so rendering that list shouldn't be too onerous). cs1|2 has nothing to do with Module:lang and only has specific categories for two-character language codes; all of which have the same form.
Trappist the monk (talk) 15:50, 17 September 2020 (UTC)[reply]
Tweaked Module:lang/utilities/sandbox so:
{{#invoke:lang/utilities/sandbox|in_lang|oko|list-cats=yes}} → {{#invoke:lang/utilities/sandbox|in_lang|oko|list-cats=yes}}
{{#invoke:lang/utilities/sandbox|in_lang|oko|chu|yuf-x-wal|list-cats=yes}} → {{#invoke:lang/utilities/sandbox|in_lang|oko|chu|yuf-x-wal|list-cats=yes}}
when there is an error, returns empty string:
{{#invoke:lang/utilities/sandbox|in_lang|okoko|list-cats=yes}} →{{#invoke:lang/utilities/sandbox|in_lang|okoko|list-cats=yes}}←
Trappist the monk (talk) 16:45, 17 September 2020 (UTC)[reply]
Can you change |list-cats= to |list_cats=? Also, I'm getting an error Lua error in Module:Lang/utilities/sandbox at line 32: attempt to call method 'getTitle' (a nil value). (test with Category:Articles with Northern Ndebele-language sources (nd) in /sandbox). --Gonnym (talk) 08:33, 18 September 2020 (UTC)[reply]
Fixed, back to you: Lua error in Module:Lang/documentor_tool/sandbox at line 168: Tried to read nil global current_category_title.
Trappist the monk (talk) 10:30, 18 September 2020 (UTC)[reply]
It works only if I keep using .in_lang, but if I switch to ._in_lang it doesn't. (if you can just change the list-cats parameter to an underscore that would be even better). --Gonnym (talk) 11:24, 18 September 2020 (UTC)[reply]
_in_lang() has to be exported ...; try again.
Why? Does |list-cats= break something? The hyphenated-parameters form is consistent with multipart parameter names used by {{lang-??}} and similarly by {{ISO 639 name}}.
Trappist the monk (talk) 11:42, 18 September 2020 (UTC)[reply]
It doesn't break anything, it just makes the code a bit less nicer as lua doesn't recognize hyphenated-parameters, but does underscore ones (in simple form). Template parameters with underscore also seem to be much more common, but if it's already setup like this in the language templates, then nevermind. Anyways, your fix works and code works. Ready for live when the deprecated is done. --Gonnym (talk) 11:50, 18 September 2020 (UTC)[reply]

Working deprecated codes

{{#invoke:ISO 639 name|iso_639_code_to_name|link=yes|sh}} and {{#invoke:ISO 639 name/sandbox|iso_639_code_to_name|link=yes|car}} are currently the only deprecated codes that work with the live version. Not sure if that means there is a bug somewhere where they are in a list they shouldn't or if this is ok, but just letting you know. --Gonnym (talk) 22:29, 18 September 2020 (UTC)[reply]

sh is in the IANA language-subtag-registry file as a legitimate code even though ISO 639-2 and-3 custodians show it as deprecated. I wish that I could find an up-to-date definitive listing of ISO 639-1 codes from the 639-1 custodian. Best I can find from them is a 2001 doc. According to ISO 639-2 RA change notice, sh was deprecated in 2000. According to ISO-standards-are-us, there is a 2002 version still current as of 2019. No idea what's in that because I'll be damned before I hand over CHF158 to find out. So, who do we believe? IANA or the ISO 639 -2, -3 custodians?
According to ISO 639-5 change notice page, car was deleted in 2009 so for ISO 639-5, we treat it as deprecated. Still a valid ISO 639-2, -3 code.
Trappist the monk (talk) 23:07, 18 September 2020 (UTC)[reply]
That's very expensive! A shame we can't get the foundation to purchase access to that. --Gonnym (talk) 09:22, 19 September 2020 (UTC)[reply]

Auto-fixing CS1 maint categories

So, after dealing with the error ones (at least what could be fixed from them automatically), I'm looking at the maintenance categories.

I found only two I wanted to ask you about:

  1. Extra punctuation
  2. Extra text in authors etc. lists

Can we somehow deal with the extra punct. automatically?

What about the other set of categories? Can we safely program a list of regex-es to remove "repeating patterns" like "(.ed)" etc? - Klein Muçi (talk) 15:10, 19 September 2020 (UTC)[reply]

The proper way to handle Category:CS1 maint: extra text: authors list is to evaluate what is there and likely, change the author-name parameters to editor-name parameters. So, the process looks like:
  1. Find the cs1|2 template that has the extra text (several possible patterns for that)
  2. if there are existing editor-name parameters, abandon
  3. find the offending author-name parameter(s) (could be multiples) and delete the offending text – there are false positives reported by the cs1|2 test
  4. replace:
    • |last<n>=|editor-last<n>=
    • |first<n>=|editor-first<n>=
    • |author<n>=|editor<n>=
    • ... for all of the rest of the possible author-name parameters
But, what if the author-name marked with (ed.) is author-name 3 in a list of four other author names, none of which have the (ed.) annotation? Why is it that I haven't written a bot to do this?
cs1|2 adds the Category:CS1 maint: extra punctuation category when the last character of a non-title holding parameter is one in the set of: [,:;]. Test each parameter in each cs1|2 template:
for each cs1|2 template:
for each parameter in that template
is this a title-holding parameter?
yes, next parameter
does parameter value end with [,:;]
yes, is it a semicolon?
yes, is it last character of an an html entity?
yes, next parameter
no, delete trailing [,:;]
next parameter
next template
My conclusion is that these are not simple regex find and replace solvable.
Trappist the monk (talk) 16:05, 19 September 2020 (UTC)[reply]

Well... That settles it then. - Klein Muçi (talk) 16:21, 19 September 2020 (UTC)[reply]

Lang/util merger

I've been looking at Module:Lang/utilities/sandbox, Module:Lang/name to tag and Module:Lang/documentor tool and seeing as how the /util module lost all the Nihongo code (and how cat_test() should be deleted), it would seem like these 3 could be merged. All 3 are utility related, so the scope fits; and all 3 are pretty small, so the size also works. What do you think? --Gonnym (talk) 09:24, 19 September 2020 (UTC)[reply]

Since I arrived at the decision to abandon Module:Language/name/data, some of what it does needs to become part of something and Module:Lang/name to tag, with a different name, seems the correct place for those things. When I created ~/utilities, I imagined it as a place for things that are heavily dependent on Module:Lang but are not 'part' of Module:lang. That's why {{in lang}} is there and why I initially had the Nihongo template support there. For me, the ~/documentor tool doesn't meet that requirement. Yeah, we could moosh them all into ~/utilities but I rather prefer the segregation. And, it ain't broke, so why fix it?
Trappist the monk (talk) 13:10, 19 September 2020 (UTC)[reply]
I'm pretty confident that the current module naming scheme across the language, lang and their sub-modules is currently very broken. Currently it's pretty much guess-work to find out which is used for what (but at least with some of it being deleted, it's getting clearer). I'm not saying the 3 above are the main issue, or even an issue, that was just me musing out loud. The reason I noticed it was because the cat_test() function which you placed in the /utils and not in the /documentor and /name to tag is utility in function. The odd one out actually seems to be the in_lang() code which isn't actually util but a product by itself and doesn't really belong in a module called util. --Gonnym (talk) 13:42, 19 September 2020 (UTC)[reply]
Maybe so. I don't object to moving in_lang() to Module:In lang. That means that ~/utilities goes away because:
native_name_lang(), if it is ever developed, will be developed in Module:Infobox/utilities where it properly should belong
cat_test() becomes obsolete after Module:Lang/documentor tool/sandbox becomes live
Trappist the monk (talk) 14:05, 19 September 2020 (UTC)[reply]
That does indeed help clean these up and sounds good. Waiting for you to say when the deprecation code is done so we can move the /sandbox code to live. --Gonnym (talk) 15:09, 19 September 2020 (UTC)[reply]
I have synced Module:Lang/sandboxModule:Lang – deprecated ISO 639 code support gone. If there are no more changes to be made to Module:Lang/utilities/sandbox we should synch in_lang() and _in_lang() to Module:Lang/utilities (leaving behind native_name_lang() and cat_test(). And then, when you're ready, Module:Lang/documentor tool/sandbox can follow. Did I miss anything?
Trappist the monk (talk) 15:17, 19 September 2020 (UTC)[reply]
Only change to Module:Lang/utilities would be syncing with sandbox then to moving it to Module:In lang so I can adjust the module call from /documentor tool. --Gonnym (talk) 15:24, 19 September 2020 (UTC)[reply]
Module:Lang/utilities moved to Module:In lang. Everything that used Module:Lang/utilities/sandbox is now broken. I fixed {{Category articles containing non-English-language text}} to point to Module:In lang/sandbox. I'm running my null edit bot of the categories where it is transclusded.
Trappist the monk (talk) 16:52, 19 September 2020 (UTC)[reply]
I fixed some template doc issues. I think if you fix the templates that show up first, the other pages will update by themselves. The system updates template changes much faster than module changes. --Gonnym (talk) 17:10, 19 September 2020 (UTC)[reply]

CS1 properties

And finally, the properties' categories...

Here I have some more questions other than "auto-fixing".

  1. Foreign script
  2. Julian-Gregorian uncertainty
  3. Long volume

Can you explain to me in details what these 3 categories serve for? I think I know what's their purpose in general lines but I want to be better informed about them so I know what to do with them.

The first category has many subcategories in EnWiki. Is that list totally exhausted and definitive? If yes, I should recreate it in SqWiki. But I'm not sure what it serves for and if the articles in it require any kind of fix.

The second category... I sort of know what it is for but I've seen you haven't clearly decided what to do with entries in it so... Is that still the case? Can we do anything with articles in it?

And the same question applies to the third category. Can we do anything with the articles in it?

I'm daring to also ask if we can use a bot in any of them but I'm 90% sure we can't since I'm not sure if they require any kind of fixing whatsoever to begin with.

Please provide whatever information you can on those. And let me say I'm sorry for taking so much time from you but hopefully, this will be the last question of this sort for a while. :P - Klein Muçi (talk) 16:30, 19 September 2020 (UTC)[reply]

Properties do not need fixing.
The script categories collect articles for the same reason that the Category:CS1 foreign language sources subcategories collect articles.
There was some dispute over how cs1|2 handles dates for citation metadata in the overlap period between the Julian and Gregorian calendars. I provided Category:CS1: Julian–Gregorian uncertainty as a way for those who were concerned to evaluate how cs1|2 does handle the overlap. Nothing has come of it and someday I hope to remove the category and the code that supports it.
Pretty much the same story for Category:CS1: long volume value; this was related to bolding of the |volume= value in various of the cs1|2 templates. Yet again, no real resolution.
Trappist the monk (talk) 17:12, 19 September 2020 (UTC)[reply]
I see.. But is that the total list of subcategories or more are continuously created on the go? And what really does the name of the category mean? Are there values using non-Latin alphabets? What does that mean for cs1|2? Can it understand those values? - Klein Muçi (talk) 17:19, 19 September 2020 (UTC)[reply]
The categories are associated with languages that are written using non-Latin script. cs1|2 adds articles to these categories when editors use |script-title=, |script-chapter=, |script-journal=, etc. I occasionally add a language code to the list and then create a category to match but that doesn't happen much anymore.
Trappist the monk (talk) 17:45, 19 September 2020 (UTC)[reply]
Understood. Then I'll go on and replicate them accordingly. And I guess that concludes this long conversation. Thank you a lot for your time! - Klein Muçi (talk) 17:50, 19 September 2020 (UTC)[reply]

Precious anniversary

Precious
Eight years!

--Gerda Arendt (talk) 07:41, 20 September 2020 (UTC)[reply]

Broken categories

I went to create Category:Articles containing Pinyin romanization-language text ('cos it was listed as missing), but the templates say that its existence is an error, even {{Lang|zh-Latn-pinyin}} is populating it.

Then I saw that all Category:Lang and lang-xx template errors has 194 empty subcats.

Something has changed somewhere in the system. Any idea what has changed, and how it can be fixed? --BrownHairedGirl (talk) • (contribs) 11:48, 20 September 2020 (UTC)[reply]

The documentation template {{Category articles containing non-English-language text}} has been change to recognize when category titles do not match the categories populated by Module:Lang. There is an ISO 639 language code for Pinyin (pny) which should be used in place of the IETF language tag zh-Latn-pinyin.
The error in the category arises because Module:lang cannot locate a language code to match Pinyin romanization:
{{lang|fn=name_from_tag|zh-Latn-pinyin}} → Chinese
{{lang|fn=category_from_tag|zh-Latn-pinyin}} → Category:Articles containing Chinese-language text
{{lang|fn=tag_from_name|{{lang|fn=name_from_tag|zh-Latn-pinyin}}}} → zh
Pinyin romanization is not an IANA recognized name but is instead derived from the pinyin variant tag.
The proper fix is to use the correct language tag so that the articles in Category:Articles containing Chinese-language text are recategorized to Category:Articles containing Pinyin-language text. There may need to be fixes to Module:lang as well.
Trappist the monk (talk) 12:30, 20 September 2020 (UTC) 13:14, 20 September 2020 (UTC)[reply]
From the links (Pinyin language and Pinyin romanization) those aren't the same. I think the issue is that the above category is made from data from Module:Language/data/iana variants (which category_from_tag() correctly retrieves) but the tag_from_name() function does not do the same backwards. --Gonnym (talk) 12:49, 20 September 2020 (UTC)[reply]
Point.
Trappist the monk (talk) 13:14, 20 September 2020 (UTC)[reply]
Here are some possible solutions (not at all completely thought out):
  1. override individual tags in Module:lang/data on a one-by-one basis
  2. create language-name link-labels, tool tips, and category names that include the variant tag – zh-Latn-pinyin
    • link-label: [[Chinese language|Chinese (pinyin variant)]]
    • tool-tip: Chinese (pinyin variant) language text
    • category: Articles containing Chinese (pinyin variant)-language text
  3. ignore variants when creating category names
  4. change category naming structure so that cat names include the base language code so that Module:Lang can work backwards from the variant description text to the proper base language (like those in Category:Articles with non-English-language sources made by {{In lang}})
  5. add an |ignore-cat-name-error= parameter to {{Category articles containing non-English-language text}}
Are there other solutions?
Trappist the monk (talk) 14:43, 20 September 2020 (UTC)[reply]
I'm not sure what #1 means. Regarding #2 and #3 - I'm less knowledgeable here if these are wanted or not and I'll guess that this is the main question here. #4 sounds like it goes with #2 (unless I missed something). #5 should be avoided. If we decide to categorize, then the backend should be able to support functions, including getting the correct category name; if we decide we don't, then the error is correct. In short, I think that we either do #2 or #3. --Gonnym (talk) 15:06, 20 September 2020 (UTC)[reply]
When something about a language code / name pair is not to en.wiki's liking, we override the name in Module:Lang/data. So for this example we add:
['zh-Latn-pinyin'] = {'Pinyin romanization'},
Doing this avoids the category-name-error but shows that there isn't an article or redirect Pinyin romanization language; should redirect to Pinyin.
But, Pinyin romanization is not a language so we should't label it as a language. Pinyin romanization is a way of writing Chinese language. This is why I suggested #2. But, should zh-Latn-pinyin, a way of writing Chinese, be handled any differently than zh-Hant or zh-Hans, also ways of writing Chinese? I'm inclined to say no. And that suggests that for the purposes of Module:lang creating categories, tool tips, and link labels, a variant tag should be validated but otherwise ignored just as Module:lang ignores script and region tags (unless specifically overridden in Module:lang/data as, for example, the various en-?? tags).
Trappist the monk (talk) 15:45, 20 September 2020 (UTC)[reply]

@Trappist the monk, this is all theoretically interesting ... but in practice, what it amounts to for now is that you have emptied nearly 200 categories without discussion, and dumped another set of ages into Special:WantedCategories.

Whatever the merits of your case, this is not the way to pursue it. Please restore the categories pending the outcome of whatever proposal you make at consensus-forming discussion. And please ping me in any reply. --BrownHairedGirl (talk) • (contribs) 15:51, 20 September 2020 (UTC)[reply]

The empty categories were empty before any changes were made to {{Category articles containing non-English-language text}}. They were empty because the {{lang}} and {{lang-??}} don't emit category links with those names. For example, there is a category Category:Articles containing Levantine Arabic-language text. ISO 639 and IANA do not recognize Levantine Arabic as a language name:
{{lang|fn=tag_from_name|Levantine Arabic}} → apc
ISO 639 and IANA do recognize North Levantine Arabic (apc) and South Levantine Arabic (ajp). There are categories for both of those: Category:Articles containing North Levantine Arabic-language text and Category:Articles containing South Levantine Arabic-language text.
The changes that added the empty categories to Category:Lang and lang-xx template errors were made specifically to identify categories that should not exist and to identify categories that are misnamed so that all of these may be deleted. It is not clear to me why those fourteen red-linked categories suddenly appeared. For whatever reason, they are no longer red-linked so no-longer an issue.
The issues remaining are the non-empty categories listed in Category:Lang and lang-xx template errors:
Category:Articles containing Ainu-language text – discussed elsewhere on this talk page
Category:Articles containing Bodo-language text – discussed elsewhere on this talk page
Category:Articles containing explicitly cited English-language text{{lang|fn=category_from_tag|en}} → Category:Articles containing explicitly cited English-language text; English in any form is not a non-English language
Category:Articles containing Mari-language text – same issue as Ainu & Bodo
Category:Articles containing Marwari-language text – there are three Marwari language codes: Marwari (Pakistan) (mve), Marwari (mwr), and Marwari (India) (rwr); I am not sure why tag_from_name() is returning the wrong name
Category:Articles containing Norwegian Nynorsk-language textDimmu Borgir discography has [[Category:Articles containing Norwegian Nynorsk-language text]] at the bottom
Category:Articles containing Pinyin romanization-language text – already discussed
Category:Articles containing simplified Chinese-language text – not a Module:Lang-created category link; likely from Module:Lang-zh
Category:Articles containing Tibetan-language text – should be Category:Articles containing Standard Tibetan-language text; not yet figured out what is populating this category
Category:Articles containing traditional Chinese-language text – not a Module:Lang-created category link; likely from Module:Lang-zh
Category:Articles containing Uyghur-language text – alternate spelling; likely populated by {{lang-ug}} which does not use Module:Lang
Category:Articles containing Valencian-language text – same issue as Pinyin romanization; name from variant tag (ca-valencia)
Trappist the monk (talk) 17:38, 20 September 2020 (UTC)[reply]
Ainu, Bodo, Mari, Marwari fixed.
Tibetan is populated by {{Bo}} which is not a {{lang}}-family template.
Trappist the monk (talk) 21:40, 20 September 2020 (UTC)[reply]

Enough of this

@Trappist the monk, nearly every time I have tried to engage with you over the last few months you have failed to ping me in reply. That leaves me to have to hunt down the conversation and check whether you have replied. Your reply of 17:38[1] is now the second time in one day that you again have chosen to reply to me without a ping, despite on this occasion being specifically asked to ping me.[2]. (The first was your reply at 12;30[3])

I do not know why you choose to persistently engage in this passive-aggressive behavior , but I have had enough of it. I will no longer try to engage with you.

Substantively, your reply doesn't deal with the fact that changes by you have created a situation where {{Lang}} populates a category, but {{Category articles containing non-English-language text}} no longer works to populate it ... and you offer no working remedy.

Over the last few years, I have created hundreds of these categories when they appear in Special:WantedCategories. They are tedious and time-consuming to produce, but I have always strived to do them properly, using the templates to create the relevant links.

However, you have now broken the system, without having something better in place ... and you repeatedly fail to communicate effectively about your changes. This non-communication goes beyond your sustained failure to ping: it includes your failure to use meaningful edit summaries on major changes to modules which effect millions of pages, e,g. this edit[4] by you, which depopulated a set of categories being discussed at Wikipedia:Categories for discussion/Log/2020 August 18#Category:Articles_with_text_from_the_Afro-Asiatic_languages_collective.

I have had enough to the persistent non-communication, and this sustained pattern of changes which screw up the work done by other editors without providing an alternative. I will not longer try to make these categories link into the lang system. When I encounter them at Special:WantedCategories or (my replica at https://quarry.wmflabs.org/query/30916), I will simply take the minimal step to remove the redlink: that is, I will create them with {{Tracking category}}, and move on. --BrownHairedGirl (talk) • (contribs) 00:46, 21 September 2020 (UTC)[reply]

Really?
Trappist the monk (talk) 11:28, 21 September 2020 (UTC)[reply]
Yes, that's how I intend to do it from now on. The Lang system no longer works as it used to; there is no documented workaround in place; and the editor who broke the old system refuses to communicate effectively. So I now have that block of text to paste into any such categories which appear as redlinks. --BrownHairedGirl (talk) • (contribs) 13:47, 21 September 2020 (UTC)[reply]
@BrownHairedGirl: You are being WP:POINTy. Please stop. If you believe Ttm has been so uncommunicative as to deserve that response, then you have WP:ANI to appeal to. I do not think they will look favorably on your actions. --Izno (talk) 15:48, 21 September 2020 (UTC)[reply]
No, I am not being at all WP:POINTy. Get yourself a mirror.
For several years, I have created these categories when they appear at Special:WantedCategories. They are slow to create: open the language article, look up the language code, check how it displays, do a few tests (because the code listed in the article doesn't always match the code which causes {{Lang}} to populate the article), then save.
They are the most time-consuming type of category to appear at SWC, but I have always taken whatever time is needed to do them properly ...and i have done many hundreds of them.
However, you have made changes to the lang system which means that these methods now don't work in some cases. When I asked you to resolve this, you engaged in repeated passive aggression: not pinging me in your replies, and adding lots of detail which doesn't answer my question of how I should now construct the categories. Since you offer no solution, I am not going to waste my time experimenting to find which options still work.
I have had enough of being messed around like this, so I have now started to make life easy for myself: instead of trying to use the lang system of templates, I simply create the category pages using {{Tracking category}}. That is a perfectly valid approach, because they are tracking categories ... and since WP:POINT describes disruption to make a point, this is not POINTy because it is not disruptive. It is clearly a perfectly a valid approach to creating a page for a tracking category which already has non-zero population ... and if you or anyone else wants to develop the category further, then of course you should feel free to do so.
That solution works for me, so I will simply route around your passive aggression rather than go to the drama boards. If you don't like that, then of course, feel free to take this to WP:ANI... but beware of WP:BOOMERANG. You will do yourself no favours with a request to "Please punish this editor who has been messed around by my undocumented changes and now won't volunteer her time to play a guessing game after I (Trappist) chose to repeatedly screw her around with useless communications". But if you want to make that trip, it's your choice. --BrownHairedGirl (talk) • (contribs) 16:16, 21 September 2020 (UTC)[reply]
BHG, no one has changed the way the categories are created. The only recent change was adding in the error message to allow fast and easy identification of categories not being populated by the template. Most have been like that for at least 3 years; others even more. The issue you reported about a false positive being added to the error category is being fixed. As these are tracking categories and not user-facing reading material, the incorrect error message does almost no harm. --Gonnym (talk) 16:52, 21 September 2020 (UTC)[reply]
@BrownHairedGirl: You? I have not been involved in any of these changes. Please sort out who you think you're talking to. The talk page warning here was the soft version of my request; please don't make me follow through with a harder version. --Izno (talk) 17:12, 21 September 2020 (UTC)[reply]
Sorry, @Izno. I mistook who I was replying to.
But regardless of who I am replying to, my answer remains the same: there is nothing at all WP:POINTy in my decision to desist from using a broken template system, and if you want to go to ANI, feel free.
The way I used to create those cats has been broken. So I will now create them a simpler way, and if anyone wants to polish the cats afterwards, they are free to do so. If you think there is an ANI case on that, that's up to you. --BrownHairedGirl (talk) • (contribs) 17:29, 21 September 2020 (UTC)[reply]
@BrownHairedGirl: Indeed, I actually don't mind basically mindless filling in these red linked categories. What you shouldn't be doing is besmirching a good-faith editor (if "uncommunicative") in category-space with your edits. That's what I am asking you to stop doing. A simple {{tracking category}} will suffice, any editors who are interested in fixing them beyond that can do so (you may wish to tell them they are now blue-linked rather than assume they will follow you around, but that is your prerogative). --Izno (talk) 17:41, 21 September 2020 (UTC)[reply]
@Izno, I am not gonna create some sort of log of these categs. I create them, then move on.
What exactly is the problem in describing as uncommunicative an editor who you seem to be agreeing is uncommunicative? I leave that note there to avoid having to field questions abut why I created the cats in that way. Do you want to suggest an alternative wording? --BrownHairedGirl (talk) • (contribs) 17:52, 21 September 2020 (UTC)[reply]
"Seem to" is funny; the statement is quote-marked for a reason (mostly to use your own words, rather than suggest that I agree with them; some might consider them scare-quotes but that was not my intention). If you believe he is uncommunicative, that is a thing for his talk page or for ANI, not for random (er, systematic) categories. If you are personally asked to answer for the creation of those categories (I am skeptical, but willing to answer the point), I would expect you to say "Please speak with Ttm. I am only filling in the red category.", which I would expect would suffice for most if not all people.

As I said, you may move on as you wish. --Izno (talk) 18:04, 21 September 2020 (UTC)[reply]

A barnstar for you!

The Random Acts of Kindness Barnstar
Thanks for your help on the lists! Is all this coding in your prayer books at the monastery haha? All the best! † Encyclopædius 12:05, 22 September 2020 (UTC)[reply]

CS1 suggestions auto-fixed

Now that languages are out of the way, I was looking at the other error/maint categories of CS1. Would it be wise to have the bot try and fix some of the errors here while using regex to change all (some? - maybe the most obvious typos?) suggestions here into their correct form? What could go wrong, if anything, while doing this? - Klein Muçi (talk) 11:19, 17 September 2020 (UTC)[reply]

Can we also safely auto-remove |class=<value> to fix errors for this category? - Klein Muçi (talk) 11:38, 17 September 2020 (UTC)[reply]
Yes, iff the value assigned to |arxiv= or |eprint= in that cs1|2 template has the form that does not support |class=.
Trappist the monk (talk) 14:33, 17 September 2020 (UTC)[reply]
What could go wrong? You and several of your friends are sitting around the campfire, drinking your favorite libations and swapping lies when one of them says, "Hey! Guys!, Watch this!" What could possibly go wrong?
The suggestions in Module:Citation/CS1/Suggestions are not always guaranteed to be correct. I suspect that sometimes they are mere guesses. Still, the correctly spelled other-language forms of a parameter name might be replaced without too much going wrong. But, perhaps the better solution for you, since sq:Kategoria:Faqe me burime që përmbajnë parametra të palejuar has relatively few members, is to concentrate on the errors that you have and not bother with ~/Suggestions and errors that you may never encounter. The most common error in Kategoria:Faqe me burime që përmbajnë parametra të palejuar seems to be the various forms to |dead-url= (which, alas is going to plague us for sometime to come because the now-unmaintained tools reFill and reFill2 continue to add |deadurl=y). |month= appears to be another one; combine that with the value in |year= to make |date= if |date= is not already present, and delete |month= and |year=.
Trappist the monk (talk) 14:33, 17 September 2020 (UTC)[reply]
Haha! A "here, hold my beer" scenario was what I had in mind when I made that question. :P Regarding the class error, the way I was imagining the solution was for the bot to check all the pages in that category and simply remove the class parameter. Would that be a safe solution? The logic being that since pages are already in that category, the CS1 module would have already made the needed checks.
As for the suggestions... I was thinking more of typos in English than the foreign not-recognized aliases. That is, as you too say, rare and I certainly don't wanna do another "all languages in all languages" for every parameter citation templates have. My script already got slowed down with more than 24 hours (went from a mere 30 minutes to 25 hours) in completion time after adding all the language regex-es. If you say typos are not safe to fix with a bot, I'm gonna agree with you. What would be the needed changes regarding the |dead-url= parameter? Can those fixes be regex-ified? I don't think the month + year = date deserves to be solved with a bot as I don't think it is a common occurrence, no?
Adding on that, do you think we can apply bot solutions at this category similar to what I said regarding |class=? Check the category, check the mentioned parameters there, remove wiki mark-up if found. Even though, judging by the text there, I don't think that task can be automatized as easily and it still requires human intervention to differentiate between different kinds of media, no? - Klein Muçi (talk) 15:47, 17 September 2020 (UTC)[reply]
If someone else has to hold your beer, you're doing it wrong.
Simply removing |class= from articles that have the error is not correct. You have to evaluate each usage of |class= so that you remove only those that are misused. |class= is valid for |arxiv=YYMM.#### and |arxiv=YYMM.##### forms so should not be removed.
When we replaced |dead-url= with |url-status=, I wrote a bot task; details at the task page.
|month= has been dead a long time but still pops up occasionally. Probably not worth too much effort ...
When we added markup detection, I wrote a bot task; details at the task page.
Trappist the monk (talk) 10:51, 18 September 2020 (UTC)[reply]
XD The reference was regarding the usual expression one drunk person says before "going on an adventure" without prior skills or information needed about the said "adventure". That's what I want to avoid (but usually end up involved anyway) when working with Smallem lately.
So what you're saying is that there may be articles that are in that category and practically have no real problem with the class parameter? Or that they do have problems but it can't simply be fixed by removing that?
I'm not sure what I should do regarding your bot jobs. Maybe I should be able to adapt them for Smallem? Of course that would have to take the course of recreating them as simple regex-es. I think I can do that for the specific 2-3 transformations that are needed regarding |dead-url= / |url-status= but I'm not sure how I would go about the job regarding markup detection. Maybe I should ask if Monkbot can work outside EnWiki? Although I'd like to have only one specific bot do all the changes regarding citations. - Klein Muçi (talk) 13:20, 18 September 2020 (UTC)[reply]
Now that I think of it... Shouldn't IA Bot take care of the aforementioned transformations? I'm confused. :/ - Klein Muçi (talk) 13:27, 18 September 2020 (UTC)[reply]
And, to end it with the questions regarding CS1 errors, is there a regex I can use for this category? - Klein Muçi (talk) 13:42, 18 September 2020 (UTC)[reply]
I wrote an awb script for that. I don't think that it is bot-able because quite often, the things that get trapped in that category are the result of vandalism or the result of unintentionally adding new article-text in the middle of a cs1|2 template (these for the CRLF errors). Fixing those kinds of errors requires humans. The script is wholly unpolished so I haven't published it but if you want a copy I can give it to you.
Trappist the monk (talk) 14:07, 18 September 2020 (UTC)[reply]
If you mean |dead-url=|url-status=, I don't know if that is in IABot's remit. You'll have to ask over there.
Trappist the monk (talk) 14:07, 18 September 2020 (UTC)[reply]
Seems sort of silly to me to convert the monkbot tasks to pywikibot smallem. You are listed at sq:Wikipedia:AutoWikiBrowser/CheckPage, so get yourself a bot flag for smallem-awb. Import tasks monkbot 14 and 16, tweak them to name smallem-awb in the edit summaries, test to make sure they are working correctly after the import, and then switch to autosave.
Trappist the monk (talk) 14:07, 18 September 2020 (UTC)[reply]

Oh, well, I guess you could give it to me and I can check after every fix. I must warn you though that Albanian does use a lot the letter Ë/ë and every script fixing CRLF characters in Albanian at least should take care of not removing diacritics. We also use Ç/ç (and that completes the full list of our non-latin letters) but that's more rare.

As for the IABot's subject, I try to check its edits regularly on our project and I've spent a lot of time localizing its pages (userpage/meta page/interface) and I was sure that it did make that switch of parameters whenever it met them. But your doubt on the subject is making me doubt that too now. Maybe Cyber can give us some insight on that.

And finally, I have a problem fully grasping what you mean with "AWB jobs/tasks". I've used AWB in the past. (Even JWB.) But I've never used it with code. I've downloaded the program, set up the find and replace transformations I wanted to make, set up a summary, set up a database dump after downloading it (that's a step I've unfortunately forgotten how to do now) and had to press save manually after every edit. After getting tired of doing that and seeing there were no problems happening, I devised a simple script to press "Ctrl+S" every 2 seconds and that's as close to autosave in that program as I have ever been. :P But I know nothing of using code to operate it. How do you do that? - Klein Muçi (talk) 14:55, 18 September 2020 (UTC)[reply]

Umm, those are Latin-script characters:
Ë ‎00CB LATIN CAPITAL LETTER E WITH DIAERESIS
ë ‎00EB LATIN SMALL LETTER E WITH DIAERESIS
Ç ‎00C7 LATIN CAPITAL LETTER C WITH CEDILLA
ç ‎00E7 LATIN SMALL LETTER C WITH CEDILLA
cs1|2 doesn't care about them:
{{cite book |title=Ë/ë and Ç/ç}}Ë/ë and Ç/ç. – no error.
It is not a matter of doubt. I just don't know because I don't pay much attention to that bot's operation unless it is doing something that it ought not be doing.
I think that if you create sq:User:smallem-awb and then add smallem-awb to sq:Wikipedia:AutoWikiBrowser/CheckPage under §Botët and then login to awb as user smallem-awb, you should see the Bots tab appear between the Skip and Start tabs. Auto-save is a checkbox on the Bots tab.
Trappist the monk (talk) 15:39, 18 September 2020 (UTC)[reply]
You are right. I shouldn't have called them non-latin but the point is that every page I've found online that removes invisible characters, also removes the diaeresis and that's a big problem (also the cedilla but that can be usually fixed manually in no time).
I literally had no idea about that kind of functionality. Assuming I did this (judging by the other bots already there, I don't think I'll need a new account/userpage for the bot, just to add its current name there), apart from the auto-save checkbox, do I also get a specific page where to import your Monkbot's tasks? - Klein Muçi (talk) 17:13, 18 September 2020 (UTC)[reply]
Put them wherever you want. Settings and code files are stored on your local machine and run from there. I have a folder called Z:\Wikipedia\AWB\Monkbot_tasks (win 10). That folder holds the .xml settings files and the .cs code files. Because en.wiki requires prospective bot tasks to pass through the WP:BRFA gauntlet, I publish the code in Monkbot's userspace; I don't usually publish the settings files but can give you those if you want them.
Trappist the monk (talk) 18:49, 18 September 2020 (UTC)[reply]
Yeah, of course but the main problem is that I still know nothing about what you're saying apparently. I've never had to use files to operate AWB before. Is there somewhere I can learn about it? - Klein Muçi (talk) 09:54, 19 September 2020 (UTC)[reply]

Okay, I added Smallem as a bot user and logged in in AWB. I also saw that you have an option to open a file for settings at "File". So I guess I learned that. I'm supposing that's where you set up the settings files. Where do you open up the code files? Or am I messing it up? - Klein Muçi (talk) 10:11, 19 September 2020 (UTC)[reply]

Start simple: User:Monkbot/Task 0: null edit. Copy/paste that to a file on your local machine (on mine its at Z:\Wikipedia\AWB\Monkbot_tasks\Monkbot_task_0_null_edit.xml) – file extension is important. Start awb. Use the File → Open settings menu to browse to and open your task 0 file. Login as smallem. Make a list. On the Start tab, click Start. AWB should show you that it has added {{subst:null}} at the start of the page. On the Bots tab check Auto save. On the Start tab, click Start. AWB should start working through the list of articles.
Trappist the monk (talk) 10:41, 19 September 2020 (UTC)[reply]
Yes, it worked perfectly! Thank you! I'll try the other tasks now. I have a question though: I was thinking of adding the fix for |deadurl= -> |url-status= as a permanent regex fix for Smallem, thinking it will continuously pop up every now and then for a while. Am I right on that logic or the one time AWB task will be enough for it too? - Klein Muçi (talk) 11:09, 19 September 2020 (UTC)[reply]
Also, I tried doing the 2 other tasks. I just copy-pasted the script code without putting much attention to it. But it said it had an error. I tried continuing nonetheless and it worked but none of them did any fixings to the articles in the specific categories. Could this be because of the said error, the fact that I should do some kind of adaption in the code I just copy-pasted blindly or the fact that none of our articles could benefit from those scripts at the moment? - Klein Muçi (talk) 11:50, 19 September 2020 (UTC)[reply]
Slow down. If you haven't already, copy the c# code from User:Monkbot/task 16: remove replace deprecated dead-url params#script and paste it into notepad++ or some other plain-text editor – if you have, do it again because I just updated it. Monkbot does not want to be responsible for edits that Smallem makes so at line 113, replace the text inside the quotes with an edit summary message that will be meaningful to sq.wiki editors. Save but don't close the file (mine is at Z:\Wikipedia\AWB\Monkbot_tasks\Monkbot_task_16_remove_replace_deprecated_dead-url_params.cs) – file extension is important.
Close awb and then restart so that you start afresh. Monkbot does not want to be responsible for automatic edits that awb makes so I always uncheck Auto tag, Apply general fixes, and Unicodify whole page on the Options tab whenever I start awb. Choose a category where you will find |dead-url= errors (for us that's Category:Pages with citations using unsupported parameters) don't click make list. In Notepad++ Ctrl-A Ctrl-C the c# code. At awb, Tools → Make module. Check Enabled. In the text box at the bottom, Ctrl-A Ctrl-V to paste the c# code into awb (yeah, overwrite what is already there). Click Make module. After a pause you should get the green message Module compiled and loaded. Close the Module window. Back in awb, File → Save settings as ... save these settings (same name as the .cs file except with .xml file extension seems sensible). Next time you load these settings, you won't have to copy/paste the c# code; it is stored in the settings file. Login as smallem, click Make list. As you did for task 0, run this task manually enough to become comfortable with what it is doing before switching to auto save. Check the edit summary to make sure it looks as you expect it to look.
Trappist the monk (talk) 12:06, 19 September 2020 (UTC)[reply]
Thank you genuinely for spending your time to explain it to me! Before I go one with what you wrote though, I want to ask you something: Will everything you wrote up work with the other task too? The one related to formatting. Of course, not minding the specific instructions about specific lines. The reason I write that is because, as I mentioned above, I decided to write a regex for that task given that I suspect it to be a recurring occurrence for a while (along with ref=harv) so I've made it part of the find and replaces source code so that it can run automatically (periodically) for a while. - Klein Muçi (talk) 12:52, 19 September 2020 (UTC)[reply]
Yeah, it's the same process.
Trappist the monk (talk) 12:54, 19 September 2020 (UTC)[reply]

I was able to make all that except for the "set the summary" step. Can you help me what line/s exactly to erase so I set up a single general summary for all the needed changes? The summary is this: Rregullime automatike të gabimeve me referimet - Klein Muçi (talk) 13:55, 19 September 2020 (UTC)[reply]

I have updated the script so fetch a new copy. Line 5040; you can delete lines 5040 and 5041. Also at the bottom of the file, change the file name listed there to your file name. I've taken to putting the file name there because I sometimes have multiple instances of awb running each with a different module; the file name helps keep me organized.
Trappist the monk (talk) 14:14, 19 September 2020 (UTC)[reply]
Thank you! I was able to make it it work like I wanted. What I did actually was to remove everything and leaving only line 5040. It gave me one error (saying I needed to set up a summary eventhough it had one) but I was able to trick it by putting a single space at the AWB's summary placeholder. - Klein Muçi (talk) 14:46, 19 September 2020 (UTC)[reply]
What do you guys need from me?—CYBERPOWER (Around) 12:37, 22 September 2020 (UTC)[reply]
The question for you posed by Editor Klein Muçi was: Shouldn't IA Bot take care of the aforementioned transformations? That question refers to changing the no-longer-supported |dead-url= and |deadurl= parameters to the |url-status= with the appropriate live or dead keywords.
Trappist the monk (talk) 13:02, 22 September 2020 (UTC)[reply]

Category cleanup

Have any idea why the following are producing errors but are valid categories?

Probably will add to the list as I continue with the cleanup. --Gonnym (talk) 18:19, 19 September 2020 (UTC)[reply]

Misnamed categories because the proper language names are overridden in that abomination that is Module:Language/data/wp languages merely for the purpose of suppressing the disambiguation.
From iana:
["aib"] = {"Ainu (China)"},
["ain"] = {"Ainu (Japan)"},
["boy"] = {"Bodo (Central African Republic)"},
["brx"] = {"Bodo (India)"},
from ~/wp languages:
["ain"] = {"Ainu"},
["brx"] = {"Bodo"},
The categories should be:
Category:Articles containing Ainu (China)-language text
Category:Articles containing Ainu (Japan)-language text
Category:Articles containing Bodo (Central African Republic)-language text
Category:Articles containing Bodo (India)-language text
To properly fix this we need to do as I suggested at Template talk:Lang § deprecated ISO 639 language codes so that we have a clean data set. It's my intent to start on that today sometime, real-life permitting.
Trappist the monk (talk) 19:12, 19 September 2020 (UTC)[reply]
Are any of these Chinese names going to be populated by the lang template?
Are any of the English variants going to be populated by the lang template?
Almost done with the cleanup, just a few more left. --Gonnym (talk) 23:38, 23 September 2020 (UTC)[reply]
You want me to predict the future? I can imagine that any of these except 'Hepburn romanization', 'Hong Kong Chinese in traditional script', 'traditional Chinese (HK)', and 'variant English' might be created in some form of the future. I tweaked Module:Lang/sandbox so that it will create categories for the various regional English tags listed in Module:Lang/data:
  • Category:Articles containing explicitly cited Australian English-language text
  • Category:Articles containing explicitly cited Canadian English-language text
  • Category:Articles containing explicitly cited Early Modern English-language text
  • Category:Articles containing explicitly cited British English-language text
  • Category:Articles containing explicitly cited Irish English-language text
  • Category:Articles containing explicitly cited Indian English-language text
  • Category:Articles containing explicitly cited New Zealand English-language text
  • Category:Articles containing explicitly cited American English-language text
  • Category:Articles containing explicitly cited South African English-language text
I haven't looked but it would not surprise me to find many or all of these in the articles listed in Category:Articles containing explicitly cited English-language text. Defer a decision about the regional English cats until after the module update but nuke the others?
Trappist the monk (talk) 00:05, 24 September 2020 (UTC)[reply]
You want me to predict the future? I somehow knew that no matter how I'd phrase my question I'd get something like that :) I meant, in your /sandbox changes if the above categories are valid. --Gonnym (talk) 00:09, 24 September 2020 (UTC)[reply]
It is said that Niels Bohr (who had a much bigger brain than I) once made a remark something like: "Making predictions is difficult; especially about the future." Yeah, probably apocryphal, but ... Only the regional English cats are supported in ~/sandbox.
Trappist the monk (talk) 00:20, 24 September 2020 (UTC)[reply]

CS1 - Meta and Smallem

Hey, Trappist!

I had a question and a fact to give to you.

Question: I was checking the source code of Smallem randomly to see if I could do any regex optimizations on it now that I've completed giving it most of the tasks it can solve. I noticed these 4 lines:

  • "\|\s*language\s*=\s*Abasinisch\b" "|language=abq" \
  • "\|\s*language\s*=\s*Abasinisch\b" "|language=abq-latn" \
  • "\|\s*language\s*=\s*Abaza\b" "|language=abq" \
  • "\|\s*language\s*=\s*Abaza\b" "|language=abq-latn" \

Don't they seem a bit odd? You know what I mean. I don't know how many lines could be like this (I only noticed these because they were at the beginning of the code since I have them lexicographically sorted) or what I should do with them.

Fact: As we've talked some times ago, I took the liberty of creating this Meta discussion about the CS1 module. My hope is to get volunteers to help in practical ways in creating the Meta infrastructure of the CS1 system - you'll understand what I mean with "system" when you read the discussion. Of course, if that happens, your help in guiding us would be necessary but I'm not too optimistic on the project yet so I didn't ping you on it. Feel free to participate on it though if you want. I hope you agree with everything I've said there (I feel like I've added nothing new we haven't discussed prior). - Klein Muçi (talk) 13:16, 27 September 2020 (UTC)[reply]

The tags are legitimate:
German names:
{{#language:abq|de}} → Abasinisch
{{#language:abq-latn|de}} → Abasinisch
English names:
{{#language:abq|en}} → Abaza
{{#language:abq-latn|en}} → Abaza
I don't know how many simple language tags are paired with IETF language tags; likely not all that many. If this duplication is not causing problems, is it worth the effort to 'fix' it?
In the meta discussion, I think that you have stated my position accurately. It would be good to see that produce tangible results. If it does, let me know.
Trappist the monk (talk) 13:59, 27 September 2020 (UTC)[reply]
Glad to hear that. :) As for the tags, the problem is that I don't know how Smallem will react to them but I suspect they will bring problems. The reason for that is because:
  • |language=Abasinisch → |language=abq
AND
  • |language=Abasinisch → |language=abq-latn
Simultaneously
Maybe I should do a manual run and simulate an experiment to see what happens. - Klein Muçi (talk) 14:30, 27 September 2020 (UTC)[reply]
Not simultaneously. One at a time. If done in the order that they are listed here, Smallem will find |language=Abasinisch and replace it with |language=abq. The next search for |language=Abasinisch will find nothing so Smallem will move on to whatever pattern follows next.
Trappist the monk (talk) 14:38, 27 September 2020 (UTC)[reply]
Oh! I see. Well, in that case I should try and remove the "second entry" from each language in the code, to make it a tiny bit faster. That was the initial intention when I randomly found out these details. Would it be wise to search for "-" and therefore manually see cases like this? - Klein Muçi (talk) 14:45, 27 September 2020 (UTC)[reply]

After move issues

--Gonnym (talk) 11:36, 30 September 2020 (UTC)[reply]

No doubt more categories will show up... {{transl}} errors are on the rise too. The ones I checked were ISO 639-2, -3 codes with -1 equivalents.
I think that the Module talk:Lang/testcases/ISO 639-3-3 name from tag fails because Module:Lang/documentor tool/sandbox does not de-dab names taken from the override list. And why is ~/testcases/ISO 639-3-3 name from tag still using the ~/documentor tool/sandbox?
I'll attend to the documentation.
Trappist the monk (talk) 11:45, 30 September 2020 (UTC)[reply]
Manually did a null update on the pages of the smaller categories in the error category. A few bigger ones are now left. We also need to wait for the ~/data/name and ~/wp languages modules to update transclusions and see if they can be sent to TfD.
Category:Articles with Serbo-Croatian-language sources (hbs) and Category:Articles with Moldovan-language sources (mol) were added to the error category and was wondering if {{In lang}} following IANA is correct, or if we do want to be able to note that a source is in such languages. What do you think? --Gonnym (talk) 17:15, 30 September 2020 (UTC)[reply]
The ISO 639-3 custodian says that hbs is active and that its ISO 639-1 equivalent, sh, is deprecated; see iso639-3:hbs. IANA has this for sh so not listed as deprecated (there would be a Deprecated: YYYY-MM-DD item in the record):

Type: language
Subtag: sh
Description: Serbo-Croatian
Added: 2005-10-16
Scope: macrolanguage
Comments: sr, hr, bs are preferred for most modern uses

But, hbs is not listed in the source we use for synonyms. How Module:Lang should handle this oddball is a puzzlement.
Category:Articles with Serbo-Croatian-language sources (hbs) should go away. A lot of the errors accumulating in Category:Lang and lang-xx template errors are hbs errors. I've tweaked my code-promotion awb script to include the hbssh promotion.
For mo and mol, both deprecated, we might add support for ro-MD to the override data.
While I was writing the awb script to update the IANA data modules, I began second-guessing my decisions to keep deprecated codes out of the IANA modules; the codes really are in the registry file so they should be included in the data modules. And that brings us back to the question of how to inform editors that the codes they are providing to {{lang}} are deprecated? I intend to restart the deprecated-codes discussion at Template talk:Lang.
I do not think that {{in lang}} should be changed away from the IANA data set. Where would it go? ISO 639 name? Then it would lose the capability to support IETF language tags.
Trappist the monk (talk) 19:05, 30 September 2020 (UTC)[reply]
My question regarding In lang was because (and correct me if I'm wrong), the reason we follow IANA for lang is so we follow the html specifications. But In lang is not for html text on page, but for outside sources. So my question was if it should follow it, or enable even more languages (from where? I have no idea. I just thought about it because I saw those two categories enter the error cat). Regarding the deprecated message for the reader. Does it matter to the end-user if they are getting a deprecated language tag? If not, then there is no need for a message. We can add a tracking category for those usages if that is needed (but is it needed? Is there a downside to viewing a deprecated language tag?). --Gonnym (talk) 19:14, 30 September 2020 (UTC)[reply]
No, {{in lang}} uses Module:Lang so that it has support for IETF language tags. IANA has all of the language tags from ISO 639-1, ISO 639-2T, ISO 639-3, and ISO 639-5 except the deprecated language tags, the ISO 639-2B language tags, and any ISO 639-2, -3, -5 language tags that have ISO 639-1 equivalents.
Trappist the monk (talk) 19:37, 30 September 2020 (UTC)[reply]
I've noticed that some pages have the language categories (the sub categories of Category:Articles containing non-English-language text) manually added (see [5]). I was wondering if we have anyway to track and fix this? --Gonnym (talk) 09:06, 1 October 2020 (UTC)[reply]
This cirrus search:
insource:"Category:Articles containing" insource:/\[Category:Articles containing [A-Za-z ]+\-language text/ → ~480 hits
Trappist the monk (talk) 10:02, 1 October 2020 (UTC)[reply]
Thanks! Removed all usages from that and from insource:"Category:Articles with" insource:/\[Category:Articles with [A-Za-z ]+\-language sources/ and insource:"Category:CS1" insource:/\[Category:CS1[A-Za-z ]+\-language sources/. Couldn't get the search to find any in style of "Category:Articles with text from the Berber languages collective" and "Category:Articles with Berber languages-collective sources (ber)", so either my code was wrong or there are none. --Gonnym (talk) 11:26, 1 October 2020 (UTC)[reply]
Is it (not painfully) possible to rename in the code the categories which are disambiguated with a date range to use an en-dash per MOS:DATERANGE? So Category:Articles containing Old English (ca. 450-1100)-language text will become Category:Articles containing Old English (ca. 450–1100)-language text. --Gonnym (talk) 12:00, 1 October 2020 (UTC)[reply]
If we do that then ought we not also enforce compliance with MOS:CIRCA? and MOS:ERA? standardize on one of BC–AD or BCE–CE?)
A quick hunt through the testcases found these:
Old English (ca. 450-1100) → Old English (c. 450 – 1100)
Middle Dutch (ca. 1050-1350) → Middle Dutch (c. 1050 – 1350)
Middle French (ca. 1400-1600) → Middle French (c. 1400 – 1600)
Old French (842-ca. 1400) → Old French (842 – c. 1400)
Middle High German (ca. 1050-1500) → Middle High German (c. 1050 – 1500)
Old High German (ca. 750-1050) → Old High German (c. 750 – 1050)
Middle Irish (900-1200) → Middle Irish (900–1200)
Ottoman Turkish (1500-1928) → Ottoman Turkish (1500–1928)
Old Persian (ca. 600-400 B.C.) → Old Persian (c. 600 – 400 BC)
Jewish Babylonian Aramaic (ca. 200-1200 CE) → Jewish Babylonian Aramaic (c. 200 – 1200 CE)
Do we really need to do this? Is someone complaining?
Trappist the monk (talk) 13:11, 1 October 2020 (UTC)[reply]
Which was why I asked if it wasn't painful to hack the fixes :) And you are right, if we do it we should take into account the two other you pointed out. --Gonnym (talk) 13:17, 1 October 2020 (UTC)[reply]
A few more date categories:
Category:Articles containing Middle English (1100-1500)-language text
Category:Articles containing Old Aramaic (up to 700 BCE)-language text
Category:Articles containing Ancient Greek (to 1453)-language text
Category:Articles containing Old Provençal (to 1500)-language text
Category:Articles containing Old Irish (to 900)-language text
Category:Articles containing Occitan (post 1500)-language text
Category:Articles containing Middle Korean (10th-16th cent.)-language text
Category:Articles containing Old Korean (3rd-9th cent.)-language text
--Gonnym (talk) 13:32, 1 October 2020 (UTC)[reply]
Of those, I have stricken four because nothing to do. For the others:
Middle English (1100-1500) → Middle English (1100–1500)
Old Aramaic (up to 700 BCE) → only an issue if we elect to standardize on BC–AD (not my preference)
Middle Korean (10th-16th cent.) → Middle Korean (10th – 16th cent.) or Middle Korean (10th – 16th century)
Old Korean (3rd-9th cent.) → Old Korean (3rd – 9th cent.) or Old Korean (3rd – 9th century)
Trappist the monk (talk) 13:50, 1 October 2020 (UTC)[reply]
"Old Aramaic (up to 700 BCE)" should change (if we change stuff) anyways as it says "up to" while the 3 stricken categories say "to". --Gonnym (talk) 14:04, 1 October 2020 (UTC)[reply]
Is there some MOS requirement that prohibits the use of 'up to' when the phrase precedes a date? If not then leave it alone; eschew special cases because they are special cases.
Trappist the monk (talk) 14:14, 1 October 2020 (UTC)[reply]
Just consistency between the titles, but it isn't a big issue. Old Persian and Jewish Babylonian Aramaic probably don't need the eras as the dates are in the same era and the other categories aren't using them. --Gonnym (talk) 14:23, 1 October 2020 (UTC)[reply]
Pretty sure you do need the BCE with Old Aramaic else you don't know if 'up to 700' means 700 BCE/BC or 700 CE/AD. Jewish Babylonian Aramaic probably doesn't need CE because current era can be assumed. Only example of the use of CE in a disambiguator? If so, special case...
Trappist the monk (talk) 14:37, 1 October 2020 (UTC)[reply]
Commenting on my deleted text :) I agree that 700 is unclear. However, by that thought so is Old Irish (to 900) and the other 3. --Gonnym (talk) 16:34, 1 October 2020 (UTC)[reply]
That's why I said that the current era can be assumed. Because Old Aramaic (up to 700 BCE) includes the era designator BCE we know that it is not 'up to 700' in the (assumed) current era (CE). All other of those disambiguators do not include an era designator so may be assumed to be in the current era. This appears to be in keeping with MOS:ERA which states, in part: In general, do not use CE or AD unless required to avoid ambiguity. It 'appears' to suggest that the current era may be assumed, though doesn't actually say that. Still, if 'Jewish Babylonian Aramaic (ca. 200-1200 CE)' is the only CE-marked date, it would be a special case to 'fix' it.
Trappist the monk (talk) 17:01, 1 October 2020 (UTC)[reply]
Ah, I see what you mean. Yeah makes sense that they are assumed. I'm not sure what the fix entails so can't really comment. But "Jewish Babylonian Aramaic" has both a dash issue and a ca. issue, so anyways it would need fixing, regardless of CE. If it's already being changed, then unless some heavy hack needed, not a lot of reason not to remove the CE to match the others without it. --Gonnym (talk) 17:08, 1 October 2020 (UTC)[reply]
Not onerous. One could write stuff like this for each unique disambiguator:
=string.gsub ('Jewish Babylonian Aramaic (ca. 200-1200 CE)', '(%s+%d+)%-(%d+)', '%1 – %2'):gsub ('ca%.', 'c.'):gsub (' CE%)', ')')
Probably better to split the task into a table of appropriate patterns such that the pattern matching and replacement are confined to within the disambiguator...
Trappist the monk (talk) 17:34, 1 October 2020 (UTC)[reply]

testcases

Added three new /testcases for the deprecated tags and included also the override tags as I noticed we weren't testing them anywhere. Module:Lang/testcases/ISO 639 deprecated and override tag from name has 24 fails and Module talk:Lang/testcases/ISO 639 deprecated and override name from tag somehow has a very short list, which I don't know how to fix. As an example, "grk-x-proto" is missing from there, but {{Lang|fn=name_from_tag|grk-x-proto}}is valid: Proto-Greek. Any idea? --Gonnym (talk) 10:14, 2 October 2020 (UTC)[reply]

For Module talk:Lang/testcases/ISO 639 deprecated and override name from tag, all the tests are two-character; the override table has two-, three-, and IETF tags. Are you constraining the test to two-character tests only?
I must away for a large portion of today.
Trappist the monk (talk) 11:19, 2 October 2020 (UTC)[reply]
Duh! Yeah, you were right :) Only one fail in there: {{Lang|fn=name_from_tag|en-SA}}: English instead of South African English. --Gonnym (talk) 11:39, 2 October 2020 (UTC)[reply]
Further proof that wherever the code/name definitions in Module:Language/data/wp_languages came from, a lot are wrong; en-SA is English as spoken in Saudi Arabia; ZA is South Africa.
Trappist the monk (talk) 20:05, 2 October 2020 (UTC)[reply]
What a mess. So only issue is the 23 fails at Module talk:Lang/testcases/ISO 639 deprecated and override tag from name. Are those because the sandbox has updated code or valid errors? --Gonnym (talk) 21:15, 2 October 2020 (UTC)[reply]
They are errors but not errors. The deprecated test expects that {{lang|fn=tag_from_name|Hebrew}} will return iw but instead gets he. This is because, for Hebrew, iw is deprecated but he is not deprecated. {{lang/sandbox}} is doing the correct thing when it returns the active tag for the language name.
I suspect that a tweak to the reference data set assembled by the testcase is required. When a language name has both active and deprecated tags, skip the language name.
Trappist the monk (talk) 21:40, 2 October 2020 (UTC)[reply]

Module:Mw lang

Hey, can you take a look at Module:Mw lang. Module talk:Mw lang/testcases shows that "code_from_name" is failing on 245 cases. --Gonnym (talk) 08:18, 6 October 2020 (UTC)[reply]

Fixed. What remains is multiple tags assigned to a common name.
Trappist the monk (talk) 10:24, 6 October 2020 (UTC)[reply]

Question on Module:ISO 639 name and Module:Language/data/ISO 639-1

In Module:ISO 639 name we use Module:Language/data/iana languages for the -1 codes, while we use the ISO-# modules for the other codes. I was wondering why we don't use Module:Language/data/ISO 639-1 here. --Gonnym (talk) 12:29, 26 September 2020 (UTC)[reply]

Before I saw you use Module:Language/data/ISO 639-1, I'm pretty sure that I was unaware of its existence.
Trappist the monk (talk) 12:53, 26 September 2020 (UTC)[reply]
Yeah, had a feeling as I also wasn't aware of it. Is there a reason we shouldn't be using it here? --Gonnym (talk) 13:23, 26 September 2020 (UTC)[reply]
None that I can think of except that it's one more module to update from the IANA language-subtag-registry file when it comes time to do that...
Trappist the monk (talk) 14:00, 26 September 2020 (UTC)[reply]
True, but even though it is currently unused, I don't think it would be deleted as it is part of the ISO 639 set, which means it would need to be updated anyways. Having it used here would at least mean that it would not be out of sync from the others. --Gonnym (talk) 14:09, 26 September 2020 (UTC)[reply]
I've updated Module:Language/data/iana languages/make so that it also extracts an ISO 639-1 table for Module:Language/data/ISO 639-1.
Trappist the monk (talk) 15:03, 26 September 2020 (UTC)[reply]
I've updated the /sandbox with the 639-1 module and also commented out a code that checks 2 or 3 letters as it seems (to me) unnecessary now that the lists are split, but I might be mistaken. The /testcases are all green except for 5 in Module talk:ISO 639 name/testcases which now use "not_found" instead of "not_code" for the error message. Can you take a look and see if the comment out code is still needed or not? --Gonnym (talk) 15:46, 26 September 2020 (UTC)[reply]
Yeah, I think that chunk of code can go away.
Trappist the monk (talk) 15:55, 26 September 2020 (UTC)[reply]
In Module:Language/data/ISO 639 name to code/make, switch "~/iana languages" also? Is Module:Language/data/ISO 639-1/make used to make the file or the list done from the iana make? --Gonnym (talk) 16:28, 26 September 2020 (UTC)[reply]
Switched.
Trappist the monk (talk) 16:52, 26 September 2020 (UTC)[reply]
After your edit to the /doc at Module:Language/data/ISO 639-1, I understand that Module:Language/data/ISO 639-1/make can be sent to TfD, correct? --Gonnym (talk) 10:39, 30 September 2020 (UTC)[reply]
I don't think it should be deleted. AWB is a windows-only tool so otherwise qualified editors who edit with other systems would not be able to do the update.
Trappist the monk (talk) 10:45, 30 September 2020 (UTC)[reply]
But you removed all mention of it from the /doc. How would they even know about it? --Gonnym (talk) 11:00, 30 September 2020 (UTC)[reply]
Point. TfD it.
Trappist the monk (talk) 11:29, 30 September 2020 (UTC)[reply]

Talk:Stop Mknai (talk) 21:39, 6 October 2020 (UTC)[reply]

Did you know about the external links in the style of iso639-3:bcp? Do you know where this is created and how? Seems very strange we have external links without any sign these are not Wikipedia links. --Gonnym (talk) 11:59, 7 October 2020 (UTC)[reply]

Interwiki prefix. See meta:Interwiki map.
Trappist the monk (talk) 12:14, 7 October 2020 (UTC)[reply]

Lang error category sort key

Could you change the sort key of the error messages produced by Module:Lang so that the key is the iso code? Makes it easier to fix group of pages. I tried doing it but couldn't get to the code without changes too many functions on the way and mess something there. --Gonnym (talk) 14:10, 9 October 2020 (UTC)[reply]

Can be done but is it really necessary? Are you expecting that we will be suddenly getting a huge number of errors?
Trappist the monk (talk) 15:28, 9 October 2020 (UTC)[reply]
Well, there are over 100 in the transl category. But I guess you are right that if there isn't going to be an unexpected surge, there is no need. --Gonnym (talk) 16:04, 9 October 2020 (UTC)[reply]