User talk:The Transhumanist/RedlinksRemover.js

This is the workshop support page for the user script RedlinksRemover.js. Comments and requests concerning the program are most welcome. Please post discussion threads below the section titled Discussions. Thank you. By the way, the various scripts I have written are listed at the bottom of the page.[1]
This script is functional

This script processes bulleted lists, removing the redlinked end nodes, reiteratively, until none are left. (A redlinked end node is a list item that is comprised of nothing more than a redlink, and that has no children.) After it has done that, this script delinks the remaining red links, and deletes red category links. It doesn't remove list item entries that have annotations, or that have children (indented entries beneath it).

Script's workshop

This is the work area for developing the script and its documentation. The talk page portion of this page starts at #Discussions, below.

Description / instruction manual

This script is functional

This script processes bulleted lists, removing the redlinked end nodes, reiteratively, until none are left. (A redlinked end node is a list item that is comprised of nothing more than a redlink, and that has no children.) After it has done that, this script delinks the remaining red links, and deletes red category links. It doesn't remove list item entries that have annotations, or that have children (indented entries beneath it).

The redlink remover has two major uses (but it is not limited to these):

  1. It can help clean up outlines that have accumulated too many redlinks.
  2. It simplifies creation of outlines using standard templates. A problem with outline generation templates is that they include every possible link that a particular type of topic (say, provinces, or cities) might have, which creates outlines with lots of red links. Following up outline creation with this script will solve that problem. Tip: it is best to work on the outline with redlinks for awhile before using the redlink remover, because the script will delink those redlinks that have children, leaving them in as informative branches in the outline. Removing redlinks too early creates extra work as many of the topics may need to be added back in or relinkified.

How to install this script

Important: this script was developed for use with the Vector skin (it's Wikipedia's default skin), and might not work with other skins. See the top of your Preferences appearance page, to be sure Vector is the chosen skin for your account.

To install this script, add this line to your vector.js page:

importScript("User:The Transhumanist/RedlinksRemover.js");

Save the page and bypass your cache to make sure the changes take effect. By the way, only logged-in users can install scripts.

Explanatory notes (source code walk-through)

This section explains the source code, in detail. It is for JavaScript programmers, and for those who want to learn how to program in JavaScript. Hopefully, this will enable you to adapt existing source code into new user scripts with greater ease, and perhaps even compose user scripts from scratch.

You can only use so many comments in the source code before you start to choke or bury the programming itself. So, I've put short summaries in the source code, and have provided in-depth explanations here.

My intention is Threefold:

  1. to thoroughly document the script so that even relatively new JavaScript programmers can understand what it does and how it works, including the underlying programming conventions. This is so that the components and approaches can be modified, or used again and again elsewhere, with confidence. (I often build scripts by copying and pasting code that I don't fully understand, which often leads to getting stuck). To prevent getting stuck, the notes below include extensive interpretations, explanations, instructions, examples, and links to relevant documentation and tutorials, etc. Hopefully, this will help both you and I grok the source code and the language it is written in (JavaScript).
  2. to refresh my memory of exactly how the script works, in case I don't look at the source code for weeks or months.
  3. to document my understanding, so that it can be corrected. If you see that I have a misconception about something, please let me know!

In addition to plain vanilla JavaScript code, this script relies heavily on the jQuery library.

If you have any comments or questions, feel free to post them at the bottom of this page under Discussions. Be sure to {{ping}} me when you do.

Aliases

An alias is one string defined to mean another. Another term for "alias" is "shortcut". In the script, the following aliases are used:

$ is the alias for jQuery (the jQuery library)

mw is the alias for mediawiki (the mediawiki library)

These two aliases are set up like this:

( function ( mw, $ ) {}( mediaWiki, jQuery ) );

That also happens to be a "bodyguard function", which is explained in the section below...

Bodyguard function

The bodyguard function assigns an alias for a name within the function, and reserves that alias for that purpose only. For example, if you want "t" to be interpreted only as "transhumanist".

Since the script uses jQuery, we want to defend jQuery's alias, the "$". The bodyguard function makes it so that "$" means only "jQuery" inside the function, even if it means something else outside the function. That is, it prevents other javascript libraries from overwriting the $() shortcut for jQuery within the function. It does this via scoping.

The bodyguard function is used like a wrapper, with the alias-containing source code inside it, typically, wrapping the whole rest of the script. Here's what a jQuery bodyguard function looks like:

1 ( function($) {
2     // you put the body of the script here
3 } ) ( jQuery );

See also: bodyguard function solution.

To extend that to lock in "mw" to mean "mediawiki", use the following (this is what the script uses):

1 ( function(mw, $) {
2     // you put the body of the script here
3 } ) (mediawiki, jQuery);

For the best explanation of the bodyguard function I've found so far, see: Solving "$(document).ready is not a function" and other problems   (Long live Spartacus!)

Load dependencies

Many of my scripts create menu items using mw.util.addPortletLink, which is provided in a resource module. Therefore, in those scripts it is necessary to make sure the supporting resource module (mediawiki.util) is loaded, otherwise the script could fail (though it could still work if the module happened to already be loaded by some other script). To load the module, use mw.loader, like this:

// For support of mw.util.addPortletLink
mw.loader.using( ['mediawiki.util'], function () {
// Body of script goes here.
} );

mw.loader.using is explained at mw:ResourceLoader/Core modules#mw.loader.using.

For more information, see the API Documentation for mw.loader.

The ready() event listener/handler

The ready() event listener/handler makes the rest of the script wait until the page (and its DOM) is loaded and ready to be worked on. If the script tries to do its thing before the page is loaded, there won't be anything there for the script to work on (such as with scripts that will have nowhere to place the menu item mw.util.addPortletLink), and the script will fail.

In jQuery, it looks like this: $( document ).ready(function() {});

You can do that in jQuery shorthand, like this:

$().ready( function() {} );

Or even like this:

$(function() {});

The part of the script that is being made to wait goes inside the curly brackets. But you would generally start that on the next line, and put the ending curly bracket, closing parenthesis, and semicolon following that on a line of their own), like this:

1 $(function() {
2     // Body of function (or even the rest of the script) goes here, such as a click handler.
3 });

This is all explained further at the jQuery page for .ready()

For the plain vanilla version see: http://docs.jquery.com/Tutorials:Introducing_$(document).ready()

var

This is the reserved word var, which is used to declare variables. A variable is a container you can put a value in. To declare the variable portletlink, write this:

var portletlink

A declared variable has no value, until you assign it one, such as like this:

portletlink = "yo mama";

You can combine declaration and assignment in the same statement, like this:

var portletlink = mw.util.addPortletLink('p-tb', '#', 'Remove red links');

Caveat: if you assign a value to a variable that does not exist, the variable will be created automatically. If it is created outside of a function, it will have global scope. For user scripts used on Wikipedia, having a variable of global scope means the variable may affect other scripts that are running, as the scripts are technically part of the same program, being called via import from a .js page (.js pages are programs). So, be careful. Here are some scope-related resources:

This adds a menu item to one of MediaWiki's menus. Use "p-tb" to signify the toolbox menu on the sidebar menu.

First you stick it in a variable, for example, "portletlink":

var portletlink = mw.util.addPortletLink('p-tb', '#', 'Remove redlinks');

It has up to 7 parameters. Only 3 are used above.

General usage:

mw.util.addPortletLink( 'portletId', 'href', 'text', 'id', 'tooltip', 'accesskey', 'nextnode');

It's components:

  • mw.util.addPortletLink: the ResourceLoader module to add links to the portlets.
  • portletId: the id of the portlet (that is, menu) where the new menu item is to be placed. The various menus ("portlets") are::
    • p-navigation: Navigation section in left sidebar
    • p-interaction: Interaction section in left sidebar
    • p-tb: Toolbox section in left sidebar
    • coll-print_export: Print/export section in left sidebar
    • p-personal Personal toolbar at the top of the page
    • p-views Upper right tabs in Vector only (read, edit, history, watch, etc.)
    • p-cactions Drop-down menu containing move, etc. (in Vector); subject/talk links and action links in other skins
  • href: Link to a Wikipedia or external page (the initial purpose of portletlink was to link somewhere)
  • text: Text that displays in the menu (the title of the
  • id: HTML id (optional)
  • tooltip: Tooltip to display on mouseover (optional)
  • accesskey: Shortcut key press (optional)
  • nextnode: id of the existing portlet link to place the new portlet link before (optional) (Don't forget: ids have a leading "#")

The optional fields must be included in the above order. To skip a field without changing it, use the value null, that is, no space between the quotes for that parameter.

To place the menu items in alphabetical order, and so that they don't move around in the menu, for your last menu item specify the id of an existing menu item to anchor it. Then set "next node" for the next to last item as the id for the menu item you just set, and so on.

See the complete documentation at https://www.mediawiki.org/wiki/ResourceLoader/Modules#addPortletLink and Help:Customizing toolbars.

Important: All we've done so far above is assign mw.util.addPortletLink to a variable. It won't do anything until we bind the variable to a click handler (see below).

click handler

To make a menu item that does something when you click on it, you have to "bind" mw.util.addPortletLink, via its variable, to a handler. Like this:

(The variable used in this example is "portletlink").

1 $(portletlink).click( function(e) {
2     e.preventDefault();
3     //do some stuff
4 }

The "handler" is the part between the curly brackets.

To read about function(e), see what does e mean in this function definition?
jQuery's event objects are explained here: http://api.jquery.com/category/events/event-object/
e.preventDefault() is short for event.preventDefault(), one of jQuery's event objects.

What is the default being prevented? Portletlink's default action is to link somewhere. We don't want it to do that, and so that is what e.preventDefault(); is for.

Calling a function

In JavaScript, a function is a subroutine, essentially, a program within the main program. Functions are usually placed at the end of the program, after its core, but can also be located in a library, like jQuery. You call a function by its name. The function "example" is called like this:

example();

See also: JavaScript Function Invocation.

window.location.href

window.location.href returns the current URL.

The window object represents the current window in the browser, and is at the top of the Browser Object Model hierarchy.

The location object pertains to the URL of the current document, and href is one of its properties.

window.location.href.indexOf

This applies the indexof method upon the URL, to return the index (starting position) of a given string. This can be used to check if the URL contains a specific string.

if (window.location.href.indexOf('action') >= 0 essentially means "if 'action' is in the URL". That is, its position in the URL is equal or greater than 0 (0 represents the first spot, 1 is the second spot, etc.), telling us that it is in there. If it is not there, it would return a -1.

window.location.href.substr

Gets part of the URL.

The substr method returns the substring from the provided start and end indexes, from within the string the method is applied to. If only a start index is provided, the substring will be from that index to the end of the string. In this case, the string is window.location.href (that is, the URL). Note that 0 represents the first character of the string.

So, window.location.href.substr(0,6) would return the first 7 characters of the URL.

That's not particularly useful, as we probably want to manipulate the string based on what is in it. For example...

window.location.href.substr(0, window.location.href.indexOf('#'))

What that returns is the beginning of the URL through the # character, which we can in turn use in concatenation. The following line of code concatenates (adds) ?action=edit to the substring, and then replaces the URL with it:

window.location = window.location.href.substr(0, window.location.href.indexOf('#'))+"?action=edit";

This jumps to the edit page for the current page, as if we clicked on "Edit".

This line assigns the variable redlinks to an empty array (represented by opening and closing square brackets).

Arrays are ordered sets of items.

We created this array to store all the redlinks that are on the page. (See below).

The following line of code declares and assigns to the variable "a" all the elements in the document with the tag "<a>", creating an array:

var a = document.getElementsByTagName('a');

using a for loop to process an array

https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Loops_and_iteration#for_statement

getAttribute('class')

This method returns the value of the attribute specified for an element it is attached to (with a dot, for example someElement.getAttribute('attribute')). This allows elements to be processed by a particular attribut, such as their class.

https://www.w3schools.com/jsref/met_element_getattribute.asp

https://www.w3schools.com/jsref/met_element_getattribute.asp

https://www.w3schools.com/html/html_attributes.asp

.length

.href.replace

.replace

decodeURIComponent

localStorage

This didn't work:

localStorage.OLUtils_redlinks = JSON.stringify(redlinks);

So I used this, and it worked:

jsonString = JSON.stringify(redlinks);
localStorage.OLUtils_redlinks = jsonString;

JSON.stringify() method

JSON.stringify()

Difference between JSON.stringify and JSON.parse

alert()

alert() is short for "window.alert()".

This command makes a message box with a message appear, with an OK button. The script will not continue until the OK button is pushed.

The message is included within the parentheses. It can be a string, a variable, or an object. If it is a variable or an object, its value or contents is displayed in the message.

JSON.parse() method

JSON.parse()

Difference between JSON.stringify and JSON.parse

RegExp

Change log

  • 2017-02-09
    • Started script with some pseudocode and feature wish list
    • Added:
      • importScript('User:AlexTheWhovian/script-functions.js');
      • copy/paste User:AlexTheWhovian/script-redlinks.js
    • Removed importScript('User:AlexTheWhovian/script-functions.js')
      • AlexTheWhovian said it wasn't used by script-redlinks.js
  • 2017-02-10 & 2017-02-11
    • In the process of documenting the script with detailed comments.
      • Got down through mw.util.addPortletLink
    • Added "Explanatory notes" section to the talk page, to provide more in-depth scripting support than the comments. Got to mw.util.addPortletLink
  • 2017-02-12
    • Fixed bug that prevented operation and required another version being loaded.
      • It was working weird because of a function invocation being placed out of context, at the start of the script.
      • There was also a function invocation missing from a conditional in the body of the script.
      • This script now runs stand-alone (without the crutch of the other version being run)
    • Changed the menu item to "Remove red links" (it was "remove redlinks")
  • 2017-02-14
    • Add ready function
  • 2017-02-15
    • Wrote pseudocode in script for bullet list item processing.
  • 2017-04-06
    • Worked out incrementing structure for the while/for nested loop pair for processing bullet list items.

Task list

  • Start script  Done
  • Add AlexTheWhovian's redlink script and function library  Done
  • Test it to see if it works  Done (it does not work = Dec 26 2016 15:08 version)
    • Cycle through various versions to see which work with Firefox  Done
      • Feb 28 2015 version  Done (it works)
      • May 27 2016 version  Done (it works)
      • Dec 4 2016 version  Done (it works)
      • Dec 22 2016 version  Done (failed - it appeared in menu, but failed to remove redlinks)
      • Dec 26 2016 15:05 version  Done (it works = chose this one for starting point)
      • Dec 26 2016 15:08 version  Done (failed - no menu item)
  • Determine which of the functions in User:AlexTheWhovian/script-functions.js are called in redlinks.js  Done (none)
    • If none, remove it  Done
  • Test local storage feature  Done
    • With alerts  Done
  • Fix red link removal so it works in originally intended fashion  Done
  • Implement nested loop to remove unannotated non-parent bulleted list entries
    • Situate a for loop inside a while loop  Done
    • Work out increment structure ((done))
    • Write the guts of the for loop (the regex for removing end of branch redlinks without annotation)
      • Study the regex objects used in the forked block of code
      • Review how to match a multiple-line string
  • Figure out why the "outline in title" alert in the function redlinks_removal() gets activated
    • Clear the local memory and test it
    • Study program flow
  • Wrap local storage in a try catch
  • Needs more comments, and more detailed comments
  • Write explanatory notes, on talk page, explaining the programming in-depth
  • Change the title to indicate the single function (redlink removal). Make a separate multi-function script.

Bug reports

  • Missing dependencies? (fixed 2017-02-12  Done -  Fixed it wasn't dependencies, it was misplaced and missing function calls.
    • The script won't work.  Fixed But there's a weird work-around:
      1. Run the Feb 28 2016 version at the same time
      2. Use it on a page with redlinks
      3. Go to your RedlinksRemover.js and bypass the cache
        Resolved
        – don't need work-around any more
  • 2017-02-13 Red link removal stopped working  Fixed
    • The script puts the target page into edit mode, but then doesn't edit anything  Fixed
  • 2017-04-06 The script runs the functions at the end of the script, when the "Remove red links" menu has not been clicked, and I don't know why

Desired/completed features

Completed features are marked with  Done
  • Remove redlinked entries in outlines
    • Remove redlinked bullet entries that both have no annotation and have no children. (If one has an annotation, or a child, don't remove it.) Because this could create new candidates, this function needs to be looped.
      • Check for annotation
      • To check for children, see if any bullet entries that follow it have more bullets than it does
      • If no changes are made during a complete loop, stop. (How do you check for changes?)
      • To prevent infinite looping, stop after 10 iterations (it can always be run again)
    • When no more candidates are to be are to be found, remove redcats, and delink the remaining redlinks.
  • Save title to variable.  Done don't have to. Can check title directly.
    • Some features will work only on outlines, and will check the title variable for "Outline of" first.  Done used
    • (if match "Outline of" in title, then do....)  Done used if (document.title.indexOf("Outline ") != -1) {}
  • Integrate anno.js (the annotation toggler).
    • get it working right first
  • For stream editing commands, the script will have an optional interactive mode.
  • For Macro compatibility, all toggles will have an on-"button" and an off-"button".
  • Entry linker (checks unlinked entry names for the existence of non-disambiguation page article titles. If one exists, linkify it.)
  • Entry inserter (checks template for entries missing in the current outline, then checks each title for existence.
  • If one exists, insert it, but not if it is a disambiguation page.)
  • Display a random outline, but not if currently in edit mode.
  • Display next outline in the main list of outlines, but not if currently in edit mode.

Development notes

Trycatch needed, and more

The Transhumanist, where you use local storage.getItem() or setItem() you should always wrap that in try catch, as it can fail at any moment (even if you checked previously). This can be due to the browser running out of storage space for the domain, or because the browser is running in privacy mode or with an ad blocker extensions or something. Also, your new RegExp() calls should be lifted outside of the for loops, so that they aren't continuously recreated. For wpTextbox1.value, realise that sometimes the content might be managed by an editor (The syntaxhighlighting beta does this for instance). We use the jquery.textSelection plugin to abstract way from these differences. Don't check document.title, check mw.config.get( 'wgTitle' ) or mw.config.get( 'wgPageName' ). And when you use mw.util.addPortlink, you have to ensure that the mediawiki.util plugin is loaded already, which you can do by using mw.loader.using. —TheDJ (talkcontribs) 14:47, 27 October 2017 (UTC)[reply]

Rough rough talk through

Script dependencies

(Copy of Wikipedia:Village pump (technical)#Script dependencies)

Let's say a script works for one person, but not another. Or it's working on two machines, but after one is cold booted, it doesn't work on that one.

How would one find the dependencies required by the script?   The Transhumanist 12:16, 12 February 2017 (UTC)[reply]

@The Transhumanist: I am guessing that your problems are not caused by a lack of dependencies, but rather by the way you are using the localStorage object. According to the docs, you should be using localStorage.setItem('foo', 'bar'), not localStorage.foo = 'bar'. If you use the API in a non-standard way I wouldn't be surprised if there were differences between the way the various browsers handle it. — Mr. Stradivarius ♪ talk ♪ 13:18, 12 February 2017 (UTC)[reply]
Actually, after some more reading, it seems that the localStorage.foo = 'bar' syntax is fine (although the setItem syntax is preferred). That link does give some other suggestions as to things that could be wrong, though - localStorage might not be implemented on old browsers, it might be disabled by users, or it might be full. — Mr. Stradivarius ♪ talk ♪ 15:01, 12 February 2017 (UTC)[reply]
Also, I would use a unique prefix for your localStorage keys, maybe olutils_ (so the current key would be olutils_redlinks), to reduce the chance of clashes between your data and other localStorage data saved by MediaWiki or by other gadgets. — Mr. Stradivarius ♪ talk ♪ 13:24, 12 February 2017 (UTC)[reply]
@Mr. Stradivarius: It had little to do with memory, but your suggestion provided the essential clue. Since I had 2 versions of the script running simultaneously, the second one worked because of data stored locally by the first one. Without that storage there, the second script failed, which became apparent when I customized the localstorage key per your suggestion. Which led me to a bug. I fixed the bug, and the now the second script works on its own. Though there are still some bugs (the menu item has to be clicked again after getting a preview, twice, for it to work, but it does work). Thank you! The Transhumanist 00:59, 13 February 2017 (UTC)[reply]
@The Transhumanist: Also, all calls to LocalStorage should always be wrapped in a try catch. Localstorage can easily fail due to being full, or due to being in a privacy mode or some other restriction that the browser is placing. —TheDJ (talkcontribs) 07:32, 13 February 2017 (UTC)[reply]
Thanks, I'll look that up. The Transhumanist 20:01, 13 February 2017 (UTC)[reply]

 ------------------ End of copy ----------------

Discussions

Nested RegExp

I'm working on a script (User:The Transhumanist/OLUtils.js) to remove redlinks from outlines, and I've run into a problem with regular expressions:

1 var nodeScoop2 = new RegExp('('+RegExp.quote(redlinks[i])+')','i');
2 var matchString2 = wpTextbox1.value.match(nodeScoop2);
3 alert(matchString2);

The above returns two matches, when I was expecting one. The second one is coming from the nested RegExp constructor.

Is there another way to specify a variable within a regular expression? If so, what?

Also, I can't find any documentation on the plus signs as used here. Can you explain them, or point me to an explanation?

What would the RegExp look like in literal notation?

Thank you. The Transhumanist 11:07, 5 May 2017 (UTC)[reply]

This is the way Twinkle specifies variables in a regular expression; to my knowledge it's the only way to do it. The plus signs are acting as string concatenation operators (string + string = concatenation). And you couldn't express this in literal notation, because literal notation can't accept variables (it is literal after all).
As an example of using new RegExp, this regexp in literal notation: /^Hello\s+/gi is entirely equivalent to new RegExp('^Hello\\s+', 'gi'). Note the double escaping! This is because character escapes in regular expression are processed separately from character escapes in strings.
As to why it is returning two matches instead of one, I really couldn't tell you. Could you provide a simplified test case or example? — This, that and the other (talk) 12:40, 5 May 2017 (UTC)[reply]
Thank you for the explanation. In answer to your question, "yes". Run the script User:The Transhumanist/OLUtils.js on any article with "Outline of" in the title, and that has red links in it, and the alerts will show you. The Transhumanist 15:35, 5 May 2017 (UTC)[reply]
(edit conflict)@The Transhumanist: It's difficult to quickly assess exactly what's going on without seeing the data it's being run against and the matches you are seeing. Is it possible that there's actually multiple matches in the input text? E.g. if you look for "apple" in "apple, orange, pineapple", two matches is the expected result. You would need to look for "\bapple\b" to restrict both ends to word boundaries, but that would still give multiple matches against "red apple, green apple, orange". There is nothing about that code snippet which suggests that multiple matches should be unexpected behaviour.
I think your problem here is that you need to deal with the text before and after the thing the regexp is supposed to match. Looking at Alex's original script, I believe you need to use something like his original regular expressions, as it looks like they already deal with the beginning and end of the string. I don't see why you appear to be reinventing the wheel here, as it looks like Alex's script already deals with that issue.
As for "plus signs as used here", do you mean the string concatenation operators? If you don't recognise basic JS operators and string concatenation, I suggest that you may need to learn fundamental JS programming before continuing. Try the tutorials and guides at https://developer.mozilla.org/en-US/docs/Web/JavaScript.
Literal notation? If you feed "apple" into the above snipped, via the "redlinks" array, you'd get the equivalent of /(apple)/i. That's very basic stuff, so you should probably be doing some reading on Mozilla's MDN site (or some other JS learning resource).
Murph9000 (talk) 12:55, 5 May 2017 (UTC)[reply]
Thank you for the input. I've been having much difficulty with this script. The answer is "no" on the multiple matches. The original statement was
var nodeScoop2 = new RegExp('\\[\\[\\s*('+RegExp.quote(redlinks[i])+')\\s*\\]\\]','i');
which for example returns [[Geography of France]], Geography of France
So I figure it's the nested RegExp that is the second match. The Transhumanist 15:33, 5 May 2017 (UTC)[reply]
Ok, now it's clearer exactly what you are talking about. This is expected behaviour, it's standard regexp group stuff as Syockit explained below. Don't use the term "nested RegExp" like that, as that's not what it is and that term just adds to the confusion here. Murph9000 (talk) 20:50, 5 May 2017 (UTC)[reply]
The parentheses creates a capturing group. The first match is the whole matched string, while the second one is the captured group. Try with RegExp(RegExp.quote(redlinks[i]),'i') and see if it works. Syockit (talk) 12:57, 5 May 2017 (UTC)[reply]

Wow. It's been many moons since anyone has asked me for JS help- I thought I'd become just a mostly-faded memory for a few editors. With that being said, Syockit is right as far as I can tell in that the parentheses create a capturing group. I'm not entirely sure why they're there at all- I'd use the same nodeScoop2 you currently have without the parentheses around the RegExp.quote; i.e. try:

var nodeScoop2 = new RegExp('\\[\\[\\s*'+RegExp.quote(redlinks[i])+'\\s*\\]\\]','i');

Best, Kangaroopowah 20:09, 5 May 2017 (UTC)[reply]

I tried what you suggested in User:The Transhumanist/redlinkstest.js, and it doesn't seem to work. I'll keep at it, thgouh. The Transhumanist 20:34, 5 May 2017 (UTC)[reply]
I forgot the quotes. So I put those back, and adjusted the replace strings to account for the removal of the control group delimiters, and it worked. Now to try it on the current script... The Transhumanist 02:29, 6 May 2017 (UTC)[reply]
Glad I could help. Best, --03:34, 6 May 2017 (UTC)
Perhaps you are looking for String.indexOf(). Oftentimes people discover regular expressions and somehow convince themselves that everything must be expressed in terms of regexes. If regex is not working for you, it is ok not to use it. 91.155.195.247 (talk) 20:07, 5 May 2017 (UTC)[reply]
According to https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/indexOf , that returns a number. I'm looking for specific strings, not the position (index) of a string. Thank you, as I was unaware of what this method does. The Transhumanist 11:00, 6 May 2017 (UTC)[reply]
You will have the starting position of the string and its length (= the length of the substring you are looking for). String.substring() will extract you the matching string - which will be the same as the string you were looking for, except possibly for case. This is how a programmer would do it, not with regexes. 91.155.195.247 (talk) 15:55, 7 May 2017 (UTC)[reply]
I cannot clear see what do you want to achieve, but I find these codes overkill. Mediawiki add titles of actual destinations as attribute title to links and class new for red links.
This jQuery one-liner simply unlinks all red links. This snippet actually inserts linked texts before links and then remove these links.
$("a.new").before(function(){ return this.textContent }).remove();
The function in before returns what to remain after link removal. The this refers to the currently iterated element due to jQuery's design. If we want to completely remove a link, make the function return nothing then. The following example completely removes red category links and treat other red links as usual.
$("a.new").before(function(){
  if (!this.title.startsWith("Category:"))
    return this.textContent;
}).remove();
Chen-Pang He (talk) —Preceding undated comment added 07:40, 6 May 2017 (UTC)[reply]

Sorry, but I don't understand what you are trying to achieve. If you want to remove red links from the DOM (in the generated code of the view), then you can use Javascript (faster) or jQuery (slower) to remove or replace all of them eventually at once, or do more things on each of them in a loop. With Javascript you need to use one of "getElementsByClassName" (for example applied to class="new") or "getElementsByTagName" for all <a> elements, and then you can apply styles ('_color_', '_cursor_', …) or replace them with your own content such as their "innerHTML" values. With jQuery >= 1.2 you can use something like $(".new").replaceWith(function() { return $(this).text(); }); or $(".new").replaceWith(function() { return this.innerHTML; });, while with jQuery >= 1.4 you can use the unwrap function like this: $(".new").contents().unwrap();. jQuery seems to be shorter, but this is because you do not see the whole code that is behind the execution of it, and it is much slower than doing it in native Javascript (when it is well written, of course). All of them, Javascript and jQuery, should be wrapped into a document ready function (via Javascript or jQuery), a setTimeout functions or both. If you need to store their values, then you can create a for or a while loop for each of them and the do whatever you want to. Of course, if you are working on the source code, then the above does not apply at all. About the regex, I need more about the data, plus tests and examples. The reason for its multiple matches has been well explained above. Just a note, if you are sending and parsin a huge quantity of data, for example the whole content of an article, then something like PERL is always the faster and the better solution possible because it was conceived for reporting of the big log files such as those generated by a server. AWK and sed are also good with this. Unfortunately, I do not think that they are available here. –pjoef (talkcontribs) 12:18, 6 May 2017 (UTC)[reply]

The script is User:The Transhumanist/OLUtils.js, and the section we are working on here is for processing outlines, and starts with this:
if (document.title.indexOf("Outline ") != -1) {.
For outlines, the script is supposed to remove list item entries (including bullet and carriage return) that are comprised entirely of redlinks, but only if they have no children. Red end nodes. It goes through several iterations, just in case the removal of a red end node renders other red entries into end nodes. After all those have been removed, then the script deletes any red category links, and finally delinks the remaining embedded red links. I've provided a more in-depth explanation below under #What the script is supposed to do. For non-outlines, it just deletes red cats and delinks the rest of the redlinks. The Transhumanist 04:09, 7 May 2017 (UTC)[reply]

The whole regex

The sample I posted at the beginning of this thread was simplified to show the problem that it was returning 2 matches instead of the expected 1. So, I thought the script might do unexpected replacements, but that has not happened (yet). But I've run into other problems...

The regex from the script is more involved than the sample, and is for matching the line the key topic (redlinks[i]) is included on plus the whole next line:

var nodeScoop2 = new RegExp('\\n((\\*)+)[ ]*?\\[\\[\\s*'+(RegExp.quote(redlinks[i]))+'\\s*\\]\\].*?\\n(.*?\\n)','i');

The reason the whole next line is included is because I'd like to delete entries based upon the type of line that follows (or more accurately, does not follow). If the entry is not followed by a child, then it gets deleted, but should be kept if it does have a child. The weird thing is, that the part matching the whole next line is in the 4th set of parentheses, so you would expect $4 to back reference that. In practice, it is $3 that accesses that capturing group. And I don't know why. Though the solution (ignoring the parentheses around the embedded RegExp, when counting the capturing groups) seems to be working. But, I've run into a worse problem...

// Here is the regular expression for matching the scoop target (to "scoop up" the redlinked entry with direct (non-piped) link, plus the whole next line)
var nodeScoop2 = new RegExp('\\n((\\*)+)[ ]*?\\[\\[\\s*'+(RegExp.quote(redlinks[i]))+'\\s*\\]\\].*?\\n(.*?\\n)','i');
 
// To actualize the search string above, we create a variable with method:
var matchString2 = wpTextbox1.value.match(nodeScoop2);
alert(matchString2); // for testing

// Declare match patterns
var patt1 = new RegExp(":");
var patt2 = new RegExp(" – ");
var patt3 = /$1\*/;

// Here's the fun part. We use a big set of nested ifs to determine if matchString2 does not match criteria. If it does not match, delete the entry:
// If matchString2 isn't empty
if (matchString2 !== null) {

    // If has no coloned annotation (that is, does not have a ":")
    if (patt1.test(matchString2) === false) {

        // If has no hyphenated annotation (that is, does not have " – ")
        if (patt2.test(matchString2) === false) {

            // ...and if the succeeding line is not a child (that is, does not have more asterisks)
            if (patt3.test(matchString2) === false) {

                // ... then replace nodeScoop2 with the last line in it, thereby removing the end node entry
                wpTextbox1.value = wpTextbox1.value.replace(nodeScoop2,"\n$3");
                incrementer++;
                alert("removed entry");
            }
        }
    }
}

The problem is patt3. I'm trying to check for the asterisks at the beginning of the second line. If there is one more asterisk on that line than in the line before it, it means it is a child. In which case I do not want to delete the parent. But, the above code deletes the parents anyways.

In the example below, $1 should match the asterisk at the beginning of the parent line, and $1\* (patt3) should match the asterisks at the beginning of the child line. But it doesn't seem to be working. And when I add an alert to test for the value of patt3 or $1, the script crashes!

* Parent
** Child

If $1 includes asterisks in it, does it return those asterisks escaped?

Any ideas on how to solve my patt3 problem? The Transhumanist 12:14, 6 May 2017 (UTC)[reply]

Try to double-escape the aterisk \\* in a RegExp constructor or in this way /\*. –pjoef (talkcontribs) 12:26, 6 May 2017 (UTC)[reply]
I did. See the RegExp below. Notice that the double escaped asterisk is inside a capturing group. When you use $1 to refer to that capturing group, will the asterisks in there still be escaped? When I try to use alert to test for $1, it crashes the script.
var nodeScoop2 = new RegExp('\\n((\\*)+)[ ]*?\\[\\[\\s*'+(RegExp.quote(redlinks[i]))+'\\s*\\]\\].*?\\n(.*?\\n)','i');
I look forward to your reply. The Transhumanist 13:58, 6 May 2017 (UTC)[reply]
"*" is a quantifier (a special character) and, as well as all other special characters, it needs to be escaped when it is part of the pattern of characters that you want to find or replace. See: w3schools.com/jsref/jsref_obj_regexp.asp. About the use of the alert for debugging purpose I suggest you to use console.log() method to display data directly within the debugger of the browser. More @: w3schools.com/js/js_debugging.asp. The debugger itself should be also able to show you which and where is the error within your code. About the editing of the article and the DOM manipulation, it doesn't save the changes, but if an user is in the editor window/view and it presses the save button all changes that have been made to the content will be saved. –pjoef (talkcontribs) 09:26, 7 May 2017 (UTC)[reply]
P.S.: I haven't tested it out but probably $1 is "undefined". In this case you need to check for this before you use it: if ($1) …. –pjoef (talkcontribs) 09:34, 7 May 2017 (UTC)[reply]
Running the code in generated document seems to be easier because we can make use of HTML structure. A leaf link safe to remove is the only child of li.
$("a.new").replaceWith(function(){
  if (this.title.startsWith("Category:"))
    return null;

  if (this.matches("li > :only-child"))
    return null;

  return this.textContent;
});
Cheers, Chen-Pang He (talk) —Preceding undated comment added 15:19, 6 May 2017 (UTC)[reply]
Hi. Thanks for the suggestions. I have some questions for you: Would the code you provided edit the article, or just affect the view? I'm looking for editing solutions. How could a script remove children list items in the edit window? The Transhumanist 03:57, 7 May 2017 (UTC)[reply]

I got your message. It looks like you may have gotten the help you need. When working with RegExp, I like to try them on some sample strings to see what each one is actually matching, and what it's returning. There's a great website for doing that: regex101. Nathanm mn (talk) 16:12, 6 May 2017 (UTC)[reply]

We still haven't figured it out. The problem I'm trying to solve is how to identify when a list item has a child. A child list item will have one more asterisk at the beginning than the parent. So, I set up a capturing group for the asterisks at the beginning of the parent (so $1 would be the back reference), and then try to match that number of asterisks plus one more in the child (using $1\*). But it isn't working. I am stuck. There are other criteria which the entries to be removed must fail, otherwise I wish to keep them. So simply getting rid of all children isn't what I'm after. We already know they are red linked entries, because the first half of the program puts all redlinks into an array, which we process in the second half of the program. Then the nested if structure checks first for whether the current redlink in the array has no entry. If it doesn't, then we check to see if it has no colon annotation. If it doesn't have a colon separator, then we check to see if it doesn't have a hyphenated annotation. If it doesn't have an en dash separator, then we check to see if it has no children. If it doesn't have a child, then we delete it from the wiki source, modifying the actual article itself.
Once all redlinked entries that fail our tests are removed, then the rest of the program mops up, deleting red category links, and delinking all redlinks that still remain after that. We know, due to the extensive filtering we just subjected them to, that they are all embedded redlinks, the content of which we want to keep. I'll make a sample below that presents examples of the data instances to be processed. The Transhumanist 22:12, 6 May 2017 (UTC)[reply]

What the script is supposed to do

Here is a sample item list:

What we want to do is remove the list entries for which the topic is a redlink, but which do not have annotations, and which do not have children. Then we delete redlinked categories, and delink whatever redlinks are leftover — those will be by definition embedded, such as redlink 1 and redlink 3. Redlink 3 is embedded by virtue of having children.

Redlink 2 is a dead end. It is an end node in the tree structure that contains only a redlink. It gets deleted.

The script goes through the list multiple times, until it no longer finds dead end redlinks. This is because when it removes a redlinked end node, that may cause its redlinked parent to become a dead end node (such as when it has no other children). Multiple iterations catch these. So the entire branch starting with Redlink 10 will be deleted.

Here is the problem I've run into: the script currently and erroneously deletes the Redlink 3 list item. Because $1\* or $1\\* do not seem to be identifying the Redlink 4 list item as having more asterisks in the wikisource than the Redlink 3 list item. I do not know why. What should happen is that Redlink 3 would be retained because of Redlink 4, and after Redlink 4 is removed, then Redlink 3 is checked again and is kept by virtue of having Psychology as a child. But, when Redlink 3 is deleted in error, it makes Psychology a child of Geology, thus ruining the tree structure.

All this processing is to be done in the editor, so that the redlinked entries are actually removed from the article.

I'm stuck! I look forward to your replies. The Transhumanist 23:00, 6 May 2017 (UTC)[reply]

Your patt3 is off for a couple of reasons. First, with the $n regex matches, in general you access them using RegExp.$1 (which will be a string containing the match), not just $1 – except for within String.replace function, when just $1 is used in the replacement string [1]. Secondly, with regex literals, what you type is literally what you get as the regex string. So var patt3 = /$1\*/; will literally be interpreted as /$1\*/ (where $ asserts position at the end of the string; 1 matches the character 1; \* matches the character *).
What you could use instead is var patt3 = new RegExp("\\*{"+(RegExp.$1.length+1)+"}"); which, for example, will give you the regex /\*{3}/ when the RegExp.$1 match is "**" - Evad37 [talk] 04:59, 7 May 2017 (UTCt)
I'll try it and will let you know how it works. By the way, what about var patt3 = new RegExp("$1\*");. Why won't that work? (That was the first thing I tried, before going literal). The Transhumanist 23:14, 7 May 2017 (UTC)[reply]
$1 as part of a string doesn't have any special meaning, except within the string .replace function. So var patt3 = new RegExp("$1\*"); would give you the regex /$1*/. To use the actual match instead of $1, you would use var patt3 = new RegExp(RegExp.$1 + "\*"); which would e.g. give you the regex /***/ for a match "**". To actually get valid regex, the match would have to be escaped (note also that the single slash in "\*" doesn't get preserved unless it is double-escaped as "\\*") . - Evad37 [talk] 23:55, 7 May 2017 (UTC)[reply]
Thank you Evad. Using your code, the script now works, matching about 90% of what it is supposed to. So far, I've cleaned up the all the country outlines for Africa. Now working on Asia. I'm not sure why it is skipping some entries that it shouldn't, but I'm sure I'll figure it out by observing as I use it. The Transhumanist 22:39, 11 May 2017 (UTC)[reply]
  1. ^