User:Pearle

'NOTE: Pearle is currently not keeping up with all its tasks and needs a complete re-write, which will hopefully happen by August, 2007. -- Beland 16:52, 20 July 2007 (UTC)'

Greetings, humans. My name is Pearle Wisebot, and I am an artificial intelligence created by User:Beland. Guess what programming language I am implemented in!

Checkup

 * Category:Pearle edits needing manual cleanup
 * Category:Articles to check for link ordering

Help
I need some tips or help from you an expert for my article HSV Senator Signature. Senators 10:28, 04 November 2006 (UTC) (australia)

Todo

 * The ADD_CFD_TAG template should be updated and tested to point people to the right discussion. (See talk page.)
 * Do another run to clear msg: syntax?
 * After the current meta cleanup on WP:CFD, do another ENFORCE_CFD scan on Category:Categories for deletion, especially R-Z, to see if articles are there than shouldn't be.
 * Unfortunately, this requires some code changes, since the introduction of templates to WP:CFD
 * Add logic to ENFORCE_CFD command that checks if there are categories on WP:CFD that aren't tagged.
 * If the superlist at User:Pearle/categories-alpha is useful, perhaps it should be ported to a Wikipedia:Namespace? -SV|t 18:09, 17 May 2005 (UTC)
 * Should another msg: run be done? Though depreciated, it appears the software still supports it, and people are still using it. -- Beland 20:32, 28 August 2005 (UTC)

Status
In general:
 * Updates to Template:Opentask occur every day or so.
 * Sorting Category:Wikipedia cleanup now occurs at least weekly.
 * Automation for category-moving is implemented and is in use. WP:CFD is the main source of requests.
 * See also /on-deck.
 * Regular uploads of the alphabetic category listing are occurring (each database dump).

Testing

 * wikify-date sorting, notice posted on Bots/Requests for approvals


 * Algorithm: The same as for cleanup-date. The  tag and known redirects are replaced by  .  Pearle will look at previous versions in the page history, at one-month intervals, to determine the month in which the tag was originally assigned.  This is intended to make it easier to use Category:Articles that need to be wikified in a first in, first out fashion.


 * Automatic updates for WP:PNA, request posted on Bots/Requests for approvals

Done

 * The first two large geographic categorization runs for Auto-categorization are complete.
 * Cleanup runs for misclassified CDPs are posted on Auto-categorization. (Fixed by the Rambot.)

A note about WP:CFD and blocking the bot
Pearle has been blocked a few times while in the process of deleting or renaming categories. Typically, an admin will come along, check WP:CFD, not find the discussion in support of the change, and throw up a block. The last time this happened, User:Beland was also auto-blocked, because his edits came from the same IP address. This temporarily prevented discussion about the bot's activities on talk pages.

Pearle does not rename categories without a preceding WP:CFD discussion. Not all such discussions are listed on the main page. Also, there are some "umbrella" renames in which a convention is decided, but not all the categories which need to be renamed are explicitly listed. Please take a careful look through the by-date subpages before concluding there was no consensus to change. There may be legitimate reasons to stop a move - sometimes people using the WP:CFD page don't tag the affected categories as requested, for instance, or there might be a good reason overlooked in the discussion. On the other hand, the bot can just as easily move things back the way they were as it can change them in the first place, so while certainly effective, blocking on sight in these situations might be a bit hasty. Or maybe not. In any case, please do check WP:CFD throughly, and make a nomination to undelete or unrename if you think the original decision was wrong.

Kbdank71 does a lot of work on WP:CFD, and has kindly offered to help folks find archived discussions if they are having trouble.

Authorized behavior
Pearle Wisebot has obtained authorization from Wikipedia talk:Bots and has been marked as a bot for the purpose of executing the following tasks. All tasks are performed by User:Beland running Pearle and using data files on his home computer in the following formats.

Alphabetical list of categories

 * 1) Generate a plaintext list, sorted alphabetically, of all categories that existed in the database or were linked to or from an article or subcategory in the latest database dump.  (This is done offline.)
 * 2) Post this list (plus introductory material) to User:Pearle/categories-alpha by completely replacing its contents.  Currently this is 600-700k in length.

Automatically move categories
(Reminder: Run TRANSFER_TEXT_ACTUALLY first!)


 * Parse a file and match commands of the form:
 * MOVE_CONTENTS Category:Name_of_A Category:Name_of_B


 * Download Category:Name_of_A
 * Parse the page to extract all of its member articles and subcategories.
 * For each member, replace all instances of, with  , preserving sort fields.  Members that contain any nowiki or pre tags in the wikisource will be skipped.

Moving a category is the equivalent of deletion, so this function will only be run on commands that have been approved by Categories for deletion.

Auto-categorization of articles and categories

 * 1) Parse a file and match commands of the form:
 * ADD_TO_CAT Page_name Category:Category_name


 * 1) Download the wikisource of Page_name
 * 2) Abort if the string "" (case insensitive) already appears in the page text
 * 3) Add the string  on a new line at the bottom of the article.

Advance notice of at least three days on Auto-categorization will be given before automatically-generated lists of articles and categories are fed in.

Remove articles from a category

 * 1) Accept commands of the following form:
 * REMOVE_X_FROM_CAT Page_name Category:Category_name


 * 1) Download the wikisource of Page_name
 * 2) Remove the string  from the text
 * 3) Post the new text

Tag categores with
Categories nominated to Categories for deletion need to be tagged with or similar template to inform watchers of the potential deletion or renaming. Pearle can do this with commands of the form:
 * ADD_CFD_TAG Category:Category_name_here

For nominations en masse, the tag should be changed to e.g.:

Old behavior
When a category is renamed, the introductory text in the old category (which also defines the parentage of the category) must be moved to the new location. The command:


 * TRANSFER_TEXT_ACTUALLY Category:Old_category_name Category:New_category_name

...will do this, plus leave a pointer behind in the old category and a short message to let readers know what is going on. If the new category already exists, the command attempts to merge the two lists of parent categories and concatenates the intro texts. For this reason, the output of this command must be manually checked to see if the new intro needs to be fixed.

Some editors have requested that this step be taken before the member articles and subcategories are moved.


 * TRANSFER_TEXT_CHECK Category:Old_category_name Category:New_category_name

can be run to check for intros that will need to be manually fixed when TRANSFER_TEXT_ACTUALLY is run, but without doing any edits.

New behavior

 * TRANSFER_TEXT_ACTUALLY Category:Old_category_name Category:New_category_name

is automatically run before a category rename. A message is left in Category:Old_category_name with a pointer to Category:New_category_name. The introductory text from Category:Old_category_name is moved to Category:New_category_name. If there was already some text in the new category, the new category is tagged and shows up in Category:Pearle edits needing manual cleanup.

REMOVE_CFD_TAG

 * See Wikipedia talk:Bots

New category/interwiki style
Minor changes and bugfixes may occur in response to community complaints or suggestions.

Rules

 * Pearle should attempt to do a category/interwiki cleanup whenever it edits an article, but there will be no mass cleanup run (except for articles already edited by Pearle) unless requested.
 * HTML comments on the same line following a category or interwiki tag will remain there. Any other text there will trigger a flag for review.
 * If a category or interwiki tag is found in the "body text" area, it will be flagged for review.
 * Canonicalize "zh-cn" (Chinese simplified) and "zh-tw" (Chinese traditional) to "zh" because the simplified/traditional distinction is now being solved in software.
 * Canonicalize "minnan" to "zh-min-nan", since only the latter is in the official, automatically updated list.
 * Canonicalize "nb" to "no", since only the latter is in the official, automatically updated list. (Added after observing the need for this in practice. -- Beland 4 July 2005 17:03 (UTC))
 * Canonicalize dk to da. (Same as above. -- Beland 02:48, 25 August 2005 (UTC))
 * Multi-line HTML comments must be preserved
 * Separate category and interwiki links mashed together on the same line.
 * Don't change interwiki link sort order.

Algorithm

 * Break the article up into segments, each of which is tagged. Use two arrays, one for content, and one for names.


 * Parse input into segments, each of which is labeled by type.
 * Find nowiki tags everywhere.
 * Find comment tags everywhere else.
 * Find HTML tags everywhere else.
 * Find category links everywhere else.
 * Find interwiki links everywhere else.
 * Find template tags everywhere else.
 * Lump html tags following a category segment (except category and interwiki links) until the next newline into the category segment.
 * Lump everything following an interwiki segment (except category and interwiki links) until the next newline into the interwiki segment.
 * The remainder of the page will be tagged as body text.
 * Move any category or interwiki links at the top of the page to the very bottom.
 * Move before the category links, preserving whatever whitespace preceded or followed them.
 * Delete these comments near the category/interwiki section (case and whitespace insensitive):


 * Determine whether or not the page should be flagged for manual review. Find the last non-category, non-interwiki segment.  If there are any interwiki or category links before this segment, flag the page for manual review by adding a template at the end.
 * If the page has not been flagged: consolidate all interwiki links at the end, preceded by category links, preceded by all other segment types. Be sure to retain the original order of segments in each of the three groups.
 * If there are interwiki links, precede them with a line that says:
 * This practice has been discontinued at the request of User:Cburnett.
 * This practice has been discontinued at the request of User:Cburnett.

msg: syntax cleanup
The syntax is depreciated in favor of. Pearle is authorized to make this change wherever it is needed. was rumored to break in MediaWiki 1.5, though it is apparently still working.

Template:cleanup-date conversion
I have created Template:cleanup-date to replace Template:cleanup. It is intended to make it easier to use Category:Wikipedia cleanup in a first in, first out fashion. Pearle is now authorized to change cleanup tags to cleanup-date tags. For all remaining uses of the cleanup tag, Pearle will look at previous versions at one-month intervals, to determine the month in which the tag was originally assigned. Per-month pages in Category:Cleanup by month should be sorted through by hand (for many of the articles there are either already cleaned up, or were never tagged in the first place), and the appropriate tags applied. -- Beland 23:39, 17 August 2005 (UTC)


 * Template:cleanup-date has been obsoleted by an upgrade to Template:Cleanup. -- Beland 22:51, 13 December 2005 (UTC)
 * This has been reverted, so that cleanup-date is still needed. -- Beland 03:30, 20 December 2005 (UTC)

Template:Opentask updates
How it works:
 * After downloading the wikitext for Template:Opentask, Pearle looks for pre-defined HTML comments, like and . Everything between these comments is replaced by a list of articles.  The list is no longer than a pre-defined number of characters, to keep the line width reasonable.  The updated wikitext is then posted to the live site.  Pearle keeps an offline record of how many times each article is featured for a given category, so that Beland can monitor effectiveness.  The update is run once a day, and Beland report each time to make sure that the bot is working properly at that it has not run out of articles for any of the lines.  The report also notes if any articles have been featured more than 7 times without being fixed, so that they can be reported to the Cleanup Taskforce.

Rationale:


 * This template used to be maintained manually, but that was a lot of work, and articles would generally be left on until they were fixed. This meant that editors had to check all the selected articles, every day, to see whether or not they had been fixed. Some lines got "stale" and productivity suffered.


 * Now, Pearle supplies a completely fresh batch of articles every day. Because she remembers which articles have been featured in the past, she can cycling through the whole list for each category, only recycling entries once all articles have gotten a chance to be featured. Hopefully this increases the chances that editors will serendipitously spot an article of interest to them on the template.


 * Instead of relying on interested editors to pick their personal favorites to feature, Pearle draws from a broader pool of candidates, to more evenly distribute publicity.

How articles are chosen:


 * For most lines, there is a specific category to pull from. Pearle looks at the live site to get a complete list of articles in that category.  This is compared to the list of articles that have been featured in the past.  Articles are given a rank, with less-frequently-featured articles first, and after that a random sort.  The highest-rank articles are added to the list, and if there's room for short lower-rank titles, those are also added.


 * The above-described mechanism is used for:
 * Category:Articles to be expanded
 * Category:Wikipedia articles needing style editing
 * Category:Wikipedia articles in need of updating
 * Category:Wikipedia articles needing factual verification
 * Category:Articles that need to be wikified
 * Category:Articles to be merged
 * Category:NPOV disputes


 * The same mechanism is used for cleanup, but it combines several categories:
 * Category:Wikipedia articles needing priority cleanup
 * One or more of the oldest months from Category:Cleanup by month


 * The same selection algorithm is also applied to stubs in an HTML comment-delimited area on Most wanted stubs. Lines that start with anything other than "# [[" are skipped.  This means that stubs which are struck through are not selected.  Pearle relies on human editors to keep this list up to date, but that's easier than maintaining both this page and Template:Opentask.  It's also not much harm if a non-stub gets a little extra attention before it is purged from the Most Wanted page by an update from a new database dump.


 * Requested articles are selected in the same way from Articles requested for more than a year. Only article links appearing at the beginning of a line will be selected.  (This is for compatibility with a different bot that maintains this page.)  When the backlog on this page has been reduced to more manageable levels, Most wanted articles (which is a little harder to parse) will also be added.  A bot removes bluelinks from "Articles requested for more than a year", but human editors are needed to keep "Most wanted articles" free of them.


 * A similar mechanism is used for Category:Wikipedia articles needing copy edit; the only difference is that all articles in the category are given equal priority. This is because this category has very high turnover, and we want articles that are "stuck" to be featured occasionally, so they will eventually be reported to the Cleanup Taskforce if they don't get fixed.

Offline reports
Using database dumps, Pearle can make offline reports. The following are uploaded automatically:


 * User:Pearle/categories-alpha (~700k!) contains a plaintext list, sorted alphabetically, of all categories that existed in the database or were linked to from an article or subcategory in the latest database dump.

The following Beland manually uploads:


 * Pearle helps produce the reports found on Auto-categorization.
 * todo.orphaned-categories.txt - Manually posted to Category:Orphaned categories after excluding those recently listed there or on CFD.
 * todo.funny-categories.txt - Categories with unusual characters in their names. After each database dump, Beland marks these as "OK, ignore" or fixes them.  This includes categories with double spaces, because doing this check online would have very little benefit for a lot of additional server load.

By request
Note: Instead of waiting for me to get around to running Pearle to generate category reports, you can now use a live web tool!


 * Pearle can generate reports showing the tree structure of category space from any starting point, to an arbitrary depth. Leave a message on User talk:Beland if you would like a tree generated.  Please specify the starting category, direction (up or down), whether or not you want non-category elements (like articles) included, and you may optionally specify a maximum depth.
 * Reports on categories or articles that have particular terms in a title, or appear in a particular part of category space are also easily made.

Download
You cannot run the Pearle bot itself, but you can run a clone, if you apply for permission and follow the other setup instructions in the documentation. Contributions are also welcome.


 * User:Pearle/pearle.pl - Updated 12 Nov 2005
 * User:Pearle/pearle-documentation.txt
 * User:Pearle/cookies.fake.txt

How do I use this thing?
If you wish to move or delete a category, please make a nomination on WP:CFD. If you want to clone Pearle (not a simple, fast thing to do), you can download the code. For all other requests, feel free to leave a note on User talk:Beland asking for a certain command to be run on the original bot.