Universal Coded Character Set

Universal Coded Character Set
Alias(es)UCS, Unicode
Language(s)International
StandardISO/IEC 10646
Encoding formatsUTF-8, UTF-16, GB 18030
Less common: UTF-32, BOCU, SCSU, UTF-7
Preceded byISO/IEC 8859, ISO/IEC 2022, various others

The Universal Coded Character Set (UCS, Unicode) is a standard set of characters defined by the international standard ISO/IEC 10646, Information technology — Universal Coded Character Set (UCS) (plus amendments to that standard), which is the basis of many character encodings, improving as characters from previously unrepresented typing systems are added.

The UCS has over 1.1 million possible code points available for use/allocation, but only the first 65,536, which is the Basic Multilingual Plane (BMP), had entered into common use before 2000. This situation began changing when the People's Republic of China (PRC) ruled in 2006 that all software sold in its jurisdiction would have to support GB 18030. This required software intended for sale in the PRC to move beyond the BMP.[clarification needed]

The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimise conflicts with other encoding forms.

The original edition of the UCS defined UTF-16, an extension of UCS-2, to represent code points outside the BMP. A range of code points in the S (Special) Zone of the BMP remains unassigned to characters. UCS-2 disallows use of code values for these code points, but UTF-16 allows their use in pairs. Unicode also adopted UTF-16, but in Unicode terminology, the high-half zone elements become "high surrogates" and the low-half zone elements become "low surrogates".[clarification needed]

Another encoding, UTF-32 (previously named UCS-4), uses four bytes (total 32 bits) to encode a single character of the codespace. UTF-32 thereby permits a binary representation of every code point (as of year 2024) in the APIs, and software applications.

History

The International Organization for Standardization (ISO) set out to compose the universal character set in 1989, and published the draft of ISO 10646 in 1990. Hugh McGregor Ross was one of its principal architects.

This work happened independently of the development of the Unicode standard, which had been in development since 1987 by Xerox and Apple.

The original ISO 10646 draft differed markedly from the current standard. It defined:

  • 128 groups of
  • 256 planes of
  • 256 rows of
  • 256 cells,

for an apparent total of 2,147,483,648 characters, but actually the standard could code only 679,477,248 characters, as the policy forbade byte values of C0 and C1 control codes (0x00 to 0x1F and 0x80 to 0x9F, in hexadecimal notation) in any one of the four bytes specifying a group, plane, row and cell. The Latin capital letter A, for example, had a location in group 0x20, plane 0x20, row 0x20, cell 0x41.

One could code the characters of this primordial ISO/IEC 10646 standard in one of three ways:

  1. UCS-4, four bytes for every character, enabling the simple encoding of all characters;
  2. UCS-2, two bytes for every character, enabling the encoding of the first plane, 0x20, the Basic Multilingual Plane, containing the first 36,864 codepoints, straightforwardly, and other planes and groups by switching to them with ISO/IEC 2022 escape sequences;
  3. UTF-1, which encodes all the characters in sequences of bytes of varying length (1 to 5 bytes, each of which contain no control codes).

In 1990, therefore, two initiatives for a universal character set existed: Unicode, with 16 bits for every character (65,536 possible characters), and ISO/IEC 10646. The software companies refused to accept the complexity and size requirement of the ISO standard and were able to convince a number of ISO National Bodies to vote against it.[citation needed] ISO officials realised they could not continue to support the standard in its current state and negotiated the unification of their standard with Unicode. Two changes took place: the lifting of the limitation upon characters (prohibition of control code values), thus opening code points for allocation; and the synchronisation of the repertoire of the Basic Multilingual Plane with that of Unicode.

Meanwhile, in the passage of time, the situation changed in the Unicode standard itself: 65,536 characters came to appear insufficient, and the standard from version 2.0 and onwards supports encoding of 1,112,064 code points from 17 planes by means of the UTF-16 surrogate mechanism. For that reason, ISO/IEC 10646 was limited to contain as many characters as could be encoded by UTF-16 and no more, that is, a little over a million characters instead of over 679 million. The UCS-4 encoding of ISO/IEC 10646 was incorporated into the Unicode standard with the limitation to the UTF-16 range and under the name UTF-32, although it has almost no use outside programs' internal data.

Rob Pike and Ken Thompson, the designers of the Plan 9 operating system, devised a new, fast and well-designed mixed-width encoding that was also backward-compatible with 7-bit ASCII, which came to be called UTF-8,[1] and is currently the most popular UCS encoding.

Differences from Unicode

ISO/IEC 10646 and Unicode have an identical repertoire and numbers—the same characters with the same numbers exist on both standards, although Unicode releases new versions and adds new characters more often. Unicode has rules and specifications outside the scope of ISO/IEC 10646. ISO/IEC 10646 is a simple character map, an extension of previous standards like ISO/IEC 8859. In contrast, Unicode adds rules for collation, normalisation of forms, and the bidirectional algorithm for right-to-left scripts such as Arabic and Hebrew. For interoperability between platforms, especially if bidirectional scripts are used, it is not enough to support ISO/IEC 10646; Unicode must be implemented.

To support these rules and algorithms, Unicode adds many properties to each character in the set such as properties determining a character's default bidirectional class and properties to determine how the character combines with other characters. If the character represents a numeric value such as the European number '8', or the vulgar fraction '¼', that numeric value is also added as a property of the character. Unicode intends these properties to support interoperable text handling with a mixture of languages.

Some applications support ISO/IEC 10646 characters but do not fully support Unicode. One such application, Xterm, can properly display all ISO/IEC 10646 characters that have a one-to-one character-to-glyph mapping[clarification needed] and a single directionality. It can handle some combining marks by simple overstriking methods, but cannot display Hebrew (bidirectional), Devanagari (one character to many glyphs) or Arabic (both features). Most GUI applications use standard OS text drawing routines which handle such scripts, although the applications themselves still do not always handle them correctly.

Citing the Universal Coded Character Set

ISO/IEC 10646, a general, informal citation for the ISO/IEC 10646 family of standards, is acceptable in most prose. And even though it is a separate standard, the term Unicode is used just as often, informally, when discussing the UCS. However, any normative references to the UCS as a publication should cite the year of the edition in the form ISO/IEC 10646:{year}, for example: ISO/IEC 10646:2014.

Relationship with Unicode

Since 1991, the Unicode Consortium and the ISO/IEC have developed The Unicode Standard ("Unicode") and ISO/IEC 10646 in tandem. The repertoire, character names, and code points of Unicode Version 2.0 exactly match those of ISO/IEC 10646-1:1993 with its first seven published amendments. After Unicode 3.0 was published in February 2000, corresponding new and updated characters entered the UCS via ISO/IEC 10646-1:2000. In 2003, parts 1 and 2 of ISO/IEC 10646 were combined into a single part, which has since had a number of amendments adding characters to the standard in approximate synchrony with the Unicode standard.

  • ISO/IEC 10646-1:1993 = Unicode 1.1
  • ISO/IEC 10646-1:1993 plus Amendments 5 to 7 = Unicode 2.0
  • ISO/IEC 10646-1:1993 plus Amendments 5 to 7 = Unicode 2.1 excluding Euro sign and Object Replacement Character, which are included in Amendment 18
  • ISO/IEC 10646-1:2000 = Unicode 3.0
  • ISO/IEC 10646-1:2000 and ISO/IEC 10646-2:2001 = Unicode 3.1
  • ISO/IEC 10646-1:2000 plus Amendment 1 and ISO/IEC 10646-2:2001 = Unicode 3.2
  • ISO/IEC 10646:2003 = Unicode 4.0
  • ISO/IEC 10646:2003 plus Amendment 1 = Unicode 4.1
  • ISO/IEC 10646:2003 plus Amendments 1 to 2 = Unicode 5.0 excluding Devanagari letters GGA, JJA, DDDA and BBA, which are included in Amendment 3
  • ISO/IEC 10646:2003 plus Amendments 1 to 4 = Unicode 5.1
  • ISO/IEC 10646:2003 plus Amendments 1 to 6 = Unicode 5.2
  • ISO/IEC 10646:2003 plus Amendments 1 to 8 = ISO/IEC 10646:2011 = Unicode 6.0 excluding Indian rupee sign
  • ISO/IEC 10646:2012 = Unicode 6.1
  • ISO/IEC 10646:2012 = Unicode 6.2 excluding Turkish lira sign, which is included in Amendment 1
  • ISO/IEC 10646:2012 = Unicode 6.3 excluding Turkish lira sign, which is included in Amendment 1, and five bidirectional control characters (Arabic Letter Mark, Left-To-Right Isolate, Right-To-Left Isolate, First Strong Isolate, Pop Directional Isolate), which are included in Amendment 2
  • ISO/IEC 10646:2012 plus Amendments 1 and 2 = Unicode 7.0 excluding the Ruble sign
  • ISO/IEC 10646:2014 plus Amendment 1 = Unicode 8.0 excluding the Lari sign, nine CJK unified ideographs, and 41 emoji characters
  • ISO/IEC 10646:2014 plus Amendments 1 and 2 = Unicode 9.0 excluding Adlam, Newa, Japanese TV symbols, and 74 emoji and symbols
  • ISO/IEC 10646:2017 = Unicode 10.0 excluding 285 Hentaigana characters, 3 Zanabazar Square characters, and 56 emoji symbols
  • ISO/IEC 10646:2017 plus Amendment 1 = Unicode 11.0 excluding 46 Mtavruli Georgian capital letters, 5 CJK unified ideographs, and 66 emoji characters
  • ISO/IEC 10646:2017 plus Amendments 1 and 2 = Unicode 12.0 excluding 62 additional characters
  • ISO/IEC 10646:2020 = Unicode 13.0
  • ISO/IEC 10646:2021 = Unicode 14.0

See also

Related standards:

References

  1. ^ Pike, Rob (2003-04-03). "UTF-8 history". Archived from the original on 2016-05-23.

Read other articles:

Period in ancient Egyptian history (c. 664 BCE–332 BCE) Late Period of ancient Egyptc. 664 BC–c. 332 BCEgypt in the 6th century BC.CapitalSais, Mendes, SebennytosCommon languagesAncient EgyptianReligion Ancient Egyptian religionGovernmentMonarchyPharaoh • c. 664–610 BC Psamtik I (first)• 336–332 BC Darius III (last) History • Began c. 664 BC• Ended c. 332 BC Preceded by Succeeded by Third Intermediate Period of Egypt Mace…

Italian priest Very ReverendClaudio AcquavivaS.J.Born14 September 1543Died31 January 1615(1615-01-31) (aged 71)OccupationJesuit priestKnown forbeing the second founder of the Jesuit Order and Superior General of the Society of Jesus Claudio Acquaviva, SJ (14 September 1543 – 31 January 1615) was an Italian Jesuit priest. Elected in 1581 as the fifth Superior General of the Society of Jesus, he has been referred to as the second founder of the Jesuit order.[1] Early life and f…

Cycling race 2011 Paris–Nice2011 UCI World Tour, race 2 of 27Race detailsDates6–13 March 2011Stages8Distance1,307 km (812.1 mi)Winning time34h 03' 37Results Winner  Tony Martin (Germany) (HTC–Highroad)  Second  Andreas Klöden (Germany) (Team RadioShack)  Third  Bradley Wiggins (Great Britain) (Team Sky) Points  Heinrich Haussler (Australia) (Garmin–Cervélo) Mountains  Rémi Pauriol (France) (FDJ) Youth  Rein Taa…

Radio station in Naples, FloridaWSGLNaples, FloridaBroadcast areaSouthwest FloridaFrequency104.7 MHzBrandingMix 104.7ProgrammingFormatHot adult contemporaryOwnershipOwnerRenda Broadcasting(Renda Broadcasting Corporation of Nevada)Sister stationsWGUFWJGOWWGRHistoryFirst air dateMay 10, 1980; 43 years ago (1980-05-10)Former frequencies97.7 MHz (1980–1982)103.1 MHz (1982–1999)Call sign meaningSeagull (former mascot)Technical information[1]Licensing authorityFCCFacility…

Kino MacGregorBorn (1977-09-12) September 12, 1977 (age 46)Miami, Florida, U.S.EducationNew York University (MS, PhD)Occupation(s)Yoga teacher, author, entrepreneurYears active2011–presentSpouse Tim Feldmann ​(m. 2008)​Websitekinoyoga.com Kino MacGregor (born September 12, 1977) is a notable American Ashtanga Yoga teacher, author,[1][2] entrepreneur,[3][4][5] influencer, inspirational speaker, and video producer. …

Federico LuigiPrincipe di Hohenzollern-HechingenStemma In carica14 novembre 1735 –4 giugno 1750 PredecessoreFederico Guglielmo SuccessoreGiuseppe Federico Guglielmo NascitaStrasburgo, 1º settembre 1688 MorteHechingen, 4 giugno 1750 (61 anni) DinastiaHohenzollern-Hechingen PadreFederico Guglielmo di Hohenzollern-Hechingen MadreMaria Leopoldina Ludovica di Sinzendorf Religionecattolicesimo Federico Luigi di Hohenzollern-Hechingen (Strasburgo, 1º settembre 1688 – Hechingen, 4 …

Jakov IgnjatovićPortrait of Ignjatović karya Novak RadonićLahir(1822-12-08)8 Desember 1822Szentendre, HungariaMeninggal5 Juli 1889(1889-07-05) (umur 66)Novi Sad, Austria-HungariaPekerjaanPenyair Jakov Ignjatović (bahasa Serbia: Јаков Игњатовић, 8 Desember 1822 – 5 Juli 1899) adalah seorang novelis dan penulis prosa, yang utamanya menulis dalam bahasa Serbia beberapa juga terdapat bahasa Hungaria. Dia juga anggota aktif Matica Srpska. Biografi Jakov Ignj…

Diacritical mark, the dot element of the letters i and j For the surname, see Tittle (surname). Not to be confused with title or tilde. Lowercase i and j in Liberation Serif, with tittles in red. A tittle or superscript dot[1] is a small distinguishing mark, such as a diacritic in the form of a dot on a letter (for example, lowercase i or j). The tittle is an integral part of the glyph of i and j, but diacritic dots can appear over other letters in various languages. In most languages, t…

List of events ← 1777 1776 1775 1778 in the United States → 1779 1780 1781 Decades: 1770s 1780s 1790s See also: History of the United States (1776–1789) Timeline of the American Revolution List of years in the United States 1778 in the United States1778 in U.S. states States Connecticut Delaware Georgia Maryland Massachusetts New Hampshire New Jersey New York North Carolina Pennsylvania Rhode Island South Carolina Virginia List of years in the United States by state or territoryvte…

German psychologist and philosopher (1872–1956) Ludwig KlagesBorn(1872-12-10)10 December 1872Hannover, Prussia, German EmpireDied29 July 1956(1956-07-29) (aged 83)Kilchberg, Zurich, SwitzerlandNationalityGermanAlma mater Leipzig University Technische Hochschule Hannover University of Munich AwardsGoethe Medal for Art and Science (1932)Era20th-century philosophyRegionWestern philosophySchool Continental philosophy Anti-foundationalism Anti-militarism Biocentrism Lebensphilosophie[…

NFL team season 2014 Denver Broncos seasonOwnerThe Pat Bowlen TrustPresidentJoe EllisGeneral managerJohn ElwayHead coachJohn FoxHome fieldSports Authority Field at Mile HighResultsRecord12–4Division place1st AFC WestPlayoff finishLost Divisional Playoffs(vs. Colts) 13–24Pro Bowlers 11 QB Peyton ManningRB C. J. AndersonWR Demaryius ThomasWR Emmanuel SandersTE Julius ThomasT Ryan CladyDE DeMarcus WareOLB Von MillerCB Aqib TalibCB Chris Harris Jr.S T. J. Ward AP All-Pros 3 WR Demaryius Tho…

American chain of cafes This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: Corner Bakery Cafe – news · newspapers · books · scholar · JSTOR (November 2014) (Learn how and when to remove this message) CBC Restaurant CorporationCorner Bakery Cafe, Washington, D.C.Trade nameCorner Bakery CafeCompany typePrivateIndust…

Blog written by Steven Molaro The SneezeType of siteNewsOwnerSteven MolaroURLhttp://thesneeze.com/Current statusInactive (last updated April 7, 2011) The Sneeze is a blog written by Steven Molaro,[1][2] identified on the site only as Steve of Los Angeles, California.[3] In 2005 the site was listed among the top 101 websites by PC Magazine,[4] and won a Blogger's Choice Award.[2] The site gained attention for its Steve, Don't Eat It! section, a series …

This article needs additional citations for verification. Please help improve this article by adding citations to reliable sources. Unsourced material may be challenged and removed.Find sources: List of University of Minnesota people – news · newspapers · books · scholar · JSTOR (October 2012) (Learn how and when to remove this message) This is a list of notable people associated with the University of Minnesota. Alumni Nobel laureates Norman Borlaug Erne…

American politician (1941–1996) For the NOAA ship, see NOAAS Ronald H. Brown (R 104). For other people named Ron Brown, see Ron Brown (disambiguation). Ron Brown30th United States Secretary of CommerceIn officeJanuary 22, 1993 – April 3, 1996PresidentBill ClintonPreceded byBarbara FranklinSucceeded byMickey KantorChair of the Democratic National CommitteeIn officeFebruary 11, 1989 – January 21, 1993Preceded byPaul G. KirkSucceeded byDavid Wilhelm Personal detailsBornRonal…

Undifferentiated vegetative tissue of certain organisms This article is about the undifferentiated tissue. For other uses, see Thallus (disambiguation). Thalli redirects here. For the Indian village, see Thally. Thallus of Pellia epiphylla Thallus (pl.: thalli), from Latinized Greek θαλλός (thallos), meaning a green shoot or twig, is the vegetative tissue of some organisms in diverse groups such as algae, fungi, some liverworts, lichens, and the Myxogastria. A thallus usually names the ent…

Term used to categorize types or levels of sanitation for monitoring purposes Share of population using safely managed sanitation facilities in 2020[1] Improved sanitation (related to but distinct from a safely managed sanitation service) is a term used to categorize types of sanitation for monitoring purposes. It refers to the management of human feces at the household level. The term was coined by the Joint Monitoring Program (JMP) for Water Supply and Sanitation of UNICEF and WHO in 2…

Quiver Tolor A Tolor, pre-1887.TypeQuiverPlace of originBorneo (Brunei, Indonesia, Malaysia)Service historyUsed byDayak people Tolor or Telenga is a traditional quiver in which the Dayak hunters carry the poisonous darts for Sumpit (blow-pipe), originating from Borneo.[1] Description The Tolor is made of a piece of bamboo. Among its myriad uses, bamboo makes an excellent cylindrical container. This bamboo quiver was made to carry the slender darts for a Dayak hunter's blow…

Combat involving electronics and directed energy For warfare on the Internet, see Cyberwarfare. For the Underground Resistance album, see Electronic Warfare (album). United States Space Force personnel operating a satellite antenna during an electromagnetic warfare military exercise Part of a series onWarOutline History Prehistoric Ancient Post-classical castles Early modern pike and shot napoleonic Late modern industrial fourth-gen Military Organization Command and control Defense ministry Army…

For the ancient city at the mouth of the Nile, see Pelusium. Wall painting from Herculaneum depicting an idealized ceremony of Isis: the priest at top center holds a jar thought to contain Nile water,[1] while some of the attendants carry water jugs[2] In the Roman Empire, the Pelusia was a religious festival held March 20[3] in honor of Isis and her child Harpocrates. It would have coincided with the second day of the Quinquatria, a five-day festival to Minerva.[4 …