Auto Check

  • Subscribe to our RSS feed.
  • Twitter
  • StumbleUpon
  • Reddit
  • Facebook
  • Digg

Friday, 3 February 2012

Unicode over 60 percent of the web

Posted on 11:52 by Unknown
Computers store every piece of text using a “character encoding,” which gives a number to each character. For example, the byte 61 stands for ‘a’ and 62 stands for ‘b’ in the ASCII encoding, which was launched in 1963. Before the web, computer systems were siloed, and there were hundreds of different encodings. Depending on the encoding, C1 could mean any of ¡, Ё, Ą, Ħ, ‘, ”, or parts of thousands of characters, from æ to 品. If you brought a file from one computer to another, it could come out as gobbledygook.

Unicode was invented to solve that problem: to encode all human languages, from Chinese (中文) to Russian (русский) to Arabic (العربية), and even emoji symbols like or
; it encodes nearly 75,000 Chinese ideographs alone. In the ASCII encoding, there wasn’t even enough room for all the English punctuation (like curly quotes), while Unicode has room for over a million characters. Unicode was first published in 1991, coincidentally the year the World Wide Web debuted—little did anyone realize at the time they would be so important for each other. Today, people can easily share documents on the web, no matter what their language.

Every January, we look at the percentage of the webpages in our index that are in different encodings. Here’s what our data looks like with the latest figures*:

*Your mileage may vary: these figures may vary somewhat from what other search engines find. The graph lumps together encodings by script. We detect the encoding for each webpage; the ASCII pages just contain ASCII characters, for example. Thanks again to Erik van der Poel for collecting the data.

As you can see, Unicode has experienced an 800 percent increase in “market share” since 2006. Note that we separate out ASCII (~16 percent) since it is a subset of most other encodings. When you include ASCII, nearly 80 percent of web documents are in Unicode (UTF-8). The more documents that are in Unicode, the less likely you will see mangled characters (what Japanese call mojibake) when you’re surfing the web.

We’ve long used Unicode as the internal format for all the text Google searches and process: any other encoding is first converted to Unicode. Version 6.1 just released with over 110,000 characters; soon we’ll be updating to that version and to Unicode’s locale data from CLDR 21 (both via ICU). The continued rise in use of Unicode makes it even easier to do the processing for the many languages that we cover. Without it, our unified index it would be nearly impossible—it’d be a bit like not being able to convert between the hundreds of currencies in the world; commerce would be, well, difficult. Thanks to Unicode, Google is able to help people find information in almost any language.

Posted by Mark Davis, International Software Architect
Email ThisBlogThis!Share to XShare to Facebook
Posted in | No comments
Newer Post Older Post Home

0 comments:

Post a Comment

Subscribe to: Post Comments (Atom)

Popular Posts

  • Hulu Plus now works with Chromecast
    Hulu has added Chromecast support to their Hulu Plus app—just in time for the fall television season. Now you can easily enjoy your favori...
  • Providing a springboard for women entrepreneurs in India
    Meghana Musunuri was a typical female entrepreneur in India. Born and brought up in Medak , she received a good education and spent time ab...
  • A look inside our 2011 diversity report
    We work hard to ensure that our commitment to diversity is built into everything we do—from hiring our employees and building our company cu...
  • Software downloads in Syria
    Free expression is a fundamental human right and a core value of our company—but sometimes there are limits to where we can make our product...
  • Celebrating teachers on National Teacher Day
    One of the best parts of my job working on the Google Education team has been hearing inspiring stories time and again of great teachers who...
  • Shiver me timbers, the 2012 D4G Winner is....
    After 114,000 submissions and millions of your votes, second grader Dylan Hoffman of Caledonia, Wisc. is this year’s U.S. Doodle 4 Google N...
  • Supporting Innovation in African News
    Cross-posted from the European Public Policy Blog We’re eager to see journalism flourish in the digital age, in all forms and on all contine...
  • Google+ Hangouts On Air: broadcast your conversation to the world
    Last year we introduced Hangouts On Air to a limited number of broadcasters, enabling them to go live with friends and fans, for all the wo...
  • New research shows smartphone growth is global
    Last October, we launched Our Mobile Planet , a resource enabling anyone to visualize the ways smartphones are transforming how people conne...
  • Local—now with a dash of Zagat and a sprinkle of Google+
    Finding the best places to go is an essential part of our lives, as are the people and resources that help us make those decisions. In fact,...

Categories

  • accessibility
  • acquisition
  • ads
  • Africa
  • Android
  • apps
  • Asia
  • books + book search
  • chrome
  • chrome + chrome os
  • commerce
  • computing history
  • crisis response
  • Cultural Institute
  • culture
  • developers
  • display advertising
  • diversity
  • doodles
  • education
  • education and research
  • energy
  • enterprise
  • entrepreneurs at Google
  • entrepreneurship
  • Europe
  • events
  • faster web
  • free expression
  • g2g
  • giving
  • Google Apps highlights
  • google ideas
  • google play
  • google.org
  • google+
  • googleplus
  • googlers and culture
  • government transparency
  • green
  • innovation
  • ipv6
  • journalism and news
  • Latin America
  • local
  • maps and earth
  • mobile
  • online safety
  • open source
  • personalization
  • photos
  • policy and issues
  • politics
  • privacy
  • privacy and security
  • publishers
  • scholarships
  • search
  • search stories
  • search trends
  • security
  • security and safety tips
  • small business
  • transparency
  • youtube and video

Blog Archive

  • ►  2013 (190)
    • ►  December (11)
    • ►  November (13)
    • ►  October (15)
    • ►  September (12)
    • ►  August (10)
    • ►  July (13)
    • ►  June (28)
    • ►  May (16)
    • ►  April (21)
    • ►  March (18)
    • ►  February (19)
    • ►  January (14)
  • ▼  2012 (269)
    • ►  December (25)
    • ►  November (20)
    • ►  October (18)
    • ►  September (16)
    • ►  August (19)
    • ►  July (20)
    • ►  June (28)
    • ►  May (30)
    • ►  April (19)
    • ►  March (27)
    • ▼  February (23)
      • Google’s new Privacy Policy
      • Understanding accessibility at CSUN 2012
      • Google@SXSW: A taste of the Googleplex, in Austin
      • Opening the Oscar (search) envelope
      • Helping you find what’s in the mind’s eye with imp...
      • Collaborate and edit anywhere with the updated Goo...
      • Google Public DNS: 70 billion requests a day and c...
      • Roses are red, violets are blue...here are some Va...
      • Brazil’s Carnival goes social with Google
      • European Commission clears Motorola deal
      • Connect with people and places you love this Valen...
      • Ship Wars@ Google Waterloo: A virtual battle of in...
      • Celebrating our history, accomplishments and commu...
      • An update on the Google bar
      • Congratulations to Amit Singhal on his election to...
      • Going gothic with bestselling author Anne Rice
      • Introducing Chrome for Android
      • What’s your X? Amplifying technology moonshots
      • Supporting U.S. student veterans with a new schola...
      • Super Bowl XLVI: Mobile, Manning and Madonna
      • Unicode over 60 percent of the web
      • Mind the Gap: Encouraging women to study engineering
      • Playbook for tackling the Super Bowl with Google
    • ►  January (24)
  • ►  2011 (41)
    • ►  December (33)
    • ►  November (8)
Powered by Blogger.

About Me

Unknown
View my complete profile