Monday, August 20, 2007

Windows Reboots Triggered Skype Glitch

By MATT MOORE
AP Business Writer
Technology Video
Buy AP Photo Reprints

FRANKFURT, Germany (AP) -- A two-day outage that left millions of Skype users unable to use the popular Internet phone service was caused by an abnormally high number of restarts after people had downloaded a Windows security update, the company said Monday.

The worldwide outage, which began on Thursday and ended on Saturday, left millions of Skype users unable to log on to make phone calls or send instant messages.

Luxembourg-based Skype Ltd., part of online auction giant eBay Inc., has more than 220 million users in total but typically has 5 million to 6 million users online at any given time. In January, Skype reported that it had counted 9 million users online at one time.

In an update to users on Skype's Heartbeat blog, employee Villu Arak said the disruption was not because of hackers or any other malicious activity.

Instead, he said that the disruption "was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update," Arak wrote.

Microsoft Corp. released its monthly patches last Tuesday, and many computers are set to automatically download and install them. Installation requires a computer restart.

"The high number of restarts affected Skype's network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact," Arak wrote.

Arak did not blame Microsoft for the troubles and said the outage ultimately rested with Skype. Arak said Skype's network normally has an ability to heal itself in such cases, but a previously unknown glitch in Skype's software prevented that from occurring quickly enough.

In a statement, Microsoft described its patch as routine and reiterated that the disruption resulted from a bug in Skype software.

Users from Vietnam to Brazil to Germany to the United States had complained they could not log on and make phone calls or send instant messages.

The outage was a critical moment for the company, founded in 2003 by Niklas Zennstrom and Janus Friis, and was the first major outage since October 2005 when its service was down only for a few hours.

"This disruption was unprecedented in terms of its impact and scope," Arak wrote. "We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions."

Sunday, August 12, 2007

TCP window scaling and broken routers

Every TCP packet includes, in the header, a "window" field which specifies how much data the system which sent the packet is willing and able to receive from the other end. The window is the flow control mechanism used by TCP; it controls the maximum amount of data which can be "in flight" between two communicating systems and keeps one side from overwhelming the other with data.

In the early days of TCP, windows tended to be relatively small. The computers of that age did not have huge amounts of memory to dedicate toward buffering network data, and the available networking technology was not fast enough to make use of a larger window in any case. Modern network interfaces can handle larger packets and keep more of them in flight at any given time; they will perform better with a larger window. Some kinds of high-speed long-haul links can have very high bandwidth, but also high latency. Keeping that sort of pipe filled can require a very large window; if a sending system cannot have a large number of packets in transit at any given time, it will not be able to make use of the bandwidth available. For these reasons, good performance can often require very large windows.

The TCP window field, however, is only 16 bits wide, allowing for a maximum window size of 64KB. The TCP designers must have thought that nobody would ever need a larger window than that. But 64KB is not even close to what is needed in many situations today. The solution to this problem is called "window scaling." It is not new; window scaling was codified in RFC 1323 back in 1992. It is also not complicated: a system wanting to use window scaling sets a TCP option containing an eight-bit scale factor. All window values used by that system thereafter should be left-shifted by that scale factor; a window scale of zero, thus, implies no scaling at all, while a scale factor of five implies that window sizes should be shifted five bits, or multiplied by 32. With this scheme, a 128KB window could be expressed by setting the scale factor to five and putting 4096 in the window field.

To keep from breaking TCP on systems which do not understand window scaling, the TCP option can only be provided in the initial SYN packet which initiates the connection, and scaling can only be used if the SYN+ACK packet sent in response also contains that option. The scale factor is thus set as part of the setup handshake, and cannot be changed thereafter.

The details are still being figured out, but it would appear that some routers on the net are rewriting the window scale TCP option on SYN packets as they pass through. In particular, they seem to be setting the scale factor to zero, but leaving the option in place. The receiving side sees the option, and responds with a window scale factor of its own. At this point, the initiating system believes that its scale factor has been accepted, and scales its windows accordingly. The other end, however, believes that the scale factor is zero. The result is a misunderstanding over the real size of the receive window, with the system behind the firewall believing it to be much smaller than it really is. If the expected scale factor (and thus the discrepancy) is large, the result is, at best, very slow communication. In many cases, the small window can cause no packets to be transmitted at all, breaking TCP between the two affected systems entirely.

In the 2.6.7 kernel, the default scale factor is zero; in Linus's BitKeeper tree and the 2.6.7-mm kernels, instead, it has been increased to seven. This change has brought the broken router behavior to light; suddenly people running current kernels are finding that they cannot talk to a number of systems out there. One of the higher-profile affected sites is packages.gentoo.org. Gentoo users are, unsurprisingly, not pleased.

As a way of making things work, Stephen Hemminger has proposed a patch which adds a calculation to select the smallest scale factor which covers the largest possible window size. The result on most systems is that the scale factor gets set to two. This factor will still be corrupted by broken routers, but the resulting window size (¼ of what it should be) is still large enough to allow communication to happen.

The patch makes networking with systems behind broken routers work again, but it has been rejected anyway. The networking maintainers (and David Miller in particular) believe that the patch simply papers over a problem, and that adding hacks to the Linux network stack to accommodate broken routers is a mistake. If, instead, the situation is left as it is, pressure on the router manufacturers should get the problem fixed relatively quickly. It has been a few years, now, that Linux has a strong enough presence in the networking world that it can get away with taking this sort of position.

In the mean time, anybody running a current kernel who is having trouble connecting to a needed site can work around the problem with a command like:

    echo 0 > /proc/sys/net/ipv4/tcp_default_win_scale

or by adding a line like:

    net.ipv4.tcp_default_win_scale = 0

to /etc/sysctl.conf.