Friday, December 21, 2007

if we build it, will they come

Consider the instant messaging landscape in 1999. ICQ had blazed a narrow trail through a small community, and AOL Instant Messenger was the dominant player (carrying over 430 million messages a day). Yahoo was making good headway, and even Microsoft was throwing their hat into the fray with MSN Messenger.[*] All four major systems used proprietary, non-interoperable, unpublished protocols. AOL was engaged in a war with what they considered "rogue" clients, vigorously blocking any attempts at interoperability.

The entrenched players had little interest in a standard protocol. Only Microsoft -- the upstart -- was promoting the concept of a standard protocol.

In the intervening 8 years, XMPP and SIMPLE have gained significant market and mind share, and it's not hard to imagine eventual interoperability between every viable instant messaging system within the next decade or so.

It's the same curve followed by just about all popular communication technologies: proprietary versions emerge first, and prove that the market exists. Eventually, a standardized protocol for the technology is defined, and the market slowly migrates to the standard. (To see how this plays out in a longer timeframe, consider the slow but complete migration of email from isolated, proprietary islands to fully-interconnected standards-based servers over the past 25 years).

That's what XCON is doing for conferencing. I understand that it's hard seeing the curve from this end, and it's difficult to imagine what impact we might have over the next 5, 10, 20 years. But it all needs to start somewhere, and we think this is as good a start as we can come up with right now.

/a

[*] Yes, there were important predecessors in spirit, like zephyr, talk, and their ilk. I'm trying to stick with the UI modality represented by currently popular instant messaging systems.

Wednesday, December 12, 2007

P2P占全球夜间95%网络带宽

据国外媒体报道,德国一家研究机构日前对全球互联网流量分布进行了研究。该机构发布的报告指出,Skype软件的流量占据了VOIP流量的95%,另外夜间带宽的95%被P2P应用所消耗。Mitbbs.com

这家机构名叫iPoque,据称他们研究了一百多万网民将近3TB的匿名数据流量,调查地区包括澳大利亚、东欧、德国、中东和南欧地区,时间是今年八月份和九月份。Mitbbs.com

调查发现,网络电话占据了互联网流量的1%,而在网络电话中,Skype软件一家就占据了95%。另外,虽然网络电话的流量并不大,但是全球三成的网民经常用网络电话软件和亲朋好友聊天。Mitbbs.com

这家研究机构指出,Skype的成功,主要是它能够穿越防火墙、网络地址转换设备以及其他障碍,普通的网络电话软件对于这些障碍无能为力。Skype具有内置的多种机制,能够自动绕过这些障碍。Mit

报告指出,目前网络带宽“消费大户”是P2P文件共享,在中东占据了49%,东欧地区占据了84%。从全球来看,晚上时段的网络带宽有95%被P2P占据。Mitbbs.com

在所有P2P工具中,BitTorrent最受欢迎,在南欧地区,电驴处于主导地位。在P2P内容方面并未和去年发生变化,主要还是视频,最受欢迎的是最新上映的电影、色情电影和音乐。其中在中东,电子书在P2P内容中比例较高,计算机游戏在南欧地区比例较高。Mitbbs.com

这家机构还说,20%的BitTorrent和电驴的数据流量进行了加密,以免遭到网络运营商的封杀。

Monday, August 20, 2007

Windows Reboots Triggered Skype Glitch

By MATT MOORE
AP Business Writer
Technology Video
Buy AP Photo Reprints

FRANKFURT, Germany (AP) -- A two-day outage that left millions of Skype users unable to use the popular Internet phone service was caused by an abnormally high number of restarts after people had downloaded a Windows security update, the company said Monday.

The worldwide outage, which began on Thursday and ended on Saturday, left millions of Skype users unable to log on to make phone calls or send instant messages.

Luxembourg-based Skype Ltd., part of online auction giant eBay Inc., has more than 220 million users in total but typically has 5 million to 6 million users online at any given time. In January, Skype reported that it had counted 9 million users online at one time.

In an update to users on Skype's Heartbeat blog, employee Villu Arak said the disruption was not because of hackers or any other malicious activity.

Instead, he said that the disruption "was triggered by a massive restart of our users' computers across the globe within a very short timeframe as they re-booted after receiving a routine set of patches through Windows Update," Arak wrote.

Microsoft Corp. released its monthly patches last Tuesday, and many computers are set to automatically download and install them. Installation requires a computer restart.

"The high number of restarts affected Skype's network resources. This caused a flood of log-in requests, which, combined with the lack of peer-to-peer network resources, prompted a chain reaction that had a critical impact," Arak wrote.

Arak did not blame Microsoft for the troubles and said the outage ultimately rested with Skype. Arak said Skype's network normally has an ability to heal itself in such cases, but a previously unknown glitch in Skype's software prevented that from occurring quickly enough.

In a statement, Microsoft described its patch as routine and reiterated that the disruption resulted from a bug in Skype software.

Users from Vietnam to Brazil to Germany to the United States had complained they could not log on and make phone calls or send instant messages.

The outage was a critical moment for the company, founded in 2003 by Niklas Zennstrom and Janus Friis, and was the first major outage since October 2005 when its service was down only for a few hours.

"This disruption was unprecedented in terms of its impact and scope," Arak wrote. "We would like to point out that very few technologies or communications networks today are guaranteed to operate without interruptions."

Sunday, August 12, 2007

TCP window scaling and broken routers

Every TCP packet includes, in the header, a "window" field which specifies how much data the system which sent the packet is willing and able to receive from the other end. The window is the flow control mechanism used by TCP; it controls the maximum amount of data which can be "in flight" between two communicating systems and keeps one side from overwhelming the other with data.

In the early days of TCP, windows tended to be relatively small. The computers of that age did not have huge amounts of memory to dedicate toward buffering network data, and the available networking technology was not fast enough to make use of a larger window in any case. Modern network interfaces can handle larger packets and keep more of them in flight at any given time; they will perform better with a larger window. Some kinds of high-speed long-haul links can have very high bandwidth, but also high latency. Keeping that sort of pipe filled can require a very large window; if a sending system cannot have a large number of packets in transit at any given time, it will not be able to make use of the bandwidth available. For these reasons, good performance can often require very large windows.

The TCP window field, however, is only 16 bits wide, allowing for a maximum window size of 64KB. The TCP designers must have thought that nobody would ever need a larger window than that. But 64KB is not even close to what is needed in many situations today. The solution to this problem is called "window scaling." It is not new; window scaling was codified in RFC 1323 back in 1992. It is also not complicated: a system wanting to use window scaling sets a TCP option containing an eight-bit scale factor. All window values used by that system thereafter should be left-shifted by that scale factor; a window scale of zero, thus, implies no scaling at all, while a scale factor of five implies that window sizes should be shifted five bits, or multiplied by 32. With this scheme, a 128KB window could be expressed by setting the scale factor to five and putting 4096 in the window field.

To keep from breaking TCP on systems which do not understand window scaling, the TCP option can only be provided in the initial SYN packet which initiates the connection, and scaling can only be used if the SYN+ACK packet sent in response also contains that option. The scale factor is thus set as part of the setup handshake, and cannot be changed thereafter.

The details are still being figured out, but it would appear that some routers on the net are rewriting the window scale TCP option on SYN packets as they pass through. In particular, they seem to be setting the scale factor to zero, but leaving the option in place. The receiving side sees the option, and responds with a window scale factor of its own. At this point, the initiating system believes that its scale factor has been accepted, and scales its windows accordingly. The other end, however, believes that the scale factor is zero. The result is a misunderstanding over the real size of the receive window, with the system behind the firewall believing it to be much smaller than it really is. If the expected scale factor (and thus the discrepancy) is large, the result is, at best, very slow communication. In many cases, the small window can cause no packets to be transmitted at all, breaking TCP between the two affected systems entirely.

In the 2.6.7 kernel, the default scale factor is zero; in Linus's BitKeeper tree and the 2.6.7-mm kernels, instead, it has been increased to seven. This change has brought the broken router behavior to light; suddenly people running current kernels are finding that they cannot talk to a number of systems out there. One of the higher-profile affected sites is packages.gentoo.org. Gentoo users are, unsurprisingly, not pleased.

As a way of making things work, Stephen Hemminger has proposed a patch which adds a calculation to select the smallest scale factor which covers the largest possible window size. The result on most systems is that the scale factor gets set to two. This factor will still be corrupted by broken routers, but the resulting window size (¼ of what it should be) is still large enough to allow communication to happen.

The patch makes networking with systems behind broken routers work again, but it has been rejected anyway. The networking maintainers (and David Miller in particular) believe that the patch simply papers over a problem, and that adding hacks to the Linux network stack to accommodate broken routers is a mistake. If, instead, the situation is left as it is, pressure on the router manufacturers should get the problem fixed relatively quickly. It has been a few years, now, that Linux has a strong enough presence in the networking world that it can get away with taking this sort of position.

In the mean time, anybody running a current kernel who is having trouble connecting to a needed site can work around the problem with a command like:

    echo 0 > /proc/sys/net/ipv4/tcp_default_win_scale

or by adding a line like:

    net.ipv4.tcp_default_win_scale = 0

to /etc/sysctl.conf.

Thursday, July 19, 2007

NAT Classification Test Results


2. Descriptions of Tests



2.1. UDP Mapping



This test sends STUN[1] packets from the same port on three different
internal IP addresses to the same destination. The source port on
the outside of the NAT is observed. The test records whether the
port is preserved or not and whether all the mapping get different
ports.

A second set of tests checks out how the NAT maps ports above and
below 1024.

Tests are run with a group of several consecutive ports to see if the
NAT preserves port parity.

2.2. UDP Filtering



This test sends STUN packets from the same port on three different
internal IP addresses to the same destination. It then tests whether
places on the outside with 1) a different port but the same IP
address and then 2) a different port and a different IP address can
successfully send a packet back to the sender. The test is based on
technique described in [2].

2.3. UDP Hairpin

This test sends a STUN packet from the inside to the outside to
create a mapping and discover the external source address called A.
It does the same thing from a different internal IP address to get a
second external mapping called B. It then sends a packet from A to B
and B to A and notes if these packets are successfully delivered from
one internal IP address to the other.

2.4. ICMP



A device on the inside sends a packet to an external address that
causes an ICMP Destination Unreachable packet to be returned. The
test records whether this packet makes it back through the NAT
correctly.

2.5. Fragmentation



The MTU on the outside of the NAT is set to under 1000; on the inside
it is set to 1500 or over. Then a 1200 byte packet is sent to the
NAT. The test records whether the NAT correctly fragments this when
sending it. Another test is done with DF=1. An additional test is
done with DF=1 in which the adjacent MTU on the NAT is large enough
the NAT does not need to fragment the packet but further on, a link
has an MTU small enough that an ICMP packet gets generated. The test
records whether the NAT correctly forwards the ICMP packet.

In the next test a fragmented packet with the packets in order is
sent to the outside of the NAT, and the test records whether the
packets are dropped, reassembled and forwarded, or forwarded
individually. A similar test is done with the fragments out of
order.

2.6. UDP Refresh



A test is done that involves sending out a STUN packet and then
waiting a variable number of minutes before the server sends the
response. The client sends different requests with different times
on several different ports at the start of the test and then watches
the responses to find out how long the NAT keeps the binding alive.

A second test is done with a request that is delayed more than the
binding time but every minute an outbound packet is sent to keep the
binding alive. This test checks that outbound traffic will update
the timer.

A third test is done in which several requests are sent with the
delay less than the binding time and one request with the delay
greater. The early test responses will result in inbound traffic
that may or may not update the binding timer. This test detects
whether the packet with the time greater than the binding time will
traverse the NAT which provide the information about whether the
inbound packets have updated the binding timers.

An additional test is done to multiple different external IP
addresses from the same source, to see if outbound traffic to one
destination updates the timers on each session in that mapping.

2.7. Multicast and IGMP



Multicast traffic is sent to the outside of the NAT, and the test
records whether the NAT forwards it to the inside. Next an IGMP
Membership Report is sent from inside. The test records whether the
NAT correctly forwards it to the outside and whether it allows
incoming multicast traffic. More detail on NATs and IGMP is provided
in [3].

2.8. Multicast Timers



The test records how long the NAT will forward multicast traffic
without receiving any IGMP Membership Reports and whether receiving
Reports refreshes this timer.

2.9. TCP Timers



TBD: Measure time before ACK, after ACK, and after FIN and RST.

2.10. TCP Port Mapping



Multiple SYN packets are sent from the same inside address to
different outside IP addresses, and the source port used on the
outside of the NAT is recorded.

2.11. SYN Filtering



Test that a SYN packet received on the outside interfaces that does
not match anything gets discarded with no reply being sent. Test
whether an outbound SYN packet will create a binding that allows an
incoming SYN packet.

2.12. DNS



Does the DNS proxy in them successfully pass through SRV requests.

2.13. DHCP



Do any DHCP options received on the WAN side get put into DHCP
answers sent on LAN side?



2.14. Ping



Do ping request sent from LAN side work?



To help organize the NATs by what types of applications they can
support, the following groups are defined. The application of using
a SIP phone with a TLS connection for signaling and using STUN for
media ports is considered. It is assumed the RTP/RTCP media is on
random port pairs as recommended for RTP.

Group A: NATs that are deterministic, not symmetric, and support
hairpin media. These NATs would work with many phones behind
them.
Group B: NATs that are not symmetric on the primary mapping. This
group would work with many IP phones as long as the media ports
did not conflict. This is unlikely to happen often but will
occasionally. Because they may not support hairpin media, a call
from one phone behind a NAT to another phone behind the same NAT
may not work.
Group D: NATs of the type Bad. These have the same limitations of
group B but when the ports conflict, media gets delivered to a
random phone behind the NAT.


Group F: These NATs are symmetric and phones will not work.


NAT Traversal

This is the original STUN algorithm.
One situation where it fails is the following:

               STUN
server
|
NAT
|
------------------
| |
UA1 UA2

If the NAT does not support hairpinning, then your algorithm will not
work.

[http://tools.ietf.org/html/draft-jennings-behave-test-results-04
shows that most NATs today do not support UDP hairpinning, and
http://www.guha.cc/saikat/pub/imc05-tcpnat.pdf reports that most do not
support TCP hairpinning.]

This situation can occur when the two UAs belong to the same company,
or both are in the same hotel, or both use the same service provider
and the service provider has a NAT in front of its entire network, etc.


ICE does to things to solve this problem. First of all, UA1 and UA2 exchange
their local addresses and ports as well as their STUN-learned addresses.
They then test to see which path works. Second, they also exchange
TURN addresses, which serve as a backup in case everything else fails.


David Barrett's algorithm, if I understand it correctly, would work in
this situation. (As far as I can tell, David's is a simplified version
of ICE).

Friday, July 6, 2007

What's wrong with DNS?

What's wrong with DNS?
DNS was designed in the days when long-term store was scarce and so
was bandwidth. Technology was difficult and obtuse, with the number of
human implementors measured in the hundreds, and the number of users
in the thousands. I remember that because I was there. There was no
other way to pass this data. Something had to be first and that was
DNS, which reflected the times and the organizations that invented it
in the first place.

Those days are long gone now in the days of the $20 grocery-store 2G
flash RAM, millions of implementors, and billions of users.

Today's Internet is a consumer-driven multi-modal world which has no
patience for that.
To cripple bootstrap by leashing it to creaky DNS is a disservice to
the future.

Bootstrap should have some criteria requirements for some obvious
things such as satisfying the security AD but the actual
implementation should allow for individual implementations that
reflect a rapidly changing modern world.

Hardcoding is a quaint way of solving the problem, and it is far too
easy for protocol engineers to design in a vacuum. When a single
handheld device can receive inputs by SMS, email, walled and unwalled
IP, and HTTP the number of sources for a bootstrap is openended and
impossible to mandate without creating limitations on functionality
and usefulness in a hybridized multi-modal world..

A mime message, encrypted by a private key known to that device, could
contain a myriad number of bootstraps, be they DNS, hardcoded IP (IPV4
or IPV6? Which?), local names, peer names, SMS addresses,sip uris,
whathaveyou. A list based on operating criteria, that can evolve over
time.

This message, more 'carrier pigeon' than boostrap, can be delivered
via a variety of means, including via C/S SIP or stored from a
previous DHT. A primitive form of unencrypted messaging is called
'paper'.

In this environment a 'friend' network, of names associated with keys,
could be seeded entirely by SMS messages and never ever need DNS,, and
it works when zeroconf fails, which is often. It could later use
broadcasts or multicasts to find other 'friends', using one of a
variety of techniques, zeroconf being one.

Create the requirements for the bootstrap that guarantee security,
functionality and interopability. Create the list of appropriate
beacons, allowing room for more in the future.
//////////////////////////////////////////////////////////////////////////////////
That's certainly my thinking in bringing this up. Just do what's easy and knock off the biggest win -- a low-cost SIP infrastructure.

As I see it, this approach has the following advantages:

1) Zero changes to proxy servers and registrars.
- Non-firewalled peers become proxies and registrars just like before. They should be able to literally use the same code as the open source proxies and registrars out there, though, because nothing changes in the lookup.

2) Eliminating the most complicated part of the current architecture -- the DHT. It's also the thorniest to standardize.

3) Lower latency for completing calls. No more waiting around for the DHT to respond. Call completion happens just like it does with any other SIP call (if the DNS is up-to-date -- more below).


Then there are the disadvantages:

1) Delay of DNS propagation when registrars go down or clients move. I think this is the toughest to get a good read on because I don't know how it would play out in the field.

2) You'd need to run a dynamic DNS *client* on every node (at least that seems like the best approach to me).

3) You'd be relying on some servers out there to do the dynamic DNS. Possibly existing dynamic DNS providers?

4) Doesn't work in completely disconnected scenarios.

Keep in mind I'm not necessarily advocating this shift, but it certainly resonated with me once I realized the full implications of Paul's question to me the other day.

///////////////////////////////////////////////////////////////////////////////////
I don't want to beat a dead horse, but:
1. Most of the dynamic DNS implementations I'm familiar with have update
times measured in a couple of minutes. It's not "real time", but it's
pretty short. This is reflected in the TTL of the answer.

2. Any form of DNS requires a set of caching resolvers and authoritative
servers. If you call that "centralized", then you are correct. I always
think of DNS as being very decentralized except for the roots, and most
implementations I know work if connection to the root is lost, but the
answer is in the cache.

3. If you combine a short TTL with a disconnected server, then caching
doesn't work; resolvers need access to the authoritative servers (plural)
that provide the data. It is servers plural, not singular. I think this is
the reason dynamic DNS is not the best choice, although we may have been
hasty in coming to that conclusion.

4. I think identifiers in these systems are in the form of user@domain,
where domain is the "Overlay Id". Humans may never see that, but it's
there. Resolving "domain" to at least a starter set of peers has to happen
somehow. We have only a few tricks in our bag for that:
a. Configuration
b. Multicast
c. DNS
Are there other mechanisms? There are multiple configuration mechanisms,
but they all depend on static information, not dynamic, learned data.

Agree that we explicitly declared dynamic DNS and multicast out of scope.
That leaves basic DNS and Configuration.

When networks become disconnected, local caching resolvers work. If you are
within a domain, and you want a DNS answer from that domain, it works.


Many existing systems work by configuration. I'd rather not do that.
/////////////////////////////////////////////////////////////////////////////////
However if a kindler, gentler and 1000 points of light - better/faster/cheaper DNS is to be implemented
please look into steroid performance of CoDnS