Monday, October 22, 2012

An article on server scaling - Nignx/NodeJS vs Apache

reference:http://erratasec.blogspot.hk/2012/10/scalability-is-systemic-anomaly.html


Scalability is a systemic anomaly inherent to the programming of the matrix


Scalability: it's the question that drives us; it's a systemic anomaly inherent to the programming of the matrix.


Few statements troll me better than saying “Nginx is faster than Apache” (referring to the two most popular web-servers). Nginx is certainly an improvement over Apache, but the reason is because it’s more scalable not because it’s faster.


Scalability is about how performance degrades under heavier loads. Sometimes servers simply need twice the performance to handle twice the traffic. Sometimes they need four times the performance. Sometimes no amount of added performance will handle the increase in traffic.


This can be visualized with the following graph of Apache server performance:



The graph above shows the well-known problem with Apache: it has a limit to the number of simultaneous connections. As the number of connections to the server increases, its ability to handle traffic goes down. With 10,000 simultaneous connections, an Apache server is essentially disabled, unable to service any of the connections.

Naively, we assume that all we need to do to fix this problem is to get a faster server. If Apache runs acceptably around 5,000 connections, then we assume that doubling server speed will make it handle 10,000 connections. This is not so, as shown in the following graph:



The above graph shows how increasing speed by two, four, eight, sixteen, and even thirty-two times still does not enable Apache to handle 10,000 simultaneous connections.

This is why scalability is a magical problem unrelated to performance: more performance just doesn’t fix it.



Spoon boy: Do not try and bend the spoon. That's impossible. Instead... only try to realize the truth. 

Neo: What truth? 
Spoon boy: There is no spoon. 
Neo: There is no spoon? 
Spoon boy: Then you'll see, that it is not the spoon that bends, it is only yourself.

The solution to scalability is like the spoon-boy story from The Matrix. It’s impossible to fix the scalability problem until you realize the truth that it’s you who bends. More hardware won’t fix scalability problems, instead you must change your software. Instead of writing web-apps using Apache’s threads, you need to write web-apps in an “asynchronous” or “event-driven” manner.


A graph of asynchronous performance looks something like the following:



This graph shows why asynchronous servers seem so magical. Running Nginx or NodeJS on a notebook computer costing $500 will still outperform a $10,000 server running Apache. Sure, the expensive system may be 100 times faster at fewer than 1000 connections, but as connections scale, even a tiny system running Nginx/NodeJS will eventually come out ahead.

Whereas 10,000 connections is a well-known problem for Apache, systems running Nginx or NodeJS have been scaled to 1-million connections. Other systems (such as my IPS code) scale to 10 million connections on cheap desktop hardware.



Morpheus: I'm trying to free your mind, Neo. But I can only show you the door. You're the one that has to walk through it.


The purpose of this post is to show you the last graph, demonstrating how asynchronous/event-driven software like Nginx isn’t “faster” than Apache. Instead, this software is more “scalable”.


This is one of the blue-pill/red-pill choices for programmers. Choosing the red-pill will teach you an entirely new way of seeing the Internet, allowing you to create scalable solutions that seem unreal to those who chose the blue-pill.

Thursday, October 18, 2012

JQuery - disable form inputs

$($('#submit_button').parents('form')[0]).find('input,select').each( function () {$(this).attr('disabled', false) });

Wednesday, October 17, 2012

Rails - A note on lib files

Rails server needs to be restarted to reflect any code changes.

Tuesday, October 16, 2012

Simpler, Cheaper, Faster: Playtomic's Move From .NET To Node And Heroku


Reference: http://highscalability.com/blog/2012/10/15/simpler-cheaper-faster-playtomics-move-from-net-to-node-and.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+HighScalability+%28High+Scalability%29

Just over 20,000,000 people hit my API yesterday 700,749,252 times, playing the ~8,000 games my analytics platform is integrated in for a bit under 600 years in total play time. That's just yesterday. There are lots of different bottlenecks waiting for people operating at scale. Heroku and NodeJS, for my use case, eventually alleviated a whole bunch of them very cheaply.
Playtomic began with an almost exclusively Microsoft.NET and Windows architecture which held up for 3 years before being replaced with a complete rewrite using NodeJS.  During its lifetime the entire platform grew from shared space on a single server to a full dedicated, then spread to second dedicated, then the API server was offloaded to a VPS provider and 4 – 6 fairly large VPSs.   Eventually the API server settled on 8 dedicated servers at Hivelocity, each a quad core with hyperthreading + 8gb of ram + dual 500gb disks running 3 or 4 instances of the API stack.
 
These servers routinely serviced 30,000 to 60,000 concurrent game players and received up to 1500 requests per second, with load balancing done via DNS round robin.
In July the entire fleet of servers was replaced with a NodeJS rewrite hosted at Heroku for a significant saving.

Scaling Playtomic With NodeJS

There were two parts to the migration:
  1. Dedicated to PaaS:  Advantages include price, convenience, leveraging their load balancing and reducing overall complexity.  Disadvantages include no New Relic for NodeJS, very inelegant crashes, and a generally immature platform.
  2. .NET to NodeJS: Switching architecture from ASP.NET / C# with local MongoDB instances and a service preprocessing event data locally and sending it to centralized server to be completed; to NodeJS on Heroku + Redis and preprocessing on SoftLayer (see Catalyst program).

Dedicated To PaaS

The reduction in complexity is significant; we had 8 dedicated servers each running 3 or 4 instances of the API at our hosting partner Hivelocity.  Each ran a small suite of software including:
  • MongoDB instance
  • log pre-processing service
  • monitoring service
  • IIS with api sites
Deploying was done via an FTP script that uploaded new api site versions to all servers.  Services were more annoying to deploy but changed infrequently.

MongoDB was a poor choice for temporarily holding log data before it was pre-processed and sent off.  It offered a huge speed advantage of just writing to memory initially which meant write requests were “finished” almost instantly which was far superior to common message queues on Windows, but it never reclaimed space left from deleted data which meant the db size would balloon to 100+ gigabytes if it wasn’t compacted regularly.

The advantages of PaaS providers are pretty well known, they all seem quite similar although it’s easiest to have confidence in Heroku and Salesforce since they seem the most mature and have broad technology support.

The main challenges transitioning to PaaS was shaking the mentality that we could run assistive software alongside the website as we did on the dedicated servers.  Most platforms provide some sort of background worker threads you can leverage but that means you need to route data and tasks from the web threads through a 3rd party service or server which seems unnecessary.

We eventually settled on a large server at Softlayer running a dozen purpose-specfic Redis instances and some middleware rather than background workers.  Heroku doesn’t charge for outbound bandwidth and Softlayer doesn’t charge for inbound which neatly avoided the significant bandwidth involved.

Switching From .NET To NodeJS

Working with JavaScript on the serverside is a mixed experience.  On the one hand the lack of formality and boilerplate is liberating.  On the other hand there’s no New Relic and no compiler errors which makes everything harder than it needs to be.

There are two main advantages that make NodeJS spectacularly useful for our API.
  1. Background workers in the same thread and memory as the web server
  2. Persistant, shared connections to redis and mongodb (etc)

Background Workers

NodeJS has the very useful ability to continue working independently of requests, allowing you to prefetch data and other operations that allow you to terminate a request very early and then finish processing it.

It is particularly advantageous for us to replicate entire MongoDB collections in memory, periodically refreshed, so that entire classes of work had access to current data without having to go an external database or local/shared caching layer.

We collectively save 100s – 1000s of database queries per second using this in:
  • Game configuration data on our main api
  • API credentials on our data exporting api
  • GameVars which developers use to store configuration or other data to hotload into their games
  • Leaderboard score tables (excluding scores)
The basic model is:
var cache = {};
module.exports = function(request, response) {
   response.end(cache[“x”]);
}
function refresh() {
   // fetch updated data from database, store in cache object
   cache[“x”] = “foo”;
   setTimeout(refresh, 30000);
}
refresh();
The advantages of this are a single connection (per dyno or instance) to your backend databases instead of per-user, and a very fast local memory cache that always has fresh data.

The caveats are your dataset must be small, and this is operating on the same thread as everything else so you need to be conscious of blocking the thread or doing too-heavy cpu work.

Persistent Connections

The other massive benefit NodeJS offers over .NET for our API is persistant database connections.  The traditional method of connecting in .NET (etc) is to open your connection, do your operation, after which your connection is returned to a pool to be re-used shortly or expired if it’s no longer needed.

This is very common and until you get to a very high concurrency it will Just Work.  At a high concurrency the connection pool can’t re-use the connections fast enough which means it generates new connections that your database servers will have to scale to handle.

At Playtomic we typically have several hundred thousand concurrent game players that are sending event data which needs to be pushed back to our Redis instances in a different datacenter which with .NET would require a massive volume of connections – which is why we ran MongoDB locally on each of our old dedicated servers.
With NodeJS we have a single connection per dyno/instance which is responsible for pushing all the event data that particular dyno receives.  It lives outside of the request model something like this:
var redisclient  = redis.createClient(….);
module.exports = function(request, response) {
   var eventdata = “etc”;
   redisclient.lpush(“events”, eventdata);
}

The End Result

High load:
REQUESTS IN LAST MINUTE

_exceptions: 75 (0.01%) 
_failures: 5 
(0.00%) 
_total: 537,151 
(99.99%)  
data.custommetric.success: 1,093 
(0.20%) 
data.levelaveragemetric.success: 2,466 
(0.46%) 
data.views.success: 105 
(0.02%) 
events.regular.invalid_or_deleted_game#2: 3,814 
(0.71%) 
events.regular.success: 527,837 
(98.25%) 
gamevars.load.success: 1,060 
(0.20%)
geoip.lookup.success: 109 
(0.02%) 
leaderboards.list.success: 457 (0.09%) 
leaderboards.save.missing_name_or_source#201: 3 (0.00%)
leaderboards.save.success: 30 
(0.01%)  
leaderboards.saveandlist.success: 102 
(0.02%)  
playerlevels.list.success: 62 
(0.01%)  
playerlevels.load.success: 13 
(0.00%)


This data comes from some load monitoring that operates in the background on each instance, pushes counters to Redis where they’re then aggregated and stored in MongoDB,  you can see it in action at https://api.playtomic.com/load.html.

There are a few different classes of requests in that data:
  • Events that check the game configuration from MongoDB, perform a GeoIP lookup (opensourced very fast implementation at https://github.com/benlowry/node-geoip-native), and then push to Redis
  • GameVars, Leaderboards, Player Levels all check game configuration from MongoDB and then whatever relevant MongoDB database
  • Data lookups are proxied to a Windows server because of poor NodeJS support for stored procedures
The result is 100,000s of concurrent users causing spectactularly light Redis loads fo 500,000 – 700,000 lpush’s per minute (and being pulled out on the other end):


 1  [||                                                                                      1.3%]     Tasks: 83; 4 running
 2  [|||||||||||||||||||                                                                    19.0%]     Load average: 1.28 1.20 1.19
 3  [||||||||||                                                                              9.2%]     Uptime: 12 days, 21:48:33
 4  [||||||||||||                                                                           11.8%]
 5  [||||||||||                                                                              9.9%]
 6  [|||||||||||||||||                                                                      17.7%]
 7  [|||||||||||||||                                                                        14.6%]
 8  [|||||||||||||||||||||                                                                  21.6%]
 9  [||||||||||||||||||                                                                     18.2%]
 10 [|                                                                                       0.6%]
 11 [                                                                                        0.0%]
 12 [||||||||||                                                                              9.8%]
 13 [||||||||||                                                                              9.3%]
 14 [||||||                                                                                  4.6%]
 15 [||||||||||||||||                                                                       16.6%]
 16 [|||||||||                                                                               8.0%]
 Mem[|||||||||||||||                                                                 2009/24020MB]
 Swp[                                                                                    0/1023MB]

 PID USER     PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
12518 redis     20   0 40048  7000   640 S  0.0  0.0  2:21.53  `- /usr/local/bin/redis-server /etc/redis/analytics.conf
12513 redis     20   0 72816 35776   736 S  3.0  0.1  4h06:40  `- /usr/local/bin/redis-server /etc/redis/log7.conf
12508 redis     20   0 72816 35776   736 S  2.0  0.1  4h07:31  `- /usr/local/bin/redis-server /etc/redis/log6.conf
12494 redis     20   0 72816 37824   736 S  1.0  0.2  4h06:08  `- /usr/local/bin/redis-server /etc/redis/log5.conf
12488 redis     20   0 72816 33728   736 S  2.0  0.1  4h09:36  `- /usr/local/bin/redis-server /etc/redis/log4.conf
12481 redis     20   0 72816 35776   736 S  2.0  0.1  4h02:17  `- /usr/local/bin/redis-server /etc/redis/log3.conf
12475 redis     20   0 72816 27588   736 S  2.0  0.1  4h03:07  `- /usr/local/bin/redis-server /etc/redis/log2.conf
12460 redis     20   0 72816 31680   736 S  2.0  0.1  4h10:23  `- /usr/local/bin/redis-server /etc/redis/log1.conf
12440 redis     20   0 72816 33236   736 S  3.0  0.1  4h09:57  `- /usr/local/bin/redis-server /etc/redis/log0.conf
12435 redis     20   0 40048  7044   684 S  0.0  0.0  2:21.71  `- /usr/local/bin/redis-server /etc/redis/redis-servicelog.conf
12429 redis     20   0  395M  115M   736 S 33.0  0.5 60h29:26  `- /usr/local/bin/redis-server /etc/redis/redis-pool.conf
12422 redis     20   0 40048  7096   728 S  0.0  0.0 26:17.38  `- /usr/local/bin/redis-server /etc/redis/redis-load.conf
12409 redis     20   0 40048  6912   560 S  0.0  0.0  2:21.50  `- /usr/local/bin/redis-server /etc/redis/redis-cache.conf

and very light MongoDB loads for 1800 – 2500 crud operations a minute:

insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    2      9      5      2       0       8       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     7k   116   01:11:12 
    1      1      5      2       0       6       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     2k     3k   116   01:11:13 
    0      3      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     6k   114   01:11:14 
    0      5      5      2       0      12       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     5k   113   01:11:15 
    1      9      7      2       0      12       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     4k     6k   112   01:11:16 
    1     10      6      2       0      15       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     1|0     4k    22k   111   01:11:17 
    1      5      6      2       0      11       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k    19k   111   01:11:18 
    1      5      5      2       0      14       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     3k   111   01:11:19 
    1      2      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     2k   111   01:11:20 
    1      7      5      2       0       9       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     2k   111   01:11:21 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    2      9      8      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     5k   111   01:11:22 
    3      8      7      2       0       9       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     9k   110   01:11:23 
    2      6      6      2       0      10       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     4k   110   01:11:24 
    2      8      6      2       0      21       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k    93k   112   01:11:25 
    1     10      7      2       3      16       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     4m   112   01:11:26 
    3     15      7      2       3      24       0  6.67g  14.8g  1.23g      0      0.2          0       0|0     0|0     6k     1m   115   01:11:27 
    1      4      8      2       0      10       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     2m   115   01:11:28 
    1      6      7      2       0      14       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     3k   115   01:11:29 
    1      3      6      2       0      10       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k   103k   115   01:11:30 
    2      3      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k    12k   114   01:11:31 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    0     12      6      2       0       9       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k    31k   113   01:11:32 
    2      4      6      2       0       8       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k     9k   111   01:11:33 
    2      9      6      2       0       7       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     3k    21k   111   01:11:34 
    0      8      7      2       0      14       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     4k     9k   111   01:11:35 
    1      4      7      2       0      11       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     3k     5k   109   01:11:36 
    1     15      6      2       0      19       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     5k    11k   111   01:11:37 
    2     17      6      2       0      19       1  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     6k   189k   111   01:11:38 
    1     13      7      2       0      15       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     1|0     5k    42k   110   01:11:39 
    2      7      5      2       0      77       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     2|0    10k    14k   111   01:11:40 
    2     10      5      2       0     181       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0    21k    14k   112   01:11:41 
insert  query update delete getmore command flushes mapped  vsize    res faults locked % idx miss %     qr|qw   ar|aw  netIn netOut  conn       time 
    1     11      5      2       0      12       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     0|0     4k    13k   116   01:11:42 
    1     11      5      2       1      33       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     3|0     6k     2m   119   01:11:43 
    0      9      5      2       0      17       0  6.67g  14.8g  1.22g      0      0.1          0       0|0     1|0     5k    42k   121   01:11:44 
    1      8      7      2       0      25       0  6.67g  14.8g  1.22g      0      0.2          0       0|0     0|0     6k    24k   125   01:11:45

A behind-the-scenes look at LinkedIn’s mobile engineering

Reference: http://arstechnica.com/information-technology/2012/10/a-behind-the-scenes-look-at-linkedins-mobile-engineering/

LinkedIn is the career-oriented social network that prides itself on professional excellence. But the company's original mobile offering was anything but—it left much to be desired. There was an iPhone application, but no support for Android or tablets. The backend was a rickety Ruby on Rails contraption; afflicted with seemingly insurmountable scalability problems. And despite serving only seven or eight percent of the LinkedIn population, the company's original mobile build required approximately 30 servers in order to operate. This was clearly not made to sustain a growing mobile user base.
Now, a little over a year has passed since LinkedIn relaunched its mobile applications and website. And the company recently marked the anniversary by debuting a number of new mobile features, including real-time notifications and support for accessing company pages from mobile apps.
Mobile is gradually becoming a central part of the LinkedIn landscape. The company says roughly 23 percent of its users access the site through one of its mobile applications, up from ten percent last year. As our friends at Wired reported last month, the underlying design language and development philosophy behind the company's mobile experience is playing an influential role as the company works to revamp its website.
No one knows more about this paradigm shift than Kiran Prasad, LinkedIn's director of mobile engineering. Prasad was one of the key players in LinkedIn's efforts to resurrect its mobile offerings, and he was kind enough to walk Ars through the process. LinkedIn's mobile man provided a behind-the-scenes look at how the company built its mobile applications and the associated backend infrastructure, while also describing the design strategy the company used to build its mobile interfaces. As you'd probably expect from the brand, it was all a professional level effort.

Initiating an overhaul

LinkedIn decided it was time for a massive overhaul as the company began recognizing the increasingly important role that mobile access would play for the social network's users. Jet-setting professionals rely on their smartphones to stay in touch while they travel, right?
Mobile soon became a major strategic focus for LinkedIn. In August of 2011, roughly five months after the mobile overhaul began, the company launched an all-new set of mobile apps powered by a completely new backend.
When the effort began, the mobile engineers at LinkedIn had several major goals. They wanted cross-platform compatibility, with support at launch for Android, iOS, and the mobile Web. They also wanted to simplify the user interface, taking the number of icons on the main screen down from 12 to three or four. Finally, they wanted a holistic rewrite—rebuilding both the frontend and backend together with an eye for boosting scalability.
Prasad previously worked on WebOS at Palm before joining LinkedIn. He brought a wealth of knowledge about building native-like user interfaces with modern Web standards. He regards the mobile Web as a platform in its own right, one that LinkedIn had to support well alongside Android and iOS. So Prasad decided that making HTML5 a central part of the company's mobile strategy would make it easier to reach all of those environments.
HTML offers a useful way to reach more screens, but Prasad believes there are still many places where native user interfaces and native code are needed in order to deliver the best possible user experience. LinkedIn set out to use a hybrid model; blending the two to theoretically offer advantages of both.
This approach made it easier for a small team to support multiple platforms. And crucially, it allowed other LinkedIn developers outside of the mobile team to contribute to the effort using their existing skills.

Everything must be simple

Prasad told me that simplicity is at the heart of LinkedIn's mobile vision. The company's internal definition of simplicity, he said, holds that a good solution must be fast, easy, and reliable—in that specific order. Each of those characteristics is considered ten times more important than the next, making performance the chief concern. That philosophy motivated his team's decisions in almost every area, ranging from visual design to backend engineering.
The thing people value most, according to Prasad, is their time. When users encounter flaws in software, they tend to be more forgiving if the software is extremely fast. A crash is less disruptive, for example, if the application is quick to restart and offers a quick path back to where the user was.
Speed is also especially important for mobile experiences in his eyes, because smartphone users tend to have shorter sessions (often less than three minutes long). Users expect applications to deliver relevant information as quickly as possible.
Ease of use was his next major priority, and Prasad said LinkedIn measures this by counting the number of clicks that it takes for a user to complete a given operation. If it takes more than three, he said, users are going to lose patience and easily be drawn away by push notifications or other things happening on their phone.
Reliability is the final of the three priorities. Basic robustness and stability are important, Prasad told me, but reliability also encompasses other ideas like consistency and predictability. To him, a reliable application is one the user can depend on to behave the same way every time. When the user taps a button and it takes them to another screen, for example, going back and tapping on the same button should take them there again. A surprising number of applications ignore that seemingly obvious design principle, Prasad said.
During the design process, LinkedIn used a simple metaphor to encourage predictability and ease of use. The idea is that an application is like a house—there are rooms with specific functions, and there are corridors that connect those rooms in a practical way. When you are putting together a room, you don't fill it with an incongruous range of functions. You may end up with a living room, a bedroom, a kitchen, and a bathroom. But you don't cook in the living room and you don't sleep in the bathroom.
Continuing this house analogy, when you begin designing an application, you start by defining the structure. You determine what rooms you want, what their purposes will be, and how those rooms will be connected. You don't start with the visuals (in the house analogy, these are like the carpeting or the paintings on the walls). When you introduce a feature, you think about the room in which it belongs.
For the mobile application, LinkedIn decided that it didn't want more than four "rooms" of functionality. You can clearly see the house metaphor in action when you open the LinkedIn iPhone application. The main screen limits itself to four icons: your updates, profile, messages, and groups. It serves as the hallway, allowing the user to tap an item in order to enter a given room.

Building a hybrid app with HTML5 and native code

LinkedIn uses a combination of HTML5 and native code in its user interface. Prasad told me which technologies they use for various parts of the app and gave me a detailed explanation of how LinkedIn has made the two approaches interoperate.
Prasad says that HTML is very effective at rendering what he calls "detail" views, large blocks of informational content consisting largely of rich text and graphical media. The flexible layouts offered by HTML make it useful for such usage scenarios, but there are also major areas where LinkedIn chose to use native controls. Why? Prasad said Web technologies weren't entirely up to the task.
Prasad dismisses the idea that you have to choose between native controls and HTML, saying that it represents a false dichotomy. Ultimately, LinkedIn's iPhone application consists of 70 percent HTML and 30 percent native.
The biggest example he cited of an area where HTML5 still falls short as a user interface layer for data-driven mobile applications is in displaying long lists of content. He said that native widgets are needed in order to achieve smooth and seamless scrolling for list displays with hundreds or thousands of items. Attempting to display such lists with HTML and JavaScript proved impractical. Prasad saw this as especially true for "infinite" lists, where new content is fetched dynamically and appended to the bottom as the user scrolls.
When implementing an infinite list, you can't just keep adding items to the bottom. Memory overhead quickly becomes unacceptable unless you also simultaneously pull items from the top after the user has scrolled past. Prasad explained that manipulating the HTML DOM (Document Object Model) during scrolling caused some stuttering.
It's hard to measure exactly what is going on when that happens, but LinkedIn guessed that page layout computations or JavaScript garbage collection were sapping too much of the device's limited resources. On top of the performance issues, Prasad felt it was exceptionally difficult to implement kinetic scrolling in HTML that felt truly native across platforms.
It's worth noting that rival social network Facebook raised very similar concerns about list performance and scrolling in a recent message to a W3C mailing list. Facebook largely ended up swapping HTML for native user interface controls in the latest update of its own mobile application.
For Prasad, however, the limitations of current Web rendering engines don't demand a full return to native code—it's a matter of using the right tool for the job. He dismisses the idea that you have to choose between native controls and HTML, saying that it represents a false dichotomy.
Ultimately, LinkedIn's iPhone application consists of 70 percent HTML and 30 percent native. The Android application is roughly 40 percent native and 60 percent HTML. The Android application relies more extensively on native elements because the platform's HTML rendering engine isn't quite as capable as the one on iOS.
LinkedIn had started with an 80-percent-native Android application, but has been able to gradually increase the amount of HTML over time due to incremental improvements in the platform's HTML renderer. Prasad is hopeful that the greatly enhanced HTML renderer that comes with Chrome for Android will eventually be available in the embeddable HTML display component that Android supplies for third-party applications. He described Chrome for Android as "awesome" and said that its support for hardware-accelerated rendering is great.
Aside from handling some performance-critical user interface elements like lists, the native part of LinkedIn's application also serves a vital role in trapping and responding to errors uncovered in the embedded HTML views. In cases where an embedded HTML pane encounters a fatal error, the native part of the application can cleanly bring it down and then repopulate it, often without the user even knowing.
Prasad briefly discussed his views on Web runtime solutions, such as PhoneGap, that aim to simplify HTML-based mobile application development and provide such applications with native shims to underlying platform functionality. While PhoneGap and similar frameworks are useful for rapid prototyping and for companies with limited resources, he said, it's better to implement the native parts of hybrid mobile software in a way that meets the specific requirements of the individual application. Ideally, he told me, developers should choose the right balance between native and Web for each application and then build their own bridge between the two environments.
I asked him to describe the specific mechanism he uses on each platform to enable communication between native code and the HTML user interface elements. On Android, the LinkedIn application largely relies on the platform's built-in support for exposing specific Java functions into the JavaScript runtime of an embedded HTML view. That feature made it relatively straightforward to bridge the gap.
On iOS, the matter is a bit more complicated. He said the company tried several different methods before settling on one they felt was most effective. Their first approach involved using a platform API to "eval" JavaScript expressions in the embedded WebView. This proved to be too computationally expensive, sometimes introducing undesirable jerkiness.
The second approach that LinkedIn tested was one using WebSockets to establish a connection between the HTML components and native code. This worked exceptionally well from a performance perspective, but wasn't stable enough for practical use. LinkedIn used this method in an actual release, but later replaced it.
Finally LinkedIn tested, and ultimately chose to adopt, something surprising. The company embedded an extremely lightweight HTTP server in the application itself. The HTTP server exposes native functionality through a very simple REST API that can be consumed in the embedded HTML controls through standard JavaScript HTTP requests. Among other things, this server also provides access to the user's contacts, calendar, and other underlying platform functionality.
When I asked Prasad if this approach raised any security issues, he said that the company reviewed it internally and found it to be acceptable. There are a number of precautions that are taken in the application to prevent it from being abused. The server is bound only to localhost and can't be accessed by other devices on the network. It also suspended immediately whenever the LinkedIn application is sent to the background.
Building a more scalable backend
When LinkedIn was deciding how to rebuild its backend, the company used a similar philosophy to the one that had directed the application design decisions. Like everything else, they wanted their backend technology stack to be fast, easy to work with, and reliable.
In a large-scale backend system, Prasad said, companies typically adapt the traditional model-view-controller (MVC) pattern into a three-tier system. The bottom tier consists of the database storage layer, the middle tier handles caching and some business logic, and the top tier serves as the presentation layer, generating your HTML views.
When you are building a mobile application, he said, this structure is no longer applicable. Your presentation layer is on the device itself, often consisting of native user interface elements. LinkedIn wanted a new kind of middle tier that would facilitate more effective communication with a mobile frontend.
In order to minimize the latency introduced by establishing new connections, the LinkedIn developers decided the application should establish as few connections as possible with the server. This led them to a model where the application is essentially piping all of its data through a single connection that is held open for as long as it is needed.
The client application connects to a system on the backend that functions as an aggregator, pulling together all of the data that it needs from various components of the backend stack and combining it all into a unified stream of data that can be piped down to the client application through a single open connection.
To build this streamlined middle tier, the LinkedIn developers wanted to use a lightweight event-driven framework. They also wanted an environment that would be well-suited to handle the aggregation and string interpolation capabilities required for the service. They tested several candidates, including Ruby with EventMachine, Python with Twisted, and the JavaScript-based Node.js framework.
They found that Node.js, which is based on Google's V8 JavaScript engine, offered substantially better performance and lower memory overhead than the other options being considered. Prasad said that Node.js "blew away" the performance of the alternatives, running as much as 20 times faster in some scenarios.
Another advantage of adopting Node.js for the project is that it allowed LinkedIn's JavaScript engineers to put their existing expertise to use on the backend. Prasad said the company was able to combine its frontend and backend mobile teams into a single unit. The event-driven nature of frontend development made it easier for the user interface programmers to understand the way that Node.js works.
Prasad was practically giddy when he told me just how much the transition from Rails to Node.js improves the scalability of LinkedIn's mobile backend infrastructure. Impressively, the company was able to move from 30 servers down to three, while still having enough headroom to handle roughly ten times their current level of resource utilization.
Firefighting scalability problems on the old infrastructure had been a major distraction, one that forced the company's engineers to spend a lot of time just keeping the system running. With that problem defeated, the engineers were free to spend more time focusing on user-facing product improvements.

Perfection is elusive

The mobile engineers at LinkedIn have done some impressive work to make the application fast, responsive, easy to use, and aesthetically pleasing. But reliability, the third item in LinkedIn's taxonomy of mobile priorities, remains an issue for some users. LinkedIn's iOS application currently has an average rating of two-and-a-half stars out of five, with an average of three stars across all versions. The Android application fares a bit better, with 3.7 stars out of five. LinkedIn rolls out routine updates as it continues to improve its software.
LinkedIn's mobile application has also been the subject of privacy concerns. It faced scrutiny earlier this year when security researchers discovered that it was programmed to send user calendars back to the LinkedIn mothership. The company responded by tweaking this behavior and attempting to make it more transparent to the end-user.
But ultimately, building a mobile experience for a popular social network is not an easy task. Whatever shortcomings currently exist are being monitored, and could potentially be addressed in the same way this redesign built on past feedback. Prasad's description of the underlying technical details not only offers a glimpse into the challenges and complexity of this problem space, but he provides unique insight. When you experience LinkedIn mobile now, you're experiencing the company's engineering culture and the philosophy that guided its design and development process.