Thursday, October 25, 2012
Rails - Human readable name for a class
some_class.model_name.human
Tuesday, October 23, 2012
Nginx over Apache
" I ran a simple test against Nginx v0.5.22 and Apache v2.2.8 using ab (Apache's benchmarking tool). During the tests, I monitored the system with vmstat and top. The results indicate that Nginx outperforms Apache when serving static content. Both servers performed best with a concurrency of 100. Apache used four worker processes (threaded mode), 30% CPU and 17MB of memory to serve 6,500 requests per second. Nginx used one worker, 15% CPU and 1MB of memory to serve 11,500 requests per second."
-- Linux Journal
-- Linux Journal
Monday, October 22, 2012
Right tools for the right task - Reasons for moving from Node.js to Ruby on Rails
reference:http://www.aqee.net/why-we-moved-from-nodejs-to-ror/
为什么我们要从 NodeJS 迁移到 Ruby on Rails
声明:这篇文章绝不是一篇讨论 NodeJS 和 Ruby on Rails 孰优孰略的檄文。它描述的只是我们做决策过程中的一些思考、决策背后的原因。两种框架都非常优秀,都出色的完成了它们的设计初衷,这也是为什么我们部分的 模块仍然运行在NodeJS上的原因。
我是NodeJs的大粉丝,认为这是一项让人非常兴奋的技术,相信它会变的越来越流行。我对这项技术非常的欣赏——尽管我们最近把Targeter App从NodeJS迁移到了Ruby on Rails。
我是NodeJs的大粉丝,认为这是一项让人非常兴奋的技术,相信它会变的越来越流行。我对这项技术非常的欣赏——尽管我们最近把Targeter App从NodeJS迁移到了Ruby on Rails。
我们当时使用NodeJS开发 它的原因很简单。我有一个程序包,能很快的将我们的应用弄上线(我们花了54小时做这个事情),相比起Ruby,我更常使用的是 JavaScript。因为我们的技术架构牵涉到MongoDB,我的这些特长只有在NodeJS环境里才会有意义。然而,随着应用规模的增 长,我认识到,选择NodeJS来实现这个应用是个错误的选择。下面让我来概述一下其中的原因。
NodeJS很适合做那些有大 量短生命期请求的应用。对于传统的CRUD应用,它也很好,但不是非常的理 想。在PHP,Ruby,Python语言里都有很成熟、优化的很好的框架来处理这种应用。NodeJS里的所有东西都异步执行的理念对于 CRUD应用来说没有任何效果。其它语言里的流行的框架能提供非常好的缓存技术,你所有的需求都能满足,包括异步执行。
NodeJS是一种非常年 轻的技术框架,它的周边程序库都不是很成熟。我说这些并没有任何对那些代码捐赠者冒犯的意思,他们很优秀,开发出来很 多优秀的程序库。然而,大部分程序库需要改进,而NodeJS的这种快速成长的环境意味着每一版升级中都带有大量的变化;当你使用一种前沿技 术时,你十分有必要尽快的紧跟最新的版本。这给创业型的企业带来了很多的麻烦。
另外一个原因是关于测 试。NodeJS里的测试框架还不错,但跟Django或RoR平台上的相比还是差一些。对于一个每天都有大量的代码 提交、并且在一两天内就要发布的应用来说,程序不能出问题是至关重要的,否则你为此辛苦的努力变得得不偿失。没有人愿意花一天的时间改一些弱 智的bug。
最后一点,我们需要的是一种能缓 存一切的东西,并且要尽快的实现。尽管我们的应用在增长,每秒钟有上万次的hits,但绝不会出现很大量的访问请求; 这不是一个聊天程序!主程序最多时也就达到1000RPS,这样的负载对于Ruby on Rails和Nginx来说算不了什么。
如果你现在还在读这篇文章,那 你已经看到了我所有要说的了,你也许非常坚持的想知道我们的应用什么地方还在使用NodeJS。是这样的,我们的应用由两部分组成。一是界 面,用户看到的这部分,二是负责报表管理的部分,以及做日志的功能。后者是NodeJS的一个最佳使用场景,存在有大量的短周期的请求。这部 分的动作需要尽快的执行完成,甚至要在我们的数据推送还没有完成之前。这很重要,当请求执行还未结束,浏览器继续等待响应结束,这会影响用户 使用体验。NodeJS的异步特性救了我们。数据要么被存入数据库,要么被处理掉,当请求一旦执行完成,浏览器就可以开始做其它重要的事情 了。
Ruby - Convert json to hash and visa versa
Hash to Json: your_hash.to_json
Json to Hash: JSON.parse( your_json )
Json to Hash: JSON.parse( your_json )
Scaling PostgreSQL at Braintree: Four Years of Evolution
reference:https://www.braintreepayments.com/braintrust/scaling-postgresql-at-braintree-four-years-of-evolution
Scaling PostgreSQL at Braintree: Four Years of EvolutionPOSTED ON OCTOBER 16, 2012 BY PAUL GROSS
::Base around_filter :activate_shard def activate_shard(&block) merchant = Merchant.find_by_public_id(params[:merchant_id]) DataFabric.activate_shard(:shard => merchant.shard, &block) end end
Scaling PostgreSQL at Braintree: Four Years of Evolution
We love PostgreSQL at Braintree. Although we use many different data stores (such as Riak,MongoDB, Redis, and Memcached), most of our core data is stored in PostgreSQL. It's not as sexy as the new NoSQL databases, but PostgreSQL is consistent and incredibly reliable, two properties we value when storing payment information.
We also love the ad-hoc querying that we get from a relational database. For example, if our traffic looks fishy, we can answer questions like "What is the percentage of Visa declines coming from Europe?" without having to pre-compute views or write complex map/reduce queries.
Our PostgreSQL setup has changed a lot over the last few years. In this post, I'm going to walk you through the evolution of how we host and use PostgreSQL. We've had a lot of help along the way from the very knowledgeable people at Command Prompt.
2008: The beginning
Like most Ruby on Rails apps in 2008, our gateway started out on MySQL. We ran a couple of app servers and two database servers replicated using DRBD. DRBD uses block level replication to mirror partitions between servers. This setup was fine at first, but as our traffic started growing, we began to see problems.
2010: The problems with MySQL
The biggest problem we faced was that schema migrations on large tables took a long time with MySQL. As our dataset grew, our deploys started taking longer and longer. We were iterating quickly, and our schema was evolving. We couldn't keep affording to take downtime while we upgraded or even added a new index to a large table.
We explored various options with MySQL (such as oak-online-alter-table), but decided that we would rather move to a database that supported it directly. We were also starting to see deadlock issues with MySQL, which were on operations we felt shouldn't deadlock. PostgreSQL solved this problem as well.
We migrated from MySQL to PostgreSQL in the fall of 2010. You can read more about the migration on the slides from my PgEast talk. PostgreSQL 9.0 was recently released, but we chose to go with version 8.4 since it had been out longer and was more well known.
2010 - 2011: Initial PostgreSQL
We ran PostgreSQL on modest hardware, and we kept DRBD for replication. This worked fine at first, but as our traffic continued to grow, we needed some upgrades. Unlike most applications, we are much heavier on writes than reads. For every credit card that we charge, we store a lot of data (such as customer information, raw responses from the processing networks, and table audits).
Over the next year, we performed the following upgrades:
- Tweaked our configs around checkpoints, shared buffers, work_mem and more (this is a great start: Tuning Your PostgreSQL Server)
- Moved the Write Ahead Log (WAL) to its own partition (so fsyncs of the WAL don't flush all of the dirty data files)
- Moved the WAL to its own pair of disks (so the sequential writes of the WAL are not slowed down by the random read/write of the data files)
- Added more RAM
- Moved to better servers (24 cores, 16 disks, even more RAM)
- Added more RAM again (kept adding to keep the working set in RAM)
Fall 2011: Sharding
These incremental improvements worked great for a long time, and our database was able to keep up with our ever increasing volume. In the summer of 2011, we started to feel like our traffic was going to outgrow a single server. We could keep buying better hardware, but we knew there was a limit.
We talked about a lot of different solutions, and in the end, we decided to horizontally shard our database by merchant. A merchant's traffic would all live on one shard to make querying easier, but different merchants would live on different shards.
We used data_fabric to introduce sharding into our Rails app. data_fabric lets you specify which models are sharded, and gives you methods for activating a specific shard. In conjunction with data_fabric, we also wrote a fair amount of custom code for sharding. We sharded every table except for a handful of global tables, such as merchants and users. Since almost every URL has the merchant id in it, we were able to activate shards in application_controller.rb for 99% of our traffic with code that looked roughly like:
class ApplicationController ActionController
Making our code work with sharding was only half the battle. We still had to migrate merchants to a different shard (without downtime). We did this with londiste, a statement-based replication tool. We set up the new database servers and used londiste to mirror the entire database between the current cluster (which we renamed to shard 0) and the new cluster (shard 1).
Then, we paused traffic[1], stopped replication, updated the shard column in the global database, and resumed traffic. The whole process was automated using capistrano. At this point, some requests went to the new database servers, and some to the old. Once we were sure everything was working, we removed the shard 0 data from shard 1 and vice versa.
The final cutover was completed in the fall of 2011.
Spring 2012: DRBD Problems
Sharding took care of our performance problems, but in the spring of 2012, we started running into issues with our DRBD replication:
- DRBD made replicating between two servers very easy, but more than two required complex stacked resources that were harder to orchestrate. It also required more moving pieces, likeDRBD Proxy to prevent blocking writes between data centers.
- DRBD is block level replication, so the filesystem is shared between servers. This means it can never be unmounted and checked (fsck) without taking downtime. We become increasingly concerned that filesystem corruption would go unnoticed and corrupt all servers in the cluster.
- The filesystem can only be mounted on the primary server, so the standby servers sit idle. It is not possible to run read-only queries on them.
- Failover required unmounting and remounting filesystems, so it was slower than desired. Also, since the filesystem was unmounted on the target server, once mounted, the filesystem cache was empty. This meant that our backup PostgreSQL was slow after failover, and we would see slow requests and sometimes timeouts.
- We saw a couple of issues in our sandbox environment where DRBD issues on the secondary prevented writes on the primary node. Thankfully, these never occurred in production, but we had a lot of trouble tracking down the issue.
- We were still using manual failover because we were scared of the horror stories withPacemaker and DRBD causing split brain scenarios and data corruption. We wanted to get to automated failover, however.
- DRBD required a kernel module, so we had to build and test a new module every time we upgraded the kernel.
- One upgrade of DRBD caused a huge degradation of write performance . Thankfully, we discovered the issue in our test environment, but it was another reason to be wary of kernel level replication.
Given all of these concerns, we decided to leave DRBD replication and move to PostgreSQL streaming replication (which was new in PostgreSQL 9). We felt like it was a better fit for what we wanted to do. We could replicate to many servers easily, standby servers were queryable letting us offload some expensive queries, and failover was very quick.
We made the switch during the summer of 2012.
Summer 2012: PostgreSQL 9.1
We updated our code to support PostgreSQL 9.1 (which involved very few code changes). Along with the upgrade, we wanted to move to fully automated failover. We decided to use Pacemaker and these great open source scripts for managing PostgreSQL streaming replication: https://github.com/t-matsuo/resource-agents/wiki. These scripts handle promotion, moving the database IPs, and even switching from sync to async mode if there are no more standby servers.
We set up our new database clusters (one per shard). We used two servers per datacenter, with synchronous replication within the datacenter and asynchronous replication between our datacenters. We configured Pacemaker and had the clusters ready to go (but empty). We performed extensive testing on this setup to fully understand the failover scenarios and exactly how Pacemaker would react.
We used londiste again to copy the data. Once the clusters were up to date, we did a similar cutover: we paused traffic, stopped londiste, updated our database.yml, and then resumed traffic. We did this one shard at a time, and the entire procedure was automated with capistrano. Again, we took no downtime.
Fall 2012: Today
Today, we're in a good state with PostgreSQL. We have fully automated failover between servers (within a datacenter). Our cross datacenter failover is still manual since we want to be sure before we give up on an entire datacenter. We have automated capistrano tasks to orchestrate controlled failover using Pacemaker and traffic pausing. This means we can perform database maintenance with zero downtime.
One of our big lessons learned is that we need to continually invest in our PostgreSQL setup. We're always watching our PostgreSQL performance and making adjustments where needed (new indexes, restructuring our data, config tuning, etc). Since our traffic continues to grow and we record more and more data, we know that our PostgreSQL setup will continue to evolve over the coming years.
[1] For more info on how we pause traffic, check out How We Moved Our Data Center 25 Miles Without Downtime and High Availability at Braintree
An article on server scaling - Nignx/NodeJS vs Apache
reference:http://erratasec.blogspot.hk/2012/10/scalability-is-systemic-anomaly.html
Scalability: it's the question that drives us; it's a systemic anomaly inherent to the programming of the matrix.
Few statements troll me better than saying “Nginx is faster than Apache” (referring to the two most popular web-servers). Nginx is certainly an improvement over Apache, but the reason is because it’s more scalable not because it’s faster.
Scalability is about how performance degrades under heavier loads. Sometimes servers simply need twice the performance to handle twice the traffic. Sometimes they need four times the performance. Sometimes no amount of added performance will handle the increase in traffic.
This can be visualized with the following graph of Apache server performance:
The graph above shows the well-known problem with Apache: it has a limit to the number of simultaneous connections. As the number of connections to the server increases, its ability to handle traffic goes down. With 10,000 simultaneous connections, an Apache server is essentially disabled, unable to service any of the connections.
Naively, we assume that all we need to do to fix this problem is to get a faster server. If Apache runs acceptably around 5,000 connections, then we assume that doubling server speed will make it handle 10,000 connections. This is not so, as shown in the following graph:
The above graph shows how increasing speed by two, four, eight, sixteen, and even thirty-two times still does not enable Apache to handle 10,000 simultaneous connections.
This is why scalability is a magical problem unrelated to performance: more performance just doesn’t fix it.
Spoon boy: Do not try and bend the spoon. That's impossible. Instead... only try to realize the truth.
Neo: What truth?
Spoon boy: There is no spoon.
Neo: There is no spoon?
Spoon boy: Then you'll see, that it is not the spoon that bends, it is only yourself.
The solution to scalability is like the spoon-boy story from The Matrix. It’s impossible to fix the scalability problem until you realize the truth that it’s you who bends. More hardware won’t fix scalability problems, instead you must change your software. Instead of writing web-apps using Apache’s threads, you need to write web-apps in an “asynchronous” or “event-driven” manner.
A graph of asynchronous performance looks something like the following:
This graph shows why asynchronous servers seem so magical. Running Nginx or NodeJS on a notebook computer costing $500 will still outperform a $10,000 server running Apache. Sure, the expensive system may be 100 times faster at fewer than 1000 connections, but as connections scale, even a tiny system running Nginx/NodeJS will eventually come out ahead.
Whereas 10,000 connections is a well-known problem for Apache, systems running Nginx or NodeJS have been scaled to 1-million connections. Other systems (such as my IPS code) scale to 10 million connections on cheap desktop hardware.
Morpheus: I'm trying to free your mind, Neo. But I can only show you the door. You're the one that has to walk through it.
The purpose of this post is to show you the last graph, demonstrating how asynchronous/event-driven software like Nginx isn’t “faster” than Apache. Instead, this software is more “scalable”.
This is one of the blue-pill/red-pill choices for programmers. Choosing the red-pill will teach you an entirely new way of seeing the Internet, allowing you to create scalable solutions that seem unreal to those who chose the blue-pill.
Scalability is a systemic anomaly inherent to the programming of the matrix
Posted by Robert David Graham (@ErrataRob)
Scalability: it's the question that drives us; it's a systemic anomaly inherent to the programming of the matrix.
Few statements troll me better than saying “Nginx is faster than Apache” (referring to the two most popular web-servers). Nginx is certainly an improvement over Apache, but the reason is because it’s more scalable not because it’s faster.
Scalability is about how performance degrades under heavier loads. Sometimes servers simply need twice the performance to handle twice the traffic. Sometimes they need four times the performance. Sometimes no amount of added performance will handle the increase in traffic.
This can be visualized with the following graph of Apache server performance:
The graph above shows the well-known problem with Apache: it has a limit to the number of simultaneous connections. As the number of connections to the server increases, its ability to handle traffic goes down. With 10,000 simultaneous connections, an Apache server is essentially disabled, unable to service any of the connections.
Naively, we assume that all we need to do to fix this problem is to get a faster server. If Apache runs acceptably around 5,000 connections, then we assume that doubling server speed will make it handle 10,000 connections. This is not so, as shown in the following graph:
The above graph shows how increasing speed by two, four, eight, sixteen, and even thirty-two times still does not enable Apache to handle 10,000 simultaneous connections.
This is why scalability is a magical problem unrelated to performance: more performance just doesn’t fix it.
Spoon boy: Do not try and bend the spoon. That's impossible. Instead... only try to realize the truth.
Neo: What truth?
Spoon boy: There is no spoon.
Neo: There is no spoon?
Spoon boy: Then you'll see, that it is not the spoon that bends, it is only yourself.
The solution to scalability is like the spoon-boy story from The Matrix. It’s impossible to fix the scalability problem until you realize the truth that it’s you who bends. More hardware won’t fix scalability problems, instead you must change your software. Instead of writing web-apps using Apache’s threads, you need to write web-apps in an “asynchronous” or “event-driven” manner.
A graph of asynchronous performance looks something like the following:
This graph shows why asynchronous servers seem so magical. Running Nginx or NodeJS on a notebook computer costing $500 will still outperform a $10,000 server running Apache. Sure, the expensive system may be 100 times faster at fewer than 1000 connections, but as connections scale, even a tiny system running Nginx/NodeJS will eventually come out ahead.
Whereas 10,000 connections is a well-known problem for Apache, systems running Nginx or NodeJS have been scaled to 1-million connections. Other systems (such as my IPS code) scale to 10 million connections on cheap desktop hardware.
Morpheus: I'm trying to free your mind, Neo. But I can only show you the door. You're the one that has to walk through it.
The purpose of this post is to show you the last graph, demonstrating how asynchronous/event-driven software like Nginx isn’t “faster” than Apache. Instead, this software is more “scalable”.
This is one of the blue-pill/red-pill choices for programmers. Choosing the red-pill will teach you an entirely new way of seeing the Internet, allowing you to create scalable solutions that seem unreal to those who chose the blue-pill.
Friday, October 19, 2012
Rails - html form input element name from object
ActiveModel::Naming.singluar(object)
Thursday, October 18, 2012
JQuery - disable form inputs
$($('#submit_button').parents('form')[0]).find('input,select').each( function () {$(this).attr('disabled', false) });
Wednesday, October 17, 2012
Rails - A note on lib files
Rails server needs to be restarted to reflect any code changes.
Tuesday, October 16, 2012
Simpler, Cheaper, Faster: Playtomic's Move From .NET To Node And Heroku
Reference: http://highscalability.com/blog/2012/10/15/simpler-cheaper-faster-playtomics-move-from-net-to-node-and.html?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+HighScalability+%28High+Scalability%29
Just over 20,000,000 people hit my API yesterday 700,749,252 times, playing the ~8,000 games my analytics platform is integrated in for a bit under 600 years in total play time. That's just yesterday. There are lots of different bottlenecks waiting for people operating at scale. Heroku and NodeJS, for my use case, eventually alleviated a whole bunch of them very cheaply.
Playtomic began with an almost exclusively Microsoft.NET and Windows architecture which held up for 3 years before being replaced with a complete rewrite using NodeJS. During its lifetime the entire platform grew from shared space on a single server to a full dedicated, then spread to second dedicated, then the API server was offloaded to a VPS provider and 4 – 6 fairly large VPSs. Eventually the API server settled on 8 dedicated servers at Hivelocity, each a quad core with hyperthreading + 8gb of ram + dual 500gb disks running 3 or 4 instances of the API stack.
These servers routinely serviced 30,000 to 60,000 concurrent game players and received up to 1500 requests per second, with load balancing done via DNS round robin.
These servers routinely serviced 30,000 to 60,000 concurrent game players and received up to 1500 requests per second, with load balancing done via DNS round robin.
In July the entire fleet of servers was replaced with a NodeJS rewrite hosted at Heroku for a significant saving.
Scaling Playtomic With NodeJS
There were two parts to the migration:
- Dedicated to PaaS: Advantages include price, convenience, leveraging their load balancing and reducing overall complexity. Disadvantages include no New Relic for NodeJS, very inelegant crashes, and a generally immature platform.
- .NET to NodeJS: Switching architecture from ASP.NET / C# with local MongoDB instances and a service preprocessing event data locally and sending it to centralized server to be completed; to NodeJS on Heroku + Redis and preprocessing on SoftLayer (see Catalyst program).
Dedicated To PaaS
The reduction in complexity is significant; we had 8 dedicated servers each running 3 or 4 instances of the API at our hosting partner Hivelocity. Each ran a small suite of software including:
- MongoDB instance
- log pre-processing service
- monitoring service
- IIS with api sites
Deploying was done via an FTP script that uploaded new api site versions to all servers. Services were more annoying to deploy but changed infrequently.
MongoDB was a poor choice for temporarily holding log data before it was pre-processed and sent off. It offered a huge speed advantage of just writing to memory initially which meant write requests were “finished” almost instantly which was far superior to common message queues on Windows, but it never reclaimed space left from deleted data which meant the db size would balloon to 100+ gigabytes if it wasn’t compacted regularly.
The advantages of PaaS providers are pretty well known, they all seem quite similar although it’s easiest to have confidence in Heroku and Salesforce since they seem the most mature and have broad technology support.
The main challenges transitioning to PaaS was shaking the mentality that we could run assistive software alongside the website as we did on the dedicated servers. Most platforms provide some sort of background worker threads you can leverage but that means you need to route data and tasks from the web threads through a 3rd party service or server which seems unnecessary.
We eventually settled on a large server at Softlayer running a dozen purpose-specfic Redis instances and some middleware rather than background workers. Heroku doesn’t charge for outbound bandwidth and Softlayer doesn’t charge for inbound which neatly avoided the significant bandwidth involved.
MongoDB was a poor choice for temporarily holding log data before it was pre-processed and sent off. It offered a huge speed advantage of just writing to memory initially which meant write requests were “finished” almost instantly which was far superior to common message queues on Windows, but it never reclaimed space left from deleted data which meant the db size would balloon to 100+ gigabytes if it wasn’t compacted regularly.
The advantages of PaaS providers are pretty well known, they all seem quite similar although it’s easiest to have confidence in Heroku and Salesforce since they seem the most mature and have broad technology support.
The main challenges transitioning to PaaS was shaking the mentality that we could run assistive software alongside the website as we did on the dedicated servers. Most platforms provide some sort of background worker threads you can leverage but that means you need to route data and tasks from the web threads through a 3rd party service or server which seems unnecessary.
We eventually settled on a large server at Softlayer running a dozen purpose-specfic Redis instances and some middleware rather than background workers. Heroku doesn’t charge for outbound bandwidth and Softlayer doesn’t charge for inbound which neatly avoided the significant bandwidth involved.
Switching From .NET To NodeJS
Working with JavaScript on the serverside is a mixed experience. On the one hand the lack of formality and boilerplate is liberating. On the other hand there’s no New Relic and no compiler errors which makes everything harder than it needs to be.
There are two main advantages that make NodeJS spectacularly useful for our API.
There are two main advantages that make NodeJS spectacularly useful for our API.
- Background workers in the same thread and memory as the web server
- Persistant, shared connections to redis and mongodb (etc)
Background Workers
NodeJS has the very useful ability to continue working independently of requests, allowing you to prefetch data and other operations that allow you to terminate a request very early and then finish processing it.
It is particularly advantageous for us to replicate entire MongoDB collections in memory, periodically refreshed, so that entire classes of work had access to current data without having to go an external database or local/shared caching layer.
We collectively save 100s – 1000s of database queries per second using this in:
It is particularly advantageous for us to replicate entire MongoDB collections in memory, periodically refreshed, so that entire classes of work had access to current data without having to go an external database or local/shared caching layer.
We collectively save 100s – 1000s of database queries per second using this in:
- Game configuration data on our main api
- API credentials on our data exporting api
- GameVars which developers use to store configuration or other data to hotload into their games
- Leaderboard score tables (excluding scores)
The basic model is:
var cache = {};
module.exports = function(request, response) {
response.end(cache[“x”]);
}function refresh() {// fetch updated data from database, store in cache object
cache[“x”] = “foo”;
setTimeout(refresh, 30000);
}refresh();
The advantages of this are a single connection (per dyno or instance) to your backend databases instead of per-user, and a very fast local memory cache that always has fresh data.
The caveats are your dataset must be small, and this is operating on the same thread as everything else so you need to be conscious of blocking the thread or doing too-heavy cpu work.
The caveats are your dataset must be small, and this is operating on the same thread as everything else so you need to be conscious of blocking the thread or doing too-heavy cpu work.
Persistent Connections
The other massive benefit NodeJS offers over .NET for our API is persistant database connections. The traditional method of connecting in .NET (etc) is to open your connection, do your operation, after which your connection is returned to a pool to be re-used shortly or expired if it’s no longer needed.
This is very common and until you get to a very high concurrency it will Just Work. At a high concurrency the connection pool can’t re-use the connections fast enough which means it generates new connections that your database servers will have to scale to handle.
At Playtomic we typically have several hundred thousand concurrent game players that are sending event data which needs to be pushed back to our Redis instances in a different datacenter which with .NET would require a massive volume of connections – which is why we ran MongoDB locally on each of our old dedicated servers.
This is very common and until you get to a very high concurrency it will Just Work. At a high concurrency the connection pool can’t re-use the connections fast enough which means it generates new connections that your database servers will have to scale to handle.
At Playtomic we typically have several hundred thousand concurrent game players that are sending event data which needs to be pushed back to our Redis instances in a different datacenter which with .NET would require a massive volume of connections – which is why we ran MongoDB locally on each of our old dedicated servers.
With NodeJS we have a single connection per dyno/instance which is responsible for pushing all the event data that particular dyno receives. It lives outside of the request model something like this:
var redisclient = redis.createClient(….);
module.exports = function(request, response) {var eventdata = “etc”;redisclient.lpush(“events”, eventdata);}
The End Result
High load:
REQUESTS IN LAST MINUTE
_exceptions: 75 (0.01%)
_failures: 5 (0.00%)
_total: 537,151 (99.99%)
data.custommetric.success: 1,093 (0.20%)
data.levelaveragemetric.success: 2,466 (0.46%)
data.views.success: 105 (0.02%)
events.regular.invalid_or_deleted_game#2: 3,814 (0.71%)
events.regular.success: 527,837 (98.25%)
gamevars.load.success: 1,060 (0.20%)
geoip.lookup.success: 109 (0.02%)
leaderboards.list.success: 457 (0.09%)
leaderboards.save.missing_name_or_source#201: 3 (0.00%)
leaderboards.save.success: 30 (0.01%)
leaderboards.saveandlist.success: 102 (0.02%)
playerlevels.list.success: 62 (0.01%)
playerlevels.load.success: 13 (0.00%)
_failures: 5 (0.00%)
_total: 537,151 (99.99%)
data.custommetric.success: 1,093 (0.20%)
data.levelaveragemetric.success: 2,466 (0.46%)
data.views.success: 105 (0.02%)
events.regular.invalid_or_deleted_game#2: 3,814 (0.71%)
events.regular.success: 527,837 (98.25%)
gamevars.load.success: 1,060 (0.20%)
geoip.lookup.success: 109 (0.02%)
leaderboards.list.success: 457 (0.09%)
leaderboards.save.missing_name_or_source#201: 3 (0.00%)
leaderboards.save.success: 30 (0.01%)
leaderboards.saveandlist.success: 102 (0.02%)
playerlevels.list.success: 62 (0.01%)
playerlevels.load.success: 13 (0.00%)
This data comes from some load monitoring that operates in the background on each instance, pushes counters to Redis where they’re then aggregated and stored in MongoDB, you can see it in action at https://api.playtomic.com/load.html.
There are a few different classes of requests in that data:
There are a few different classes of requests in that data:
- Events that check the game configuration from MongoDB, perform a GeoIP lookup (opensourced very fast implementation at https://github.com/benlowry/node-geoip-native), and then push to Redis
- GameVars, Leaderboards, Player Levels all check game configuration from MongoDB and then whatever relevant MongoDB database
- Data lookups are proxied to a Windows server because of poor NodeJS support for stored procedures
The result is 100,000s of concurrent users causing spectactularly light Redis loads fo 500,000 – 700,000 lpush’s per minute (and being pulled out on the other end):
1 [|| 1.3%] Tasks: 83; 4 running
2 [||||||||||||||||||| 19.0%] Load average: 1.28 1.20 1.19
3 [|||||||||| 9.2%] Uptime: 12 days, 21:48:33
4 [|||||||||||| 11.8%]
5 [|||||||||| 9.9%]
6 [||||||||||||||||| 17.7%]
7 [||||||||||||||| 14.6%]
8 [||||||||||||||||||||| 21.6%]
9 [|||||||||||||||||| 18.2%]
10 [| 0.6%]
11 [ 0.0%]
12 [|||||||||| 9.8%]
13 [|||||||||| 9.3%]
14 [|||||| 4.6%]
15 [|||||||||||||||| 16.6%]
16 [||||||||| 8.0%]
Mem[||||||||||||||| 2009/24020MB]
Swp[ 0/1023MB]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
12518 redis 20 0 40048 7000 640 S 0.0 0.0 2:21.53 `- /usr/local/bin/redis-server /etc/redis/analytics.conf
12513 redis 20 0 72816 35776 736 S 3.0 0.1 4h06:40 `- /usr/local/bin/redis-server /etc/redis/log7.conf
12508 redis 20 0 72816 35776 736 S 2.0 0.1 4h07:31 `- /usr/local/bin/redis-server /etc/redis/log6.conf
12494 redis 20 0 72816 37824 736 S 1.0 0.2 4h06:08 `- /usr/local/bin/redis-server /etc/redis/log5.conf
12488 redis 20 0 72816 33728 736 S 2.0 0.1 4h09:36 `- /usr/local/bin/redis-server /etc/redis/log4.conf
12481 redis 20 0 72816 35776 736 S 2.0 0.1 4h02:17 `- /usr/local/bin/redis-server /etc/redis/log3.conf
12475 redis 20 0 72816 27588 736 S 2.0 0.1 4h03:07 `- /usr/local/bin/redis-server /etc/redis/log2.conf
12460 redis 20 0 72816 31680 736 S 2.0 0.1 4h10:23 `- /usr/local/bin/redis-server /etc/redis/log1.conf
12440 redis 20 0 72816 33236 736 S 3.0 0.1 4h09:57 `- /usr/local/bin/redis-server /etc/redis/log0.conf
12435 redis 20 0 40048 7044 684 S 0.0 0.0 2:21.71 `- /usr/local/bin/redis-server /etc/redis/redis-servicelog.conf
12429 redis 20 0 395M 115M 736 S 33.0 0.5 60h29:26 `- /usr/local/bin/redis-server /etc/redis/redis-pool.conf
12422 redis 20 0 40048 7096 728 S 0.0 0.0 26:17.38 `- /usr/local/bin/redis-server /etc/redis/redis-load.conf
12409 redis 20 0 40048 6912 560 S 0.0 0.0 2:21.50 `- /usr/local/bin/redis-server /etc/redis/redis-cache.conf
and very light MongoDB loads for 1800 – 2500 crud operations a minute:
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
2 9 5 2 0 8 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 7k 116 01:11:12
1 1 5 2 0 6 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 2k 3k 116 01:11:13
0 3 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 6k 114 01:11:14
0 5 5 2 0 12 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 5k 113 01:11:15
1 9 7 2 0 12 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 4k 6k 112 01:11:16
1 10 6 2 0 15 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 1|0 4k 22k 111 01:11:17
1 5 6 2 0 11 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 19k 111 01:11:18
1 5 5 2 0 14 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 3k 111 01:11:19
1 2 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 2k 111 01:11:20
1 7 5 2 0 9 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 2k 111 01:11:21
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
2 9 8 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 5k 111 01:11:22
3 8 7 2 0 9 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 9k 110 01:11:23
2 6 6 2 0 10 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 4k 110 01:11:24
2 8 6 2 0 21 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 93k 112 01:11:25
1 10 7 2 3 16 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 4m 112 01:11:26
3 15 7 2 3 24 0 6.67g 14.8g 1.23g 0 0.2 0 0|0 0|0 6k 1m 115 01:11:27
1 4 8 2 0 10 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 2m 115 01:11:28
1 6 7 2 0 14 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 3k 115 01:11:29
1 3 6 2 0 10 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 103k 115 01:11:30
2 3 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 12k 114 01:11:31
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 12 6 2 0 9 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 31k 113 01:11:32
2 4 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 9k 111 01:11:33
2 9 6 2 0 7 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 21k 111 01:11:34
0 8 7 2 0 14 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 9k 111 01:11:35
1 4 7 2 0 11 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 5k 109 01:11:36
1 15 6 2 0 19 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 5k 11k 111 01:11:37
2 17 6 2 0 19 1 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 6k 189k 111 01:11:38
1 13 7 2 0 15 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 1|0 5k 42k 110 01:11:39
2 7 5 2 0 77 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 2|0 10k 14k 111 01:11:40
2 10 5 2 0 181 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 21k 14k 112 01:11:41
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
1 11 5 2 0 12 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 4k 13k 116 01:11:42
1 11 5 2 1 33 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 3|0 6k 2m 119 01:11:43
0 9 5 2 0 17 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 1|0 5k 42k 121 01:11:44
1 8 7 2 0 25 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 6k 24k 125 01:11:45
1 [|| 1.3%] Tasks: 83; 4 running
2 [||||||||||||||||||| 19.0%] Load average: 1.28 1.20 1.19
3 [|||||||||| 9.2%] Uptime: 12 days, 21:48:33
4 [|||||||||||| 11.8%]
5 [|||||||||| 9.9%]
6 [||||||||||||||||| 17.7%]
7 [||||||||||||||| 14.6%]
8 [||||||||||||||||||||| 21.6%]
9 [|||||||||||||||||| 18.2%]
10 [| 0.6%]
11 [ 0.0%]
12 [|||||||||| 9.8%]
13 [|||||||||| 9.3%]
14 [|||||| 4.6%]
15 [|||||||||||||||| 16.6%]
16 [||||||||| 8.0%]
Mem[||||||||||||||| 2009/24020MB]
Swp[ 0/1023MB]
PID USER PRI NI VIRT RES SHR S CPU% MEM% TIME+ Command
12518 redis 20 0 40048 7000 640 S 0.0 0.0 2:21.53 `- /usr/local/bin/redis-server /etc/redis/analytics.conf
12513 redis 20 0 72816 35776 736 S 3.0 0.1 4h06:40 `- /usr/local/bin/redis-server /etc/redis/log7.conf
12508 redis 20 0 72816 35776 736 S 2.0 0.1 4h07:31 `- /usr/local/bin/redis-server /etc/redis/log6.conf
12494 redis 20 0 72816 37824 736 S 1.0 0.2 4h06:08 `- /usr/local/bin/redis-server /etc/redis/log5.conf
12488 redis 20 0 72816 33728 736 S 2.0 0.1 4h09:36 `- /usr/local/bin/redis-server /etc/redis/log4.conf
12481 redis 20 0 72816 35776 736 S 2.0 0.1 4h02:17 `- /usr/local/bin/redis-server /etc/redis/log3.conf
12475 redis 20 0 72816 27588 736 S 2.0 0.1 4h03:07 `- /usr/local/bin/redis-server /etc/redis/log2.conf
12460 redis 20 0 72816 31680 736 S 2.0 0.1 4h10:23 `- /usr/local/bin/redis-server /etc/redis/log1.conf
12440 redis 20 0 72816 33236 736 S 3.0 0.1 4h09:57 `- /usr/local/bin/redis-server /etc/redis/log0.conf
12435 redis 20 0 40048 7044 684 S 0.0 0.0 2:21.71 `- /usr/local/bin/redis-server /etc/redis/redis-servicelog.conf
12429 redis 20 0 395M 115M 736 S 33.0 0.5 60h29:26 `- /usr/local/bin/redis-server /etc/redis/redis-pool.conf
12422 redis 20 0 40048 7096 728 S 0.0 0.0 26:17.38 `- /usr/local/bin/redis-server /etc/redis/redis-load.conf
12409 redis 20 0 40048 6912 560 S 0.0 0.0 2:21.50 `- /usr/local/bin/redis-server /etc/redis/redis-cache.conf
and very light MongoDB loads for 1800 – 2500 crud operations a minute:
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
2 9 5 2 0 8 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 7k 116 01:11:12
1 1 5 2 0 6 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 2k 3k 116 01:11:13
0 3 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 6k 114 01:11:14
0 5 5 2 0 12 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 5k 113 01:11:15
1 9 7 2 0 12 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 4k 6k 112 01:11:16
1 10 6 2 0 15 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 1|0 4k 22k 111 01:11:17
1 5 6 2 0 11 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 19k 111 01:11:18
1 5 5 2 0 14 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 3k 111 01:11:19
1 2 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 2k 111 01:11:20
1 7 5 2 0 9 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 2k 111 01:11:21
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
2 9 8 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 5k 111 01:11:22
3 8 7 2 0 9 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 9k 110 01:11:23
2 6 6 2 0 10 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 4k 110 01:11:24
2 8 6 2 0 21 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 93k 112 01:11:25
1 10 7 2 3 16 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 4m 112 01:11:26
3 15 7 2 3 24 0 6.67g 14.8g 1.23g 0 0.2 0 0|0 0|0 6k 1m 115 01:11:27
1 4 8 2 0 10 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 2m 115 01:11:28
1 6 7 2 0 14 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 3k 115 01:11:29
1 3 6 2 0 10 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 103k 115 01:11:30
2 3 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 12k 114 01:11:31
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
0 12 6 2 0 9 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 31k 113 01:11:32
2 4 6 2 0 8 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 9k 111 01:11:33
2 9 6 2 0 7 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 3k 21k 111 01:11:34
0 8 7 2 0 14 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 4k 9k 111 01:11:35
1 4 7 2 0 11 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 3k 5k 109 01:11:36
1 15 6 2 0 19 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 5k 11k 111 01:11:37
2 17 6 2 0 19 1 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 6k 189k 111 01:11:38
1 13 7 2 0 15 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 1|0 5k 42k 110 01:11:39
2 7 5 2 0 77 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 2|0 10k 14k 111 01:11:40
2 10 5 2 0 181 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 21k 14k 112 01:11:41
insert query update delete getmore command flushes mapped vsize res faults locked % idx miss % qr|qw ar|aw netIn netOut conn time
1 11 5 2 0 12 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 0|0 4k 13k 116 01:11:42
1 11 5 2 1 33 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 3|0 6k 2m 119 01:11:43
0 9 5 2 0 17 0 6.67g 14.8g 1.22g 0 0.1 0 0|0 1|0 5k 42k 121 01:11:44
1 8 7 2 0 25 0 6.67g 14.8g 1.22g 0 0.2 0 0|0 0|0 6k 24k 125 01:11:45
Subscribe to:
Posts (Atom)