December 8th, 2010 · 4 Comments
I had a very pleasant experience recently in contributing some net/http documentation improvements to Ruby. Ruby Core has made contributing very easy these days, it’s really as simple as:
- Fork http://github.com/ruby/ruby
- Make and commit your changes in your repo
- Send a pull request
In other words, contributing to Ruby is now just as easy to contribute to as any other Github project! This is a huge deal. Obviously there are caveats: if you are making significant changes to runtime code, you should probably enter an issue in Redmine and/or send an email to the ruby-dev mailing list. But for small cleanups and RDoc improvements that have been sorely missing for years, there’s no longer any excuse!
Tags: Ruby
Eric Hodel disagreed with my recent point. I didn’t present much of an argument because I didn’t think there was much disagreement with my point. Let me elaborate.
1) It’s full of unnecessary libraries that should be separately distributed
The Ruby community, at least in the US, has proven to be very open to change, see Rails 1, 2 and 3 for example. Sometimes you have to break compatibility to advance the state of the art. Continuing to bundle inferior implementations means that we have this compatibility albatross around our neck while stronger third-party libraries don’t pop to the top of the ecosystem. Some random dude said it better than me. Some other dude agreed, REXML in this case should be unbundled.
I was proposing unbundling DRb and Tk not because their codebase sucks but simply because they aren’t used by the vast majority of the Ruby community. treetop and shoes are very useful libraries also but they don’t belong in stdlib either.
I’m not proposing we do this in the next 1.9.2 patch, but for 2.0, sure. Now that rubygems is in core (thanks Eric!), I think we should be aggressive in unbundling everything that can be unbundled.
2) It’s (still) too hard to contribute
The Net::HTTP docs haven’t been touched AFAICT in 5 years. They’re full of broken English and rudimentary examples. I hear people complain about it all the time, why hasn’t anyone fixed it? Maybe everyone is lazy or maybe the contribution process remains too steep for most.
Net::HTTP had performance issues in the 1.8.6 era, sounds they have been fixed. The API is still poor and poorly documented IMO. Just about every HTTP library has a better API IMO, e.g. Typhoeus, httparty and rest-client. As someone else pointed out, Ruby really does need an http API in core though, if not just for Rubygems to use.
I would love to see ruby-core treat git as a first class citizen and allow pull requests and git email patches against http://github.com/ruby/ruby. If this is the case today, please let me know. I will be the first to submit a pull request for better net/http docs.
Tags: Ruby
November 22nd, 2010 · 5 Comments
Want to wreck your afternoon? Just have a poorly configured WordPress install linked from Hacker News. Here’s the postmortem.
In my case, my slice was freezing. I didn’t know what the problem was until I ran top and saw this. Yikes.
The problem was the Apache is configured by default to allow up to 150 Apache processes. Each process took 5-10MB of real memory so my slice’s 512MB was quickly overwhelmed. But why was it creating 150 processes in the first place? Shouldn’t WP-SuperCache respond very quickly, such that the process can serve many requests per second? Yes, but…
Keep-Alives
Keep-Alives try to help client performance. This is a performance tweak that will kill you. By default, Apache is configured to hold the process locked for a given socket for 15 seconds (!!?) in case that socket makes another request. That’s a terrible, terrible default: you should never lock resources waiting for human input. So in 15 seconds, Hacker News delivered me 50-100 requests. These requests all generated their own process, quickly overwhelming my RAM and swap and effectively freezing my slice.
I lowered the maximum number of processes (MaxClients) to 20 and the keep-alive timeout from 15 to 2 seconds. Before I was seeing load averages in the 100s and since reconfiguration, my slice’s load average has been under 1 all afternoon. Here’s the config I changed:
#
# KeepAliveTimeout: Number of seconds to wait for the next request from the
# same client on the same connection.
#
KeepAliveTimeout 2
##
## Server-Pool Size Regulation (MPM specific)
##
# prefork MPM
# StartServers: number of server processes to start
# MinSpareServers: minimum number of server processes which are kept spare
# MaxSpareServers: maximum number of server processes which are kept spare
# MaxClients: maximum number of server processes allowed to start
<IfModule mpm_prefork_module>
StartServers 5
MinSpareServers 5
MaxSpareServers 10
MaxClients 20
</IfModule>
Tags: Software
Much of Ruby’s standard library (the set of classes shipped with the Ruby VM itself) is old and crufty. For laughs, go look at the code for some of the classes that you’ve never used. Chances are it’s from 2000-2003 and doesn’t even look like idiomatic Ruby. I’m wondering what classes should be removed from the standard library or deprecated so that higher quality replacements can take their place.
The canonical example is Ruby’s net/http library. Its performance and API are just terrible. (Side note: how do you know if an API is terrible? If you have to consult the docs even after having used the API for the past 5 years.) But because it’s in the standard library, most people use it as the base for higher-level API abstractions (e.g. httparty, rest-client).
So looking at Ruby’s core RDoc, my suggested list for removal (where removal means move to a rubygem):
- Net::*
- DRb
- REXML
- RSS
- Rinda
- WEBrick
- XML
Any others I missed? Will Ruby 1.9.3 or 2.0 get a good spring cleaning or will we have to live with these classes forever?
Tags: Ruby
I’ve been working on a complex telecom system recently with a codebase that is hard to trace and learn. Given several tickets to fix, my morale flagged a bit as I waded through code last week. Then I remembered an easy morale booster for me: close at least one ticket a day.
As an engineer it makes me feel good to know my efforts are improving the system. Working on a complex ticket can take days to reproduce and fix the issue, often with little noticeable payoff in the end. So I grabbed two lower priority issues and fixed them – both were UI cleanups that led to a nicer user experience. I left the office that day with a spring in my step and smile on my face, ready to tackle the complex ticket again the next morning.
Tags: Software
September 19th, 2010 · 1 Comment
I did an interesting experiment to compare memcache-client and Dalli performance this morning. I wanted to understand which library allocated more objects in order to know which library would have more GC overhead. Ruby 1.9 has a new module GC::Profiler which will generate a report with stats about each GC run. Since both gems have an identical benchmark suite, I ran the GC Profiler on the benchmark suite for each:
| |
Runs |
GC Time |
Total Time |
| memcache-client |
596 |
3.40 |
18.35 |
| dalli |
153 |
1.73 |
15.29 |
memcache-client runs the GC 4x as much as Dalli and roughly half of Dalli’s speed improvement over memcache-client is due to more efficient object allocation requiring less garbage collection. Note that Dalli’s GC runs seem to take twice as long as the memcache-client runs. Anyone know Ruby 1.9′s GC implementation and have an idea why this might be?
Tags: Ruby
Dalli is my brand new memcached client for Ruby. I’ve maintained Ruby’s memcache-client for two years now and been dissatisfied with the codebase for a while.

Coincidentally, NorthScale approached me recently about building a pure Ruby memcached client which used the new binary protocol defined in memcached 1.4. We worked out an arrangement to sponsor the OSS project which became Dalli.
My goals for Dalli were threefold:
- Clean sheet codebase using the binary protocol
- Drop-in replacement for memcache-client in Rails for a very simple upgrade path for Rails developers
- Equivalent or faster performance than memcache-client
I’m happy to say that Dalli meets all those goals. For one, the Dalli core is almost half the size of the memcache-client core, 700 vs 1250 LOC! But wait, there’s more! Using Rails 3? Dalli drops right in! Using Heroku? Dalli works without any additional configuration! Take a look at the README for more details.
Please file an issue if you find a bug or have a feature you’d like to see. In the meantime, happy caching!
Tags: Ruby
The recent memcached security exposé highlighted the fact that simple vulnerabilities require constant vigilance and education for new developers.
Rule #1 of Network Security: Don’t expose services which are not designed to be exposed.
Web and app servers will usually have 2-3 ports open to the public: ssh, http and https. All others should be vetted to determine if they should be public or not. Here’s the current state of mikeperham.com:
mike@perham:~$ netstat -a | grep LIST
tcp 0 0 localhost:mysql *:* LISTEN
tcp 0 0 *:www *:* LISTEN
tcp 0 0 *:ssh *:* LISTEN
tcp 0 0 localhost:smtp *:* LISTEN
There’s two types of ports in this list. ‘localhost’ means that my database is just listening locally:
localhost:mysql
whereas the star indicates my web server is listening on all network interfaces, including the public:
*:www
In the case of memcached, you want to configure it to listen locally only if you just have a single memcached instance. In Ubuntu/Debian, you would edit /etc/memcached.conf and ensure that:
-l 127.0.0.1
is in the file. Otherwise memcached will by default listen on all interfaces and be exposed publicly.
Firewall configuration brings another dimension of variability into the mix but I prefer to configure my services to listen correctly first and then determine any additional firewall rules necessary based on the network topology. Using Memcached servers on multiple machine might require some fancy firewall rules to ensure that they can talk to each other while not being exposed publicly. One nice thing about Amazon’s EC2 service is that it forces you to explicitly open ports to the public via firewall rules, everything else is internal by default.
In summary, I always perform a quick port audit of all machines after I’m done configuring them to ensure that they are as secure as possible before putting them in production. A quick netstat command can go a long way to ensure a sound night’s sleep.
Tags: Software
It’s safe to say that RVM and Bundler have completely changed how I interact with my Ruby applications and gems. It’s pretty well understood how to use each by itself, I didn’t have a good idea how to use them in tandem until recently. Parts of this post are based on Derek Kastner’s great post on using Bundler.
When I grab the source for a random rubygem from github and want to run its tests or test drive it, I use RVM and Bundler to create a sandbox so I don’t pollute the gems used by other Ruby projects on my box:
rvm use 1.9.2@<gemname> --create
gem install bundler --pre
# Would love to see this cleaned up for bundler 1.0
# e.g. bundle install --from-gemspec
cat > Gemfile <<EOM
source 'http://rubygems.org'
gemspec :path => '.'
EOM
bundle install
rake
The only trick here is using Bundler’s support for gemspecs to avoid the need for a separate (and redundant) Gemfile. But Andre Arko suggests that we prefer Bundler to Jeweler and I agree with him. Jeweler should check for an existing Gemfile and defer to it for dependencies when generating the gemspec:
require 'bundler'
Gem::Specification.new do |s|
s.add_bundler_dependencies
end
This means that gems should check in their gemspec into git and jeweler (or however you are generating your gemspec) should be declared as a development dependency. Do your gems pass this simple test? Any thoughts on how to make this even simpler?
Tags: Ruby
Recently I was given a ticket to implement a “near-duplicate” image detector. Look at these three images:
The original image files have different bytesizes and different sizes but they show essentially the same thing. This is what we call a “near-duplicate” and the problem was that when displaying an automatically generated image gallery for a given subject, we were sometimes showing duplicate images due to slight differences in the images.
Obviously we can’t use something like an MD5 or SHA1 fingerprint – we have to create a fingerprint based on the content of the image, not the exact bytes. This is what the pHash library does. A “perceptual hash” is a 64-bit value based on the discrete cosine transform of the image’s frequency spectrum data. Similar images will have hashes that are close in terms of Hamming distance. That is, a binary hash value of 1000 is closer to 0000 than 0011 because it only has one bit different whereas the latter value has two bits different. The duplicate threshold defines how many bits must be different between two hashes for the two associated images to be considered different images. Our testing showed that 15 bits is a good value to start with, it detected all duplicates with a minimum of false positives.
Phashion is my new Ruby wrapper for the pHash library and wraps just enough of the pHash API to implement the described functionality. Here’s the test in the test suite which verifies that Phashion considers the images to be duplicates:
def assert_duplicate(a, b)
assert a.duplicate?(b), "#{a.filename} not dupe of #{b.filename}"
end
def test_duplicate_detection
files = %w(86x86-0a1e.jpeg 86x86-83d6.jpeg 86x86-a855.jpeg)
images = files.map {|f| Phashion::Image.new("#{File.dirname(__FILE__) + '/../test/'}#{f}")}
assert_duplicate images[0], images[1]
assert_duplicate images[1], images[2]
assert_duplicate images[0], images[2]
end
pHash does have much more functionality, including video and audio support and persistent MVP tree support for similarity searches based on previously processed files, but I have not wrapped any of those APIs. Try it out and let me know what you think!
Tags: Ruby · Software