On Tuesday 08 December 2009 04:25:07 pm Seebs wrote:
> On 2009-12-08, David Masover <ninja / slaphack.com> wrote:
> > Compare any of these to C. You probably could write a web app in C. You
> > probably could be about as efficient with it. You could be disciplined
> > enough to never do pointer arithmetic,
> 
> This is hardly necessary.  Pointer arithmetic can certainly be done safely.

Can be. However, the fact that it exists opens the door to a whole class of 
weird and hard-to-pin-down crashes (and possible vulnerabilities) that simply 
don't happen if you don't (or can't) do it.

But that wasn't my point. My point was that if you consider lack of pointer 
arithmetic, garbage collection, and other features to be a selling point of 
higher-level languages, you can do all that in C, and you can make it _almost_ 
automatic in C++.

> > Think about that for a moment. In languages like Ruby and PHP, a buffer
> > overflow is actually not possible. You might get it in a third-party
> > library written in another language (like C), but you can't do it
> > yourself. But in C, it's not only possible, it's a very easy mistake to
> > make, and a hard one to avoid.
> 
> I'm not sold on this.  I don't think I've had any buffer overflows in my
> code in years.  It's pretty easy -- if I'm about to use a buffer, I make
> sure I know what I'm using it for and that I cap any copies and/or report
> failure if there's not enough space.

My favorite example is here:

http://joelonsoftware.com/articles/fog0000000319.html

I like this both for the ludicrous example, when he finally decides to figure 
out how much to allocate:

char* bigString;
int i = 0;
i = strlen("John, ") 
     + strlen("Paul, ") 
     + strlen("George, ") 
     + strlen("Joel ");
bigString = (char*) malloc (i + 1); 

...and for the ludicrous inefficiency. He's going to scan through each string 
at least twice, and that's with a customized strcat -- it gets much worse with 
the real strcat.

And remember, his next step is:

char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,"John, ");
p = mystrcat(p,"Paul, ");
p = mystrcat(p,"George, ");
p = mystrcat(p,"Joel ");

It's still a bit sloppy -- that initial null assignment makes me cringe -- but 
think about this. Even if you ignore the fact that we've got each string 
duplicated here -- let's say they're variables:

int i = 0;
i = strlen(a) + strlen(b) + strlen(c) + strlen(d);
bigString = (char*) malloc (i+1);

char *p = bigString;
bigString[0] = '\0';
p = mystrcat(p,a);
p = mystrcat(p,b);
p = mystrcat(p,c);
p = mystrcat(p,d);

Now suppose you add a string to that, or remove it. If you add it to one place 
and not the other, or remove it from one place and not the other, you're 
either wasting RAM or hitting a buffer overrun every time.

Then again, this kind of malloc is probably inefficient, as the article points 
out. Instead, you probably want to allocate some power of 2 -- at which point, 
you want to make sure you've always allocated a power of two that's more than 
you need, not less than you need.

Are you sure you never make a mistake here?

Because this is the kind of thing that I don't have to think about. Yes, it's 
less efficient, but if I have a bunch of strings in Ruby, I can just do this:

big_string = a + b + c + d

There are other, more efficient ways, like:

big_string = a.dup << b << c << d

or

big_string = "#{a}#{b}#{c}#{d}"

The point is, though, while these have varying degrees of efficiency, none of 
them have the possibility that I'll forget something and open myself up to a 
vulnerability or a crash. Worst case, I waste a bit of RAM, and 100% of the 
RAM I waste here can be garbage-collected later, whereas in C, if I waste it, 
it's wasted, possibly even leaked.

So not only is it ridiculously easier, it's also safer.

It's also possibly faster, because since it's a higher-level abstraction, the 
runtime might (in theory; I bet Ruby doesn't) notice that these are all 
strings and that you're just concatenating them, so it could use some sort of 
StringBuilder automatically.

Even if it doesn't, it still has the option of storing the length of a string 
separately, rather than using null-terminated strings -- thus saving you at 
least half your time in an operation like ("a" + "b").

Am I being unrealistic? Is this the kind of thing you'd never do?

> I agree that it requires actual effort, as opposed to being implicit.

The point here is that the implicit version also implicitly handles all the 
safety for you. Another example might be SQL manipulation. To keep myself 
sane, let's do this with Ruby:

execute "select hashed_password from users where username = '#{name}'"

The problem with that code should be blindingly obvious. Of course, I should 
probably be doing something like this:

execute "select hashed_password from users where username = '#{escape name}'"

The problem is, this requires me to always, always remember to do it. This is 
how a lot of PHP stuff is written, though I'm told it's changing, and those in 
the know use libraries that allow you to do it the Right Way. How would the 
Right Way look?

execute 'select hashed_password from users where username = ?', name

Can you see why that's safer? I can develop a much easier to maintain habit of 
using only single-quoted strings as my queries. Since the actual values are 
always passed separately, they are always escaped -- I don't have to remember 
anything special to make that work.

So I can develop a very, very simple habit (use single-quoted strings) that I 
can almost unconsciously apply everywhere, and I will never be subject to a 
SQL injection attack.

Or I can try to develop a habit of manually escaping -- the problem is that 
sooner or later, mistakes WILL happen. Best case, I develop such muscle memory 
of doing it this way that I end up accidentally doing this:

puts "Hello, #{escape name}!"

That way, worst case, it goes unnoticed for months until someone named 
O'Harris signs up and wonders why the system thinks their name is O''Harris or 
O\'Harris.

The point is that higher levels of abstraction do allow us to abstract away 
opportunities to screw things up. This is true in the language itself, and in 
the libraries.

And if I've convinced you of that, don't worry, low-level skill is still 
needed. Another of my favorite articles:

http://joelonsoftware.com/articles/LeakyAbstractions.html

It helps to understand what's going on at the C level, even if I never want to 
actually touch it, because that might give me some insight as to why

"Hello, #{name}!"

is more efficient than

'Hello, '+name+'!'

Try it yourself:

require 'benchmark'
name = 'steve'
Benchmark.bm do |x|
  x.report { 10000000.times { "Hello, #{name}!" }}
  x.report { 10000000.times { 'Hello, '+name+'!' }}
end

My results:

      user     system      total        real
  6.010000   0.020000   6.030000 (  6.104799)
  7.500000   0.010000   7.510000 (  7.505193)

It only gets better, the more interpolated values you have. a+b is more 
efficient than "#{a}#{b}", but a+b+c+d is less efficient than 
"#{a}#{b}#{c}#{d}".

This was very surprising to me. Then I went back and read that article, and 
thought a bit about the concept of a string builder. Now it makes sense, even 
though it's still a bit counterintuitive.

So I'm glad I sort of know C, and I'm just as glad I don't have to use it 
much.

> The killer for me was
>  discovering that there was a thing like a function pointer which could be
>  used only for user-defined functions, not built-in functions.

I could live with that, but I'm guessing it might've been the last straw...

For me, I'm spoiled by blocks now. I can fake them in Javascript, and even 
(though less effectively) in Java, but not in PHP, that I know of.