pts.blog: 2009-11

2009-11-17

How fast does 8g in Google Go compile?

To get an idea how fast Google Go code can be compiled, I've compiled a big chunk of the Google Go library code with the default 8g compiler. Here are the results:

90 8g compiler invocations
no linking
21MB .8 object data generated
89540 lines of .go source from goroot/src/pkg
345305 words of .go source
2.22MB of .go source
17908 lines per second compilation speed
3676757 .8 bytes per second compilation speed
389473 source bytes per second compilation speed
average 5.7s real time
average 4.8s user time
average 0.9s system time
3G RAM and Intel Core2 Duo CPU T5250 @ 1.50GHz

As a quick comparison, I've compiled Ruby 1.8.6 with gcc:

38 invocations of gcc -s -O2 (GCC 4.2.4)
38 .c files of Ruby 1.8.6 source code (without extensions)
15 .h files
1044075 bytes generated in .o files
92543 source lines
283554 source words
2.15MB of source code
62190 source bytes per second compilation speed
27916 .o bytes per second compilation speed
37.4s real time
35.9s user time
1.0s system time
0.4626 .o bytes generated per source byte

What I see here is that 8g is 131 times faster than GCC if we count object bytes per second, and it is 6.26 times faster if we count source bytes per second. But we cannot learn much from these ratios, because not the same code was compiled. What is obvious is that Go object files are much larger than C object files (relative to their source files) in this experiment.

How to write a program with multiple packages in Google Go

This blog post explains how to write a Google Go program which contains multiple packages. Let's suppose your software contains multiple packages: a and b, a consisting of two source files: a1.go and a2.go, and b contains only 1 source file importing package a. Here is an example:

$ cat a1.go
package a
type Rectangle struct {
  width int;
  Height int;
}

$ cat a2.go
package a
func (r *Rectangle) GetArea() int { return r.width * r.Height }
func (r *Rectangle) Populate() {
  r.width = 2;
  r.Height = 3;
}

$ cat b.go
package main
import (
  "a";
)
func main() {
  r := new(a.Rectangle);
  r.Populate();
  // Fields (etc.) starting with a small letter are undefined.
  // (private): b.go:9: r.width undefined (type a.Rectangle has no field width)
  // print(r.width, "\n");
  print(r.GetArea(), "\n");
  print(r.Height, "\n");
  print(r, "\n");
  print("done\n")
}

$ cat abc.sh
#! /bin/bash --
set -ex
rm -f *.6 6.out
6g -o a.6 a1.go a2.go
6g -I. b.go
6l b.6  # links a.6 as well

$ ./abc.sh
...
$ ./6.out
6
3
0x7f18309691a8
done

Notes:

Symbols and structure fields starting with lower case are not visible in other packages.
You have to specify -I. for the compiler so it will find a.6 in the current directory.
The linker figures out that it has to load prerequisite a.6 when linking b.6.

2009-11-15

FUSE protocol tutorial for Linux 2.6

Introduction and copyright

This is a tutorial on writing FUSE (Filesystem in UserSpacE for Linux and other systems) servers speaking the FUSE protocol and the wire format, as used on the /dev/fuse device for communication between the Linux 2.6 kernel (FUSE client) and the userspace filesystem implementation (FUSE server). This tutorial is useful for writing a FUSE server witout using libfuse. This tutorial is not a reference: it doesn't contain everything, but you should be able to figure out the rest yourself.

This document has been written by Péter Szabó on 2009-11-15. It can be used and redistributed under a Creative Commons Attribution-Share Alike 2.5 Switzerland License.

This tutorial was based on versions 7.5 and 7.8 of the FUSE protocol (FUSE_KERNEL_VERSION . FUSE_KERNEL_MINOR_VERSION), but it should work with newer versions in the 7 major series.

This tutorial doesn't give all the details. For that see the sample Python source code for a FUSE server.

Requirements

Linux 2.6 (even though FUSE has been ported to many operating systems, this tutorial focuses on the default Linux implementatation);
the fuse kernel module 7.5 or later compiled and loaded;
you being familiar with writing a FUSE server (with libfuse or one of the Perl, Python, Ruby or other scripting language bindings);
support for running external programs (like with system(3)) in the programming language;
support for creating socketpairs (socketpair(2)) in the programming language;
support for receiving filehandles (with recvmsg(2)) in the programming language (this is tricky, see below).

Overview of the operation of a FUSE server

This overview assumes that the FUSE server is single-threaded.

Fetch the mount point and the mount options from the command-line.
Optionally, create the directory $MOUNT_POINT (libfuse doesn't do this).
Optionally, do a fusermount -u $MOUNT_POINT to clean up an existing or stale FUSE filesystem in the mount point directory. (libfuse doesn't do this).
Run fusermount(1) to mount a filesystem.
Receive the /dev/fuse filehandle from fusermount(1).
Receive, process and rely o the FUSE_INIT message.
In an infinite loop:
1. Receive a message (with a buffer or ≤ 8192 bytes, recommended 65536 + 100 bytes) from the FUSE client on the file descriptor.
2. If ENODEV is the result, or FUSE_DESTROY is received, break from the loop.
3. Process the message.
4. Send the reply to on the file descriptor, except for FUSE_FORGET.
Clean up so your backend filesystem remains in consistent state.
Exit from the FUSE server process.

fusermount(1), when called above does the following:

Uses its setuid bit to run as root.
Opens the character device /dev/fuse.
Mounts the filesystem with the mount(2) system call, passing the file descripto of /dev/fuse to it.

Steps of running fusermount(1) and obtaining the /dev/fuse file descriptor:

Create a socketpair (AF_UNIX, SOCK_STREAM) with fd0 and fd1 as file descriptors.
Run (with system(3)): export _FUSE_COMMFD=$FD0; fusermount -o $OPTS $MOUNT_POINT . Example: export _FUSE_COMMFD=3; fusermount -o ro /tmp/foo .
Receive the /dev/fuse file descriptor from fd1. This is tricky. See receive_fd function in the sample receive_fd.c for this. The sample fuse0.py contains a Python implementation (using the ctypes or the dl module to call C code).
Close fd0 and fd1.

Wire format and communication

Once you have received the /dev/fuse file descriptor, do a read(dev_fuse_fd, bug, 8192) on it to read the FUSE_INIT message, and you have to send your reply. After that, you should be reading more messages, and reply to all of them in sequence (except for FUSE_DESTROY and FUSE_FORGET messages, which don't require a reply). All input (FUSE client → server) message types share the same, fixed-length header format, but the message may contain optional, possible variable-length parts as well, depending the message type (opcode). Nevertheless, the whole message must be read in a single read(3), so you have to preallocate a buffer for that (at least 8192 bytes, may be larger based on FUSE_INIT negotiation, preallocate 65536 + 100 bytes to be safe). All integers in messages are unsigned (except for the negative of errno).

The input message header is:

uint32 size; size of the message in bytes, including the header;
uint32 opcode; one of the FUSE_* constants describing the message type and the interpretation of the rest of the header;
uint64 unique; unique identifier of the message, must be repeated in the reply;
uint64 nodeid; nodeid (describing a file or directory) this message applies to (can be FUSE_ROOT_ID == 1, or a larger number, what you have returned in a previous FUSE_LOOKUP repy);
uint32 uid; the fsuid (user ID) of the process initiating the operation (use this for access control checks if needed);
uint32 gid; the fsgid (group ID) of the process initiating the operation (use this for access control checks if needed);
uint32 gid; the PID (process ID) of the process initiating the operation;
uint32 padding; zeroes to pad up to 64-bits.

The interpretation of the rest of the input message depends on the opcode. The most common input message types are:

FUSE_LOOKUP = 1: input is a '\0'-terminated filename without slashes (relative to nodeid), output is struct fuse_entry_out;
FUSE_FORGET = 2: input is a struct fuse_forget_in, there is no output message;
FUSE_GETATTR = 3: input is empty, output is struct fuse_attr_out;
FUSE_OPEN = 14: input is struct fuse_open_in, output is struct fuse_open_out;
FUSE_READ = 15: input is struct fuse_read_in, output is the byte sequence read;
FUSE_RELEASE = 18: input is struct fuse_release_in, output is empty;
FUSE_INIT = 26: input is struct fuse_init_in, output is struct fuse_init_out;
FUSE_OPENDIR = 27: input is struct fuse_open_in, output is struct fuse_open_out;
FUSE_READDIR = 28: input is struct fuse_read_in, output is the byte sequence read (serialized as FUSE-specific dirents);
FUSE_RELEASEDIR = 29: input is struct fuse_release_in, output is empty;
FUSE_DESTROY = 38: input is empty; there is no output message.

For a read-only filesystem with some files and directories, it is enough to implement only the opcodes above. See more opcodes and their coressponding C structs in the table. The linked document contains more details about some of the message fields. The complete up-to-date opcodes and message structs can be found in fuse_kernel.h. Each reply output message (FUSE server → client) starts with this header:

uint32 size; size of the message in bytes, including the header;
int32 error; zero for successful completion, a negative errno value (such as -EIO or -ENOENT) on failure; upon failure, only the reply header is sent;
uint64 unique; unique identifier copied from the input message;

Please note that you have to write the whole reply at once (one write(2) call). Using any kind of buffered IO (such as stdio.h or C++ streams) can lead to problems, so don't do that.

Feel free to experiment: whatever junk you write as a reply, it won't make the kernel crash, but you'll get an EINVAL errno for the write(2) call.

Your FUSE server doesn't have to implement all possible operations (opcodes). By default, you can just return ENOSYS as errno for any operation (except for FUSE_INIT, FUSE_DESTROY and FUSE_FORGET) you don't want to implement.

Common errno values the FUSE server can return:

ENOSYS: The operation (opcode) is not implemented.
EIO: Generic I/O error, if other errno values are not appropriate.
EACCES: Permission denied.
EPERM: Operation not permitted. Most of the time you need EACCES instead.
ENOENT: No such file or directory.
ENOTDIR: Not a directoy. Return it if a directory operation was attempted on a nodeid which is not a directory.

The format of struct fuse_init_in used in FUSE_INIT:

uint32 init_major; the FUSE_KERNEL_VERSION in the kernel; must be exactly the same your code supports;
uint32 init_minor; the FUSE_KERNEL_MINOR_VERSION in the kernel; must be at least what your code supports;
uint32 init_readahead; ??;
uint32 init_flags; ??;

The format of struct fuse_init_out reply used in FUSE_INIT:

uint32 major; the same as FUSE_KERNEL_VERSION in the input;
uint32 minor; at most FUSE_KERNEL_MINOR_VERSION (init_minor) in the input, feel free to set it to less if you don't support the newest version;
uint32 max_readahead; ?? set it to 65536;
uint32 flags; ?? set it to 0;
uint32 unused; set it to 0;
uint32 max_write; ?? set it to 65536;

You have to implement FUSE_GETATTR to make the user able to do an ls -l (or stat(2)) on the mount point. It will be caled with nodeid FUSE_ROOT_ID (== 1) for the mount point.

The format of struct fuse_attr_out reply used in FUSE_GETATTR:

uint64 attr_value; number of seconds the kernel is allowed to cache the attributes returned, without issuing a FUSE_GETATTR call again; a zero value is OK; for non-networking filesystems you can set a very high value, since nobody else would change the attributes anyway;
uint32 attr_value_ns; number of nanoseconds to add to attr_value;
uint32 padding; to 64 bits;
struct fuse_attr attr; node attributes (permissions, owners etc.).

The format of struct fuse_attr reply used in FUSE_GETATTR and FUSE_LOOKUP:

uint64 ino; inode number copied to st_ino; can be any positive integer, the kernel doesn't depend on its uniqueness; it has no releation to nodeids used in FUSE (except for the name);
uint64 size; file size in bytes (or 0 for devices); make sure you set it correctly, because the kernel would truncate rads at this size even if your FUSE_READ returns more; be aware of the size being cached (using attr_value);
uint64 blocks; number of 512-byte blocks occupied on disk; you can safely set it to zero or any arbitrary value;
uint64 atime; the last access (read) time, in seconds since the Unix epoch;
uint64 mtime; the last content modification (write) time, in seconds since the Unix epoch;
uint64 ctime; the last attribute (inode) change time, in seconds since the Unix epoch;
uint32 atime_ns; nanoseconds part of atime;
uint32 mtime_ns; nanoseconds part of mtime;
uint32 ctime_ns; nanoseconds part of ctime;
uint32 more; file type and permissions; example file: S_IFREG | 0644; example directory: S_IFDIR | 0755;
uint32 nlink; total number of hard links; set it to 1 for both files and directories by default; for directories, you can speed up some listing operations (such as find(1)) by setting it to 2 + the number of subdirectories;
uint32 uid; user ID of the owner
uint32 gid; group ID of the owner
uint32 rdev; device major and minor number for device for character devices (mode & S_IFCHR) and block devices (mode & S_IFBLK).

Nodeid and generation number rules

In FUSE_LOOKUP you should return entry_nodeid and generation numbers. If I undestand correctly, the following rules hold:

When a (nodeid, name) pair selected which you have never returned before, you can return any entry nodeid and generation number (except for those which are in use, see below). These two numbers uniquely identify the node for the kernel.
When called again for the same (nodeid, name) pair, you must return the same entry_nodeid and generation numbers. (So you must remember what numbers you have returned previously).
You should count the number of FUSE_LOOKUP requests on the same (nodeid, name). When you receive a FUSE_FORGET request for the specified entry nodeid, you must decrement the counter by the nlookups field of the FUSE_FORGET request. Once the counter is 0, you may safely forget about the entry nodeid (so it no longer considered to be in use), and next time you may return the same or a different nodeid at your choice for the same (nodeid, name) -- but with an increased generation number.
You must never return the same nodeid with the same generation number again for a different inode, even after FUSE_FORGET dropped the reference counter to 0. That is: nodeids that have been released by the kernel may be recycled with a different generation number (but not with the same one!).

How to list the entries in a directory

TODO

How to read the contents of a file

TODO

Other TODO

This tutorial is work in progress. In the meantime, please see the sample Python source code for a FUSE server.

Open questions

How does nodeid and generation allocation and deallocation work?
How to run a multithreaded FUSE server?

2009-11-12

Buffered IO speed test of Google Go

This blog post presents the buffered IO speed test I've done with the standard implementation of the programming language Google Go, as compared to C on Linux amd64. The program I've run just copies everything from standard input to standard output, character by character using the default buffered IO. The conclusion: C (with the stdio in glibc) is about 2.68 times faster than Google Go (with its bufio).

$ cat cat.c
/* by pts@fazekas.hu at Thu Nov 12 12:04:49 CET 2009 */
#include 

int main(int argc, char **argv) {
  int c;
  while ((c = getchar()) >= 0 && putchar(c) >= 0) {}
  fflush(stdout);
  if (ferror(stdin)) {
    perror("error reading stdin");
    return 1;
  }
  if (ferror(stdout)) {
    perror("error reading stdin");
    return 1;
  }
  return 0;
}

$ cat cat.go
// by pts@fazekas.hu at Thu Nov 12 11:56:06 CET 2009
package main

import (
  "bufio";
  "os";
  "syscall";
  "unsafe";
)

func Isatty(f *os.File) bool {
  const TCGETS = 0x5401;  // Linux-specific
  var b [256]byte;
  _, _, e1 := syscall.Syscall(
      syscall.SYS_IOCTL, uintptr(f.Fd()), uintptr(TCGETS),
      uintptr(unsafe.Pointer(&b)));
  return e1 == 0;
}

func main() {
  r := bufio.NewReader(os.Stdin);
  w := bufio.NewWriter(os.Stdout);
  defer w.Flush();
  wisatty := Isatty(os.Stdout);
  for {
    c, err := r.ReadByte();
    if err == os.EOF { break }
    if err != nil { panic("error reading stdin: " + err.String()) }
    err = w.WriteByte(c);
    if err != nil { panic("error writing stdout: " + err.String()) }
    if (wisatty && c == '\n') {
      err = w.Flush();
      if err != nil { panic("error writing stdout: " + err.String()) }
    }
  }
}

Compilation and preparation:

$ gcc -s -O2 cat.c  # create a.out
$ ls -l a.out
-rwxr-xr-x 1 pts pts 4968 Nov 12 11:59 a.out
$ 6g cat.go && 6l cat.6  # create 6.out
$ ls -l 6.out
-rwxr-xr-x 1 pts pts 325257 Nov 12 11:55 6.out
$ dd if=/dev/urandom of=/tmp/data bs=1M count=256
$ time ./6.out /dev/null  # median of multiple runs
./6.out < /tmp/data > /dev/null  15.31s user 0.16s system 99% cpu 15.497 total
$ time ./6.out /dev/null  # median of multiple runs
./a.out < /tmp/data > /dev/null  5.92s user 0.20s system 99% cpu 6.153 total

Update: My naive implementation of zcat (gzip decompressor) in Google Go is only about 2.42 times slower than the C implementation (gcc -O3), and the C implementation is about 5.2 times slower than zcat(1).

2009-11-10

How to read a whole file to String in Java

Reading a whole file to a String in Java is tricky, one has to pay attention to many aspects:

Read with the proper character set (encoding).
Don't ignore the newline at the end of the file.
Don't waste CPU and memory by adding String objects in a loop (use a StringBuffer or an ArrayList<String> instead).
Don't waste memory (by line-buffering or double-buffering).

See my solution at http://stackoverflow.com/questions/1656797/how-to-read-a-file-into-string-in-java/1708115#1708115

For your convenience, here it is my code:

// charsetName can be null to use the default charset.    
public static String readFileAsString(String fileName, String charsetName)    
    throws java.io.IOException {    
  java.io.InputStream is = new java.io.FileInputStream(fileName);    
  try {    
    final int bufsize = 4096;    
    int available = is.available();    
    byte data[] = new byte[available < bufsize ? bufsize : available];    
    int used = 0;    
    while (true) {    
      if (data.length - used < bufsize) {    
        byte newData[] = new byte[data.length << 1];    
        System.arraycopy(data, 0, newData, 0, used);    
        data = newData;    
      }    
      int got = is.read(data, used, data.length - used);    
      if (got <= 0) break;    
      used += got;    
    }    
    return charsetName != null ? new String(data, 0, used, charsetName)    
                               : new String(data, 0, used);    
  } finally {
    is.close();  
  }
}

2009-11-08

How to convert a Unix timestamp to a civil date

Here is how to convert a Unix timestamp to a Gregorian (western) civil date and back in Ruby:

# Convert a Unix timestamp to a civil date ([year, month, day, hour, minute,
# second]) in GMT.
def timestamp_to_gmt_civil(ts)
  t = Time.at(ts).utc
  [t.year, t.month, t.mday, t.hour, t.min, t.sec]
end

# Convert a civil date in GMT to a Unix timestamp.
def gmt_civil_to_timestamp(y, m, d, h, mi, s)
  Time.utc(y, m, d, h, mi, s).to_i
end

Here is a pure algorithmic solution, which doesn't use the Time built-in class:

# Convert a Unix timestamp to a civil date ([year, month, day, hour, minute,
# second]) in GMT.
def timestamp_to_gmt_civil(ts)
  s = ts%86400
  ts /= 86400
  h = s/3600
  m = s/60%60
  s = s%60
  x = (ts*4+102032)/146097+15
  b = ts+2442113+x-(x/4)
  c = (b*20-2442)/7305
  d = b-365*c-c/4
  e = d*1000/30601
  f = d-e*30-e*601/1000
  (e < 14 ? [c-4716,e-1,f,h,m,s] : [c-4715,e-13,f,h,m,s])
end

# Convert a civil date in GMT to a Unix timestamp.
def gmt_civil_to_timestamp(y, m, d, h, mi, s)
  if m <= 2; y -= 1; m += 12; end
  (365*y + y/4 - y/100 + y/400 + 3*(m+1)/5 + 30*m + d - 719561) *
      86400 + 3600 * h + 60 * mi + s
end
t = Time.now.to_i
fail if t != gmt_civil_to_timestamp(*timestamp_to_gmt_civil(t))

Please note that most programming languages have a built-in function named gmtime for doing timestamp_to_gmt_civil's work. Please note that Python has calendar.timegm for doing gmt_civil_to_timestamp's work. Please note that most programming languages have a built-in function named mktime for converting a local time to a timestamp (compare with gmt_civil_to_timestamp, which accepts GMT time as input).

Please be aware of the rounding direction of division for negative numbers when porting the Ruby code above to other programming languages. For example, Ruby has 7 / (-4) == -2, but many other programming languages have 7 / (-4) == -1.

The algorithmic solution above is part of the programming folklore. See more examples at http://www.google.com/codesearch?q=2442+%2F\+*7305

2009-11-07

AES encpytion in Python using C extensions

This blog post gives example code how to do AES encryption and decryption in Python. The crypto implementations shown are written in C, but they have a Python interface. For pure Python implementations, have a look at Python_AES in tlslite or SlowAES.

Example Python code using the AES implementation in alo-aes:

#! /usr/bin/python2.4
# by pts@fazekas.hu at Sat Nov  7 11:45:38 CET 2009

# Download installer from http://www.louko.com/alo-aes/alo-aes-0.3.tar.gz
import aes

# Must be 16, 24 or 32 bytes.
key = '(Just an example for testing :-)'

# Must be a multiple of 16 bytes.
plaintext = 'x' * 16 * 3

o = aes.Keysetup(key)
# This uses ECB (simplest). alo-aes supports CBC as well (with o.cbcencrypt).
ciphertext = o.encrypt(plaintext)

assert (ciphertext.encode('hex') == 'fe4877546196cf4d9b14c6835fdeab1a' * 3)
assert len(ciphertext) == len(plaintext)
assert ciphertext != plaintext  # almost always true
decrypted = o.decrypt(ciphertext)
assert plaintext == decrypted

print 'benchmarking'
plaintext = ''.join(map(chr, xrange(237)) + map(chr, xrange(256))) * 9 * 16
assert len(plaintext) == 70992
for i in xrange(1000):
  assert plaintext == o.decrypt(o.encrypt(plaintext))

Example Python code using the AES implementation in PyCrypto:

#! /usr/bin/python2.4
# by pts@fazekas.hu at Sat Nov  7 12:04:44 CET 2009
# based on
# http://www.codekoala.com/blog/2009/aes-encryption-python-using-pycrypto/

# Download installer from
# http://ftp.dlitz.net/pub/dlitz/crypto/pycrypto/pycrypto-2.1.0b1.tar.gz   
from Crypto.Cipher import AES

# Must be 16, 24 or 32 bytes.
key = '(Just an example for testing :-)'

# Must be a multiple of 16 bytes.
plaintext = 'x' * 16 * 3

o = AES.new(key)
# This uses ECB (simplest). PyCrypto aes supports CBC (with AES.new(key,
# aes.MODE_ECB)) as well.
ciphertext = o.encrypt(plaintext)

assert (ciphertext.encode('hex') == 'fe4877546196cf4d9b14c6835fdeab1a' * 3)
assert len(ciphertext) == len(plaintext)
assert ciphertext != plaintext  # almost always true
decrypted = o.decrypt(ciphertext)
assert plaintext == decrypted

print 'benchmarking'
plaintext = ''.join(map(chr, xrange(237)) + map(chr, xrange(256))) * 9 * 16
assert len(plaintext) == 70992
for i in xrange(1000):
  assert plaintext == o.decrypt(o.encrypt(plaintext))

Please note how similar the two example codes are. (Only the import and the object creation are different.)

On an Intel Centrino T2300 1.66GHz system, PyCrypto seems to be about 8% faster than alo-aes. The alo-aes implementation is smaller, because it contains only AES.

2009-11-03

How to use \showhyphens with beamer

When using \showhyphens with the beamer LaTeX class, the output of \showhypens doesn't appear on LaTeX's terminal output or in the .log file. The solution is to specify \rightskip=0pt, e.g.

\showhyphens{\rightskip0pt pseudopseudohypoparathyroidism}

results

Underfull \hbox (badness 10000) in paragraph at lines 9--9
[] \OT1/cmss/m/n/10.95 pseu-dopseu-do-hy-poparathy-roidism

2009-11-01

Tiny, self-contained C compiler using TCC + uClibc

This post presents how I played with a small C compiler and a small libc on Linux. I've combined TCC (the Tiny C Compiler) 0.9.25 and uClibc 0.9.26, compressed with UPX 3.03 to a tiny, self-contained C compiler for Linux i386. You don't need any external files (apart from the compiler executable) to compile (or interpret) C code, and the compiled executable will be self-contained as well.

Here is how to download and use it for compilation on a Linux x86 or x86_64 system:

$ uname
Linux
$ wget -O pts-tcc https://raw.githubusercontent.com/pts/pts-tcc/master/pts-tcc-0.9.26-uclibc-0.9.30.1
$ chmod +x pts-tcc
$ ls -l pts-tcc
-rwxrwxr-- 1 pts pts 349640 Nov  1 13:07 pts-tcc
$ wget -O example1.c https://raw.githubusercontent.com/pts/pts-tcc/master/pts-tcc/example1.c
$ cat example1.c
#! ./pts-tcc -run
int printf(char const*fmt, ...);
double sqrt(double x);
int main() {
  printf("Hello, World!\n");
  return sqrt(36) * 7;
}
$ ./pts-tcc example1.c
$ file a.out
a.out: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), statically linked, not stripped
$ ls -l a.out
-rwxrwxr-x 1 pts pts 17124 Nov  1 13:17 a.out
$ ./a.out; echo "$?"
Hello, World!
42
$ strace -e open ./pts-tcc example1.c
open("/proc/self/mem", O_RDONLY)        = 3
open("/proc/self/mem", O_RDONLY)        = 3
open("example1.c", O_RDONLY)            = 3
open("/proc/self/mem", O_RDONLY)        = 3
open("/proc/self/mem", O_RDONLY)        = 3
open("a.out", O_WRONLY|O_CREAT|O_TRUNC, 0777) = 3
$ strace ./a.out
execve("./a.out", ["./a.out"], [/* 47 vars */]) = 0
ioctl(0, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, {B38400 opost isig icanon echo ...}) = 0
write(1, "Hello, World!\n", 14Hello, World!
)         = 14
_exit(42)                               = ?

As you can see above, my version the compiler is less than 250k in size, and it doesn't need any library files (such as libc.so) for compilation (as indicated by the strace output above), and the compiled binary doesn't need library files either.

TCC can run the C program without creating any binaries:

$ rm -f a.out
$ ./pts-tcc -run example1.c; echo "$?"
Hello, World!
42
$ head -1 example1.c
#! ./pts-tcc -run
$ chmod +x example1.c
$ ./example1.c; echo "$?"
Hello, World!

TCC is a very fast compiler (see the speed comparisons on it web site), it is versatile (it can compile the Linux kernel), but it doesn't optimize much: the code it produces is usually larger and slower than the output of a traditional, optimizing C compiler such as gcc -O2.

The shell script pts-tcc-0.9.26-compile.sh builds TCC from source, adding uClibc, and compressing with UPX.

Update: TCC 0.9.25 and uClibc 0.9.29 (or older)

Please note that TCC 0.9.25 is buggy, don't use it!

$ wget -O pts-tcc https://raw.githubusercontent.com/pts/pts-tcc/master/pts-tcc/pts-tcc-0.9.25
$ chmod +x pts-tcc
$ ls -l pts-tcc
-rwxr-xr-x 1 pts eng 241728 Apr  8 16:53 pts-tcc

Update: TCC 0.9.25 and uClibc 0.9.30.1