Benchmarking Foundation Models with Language-Model-as-an-Examiner