Topic Types

We will combine factoid and complex questions into a single task.

Type

#

Example Question

Related NTCIR QA Task

DEFINITION

10

What is the Human Genome Project?

ACLIA

BIOGRAPHY

10

Who is Howard Dean?

ACLIA

RELATIONSHIP

20

What is the relationship between Saddam Hussein and Jacques Chirac?

ACLIA

EVENT

20

What are the major conflicts between India and China on border issues?

ACLIA

WHY

20

Why doesn't U.S. ratify the Kyoto Protocol?

QAC-4

PERSON

5

Who is the Finland's first woman president?

QAC 1-3, CLQA 1,2

ORGANIZATION

5

What is the name of the company that produced the first Fairtrade coffee?

QAC 1-3, CLQA 1,2

LOCATION

5

What is the name of the river that separates North Korea from China?

QAC 1-3, CLQA 1,2

DATE

5

When did Queen Victoria die?

QAC 1-3, CLQA 1,2

(# of topics sum up to 100)

IR4QA Evaluation

In IR4QA, we will evaluate how good an IR system is at returning documents that are relevant to the information needs on average, given a set of natural language questions or question analysis results.

We will provide both human and automatic evaluation for IR (and QA).

Relevance Judgment

In relevance judgment, human evaluators will read the content of each returned document and judge if it satisfies the information need of that particular topic. Grading on returned document is as follows.

Official Evaluation Metrics

Based on the relevance judgment results, ACLIA organizers will provide MAP (Mean Average Precision) scores as the official evaluation metric, since it is commonly used in IR community.

Unofficial Evaluation Metrics

ACLIA organizers may release additional evaluation results from trec_eval script. Additional metrics from the script (e.g. R-prec, MRR, interpolated recall etc.) will help you understand your system behavior in more detail.

CCLQA Evaluation

In CCLQA, we will evaluate how good a QA system is at returning answers that satisfy information needs on average, given a set of natural language questions.

Official Evaluation Metrics

1. Definition

Just like we did in NTCIR-7 ACLIA CCLQA, we will use the nugget pyramid human-in-the-loop evaluation (Lin and Demner-Fushman, 2006) method on CCLQA results. The pyramid method has also been used in various QA tasks (e.g. TREC 2006-2007 QA Main Task "other" questions, ciQA Task, and TAC 2008 Opinion Question Answering Task).

Here are some definitions of key concepts.

Each system response per question will be assigned an F score calculated as follows.

C is set to 100 non-whitespace characters in (Lin and Demner-Fushman, 2006) for English. In ACLIA, we will estimate the C value for each language and each answer type based on training data, as it varies a lot. The value is proportional to the average length of the answer nuggets, so it is expected the C value for factoid questions are very small.

Note that precision is an approximation, reflecting the length penalty of the system response. This is because "nugget precision is much more difficult to compute since there is no effective way of enumerating all the concepts in a response". (Voorhees, 2005)

F-score has an emphasis on recall over precision, parameterized by a beta value of 3 meaning that recall is favored three times as important as precision. Historically, a value of 5 was estimated by a pilot study on definitional QA evaluation (Voorhees, 2004). In later TREC tasks, the value was empirically adjusted to 3. We evaluate a submitted run using average F-score over questions.

2. Case study

As a case study, consider following two cases.

(1) Complex Question

Suppose there is a definition question, and corresponding 5 answer nuggets with a human-voted weight vector of [1.0, 0.4, 0.2, 0.5, 0.7]. The system returns N answers in Japanese, in response to the question. Then, the answers are concatenated into one "system response" which is in 300 character-length in total. An evaluator finds the 2nd nugget to match the system response, and assuming the C value is defined to be 50, we obtain:

The evaluation result on this particular question is 0.146. Note that N in top N answers to return is up to each participant. But, avoid returning extraordinary many answers considering human evaluation effort.

(2) Factoid Question

Suppose there is a factoid question with one or multiple information needs to be satisfied, for example, assume there exists five answers to the question "Which companies did Google acquire in 2004?". By the nature of these questions, it is expected that human-votes are close to 1, so let us fix the nugget weight to 1 for this kind of questions. Assuming that the system returns unordered list of N answers which sums up to the character length of 10, and 3 of them are correct, and C value is defined to be 4, we obtain:

Unofficial Evaluation Metrics

Just like the NTCIR-7, ACLIA organizers will release automatic evaluation results for unofficial runs.

References

Evaluation (last edited 2009-10-29 20:29:26 by HidekiShima)