DatasetReader

class openicl.DatasetReader(dataset: Dataset | DatasetDict | str, input_columns: List[str] | str, output_column: str, name: str | None = None, data_files: str | None = None, input_template: PromptTemplate | None = None, output_template: PromptTemplate | None = None, input_output_template: PromptTemplate | None = None, ds_size: None | int | float = None, split: NamedSplit | None = None, test_split: str | None = 'test')[source]
In-conext Learning Dataset Reader Class

Generate an DatasetReader instance through ‘dataset’.

dataset

The dataset to be read.

Type:

Dataset or DatasetDict

input_columns

A list of column names (a string of column name) in the dataset that represent(s) the input field.

Type:

List[str] or str

output_column

A column name in the dataset that represents the prediction field.

Type:

str

ds_size

The number of pieces of data to return. When ds_size is an integer and greater than or equal to 1, ds_size pieces of data are randomly returned. When 0 < ds_size < 1, int(len(dataset) * ds_size) pieces of data are randomly returned. (used for testing)

Type:

int or float, optional

references

The list of references, initialized by self.dataset[self.test_split][self.output_column].

Type:

list, optional

input_template

An instance of the PromptTemplate class, used to format the input field content during the retrieval process. (in some retrieval methods)

Type:

PromptTemplate, optional

output_template

An instance of the PromptTemplate class, used to format the output field content during the retrieval process. (in some learnable retrieval methods)

Type:

PromptTemplate, optional

input_output_template

An instance of the PromptTemplate class, used to format the input-output field content during the retrieval process. (in some retrieval methods)

Type:

PromptTemplate, optional

generate_input_field_corpus(dataset: Dataset | DatasetDict, split: str | None = None) List[str][source]

Generate corpus for input field.

Parameters:
  • dataset (Dataset or DatasetDict) – A datasets.Dataset or datasets.DatasetDict instance.

  • split (str, optional) – The split of the dataset to use. If None, the entire dataset will be used. Defaults to None.

Returns:

A list of generated input field prompts.

Return type:

List[str]

generate_input_field_prompt(entry: Dict) str[source]

Generate a prompt for the input field based on the provided entry data.

Parameters:

entry (Dict) – A piece of data to be used for generating the prompt.

Returns:

The generated prompt.

Return type:

str

generate_input_output_field_corpus(dataset: Dataset | DatasetDict, split: str | None = None) List[str][source]

Generate corpus for input-output field.

Parameters:
  • dataset (Dataset or DatasetDict) – A datasets.Dataset or datasets.DatasetDict instance.

  • split (str, optional) – The split of the dataset to use. If None, the entire dataset will be used. Defaults to None.

Returns:

A list of generated input-output field prompts.

Return type:

List[str]

generate_input_output_field_prompt(entry: Dict) str[source]

Generate a prompt for the input-output field based on the provided:obj:entry data.

Parameters:

entry (Dict) – A piece of data to be used for generating the prompt.

Returns:

The generated prompt.

Return type:

str

generate_ouput_field_prompt(entry: Dict) str[source]

Generate a prompt for the output field based on the provided entry data.

Parameters:

entry (Dict) – A piece of data to be used for generating the prompt.

Returns:

The generated prompt.

Return type:

str

generate_output_field_corpus(dataset: Dataset | DatasetDict, split: str | None = None) List[str][source]

Generate corpus for output field.

Parameters:
  • dataset (Dataset or DatasetDict) – A datasets.Dataset or datasets.DatasetDict instance.

  • split (str, optional) – The split of the dataset to use. If None, the entire dataset will be used. Defaults to None.

Returns:

A list of generated output field prompts.

Return type:

List[str]

set_references(column: str, split: str | None = None) None[source]

Set self.references based on column and optional split.

Parameters:
  • column (str) – A string of column name.

  • split (str, optional) – A string of dataset split. Defaults to None.