Академический Документы
Профессиональный Документы
Культура Документы
For TPUs:
This will create two directories, one named as the model and another named
"encoder". Change the "model_dir" and "encoder_path" parameters in the .json
corresponding to your model to point to these paths, respectively.
## Generating Text
To predict you can either pass the prompt directly in the command line, or have it
read from a file. (This is useful for prompts that include newlines) Text is output
to the console and the file specified in the "predict_path" parameter. You need a
model checkpoint and a copy of the BPE encoder at an accessible location for this
to work. (Change the "model_dir" and "encoder_path" parameters in the .json)
From file:
The optional top_k parameter causes the model to only consider the top k most
likely tokens at each step. Setting this around 40 tends to create better results,
but with less variety.
## Training
To train a model, define its parameters in a .json file (see examples) and then
simply call
This assumes you have a version of the openwebtext corpus stored in an accessible
location. If you don't, see below how to generate your own version.
## Explanation of Parameters
Because passing two dozen parameters over the command line would be tedious, you
pass all the model parameters in a .json file. Note that any paths also support
Google Storage paths and *must* be gs:// paths if you're running on TPUs.
Training parameters:
* **precision**: Whether to use float32 or bfloat16 variables (use "bfloat16" when
training very large models) (optional, defaults to float32)
* **input**: Which input function to use (default: "openwebtext")
* **lr**: Learning rate (default: 0.00025)
* **warmup_steps**: Number of warmup steps. If this is set, a linear warmup +
cosine decay schedule is used (default: 2000) (optional)
* **opt_name**: Name of optimizer, currently there are "adam" and "adafactor"
(default: "adam")
* **weight_decay**: Weight decay parameter, if not present no weight decay is used
(the weight decay fix for Adam is used) (default: 0.01) (optional)
* **beta1**: Adam/Adafactor beta1 parameter (adam default: 0.9, adafactor default:
0.0)
* **beta2**: Adam/Adafactor beta2 parameter (default: 0.98) (optional for adafactor
with pow decay type)
* **epsilon**: Adam epsilon parameter (default: 1e-9)
* **decay_type**: Adafactor decay type, either "pow" or "adam" (default: "pow")
* **decay_exponent**: Adafactor pow decay exponent (default: 0.8)
* **train_steps**: Number of training steps to take between evaluations
* **eval_steps**: Number of steps per evaluation
* **max_steps**: The maximum number of training steps (important for declining lr)
* **iterations**: Number of iterations to perform on TPUs (Default: 100) (Only
required for TPUs)
* **embed_dropout**: Dropout chance on the word embedding, set to 0 to disable
(default: 0.1)
* **attn_dropout**: Dropout chance on attention layers, set to 0 to disable
(default: 0.1)
* **res_dropout**: Dropout chance on residual connections, set to 0 to disable
(default: 0.1)