ML Engine Experiment eval tf.summary.scalar not displaying in tensorboard
I am trying to output some summary scalars in an ML engine experiment at both train and eval time. tf.summary.scalar('loss', loss)
is correctly outputting the summary scalars for both training and evaluation on the same plot in tensorboard. However, I am also trying to output other metrics at both train and eval time and they are only outputting at train time. The code immediately follows tf.summary.scalar('loss', loss)
but does not appear to work. For example, the code as follows is only outputting for TRAIN, but not EVAL. The only difference is that these are using custom accuracy functions, but they are working for TRAIN
if mode in (Modes.TRAIN, Modes.EVAL):
loss = tf.contrib.legacy_seq2seq.sequence_loss(logits, outputs, weights)
tf.summary.scalar('loss', loss)
sequence_accuracy = sequence_accuracy(targets, predictions,weights)
tf.summary.scalar('sequence_accuracy', sequence_accuracy)
Does it make any sense why loss would plot in tensorboard for both TRAIN & EVAL, while sequence_accuracy would only plot for TRAIN?
Could this behavior somehow be related to the warning I received "Found more than one metagraph event per run. Overwriting the metagraph with the newest event."?
Answer
The issue you're encountering, where loss
plots for both training and evaluation but sequence_accuracy
only plots during training, could be due to several factors. Here's a detailed breakdown:
Potential Causes and Solutions
-
TensorFlow Graph Context
- TensorFlow operations like
tf.summary.scalar
are added to the computation graph for specific modes (e.g., TRAIN or EVAL). If the graph built for EVAL mode doesn't include thesequence_accuracy
metric, it won't produce summaries. - Ensure that the
sequence_accuracy
calculation and the correspondingtf.summary.scalar
call are executed within the same graph for both modes.
Solution: Confirm that the
sequence_accuracy
logic is not inadvertently excluded in EVAL mode by debugging the control flow in your code.if mode == Modes.TRAIN: with tf.name_scope("train"): sequence_accuracy = sequence_accuracy(targets, predictions, weights) tf.summary.scalar('sequence_accuracy', sequence_accuracy) elif mode == Modes.EVAL: with tf.name_scope("eval"): sequence_accuracy = sequence_accuracy(targets, predictions, weights) tf.summary.scalar('sequence_accuracy', sequence_accuracy)
- TensorFlow operations like
-
Custom Accuracy Function
- If your custom
sequence_accuracy
function uses operations that are specific to the training graph (e.g., operations with placeholders or variables that aren't used in the EVAL graph), it might fail to compute during evaluation.
Solution: Validate that
sequence_accuracy
is independent of any training-specific components. Check that all required tensors (targets
,predictions
,weights
) are accessible during EVAL. - If your custom
-
Summary Writer Configuration
- Each mode (TRAIN and EVAL) should have its own summary writer, ensuring summaries are written to separate subdirectories or prefixed appropriately to avoid overwriting.
Solution: Use distinct log directories or prefixes for TRAIN and EVAL:
if mode == Modes.TRAIN: summary_writer = tf.summary.FileWriter(train_log_dir, graph=tf.get_default_graph()) elif mode == Modes.EVAL: summary_writer = tf.summary.FileWriter(eval_log_dir)
-
Summary Operations
- Summaries must be explicitly evaluated and written to disk in both TRAIN and EVAL modes. This typically involves running the
merged_summary_op
and passing it to the writer.
Solution: Ensure you evaluate and write the summary operations:
merged_summary_op = tf.summary.merge_all() summary = sess.run(merged_summary_op, feed_dict=feed_dict) summary_writer.add_summary(summary, global_step)
- Summaries must be explicitly evaluated and written to disk in both TRAIN and EVAL modes. This typically involves running the
-
Metagraph Warning
- The warning "Found more than one metagraph event per run. Overwriting the metagraph with the newest event." occurs when multiple metagraph definitions (the structure of the TensorFlow computation graph) are saved in the same TensorBoard log directory. This might indicate that TRAIN and EVAL modes are sharing the same log directory.
Solution: Separate TRAIN and EVAL log directories to avoid overwriting.
Debugging Steps
-
Verify Summary Addition
- Print a confirmation that
tf.summary.scalar('sequence_accuracy')
is being added in both TRAIN and EVAL modes.
print(f"Adding sequence_accuracy summary in mode: {mode}")
- Print a confirmation that
-
Check TensorBoard Logs
- Inspect the TensorBoard logs to confirm that
sequence_accuracy
appears in the graph definitions for both TRAIN and EVAL.
- Inspect the TensorBoard logs to confirm that
-
Simplify the Code
- Temporarily replace the custom
sequence_accuracy
function with a simpler computation to rule out issues in the function itself.
dummy_accuracy = tf.reduce_mean(tf.cast(tf.equal(targets, predictions), tf.float32)) tf.summary.scalar('dummy_accuracy', dummy_accuracy)
- Temporarily replace the custom
-
Inspect Feed Dict
- Ensure that all required tensors (e.g.,
targets
,predictions
,weights
) are provided during EVAL.
- Ensure that all required tensors (e.g.,
If these steps don't resolve the issue, consider sharing more details about your custom sequence_accuracy
function and the context in which you evaluate the summaries. This will help narrow down the root cause.