ML Engine Experiment eval tf.summary.scalar not displaying in te

ghz 15hours ago ⋅ 7 views

ML Engine Experiment eval tf.summary.scalar not displaying in tensorboard

I am trying to output some summary scalars in an ML engine experiment at both train and eval time. tf.summary.scalar('loss', loss) is correctly outputting the summary scalars for both training and evaluation on the same plot in tensorboard. However, I am also trying to output other metrics at both train and eval time and they are only outputting at train time. The code immediately follows tf.summary.scalar('loss', loss) but does not appear to work. For example, the code as follows is only outputting for TRAIN, but not EVAL. The only difference is that these are using custom accuracy functions, but they are working for TRAIN

if mode in (Modes.TRAIN, Modes.EVAL):
    loss = tf.contrib.legacy_seq2seq.sequence_loss(logits, outputs, weights)
    tf.summary.scalar('loss', loss)

    sequence_accuracy = sequence_accuracy(targets, predictions,weights)
    tf.summary.scalar('sequence_accuracy', sequence_accuracy)

Does it make any sense why loss would plot in tensorboard for both TRAIN & EVAL, while sequence_accuracy would only plot for TRAIN?

Could this behavior somehow be related to the warning I received "Found more than one metagraph event per run. Overwriting the metagraph with the newest event."?

Answer

The issue you're encountering, where loss plots for both training and evaluation but sequence_accuracy only plots during training, could be due to several factors. Here's a detailed breakdown:


Potential Causes and Solutions

  1. TensorFlow Graph Context

    • TensorFlow operations like tf.summary.scalar are added to the computation graph for specific modes (e.g., TRAIN or EVAL). If the graph built for EVAL mode doesn't include the sequence_accuracy metric, it won't produce summaries.
    • Ensure that the sequence_accuracy calculation and the corresponding tf.summary.scalar call are executed within the same graph for both modes.

    Solution: Confirm that the sequence_accuracy logic is not inadvertently excluded in EVAL mode by debugging the control flow in your code.

    if mode == Modes.TRAIN:
        with tf.name_scope("train"):
            sequence_accuracy = sequence_accuracy(targets, predictions, weights)
            tf.summary.scalar('sequence_accuracy', sequence_accuracy)
    elif mode == Modes.EVAL:
        with tf.name_scope("eval"):
            sequence_accuracy = sequence_accuracy(targets, predictions, weights)
            tf.summary.scalar('sequence_accuracy', sequence_accuracy)
    
  2. Custom Accuracy Function

    • If your custom sequence_accuracy function uses operations that are specific to the training graph (e.g., operations with placeholders or variables that aren't used in the EVAL graph), it might fail to compute during evaluation.

    Solution: Validate that sequence_accuracy is independent of any training-specific components. Check that all required tensors (targets, predictions, weights) are accessible during EVAL.

  3. Summary Writer Configuration

    • Each mode (TRAIN and EVAL) should have its own summary writer, ensuring summaries are written to separate subdirectories or prefixed appropriately to avoid overwriting.

    Solution: Use distinct log directories or prefixes for TRAIN and EVAL:

    if mode == Modes.TRAIN:
        summary_writer = tf.summary.FileWriter(train_log_dir, graph=tf.get_default_graph())
    elif mode == Modes.EVAL:
        summary_writer = tf.summary.FileWriter(eval_log_dir)
    
  4. Summary Operations

    • Summaries must be explicitly evaluated and written to disk in both TRAIN and EVAL modes. This typically involves running the merged_summary_op and passing it to the writer.

    Solution: Ensure you evaluate and write the summary operations:

    merged_summary_op = tf.summary.merge_all()
    summary = sess.run(merged_summary_op, feed_dict=feed_dict)
    summary_writer.add_summary(summary, global_step)
    
  5. Metagraph Warning

    • The warning "Found more than one metagraph event per run. Overwriting the metagraph with the newest event." occurs when multiple metagraph definitions (the structure of the TensorFlow computation graph) are saved in the same TensorBoard log directory. This might indicate that TRAIN and EVAL modes are sharing the same log directory.

    Solution: Separate TRAIN and EVAL log directories to avoid overwriting.


Debugging Steps

  1. Verify Summary Addition

    • Print a confirmation that tf.summary.scalar('sequence_accuracy') is being added in both TRAIN and EVAL modes.
    print(f"Adding sequence_accuracy summary in mode: {mode}")
    
  2. Check TensorBoard Logs

    • Inspect the TensorBoard logs to confirm that sequence_accuracy appears in the graph definitions for both TRAIN and EVAL.
  3. Simplify the Code

    • Temporarily replace the custom sequence_accuracy function with a simpler computation to rule out issues in the function itself.
    dummy_accuracy = tf.reduce_mean(tf.cast(tf.equal(targets, predictions), tf.float32))
    tf.summary.scalar('dummy_accuracy', dummy_accuracy)
    
  4. Inspect Feed Dict

    • Ensure that all required tensors (e.g., targets, predictions, weights) are provided during EVAL.

If these steps don't resolve the issue, consider sharing more details about your custom sequence_accuracy function and the context in which you evaluate the summaries. This will help narrow down the root cause.