GELU Backward Bug: Accuracy Issues In TTNN

by Editorial Team 43 views
Iklan Headers

Introduction: The GELU Backward Function Under Scrutiny

Hey folks! Let's dive into a critical issue affecting the ttnn.gelu_bw() function, specifically in the context of Tenstorrent's TT-Metal framework. We've uncovered some pretty nasty accuracy problems that could seriously impact the performance and reliability of models using this function. We're talking about severe ULP (Units in the Last Place) errors and, even worse, a "wrong sign" bug. This means the gradient, which is crucial for training neural networks, is sometimes calculated with the wrong sign. This can lead to all sorts of problems during backpropagation, potentially throwing your training runs off course, and even causing them to fail. We're going to break down the details, explain the impact, and hopefully, point towards a solution. The core of the problem lies in the ttnn.gelu_bw() implementation, which calculates the derivative of the Gaussian Error Linear Unit (GELU) activation function. GELU is a popular choice in modern neural networks, especially in models like BERT and GPT, so its accurate implementation is paramount. Let's get into the nitty-gritty of what's going on.

The Problem: ULP Errors and the Wrong Sign Bug

So, what exactly are we dealing with here? The primary concern revolves around the accuracy of the ttnn.gelu_bw() function, particularly when operating on BF16 (bfloat16) inputs. We've observed several key issues:

  • Severe ULP Errors: Approximately 10.69% of all BF16 values tested exhibited ULP errors greater than 1000. This indicates significant deviations from the expected, correct values. The maximum ULP error reached a staggering 32,460. This is really bad.

  • Wrong Sign Bug: The most critical issue is the "wrong sign" bug. At specific input values (around x ≈ -3.7), the function returns a positive derivative when it should be negative. This is a fundamental flaw, and it can cause the training process to diverge or produce incorrect results. Think of it like trying to climb a hill but being pushed down instead. It’s counter-productive.

  • High ULP Errors in Specific Regions: We've also noted particularly high ULP errors in the "deep negative" (x < -5) and "large positive" (x > 5) regions of the input space. These areas are critical, as they can represent regions where the network is very confident, or where gradients are large, making accuracy paramount. We have analyzed the ULP errors by region to give you a clear view, check it out:

    • Deep negative (x < -5): 16,095 values with a Mean ULP of 6,255.61 and a Max ULP of 32,460
    • Moderate negative [-5, -2]: 160 values with a Mean ULP of 1,885.86 and a Max ULP of 29,756
    • Near negative [-2, -0.5]: 256 values with a Mean ULP of 5.98 and a Max ULP of 146
    • Near zero [-0.5, 0.5]: 32,003 values with a Mean ULP of 0.11 and a Max ULP of 4
    • Near positive [0.5, 2]: 256 values with a Mean ULP of 0.55 and a Max ULP of 2
    • Moderate positive [2, 5]: 160 values with a Mean ULP of 0.38 and a Max ULP of 1
    • Large positive (x > 5]: 16,096 values with a Mean ULP of 2,748.22 and a Max ULP of 16,203

This table illustrates the severity of the problem. A large number of values have significant ULP errors, which leads to the wrong sign bug. The implications of these errors are substantial, and we'll discuss them more later.

Diving Deeper: The Wrong Sign Bug

Let's zoom in on this "wrong sign" bug, as it's the most critical issue. This is where the derivative, which represents the gradient, gets its sign flipped. This can happen around the input value x ≈ -3.7, where the expected derivative is negative, but the hardware erroneously returns a positive value. The gradient's direction is reversed in this case, preventing the network from learning correctly.

Here’s a table that highlights the problem, showing the input value, expected derivative, actual output, and ULP error:

Input x Expected GELU'(x) Actual Output ULP Error
-3.700 -1.526e-03 +4.349e-04 29,742
-3.719 -1.373e-03 +5.112e-04 29,756

As you can see, the "Actual Output" has the wrong sign, which shows how serious this issue is.

What's the consequence of this? During backpropagation, the gradients are used to adjust the network's weights. If the gradient's sign is incorrect, the weights will be updated in the wrong direction. The model may converge to the wrong solution or fail to converge at all. Imagine trying to steer a car while someone is constantly turning the wheel in the wrong direction – you're not going to get where you want to go.

The Root Cause: Potential Issues in erf_tile() Implementation

So, what's causing these issues? The most likely culprit is within the erf_tile() function, which is used to calculate the error function, erf(x), a core component of the GELU calculation. The GELU function is calculated based on these steps, which shows the potential issue of the erf_tile() function:

  • Step 1: erf(x / sqrt(2)): The input is divided by the square root of 2, and the error function is calculated.
  • cdf_term = 0.5 * (1.0 + erf(x / sqrt(2))): Based on the result, the calculation continues.
  • ... pdf_term calculation ...: PDF is calculated
  • multiply by x: The result of PDF is multiplied by x.

It is possible that the erf_tile() implementation has issues for inputs where |x/√2| > ~ 2.6. For x = -3.7:

  • erf(-3.7/√2) = erf(-2.616) should be ≈ -0.99978
  • If erf_tile() saturates to exactly -1.0, then 1 + erf() = 0 and cdf_term = 0
  • The pdf_term calculation should still produce a negative result via x * pdf
  • But the actual output (+4.349e-04) is suspiciously close to just pdf without the x multiplication

The most probable reason for the wrong sign bug is that the erf_tile() function may not be behaving as expected for inputs in the negative region. It’s possible that the function is either saturating at a value, or there might be an overflow or precision issue, especially when dealing with the bfloat16 data type. The erf_tile() function seems to return the derivative when it should be negative, making it the cause of the problem.

This behavior leads to the wrong sign, and the output is close to the pdf value.

Proposed Solutions: Moving Forward

Alright, so what can we do to fix this? Here are a few proposed solutions:

  • Use erfc() for negative inputs: This approach involves using the complementary error function, erfc(), for negative input values. The erfc() function is numerically more stable for negative inputs and can mitigate precision issues. This modification is the recommended approach.

    // For negative x, use erfc for numerical stability
    v_if(x < 0.0f) {
        // cdf = 0.5 * erfc(-x / sqrt(2))
        fill_tile(3, -kAlpha);
        mul_binary_tile(1, 3, 1);  // tile[1] = -x / sqrt(2) = |x| / sqrt(2)
        erfc_tile(1);              // tile[1] = erfc(|x| / sqrt(2))
        fill_tile(3, 0.5f);
        mul_binary_tile(1, 3, 1);  // tile[1] = cdf
    }
    v_else {
        // cdf = 0.5 * (1 + erf(x / sqrt(2)))
        fill_tile(3, kAlpha);
        mul_binary_tile(1, 3, 1);
        erf_tile(1);
        fill_tile(3, 1.0f);
        add_binary_tile(1, 3, 1);
        fill_tile(3, 0.5f);
        mul_binary_tile(1, 3, 1);
    }
    v_endif;
    
  • Investigate and fix the tile corruption issue: Further investigation may be necessary to identify if there are issues within the tile register and fix it.

  • Use a different computation order: Rearranging the order of calculations might resolve the issue by ensuring that the value of x is preserved and calculations do not cause issues.

Impact and Implications: What Does This Mean?

The consequences of these accuracy issues are significant:

  • Training Accuracy: The wrong sign bug will lead to the gradients flowing in the incorrect direction. This impacts the training process, causing the model not to learn correctly.
  • Training Stability: High ULP errors can cause instability in the training process. This is particularly problematic in modern neural networks.
  • Convergence Issues: The model might not converge to a good solution or may converge to the wrong solution.

Conclusion: Addressing the GELU Backward Accuracy Crisis

This report has highlighted significant accuracy issues in the ttnn.gelu_bw() function, particularly the "wrong sign" bug. The recommended solution is to use the erfc() function for negative values to improve numerical stability. By addressing these issues, we can ensure that models using the TT-Metal framework perform as expected and deliver reliable results. Keep an eye out for updates and patches to resolve these issues, and feel free to use the provided information to reproduce the issues. This is a critical step toward ensuring the reliability and accuracy of our models.

Hopefully, this detailed analysis has provided some valuable insights into the problem and the path towards a solution. We encourage you to contribute to resolving this problem, as fixing this is crucial for the reliability of all models using GELU activation with backpropagation. Thanks for reading, and happy coding, everyone!