deeplearnjs is a new deep learning framework running on browser accelerated by WebGL hardware acceleration as I introduced before in this post. I was interested in the library from the beginning and submitted some patches including new optimizer implementations. I learned some internal codebase of the library during this process. So I want to explain some points interesting to me in this post. This time is regarding the component about mathematical calculation in deeplearnjs.

Mathematical calculation

NDArrayMath is a component which is responsible for doing a tensor calculation. This class provides kernel interface like exp, add and convolution etc. NDArrayMath is an abstract component. Actual calculation will be delegated to implementations on CPU and GPU, NDArrayMathCPU and NDArrayMathGPU respectively. These implementations are delegated from template method of NDArrayMath abstract class.

  /**
   * Adds two NDArrays element-wise, A + B. Supports broadcasting.
   * For a stricter version without broadcasting use math.addStrict().
   *
   * @param a The first NDArray to add element-wise.
   * @param b The second NDArray to add element-wise.
   */
  add(a: NDArray, b: NDArray): NDArray {
    util.assertAndGetBroadcastedShape(a.shape, b.shape);
    return this.executeOp('add', () => this.addInternal(a, b));
  }
  protected abstract addInternal(a: NDArray, b: NDArray): NDArray;

NDArrayMathCPU and NDArrayMathGPU should implement addInternal method to provide this kernel function in their platform, CPU and GPU. CPU implementation is very simple.

  protected addInternal<T extends NDArray>(a: T, b: T): T {
    return this.scaledArrayAddInternal<T>(Scalar.ONE, a, Scalar.ONE, b);
  }

It’s delegated to scaledArrayAddInternal further.

  protected scaledArrayAddInternal<T extends NDArray>(
      c1: Scalar, a: T, c2: Scalar, b: T): T {
    const newShape = util.assertAndGetBroadcastedShape(a.shape, b.shape);
    const newValues = new Float32Array(util.sizeFromShape(newShape));

    const aValues = a.getValues();
    const bValues = b.getValues();
    const c1Val = c1.get();
    const c2Val = c2.get();
    for (let i = 0; i < newValues.length; ++i) {
      newValues[i] = c1Val * aValues[i % a.size] + c2Val * bValues[i % b.size];
    }
    return NDArray.make(newShape, {values: newValues}) as T;
  }

In contrast, understanding GPU implementation may require a little WebGL familiarity. This is the implementation in NDArrayMathGPU.

  protected addInternal<T extends NDArray>(a: T, b: T): T {
    const program = new BinaryOpProgram(binaryop_gpu.ADD, a.shape, b.shape);
    return this.compileAndRun<NDArray, T>(program, [a, b]);
  }

It writes a shader program in a plain string. BinaryOpProgram just keeps the shader source program string and some metadata. NDArrayMathGPU compiles the program and send it to GPU through WebGL API. If you are familiar with the shader pipeline of WebGL, it is not so difficult to understand the process.

NDArray is a data treated as a tensor entity in deeplearnjs. If you touch Python numpy, the interface may look similar to numpy array. We learned how to run kernel program in WebGL GPU for now but how deeplearnjs send a data in NDArray to GPU?

The answer is texture.

Texture Manager

A data in NDArray is copied to GPU frame buffer as a texture. Texture manager is responsible for managing the data in GPU frame buffer.

function createAndConfigureTexture(
    gl: WebGLRenderingContext, width: number, height: number,
    numChannels: number): WebGLTexture {
  webgl_util.validateTextureSize(gl, width, height);
  const texture = webgl_util.createTexture(gl);

  const tex2d = gl.TEXTURE_2D;
  const internalFormat = getTextureInternalFormat(gl, numChannels);
  const format = getTextureFormat(gl, numChannels);
  webgl_util.callAndCheck(gl, () => gl.bindTexture(tex2d, texture));
  webgl_util.callAndCheck(
      gl, () => gl.texParameteri(tex2d, gl.TEXTURE_WRAP_S, gl.CLAMP_TO_EDGE));
  webgl_util.callAndCheck(
      gl, () => gl.texParameteri(tex2d, gl.TEXTURE_WRAP_T, gl.CLAMP_TO_EDGE));
  webgl_util.callAndCheck(
      gl, () => gl.texParameteri(tex2d, gl.TEXTURE_MIN_FILTER, gl.NEAREST));
  webgl_util.callAndCheck(
      gl, () => gl.texParameteri(tex2d, gl.TEXTURE_MAG_FILTER, gl.NEAREST));
  webgl_util.callAndCheck(
      gl,
      () => gl.texImage2D(
          tex2d, 0, internalFormat, width, height, 0, format, gl.FLOAT, null));
  webgl_util.callAndCheck(gl, () => gl.bindTexture(gl.TEXTURE_2D, null));
  return texture;
}

createAndConfigureTexture create texture frame buffer and bind it to the context. A frame buffer in a GPU memory space is allocated. Who send the data into that space?

NDArray

runProgram method in gpgpu_math.ts gets inputs and outputs of the program.

const outTex = output.getTexture();

getTexture() actually copies the data into frame buffer allocated in advance. This is a method in NDArray.

  getTexture(preferredShapeRC?: [number, number]): WebGLTexture {
    if (this.data.texture == null) {
      this.uploadToGPU(preferredShapeRC);
    }
    return this.data.texture;
  }

Finally it is delegated to uploadDataToTexture method in gpgpu_util.ts.

function uploadDataToTexture(
    gl: WebGLRenderingContext, texture: WebGLTexture, width: number,
    height: number, data: Float32Array, numChannels: number) {
  const textureFormat = getTextureFormat(gl, numChannels);

  webgl_util.validateTextureSize(gl, width, height);
  webgl_util.callAndCheck(gl, () => gl.bindTexture(gl.TEXTURE_2D, texture));
  webgl_util.callAndCheck(
      gl,
      () => gl.texSubImage2D(
          gl.TEXTURE_2D, 0, 0, 0, width, height, textureFormat, gl.FLOAT,
          data));
  webgl_util.callAndCheck(gl, () => gl.bindTexture(gl.TEXTURE_2D, null));
}

gl.bindTexture send a command to GPU to select the frame buffer allocated in advance for this kernel program. Then it copies data with texSubImage2D command as 2D texture image internally. (But it is 2D tensor actually). So the kernel program can find the data in frame buffer after this method is called.

Problem

Copying the data into GPU memory can be an overhead. It is desirable to copy data as much as possible at the same time. The data should not be changed in a batch training in a deeplearning framework. Bulk copying still more important. deeplearnjs now copies the data every time each kernel program runs. There is some room to be improved regarding copying data into GPU memory.