Quantcast

AVX-512 development proposal

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

AVX-512 development proposal

Knapp, Rashawn L

Hello Julian and Valgrind developers,

 

I wish you a happy new year.  Regarding AVX-512 support in Valgrind, I am part of a small team who wish to participate in enabling this support.   We have drafted a preliminary statement of work which covers our thoughts on what the work might comprise and a list of implementation steps we think we can start with.  Will you let me know if the following is reasonable and advice you can offer to us on contributing to this development?

 

We propose to extend  Valgrind’s VEX infrastructure to support the Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions for Intel® Xeon Phi™ for the Knights Landing microarchitecture.

 

Specifically, this includes:

-    the Foundation Instructions, which extend AVX and AVX2 with 64 512-bit instruction mnemonics (Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals, section 5.19 December, 2016. Retrieved January 2017) (avx512f),

-     the Prefetch Instructions, which include eight 512-bit mnemonics (Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals, section 5.19, September, 2016. December, 2016. Retrieved January 2017) (avx512pf),

-    the Exponential and Reciprocal Instructions, which include six 512-bit mnemonics (Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals, section 5.19, December, 2016. Retrieved January 2017) (avx512er), and

-     the Conflict Detection Instructions and include three 512-bit instruction mnemonics that are not AVX or AVX2 (Combined Volume Set of Intel® 64 and IA-32 Architectures Software Developer’s Manuals, section 5.19, December, 2016. Retrieved January 2017). (avx512cd)

 

We expect to use the EVEX encoding mnemonics. 

 

We will enhance the test suite with Intel® AVX-512 specific tests for the above Knights Landing instruction groups.

 

A stretch goal is to include the remaining Intel® AVX-512 instructions for architectures for which  hardware is currently unavailable upon which to test, and may include the following instruction groups:  AVX-512DQ, AVX-512BW, AVX-512VL, AVX512IFMA, AVX512VBMI, AVX512_4FMAPS, and AVX512_4VNNIW.

 

Our proposed implementation starting steps are the following:

1.  Implement EVEX prefix recognition and minimal parsing (Intel® Architecture Instruction Set Extensions Programming Reference, December, 2016. Retrieved December 2016).

2.  Implement a handful of  avx512f instructions using the existing AVX-2 implementations for the reference (more can be added after successful pass of starting steps plan).

3.  Implement stubs for avx512pf instructions; we propose stubs because of the impact on performance and believe they can be ignored safely.

4.  Implement the avx512er instruction.

5.  Implement the vpconflictd instructions.

6.  Test with an AVX512 benchmark and micro benchmarks.

 

 

We look forward to working with the Valgrind community on this.

 

Regards,

 

Rashawn Knapp

Software Development Engineer, Intel Corporation


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: AVX-512 development proposal

Julian Seward-2

Hi Rashawn,

Thank you for the offer of adding AVX-512 support, and sorry for the
slow response.  Some of the Valgrind developers discussed this briefly
at Fosdem in Brussels last weekend and there was general agreement
that this would be a good thing to do.

I would be happy to be a point of contact for technical and process
assistance.  I have both technical and process comments regarding
your proposal.

>From a process point of view:

* This is likely to take several months and may involve more than one
  round of review and iteration.  That's based on experience from
  other large chunks of instruction-set development work.

* As an example, have a look the following 5 bugs, which show a staged
  approach to implementation of the recent POWER ISA 3.0 extensions:

    https://bugs.kde.org/show_bug.cgi?id=359767
    https://bugs.kde.org/show_bug.cgi?id=361207
    https://bugs.kde.org/show_bug.cgi?id=362329
    https://bugs.kde.org/show_bug.cgi?id=363858
    https://bugs.kde.org/show_bug.cgi?id=364948

* Patches should go on the bug tracker, as per the examples above, and
  will be reviewed there.

* All contributions to the tree need to be licensed "GNU GPL 2 or
  later".  Are you OK with that?  GPL 2-only is not possible.

* There is a general, although largely unstated, expectation that parties
  who contribute large chunks of code continue afterwards to provide at
  least some minimal level of support/bugfixing, especially around
  release-time.  We've had problems in the past with large bits of the
  code going into the tree and the developers later simply disappearing,
  and would prefer to avoid that in future.  Would you be able to
  provide that level of support going forward?

* Similarly, there is an expectation that you have some machine which
  can run nightly tests (from our framework) and send results to the
  valgrind-testresults mailing list.  Since none of the developers
  (AFAIK) have AVX512 capable hardware, we have no other way to know
  whether the support is working.

* VEX is basically a mini-compiler for basic blocks.  Not essential,
  but it will help if your developer(s) have a bit of basic background
  in compiler internals.

Regarding your proposed implementation steps, they sound plausible.
However:

* You need a step zero, which is to extend Valgrind's HW capabilities
  detection (coregrind/m_machine.c) to detect AVX512 support and tell
  VEX about it.  That has to happen before any insns get implemented.

* Also, you will need to extend the implementation of XSAVE and XRSTOR
  to cover the new register state.  Given the inflexibility of VEX's
  IR (intermediate representation), the current AVX2-level XSAVE and
  XRSTOR was difficult to implement and is hard to understand, so this
  is likely to be a challenge.  I suggest you deal with it sooner
  rather than later, since we've found that runtime libraries rely on
  XSAVE and XRSTOR and so you won't be able to run any real code with
  AVX512 until those two are working.

* I assume (although you didn't say this) that you are doing this for
  the 64-bit instruction set only.  Our 32 bit insn set support is
  essentially legacy, having stopped at SSSE3, and doesn't have a
  proper prefix decoder in the same way that the 64 bit front end
  does.

* Write test cases for the insns first, and make sure they are
  comprehensive enough and work well.  This reduces the general stress
  and difficulty of implementing the instructions.  Bear in mind that
  incorrect instruction emulation can corrupt program state in a way
  that isn't apparent until hundreds of millions of instructions
  later, by which time it is impossible to figure out what went wrong.
  So a good test suite is essential.  See for example
  none/tests/amd64/avx2-1.c and many others in the same directory.

* Some of the existing AVX256 insn implementations are less than
  ideal, in the sense that they generate very verbose IR that performs
  operations a lane at a time, rather than as a vector as a whole.
  That gives rise to problems like
    https://bugs.kde.org/show_bug.cgi?id=375839
  The practical consequence is that (often) you won't be able to just
  implement a 512-bit variant of an existing 256-bit insn by doubling
  up the IR -- we'll have to do something better (wider and shallower)
  here.

* If -- as seems likely -- you need to add new IROps to facilitate
  this support, then you will also need to add support for them in
  memcheck/mc_translate.c.

* Since you are adding register state, you'll need to futz with
  memcheck/mc_machine.c too.

* You will need to be careful to ensure that the back end provides
  SIMD integer support capable of supporting Memcheck's instrumentation
  of the front end's SIMD FP IR.  Without that, you'll wind up in a
  situation where you can run AVX512 code with the 'none' tool but not
  with 'memcheck'.  This is an arcane but important detail.  We can
  come back to it later.

J


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: AVX-512 development proposal

Knapp, Rashawn L

Hello Julian and Valgrind developers,

 

We were very happy to receive your detailed response regarding this proposal. I have met with my team and  the managers involved to coordinate our response to your posed questions and the process/technical points you have outlined. Our responses are inlined with [RLK] prefix.  We look forward to working with Valgrind.

 

Best regards,

 

-Rashawn

 

-----Original Message-----
From: Julian Seward [mailto:[hidden email]]
Sent: Thursday, February 09, 2017 10:01 AM
To: Knapp, Rashawn L <[hidden email]>
Cc: [hidden email]
Subject: Re: [Valgrind-developers] AVX-512 development proposal

 

 

Hi Rashawn,

 

Thank you for the offer of adding AVX-512 support, and sorry for the slow response.  Some of the Valgrind developers discussed this briefly at Fosdem in Brussels last weekend and there was general agreement that this would be a good thing to do.

 

I would be happy to be a point of contact for technical and process assistance.  I have both technical and process comments regarding your proposal.

 

From a process point of view:

 

* This is likely to take several months and may involve more than one

  round of review and iteration.  That's based on experience from

  other large chunks of instruction-set development work.

[RLK] We understand this may take several months.

 

* As an example, have a look the following 5 bugs, which show a staged

  approach to implementation of the recent POWER ISA 3.0 extensions:

 

    https://bugs.kde.org/show_bug.cgi?id=359767

    https://bugs.kde.org/show_bug.cgi?id=361207

    https://bugs.kde.org/show_bug.cgi?id=362329

    https://bugs.kde.org/show_bug.cgi?id=363858

    https://bugs.kde.org/show_bug.cgi?id=364948

 

* Patches should go on the bug tracker, as per the examples above, and

  will be reviewed there.

[RLK] We will follow a staged approach, with all patches submitted on

the bug tracker.

 

* All contributions to the tree need to be licensed "GNU GPL 2 or

  later".  Are you OK with that?  GPL 2-only is not possible.

[RLK] This will not be problematic.

 

* There is a general, although largely unstated, expectation that parties

  who contribute large chunks of code continue afterwards to provide at

  least some minimal level of support/bugfixing, especially around

  release-time.  We've had problems in the past with large bits of the

  code going into the tree and the developers later simply disappearing,

  and would prefer to avoid that in future.  Would you be able to

  provide that level of support going forward?

[RLK] Our intention is to support this; we agreed in our meeting that not

doing so may risk this work becoming deprecated.

 

* Similarly, there is an expectation that you have some machine which

  can run nightly tests (from our framework) and send results to the

  valgrind-testresults mailing list.  Since none of the developers

  (AFAIK) have AVX512 capable hardware, we have no other way to know

  whether the support is working.

[RLK] We have internal machines for developing and testing.  I will

inquire about funding options for Valgrind to invest in a machine

which Valgrind would host. We will acquaint ourselves with running nightly

tests.

 

* VEX is basically a mini-compiler for basic blocks.  Not essential,

  but it will help if your developer(s) have a bit of basic background

  in compiler internals.

[RLK] We are ramping up on  these skills.

 

Regarding your proposed implementation steps, they sound plausible.

However:

 

* You need a step zero, which is to extend Valgrind's HW capabilities

  detection (coregrind/m_machine.c) to detect AVX512 support and tell

  VEX about it.  That has to happen before any insns get implemented.

[RLK] We have started with this step in our internal work thus far.

 

* Also, you will need to extend the implementation of XSAVE and XRSTOR

  to cover the new register state.  Given the inflexibility of VEX's

  IR (intermediate representation), the current AVX2-level XSAVE and

  XRSTOR was difficult to implement and is hard to understand, so this

  is likely to be a challenge.  I suggest you deal with it sooner

  rather than later, since we've found that runtime libraries rely on

  XSAVE and XRSTOR and so you won't be able to run any real code with

  AVX512 until those two are working.

[RLK] We have started with this step in our internal work thus far.

 

* I assume (although you didn't say this) that you are doing this for

  the 64-bit instruction set only.  Our 32 bit insn set support is

  essentially legacy, having stopped at SSSE3, and doesn't have a

  proper prefix decoder in the same way that the 64 bit front end

  does.

[RLK] Our intention was to do this for the 64-bit instructions set.

 

* Write test cases for the insns first, and make sure they are

  comprehensive enough and work well.  This reduces the general stress

  and difficulty of implementing the instructions.  Bear in mind that

  incorrect instruction emulation can corrupt program state in a way

  that isn't apparent until hundreds of millions of instructions

  later, by which time it is impossible to figure out what went wrong.

  So a good test suite is essential.  See for example

  none/tests/amd64/avx2-1.c and many others in the same directory.

[RLK] We will follow this advice; we have started with several

instructions.

 

* Some of the existing AVX256 insn implementations are less than

  ideal, in the sense that they generate very verbose IR that performs

  operations a lane at a time, rather than as a vector as a whole.

  That gives rise to problems like

    https://bugs.kde.org/show_bug.cgi?id=375839

  The practical consequence is that (often) you won't be able to just

  implement a 512-bit variant of an existing 256-bit insn by doubling

  up the IR -- we'll have to do something better (wider and shallower)

  here.

[RLK] We have reviewed this bug report and are seeking to implement

vector wide IRs in the future.

 

* If -- as seems likely -- you need to add new IROps to facilitate

  this support, then you will also need to add support for them in

  memcheck/mc_translate.c.

 

* Since you are adding register state, you'll need to futz with

  memcheck/mc_machine.c too.

[RLK] We will update the memcheck files as appropriate.

 

* You will need to be careful to ensure that the back end provides

  SIMD integer support capable of supporting Memcheck's instrumentation

  of the front end's SIMD FP IR.  Without that, you'll wind up in a

  situation where you can run AVX512 code with the 'none' tool but not

  with 'memcheck'.  This is an arcane but important detail.  We can

  come back to it later.

[RLK] We have added this to our success metrics for this work.

 

J

 


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, SlashDot.org! http://sdm.link/slashdot
_______________________________________________
Valgrind-developers mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/valgrind-developers
Loading...