Parallel Programming Directives and Concepts

undefined
Diretiva parallel
 
Diretiva
parallel
#pragma omp parallel [clauses]
   {
       
code_block
   }
Define uma 
região paralela
, que é o 
código que
será executado por vários threads em paralelo
.
Exemplo –
Diretiva
parallel
// omp_parallel.cpp
// compile with: /openmp
#include <stdio.h>
#include <omp.h>
int main()  {
      #pragma omp  
parallel
  
num_threads(4)
         {
             int i = omp_get_thread_num();
             printf_s("Hello from thread %d\n", i);
         }
}
Diretiva
parallel
Por padrão, 
o número de threads é igual ao
número de processadores lógicos
 no computador.
Por exemplo, se você tiver uma máquina com 
um
processador físico com 
hyperthreading
 habilitado,
ele terá dois processadores lógicos 
e, portanto,
duas threads.
Hyperthreading
 - Simulando dois núcleos lógicos em um único núcleo
físico, cada núcleo lógico recebe seu próprio controlador de
interrupção programável, e um conjunto de registradores.  Os outros
recursos do núcleo físico como cache de memória, unidade lógica e
aritmética, barramentos, são compartilhados entre os núcleos
lógicos, parecendo assim um sistema com dois núcleos físicos.
Função
omp_get_thread_num()
omp_get_thread_num()
Retorna o número da thread em execução dentro de
sua equipe de threads em paralelo.
Hello from thread 0
Hello from thread 1
Hello from thread 2
Hello from thread 3
Observe que a ordem de saída pode variar em máquinas diferentes.
Não confundir com 
omp_get_num_threads()
 função retorna o
número de threads, atualmente na equipe de threads executando na
região paralela do qual ele é chamado.
undefined
Diretiva  OpenMP for
 
Diretiva
OpenMP  for
#pragma omp  [parallel]  
for
  [clauses]
           
for_statement
Faz com que o trabalho feito em um 
loop for
dentro de uma região paralela seja dividido
entre threads.
 
#pragma omp 
for
      
for (i = nStart; i <= nEnd; ++i)  {
             #pragma omp 
atomic
              nSum += i;
      }
Diretiva 
atomic
 - Especifica que um local de memória que será
atualizado 
 numa única etapa de processamento, relativa a outras
threads.
An operation acting on shared memory is 
atomic
 if it completes in a
single step relative to other threads.
Operações agindo sobre memória compartilhada.
Ver  a  diretiva OpenMP  atomic
Diretiva
master
A diretiva 
master
 permite especificar que uma
seção de código deve ser executada em uma
única thread, não necessariamente a thread
principal.
Exemplo –
Diretivas
master
 e
barrier
int main( )
{
    int a[5], i;
    #pragma omp 
parallel
    {
        // Perform some computation.
        #pragma omp 
for
        for (i = 0; i < 5; i++)
            a[i] = i * i;
Diretiva –
barrier
Sincroniza todos as threads em uma equipe;
Todas as threads pausam na barreira, até que
todas as threads executem a barreira.
Exemplo –
Diretivas
master
 e
barrier
        // Print intermediate results in a single thread.
        #pragma omp 
master
            
for (i = 0; i < 5; i++)
                printf_s("a[%d] = %d\n", i, a[i]);
         
// Wait.
        #pragma omp 
barrier
        // Continue with the computation.
        #pragma omp 
for
        for (i = 0; i < 5; i++)
            a[i] += i;
    }
}
undefined
Diretiva Schedule
 
By default,
OpenMP
statically
assigns loop
iterations to
threads.
#define THREADS 8
#define N 100
int main ( ) {
int i;
#pragma omp parallel 
for
 num_threads(THREADS)
     for (i = 0; i < N; i++)   {
          printf( "Thread %d is doing iteration %d.\n",
                         omp_get_thread_num(), i );
     }
     /* all threads done */
     printf("All done!\n");
     return 0;
}
A
 
static schedule
can be non-
optimal,
however. This is
the case when
the
different
iterations take
different
amounts of time
.
#define THREADS  4
#define N  16
int main ( ) {
int i;
#pragma omp parallel 
for schedule(static) 
num_threads(THREADS)
 for (i = 0; i < N; i++)  {
      /* wait for i seconds */
      
sleep(i)
;
      printf( "Thread %d has completed iteration %d.\n",
                      omp_get_thread_num( ), i);
     }
     /* all threads done */
     printf("All done!\n");
     return 0;
}
This program also specifies static scheduling, in the parallel for directive.
This program can be greatly improved with a dynamic schedule.
How much
faster does this
program run?
#define THREADS 4
#define N 16
int main ( ) {
 int i;
#pragma omp parallel 
for schedule(dynamic)
 
num_threads(THREADS)
for (i = 0; i < N; i++)  {
     /* wait for i seconds */
      
sleep(i)
;
      printf( "Thread %d has completed iteration %d.\n",
                     omp_get_thread_num( ), i );
     }
     /* all threads done */
     printf("All done!\n");
     return 0;
}
How much faster does this program run?
Dynamic
Schedule
Overhead
Dynamic scheduling is better when the
iterations may take very different
amounts of time.
However, there is some overhead to
dynamic scheduling.
After each iteration, the threads must
stop and receive a new value of the loop
variable to use for its next iteration.
The following
program
demonstrates
this overhead:
#define THREADS 16
#define N 100000000
int main ( )  {
int i;
    printf( "
Running %d iterations on %d threads dynamically.
\n", N,
                    THREADS);
    
#pragma omp parallel 
for schedule(dynamic)    
                                                          num_threads(THREADS)
     for (i = 0; i < N; i++)  {
           /* a loop that doesn't take very long */
     }
     /* all threads done */
     printf("All done!\n");
     return 0;
}
How long does this program take to execute?
If we specify
static
scheduling
, the
program 
will
run faster
:
#define THREADS 16
#define N 100000000
int main ( )  {
int i;
        printf( "
Running %d iterations on %d threads statically
.\n", N,
                        THREADS);
        #pragma omp parallel for 
schedule(static)
                                                               num_threads(THREADS)
         for (i = 0; i < N; i++)  {
               /* a loop that doesn't take very long */
         }
         /* all threads done */
         printf("All done!\n");
         return 0;
}
Chunk Sizes
We can split the difference between
static and dynamic scheduling by using
chunks in a dynamic schedule.
Here, 
each thread will take a set number
of iterations, called a “chunk”
,
 
execute it,
and then 
be assigned a new chunk when
it is done
.
By specifying a
chunk size of
100 in the
program
below, we
markedly
improve the
performance:
#define THREADS 16
#define N 100000000
#define CHUNK 100
int main ( ) {
 int i;
     printf("
Running %d iterations on %d threads dynamically
.\n", N,
                    THREADS);
    #pragma omp parallel 
for
 
schedule(dynamic, 
CHUNK
)
                                                           num_threads(THREADS)
        for (i = 0; i < N; i++) {
                 
/* a loop that doesn't take very long */
        } /* all threads done */
        printf("All done!\n");
        return 0;
}
Increasing or
decreasing the
chunk size ...
Increasing the chunk size makes the
scheduling more static
, and 
decreasing it
makes it more dynamic
.
Guided
Schedules
Instead of static, or dynamic, we can
specify guided as the schedule.
This scheduling policy is similar to a
dynamic schedule
, 
except that the chunk
size changes as the program runs.
It begins with big chunks, but then
adjusts to smaller chunk sizes if the
workload is imbalanced.
Guided
Schedules
How does the program above perform
with a guided schedule?
Guided
Schedules
#define THREADS 16
#define N 100000000
int main ( )  {
     int i;
     printf("
Running %d iterations on %d threads guided. 
\n", N,
                    THREADS);
    #pragma omp parallel 
for schedule(guided
) num_threads(THREADS)
     for (i = 0; i < N; i++)  {
           
/* a loop that doesn't take very long */
     }
     /* all threads done */
     printf("All done!\n");
     return 0;
}
How does our
program with
iterations that
take different
amounts of
time perform
with guided
scheduling?
#define THREADS 4
#define N 16
int main ( ) { int i;
#pragma omp parallel
 
for schedule(guided) 
num_threads(THREADS)
    for (i = 0; i < N; i++)  {
           /* wait for i seconds */
           sleep(i);
           printf("Thread %d has completed iteration %d.\n",
                          omp_get_thread_num( ), i);
    }
    /* all threads done */
    printf("All done!\n");
    return 0;
}
Conclusion
OpenMP for 
automatically splits 
for loop iterations 
for
us.
But, depending on our program, 
the default behavior
may not be ideal
.
For loops where each iteration takes roughly equal
time, static schedules work best
, as they have little
overhead.
Scheduled
Conclusion
For loops where each iteration can take very different
amounts of time, dynamic schedules, work best 
as the
work will be split more evenly across threads.
Specifying chunks
, or using a 
guided schedule 
provide
a trade-off (uma alternativa) between the two.
Choosing the best schedule depends on understanding
your loop.
undefined
int soma = 0 ;
#pragma omp parallel for schedule(static) private(soma)
for (i=0 ; i < 10000 ; i++)
   soma += a[i];
   printf(“Terminado — soma = %d”,
              soma);
Pr ivate
2 problemas: inicialização + valor final!
undefined
int 
soma
 = 0 ;
#pragma omp 
parallel for schedule(static)
   #pragma omp  
firstprivate
(
soma
)
  
lastprivate
(
soma
)
      for (i=0 ; i < 10000 ; i++)
      
soma
 += a[i];
      printf(“Terminado”);
Resolveu o problema da inicialização e do fim!
Firstprivate
Especifica que 
cada thread deve ter sua
própria instância de uma variável, e que a
variável deve ser inicializada com o valor
da variável
, pois ela existe antes da
construção parallel.
Lastprivate
Especifica que 
o contexto delimitador da
variável é definida igual à versão
particular de qualquer thread
 que
executa a iteração final (construção de
loop).
reduction
Especifica que 
uma ou mais variáveis
que são particulares a cada thread
são o assunto de uma operação de
redução 
no final da região paralela
.
reduction
(op : list);
usada para operações tipo “all-to-one”: exemplo: 
op = ’+’
cada thread terá uma cópia da(s) variável(is) definidas em ’list’
com a devida inicialização;
ela efetuará a soma local com sua cópia;
ao sair da seção paralela, as somas locais serão automaticamente
adicionadas na variavel global.
undefined
#include #define NUM_THREADS 4
void main( ) {
int i, tmp, 
res
 = 0;
#pragma omp parallel for 
reduction
(
+:res
) private(tmp)
for (i=0 ; i< 10000 ; i++)
   {
      tmp = Calculo( );
      
res
 += tmp ;
   }
   printf(“O resultado vale %d´´, 
res
) ; }
Obs: os índices de laços sempre são privados.
nowait
nowait
 - Substitui a barreira implícita em uma diretiva.
#include <stdio.h>
#define SIZE 5
void test( int *a, int *b, int *c, int size )
{
   int i;
         #pragma omp parallel
   {
               #pragma omp for 
nowait
                     for (i = 0; i < size; i++)
                         b[i] = a[i] * a[i];
               #pragma omp for 
nowait
                     for (i = 0; i < size; i++)
                        c[i] = a[i]/2;
         }
}
Se houver vários loops independentes dentro de uma região paralela, você
pode usar o nowait para evitar a barreira implícita no final do for, da seguinte
maneira:
...  ...  ... 
Parallel
sections
#pragma omp [parallel] sections [clauses]
{
    #pragma omp section
      { code_block }
}
undefined
Pode-se usar omp section quando não se usam laços:
OMP SECTIONS
#pragma omp parallel
    
#pragma omp sections
    
{
                 Calculo1( );
           
#pragma omp section
                 Calculo2( );
           
#pragma omp section
                 Calculo3( );
   
 }
As seções são distribuídas entre as diferentes threads.
Cada seção tem uma lógica diferente 
(threads diferentes) 
.
Slide Note
Embed
Share

Learn about parallel programming directives like Diretiva.parallel and #pragma omp.parallel, which allow code to be executed by multiple threads simultaneously. Explore concepts such as defining parallel regions, setting the number of threads, and utilizing OpenMP directives for parallel for loops. Understand how hyperthreading simulates logical cores, and discover functions like omp_get_thread_num to manage thread execution within parallel regions.

  • Parallel Programming
  • OpenMP Directives
  • Thread Management
  • Hyperthreading
  • Parallel For Loops

Uploaded on Sep 26, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Diretiva parallel

  2. #pragma omp parallel [clauses] { code_block Diretiva parallel } Define uma regi o paralela, que o c digo que ser executado por v rios threads em paralelo.

  3. // omp_parallel.cpp // compile with: /openmp #include <stdio.h> Exemplo Diretiva parallel #include <omp.h> int main() { #pragma omp parallel num_threads(4) { int i = omp_get_thread_num(); printf_s("Hello fromthread %d\n", i); } }

  4. Por padro, o nmero de threads igual ao n mero de processadores l gicosno computador. Por exemplo, se voc tiver uma m quina com um processador f sico com hyperthreading habilitado, ele ter dois processadores l gicos e, portanto, duas threads. Diretiva parallel Hyperthreading-Simulando dois n cleos l gicos em um nico n cleo f sico, cada n cleo l gico recebe seu pr prio controlador de interrup o program vel, e um conjunto de registradores. Os outros recursos do n cleo f sico como cache de mem ria, unidade l gica e aritm tica, barramentos, s o compartilhados entre os n cleos l gicos, parecendo assim um sistema com dois n cleos f sicos.

  5. omp_get_thread_num() Retorna o n mero da thread em execu o dentro de sua equipe de threads em paralelo. Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Fun o omp_get_thread_num() Observe que a ordem de sa da pode variar em m quinas diferentes. N o confundir com omp_get_num_threads() fun o retorna o n mero de threads, atualmente na equipe de threads executando na regi o paralela do qual ele chamado.

  6. Diretiva OpenMPfor

  7. #pragma omp [parallel] for [clauses] for_statement Diretiva OpenMP for Faz com que o trabalho feito em um loop for dentro de uma regi o paralela seja dividido entre threads.

  8. #pragmaompfor for (i = nStart; i <= nEnd; ++i) { #pragmaompatomic nSum+= i; } Diretiva atomic-Especifica que um local de mem ria que ser atualizado numa nica etapa de processamento, relativa a outras threads. An operation acting on shared memory isatomic if it completes in a single step relative to other threads.

  9. Operaes agindo sobre memria compartilhada. Ver a diretiva OpenMP atomic

  10. A diretiva master permite especificar que uma se o de c digo deve ser executada em uma nica thread, n o necessariamente a thread principal. Diretiva master

  11. intmain( ) { inta[5], i; Exemplo Diretivas mastere barrier #pragmaompparallel { // Performsome computation. #pragmaompfor for (i = 0; i < 5; i++) a[i] = i * i;

  12. Sincroniza todos as threads em uma equipe; Diretiva barrier Todas as threads pausam na barreira, at que todas as threads executem a barreira.

  13. // Print intermediate resultsin a single thread. #pragma omp master for (i = 0; i < 5; i++) printf_s("a[%d] = %d\n", i, a[i]); Exemplo Diretivas mastere barrier // Wait. #pragma omp barrier // Continue with the computation. #pragma omp for for (i = 0; i < 5; i++) a[i] += i; } }

  14. Diretiva Schedule

  15. #define THREADS 8 #define N 100 int main ( ) { By default, OpenMP statically assigns loop iterations to threads. int i; #pragma omp parallel for num_threads(THREADS) for (i = 0; i < N; i++) { printf( "Thread %d is doing iteration %d.\n", omp_get_thread_num(), i ); } /* all threads done */ printf("All done!\n"); return 0; }

  16. #define THREADS 4 #define N 16 A static schedule can be non- optimal, however. This is the case when the different iterations take different amounts of time. intmain( ) { inti; #pragmaompparallel for schedule(static) num_threads(THREADS) for (i = 0; i < N; i++) { /* waitfor i seconds*/ sleep(i); printf( "Thread %d has completediteration %d.\n", omp_get_thread_num( ), i); } /* all threads done*/ printf("Alldone!\n"); return0; } This program also specifies static scheduling, in the parallel for directive. This program can be greatly improved with a dynamic schedule.

  17. #define THREADS 4 #define N 16 intmain( ) { inti; #pragmaompparallel for schedule(dynamic)num_threads(THREADS) for (i = 0; i < N; i++) { How much faster does this program run? /* wait for i seconds*/ sleep(i); printf( "Thread %d has completediteration %d.\n", omp_get_thread_num( ), i ); } /* all threads done*/ printf("Alldone!\n"); return0; } How much faster does this program run?

  18. Dynamic scheduling is better when the iterations may take very different amounts of time. Dynamic Schedule Overhead However, there is some overhead to dynamic scheduling. After each iteration, the threads must stop and receive a new value of the loop variable to use for its next iteration.

  19. #define THREADS 16 #define N 100000000 int main ( ) { int i; The following program demonstrates this overhead: printf( "Running %d iterations on %d threads dynamically.\n", N, THREADS); #pragma omp parallel for schedule(dynamic) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop that doesn't take very long */ } /* all threads done */ printf("All done!\n"); return 0; } How long does this program take to execute?

  20. #define THREADS 16 #define N 100000000 int main ( ) { int i; printf( "Running %d iterations on %d threads statically.\n", N, THREADS); #pragma omp parallel for schedule(static) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop that doesn't take very long */ } /* all threads done */ printf("All done!\n"); return 0; } If we specify static scheduling, the program will run faster:

  21. We can split the difference between static and dynamic scheduling by using chunks in a dynamic schedule. Chunk Sizes Here, each thread will take a set number of iterations, called a chunk , execute it, and then be assigned a new chunk when it is done.

  22. #define THREADS 16 By specifying a chunk size of 100 in the program below, we markedly improve the performance: #define N 100000000 #define CHUNK 100 intmain( ) { int i; printf("Running %d iterationson%d threads dynamically.\n", N, THREADS); #pragma ompparallel for schedule(dynamic, CHUNK) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop thatdoesn't takeverylong */ } /* all threads done*/ printf("Alldone!\n"); return0; }

  23. Increasingor decreasingthe chunksize... Increasing the chunk size makes the scheduling more static, and decreasing it makes it more dynamic.

  24. Instead of static, or dynamic, we can specify guided as the schedule. This scheduling policy is similar to a dynamic schedule, except that the chunk size changes as the program runs. Guided Schedules It begins with big chunks, but then adjusts to smaller chunk sizes if the workload is imbalanced.

  25. Guided Schedules How does the program above perform with a guided schedule?

  26. #define THREADS 16 #define N 100000000 int main ( ) { int i; printf("Running %d iterations on %d threads guided. \n", N, THREADS); #pragma omp parallel for schedule(guided) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop that doesn't take very long */ } /* all threads done */ printf("All done!\n"); return 0; } Guided Schedules

  27. #define THREADS 4 #define N 16 int main ( ) { int i; #pragma omp parallel for schedule(guided) num_threads(THREADS) for (i = 0; i < N; i++) { /* wait for i seconds */ sleep(i); printf("Thread %d has completed iteration %d.\n", omp_get_thread_num( ), i); } /* all threads done */ printf("All done!\n"); return0; } How does our program with iterations that take different amounts of time perform with guided scheduling?

  28. OpenMP for automatically splits for loop iterations for us. But, depending on our program, the default behavior may not be ideal. Conclusion For loops where each iteration takes roughly equal time, static schedules work best, as they have little overhead.

  29. For loops where each iteration can take very different amounts of time, dynamic schedules, work best as the work will be split more evenly across threads. Scheduled Conclusion Specifying chunks, or using a guided schedule provide a trade-off (uma alternativa) between the two. Choosing the best schedule depends on understanding your loop.

  30. 2 problemas: inicializao + valor final! int soma = 0 ; #pragma omp parallel for schedule(static) private(soma) for (i=0 ; i < 10000 ; i++) soma += a[i]; printf( Terminado soma = %d , soma); Private

  31. int soma = 0 ; #pragma omp parallel for schedule(static) #pragma omp firstprivate(soma) lastprivate(soma) for (i=0 ; i < 10000 ; i++) soma += a[i]; printf( Terminado ); Resolveu o problema da inicializa o e do fim!

  32. Especifica que cada thread deve ter sua pr pria inst ncia de uma vari vel, e que a vari vel deve ser inicializada com o valor da vari vel, pois ela existe antes da constru o parallel. Firstprivate

  33. Especifica que o contexto delimitador da vari vel definida igual vers o particular de qualquer threadque executa a itera o final (constru o de loop). Lastprivate

  34. Especifica que uma ou mais variveis que s o particulares a cada thread s o o assunto de uma opera o de redu o no final da regi o paralela. reduction

  35. usada para operaes tipo all-to-one: exemplo: op= + cada thread ter uma c pia da(s) vari vel(is) definidas em list com a devida inicializa o; reduction (op: list); ela efetuar a soma local com sua c pia; ao sair da se o paralela, as somas locais ser o automaticamente adicionadas na variavelglobal.

  36. #include #define NUM_THREADS 4 voidmain( ) { int i, tmp, res = 0; #pragma omp parallel for reduction(+:res) private(tmp) for (i=0 ; i< 10000 ; i++) { tmp = Calculo( ); res += tmp ; } printf( O resultado vale %d , res) ; } Obs: os ndices de la os sempre s o privados.

  37. nowait -Substitui a barreira implcita em uma diretiva. #include <stdio.h> #define SIZE 5 void test( int *a, int *b, int *c, int size ) { int i; #pragma omp parallel { #pragma omp for nowait for (i = 0; i < size; i++) nowait b[i] = a[i] * a[i]; #pragma ompfor nowait for (i = 0; i < size; i++) c[i] = a[i]/2; } } Se houver v rios loops independentes dentro de uma regi o paralela, voc pode usar o nowait para evitar a barreira impl cita no final do for, da seguinte maneira: ... ... ...

  38. #pragmaomp [parallel] sections[clauses] { #pragmaompsection { code_block} } Parallel sections

  39. Pode-se usar ompsection quando no se usam laos: OMP SECTIONS #pragmaompparallel #pragmaompsections { Calculo1( ); #pragmaompsection Calculo2( ); #pragmaompsection Calculo3( ); } As se es s o distribu das entre as diferentes threads. Cada se o tem uma l gica diferente (threads diferentes) .

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#