Parallel Programming Directives and Concepts

undefined

Diretiva parallel

Diretiva

parallel



#pragma omp parallel [clauses]

code_block



Define uma

região paralela

, que é o

código que

será executado por vários threads em paralelo

Exemplo –

Diretiva

parallel

// omp_parallel.cpp

// compile with: /openmp

#include <stdio.h>

#include <omp.h>

int main()  {

      #pragma omp

parallel

num_threads(4)

             int i = omp_get_thread_num();

             printf_s("Hello from thread %d\n", i);

Diretiva

parallel



Por padrão,

o número de threads é igual ao

número de processadores lógicos

 no computador.



Por exemplo, se você tiver uma máquina com

um

processador físico com

hyperthreading

 habilitado,

ele terá dois processadores lógicos

e, portanto,

duas threads.



Hyperthreading

 - Simulando dois núcleos lógicos em um único núcleo

físico, cada núcleo lógico recebe seu próprio controlador de

interrupção programável, e um conjunto de registradores.  Os outros

recursos do núcleo físico como cache de memória, unidade lógica e

aritmética, barramentos, são compartilhados entre os núcleos

lógicos, parecendo assim um sistema com dois núcleos físicos.

Função

omp_get_thread_num()



omp_get_thread_num()



Retorna o número da thread em execução dentro de

sua equipe de threads em paralelo.



Hello from thread 0



Hello from thread 1



Hello from thread 2



Hello from thread 3



Observe que a ordem de saída pode variar em máquinas diferentes.



Não confundir com

omp_get_num_threads()

 função retorna o

número de threads, atualmente na equipe de threads executando na

região paralela do qual ele é chamado.

undefined

Diretiva  OpenMP for

Diretiva

OpenMP  for



#pragma omp  [parallel]

for

  [clauses]

for_statement



Faz com que o trabalho feito em um

loop for

dentro de uma região paralela seja dividido

entre threads.

#pragma omp

for

for (i = nStart; i <= nEnd; ++i)  {

             #pragma omp

atomic

              nSum += i;

Diretiva

atomic

 - Especifica que um local de memória que será

atualizado

 numa única etapa de processamento, relativa a outras

threads.

An operation acting on shared memory is

atomic

 if it completes in a

single step relative to other threads.

Operações agindo sobre memória compartilhada.

Ver  a  diretiva OpenMP  atomic

Diretiva

master



A diretiva

master

 permite especificar que uma

seção de código deve ser executada em uma

única thread, não necessariamente a thread

principal.

Exemplo –

Diretivas

master

barrier

int main( )

    int a[5], i;

    #pragma omp

parallel

        // Perform some computation.

        #pragma omp

for

        for (i = 0; i < 5; i++)

            a[i] = i * i;

Diretiva –

barrier



Sincroniza todos as threads em uma equipe;



Todas as threads pausam na barreira, até que

todas as threads executem a barreira.

Exemplo –

Diretivas

master

barrier

        // Print intermediate results in a single thread.

        #pragma omp

master

for (i = 0; i < 5; i++)

                printf_s("a[%d] = %d\n", i, a[i]);

// Wait.

        #pragma omp

barrier

        // Continue with the computation.

        #pragma omp

for

        for (i = 0; i < 5; i++)

            a[i] += i;

undefined

Diretiva Schedule

By default,

OpenMP

statically

assigns loop

iterations to

threads.

#define THREADS 8

#define N 100

int main ( ) {

int i;

#pragma omp parallel

for

 num_threads(THREADS)

     for (i = 0; i < N; i++)   {

          printf( "Thread %d is doing iteration %d.\n",

                         omp_get_thread_num(), i );

     /* all threads done */

     printf("All done!\n");

     return 0;

static schedule

can be non-

optimal,

however. This is

the case when

the

different

iterations take

different

amounts of time

#define THREADS  4

#define N  16

int main ( ) {

int i;

#pragma omp parallel

for schedule(static)

num_threads(THREADS)

 for (i = 0; i < N; i++)  {

      /* wait for i seconds */

sleep(i)

      printf( "Thread %d has completed iteration %d.\n",

                      omp_get_thread_num( ), i);

     /* all threads done */

     printf("All done!\n");

     return 0;

This program also specifies static scheduling, in the parallel for directive.

This program can be greatly improved with a dynamic schedule.

How much

faster does this

program run?

#define THREADS 4

#define N 16

int main ( ) {

 int i;

#pragma omp parallel

for schedule(dynamic)

num_threads(THREADS)

for (i = 0; i < N; i++)  {

     /* wait for i seconds */

sleep(i)

      printf( "Thread %d has completed iteration %d.\n",

                     omp_get_thread_num( ), i );

     /* all threads done */

     printf("All done!\n");

     return 0;

How much faster does this program run?

Dynamic

Schedule

Overhead



Dynamic scheduling is better when the

iterations may take very different

amounts of time.



However, there is some overhead to

dynamic scheduling.



After each iteration, the threads must

stop and receive a new value of the loop

variable to use for its next iteration.

The following

program

demonstrates

this overhead:

#define THREADS 16

#define N 100000000

int main ( )  {

int i;

    printf( "

Running %d iterations on %d threads dynamically.

\n", N,

                    THREADS);

#pragma omp parallel

for schedule(dynamic)

                                                          num_threads(THREADS)

     for (i = 0; i < N; i++)  {

           /* a loop that doesn't take very long */

     /* all threads done */

     printf("All done!\n");

     return 0;

How long does this program take to execute?

If we specify

static

scheduling

, the

program

will

run faster

#define THREADS 16

#define N 100000000

int main ( )  {

int i;

        printf( "

Running %d iterations on %d threads statically

.\n", N,

                        THREADS);

        #pragma omp parallel for

schedule(static)

                                                               num_threads(THREADS)

         for (i = 0; i < N; i++)  {

               /* a loop that doesn't take very long */

         /* all threads done */

         printf("All done!\n");

         return 0;

Chunk Sizes



We can split the difference between

static and dynamic scheduling by using

chunks in a dynamic schedule.



Here,

each thread will take a set number

of iterations, called a “chunk”

execute it,

and then

be assigned a new chunk when

it is done

By specifying a

chunk size of

100 in the

program

below, we

markedly

improve the

performance:

#define THREADS 16

#define N 100000000

#define CHUNK 100

int main ( ) {

 int i;

     printf("

Running %d iterations on %d threads dynamically

.\n", N,

                    THREADS);

    #pragma omp parallel

for

schedule(dynamic,

CHUNK

                                                           num_threads(THREADS)

        for (i = 0; i < N; i++) {

/* a loop that doesn't take very long */

        } /* all threads done */

        printf("All done!\n");

        return 0;

Increasing or

decreasing the

chunk size ...



Increasing the chunk size makes the

scheduling more static

, and

decreasing it

makes it more dynamic

Guided

Schedules



Instead of static, or dynamic, we can

specify guided as the schedule.



This scheduling policy is similar to a

dynamic schedule

except that the chunk

size changes as the program runs.



It begins with big chunks, but then

adjusts to smaller chunk sizes if the

workload is imbalanced.

Guided

Schedules



How does the program above perform

with a guided schedule?

Guided

Schedules

#define THREADS 16

#define N 100000000

int main ( )  {

     int i;

     printf("

Running %d iterations on %d threads guided.

\n", N,

                    THREADS);

    #pragma omp parallel

for schedule(guided

) num_threads(THREADS)

     for (i = 0; i < N; i++)  {

/* a loop that doesn't take very long */

     /* all threads done */

     printf("All done!\n");

     return 0;

How does our

program with

iterations that

take different

amounts of

time perform

with guided

scheduling?

#define THREADS 4

#define N 16

int main ( ) { int i;

#pragma omp parallel

for schedule(guided)

num_threads(THREADS)

    for (i = 0; i < N; i++)  {

           /* wait for i seconds */

           sleep(i);

           printf("Thread %d has completed iteration %d.\n",

                          omp_get_thread_num( ), i);

    /* all threads done */

    printf("All done!\n");

    return 0;

Conclusion



OpenMP for

automatically splits

for loop iterations

for

us.



But, depending on our program,

the default behavior

may not be ideal



For loops where each iteration takes roughly equal

time, static schedules work best

, as they have little

overhead.

Scheduled

Conclusion



For loops where each iteration can take very different

amounts of time, dynamic schedules, work best

as the

work will be split more evenly across threads.



Specifying chunks

, or using a

guided schedule

provide

a trade-off (uma alternativa) between the two.



Choosing the best schedule depends on understanding

your loop.

undefined

int soma = 0 ;

#pragma omp parallel for schedule(static) private(soma)

for (i=0 ; i < 10000 ; i++)

   soma += a[i];

   printf(“Terminado — soma = %d”,

              soma);

Pr ivate

2 problemas: inicialização + valor final!

undefined

int

soma

 = 0 ;

#pragma omp

parallel for schedule(static)

   #pragma omp

firstprivate

soma

lastprivate

soma

      for (i=0 ; i < 10000 ; i++)

soma

 += a[i];

      printf(“Terminado”);

Resolveu o problema da inicialização e do fim!

Firstprivate



Especifica que

cada thread deve ter sua

própria instância de uma variável, e que a

variável deve ser inicializada com o valor

da variável

, pois ela existe antes da

construção parallel.

Lastprivate



Especifica que

o contexto delimitador da

variável é definida igual à versão

particular de qualquer thread

que

executa a iteração final (construção de

loop).

reduction



Especifica que

uma ou mais variáveis

que são particulares a cada thread

são o assunto de uma operação de

redução

no final da região paralela

reduction

(op : list);



usada para operações tipo “all-to-one”: exemplo:

op = ’+’



cada thread terá uma cópia da(s) variável(is) definidas em ’list’

com a devida inicialização;



ela efetuará a soma local com sua cópia;



ao sair da seção paralela, as somas locais serão automaticamente

adicionadas na variavel global.

undefined

#include #define NUM_THREADS 4

void main( ) {

int i, tmp,

res

 = 0;

#pragma omp parallel for

reduction

+:res

) private(tmp)

for (i=0 ; i< 10000 ; i++)

      tmp = Calculo( );

res

 += tmp ;

   printf(“O resultado vale %d´´,

res

) ; }

Obs: os índices de laços sempre são privados.

nowait

nowait

 - Substitui a barreira implícita em uma diretiva.

#include <stdio.h>

#define SIZE 5

void test( int *a, int *b, int *c, int size )

   int i;

         #pragma omp parallel

               #pragma omp for

nowait

                     for (i = 0; i < size; i++)

                         b[i] = a[i] * a[i];

               #pragma omp for

nowait

                     for (i = 0; i < size; i++)

                        c[i] = a[i]/2;

Se houver vários loops independentes dentro de uma região paralela, você

pode usar o nowait para evitar a barreira implícita no final do for, da seguinte

maneira:

...  ...  ...

Parallel

sections

#pragma omp [parallel] sections [clauses]

    #pragma omp section

      { code_block }

undefined

Pode-se usar omp section quando não se usam laços:

OMP SECTIONS

#pragma omp parallel

#pragma omp sections

                 Calculo1( );

#pragma omp section

                 Calculo2( );

#pragma omp section

                 Calculo3( );

As seções são distribuídas entre as diferentes threads.

Cada seção tem uma lógica diferente

(threads diferentes)

Slide Note

Embed Share

Download

Learn about parallel programming directives like Diretiva.parallel and #pragma omp.parallel, which allow code to be executed by multiple threads simultaneously. Explore concepts such as defining parallel regions, setting the number of threads, and utilizing OpenMP directives for parallel for loops. Understand how hyperthreading simulates logical cores, and discover functions like omp_get_thread_num to manage thread execution within parallel regions.

les_b Follow

Uploaded on Sep 26, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Diretiva parallel

#pragma omp parallel [clauses] { code_block Diretiva parallel } Define uma regi o paralela, que o c digo que ser executado por v rios threads em paralelo.

// omp_parallel.cpp // compile with: /openmp #include <stdio.h> Exemplo Diretiva parallel #include <omp.h> int main() { #pragma omp parallel num_threads(4) { int i = omp_get_thread_num(); printf_s("Hello fromthread %d\n", i); } }

Por padro, o nmero de threads igual ao n mero de processadores l gicosno computador. Por exemplo, se voc tiver uma m quina com um processador f sico com hyperthreading habilitado, ele ter dois processadores l gicos e, portanto, duas threads. Diretiva parallel Hyperthreading-Simulando dois n cleos l gicos em um nico n cleo f sico, cada n cleo l gico recebe seu pr prio controlador de interrup o program vel, e um conjunto de registradores. Os outros recursos do n cleo f sico como cache de mem ria, unidade l gica e aritm tica, barramentos, s o compartilhados entre os n cleos l gicos, parecendo assim um sistema com dois n cleos f sicos.

omp_get_thread_num() Retorna o n mero da thread em execu o dentro de sua equipe de threads em paralelo. Hello from thread 0 Hello from thread 1 Hello from thread 2 Hello from thread 3 Fun o omp_get_thread_num() Observe que a ordem de sa da pode variar em m quinas diferentes. N o confundir com omp_get_num_threads() fun o retorna o n mero de threads, atualmente na equipe de threads executando na regi o paralela do qual ele chamado.

Diretiva OpenMPfor

#pragma omp [parallel] for [clauses] for_statement Diretiva OpenMP for Faz com que o trabalho feito em um loop for dentro de uma regi o paralela seja dividido entre threads.

#pragmaompfor for (i = nStart; i <= nEnd; ++i) { #pragmaompatomic nSum+= i; } Diretiva atomic-Especifica que um local de mem ria que ser atualizado numa nica etapa de processamento, relativa a outras threads. An operation acting on shared memory isatomic if it completes in a single step relative to other threads.

Operaes agindo sobre memria compartilhada. Ver a diretiva OpenMP atomic

A diretiva master permite especificar que uma se o de c digo deve ser executada em uma nica thread, n o necessariamente a thread principal. Diretiva master

intmain( ) { inta[5], i; Exemplo Diretivas mastere barrier #pragmaompparallel { // Performsome computation. #pragmaompfor for (i = 0; i < 5; i++) a[i] = i * i;

Sincroniza todos as threads em uma equipe; Diretiva barrier Todas as threads pausam na barreira, at que todas as threads executem a barreira.

// Print intermediate resultsin a single thread. #pragma omp master for (i = 0; i < 5; i++) printf_s("a[%d] = %d\n", i, a[i]); Exemplo Diretivas mastere barrier // Wait. #pragma omp barrier // Continue with the computation. #pragma omp for for (i = 0; i < 5; i++) a[i] += i; } }

Diretiva Schedule

#define THREADS 8 #define N 100 int main ( ) { By default, OpenMP statically assigns loop iterations to threads. int i; #pragma omp parallel for num_threads(THREADS) for (i = 0; i < N; i++) { printf( "Thread %d is doing iteration %d.\n", omp_get_thread_num(), i ); } /* all threads done */ printf("All done!\n"); return 0; }

#define THREADS 4 #define N 16 A static schedule can be non- optimal, however. This is the case when the different iterations take different amounts of time. intmain( ) { inti; #pragmaompparallel for schedule(static) num_threads(THREADS) for (i = 0; i < N; i++) { /* waitfor i seconds*/ sleep(i); printf( "Thread %d has completediteration %d.\n", omp_get_thread_num( ), i); } /* all threads done*/ printf("Alldone!\n"); return0; } This program also specifies static scheduling, in the parallel for directive. This program can be greatly improved with a dynamic schedule.

#define THREADS 4 #define N 16 intmain( ) { inti; #pragmaompparallel for schedule(dynamic)num_threads(THREADS) for (i = 0; i < N; i++) { How much faster does this program run? /* wait for i seconds*/ sleep(i); printf( "Thread %d has completediteration %d.\n", omp_get_thread_num( ), i ); } /* all threads done*/ printf("Alldone!\n"); return0; } How much faster does this program run?

Dynamic scheduling is better when the iterations may take very different amounts of time. Dynamic Schedule Overhead However, there is some overhead to dynamic scheduling. After each iteration, the threads must stop and receive a new value of the loop variable to use for its next iteration.

#define THREADS 16 #define N 100000000 int main ( ) { int i; The following program demonstrates this overhead: printf( "Running %d iterations on %d threads dynamically.\n", N, THREADS); #pragma omp parallel for schedule(dynamic) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop that doesn't take very long */ } /* all threads done */ printf("All done!\n"); return 0; } How long does this program take to execute?

#define THREADS 16 #define N 100000000 int main ( ) { int i; printf( "Running %d iterations on %d threads statically.\n", N, THREADS); #pragma omp parallel for schedule(static) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop that doesn't take very long */ } /* all threads done */ printf("All done!\n"); return 0; } If we specify static scheduling, the program will run faster:

We can split the difference between static and dynamic scheduling by using chunks in a dynamic schedule. Chunk Sizes Here, each thread will take a set number of iterations, called a chunk , execute it, and then be assigned a new chunk when it is done.

#define THREADS 16 By specifying a chunk size of 100 in the program below, we markedly improve the performance: #define N 100000000 #define CHUNK 100 intmain( ) { int i; printf("Running %d iterationson%d threads dynamically.\n", N, THREADS); #pragma ompparallel for schedule(dynamic, CHUNK) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop thatdoesn't takeverylong */ } /* all threads done*/ printf("Alldone!\n"); return0; }

Increasingor decreasingthe chunksize... Increasing the chunk size makes the scheduling more static, and decreasing it makes it more dynamic.

Instead of static, or dynamic, we can specify guided as the schedule. This scheduling policy is similar to a dynamic schedule, except that the chunk size changes as the program runs. Guided Schedules It begins with big chunks, but then adjusts to smaller chunk sizes if the workload is imbalanced.

Guided Schedules How does the program above perform with a guided schedule?

#define THREADS 16 #define N 100000000 int main ( ) { int i; printf("Running %d iterations on %d threads guided. \n", N, THREADS); #pragma omp parallel for schedule(guided) num_threads(THREADS) for (i = 0; i < N; i++) { /* a loop that doesn't take very long */ } /* all threads done */ printf("All done!\n"); return 0; } Guided Schedules

#define THREADS 4 #define N 16 int main ( ) { int i; #pragma omp parallel for schedule(guided) num_threads(THREADS) for (i = 0; i < N; i++) { /* wait for i seconds */ sleep(i); printf("Thread %d has completed iteration %d.\n", omp_get_thread_num( ), i); } /* all threads done */ printf("All done!\n"); return0; } How does our program with iterations that take different amounts of time perform with guided scheduling?

OpenMP for automatically splits for loop iterations for us. But, depending on our program, the default behavior may not be ideal. Conclusion For loops where each iteration takes roughly equal time, static schedules work best, as they have little overhead.

For loops where each iteration can take very different amounts of time, dynamic schedules, work best as the work will be split more evenly across threads. Scheduled Conclusion Specifying chunks, or using a guided schedule provide a trade-off (uma alternativa) between the two. Choosing the best schedule depends on understanding your loop.

2 problemas: inicializao + valor final! int soma = 0 ; #pragma omp parallel for schedule(static) private(soma) for (i=0 ; i < 10000 ; i++) soma += a[i]; printf( Terminado soma = %d , soma); Private

int soma = 0 ; #pragma omp parallel for schedule(static) #pragma omp firstprivate(soma) lastprivate(soma) for (i=0 ; i < 10000 ; i++) soma += a[i]; printf( Terminado ); Resolveu o problema da inicializa o e do fim!

Especifica que cada thread deve ter sua pr pria inst ncia de uma vari vel, e que a vari vel deve ser inicializada com o valor da vari vel, pois ela existe antes da constru o parallel. Firstprivate

Especifica que o contexto delimitador da vari vel definida igual vers o particular de qualquer threadque executa a itera o final (constru o de loop). Lastprivate

Especifica que uma ou mais variveis que s o particulares a cada thread s o o assunto de uma opera o de redu o no final da regi o paralela. reduction

usada para operaes tipo all-to-one: exemplo: op= + cada thread ter uma c pia da(s) vari vel(is) definidas em list com a devida inicializa o; reduction (op: list); ela efetuar a soma local com sua c pia; ao sair da se o paralela, as somas locais ser o automaticamente adicionadas na variavelglobal.

#include #define NUM_THREADS 4 voidmain( ) { int i, tmp, res = 0; #pragma omp parallel for reduction(+:res) private(tmp) for (i=0 ; i< 10000 ; i++) { tmp = Calculo( ); res += tmp ; } printf( O resultado vale %d , res) ; } Obs: os ndices de la os sempre s o privados.

nowait -Substitui a barreira implcita em uma diretiva. #include <stdio.h> #define SIZE 5 void test( int *a, int *b, int *c, int size ) { int i; #pragma omp parallel { #pragma omp for nowait for (i = 0; i < size; i++) nowait b[i] = a[i] * a[i]; #pragma ompfor nowait for (i = 0; i < size; i++) c[i] = a[i]/2; } } Se houver v rios loops independentes dentro de uma regi o paralela, voc pode usar o nowait para evitar a barreira impl cita no final do for, da seguinte maneira: ... ... ...

#pragmaomp [parallel] sections[clauses] { #pragmaompsection { code_block} } Parallel sections

Pode-se usar ompsection quando no se usam laos: OMP SECTIONS #pragmaompparallel #pragmaompsections { Calculo1( ); #pragmaompsection Calculo2( ); #pragmaompsection Calculo3( ); } As se es s o distribu das entre as diferentes threads. Cada se o tem uma l gica diferente (threads diferentes) .

Parallel Programming Directives and Concepts

Download Presentation

Presentation Transcript

Related

More Related Content