LING 388: Computers and Language

LING 388: Computers and

Language

Lecture 4

Administrivia

•

Quick Homework 3 graded

•

Homework 3 Review

•

What's underneath Python: a deep dive …

Homework 3 Review

•

Euler’s Identity:

•

Arbitrary precision:

>>> from decimal import *

>>> getcontext()

Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999, Emax=999999, capitals=1, clamp=0, flags=[],

traps=[InvalidOperation, DivisionByZero, Overflow])

>>> import cmath

>>> i = cmath.sqrt(-1)

>>> pi = Decimal('3.141592653589793238462643383279’)

>>> pi

Decimal('3.141592653589793238462643383279’)

>> pi / 2

Decimal('1.570796326794896619231321692')

>>> i*pi

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

TypeError: unsupported operand type(s) for *: 'complex' and 'decimal.Decimal'

The underlying architecture

•

Python is a high-level language

•

Beneath Python lies binary…

interesting use of binary notation:

6 bit Leica camera lens encoding

The 6-bit code shown on the right

 is 101011 (1 = black, 0 = white),

read starting from the philips

screw head

Introduction

•

Memory Representation

•

binary: zeros and ones (1 bit)

•

organized into bytes (8 bits)

•

memory is byte-addressable

•

word (32 bits)

•

e.g. integer

•

(64 bits: floating point number)

•

big-endian/little-endian

•

most significant byte first or least significant byte

•

communication …

addressable

Memory

(RAM)

FFFFFFFF

your Intel and

ARM CPUs

array

a[23]

Introduction

•

A typical notebook computer

•

Example

: a 2013 Macbook Air

•

CPU: Core i5-4250U

•

1.3 billion transistors

•

built-in GPU

•

TDP: 15W (1.3 GHz)

•

Dual core (Turbo: 2.6 GHz)

•

Hyper-Threaded (4 logical CPUs, 2

physical)

•

64 bit

•

64 KB (32 KB Instruction + 32 KB Data)

L1 cache

•

256 KB L2 cache per core

•

12MB L3 cache shared

•

16GB max RAM

Increased

address space

and 64-bit

registers

anandtech.com

A 4 core machine: 8 virtual

Introduction

•

Machine Language

•

A CPU understands only one language: machine language

•

all

other

languages

 must be translated into machine language

•

Primitive instructions include:

•

MOV

•

PUSH

•

POP

•

ADD / SUB

•

INC / DEC

•

IMUL / IDIV

•

AND / OR / XOR / NOT

•

NEG

•

SHL / SHR

•

JMP

•

CMP

•

JE / JNE / JZ / JG / JGE / JL / JLE

•

CALL / RET

Assembly Language: (this notation)

by definition, nothing built on it is more powerful

http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Introduction

•

Not all the machine instructions are conceptually necessary

•

many provided for speed/efficiency

•

Theoretical Computer Science

•

All mechanical computation can be carried out using a

TURING MACHINE

•

Finite state table + (infinite) tape

•

Tape instructions:

•

at the tape head: Erase, Write, Move (Left/Right/NoMove)

•

Finite state table:

•

Current state x Tape symbol --> new state x New Tape symbol x Move

Introduction

•

Storage:

•

based on digital logic

•

binary (base 2) – everything is a power of 2

•

Byte: 8 bits

•

01011011

•

= 2

+2

+2

+2

+2

•

= 64 + 16 + 8 + 2 + 1

•

= 91 (in decimal)

•

Hexadecimal (base 16)

•

0-9,A,B,C,D,E,F (need 4 bits)

•

5B (= 1 byte)

•

= 5*16

 + 11

•

= 80 + 11

•

= 91

Introduction: data types

•

Integers

•

In one byte (= 8 bits), what’s the largest and smallest

number, we can represent?

•

Answer

: -128 .. 0 .. 127

•

Why? -2

8-1

.. 0 .. 2

8-1

 – 1

•

00000000 = 0

•

01111111 = 127

•

10000000 = -128

•

11111111 = -1

+ 2

+ 2

+ 2

+ 2

+ 2

+ 2

+ 2

 – 1 = 255

 – 1 = 127

Introduction: data types

•

Integers

•

In one byte (= 8 bits), what’s the largest and smallest

number, we can represent?

•

00000000 = 0

•

01111111 = 127

•

10000000 = -128

•

11111111 = -1

00000000

11111111

2’s complement representation

Introduction: data types

•

Integers

•

In one byte, what’s the largest and smallest number, we can

represent?

•

Answer

: -128 .. 0 .. 127 using the

2’s complement

representation

•

Why? super-convenient for arithmetic operations

•

“to convert a positive integer X to its negative counterpart, flip

all the bits, and add 1”

•

Example

•

00001010 = 2

+ 2

 = 10 (decimal)

•

11110101 + 1 = 11110110 = -10 (decimal)

•

11110110 flip + 1 = 00001001 + 1 = 00001010

Addition

-10 + 10

= 11110110

+ 00001010 = 0 (ignore overflow)

Introduction: data types

•

Typically 32 bits (4 bytes) are used to store an integer

•

range: -2,147,483,648 (2

31-1

 -1) to 2,147,483,647 (2

32-1

-1)

•

what if you want to store even larger numbers?

•

Binary Coded Decimal (BCD)

•

code each decimal digit separately, use a string (sequence) of decimal digits …

C:

int

Introduction: data types

•

what if you want to store even larger numbers?

•

Binary Coded Decimal (BCD)

•

1 byte can code two digits (0-9 requires 4 bits)

•

1 nibble (4 bits) codes the sign (+/-), e.g. hex C/D

2 bytes (= 4 nibbles)

2.5 bytes (= 5 nibbles)

credit (+)

debit (-)

Introduction: data types

•

Typically, 64 bits (8 bytes) are used to represent floating point

numbers (double precision)

•

 = 2.99792458 x 10

(m/s)

•

coefficient: 52 bits (implied 1, therefore treat as 53)

•

exponent: 11 bits (not 2’s complement, unsigned with bias)

•

sign: 1 bit (+/-)

C:

float

double

wikipedia

x86 CPUs have a built-in

floating point coprocessor (x87)

80 bit long registers

Introduction: data types

•

How about letters, punctuation, etc.?

•

ASCII

•

American Standard Code for Information Interchange

•

Based on English alphabet (upper and lower case) + space + digits +

punctuation + control (Teletype Model 33)

•

Question

: how many bits do we need?

•

7 bits + 1 bit parity

•

Remember everything is in binary …

C:

char

Teletype Model 33 ASR

Teleprinter (Wikipedia)

Introduction: data types

order is important in sorting!

0-9: there’s a connection with BCD.

Notice:

code 30 (hex) through 39 (hex)

Introduction: data types

•

Parity bit:

•

transmission can be noisy

•

parity bit can be added to ASCII code

•

can spot single bit transmission errors

•

even/odd parity:

•

receiver understands each byte should be even/odd

•

Example:

•

0 (zero) is ASCII 30 (hex) = 011000

•

even parity: 011000

, odd parity: 011000

•

Checking parity:

•

Exclusive or (XOR): basic machine instruction

•

A xor B true if either A or B true but not both

•

Example:

•

(even parity 0) 011000

xor bit by bit

•

xor

 = 1 xor

 = 0 xor

 = 0 xor

 = 0 xor

 = 0 xor

 = 0 xor

= 0

x86 assemby language

1.

PF: even parity flag set by

      arithmetic ops.

2.

TEST: AND (don’t store

result), sets PF

3.

JP: jump if PF set

Example

MOV al,<char>

TEST al, al

JP <location if even>

<go here if odd>

Introduction: data types

•

UTF-8

•

standard in the post-ASCII world

•

backwards compatible with ASCII

•

previously, different languages had multi-byte character sets

that clashed

•

Universal Character Set (UCS) Transformation Format 8-bits

(Wikipedia)

Introduction: data types

•

Example:

•

あ Hiragana letter A: UTF-8: E38182

•

Byte 1: E =

, 3 = 0011

•

Byte 2: 8 =

00, 1 = 0001

•

Byte 3: 8 =

00, 2 = 0010

•

い Hiragana letter I: UTF-8: E38184

Shift-JIS (Hex):

あ

82A0

い

82A2

Introduction: data types

•

How can you tell what encoding your file is using?

•

Detecting UTF-8

•

Microsoft:

•

st

 three bytes in the file is EF BB BF

•

not all software understands this; not everybody uses it

•

HTML:

•

meta

 http-equiv="Content-Type" content="text/html;charset=UTF-8" >

•

not always present

•

Analyze the file:

•

Find non-valid UTF-8 sequences: if found, not UTF-8…

•

Interesting paper:

•

http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

Introduction: data types

•

Text files:

•

text files have lines:

how do we mark the end of a line

•

End of line (EOL) control character(s):

•

LF

0x0A

(Mac/Linux),

•

CR

0x0D

(Old Macs),

•

CR+LF  0x0D0A (Windows)

•

End of file (EOF) control character:

•

(EOT) 0x04 (

aka

 Control-D)

binaryvision.nl

programming languages:

NUL used to mark

the end of a string

Slide Note

Embed Share

Download

The lecture explores topics such as Python's underlying architecture, memory representation, increased address space, 64-bit registers, and machine language instructions. It delves into binary notation, memory organization, CPU architecture specifics, and the translation of programming languages into machine code. Dive into the complexities of computer systems and their language processing capabilities.

vanorman_s Follow

Uploaded on Feb 22, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

LING 388: Computers and Language Lecture 4

Administrivia Quick Homework 3 graded Homework 3 Review What's underneath Python: a deep dive

Homework 3 Review Euler s Identity: Arbitrary precision: >>> from decimal import * >>> getcontext() Context(prec=28, rounding=ROUND_HALF_EVEN, Emin=-999999, Emax=999999, capitals=1, clamp=0, flags=[], traps=[InvalidOperation, DivisionByZero, Overflow]) >>> import cmath >>> i = cmath.sqrt(-1) >>> pi = Decimal('3.141592653589793238462643383279 ) >>> pi Decimal('3.141592653589793238462643383279 ) >> pi / 2 Decimal('1.570796326794896619231321692') >>> i*pi Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: unsupported operand type(s) for *: 'complex' and 'decimal.Decimal'

The underlying architecture Python is a high-level language Beneath Python lies binary interesting use of binary notation: 6 bit Leica camera lens encoding The 6-bit code shown on the right is 101011 (1 = black, 0 = white), read starting from the philips screw head

Introduction array a[23] 0 Memory Representation binary: zeros and ones (1 bit) organized into bytes (8 bits) memory is byte-addressable word (32 bits) e.g. integer (64 bits: floating point number) big-endian/little-endian most significant byte first or least significant byte communication addressable Memory (RAM) your Intel and ARM CPUs FFFFFFFF

Introduction Increased address space and 64-bit registers A typical notebook computer Example: a 2013 Macbook Air CPU: Core i5-4250U 1.3 billion transistors built-in GPU TDP: 15W (1.3 GHz) Dual core (Turbo: 2.6 GHz) Hyper-Threaded (4 logical CPUs, 2 physical) 64 bit 64 KB (32 KB Instruction + 32 KB Data) L1 cache 256 KB L2 cache per core 12MB L3 cache shared 16GB max RAM anandtech.com A 4 core machine: 8 virtual

Introduction Machine Language A CPU understands only one language: machine language allotherlanguages must be translated into machine language Primitive instructions include: MOV PUSH POP ADD / SUB INC / DEC IMUL / IDIV AND / OR / XOR / NOT NEG SHL / SHR JMP CMP JE / JNE / JZ / JG / JGE / JL / JLE CALL / RET Assembly Language: (this notation) by definition, nothing built on it is more powerful http://www.cs.virginia.edu/~evans/cs216/guides/x86.html

Introduction Not all the machine instructions are conceptually necessary many provided for speed/efficiency Theoretical Computer Science All mechanical computation can be carried out using a TURING MACHINE Finite state table + (infinite) tape Tape instructions: at the tape head: Erase, Write, Move (Left/Right/NoMove) Finite state table: Current state x Tape symbol --> new state x New Tape symbol x Move

Introduction Storage: based on digital logic binary (base 2) everything is a power of 2 Byte: 8 bits 01011011 = 26+24+23+21+20 = 64 + 16 + 8 + 2 + 1 = 91 (in decimal) Hexadecimal (base 16) 0-9,A,B,C,D,E,F (need 4 bits) 5B (= 1 byte) = 5*161 + 11 = 80 + 11 = 91 27 27 26 26 25 25 24 24 23 23 22 22 21 21 20 20 0 1 0 1 1 0 1 1 23 23 23 22 22 22 21 21 21 20 20 20 23 23 23 22 22 22 21 21 21 20 20 20 0 0 1 1 0 0 1 1 1 0 1 1 B 5 161160 161160 5 B

Introduction: data types Integers In one byte (= 8 bits), what s the largest and smallest number, we can represent? Answer: -128 .. 0 .. 127 Why? -28-1.. 0 .. 28-1 1 00000000 = 0 01111111 = 127 10000000 = -128 11111111 = -1 27 27 27 26 26 26 25 25 25 24 24 24 23 23 23 22 22 22 21 21 21 20 20 20 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 27 + 26 + 25 + 24 + 23 + 22 + 21 + 20 28 1 = 255 27 27 26 26 25 25 24 24 23 23 22 22 21 21 20 20 0 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 27 1 = 127 27 26 25 24 23 22 21 20 1 0 0 0 0 0 0 0

Introduction: data types Integers In one byte (= 8 bits), what s the largest and smallest number, we can represent? 00000000 = 0 01111111 = 127 10000000 = -128 11111111 = -1 00000000 11111111 0 127 -128 -127 -1 2 s complement representation

Introduction: data types Integers In one byte, what s the largest and smallest number, we can represent? Answer: -128 .. 0 .. 127 using the 2 s complement representation Why? super-convenient for arithmetic operations to convert a positive integer X to its negative counterpart, flip all the bits, and add 1 Example: 00001010 = 23 + 21 = 10 (decimal) 11110101 + 1 = 11110110 = -10 (decimal) 11110110 flip + 1 = 00001001 + 1 = 00001010 Addition: -10 + 10 = 11110110 + 00001010 = 0 (ignore overflow)

Introduction: data types Typically 32 bits (4 bytes) are used to store an integer range: -2,147,483,648 (231-1 -1) to 2,147,483,647 (232-1 -1) 231 230 229 228 227 226 225 224 27 26 25 24 23 22 21 20 byte 3 byte 2 byte 1 byte 0 C: int what if you want to store even larger numbers? Binary Coded Decimal (BCD) code each decimal digit separately, use a string (sequence) of decimal digits

Introduction: data types what if you want to store even larger numbers? Binary Coded Decimal (BCD) 1 byte can code two digits (0-9 requires 4 bits) 1 nibble (4 bits) codes the sign (+/-), e.g. hex C/D 23 22 21 20 2 0 1 4 0 0 0 0 0 2 bytes (= 4 nibbles) 23 22 21 20 1 + 2 0 1 4 0 0 0 1 2.5 bytes (= 5 nibbles) 23 22 21 20 9 debit (-) credit (+) 1 0 0 1 23 22 21 20 23 22 21 20 C D 1 1 0 0 1 1 0 1

Introduction: data types Typically, 64 bits (8 bytes) are used to represent floating point numbers (double precision) c = 2.99792458 x 108 (m/s) coefficient: 52 bits (implied 1, therefore treat as 53) exponent: 11 bits (not 2 s complement, unsigned with bias) sign: 1 bit (+/-) C: float double x86 CPUs have a built-in floating point coprocessor (x87) 80 bit long registers wikipedia

Introduction: data types C: char How about letters, punctuation, etc.? ASCII American Standard Code for Information Interchange Based on English alphabet (upper and lower case) + space + digits + punctuation + control (Teletype Model 33) Question: how many bits do we need? 7 bits + 1 bit parity Remember everything is in binary Teletype Model 33 ASR Teleprinter (Wikipedia)

Introduction: data types order is important in sorting! 0-9: there s a connection with BCD. Notice: code 30 (hex) through 39 (hex)

Introduction: data types x86 assemby language: 1. PF: even parity flag set by arithmetic ops. 2. TEST: AND (don t store result), sets PF 3. JP: jump if PF set Parity bit: transmission can be noisy parity bit can be added to ASCII code can spot single bit transmission errors even/odd parity: receiver understands each byte should be even/odd Example: 0 (zero) is ASCII 30 (hex) = 011000 even parity: 0110000, odd parity: 0110001 Checking parity: Exclusive or (XOR): basic machine instruction A xor B true if either A or B true but not both Example: (even parity 0) 0110000 xor bit by bit 0 xor 1 = 1 xor 1 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 xor 0 = 0 Example: MOV al,<char> TEST al, al JP <location if even> <go here if odd>

Introduction: data types UTF-8 standard in the post-ASCII world backwards compatible with ASCII (previously, different languages had multi-byte character sets that clashed) Universal Character Set (UCS) Transformation Format 8-bits (Wikipedia)

Introduction: data types Example: Hiragana letter A: UTF-8: E38182 Byte 1: E = 1110, 3 = 0011 Byte 2: 8 = 1000, 1 = 0001 Byte 3: 8 = 1000, 2 = 0010 Hiragana letter I: UTF-8: E38184 Shift-JIS (Hex): : 82A0 : 82A2

Introduction: data types How can you tell what encoding your file is using? Detecting UTF-8 Microsoft: 1st three bytes in the file is EF BB BF (not all software understands this; not everybody uses it) HTML: <meta http-equiv="Content-Type" content="text/html;charset=UTF-8" > (not always present) Analyze the file: Find non-valid UTF-8 sequences: if found, not UTF-8 Interesting paper: http://www-archive.mozilla.org/projects/intl/UniversalCharsetDetection.html

Introduction: data types Text files: text files have lines: how do we mark the end of a line? End of line (EOL) control character(s): LF 0x0A (Mac/Linux), CR 0x0D (Old Macs), CR+LF 0x0D0A (Windows) End of file (EOF) control character: (EOT) 0x04 (aka Control-D) programming languages: NUL used to mark the end of a string binaryvision.nl

LING 388: Computers and Language

Download Presentation

Presentation Transcript

Related

More Related Content