Ask Sawal

Discussion Forum
Notification Icon1
Write Answer Icon
Add Question Icon

Hrishitaa Danyluk




Posted Questions


No Question(s) posted yet!

Posted Answers



Answer


Following the treatment plan can help a person stay healthy, but it's not a cure for diabetes. Right now, there's no cure for diabetes, so people with type 1.


Answer is posted for the following question.

How to remove diabetes mellitus type 1?

Answer


Kansas, United States About Youtuber Beauty blogger & Youtube guru. Confidence and beauty from the inside out, with the help of a few makeup


Answer is posted for the following question.

What youtube beauty guru?

Answer


3 syllables


Answer is posted for the following question.

Bountiful how many syllables?

Answer


The new JTBC drama "My ID Is Gangnam Beauty" explores the two sides of having plastic surgery in Korea ― becoming more beautiful and


Answer is posted for the following question.

What does gangnam beauty mean?

Answer


Why do you want to work at our company · Do you have any questions · How much salary do you expect · Rate me as an interviewer · What was the toughest


Answer is posted for the following question.

Why do you want to join our company?

Answer


Ceiba pentandra is a tropical tree of the order Malvales and the family Malvaceae, native to Mexico, Central America and the Caribbean, northern South America, and West Africa. A somewhat smaller variety was introduced to South and Southeast Asia, where it is cultivated.


Answer is posted for the following question.

What is silk cotton?

Answer


A steering committee is a group of people, usually managers. It is formed to oversee and support a project from management level. Committee members are


Answer is posted for the following question.

Who chairs a project steering committee?

Answer


From a performance standpoint as a coder, it makes little sense why CPU technologies have evolved the way they have evolved, particularly the x86 architecture. Inside every CPU on the x86 (and most other architectures to some extent), there are several different technologies going on which try to speed up code execution. Note that this is not an exhaustive list and these are all woefully inadequate definitions, but I didn't want to write a book when I wrote this post.

The above technologies seem awesome and any program utilizing them all should theoretically run hundreds of times faster than without these technologies. Together, all of the above technologies make CPUs, particularly AMD and Intel x86 processors (which feature all of the above), really really fast. However, there are lots of problems with this performance, which severely reduces the speedup in most software. Again, there is some specialized software that benefits from these technologies. My complaint is that practical everyday software like web browsers, text editors, and games don't receive a massive speedup from these technologies because these technologies are largely inapplicable to the greatest bottlenecks in the code. If you make 80% of the code run four times faster, then overall the code will only be 2.5x as fast, and that's the problem. These CPU technologies can speed up large sections of the code, but there will still be portions of the code which gain no advantage, and these sections cause the software to run slow.

I believe the solution is to create an extremely generic form of hardware coroutines that can be applied to any code to speed it up.

Some downsides of hardware coroutines are:

So, here's an example implementation of the atoll code above rewritten to use hardware coroutines (at least this is what I imagine it would look like):

Observe how the hardware coroutines do about the same number of things. It even looks like the hardware coroutines would be slower in this case. However, the hardware coroutines would be much faster in this case because the pipeline is busy and several instructions can be executed in parallel without the overhead of needing to prove that they have no side effects. Another way of looking at hardware coroutines is that they are similar to speculative execution except that they carefully control and optimize eager execution and they are explicit, so no work needs to be done to try to reverse side effects.

Next, let me explain a possible way the registers and variables and the stack would work. We add a new series of registers that will be local to each coroutine, say registers C0-15, where C0-C7 are the parent coroutine's state and C8-C15 are the child's coroutine state. Entering a coroutine remaps C8-C15 into C0-C7 automatically such that any writes to C0-C7 from the child coroutine are visible across all other coroutines spawned by the parent, forming a queue of waiting coroutines when necessary. R0-R15 are unique to each coroutine, starting off with a copy of their parent's state. This copying is necessary to ensure backward compatibility with existing libraries.

Next, let's discuss how the stack would work. Well, it's going to be a complete gagglefuck if we try to implement it with a software solution because we need high-performance and we need backward compatibility with code in library function calls not compiled with support for hardware coroutines. So, here's how it would work: each coroutine processor has its own private lookup table correlating the perceived view of the stack to the actual view of the stack. This lookup table is only invoked whenever we offset the rbp variable and whenever the assembly push/pop/call/ret instruction is used. The CPU handles the actual stack pointer location, manages to push things onto the stack. The CPU access memory as if it had the privileges of the program, so the usual Stackoverflow segmentation fault rules still apply.

Observe the below slow scalar code:

We can not-so-easily accelerate it with SSE2, for example:

With coroutines, the number of parallel summations scales to the concurrency of the CPU, enabling potentially drastically better performance than SSE.

The above code is locked into serially adding each element because every operation inside the hardware coroutine is atomic with respect to all other coroutines, so the coroutines are stuck waiting in line for the previous coroutine to finish adding to sum. Nevertheless, this will rival and perhaps surpass the performance of SIMD. The reason why is superscalar execution. While the coroutines are waiting in line, they have time to investigate adjacent queued coroutines. Upon seeing that there are two adjacent 32-bit additions waiting in line and recognizing that integer addition is associative, adjacent queue slots will be added out of order while they're in the queue, enabling a high degree of parallelism.

You are probably thinking "it's too complicated." Well, let me ask you, is speculative execution too complicated? If you asked anyone 30 years ago whether mainstream consumer computers were going to support speculative execution they would have laughed at you. And, here we are today.

Also, an important thing to note is that I imagine the assembly for these hardware coroutines being very primitive and very simple in order to maximize performance. The fancy syntax I wrote in my C code snippets is what I imagine the sugar syntax to look like, much in the same way that if/for/while/switch statements are sugar syntax for complex conditional jumps.

Yes, CPUs already do a similar superscalar parallelization to some extent. However, don't let the hype fool you. CPUs are really dumb. The coder understands his/her code far better than the compiler, which understands the code far better than the CPU. Allowing the coder to assist the CPU with parallelization and out-of-order processing will, I believe, have drastic positive effects on performance, especially because more coroutine concurrency means a faster computer, so CPU manufacturers will eventually start producing consumer CPUs with hundreds of parallel coroutines. They don't produce those CPUs right now because no-one would buy them because so much everyday software is stuck in a single thread.

There's been a lot of taking the comments section about GPUs, so let's clear up the confusion here:

Basically, I believe the confusion many people have is between the concepts of GPU and GPGPU. When we use the hardware as a GPU, we are using it to render lots of triangles in 2D or 3D space onto a 2D projection (our screen). When we use it as a GPGPU, we do all sorts of crazy tricks to reimagine this triangle rendering as data processing so that we can do tons of things in parallel. There is no one GPGPU. There are tons of ways to use the GPU as a GPGPU, and all these ways involve "tricking" the GPU into executing a task in parallel and then gathering the results. One very primitive example might be to use various buffers to feed data into the GPU, where a shader determines the output color at each pixel on the screen. Then, we capture the screen as a bitmap instead of showing it to the user in order to gather the result. Real-world GPGPU workloads are a lot more complicated and typically use an extremely low-level interface that directly manipulates the GPU, practically discarding the entire concept of triangles and rendering.

Knowing the difference between GPU and GPGPU, I hope you now understand why this section is titled "GPGPU." So, let's discuss the disadvantages of the GPGPU:

Basically, the GPGPU is even more application-specific than SIMD and AVX(51)2. And, it requires a ton of knowledge and time to get it right, but even then you still might not gain any performance. You need to be performing millions of complex completely independent computations in parallel before the GPGPU becomes a practical option, and that's extremely rare on real workloads.

Hardware coroutines aim to speed up general-purpose microtasks lasting 100ns or less. Hardware coroutines also aim to make the CPU more efficient at resolving the critical path by streamlining the critical pipeline so that it can run super-speed fast. These two objectives exist in a different paradigm than the GPU.


Answer is posted for the following question.

Why avx 512 is bad?

Answer


Portarlington Swimming Beach

Address: 1A Harding St, Portarlington VIC 3223, Australia


Answer is posted for the following question.

Where are the best beach to swim in Geelong, Australia?

Answer


How do you say Pevensey and Westham? Listen to the audio pronunciation of Pevensey and Westham on pronouncekiwi.


Answer is posted for the following question.

How to pronounce pevensey?

Answer


The business may be a sole proprietorship, a Partnership (general or limited), the Mortgages and notes payable in less than one year and determine whether


Answer is posted for the following question.

How to calculate k-1 income for mortgage?

Answer


After successfully passing the relevant SIA Training course with our training provider you are eligible to apply for the SIA Licence This is your


Answer is posted for the following question.

How to become sia?

Answer


The Food Co-op Shop and Cafe

Address: 3 Kingsley St, Canberra ACT 2601, Australia


Answer is posted for the following question.

Where can I find best fruit and veg shop in Canberra, Australia?

Answer


The Uffizi is normally open Tuesday to Sunday, with entrance bookable between 8:30 AM and 12:30 PM (the gallery closes at 1:30 PM) during the week, and 8:30"Rating: 4.6 · 278 reviews


Answer is posted for the following question.

How to book uffizi tickets?


Wait...